Seed7 Manual: Structured syntax definition

Syntax

9. STRUCTURED SYNTAX DEFINITION

Most programming languages have only predefined constructs like statements and operators. Seed7, on the other hand, additionally allows user defined constructs. This chapter introduces the Seed7 Structured Syntax Description (S7SSD) which is used to define the syntax of new constructs. The syntax of predefined constructs is also defined with S7SSD.

The syntax descriptions used in manuals of conventional programming languages have no relationship to the approach used by the syntax analysis of the corresponding interpreters/compilers. S7SSD is a simple syntax description that can be used by humans and compilers/interpreters. Although compiler-compilers follow the path of machine readable syntax descriptions, they use much more complicated syntax and semantic descriptions and do not allow users of the language to define new constructs.

There are different existing notations to specify the syntax of programming languages. Backus-Naur Form (BNF) and its variants like Extended Backus-Naur Form (EBNF) are examples of such syntax specifications. Since it is easier to understand new concepts if they are compared to well known concepts, EBNF will be used as a base to explain S7SSD.

9.1 The Extended Backus-Naur Form

As the name says the Extended Backus-Naur Form is an extension of BNF. The extension allows the definition of repetitions and optional parts without the use of recursion. EBNF has the following elements:

Nonterminal symbols are described with identifiers. An identifier consist of lower case letters and underline characters.
Terminal symbols are quoted strings or names in upper case characters, which describe unprintable characters (control characters).
The concatenation of nonterminal and/or terminal symbols is described by writing them in sequence.
With | two alternatives can be separated.
Expressions of the extended Backus-Naur form can be put within parentheses ( ... ) .
If an expression is optional it is enclosed in square brackets [ ... ] .
If an expression may be omitted or repeated it is enclosed in curly braces { ... } .

The syntax of the extended Backus-Naur form can be described in extended Backus-Naur form:

syntax_description ::= { ebnf_statement } . ebnf_statement ::= identifier '::=' ebnf_expression '.' . ebnf_expression ::= term { '|' term } . term ::= factor { factor } . factor ::= identifier | string | control_character_description | '(' ebnf_expression ')' | '[' ebnf_expression ']' | '{' ebnf_expression '}' .

9.2 The Seed7 Structured Syntax Description

The Seed7 Structured Syntax Description is abbreviated with S7SSD. The S7SSD can describe most but not all of the syntax of a programming language. The syntax of identifiers, literals and comments is not described with S7SSD. S7SSD views a program as a big typeless expression. The syntax of this expression is described with prefix, infix and postfix operators. The operators have a priority and an associativity. Operators can have one or more operator symbols. The operator symbols of an operator can be adjacent or they can have parameters in between. The S7SSD of an infix + is:

$ syntax expr: .(). + .()   is -> 7;

This defines the + as left associative infix operator with priority 7. The actual syntax is described with:

. () . + . ()

The dots are used to create a list of elements. For the purpose of the syntax description we can just remove the dots, which gives:

() + ()

The place of the parameters is specified with (). The + operator is an infix operator, because the + is preceded and succeeded by a (). Any expression can be used as () parameter. The type of the () parameters and the type of the result of + is not specified by the S7SSD. Checks for the correct type are not done at the syntactic level. This way S7SSD allows syntax that would not be allowed in a corresponding EBNF description. S7SSD considers just operator symbols and their priority and associativity.

9.3 The syntax of a statement

To explain the Seed7 Structured Syntax Description we design a new statement, the loop-statement. The loop-statement should be similar to while- and repeat-loops but instead of having the conditional exit at the beginning or at the end, it should have a conditional exit in the middle of the loop. This middle conditional exit should be part of the loop-statement. Note that the break-statement, which exists in some programming languages, is a statement on its own and is not part of the loop which it leaves. Therefore the middle conditional exit should not be confused with a break-statement. An example of the new loop-statement is:

loop
  ch := getc(inFile);
until ch = '\n' do
  stri &:= ch;
end loop;

The 'loop' example above reads characters from a file and concatenates them to a string until the character '\n' is read. The '\n' ends the loop. Hence it is not added to the string. An equivalent solution without the usage of the loop-statement would be:

repeat
  ch := getc(inFile);
  if ch <> '\n' then
    stri &:= ch;
  end if;
until ch = '\n';

The S7SSD of the loop-statement is:

$ syntax expr: .loop.().until.().do.().end.loop   is -> 25;

The details of the S7SSD 'syntax' definition will be explained later. For now we concentrate at the heart of the S7SSD, the expression:

.loop.().until.().do.().end.loop

For the purpose of the syntax description we can just remove the dots, which gives:

 loop () until () do () end loop

This are the keywords used in a loop-statement. The symbol () acts as placeholder for an expression. With EBNF the loop-statement can be described as:

loop_statement ::= 'loop' statement 'until' expression 'do' statement 'end' 'loop' .

An EBNF description may use many nonterminal symbols such as 'statement' or 'expression'. S7SSD does not distinguish between different nonterminal symbols. Instead S7SSD only knows one nonterminal symbol: ()

Therefore S7SSD cannot distinguish between 'statement', 'expression' or something else. At the syntax level any kind of expression can by substituted for a S7SSD nonterminal symbol (). With EBNF it is possible to describe constraints such as the type of an expression. S7SSD relies on semantic checks to verify such constraints. Given the S7SSD of the loop-statement an expression like

loop
  "X"
until 1+2 do
  integer
end loop

would be legal as it contains the required keywords

loop  until  do  end  loop

and the expressions

"X"  1+2  integer

at the places of the () symbols. This is exactly what the syntax definition specifies, but it would be not be considered correct given the description of the loop-statement at the beginning of the chapter. To determine which types of expressions are allowed at the places of the () symbol, a semantic definition of the loop-statement is necessary. A semantic definition is just a function definition which uses the keywords and parameters from the syntax definition. The definition of the 'loop' function (semantic definition of the loop-statement) is:

const proc: loop
              (in proc: statements1)
            until (in func boolean: condition) do
              (in proc: statements2)
            end loop is func
  local
    var boolean: exitLoop is FALSE;
  begin
    repeat
      statements1;
      if not condition then
        statements2;
      else
        exitLoop := TRUE;
      end if;
    until exitLoop;
  end func;

This definition determines the types of the expressions accepted between the keywords. Besides that the semantic definition of the loop-statement is just a normal function definition. Note that the sequence of keywords and parameters in the header of this function definition is determined by the corresponding syntax definition.

The parameters 'statements1', 'condition' and 'statements2' are call-by-name parameters. A call-by-name parameter is a function without parameters. Function types such as proc or func boolean are used as type of formal call-by-name parameters. An expression with the correct type is allowed as actual call-by-name parameter. This actual parameter expression is not evaluated when the function is called. Instead the expression is evaluated every time the formal call-by-name parameter is used. This way 'statements1', 'condition' and 'statements2' are not executed when the 'loop' function is called. Inside the body of the 'loop' function the call-by-name parameters are executed at some places.

The 'loop' function uses a repeat- and an if-statement to implement the desired behavior. If necessary the call-by-name parameters are executed several times.

For the 'loop' example with the semantic errors (see above) we would get an error message like:

*** chkloop.sd7(35):57: Match for {loop "X" until {1 + 2 } do integer end loop } failed

9.4 Priority and associativity

If a syntax construct has parameters before the first symbol or after the last symbol the priority and the associativity of the construct are significant. Constructs with stronger priority bind their parameters earlier than constructs with weaker priority. The priority is described by a natural number (inclusive 0). The strongest priority is 0. Weaker priorities are described by larger numbers. What bind means is can be explained with an example:

                                      =
    A = B + C * D                    / \
                                    A   +
    * priority  6                      / \
    + priority  7                     B   *
    = priority 12                        / \
                                        C   D

The * operator has the strongest priority (6) of all operators involved. Therefore the * takes its parameters first. Then the + (with priority 7) and at last the = (with priority 12) follows. This leads to the following interpretation of the expression:

A = (B + (C * D))

The associativity describes, in which order constructs with equal priority bind their parameters. For example

A - B - C

can be interpreted in two ways:

(A - B) - C    or   A - (B - C)

The first interpretation is usually preferred by mathematicians and is described with the associativity -> . Generally four associativities are possible:

Associativity	Symbol
Binding from left to right	`->`
Binding from right to left	`<-`
Neither the left nor the right parameter are allowed to have the same priority	`<->`
At the left side there is a binding from left to right and at the right side there is a binding from right to left	`-><-`

The last two possibilities give no legal interpretation in the subtraction example. The third kind of associativity ( <-> ) is used by the equal operator ( = ) of Pascal because there an expression like

A = B = C

is not legal.

There is a second way to describe the associativity. The associativity describes, if an operand must have a stronger priority than the priority of the operator. For example:

                             -                     7
    A - B - C              /   \                 /   \
                          /     \           <=7 /     \ <7
    - priority 7 ->      /       \             /       \
                        -         C           7         0
                      /   \                 /   \
                     /     \           <=7 /     \ <7
                    /       \             /       \
                   A         B           0         0

The numbers in the nodes of the right tree show the priority of each sub expression (sub tree). With < and <= the required condition for the priority of an operand is described. An interpretation is legal if all this conditions are met. If there are more than one legal interpretations or no legal interpretation the expression is illegal.

Table for the possibilities of associativity:

associativity	The priority of the
associativity	left operand must be	right operand must be
`->`	`<=`	`<`
`<-`	`<`	`<=`
`<->`	`<`	`<`
`-><-`	`<=`	`<=`
	than that of the operator

The parameter before the operator symbol is called left operand. The parameter after the last symbol of a construct is called right operand. In case of normal operators the last symbol of a construct and the operator symbol are identical. If this is not the case there is a third kind of operand. Between the operator symbol and the last symbol of a construct are the middle operands. Middle operands can have any priority.

9.5 The syntax of operators

A syntax definition specifies the way a usage of a statement or operator must be written. For example a call of the not operator looks like:

not okay

To describe the syntax of the not operator we write:

$ syntax expr: .not.() is <- 13;

This means that a not expression is constructed with the symbol not followed by a parameter. The place of the parameter is marked with the () sign. The syntax description contains no information about the types of the parameters. At the syntax level a parameter may be anything. With <- the associativity of the not operator is specified as right associative. This means that the right operand is allowed to have the same priority as the operator symbol. So the expression

not not okay

is legal and means

not (not okay)

If the associativity of the not operator is specified with -> instead of <- the 'not not' expression above is not legal. With 13 the priority of the whole not operator is determined. As convention priorities from 1 to 20 are used by operators and priority 25 is used by statements. Arithmetic operators have priorities from 1 to 11 and comparisons have priority 12.

To define the not operator completely there must be also a semantic definition which is as follows:

const func boolean: not (in boolean: aBool) is func
  result
    var boolean: negation is TRUE;
  begin
    if aBool then
      negation := FALSE;
    end if;
  end func;

In the declaration the not operator is written exactly in the same way it is written when it is called. The syntax definition is used at both places: declaration and call. The syntax and semantic declarations define precisely how the not operator works.

As next example we try an infix operator like the and operator. A call of the and operator may look like:

okay and not error

To describe the syntax of the and operator we write:

$ syntax expr: .().and.() is    -> 14;

This means that an and expression is constructed with the symbol and surrounded by parameters. The -> defines the and operator as left associative. This means that an expression like

A and B and C

is interpreted as

(A and B) and C

With 14 the priority of the whole and operator is determined. Since priority 14 is weaker than the priority of the not operator which is 13 the example expression is evaluated as:

okay and (not error)

Note that the expression

okay and not error

makes no sense if the and operator has priority 12 instead of 14.

S7SSD treats everything as operator description. Operators have priority and associativity. The priority and associativity determine in which succession S7SSD syntax rules get applied. To explain priority and associativity we use the basic arithmetic operations (+,-,*,/). To describe them with EBNF we can write:

This describes the following things:

The operators have different priorities:
- Plus and minus signs are executed first
- Multiplication and division are executed second.
- Addition and subtraction are executed last.
These priorities are exactly what we expect from an arithmetic expression.
Additionally we see that ++2 is allowed and interpreted as +(+(2)) which means that the plus sign is a right-associative operator.
We can also see that a*b*c is allowed and interpreted as (a*b)*c which means that the multiplication is a left-associative operator.

All this things can also be described with S7SSD:

$ syntax expr: . + .()      is <-  5;
$ syntax expr: . - .()      is <-  5;
$ syntax expr: .(). * .()   is  -> 6;
$ syntax expr: .(). / .()   is  -> 6;
$ syntax expr: .(). + .()   is  -> 7;
$ syntax expr: .(). - .()   is  -> 7;

As we can see S7SSD is shorter as the description with EBNF. A syntax statement is explained as follows:

The $ is used to introduce all hard coded statements.
The keyword 'syntax' introduces a structured syntax description.
The result of the recognized expression will have the type expr. The type expr is used between the syntax and the semantic analysis. The type expr describes expressions which are syntactically analyzed but not semantically analyzed. After the semantic analysis (and during the runtime) the type expr is not used.
The colon ':' is used as separator between type and syntax description.
A dot expression like '.(). * .()' is introduced (as can probably be guessed by the name) with a dot. For the purpose of the syntax description we can just remove the dots in our mind: '() * ()'
The symbol 'is' is used in all Seed7 declarations as separator between the name and the value.
The associativity is described with one of the symbols -> (left-associative), <- (right-associative), <-> (not associative) and -><- (both associativities). If there are no left or right operands, as it is the case for the loop-statement, the associativity is irrelevant.
Finally the priority of the syntax construct is defined with a integer literal like '6'. The priority '6' is used for the operators *, /, div, rem, mdiv and mod.

9.6 Syntax of predefined statements

Predefined statements can also be defined with S7SSD. E.g.: The while-statement. A use of the while-statement is:

while element_index > 0 and okay do
  processElement;
  write(".");
end while;

To describe the syntax of the while-statement we write:

$ syntax expr: .while.().do.().end.while is -> 25;

This means that the while-statement is an expression with the symbols 'while', 'do', 'end' and 'while'. With -> the associativity of the while-statement is specified as left associative. The associativity has no meaning for the while-statement since there is no parameter before the first symbol or after the last symbol. The priority of the whole while-statement is 25.

The semantic definition of the while-statement is as follows:

const proc: while (in func boolean: condition) do
    (in proc: statement) end while is func
  begin
    if condition then
      statement;
      while condition do
        statement;
      end while;
    end if;
  end func;

The syntax definition is used for the declaration and for the call. This declaration defines precisely how the while-statement works. It is based on the if-statement and uses recursion to emulate the repetition of the loop body. Another example for a syntax description is the repeat-statement

repeat
  processElement;
  write(".");
until element_index = 0 or not okay;

which has the following syntax description:

$ syntax expr: .repeat.().until.() is -> 25;

This means that the repeat-statement is an expression with the symbols 'repeat' and 'until' and a parameter between 'repeat' and 'until' and after 'until'. With 25 the priority of the whole repeat-statement is determined. With -> the associativity of the repeat-statement is specified as left associative. This allows priorities from 0 to 24 for the parameter after 'until'. Since statements have priority 25 it is not possible to write a statement direct behind 'until'.

A simple if-statement, without 'elsif' part, is the next example. A usage of this if-statement might be:

if okay then
  writeln("okay");
else
  writeln("not okay");
end if;

As syntax description we use

$ syntax expr: .if.().then.().end.if is            -> 25;
$ syntax expr: .if.().then.().else.().end.if is    -> 25;

Note that this description allows if-statements with and without 'else' parts. As semantic description we use

const proc: if (in boolean: condition) then
              (in proc: statement)
            end if is func
  begin
    case condition of
      when {TRUE}: statement;
    end case;
  end func;

const proc: if (in boolean: condition) then
              (in proc: statement1)
            else
              (in proc: statement2)
            end if is func
  begin
    case condition of
      when {TRUE}:  statement1;
      when {FALSE}: statement2;
    end case;
  end func;

The two forms of the if-statement are based on the case-statement. A more complex if-statement with 'elsif' parts can be:

if number < 0 then
  write("less");
elsif number = 0 then
  write("equal");
else
  write("greater");
end if;

How to define the syntax and the semantic for this statement is described in the next chapter.

9.7 Advanced syntax definitions

If we want to use some special syntax which should be only allowed at some place we do the following:

Define the special syntax with S7SSD in a way that does not contradict with the rest of the syntax definitions.
Use semantic definitions to make sure that this syntax construct can only be used at the place desired.

The EBNF of the if-statement with 'elsif' parts is:

if_statement ::= 'if' expression 'then' statement { 'elsif' expression 'then' statement } [ 'else' statement ] 'end' 'if' .

The S7SSD of this if-statement is:

$ syntax expr : .if.().then.().end.if           is -> 25;
$ syntax expr : .if.().then.().().end.if        is -> 25;

$ syntax expr : .elsif.().then.()               is <- 60;
$ syntax expr : .elsif.().then.().()            is <- 60;
$ syntax expr : .else.()                        is <- 60;

Instead of one rule (as EBNF does) the rule is broken into several S7SSD rules. This is necessary because S7SSD does not support the [ ] and { } notations. They are not supported for good reasons: They complicate the parameter lists and they are also not so easy to implement. On the other hand, the BNF like rules of S7SSD lead to semantic constructs which are easy to parse and easy to compile. The broken down S7SSD rules of the if-statement corresponds to the following EBNF description:

if_statement ::= 'if' expression 'then' statement 'end' 'if' . if_statement ::= 'if' expression 'then' statement elseif_or_else_part 'end' 'if' . elseif_or_else_part ::= 'elsif' expression 'then' statement . elseif_or_else_part ::= 'elsif' expression 'then' statement elseif_or_else_part . elseif_or_else_part ::= 'else' statement .

Since S7SSD uses only one nonterminal symbol '()' it is the job of the semantic level to make sure that only the right nonterminal symbol can be used. This is done by introducing the type ELSIF_PROC (which corresponds to the nonterminal symbol 'elseif_or_else_part' of the EBNF) and the type ELSIF_RESULT (which is the result of the ELSIF_PROC).

Normally a syntax declaration can be used in many semantic declarations. E.g.: The syntax of the '+' operator is defined once and the semantic of the '+' operator is defined for the types integer, bigInteger, float, complex, ... This possibility is not needed for the if-statement. For each of the five S7SSD syntax rules of the if-statement just one corresponding semantic declaration is done:

# Semantic for the syntax: .if.().then.().end.if
const proc: if (in boolean: condition) then
              (in proc: statements)
            end if                                    is func
  begin
    case condition of
      when {TRUE}: statements;
    end case;
  end func;

# Semantic for the syntax: .if.().then.().().end.if
const proc: if (in boolean: condition) then
              (in proc: statements)
            (in ELSIF_PROC: elsifPart)
            end if                                    is func
  begin
    case condition of
      when {TRUE}: statements;
      when {FALSE}: elsifPart;
    end case;
  end func;

# Semantic for the syntax: .elsif.().then.()
const ELSIF_PROC: elsif (in boolean: condition) then
                    (in proc: statements)             is func
  begin
    case condition of
      when {TRUE}: statements;
    end case;
  end func;

# Semantic for the syntax: .elsif.().then.().()
const ELSIF_PROC: elsif (in boolean: condition) then
                    (in proc: statements)
                  (in ELSIF_PROC: elsifPart)          is func
  begin
    case condition of
      when {TRUE}: statements;
      when {FALSE}: elsifPart;
    end case;
  end func;

# Semantic for the syntax: .else.()
const ELSIF_PROC: else
                    (ref void: voidValue)        is ELSIF_EMPTY;

Since no other functions of type 'ELSIF_PROC' are defined only legal if-statements can be written.

9.8 Comparison of EBNF and S7SSD

In the S7SSD of the loop-statement

$ syntax expr: .loop.().until.().do.().end.loop is -> 25;

are no nonterminal expressions '()' before the first keyword or after the last keyword. Therefore the associativity does not play any role. The nonterminal expressions '()' of the loop-statement are all surrounded by keywords and therefore they can have any priority. As priority of the 'loop' 25 is chosen just because most other statements have also priority 25. The assignments (:= +:= *:= ...) have priority 20 and all operators used in arithmetic, boolean and string expressions have priorities less than 20. BTW: The semicolon operator (;) is defined with the priority 50. Operators with a priority of 0 get their parameters before operators with priority 1 and so on.

The corresponding EBNF description of the loop-statement would be:

expression_25 ::= 'loop' expression_127 'until' expression_127 'do' expression_127 'end' 'loop' .

We must keep in mind that alternative rules for expression_25 are also possible and that for every priority level a rule like

expression_127 ::= expression_126 .

is defined. Additionally the following rules are defined:

expression_0 ::= token | parentheses_expression | call_expression | dot_expression . token ::= identifier | literal . parentheses_expression ::= '(' expression_127 ')' . call_expression ::= expression_127 [ '(' [ expression_127 { ',' expression_127 } ] ')' ] . dot_expression ::= [ '.' ] call_expression { '.' call_expression } .

The EBNF description can become long if many priority levels exist, as it is the case in Seed7.

There are some things which are out of the scope of S7SSD. The syntax of comments, tokens (identifiers and literals) and expressions (parentheses, function calls and dot expressions) is hard coded. The hard coded constructs are described in chapter 10 (Tokens) and chapter 11 (Expressions).

For the reasons mentioned above it is not possible to transform every EBNF syntax description into S7SSD. Transforming S7SSD descriptions to EBNF is always possible.

The advantage of S7SSD lies in its simplicity and that a fast automated syntax recognition algorithm can be easily implemented. It is exactly the combination of hard coded syntax recognition and flexible syntax rules that make it successful.