Synopsis The symbols that can occur in a syntax definition.
Syntax
Nonterminal symbols are identifier names that
start with an uppercase letter.
Symbol | Description |
---|
Symbol fieldName | Any symbol can be labeled with a field name that starts with a lowercase letter |
The following literal symbols and character classes are defined:
Symbol | Description |
---|
"stringliteral" | Literal string |
'stringliteral' | Case-insensitive literal string |
[range1 range2 ... ] | Character class |
The following operations on character classes can be composed arbitrarily:
Class | Description |
---|
!Class | Complement of Class with respect to the UTF8 universe of characters |
Class1 - Class2 | Difference of character classes Class1 and Class2 |
Class1 || Class2 | Union of character classes Class1 and Class2 |
Class1 && Class2 | Intersection of character classes Class1 and Class2 |
(Class) | Brackets for defining application order of class operators |
The following regular expressions can be constructed over
Symbols:
Symbol | Description |
---|
Symbol? | Optional Symbol |
Symbol+ | Non-empty list of Symbol s |
Symbol* | Possibly empty list of Symbol s. |
{Symbol1 Symbol2}+ | Non-empty list of Symbol1 separated by Symbol2 |
{Symbol1 Symbol2}* | Possibly empty list of Symbol1 separated by Symbol2 . |
(Symbol1 Symbol2 ... ) | Embedded sequence of symbols |
(Symbol1 | Symbol2 | ... ) | Embedded choice of alternative symbols |
() | The anonymous non-terminal for the language with the empty string |
Inline conditions (
Disambiguations) can be added to symbols to constrain their acceptability:
Disambiguation | Description |
---|
Symbol $ | Symbol ends at end of line or end of file |
^Symbol | Symbol starts at begin of line |
Symbol @ ColumnIndex | Symbol starts at certain column index. |
Symbol1 >> Symbol2 | Symbol1 must be (directly) followed by Symbol2 |
Symbol1 !>> Symbol2 | Symbol1 must not be (directly) followed by Symbol2 |
Symbol1 << Symbol2 | Symbol2 must be (directly) preceded by Symbol1 |
Symbol1 !<< Symbol2 | Symbol2 must not be (directly) preceded by Symbol1 |
Symbol1 \ Symbol2 | Symbol1 must not be in the language defined by Symbol2 |
Symbols can be composed arbitrarily.
Types Every non-terminal symbol is a type.
Description The basic symbols are the non-terminal name and the labeled non-terminal name. These refer to the names defined by
SyntaxDefinition. You can use any defined non-terminal name in any other definition (lexical in syntax, syntax in lexical, etc).
Then we have literals and character classes to define the
terminals of a grammar. When you use a literal such as
"begin"
, Rascal will produce a definition for it down to the character level before generating a parser:
syntax "begin" = [b][e][g][i][n];
. This effect will be visible in the
ParseTrees produced by the parser. For case insensitive literals you will see a similar effect; the use of
'begin'
produces
syntax 'begin' = [bB][eE][gG][iI][nN]
.
Character classes have the same escaping conventions as characters in a
String literal, but spaces and newlines are meaningless and have to be escaped and the
[
and
]
brackets as well as the dash
-
need escaping. For example, one writes
[\[ \] \ \n\-]
for a class that includes the open and close square brackets and a space, a newline and a dash. Character classes support ranges as in
[a-zA-Z0-9]
. Please note about character classes that:
- the operations on character classes are executed before parser generation time. You will not find explicit representation of these operations in ParseTrees, but rather their net effect as resulting character classes.
- Character classes are also ordered by Rascal and overlapping ranges are merged before parsers are generated. Equality between character classes is checked after this canonicalization.
- Although all Symbols are type constructors, the character class operators are not allowed in types.
The other symbols either
generate for you parts of the construction of a grammar, or they
constrain the rules of the grammar to generate a smaller set of trees as
Disambiguations.
The
generative symbols are referred to as the
regular symbols. These are like named non-terminals, except that they are defined implicitly and interpreted by the parser generator to produce a parser that can recognize a symbol optionally, iteratively, alternatively, sequentially, etc. You also need to know this about the regular symbols:
- In ParseTrees you will find special nodes for the regular expression symbols that hide how these were recognized.
- Patterns using ConcreteSyntax have special semantics for the regular symbols (list matching, separator handling, ignoring layout, etc.).
- Regular symbols are not allowed in keyword SyntaxDefinitions
- Depending on their occurrence in a lexical, syntax or layout SyntaxDefinition the semantics of regular symbols changes. In the syntax context, layout non-terminals will be woven into the regular symbol, but not in the lexical and layout contexts. For example, a
Symbol*
in a syntax definition such as syntax X = A*;
will be processed to syntax X =
{A Layout}*. Similarly,
syntax X = {A B}+; will be processed to
syntax X = {A (Layout B Layout)}+;`.
The
constraint symbols are specially there to deal with the fact that Rascal does not generate a scanner. There are no a priori disambiguation rules such as prefer keywords or longest match. Instead, you should use the constraint symbols to define the effect of keyword reservation and longest match.
- It is important to note that these constraints work on a character-by-character level in the input stream. So, a follow constraint such as
A >> [a-z]
means that the character immediately following a recognized A must be in the range [a-z]
.
- Read more on the constraint symbols via Disambiguations.
Examples A character class that defines all alphanumeric characters:
rascal>lexical AlphaNumeric = [a-zA-Z0-9];
ok
A character class that defines anything except quotes:
rascal>lexical AnythingExceptQuote = ![\"];
ok
An identifier class with longest match (can not be followed immediately by [a-z]):
rascal>lexical Id = [a-z]+ !>> [a-z];
ok
An identifier class with longest match and first match (can not be preceded or followed by [a-z]):
rascal>lexical Id = [a-z] !<< [a-z]+ !>> [a-z];
ok
An identifier class with some reserved keywords and longest match:
rascal>lexical Id = [a-z]+ !>> [a-z] \ "if" \ "else" \ "fi";
ok
An optional else branch coded using sequence and optional symbols:
rascal>syntax Statement = "if" Expression "then" Statement ("else" Statement)? "fi";
ok
A block of statements separated by semicolons:
rascal>syntax Statement = "{" {Statement ";"}* "}";
ok
A declaration with an embedded list of alternative modifiers and a list of typed parameters:
rascal>syntax Declaration = ("public" | "private" | "static" | "final")* Type Id "(" {(Type Id) ","}* ")" Statement;
ok
Benefits - The symbol language is very expressive and can lead to short definitions of complex syntactic constructs.
- There is no built-in longest match for iterators, which makes syntax definitions open to languages that do not have longest match.
- There is no built-in keyword preference or reservation, which makes syntax definitions open to language composition and legacy languages.
Pitfalls - By nesting too many symbols definitions can be become hard to understand.
- By nesting too many symbols pattern matching and term construction becomes more complex. Extra non-terminals and rules with meaningful names can make a language specification more manageable.
- The lack of automatic longest match and prefer keyword heuristics (you have to define it yourself), sometimes leads to unexpected ambiguity. See Disambiguation.