Navigation
Synopsis The symbols that can occur in a syntax definition.
Syntax
Nonterminal symbols are identifier names that start with an uppercase letter.

Symbol Description
Symbol fieldName Any symbol can be labeled with a field name that starts with a lowercase letter
The following literal symbols and character classes are defined:

Symbol Description
"stringliteral" Literal string
'stringliteral' Case-insensitive literal string
[range1 range2 ... ] Character class


The following operations on character classes can be composed arbitrarily:

Class Description
!Class Complement of Class with respect to the UTF8 universe of characters
Class1 - Class2 Difference of character classes Class1 and Class2
Class1 || Class2 Union of character classes Class1 and Class2
Class1 && Class2 Intersection of character classes Class1 and Class2
(Class) Brackets for defining application order of class operators


The following regular expressions can be constructed over Symbols:

Symbol Description
Symbol? Optional Symbol
Symbol+ Non-empty list of Symbols
Symbol* Possibly empty list of Symbols.
{Symbol1 Symbol2}+ Non-empty list of Symbol1 separated by Symbol2
{Symbol1 Symbol2}* Possibly empty list of Symbol1 separated by Symbol2.
(Symbol1 Symbol2 ... ) Embedded sequence of symbols
(Symbol1 | Symbol2 | ... ) Embedded choice of alternative symbols
() The anonymous non-terminal for the language with the empty string


Inline conditions (Disambiguations) can be added to symbols to constrain their acceptability:

Disambiguation Description
Symbol $ Symbol ends at end of line or end of file
^Symbol Symbol starts at begin of line
Symbol @ ColumnIndex Symbol starts at certain column index.
Symbol1 >> Symbol2 Symbol1 must be (directly) followed by Symbol2
Symbol1 !>> Symbol2 Symbol1 must not be (directly) followed by Symbol2
Symbol1 << Symbol2 Symbol2 must be (directly) preceded by Symbol1
Symbol1 !<< Symbol2 Symbol2 must not be (directly) preceded by Symbol1
Symbol1 \ Symbol2 Symbol1 must not be in the language defined by Symbol2


Symbols can be composed arbitrarily.
Types Every non-terminal symbol is a type.
Description The basic symbols are the non-terminal name and the labeled non-terminal name. These refer to the names defined by SyntaxDefinition. You can use any defined non-terminal name in any other definition (lexical in syntax, syntax in lexical, etc).

Then we have literals and character classes to define the terminals of a grammar. When you use a literal such as "begin", Rascal will produce a definition for it down to the character level before generating a parser: syntax "begin" = [b][e][g][i][n];. This effect will be visible in the ParseTrees produced by the parser. For case insensitive literals you will see a similar effect; the use of 'begin' produces syntax 'begin' = [bB][eE][gG][iI][nN].

Character classes have the same escaping conventions as characters in a String literal, but spaces and newlines are meaningless and have to be escaped and the [ and ] brackets as well as the dash - need escaping. For example, one writes [\[ \] \ \n\-] for a class that includes the open and close square brackets and a space, a newline and a dash. Character classes support ranges as in [a-zA-Z0-9]. Please note about character classes that:
  • the operations on character classes are executed before parser generation time. You will not find explicit representation of these operations in ParseTrees, but rather their net effect as resulting character classes.
  • Character classes are also ordered by Rascal and overlapping ranges are merged before parsers are generated. Equality between character classes is checked after this canonicalization.
  • Although all Symbols are type constructors, the character class operators are not allowed in types.
The other symbols either generate for you parts of the construction of a grammar, or they constrain the rules of the grammar to generate a smaller set of trees as Disambiguations.

The generative symbols are referred to as the regular symbols. These are like named non-terminals, except that they are defined implicitly and interpreted by the parser generator to produce a parser that can recognize a symbol optionally, iteratively, alternatively, sequentially, etc. You also need to know this about the regular symbols:
  • In ParseTrees you will find special nodes for the regular expression symbols that hide how these were recognized.
  • Patterns using ConcreteSyntax have special semantics for the regular symbols (list matching, separator handling, ignoring layout, etc.).
  • Regular symbols are not allowed in keyword SyntaxDefinitions
  • Depending on their occurrence in a lexical, syntax or layout SyntaxDefinition the semantics of regular symbols changes. In the syntax context, layout non-terminals will be woven into the regular symbol, but not in the lexical and layout contexts. For example, a Symbol* in a syntax definition such as syntax X = A*; will be processed to syntax X = {A Layout}*. Similarly, syntax X = {A B}+; will be processed to syntax X = {A (Layout B Layout)}+;`.
The constraint symbols are specially there to deal with the fact that Rascal does not generate a scanner. There are no a priori disambiguation rules such as prefer keywords or longest match. Instead, you should use the constraint symbols to define the effect of keyword reservation and longest match.
  • It is important to note that these constraints work on a character-by-character level in the input stream. So, a follow constraint such as A >> [a-z] means that the character immediately following a recognized A must be in the range [a-z].
  • Read more on the constraint symbols via Disambiguations.
Examples A character class that defines all alphanumeric characters:
rascal>lexical AlphaNumeric = [a-zA-Z0-9];
ok
A character class that defines anything except quotes:
rascal>lexical AnythingExceptQuote = ![\"];
ok
An identifier class with longest match (can not be followed immediately by [a-z]):
rascal>lexical Id = [a-z]+ !>> [a-z];
ok
An identifier class with longest match and first match (can not be preceded or followed by [a-z]):
rascal>lexical Id = [a-z] !<< [a-z]+ !>> [a-z];
ok
An identifier class with some reserved keywords and longest match:
rascal>lexical Id = [a-z]+ !>> [a-z] \ "if" \ "else" \ "fi";
ok
An optional else branch coded using sequence and optional symbols:
rascal>syntax Statement = "if" Expression "then" Statement ("else" Statement)? "fi";
ok
A block of statements separated by semicolons:
rascal>syntax Statement = "{" {Statement ";"}* "}";
ok
A declaration with an embedded list of alternative modifiers and a list of typed parameters:
rascal>syntax Declaration = ("public" | "private" | "static" | "final")* Type Id "(" {(Type Id) ","}* ")" Statement;
ok
Benefits
  • The symbol language is very expressive and can lead to short definitions of complex syntactic constructs.
  • There is no built-in longest match for iterators, which makes syntax definitions open to languages that do not have longest match.
  • There is no built-in keyword preference or reservation, which makes syntax definitions open to language composition and legacy languages.
Pitfalls
  • By nesting too many symbols definitions can be become hard to understand.
  • By nesting too many symbols pattern matching and term construction becomes more complex. Extra non-terminals and rules with meaningful names can make a language specification more manageable.
  • The lack of automatic longest match and prefer keyword heuristics (you have to define it yourself), sometimes leads to unexpected ambiguity. See Disambiguation.
Is this page unclear, or have you spotted an error? Please add a comment below and help us to improve it. For all other questions and remarks, visit ask.rascal-mpl.org.