-
Notifications
You must be signed in to change notification settings - Fork 0
Grammar syntax
The top of each grammar has to declare its name (which is the default name for generated parsers).
Name Declarations have the format:
%% NAME GrammarName
The name of a grammar should be an identifier.
The grammar is based on a sequence of rules of the form:
rule
= sequence_1
| sequence_2
...
| sequence_n
;This means that rule matches the first sequence that does not fail.
The Grammar has to include at least one rule. Additionally the rule declared on the top of the grammar will be seen as the Root and is matched by the generated parser.
NOTE: per default the generated parser will not fail if it finds an error, as it is not forced to match the full text. The solution for that is to add
!.as a last rule.
Sequences are made up of multiple expressions behind one another, for example the following sequence:
expr_1 expr_2 ... expr_n
It succeeds if exr_1 matches and expr_2 is matched against the remaining
input after expr_1 and also succeeds, and so on.
The important thing is that a sequence should always contain at least one Expression.
Expressions are the core building blocks that make up the grammar and match test.
A character in zapp is the same thing as a zig character.
- it accepts any byte (except zero)
- it supports common escape sequences (
\n,\t,\r) - it supports hexadecimals (
\xNNlike in zig) except for\x00
The String match Expression matches a string, it works with single quotes or double quotes. A string is made up of bytes which get matched sequentially.
Examples would be:
"Hello\nWorld!\x12"or
'single quotes'A class matches only one character at a time, but it allows multiple characters at once, An example would be:
[a-zA-Z0-9_]
Which is made up of single character matches (_) and ranges (a-z, A-Z, 0-9)
which match everything from the character left of the - to the character right of the - (including the both ends).
NOTE: Some Pattern matching languages define
[^ ...](a class starting with a^) to invert the class. This is not the case here.
Additionally characters can be defined more than once in a class (it serves no purpose but generates no error message).
The Dot Expression just accepts every character.
for example:
Root = . ;is equivalent to:
Root = [\x01-\xff] ;A Nonterminal Expression is the case where the expression matches and fails exactly when a rule matches or fails.
An Example would be:
Top = Bottom ;
Bottom = "Hello World" ;Let's walk through the example.
The Rule Bottom Has only one sequence which has only on expression, which only accepts the string Hello World.
Top also has just one sequence with one Expression, but this one is a nonterminal match to the Rule bottom.
This means that Top and Bottom act exactly the same way.
A Grouping Expression contains a Sequence in Parentheses. An example would be:
Hello = ("Hello" [ \t] "World") ;Which is Equivalent to:
Hello = "Hello" [ \t] "World" ;As Sequences have to contain at least one Expression there would be no way for them to accept everything without consuming any character.
Epsilon is used for that, it is represented by an @.
An example would be:
Root = "Hello" @ [ ] @ "World" ;Which is Equivalent to:
Root = "Hello" [ ] "World" ;The cut operator (denoted by a ^) is fundamentally an Expression, with no operators allowed on it.
it is used to change the backtracking path of the parser when in a sequence.
For example:
Root
= "Hello" ^ " World"
| @
;Would always accept if the cut operator was not there (because of the @ as last sequence).
Because of the cut operator, if "Hello" is in the input but not " World", Root would fail.
So one could think of the cut operator as some sort of if then else (if "Hello" then " World" else @).
The Cut operator helps for performance reasons as is creates less backtracking and can reduce memory consumption, as the memoization table is reset when the backtracking stack is empty.
There Are some operators that can be added to Expressions to make them match more complex strings.
The star operator means that the expression is matched repeatedly zero or more times. An example would be:
Addition = [0-9] ('+' [0-9])* ;Which matches all addition expressions with one digit natural numbers.
It is exactly equivalent to:
Addition = [0-9] Repetition ;
Repetition
= '+' [0-9] Repetition
| @
;The Plus Operator means that the expression is matched one or more times. For example:
Hello = "Hello"+ ;is equivalent to:
Hello = "Hello" "Hello"* ;The Question Operator means that the expression is matched zero or one time. For example:
Greeting = "Hello" " World"? ;is equivalent to:
Greeting = "Hello" World ;
World
= " World"
| @
;NOTE:
Worldcould not be written with the two operators inverted as it would alway accept@and never match" World"
A Lookahead Operator on an expression means it will not consume the text it matches.
Positive Lookahead means (symbolized by an &) means that the expression it operates on will have to match the text to not fail, but will not consume it.
For example:
Root = &"Hello" FollowUp;
FollowUp
= "Hello to you"
| "Hello World"
| "How Are You"
; Root will never accept "How Are You" because of the Positive lookahead that Would Fail if the start of the parsed string is not Hello.
Negative Lookahead (symbolized with a !) is the inverse of positive lookahead, so it accepts iff the expression fails.
For example if we modified the last example:
Root = !"Hello" FollowUp;
FollowUp
= "Hello to you"
| "Hello World"
| "How Are You"
; Then Root would only accepts the input How Are You because it fails on Hello.
In zapp there are two types of simgle-line comments, simple comments:
// some stringand transparent comments (with a *):
//* This is a transparent comment.which have to be at the top level of the grammar (above %% NAME) and are added to the top of the generated parser by zapp.