Skip to content

Grammar syntax

PiisRational edited this page Jun 4, 2024 · 10 revisions

Name Declaration

The top of each grammar has to declare its name (which is the default name for generated parsers).

Name Declarations have the format:

%% NAME GrammarName

The name of a grammar should be an identifier.

Rules

The grammar is based on a sequence of rules of the form:

rule
    = sequence_1
    | sequence_2 
    ...
    | sequence_n
    ;

This means that rule matches the first sequence that does not fail.

The Grammar has to include at least one rule. Additionally the rule declared on the top of the grammar will be seen as the Root and is matched by the generated parser.

NOTE: per default the generated parser will not fail if it finds an error, as it is not forced to match the full text. The solution for that is to add !. as a last rule.

Sequences

Sequences are made up of multiple expressions behind one another, for example the following sequence:

expr_1 expr_2 ... expr_n

It succeeds if exr_1 matches and expr_2 is matched against the remaining input after expr_1 and also succeeds, and so on.

The important thing is that a sequence should always contain at least one Expression.

Expressions

Expressions are the core building blocks that make up the grammar and match test.

Character Notation

A character in zapp is the same thing as a zig character.

  • it accepts any byte (except zero)
  • it supports common escape sequences (\n, \t, \r)
  • it supports hexadecimals (\xNN like in zig) except for \x00

String Expression

The String match Expression matches a string, it works with single quotes or double quotes. A string is made up of bytes which get matched sequentially.

Examples would be:

"Hello\nWorld!\x12"

or

'single quotes'

Class Expression

A class matches only one character at a time, but it allows multiple characters at once, An example would be:

[a-zA-Z0-9_]

Which is made up of single character matches (_) and ranges (a-z, A-Z, 0-9) which match everything from the character left of the - to the character right of the - (including the both ends).

NOTE: Some Pattern matching languages define [^ ...] (a class starting with a ^) to invert the class. This is not the case here.

Additionally characters can be defined more than once in a class (it serves no purpose but generates no error message).

Dot Expression

The Dot Expression just accepts every character.

for example:

Root = . ;

is equivalent to:

Root = [\x01-\xff] ;

Nonterminal Expression

A Nonterminal Expression is the case where the expression matches and fails exactly when a rule matches or fails.

An Example would be:

Top = Bottom ;

Bottom = "Hello World" ;

Let's walk through the example. The Rule Bottom Has only one sequence which has only on expression, which only accepts the string Hello World. Top also has just one sequence with one Expression, but this one is a nonterminal match to the Rule bottom. This means that Top and Bottom act exactly the same way.

Grouping Expression

A Grouping Expression contains a Sequence in Parentheses. An example would be:

Hello = ("Hello" [ \t] "World") ;

Which is Equivalent to:

Hello = "Hello" [ \t] "World" ;

Epsilon Expression

As Sequences have to contain at least one Expression there would be no way for them to accept everything without consuming any character.

Epsilon is used for that, it is represented by an @.

An example would be:

Root = "Hello" @ [ ] @ "World" ;

Which is Equivalent to:

Root = "Hello" [ ] "World" ;

Cut Operator

The cut operator (denoted by a ^) is fundamentally an Expression, with no operators allowed on it. it is used to change the backtracking path of the parser when in a sequence.

For example:

Root 
    = "Hello" ^ " World"
    | @
    ;

Would always accept if the cut operator was not there (because of the @ as last sequence). Because of the cut operator, if "Hello" is in the input but not " World", Root would fail.

So one could think of the cut operator as some sort of if then else (if "Hello" then " World" else @).

The Cut operator helps for performance reasons as is creates less backtracking and can reduce memory consumption, as the memoization table is reset when the backtracking stack is empty.

Repetition Operators

There Are some operators that can be added to Expressions to make them match more complex strings.

Star Operator

The star operator means that the expression is matched repeatedly zero or more times. An example would be:

Addition = [0-9] ('+' [0-9])* ;

Which matches all addition expressions with one digit natural numbers.

It is exactly equivalent to:

Addition = [0-9] Repetition ; 

Repetition
    = '+' [0-9] Repetition
    | @
    ;

Plus Operator

The Plus Operator means that the expression is matched one or more times. For example:

Hello = "Hello"+ ;

is equivalent to:

Hello = "Hello" "Hello"* ;

Question Operator

The Question Operator means that the expression is matched zero or one time. For example:

Greeting = "Hello" " World"? ;

is equivalent to:

Greeting = "Hello" World ;

World 
    = " World"
    | @
    ;

NOTE: World could not be written with the two operators inverted as it would alway accept @ and never match " World"

Lookahead Operators

A Lookahead Operator on an expression means it will not consume the text it matches.

Positive Lookahead

Positive Lookahead means (symbolized by an &) means that the expression it operates on will have to match the text to not fail, but will not consume it.

For example:

Root = &"Hello" FollowUp;

FollowUp 
    = "Hello to you"
    | "Hello World"
    | "How Are You"
    ; 

Root will never accept "How Are You" because of the Positive lookahead that Would Fail if the start of the parsed string is not Hello.

Negative Lookahead

Negative Lookahead (symbolized with a !) is the inverse of positive lookahead, so it accepts iff the expression fails.

For example if we modified the last example:

Root = !"Hello" FollowUp;

FollowUp 
    = "Hello to you"
    | "Hello World"
    | "How Are You"
    ; 

Then Root would only accepts the input How Are You because it fails on Hello.

Comments

In zapp there are two types of simgle-line comments, simple comments:

// some string

and transparent comments (with a *):

//* This is a transparent comment.

which have to be at the top level of the grammar (above %% NAME) and are added to the top of the generated parser by zapp.