grmtools

grmtools is a suite of Rust libraries and binaries for parsing text, both at compile-time, and run-time. Most users will probably be interested in the compile-time Yacc feature, which allows traditional .y files to be used mostly unchanged in Rust. See the Quickstart Guide for a quick introduction to this feature.

Quickstart Guide

Most users will probably be interested in the compile-time Yacc feature of grmtools, which allows traditional .y files to be used mostly unchanged in Rust. This page is a short guide to get you up and running with this feature as quickly as possible.

grmtools includes both a Yacc-style LR parser (lrpar) and a lex-style lexer (lrlex). The lexer breaks input up into individual lexemes and the parser checks to see if the lexemes conform to a grammar. As the parser executes, it can either create a generic parse tree, or execute user-specified Rust code.

A calculator evaluator

Let's assume we want to create a simple calculator which can evaluate expressions such as 2 + 3 * 4. Assuming a fresh Rust project, we first create a Cargo.toml file with the following dependencies:

[package]
name = "calc"
version = "0.0.1"
authors = ["<authors>"]
edition = "2021"

[[bin]]
doc = false
name = "calc"

[build-dependencies]
cfgrammar = "0.13"
lrlex = "0.13"
lrpar = "0.13"

[dependencies]
cfgrammar = "0.13"
lrlex = "0.13"
lrpar = "0.13"

In this situation we want to statically compile the .y grammar and .l lexer into Rust code. We thus need to create a build.rs file inside the root of our project which can process the lexer and grammar. lrlex provides a simple interface which does both jobs for us in one go. Our build.rs file thus looks as follows:

use cfgrammar::yacc::YaccKind;
use lrlex::CTLexerBuilder;

fn main() {
    CTLexerBuilder::new()
        .lrpar_config(|ctp| {
            ctp.yacckind(YaccKind::Grmtools)
                .grammar_in_src_dir("calc.y")
                .unwrap()
        })
        .lexer_in_src_dir("calc.l")
        .unwrap()
        .build()
        .unwrap();
}

grmtools accepts several different Yacc variants as input. In our case, we want to execute Rust code as the input is parsed (rather than creating a generic parse tree which we traverse later), so we specified that the yacckind (i.e. what variant of Yacc file we're using) is YaccKind::Grmtools. The grammar file is stored in src/calc.y, but we only specify calc.y as the filename to lrpar, since it searches relative to src/ automatically. Similarly, we only needed to specify calc.l to lrlex.

The lexer

While Yacc-style parsing is powerful, lex-style lexing is less powerful. grmtools allows you to use whatever lexer you want with lrpar. Fortunately, in this case, lrlex is powerful enough for us. calc.l looks as follows:

%%
[0-9]+ "INT"
\+ "+"
\* "*"
\( "("
\) ")"
[\t ]+ ;

Roughly speaking, each line after the %% line is a regular expression (we use the regex crate), a space character, and a quoted lexeme type name. For example, if the user gives us input such as 234 we will create a single lexeme with a value (234) and a type (INT).

The one exception is the final line: if a lexeme type name is replaced with ‘;’ then any matching input is discarded. In this case, whitespace (tabs and spaces) is lexed, but no lexemes are created from it.

The grammar

Our initial version of calc.y looks as follows:

%start Expr
%%
Expr -> Result<u64, ()>:
      Expr '+' Term { Ok($1? + $3?) }
    | Term { $1 }
    ;

Term -> Result<u64, ()>:
      Term '*' Factor { Ok($1? * $3?) }
    | Factor { $1 }
    ;

Factor -> Result<u64, ()>:
      '(' Expr ')' { $2 }
    | 'INT'
      {
          let v = $1.map_err(|_| ())?;
          parse_int($lexer.span_str(v.span()))
      }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> Result<u64, ()> {
    match s.parse::<u64>() {
        Ok(val) => Ok(val),
        Err(_) => {
            eprintln!("{} cannot be represented as a u64", s);
            Err(())
        }
    }
}

The grammar is in 3 parts, separated by the %% lines.

The first part specifies general settings for the grammar, at a minimum the start rule (%start Expr).

The second part is the Yacc grammar. It consists of 3 rules (Expr, Term, and Factor) and 6 productions (2 for each rule, separated by | characters). Because we are using the Grmtools Yacc variant, each rule has a Rust type associated with it (after ->) which specifies the type that each production’s action must return. A production (sometimes called an “alternative”) consists of zero or more symbols. Symbols either reference rules or lexemes. If a production matches text, its ”action” (the Rust code between curly brackets at the end of the production) is executed.

lrpar's actions are subtly different to Yacc. The $x variables refer to the respective symbol in the production, numbered from 1 (i.e. $1 refers to the first symbol in the production). If the symbol references a rule R then an instance of R's type will be stored in the $x variable; if the symbol references a lexeme then an Option<Lexeme> instance is returned. A special $lexer variable allows access to the lexer. The most commonly used function that $lexer exposes is the span_str function, which allows us to extract &'input strs from a Span (e.g. to extract the string represented by a Lexeme, we would use $lexer.span_str(lexeme.span())). As this may suggest, actions may also reference the special lifetime 'input (without any $ prefix), which allows strings to be returned / stored by the grammar without copying memory.

The third part is arbitrary Rust code which can be called by productions’ actions. In our case we have a simple function which converts integers as strings into integers as u64s: if the user provides an invalid number (e.g. one that is too big) the system panics.

This example uses a common grmtools idiom: making use of Result types. This allows us to deal with two different issues that prevent evaluation. First is the “obvious” issue of integers which are too big to represent as u64s: these cause Err to be percolated upwards, preventing evaluation. Second is the issue of error recovery telling us that the user should have inserted an integer: since it would be confusing for us to insert a default value in such cases, we map_err such cases to Err, preventing evaluation. See the section below on error recovery for more details about error recovery.

Putting everything together

The build.rs file will statically compile the lexer and grammar into Rust code that we can then call. The src/main.rs file below provides a simple Python-esque REPL to the user into which they can write calculator expressions:

use std::io::{self, BufRead, Write};

use lrlex::lrlex_mod;
use lrpar::lrpar_mod;

// Using `lrlex_mod!` brings the lexer for `calc.l` into scope. By default the
// module name will be `calc_l` (i.e. the file name, minus any extensions,
// with a suffix of `_l`).
lrlex_mod!("calc.l");
// Using `lrpar_mod!` brings the parser for `calc.y` into scope. By default the
// module name will be `calc_y` (i.e. the file name, minus any extensions,
// with a suffix of `_y`).
lrpar_mod!("calc.y");

fn main() {
    // Get the `LexerDef` for the `calc` language.
    let lexerdef = calc_l::lexerdef();
    let stdin = io::stdin();
    loop {
        print!(">>> ");
        io::stdout().flush().ok();
        match stdin.lock().lines().next() {
            Some(Ok(ref l)) => {
                if l.trim().is_empty() {
                    continue;
                }
                // Now we create a lexer with the `lexer` method with which
                // we can lex an input.
                let lexer = lexerdef.lexer(l);
                // Pass the lexer to the parser and lex and parse the input.
                let (res, errs) = calc_y::parse(&lexer);
                for e in errs {
                    println!("{}", e.pp(&lexer, &calc_y::token_epp));
                }
                match res {
                    Some(Ok(r)) => println!("Result: {:?}", r),
                    _ => eprintln!("Unable to evaluate expression.")
                }
            }
            _ => break
        }
    }
}

We can now cargo run our project and evaluate simple expressions:

>>> 2 + 3
Result: 5
>>> 2 + 3 * 4
Result: 14
>>> (2 + 3) * 4
Result: 20

Error recovery

Because powerful error recovery is built into lrpar, we can even make minor errors and have the system recover automatically:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 5
>>> 2 + 3 3
Parsing error at line 1 column 7. Repair sequences found:
   1: Delete 3
   2: Insert +
   3: Insert *
Result: 5
>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
   1: Insert *, Delete 4
   2: Insert +, Delete 4
   3: Delete 4, Delete 5
   4: Insert *, Shift 4, Delete 5
   5: Insert *, Shift 4, Insert +
   6: Insert *, Shift 4, Insert *
   7: Insert +, Shift 4, Delete 5
   8: Insert +, Shift 4, Insert +
   9: Insert +, Shift 4, Insert *
Result: 17

Note that we didn't have to do anything clever in order for error recovery to happen: it happens by default, and it works with whatever grammar we throw at it. The way to read the resulting error messages are that each numbered repair sequence is a way that the error recovery system found to make sense of the input. For example, for the input 2 + + 3, an error is detected at the second +: we could either delete the second + (option 1 above) or insert an integer. In all cases, error recovery applies repair sequence 1, and continues parsing. 2 + + 3 was thus parsed as if the user had written 2 + 3, hence why it evaluated to 5. Similarly, 2 + 3 4 5 was parsed as if the user had written 2 + 3 * 5.

Error recovery opens up a number of possibilities to customise and streamline the user experience. For example, the simple approach above causes a panic if the user provides a non-u64 number or if error recovery inserts an integer. For more details about the possibilities, see the section on error recovery.

Lexing

Lexing is the act of taking in an input stream and splitting it into lexemes. Colloquially, lexing is often described as splitting input into words. In grmtools, a Lexeme has a type (e.g. "INT", "ID"), a value (e.g. "23", "xyz"), and knows which part of the user's input matched (e.g. "the input starting at index 7 to index 10"). There is also a simple mechanism to differentiate lexemes of zero length (e.g. DEDENT tokens in Python) from lexemes inserted by error recovery.

lrpar provides a generic lexing interface to which any lexer can plug into. Many easy lexing tasks can more easily be carried out by lrlex, a lex replacement. lrlex also provides helper functions which make it easier to hand-write lexers.

Lex compatibility

grmtools currently supports one common use of Lex, which is to produce a sequence of tokens. All Lex files require at least some porting to grmtools, though in many cases this is fairly trivial. Nevertheless, aspects such as the longest match rule are identical to Lex, and we assume familiarity with Lex syntax and its major features: the Lex manual is recommended reading.

Major differences

There are several major differences between Lex and grmtools:

Lex has its own regular expression language whereas grmtools uses the well known Rust regex crate for regular expressions. These two regular expression languages are very similar, but complex regular expressions might not be supported under one or the other.
Lex files consist of a sequence of regular expressions and an action for each. grmtools lex files consists of a sequence of regular expressions and a token name. Actions are not currently supported (and, by extension, nor are special action expressions such as ECHO and REJECT).
Both Lex and grmtools lex files support start conditions as an optional prefix to regular expressions, listing necessary states for the input expression to be considered for matching against the input. Lex uses a special action expression BEGIN(state) to switch to the named state. Start states in grmtools are described in start_states.
Character sets, and changes to internal array sizes are not supported by grmtools.
Escape sequences:

In addition to escape sequences involved in the escaping of regular expressions. Lex and grmtools support the escape sequences \123 (octal) \x1234 (hexadecimal) and ASCII escape sequences. \\ \a \f \n \r \t \v.

Lex also interprets the escape sequence \b as backspace. While regex treats \b as a word boundary subsequently grmtools will too.

Additional escape sequences supported by regex:

The \u1234 and \U12345678 escape sequences for unicode characters, the \p,\P unicode character classes, as well as the \d \D \s \S \w \W perl character classes, and \A \b \B \z escape sequences.

Both Lex and grmtools support escaping arbitrary characters, for all other characters besides those listed above, when given an escaped character \c it will be passed to the regex engine as the character c. This is useful when a character is used within the lex format.

An example of this is when the character < is used at the beginning of a regex. Both Lex and grmtools interpret this as the beginning of a start condition prefix. Which can be escaped with \< to ensure it is treated as the start of a regular expression.

But the characters to which this behavior applies is impacted by the escape sequence differences listed above.
Lex treats lines in the rules section beginning with whitespace as code to be copied verbatim into the generated lexer source. Grmtools lex does not support these and produces an error.

Hand-written lexers

lrpar provides a generic lexing interface to which any lexer can plug into. Users can provide one or both of a custom lexeme type -- conforming to lrpar::Lexeme -- and a custom lexing type -- conforming to lrpar::NonStreamingLexer. If you wish to use a custom lexer, you will need to instantiate lrpar appropriately (both CTParserBuilder and RTParserBuilder).

For many purposes, the low-level control and performance that lrpar gives you is unneeded, and the boiler-plate that comes with it unwanted. Fortunately, lrlex provides the following convenience mechanisms to make it easier to use a hand-written lexer with lrpar:

lrlex's normal LRNonStreamingLexer struct can be instantiated by an end-user with an input stream, a list of lexemes created from that input stream, and the newlines encountered while lexing that input stream. This saves having to define a custom instance of the lrpar::NonStreamingLexer trait.
lrlex's DefaultLexeme struct can also be instantiated by end-users, saving having to define a custom instance of the lrpar::Lexeme trait.
lrlex exposes a ct_token_map function to be used from build.rs scripts which automatically produces a Rust module with one constant per token ID. ct_token_map is explicitly designed to be easy to use with lrpar's compile-time building.

Putting these together is then relatively easy. First a build.rs file for a hand-written lexer will look roughly as follows:

use cfgrammar::yacc::YaccKind;
use lrlex::{ct_token_map, DefaultLexerTypes};
use lrpar::CTParserBuilder;

fn main() {
    let ctp = CTParserBuilder::<DefaultLexerTypes<u8>>::new()
        .yacckind(YaccKind::Grmtools)
        .grammar_in_src_dir("grammar.y")
        .unwrap()
        .build()
        .unwrap();
    ct_token_map::<u8>("token_map", ctp.token_map(), None).unwrap()
}

This produces a module that can be imported with lrlex_mod!("token_map"). The module will contain one constant, prefixed with T_ per token identifiers in the grammar. For example, for the following grammar excerpt:

Expr -> Result<u64, ()>:
      Expr 'PLUS' Term { Ok($1? + $3?) }
    | Term { $1 }
    ;

the module will contain const T_PLUS: u8 = ...;.

Since Yacc grammars can contain token identifiers which are not valid Rust identifiers, ct_token_map allows you to provide a map from the token identifier to a "Rust friendly" variant. For example, for the following grammar excerpt:

Expr -> Result<u64, ()>:
      Expr '+' Term { Ok($1? + $3?) }
    | Term { $1 }
    ;

we would provide a map '+' => 'PLUS' leading, again, to a constant T_PLUS being defined.

One can then write a simple custom lexer which lexes all the input in one go and returns an LRNonStreamingLexer as follows:

#![allow(unused)]
fn main() {
use cfgrammar::NewlineCache;
use lrlex::{lrlex_mod, DefaultLexeme, DefaultLexerTypes, LRNonStreamingLexer};
use lrpar::{lrpar_mod, Lexeme, NonStreamingLexer, Span};

lrlex_mod!("token_map");
use token_map::*;

fn lex(s: &str) -> LRNonStreamingLexer<DefaultLexerTypes<u8>> {
  let mut lexemes = Vec::new();
  let mut newlines = NewlineCache::new();
  let mut i = 0;
  while i < s.len() {
    if i == ... {
      lexemes.push(DefaultLexeme::new(T_PLUS, i, ...));
    } else {
      ...
    }
  }
  LRNonStreamingLexer::new(s, lexemes, newlines)
}
}

Start States

The following explains the syntax and semantics of Start States in lrlex.
A working example can be found in the repository at lrpar/examples/start_states

Motivation

Start states are a feature from lex which can be used for context sensitive lexing. For instance, they can be used to implement nested comments (see the example in the repository). Such that the tokens start/end markers of tokens maintain balance.

This is achieved by making rules which are qualified to match only when the lexer is in a particular state. Additionally the lexer has a stack of states, and matching rules perform actions which modify the stack.

The INITIAL start state

Unless specified otherwise all lex rules are members of the INITIAL start state.

%%
<INITIAL>a "A"
<INITIAL>[\t \n]+ ;

This is the lex file below with no start states specified.

%%
a "A"
[\t \n]+ ;

Rules matching multiple states

Rules can be matched in multiple states, just separate the states a rule should match in commas. The following matches the a character when in either of the states FirstState or SecondState.

<FirstState,SecondState>a "A"

Differences from POSIX lex

In posix lex start states are entered via code in the action, through either BEGIN(STATE) and calling combinations of yy_push_state, and yy_pop_state.

Because lrlex is actionless, and does not support code actions, instead we have operators to perform the common modifications to the stack of start states.

Push

The push operator is given by the adding '+' to the target state on the right hand side within angle brackets. The following when regex matches in the CURRENT_STATE pushes TARGET_STATE to the top of a stack of states.

<CURRENT_STATE>Regex <+TARGET_STATE>;

Pop

The pop operator is given by the adding '-' to the target state on the right hand side within angle brackets. Similarly when in the current state, the following pops the current state off of the stack of states. Similarly to calling yy_pop_state from action code.

<CURRENT_STATE>Regex <-CURRENT_STATE>;

ReplaceStack

The ReplaceStack operator is given by naming the target state within angle brackets. The ReplaceStack op clears the entire stack of states, then pushing the target state.

<CURRENT_STATE>Regex <TARGET_STATE>;

Returning a token while performing an operator.

Start state operators can be combined with returning a token for example:

<CURRENT_STATE>Regex <+TARGET_STATE>"TOKEN"

Adding a start state

Start stats come in two forms, exclusive and inclusive. These are given by %x and %s respectively.

Exclusive states

In an exclusive state, the rule can be matched only if the rule begins with the state specified. In the following because ExclState is exclusive, the #= rule is only matched during the INITIAL state, while the a and =# characters are only matched while in the ExclState.

%x ExclState
%%

#= <+ExclState>;
<ExclState>a "A"
<ExclState>=# <-ExclState>;

Inclusive states

Inclusive states are added to the set of rules to be matched when the start state is unspecified.

%s InclusiveState
%%

a "A"
<InclusiveState>b "B"
<INITIAL>#= <+InclusiveState>;
<InclusiveState>=# <-InclusiveState>;

Is equivalent to the following using exclusive states.

%x Excl
%%

<INITIAL, Excl>a "A"
<Excl>b "B"
<INITIAL>#= <+Excl>;
<Excl>=# <-Excl>;

Parsing

Parsing is the act of checking whether a stream of lexemes match a grammar. Since a simple "yes/no" answer is rarely useful, it is common to execute user-defined actions during parsing.

grmtools contains libraries (cfgrammar and lrtable) which allow users to build their own LR parsers in whatever fashion they want. However, for 99% of cases, the lrpar library is what users want and need: a (largely) Yacc-compatible parser. Roughly speaking, the core parts of grammars work identically in Yacc and lrpar, but some other parts of the system have been modernised (e.g. to avoid the use of global variables) and given a more idiomatic Rust feel. Notably, lrpar is built from the ground-up to have a powerful, flexible approach to error recovery.

Yacc compatibility

grmtools supports most major Yacc features, to the extent that many Yacc grammars can be used unchanged with grmtools. In this book we assume familiarity with Yacc syntax and its major features: the Yacc manual is recommended reading.

Major differences

There are several differences between Yacc and grmtools including:

grmtools has no equivalent of any of the yy* functions (e.g. yyerror, yylex, yylval, yyparse and so on). This means, for example, that grammar actions cannot currently influence the lexer in any way.
grmtools has an entirely different approach to error recovery. The token error and the special action expressions yyerrok and yyclearin are not supported. In general, users can simply remove alternatives that consist solely of error.
%union can be mapped to %actiontype in grmtools, though this is rarely the best way of using a Yacc grammar in Rust. See the Grmtools Yacc variant below for the most common way of making grammars do something useful; in a limited number of cases (e.g. if you just want to build a parse tree), you may find the "Original" Yacc variant useful.
grmtools allows both Yacc's %expect and Bison's %expect-rr declarations in its base "Yacc" mode.
Bison's %parse-param can take multiple arguments. grmtools' %parse-param takes a single argument which can be a tuple, thus emulating multiple arguments while integrating naturally into Rust's type system.
Although rare, it is possible to generate accept/reduce conflicts (e.g. for a grammar with the sole rule A: A;). grmtools considers accept/reduce conflicts to be a hard error, and refuses to generate anything for the resulting grammar, whereas Yacc allows them through (with unclear consequences). Bison also appears to consider accept/reduce conflicts a hard error, though it appears to detect them in a more generic way (reporting such rules as "not generating any sentences").

YaccKinds

Grmtools

YaccKind::Grmtools is grmtools' own variant of Yacc syntax, and the one that most users will want to use. The most significant difference to "normal" Yacc is that rules are annotated with a Rust type to which all their production's actions must adhere to. Note that whilst a rule's productions must all adhere to a single type, different rules can have different types. Consider the following snippet:

R1 -> Result<i32, ()>:
     'a' { Ok(5) }
   | 'b' { Err(()) }
   ;

R2 -> u64:
   | { 0 }
   ;

Here the rule R1 has a Rust return type of Result<X, ()> (between -> and :). Both of its productions adhere to this type, the first by instantiating Ok(5) and the second Err(()). The rule R2 has a return type of u64.

“Original” Yacc

Although the name is not fully accurate (grmtools supports a slightly disjoint subset of original Yacc's input), this mode allows users to most easily test externally created Yacc files. Several sub-variants are allowed:

YaccKind::Original(YaccOriginalActionKind::GenericParseTree) does not execute user actions, but instead creates a generic parse tree, where elements are instances of the lrpar::parser::Node enum. This is useful for quickly testing whether a parser is accepting the intended language.
YaccKind::Original(YaccOriginalActionKind::NoAction) parses input and reports errors but does not execute any user actions. This is useful if you are trying to find out whether a corpus of input parses successfully against your grammar or not.
YaccKind::Original(YaccOriginalActionKind::UserAction) models original Yacc most closely but, in a Rust setting, is probably of little use beyond simple calculator like languages. Instead of Yacc's %union directive, users can specify %actiontype which is a Rust type to which every production's actions in the grammar must adhere to. Unless all actions happen to naturally return the same type, this quickly becomes cumbersome to use. For most use cases, YaccKind::Grmtools is a superior alternative.

Action code and return types

Action code

Action code is normal Rust code with the addition of the following special variables:

$1 ... $n refer to the respective symbol in the production, numbered from 1 (i.e. $1 refers to the first symbol in the production). If the symbol references a rule R then an instance of R's type will be stored in the $i variable. If the symbol references a lexeme then a Result<Lexeme<StorageT>, Lexeme<StorageT>> instance is returned where the Ok variant is used for lexemes that are directly derived from the user's input and the Err variant is used for lexemes that have been inserted by error recovery.
$lexer allows access to the lexer and its various functions. The most commonly used of these is the span_str function, which allows us to extract &'input strs from a Span (e.g. to extract the string represented by a Lexeme, we would use $lexer.span_str(lexeme.span())). As this may suggest, actions may also reference the special lifetime 'input (without any $ prefix), which allows strings to be returned / stored by the grammar without copying memory.
$span is a cfgrammar::Span which captures how much of the user's input the current production matched.
$$ is equivalent to $ in normal Rust code.

Any other variables beginning with $ are treated as errors.

Return types

Productions' return types can be any arbitrary Rust type. You may in addition make use of the following:

The generic parameter StorageT references the type of lexemes and is typically used with the Lexeme type i.e. Lexeme<StorageT>. This allows you to return lexemes from rules.
The lifetime 'input allows you to extract strings whose lifetime is tied to the lexer and return them from rules / store them in structs without copying. Lexer::span_str returns such strings and the typical idiom of use is &'input str.

Additional parse parameter

A single extra parameter can be passed to action functions if the %parse-param <var>: <type> declaration is used. The variable <var> is then visible in all action code. <type> must implement the Clone trait (note that Copy bounds imply Clone, and & references implement Copy).

For example if a grammar has a declaration:

%parse-param p: u64

then the statically generated parse function will take two paramaters (lexer: &..., p: u64) and the variable p can be used in action code e.g.:

R -> ...:
  'ID' { format!("{}{}", p, ...) }
  ;

grmtools parsing idioms

grmtools is a flexible tool and can be used in many ways. However, for those using the Grmtools format, the simple idioms below can often make life easier.

Return `Span`s when possible

When executing grammar actions one is often building up an Abstract Syntax Tree (AST) or equivalent. For example consider a simple language with assignments:

Assign: "ID" "=" Expr;

Perhaps the "obvious" way to build this into an AST is to extract the string representing the identifier as follows:

Assign -> ASTAssign: "ID" "=" Expr
    {
        let id = $lexer.span_str($1.as_ref().unwrap().span()).to_string();
        ASTAssign::new(id, $3)
    }

%%

struct ASTAssign {
    id: String
}

impl ASTAssign {
    fn new(name: String) -> Self {
        ASTAssign { name }
    }
}

This approach is easy to work with, but isn't as performant as may be desired: the to_string call allocates memory and copies part of the user's input into that. It also loses information about the part of the user's input that the string relates to.

An alternative approach is not to convert the lexeme into a String during parsing, but simply to return a Span. An outline of this is as follows:

Assign -> ASTAssign: "ID" "=" Expr
    {
        ASTAssign { id: $1, expr: Box::new($3.span()) }
    }

%%

type StorageT = u32;

struct ASTAssign {
    id: Span
    expr: Box<Expr>
}

enum Expr { ... }

If this is not quite what you want to do, you can use largely the same trick with the Lexeme struct. Working with Lexemes has the advantage that you can tell what the type of the lexeme in question is, though generally this is entirely clear from AST context, and Lexeme's type parameter makes it marginally more fiddly to work with than Span.

Alternatively, if you really want to extract strings during parsing, consider using the 'input to extract &str's during parsing, since this does not cause any additional memory to be allocated.

Have rules return a `Result` type

As described in the error recovery section, it is generally a good idea to give rules a Result return type as this allows you to easily stop, or change, action code execution if you encounter "important" inserted lexemes. There are many ways that you can use this, but many simple cases work well using either:

Err(()) works well if you are creating a parse tree and simply want to stop creating the tree when you encounter an important inserted lexeme.
Err(Box<dyn Error>) works well if you are performing more detailed evaluation while parsing and wish to explain to the user why you stopped evaluating when you encountered an important inserted lexeme.

Using `Err(())`

The idea here is that we stop evaluating normal action code by returning Err(()). However, this can lead to endless instances of the following map_err idiom:

R -> Result<..., ()>:
    "ID" { $1.map_err(|_| ())? }
    ;

It can be helpful to define a custom map_err function which hides some of this mess for you:

R -> Result<Lexeme<StorageT>, ()>:
    "ID" { map_err($1)? }
    ;

%%

fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
        -> Result<Lexeme<StorageT>, ()>
{
    r.map_err(|_| ())
}

Using `Err(Box<dyn Error>)`

The idea here is that we both stop evaluating normal action code, and explain why, by returning Err(Box<dyn Error>). Although Box<dyn Error> is something of a mouthful, it allows you significant flexibility in what you return in error situations. If you want to quickly experiment, then this is convenient because the token type Result<Lexeme<StorageT>, Lexeme<StorageT>> can be automatically coerced to Box<dyn Error> (e.g. $1? in action code will return the Err variant without additional code). You can also return strings-as-errors with Box::<dyn Error>::from("...").

Using this idiom we can change our calculator example to deal with many more possible sources of error:


%start Expr
%avoid_insert "INT"
%%
Expr -> Result<u64, Box<dyn Error>>:
      Expr '+' Term
      {
          Ok($1?.checked_add($3?)
              .ok_or(Box::<dyn Error>::from("Overflow detected."))?)
      }
    | Term { $1 }
    ;

Term -> Result<u64, Box<dyn Error>>:
      Term '*' Factor
      {
          Ok($1?.checked_mul($3?)
              .ok_or(Box::<dyn Error>::from("Overflow detected."))?)
      }
    | Factor { $1 }
    ;

Factor -> Result<u64, Box<dyn Error>>:
      '(' Expr ')' { $2 }
    | 'INT'
      {
          parse_int(
              $lexer.span_str(
                  $1.map_err(|_| "<evaluation aborted>")?.span()))
      }
    ;
%%
// Any imports here are in scope for all the grammar actions above.

use std::error::Error;

fn parse_int(s: &str) -> Result<u64, Box<dyn Error>> {
    match s.parse::<u64>() {
        Ok(val) => Ok(val),
        Err(_) => {
            Err(Box::from(
                format!("{} cannot be represented as a u64", s)))
        }
    }
}

Define a `flatten` function

Yacc grammars make specifying sequences of things something of a bore. A common idiom is thus:

ListOfAs -> Result<Vec<A>, ()>:
      A { Ok(vec![$1?]) }
    | ListOfAs A
      {
          let mut $1 = $1?;
          $1.push($1?);
          Ok($1)
      }
    ;

A -> Result<A, ()>: ... ;

Since this idiom is often present multiple times in a grammar, it's generally worth adding a flatten function to hide some of this:

ListOfAs -> Result<Vec<A>, ()>:
      A { Ok(vec![$1?]) }
    | ListOfAs A { flatten($1, $2) }
    ;

A -> Result<A, ()>: ... ;
%%

fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
           -> Result<Vec<T>, ()>
{
    let mut flt = lhs?;
    flt.push(rhs?);
    Ok(flt)
}

Note that flatten is generic with respect to T so that it can be used in multiple places in the grammar.

Composing idioms

The above idioms compose well together. For example, flatten, map_err, and Lexeme can be used together as shown in the following example:

ListOfIds -> Result<Vec<Lexeme<StorageT>>, ()>:
      "ID" { Ok(vec![map_err($1)?]) }
    | ListOfIds "Id" { flatten($1, map_err($2)?) }
    ;

%%

type StorageT = u32;

fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
        -> Result<Lexeme<StorageT>, ()>
{
    r.map_err(|_| ())
}

fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
           -> Result<Vec<T>, ()>
{
    let mut flt = lhs?;
    flt.push(rhs?);
    Ok(flt)
}

Error recovery

One of lrpar's most powerful features is its approach to error recovery, which can be used with any grammar. This section outlines the background to error recovery, the choices that users can make, and how to best make use of this feature.

Error recovery background

Programmers frequently make mistakes when entering input, either because of simple typos, or an outright failure to use the correct syntax. Happily, LR parsing guarantees to report syntax errors at the first point that an error can be definitively proven to have occurred (though note that this might not be the same point that a user would consider the error to have been made). It has long been a goal of parsing technologies to recover from such errors, and allow parsing to continue. This allows users to fix all their syntax errors in one go and, optionally, post-parsing phases to operate as if no syntax errors had been made at all. For example, a compiler author might decide to run the compiler's static type checker even in the presence of syntax errors (since many static type errors are unaffected by syntax errors), but not generate code (which might incorrectly give users the illusion that their code is safe to run).

However, most mainstream parsers do a bad job of error recovery. The most common generic error recovery algorithm is "panic mode" (in reality, a family of algorithms). Unfortunately such simple error recovery algorithms do a poor job of recovering from syntax errors, causing a cascade of spurious further syntax errors to be reported. Programmers quickly learn that only the first reported syntax error can be trusted on to be correct.

lrpar implements the CPCT+ error recovery algorithm from Reducing Cascading Parsing Errors Through Fast Error Recovery, which, in our biased opinion, does a better job than previous approaches. It is fast, grammar neutral, and reports multiple repair sequences to users, allowing them to consider which best matches their intentions.

No matter how clever we think CPCT+ is, it is important to understand that it has a fundamental limitation: it only knows about a language's syntax; it has no concept of the language's semantics beyond that implied by the structure of the grammar; and it cannot control what the user does with the result of error recovery. Thus, grammar writers can significantly influence how useful error recovery is for users. Most of the rest of this section explains how best to make use of error recovery.

Error recovery basics

A simple calculator grammar looks as follows:

%start Expr
%%
Expr -> u64:
      Expr '+' Term { $1 + $3 }
    | Term { $1 }
    ;

Term -> u64:
      Term '*' Factor { $1 * $3 }
    | Factor { $1 }
    ;

Factor -> u64:
      '(' Expr ')' { $2 }
    | 'INT' { parse_int($lexer.span_str($1.unwrap().span())) }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> u64 {
    match s.parse::<u64>() {
        Ok(val) => val,
        Err(_) => panic!("{} cannot be represented as a u64", s)
    }
}

For this simplification we need to make a small tweak to our main.rs changing:

match res {
    Some(Ok(r)) => println!("Result: {}", r),
    _ => eprintln!("Unable to evaluate expression.")
}

to:

match res {
    Some(r) => println!("Result: {}", r),
    _ => eprintln!("Unable to evaluate expression.")
}

For many examples, this simple grammar and its actions work well leading to output such as the following:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 5

Insert x means “error recovery inserted a lexeme of type x”; Delete x means “error recovery deleted the next lexeme in the stream”; and Shift x means “error recovery kept the user’s lexeme x as-is”.

Repair sequences are minimal ways of adjusting the user’s input such that it becomes correct relative to the underlying grammar. Intuitively, in this example, the two repair sequences would adjust the input to be equivalent to 2 + 3 (repair sequence 1) or 2 + <some int> + 3 (repair sequence 2). When more than one repair sequence is presented to the user, the first is used by the algorithm to continue parsing: in this case, the input was parsed as if it was equivalent to 2 + 3, hence the evaluation of the input to 5.

Repair sequences can, as their name suggests, be of arbitrary length:

>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
   1: Insert *, Delete 4
   2: Insert +, Delete 4
   3: Delete 4, Delete 5
   4: Insert *, Shift 4, Delete 5
   5: Insert *, Shift 4, Insert +
   6: Insert *, Shift 4, Insert *
   7: Insert +, Shift 4, Delete 5
   8: Insert +, Shift 4, Insert +
   9: Insert +, Shift 4, Insert *
Result: 17

In this case, the first repair sequence caused the input to be parsed as if it was equivalent to 2 + 3 * 5, hence the evaluation of the input to 17.

Syntax errors and language semantics

Our example inputs so far have deliberately exploited cases where the first repair sequence at worst inserted “unimportant” lexemes such as + and *. Since the grammar’s actions never read the values of such lexemes, only their type is important. However, what should happen if error recovery inserts an integer, whose value is later read by one of the grammar’s actions? An example shows the unhappy result:

>>> 2+
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Lexeme { start: 2, len: 4294967295, tok_id: 4 }', libcore/result.rs:1009:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
>>>

In this case, the first repair sequence was Insert INT. The fundamental problem is that while error recovery can adjust the user’s input to insert a lexeme of type INT, neither it nor the parser have any idea what value might have made sense for that lexeme. Thus the expression above caused the expression $lexer.span_str($1.unwrap().span()) to panic, since $1 was Err(<lexeme>).

It is thus up to the user to decide what to do in the face of the inevitable semantic issues that error recovery highlights. Fortunately, this is generally simpler than it sounds with only a slight rethink in the way that we tend to write a grammar's actions.

A rule of thumb: have rules return a `Result` type

Although rules can have any Rust type you can imagine, using a Result type allows a (deliberately) simple interaction with the effects of error recovery. The basic idea is simple: in actions, we ignore lexemes whose value we don't care about (e.g. brackets); for lexemes whose value we care about, we either introduce a default value, or percolate an Err upwards. Default values make sense in certain situations. For example, if you're writing a compiler, and want to run a static type checker even after syntax errors, it might make sense to assume that Insert 0 is a good substitute for Insert INT. However, in the case of the calculator, default values are likely to lead to confusing results. We thus change the grammar so that inserted integers prevent evaluation from occurring:

%start Expr
%%
Expr -> Result<u64, ()>:
      Expr '+' Term { Ok($1? + $3?) }
    | Term { $1 }
    ;

Term -> Result<u64, ()>:
      Term '*' Factor { Ok($1? * $3?) }
    | Factor { $1 }
    ;

Factor -> Result<u64, ()>:
      '(' Expr ')' { $2 }
    | 'INT' { parse_int($lexer.span_str($1.map_err(|_| ())?.span())) }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> Result<u64, ()> {
    match s.parse::<u64>() {
        Ok(val) => Ok(val),
        Err(_) => panic!("{} cannot be represented as a u64", s)
    }
}

The basic idea here is that every action returns an instance of Result<u64, ()>: if we receive Ok(u64) we successfully evaluated the expression, but if we received Err(()) we were not able to evaluate the expression. If we encounter an integer lexeme which is the result of error recovery, then the INT lexeme in the second Factor action will be Err(<lexeme>). By writing $1.map_err(|_| ())? we’re saying “if the integer lexeme was created by error recovery, percolate Err(()) upwards”. We then have to tweak a couple of other actions to percolate errors upwards, but this is a trivial change. We'll also need to change main.rs back to expecting a Result.

Now the input which previously caused a panic simply tells the user that it could not evaluate the expression:

>>> 2+
Parsing error at line 1 column 3. Repair sequences found:
   1: Insert INT
Unable to evaluate expression.

Usefully, our inability (or unwillingness) to evaluate the expression does not prevent further syntax errors from being discovered and repaired:

>>> (2+)+3+4+
Parsing error at line 1 column 4. Repair sequences found:
   1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert Int
Unable to evaluate expression.

Using a Result type allows the user arbitrary control over the classes of syntax errors they are prepared to deal with or not. For example, we could remove the panic from parse_int by making the rules have a type Result<u64, String> where the Err case would report a string such as “18446744073709551616 cannot be represented as a u64” for the first unrepresentable u64 in the user's input. If we wanted to report all unrepresentable u64s, we could have the rules have a type Result<u64, Vec<String>>, though merging together the errors found on the left and right hand sides of the + and * operators requires adding a few lines of code.

Making use of `%epp` for easier to read repair sequences

By default, pretty-printing lexeme types prints out their identifier in the grammar. Up to now, we have used lexeme types when showing output to the user. While the lexeme types are sometimes adequate for this purpose, this is not always the case. Consider this lex file:

%%
[0-9]+ "INT"
\+ "PLUS"
\* "MUL"
\( "LBRACK"
\) "RBRACK"
[\t ]+ ;

The user would see output such as:

>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete 3
   2: Insert PLUS
   3: Insert MUL
Result: 2

What are PLUS and MUL? These might be semi-obvious, but many lexeme types are far from obvious. grmtools allows users to provide human friendly versions of these for error recovery using the %epp declaration in grammars. For example, we can extend the calc grammar as follows:

%epp PLUS "+"
%epp MUL "*"
%epp LBRACK "("
%epp RBRACK ")"
%epp INT "Int"

leading to the following output:

>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete 3
   2: Insert +
   3: Insert *
Result: 2

Biasing repair sequences

Depending on your language, some repair sequences are better than others. For example, sometimes Insert repairs are less welcome than Delete repairs:

>>> 2 + + 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Insert INT
   2: Delete +
Unable to evaluate expression.
>>> 2 + + 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 5

Why does the same input sometimes produce a result and sometimes fail to produce a result? The problem is that 2 + + 3 has two repair sequences Delete + and Insert Int. As things stand, both are equally good, and so one is chosen non-deterministically. If Insert Int is chosen, we hit the Err case from earlier, and fail to produce a result; if the Delete case is chosen, we can produce a result.

To lessen this problem, the %avoid_insert L directive causes grmtools to prefer repair sequences that don't include Insert L over those that do. Intuitively, we want to annotate lexemes whose value we care about in this way (e.g. INT), but we don't need to worry about lexemes whose value we never expect (e.g. (, + etc.). In the case of the calculator grammar a good use of this directive is as follows:

%avoid_insert "INT"

With this, the Delete + repair sequence is consistently favoured over Insert INT.

Turning lexing errors into parsing errors

Most lexers do not have lexical rules for all possible inputs. For example, our running calculator example has no lexical rule for the character @. Typically this causes the lexer to generate an error and stop lexing further. For example with lrlex we would encounter the following:

>>> 2@3
Lexing error at line 1 column 2.

This error message is correct, but not as helpful as we might like (what is the error specifically?). Furthermore, any further errors in the input will not be found until the lexing error is fixed.

Fortunately we can fix this easily for nearly all grammars by adding a line similar to this to the end of your .l file:

. "UNMATCHED"

Any single character which is not matched by any other lex rule will now lead to a token of type UNMATCHED. Note that it is vital that this is the last rule in your .l file, and that only a single character is matched, otherwise you will incorrectly lex correct input as UNMATCHED!

We then need to add a dummy rule to your .y file, simply so that lrpar knows about UNMATCHED tokens. This dummy rule won't be referenced by other rules, so its return type and action are irrelevant. The simplest example is thus:

Unmatched -> ():
  "UNMATCHED" { } 
  ;

Assuming you have the "warnings are errors" option set to true (its default), you will then receive a warning about the unused rule (Unmatched) and token (UNMATCHED). You can inform grmtools that you expect both to be unused by adding this declaration in the top part of your .y file:

%expect-unused Unmatched "UNMATCHED"

With this done, all possible input will be lexed, and what were previously lexing errors are now parsing errors. This means that error recovery section kicks in, giving us more detailed and informative errors, and ensuring that multiple "lexing" errors are reported at once:

>>> 2@3+4+5+6@7
Parsing error at line 1 column 2. Repair sequences found:
   1: Delete @, Delete 3
   2: Insert +, Delete @
   3: Insert *, Delete @
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert +, Delete @
   2: Delete @, Delete 7
   3: Insert *, Delete @
Result: 24

Under the bonnet

For any given syntax error there are, potentially, a finite but vast number of possible valid repair sequences: far too many to exhaustively search. Error recovery algorithms such as CPCT+ use various heuristics to cut the search space down to something that is (generally) manageable. Although surprisingly few in practise, this inevitably leads to occasional situations where the repair sequences found (or, more accurately, those not found) surprise humans.

Timeout

The first surprising condition is that even with the small calc grammar, some user inputs lead to such a massive search space that no repair sequences can be found. The easiest way to trigger this in most grammars is bracket expressions:

>>> 1+(
Parsing error at line 1 column 4. Repair sequences found:
   1: Insert Int, Insert )
Unable to evaluate expression.
>>> 1+((
Parsing error at line 1 column 5. Repair sequences found:
   1: Insert Int, Insert ), Insert )
Unable to evaluate expression.
>>> 1+(((((((((((
Parsing error at line 1 column 14. No repair sequences found.
Unable to evaluate expression.

At a certain number of open brackets (which will partly depend on the speed of your machine), CPCT+ simply cannot find suitable repair sequences within its internal timeout, hence the “No repair sequences found” message. In practise this happens in less than 2% of real-world inputs, so it is not a significant worry.

Some “obvious” repair sequences aren't reported at the end of a file

The second surprising condition is more subtle. Before we can show the issue, we need to introduce the concept of repair sequence ranking: CPCT+ only presents the lowest cost repair sequences to users (where Inserts and Deletes cost 1, and Shifts cost 0). Higher cost repair sequences are discarded.

In an ideal world, CPCT+ would find repair sequences that allow a file to parse completely successfully. In practice, this is only feasible if a syntax error occurs near the very end of the input. In most cases, CPCT+ is happy with a weaker condition, which is that a repair sequence ends with 3 Shift repairs, showing that parsing has got back on track, at least for a little bit. This condition explains the following:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert Int
Result: 5
>>> 2 + + 3 +
Parsing error at line 1 column 5. Repair sequences found:
   1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert Int
Unable to evaluate expression.

For 2 + + 3 we match the human intuition that the input could have been 2 + 3 or 2 + <some int> + 3. However, for the input 2 + + 3 + we do not report a Delete + repair sequence for the first error in the input. Why?

The first thing we need to know is that repair sequences are always reported with trailing Shift repairs pruned: for the rest of this subsection it aids understanding to leave them unpruned. Thus, for 2 + + 3, the two repair sequences found are Delete +, Shift 3 and Insert Int, Shift +, Shift 3, both of which cause the entire input to parse successfully, and both of which have the same cost.

For 2 + + 3 +, however, the first error leads to 3 repair sequences, Insert Int, Shift +, Shift 3, Shift +, Delete +, Shift 3, Delete or Delete +, Shift 3, Shift +, Insert Int: the latter two are not even completed since they're provably higher than the Insert Int repair sequence and thus aren’t reported to the user.

In practise, this situation is rarer than the timeout problem, to the point that it’s arguably not worth worrying about or explaining to end users. Even when it happens, the repair sequences that CPCT+ reports are always correct and at least one repair sequence will be reported (assuming that error recovery doesn't time out!).

Error recovery on real-world grammars

Continuing the example from the nimbleparse section, we can see that error recovery works well on arbitrary grammars. Consider the following syntactically incorrect Lua 5.3 program:

$ cat test.lua
x = 0
if x > 0
   print("greater than")
else
   print("less than"}

When run through nimbleparse, the following output is generated:

$ caro run --release --bin nimbleparse lua5_3.l lua5_3.y test.lua
...
Error at line 3 col 4. Repair sequences found:
   1: Insert then
Error at line 5 col 21. Repair sequences found:
   1: Insert ), Insert end, Delete }
   2: Insert ), Insert {, Shift }, Insert end

Turning off error recovery

By default, lrpar uses the CPCT+ error recovery algorithm. You can use the None error recovery algorithm, which causes parsing to stop as soon as it hits the first parsing error, with the recoverer method in CTParserBuilder or RTParserBuilder. For example, we can change calc's build.rs file to:

CTLexerBuilder::new()
    .lrpar_config(|ctp| {
        ctp.yacckind(YaccKind::Grmtools)
            .recoverer(lrpar::RecoveryKind::None)
            .grammar_in_src_dir("calc.y")
            .unwrap()
    })
    .lexer_in_src_dir("calc.l")?
    .build()?;

and then no matter how many syntax errors we make, only one is reported:

>>> 2++3++
Parsing error at line 1 column 3. No repair sequences found.
Unable to evaluate expression.

Unless you have a good reason to do so (e.g. quickly hacking together a grammar where you would prefer not to think about error recovery at all), we do not recommend turning off error recovery.

An AST evaluator

We now know enough to put together a more sophisticated version of our simple calculator example that builds an Abstract Syntax Tree (AST) while parsing, which is then evaluated separately. This models a common way of building real compilers. The full example code can be found at https://github.com/softdevteam/grmtools/tree/master/lrpar/examples/calc_ast.

The calc.l file remains unchanged from that in the Quickstart guide. However the calc.y file is change as follows:

%start Expr
%avoid_insert "INT"
%%
Expr -> Result<Expr, ()>:
      Expr '+' Term { Ok(Expr::Add{ span: $span, lhs: Box::new($1?), rhs: Box::new($3?) }) }
    | Term { $1 }
    ;

Term -> Result<Expr, ()>:
      Term '*' Factor { Ok(Expr::Mul{ span: $span, lhs: Box::new($1?), rhs: Box::new($3?) }) }
    | Factor { $1 }
    ;

Factor -> Result<Expr, ()>:
      '(' Expr ')' { $2 }
    | 'INT' { Ok(Expr::Number{ span: $span }) }
    ;
%%

use cfgrammar::Span;

#[derive(Debug)]
pub enum Expr {
    Add {
        span: Span,
        lhs: Box<Expr>,
        rhs: Box<Expr>,
    },
    Mul {
        span: Span,
        lhs: Box<Expr>,
        rhs: Box<Expr>,
    },
    Number {
        span: Span
    }
}

The most obvious difference here is that we have defined a simple enum Expr, with three variants, for our AST. Each AST variant also records a Span which records how much input the AST element covers. By using the $span variable we can ensure that AST elements record their relationship to portions of the user's input that span multiple tokens (e.g. for the expressions 1 + 2 the resulting Expr::Add will have a Span starting at byte index 0 and ending at byte index 5 -- in other words covering the complete input string in this case).

After parsing, we thus end up with a Result<Expr, ()>. In the case of a successful parse, this will give us an arbitrarily deeply nested Expr.

Our main.rs file then looks as follows:

use std::io::{self, BufRead, Write};

use lrlex::{lrlex_mod, DefaultLexeme, LRLexError};
use lrpar::{lrpar_mod, NonStreamingLexer, Span};

lrlex_mod!("calc.l");
lrpar_mod!("calc.y");

use calc_y::Expr;

fn main() {
    let lexerdef = calc_l::lexerdef();
    let stdin = io::stdin();
    loop {
        print!(">>> ");
        io::stdout().flush().ok();
        match stdin.lock().lines().next() {
            Some(Ok(ref l)) => {
                if l.trim().is_empty() {
                    continue;
                }
                let lexer = lexerdef.lexer(l);
                let (res, errs) = calc_y::parse(&lexer);
                for e in errs {
                    println!("{}", e.pp(&lexer, &calc_y::token_epp));
                }
                if let Some(Ok(r)) = res {
                    // We have a successful parse.
                    match eval(&lexer, r) {
                        Ok(i) => println!("Result: {}", i),
                        Err((span, msg)) => {
                            let ((line, col), _) = lexer.line_col(span);
                            eprintln!(
                                "Evaluation error at line {} column {}, '{}' {}.",
                                line,
                                col,
                                lexer.span_str(span),
                                msg
                            )
                        }
                    }
                }
            }
            _ => break
        }
    }
}

fn eval(
    lexer: &dyn NonStreamingLexer<DefaultLexeme, u32>,
    e: Expr,
    LRLexError)
-> Result<u64, (Span, &'static str)> {
    match e {
        Expr::Add { span, lhs, rhs } => eval(lexer, *lhs)?
            .checked_add(eval(lexer, *rhs)?)
            .ok_or((span, "overflowed")),
        Expr::Mul { span, lhs, rhs } => eval(lexer, *lhs)?
            .checked_mul(eval(lexer, *rhs)?)
            .ok_or((span, "overflowed")),
        Expr::Number { span } => lexer
            .span_str(span)
            .parse::<u64>()
            .map_err(|_| (span, "cannot be represented as a u64"))
    }
}

Let's start by running this and seeing what happens:

>>> 2+3*4
Result: 14
>>> 2++3*4
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 14
>>> 999999*888888 + 777777*666666
Result: 1407404592594
>>> 9999999999*8888888888 + 7777777777*6666666666
Evaluation error at line 1 column 6, '9999999999*8888888888' overflowed.

The first three expressions evaluate just as before. However, the fourth is interesting: we have explicitly captured the fact that the result of 9999999999*8888888888 is too big to fit into a u64; and not only have we told the user which character the input starts out, but we've printed out the precise sub-part of the input which caused that error. This works even when it's in the middle of the input:

>>> 10 + 9999999999*8888888888 + 20
Evaluation error at line 1 column 6, '9999999999*8888888888' overflowed.

The key to this is that each AST element knows the $span of the production it is related to; and the resulting Span can extract the user's input with lexer.span_str(span).

Happily, this facility composes nicely with error recovery:

>>> 10 ++ 9999999999*8888888888 + 20
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert INT
Evaluation error at line 1 column 7, '9999999999*8888888888' overflowed.

Rust Editions

The edition of rust used by grmtools updates as the rust language evolves. We try to keep code generated by CTParserBuilder and CTLexerBuilder building with older versions of rust, so that downstream users can use the edition that suits their requirements.

Controlling edition used during code generation

CTLexerBuilder and CTParserBuilder both have functions, rust_edition() that accept a lrpar::RustEdition and lrlex::RustEdition respectively.

Known edition incompatibility in the book

While there is a preference for keeping the code in this manual working with all editions, exceptions may be made when for clarity.

In An AST evaluator, with the rust_2018_idioms lint deprecates some behavior which was previously accepted by the 2015 edition. The eval function has an elided lifetime that must be given explicitly as lexer: &dyn NonStreamingLexer<'_, DefaultLexeme, u32>.

The individual libraries and tools

grmtools consists of several libraries and command-line tools. The following sections describe each.

`lrpar`

lrpar (crate; source) is the LR parser library aspect of grmtools. It takes in streams of lexemes (using a lexer of the user's choice) and parses them, determining if they successfully match a grammar or not; if not, it can optionally recover from errors.

`lrlex`

lrlex (crate; source) is a partial replacement for lex / flex. It takes an input string and splits it into lexemes based on a .l file. Unfortunately, many real-world languages have corner cases which exceed the power that lrlex can provide. However, when it is suitable, it is a very convenient way of expressing lexing.

lrlex also has a simple command-line interface, allowing you to check whether your lexing rules are working as expected:

$ cat C.java
class C {
    int x = 0;
}
$ cargo run --lrlex java.l /tmp/C.java
    Finished dev [unoptimized + debuginfo] target(s) in 0.18s
     Running `target/debug/lrlex ../grammars/java7/java.l /tmp/C.java`
CLASS class
IDENTIFIER C
LBRACE {
INT int
IDENTIFIER x
EQ =
INTEGER_LITERAL 0
SEMICOLON ;
RBRACE }

nimbleparse

nimbleparse is a simple grammar debugging aid. It takes as input a Lex specification, a Yacc specification, and an input file and prints any warnings about the specifications (e.g. shift/reduce errors) as well as the resulting parse tree to stdout. If the parse is unsuccessful it will report parsing errors and, when possible fixes. If parsing is successful, nimbleparse exits with 0; if an error is detected it exits with 1.

The full command-line specification is as follows:

nimbleparse [-r <cpctplus|none>] [-y <eco|grmtools|original>] [-q] <lexer.l> <parser.y> <input file>

where:

-r selects the recovery algorithm to be used. Defaults to cpctplus.
-y selects the Yacc variant to be used. Defaults to original.
-q prevents warnings (e.g. shift/reduce errors) from being reported.

You can use your own Lex/Yacc files. A small repository of example grammars can be found at https://github.com/softdevteam/grammars/.

An example invocation is as follows:

$ cat Hello.java
class Hello {
    public static void main(String[] args) {
        System.out.println("Hello world");
    }
}
$ nimbleparse java7.l java7.y Hello.java
goal
 compilation_unit
  type_declarations_opt
   type_declarations
    type_declaration
     class_declaration
      modifiers_opt
      CLASS class
      IDENTIFIER Hello
      type_parameters_opt
      super_opt
      interfaces_opt
      class_body
       LBRACE {
       class_body_declarations_opt
        class_body_declarations
         class_body_declaration
          class_member_declaration
           method_declaration
            method_header
             modifiers_opt
              modifiers
               modifiers
                modifier
                 PUBLIC public
               modifier
                STATIC static
             VOID void
             method_declarator
              IDENTIFIER main
              LPAREN (
              formal_parameter_list_opt
               formal_parameter_list
                formal_parameter
                 type
                  reference_type
                   array_type
                    name
                     simple_name
                      IDENTIFIER String
                    dims
                     LBRACK [
                     RBRACK ]
                 variable_declarator_id
                  IDENTIFIER args
              RPAREN )
             throws_opt
            method_body
             block
              LBRACE {
              block_statements_opt
               block_statements
                block_statement
                 statement
                  statement_without_trailing_substatement
                   expression_statement
                    statement_expression
                     method_invocation
                      qualified_name
                       name
                        qualified_name
                         name
                          simple_name
                           IDENTIFIER System
                         DOT .
                         IDENTIFIER out
                       DOT .
                       IDENTIFIER println
                      LPAREN (
                      argument_list_opt
                       argument_list
                        expression
                         assignment_expression
                          conditional_expression
                           conditional_or_expression
                            conditional_and_expression
                             inclusive_or_expression
                              exclusive_or_expression
                               and_expression
                                equality_expression
                                 instanceof_expression
                                  relational_expression
                                   shift_expression
                                    additive_expression
                                     multiplicative_expression
                                      unary_expression
                                       unary_expression_not_plus_minus
                                        postfix_expression
                                         primary
                                          primary_no_new_array
                                           literal
                                            STRING_LITERAL "Hello world"
                      RPAREN )
                    SEMICOLON ;
              RBRACE }
       RBRACE }
$ cat SyntaxError.java
class SyntaxError {
    int x y;
}
$ nimbleparse java7.l java7.y Hello.java
goal
 compilation_unit
  type_declarations_opt
   type_declarations
    type_declaration
     class_declaration
      modifiers_opt
      CLASS class
      IDENTIFIER SyntaxError
      type_parameters_opt
      super_opt
      interfaces_opt
      class_body
       LBRACE {
       class_body_declarations_opt
        class_body_declarations
         class_body_declaration
          class_member_declaration
           field_declaration
            modifiers_opt
            type
             primitive_type
              numeric_type
               integral_type
                INT int
            variable_declarators
             variable_declarators
              variable_declarator
               variable_declarator_id
                IDENTIFIER x
             COMMA 
             variable_declarator
              variable_declarator_id
               IDENTIFIER y
            SEMICOLON ;
       RBRACE }

Parsing error at line 2 column 11. Repair sequences found:
   1: Insert ,
   2: Insert =
   3: Delete y

`cfgrammar`

cfgrammar (crate; source) reads in grammar files, processes them, and provides a convenient API for operating with them. Most users only need to think about cfgrammar to the extent that they are required to use it to specify what Yacc variant they wish to use.

cfgrammar may also be of interest to those manipulating grammars directly, or who wish to use custom types of parsers. Note that cfgrammar's API should be considered semi-stable at best. As the needs of other parts of grmtools change, cfgrammar tends to have to change too. Since it is unlikely to have few direct users, the consequences of changing the API are relatively slight.

`lrtable`

lrtable (crate; source) takes in grammars from cfgrammar and creates LR state tables from them. Few users will be interested in its functionality directly, except those doing advanced forms of grammar analysis.

One, admittedly fairly advanced, aspect worth noting is that lrtable uses Pager's algorithm to compress the resulting LR state tables. In rare cases this can provide surprising results: see Denny and Malloy's paper for more.

Libraries and tools developed by third parties

Besides using grmtools to develop parsers. The following items use grmtools to extend or augment the functionality which may be useful to people developing parsers with grmtools.

nimbleparse_lsp

Other tools

When parsing text in Rust, you should also evaluate the following tools to see if they are more suitable for your purposes: