grmtools

grmtools is a suite of Rust libraries and binaries for parsing text, both at compile-time, and run-time. Most users will probably be interested in the compile-time Yacc feature, which allows traditional .y files to be used mostly unchanged in Rust. See the Quickstart Guide for a quick introduction to this feature.

Quickstart Guide

Most users will probably be interested in the compile-time Yacc feature of grmtools, which allows traditional .y files to be used mostly unchanged in Rust. This page is a short guide to get you up and running with this feature as quickly as possible.

grmtools includes both a Yacc-style LR parser (lrpar) and a lex-style lexer (lrlex). The lexer breaks input up into individual lexemes and the parser checks to see if the lexemes conform to a grammar. As the parser executes, it can either create a generic parse tree, or execute user-specified Rust code.

A calculator evaluator

Let's assume we want to create a simple calculator which can evaluate expressions such as 2 + 3 * 4. Assuming a fresh Rust project, we first create a Cargo.toml file with the following dependencies:

[package]
name = "calc"
version = "0.0.1"
authors = ["<authors>"]

[[bin]]
doc = false
name = "calc"

[build-dependencies]
cfgrammar = "0.6"
lrlex = "0.6"
lrpar = "0.6"

[dependencies]
cfgrammar = "0.6"
lrlex = "0.6"
lrpar = "0.6"

In this situation we want to statically compile the .y grammar and .l lexer into Rust code. We thus need to create a build.rs file inside the root of our project which can process the lexer and grammar. Our build.rs file thus looks as follows:

use cfgrammar::yacc::YaccKind;
use lrlex::LexerBuilder;
use lrpar::{CTParserBuilder};

fn main() -> Result<(), Box<std::error::Error>> {
    let lex_rule_ids_map = CTParserBuilder::new()
        .yacckind(YaccKind::Grmtools)
        .process_file_in_src("calc.y")?;
    LexerBuilder::new()
        .rule_ids_map(lex_rule_ids_map)
        .process_file_in_src("calc.l")?;
    Ok(())
}

grmtools accepts several different Yacc variants as input. In our case, we want to execute Rust code as the input is parsed (rather than creating a generic parse tree which we traverse later), so we specified that the yacckind (i.e. what variant of Yacc file we're using) is YaccKind::Grmtools. The grammar file is stored in src/calc.y, but we only specify calc.y as the filename to lrpar, since it searches relative to src/ automatically.

The lexer

While Yacc-style parsing is powerful, lex-style lexing is less powerful. grmtools allows you to use whatever lexer you want with lrpar. Fortunately, in this case, lrlex is powerful enough for us. Our lex file is stored in src/calc.l. The rule_ids_map dance synchronises the parser and lexer (the details of this are unimportant to us).

calc.l is as follows:

%%
[0-9]+ "INT"
\+ "PLUS"
\* "MUL"
\( "LBRACK"
\) "RBRACK"
[\t ]+ ;

Roughly speaking, each line after the %% line is a regular expression (we use the regex crate), a space character, and a quoted lexeme type name. For example, if the user gives us input such as 234 we will create a single lexeme with a value (234) and a type (INT).

The one exception is the final line: if a lexeme type name is replaced with ‘;’ then any matching input is discarded. In this case, whitespace (tabs and spaces) is lexed, but no lexemes are created from it.

The grammar

Our initial version of calc.y looks as follows:

%start Expr
%%
Expr -> Result<u64, ()>:
      Term 'PLUS' Expr { Ok($1? + $3?) }
    | Term { $1 }
    ;

Term -> Result<u64, ()>:
      Factor 'MUL' Term { Ok($1? * $3?) }
    | Factor { $1 }
    ;

Factor -> Result<u64, ()>:
      'LBRACK' Expr 'RBRACK' { $2 }
    | 'INT'
      {
          let v = $1.map_err(|_| ())?;
          parse_int($lexer.span_str(v.span()))
      }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> Result<u64, ()> {
    match s.parse::<u64>() {
        Ok(val) => Ok(val),
        Err(_) => {
            eprintln!("{} cannot be represented as a u64", s);
            Err(())
        }
    }
}

The grammar is in 3 parts, separated by the %% lines.

The first part specifies general settings for the grammar, at a minimum the start rule (%start Expr).

The second part is the Yacc grammar. It consists of 3 rules (Expr, Term, and Factor) and 6 productions (2 for each rule, separated by | characters). Because we are using the Grmtools Yacc variant, each rule has a Rust type associated with it (after ->) which specifies the type that each production’s action must return. A production (sometimes called an “alternative”) consists of zero or more symbols. Symbols either reference rules or lexemes. If a production matches text, its ”action” (the Rust code between curly brackets at the end of the production) is executed.

lrpar's actions are subtly different to Yacc. The $x variables refer to the respective symbol in the production, numbered from 1 (i.e. $1 refers to the first symbol in the production). If the symbol references a rule R then an instance of R's type will be stored in the $x variable; if the symbol references a lexeme then an Option<Lexeme> instance is returned. A special $lexer variable allows access to the lexer. The most commonly used function that $lexer exposes is the span_str function, which allows us to extract &'input strs from a Span (e.g. to extract the string represented by a Lexeme, we would use $lexer.span_str(lexeme.span())). As this may suggest, actions may also reference the special lifetime 'input (without any $ prefix), which allows strings to be returned / stored by the grammar without copying memory.

The third part is arbitrary Rust code which can be called by productions’ actions. In our case we have a simple function which converts integers as strings into integers as u64s: if the user provides an invalid number (e.g. one that is too big) the system panics.

This example uses a common grmtools idiom: making use of Result types. This allows us to deal with two different issues that prevent evaluation. First is the “obvious” issue of integers which are too big to represent as u64s: these cause Err to be percolated upwards, preventing evaluation. Second is the issue of error recovery telling us that the user should have inserted an integer: since it would be confusing for us to insert a default value in such cases, we map_err such cases to Err, preventing evaluation. See the section below on error recovery for more details about error recovery.

Putting everything together

The build.rs file will statically compile the lexer and grammar into Rust code that we can then call. The src/main.rs file below provides a simple Python-esque REPL to the user into which they can write calculator expressions:

use std::io::{self, BufRead, Write};

use lrlex::lrlex_mod;
use lrpar::lrpar_mod;

// Using `lrlex_mod!` brings the lexer for `calc.l` into scope. By default the
// module name will be `calc_l` (i.e. the file name, minus any extensions,
// with a suffix of `_l`).
lrlex_mod!("calc.l");
// Using `lrpar_mod!` brings the parser for `calc.y` into scope. By default the
// module name will be `calc_y` (i.e. the file name, minus any extensions,
// with a suffix of `_y`).
lrpar_mod!("calc.y");

fn main() {
    // Get the `LexerDef` for the `calc` language.
    let lexerdef = calc_l::lexerdef();
    let stdin = io::stdin();
    loop {
        print!(">>> ");
        io::stdout().flush().ok();
        match stdin.lock().lines().next() {
            Some(Ok(ref l)) => {
                if l.trim().is_empty() {
                    continue;
                }
                // Now we create a lexer with the `lexer` method with which
                // we can lex an input.
                let lexer = lexerdef.lexer(l);
                // Pass the lexer to the parser and lex and parse the input.
                let (res, errs) = calc_y::parse(&lexer);
                for e in errs {
                    println!("{}", e.pp(&lexer, &calc_y::token_epp));
                }
                match res {
                    Some(r) => println!("Result: {}", r),
                    _ => eprintln!("Unable to evaluate expression.")
                }
            }
            _ => break
        }
    }
}

We can now cargo run our project and evaluate simple expressions:

>>> 2 + 3
Result: 5
>>> 2 + 3 * 4
Result: 14
>>> (2 + 3) * 4
Result: 20

Error recovery

Because powerful error recovery is built into lrpar, we can even make minor errors and have the system recover automatically:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 5
>>> 2 + 3 3
Parsing error at line 1 column 7. Repair sequences found:
   1: Delete 3
   2: Insert PLUS
   3: Insert MUL
Result: 5
>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
   1: Insert MUL, Delete 4
   2: Insert PLUS, Delete 4
   3: Delete 4, Delete 5
   4: Insert MUL, Shift 4, Delete 5
   5: Insert MUL, Shift 4, Insert PLUS
   6: Insert MUL, Shift 4, Insert MUL
   7: Insert PLUS, Shift 4, Delete 5
   8: Insert PLUS, Shift 4, Insert PLUS
   9: Insert PLUS, Shift 4, Insert MUL
Result: 17

Note that we didn't have to do anything clever in order for error recovery to happen: it happens by default, and it works with whatever grammar we throw at it. The way to read the resulting error messages are that each numbered repair sequence is a way that the error recovery system found to make sense of the input. For example, for the input 2 + + 3, an error is detected at the second +: we could either delete the second + (option 1 above) or insert an integer. In all cases, error recovery applies repair sequence 1, and continues parsing. 2 + + 3 was thus parsed as if the user had written 2 + 3, hence why it evaluated to 5. Similarly, 2 + 3 4 5 was parsed as if the user had written 2 + 3 * 5.

Error recovery opens up a number of possibilities to customise and streamline the user experience. For example, the simple approach above causes a panic if the user provides a non-u64 number or if error recovery inserts an integer. For more details about the possibilities, see the section on error recovery.

Lexing

Lexing is the act of taking in an input stream and splitting it into lexemes. Colloquially, lexing is often described as splitting input into words. In grmtools, a Lexeme has a type (e.g. "INT", "ID"), a value (e.g. "23", "xyz"), and knows which part of the user's input matched (e.g. "the input starting at index 7 to index 10"). There is also a simple mechanism to differentiate lexemes of zero length (e.g. DEDENT tokens in Python) from lexemes inserted by error recovery.

Users can write custom lexers that conform to the lrpar::lex::Lexer trait. This API allows users to deal with streaming data since the parser asks the Lexer for one token at a time. However, note that users can later ask the Lexer to return the string from the input matching a lexeme: users need to buffer input to provide this information.

Hand-written lexers are not particularly difficult to write and, for better or worse, are necessary for many real-world languages. However, a subset of languages can use a simpler lex/flex style approach to lexing, for which lrlex can be used.

Lex compatibility

grmtools currently supports one common use of Lex, which is to produce a sequence of tokens. All Lex files require at least some porting to grmtools, though in many cases this is fairly trivial. Nevertheless, aspects such as the longest match rule are identical to Lex, and we assume familiarity with Lex syntax and its major features: the Lex manual is recommended reading.

Major differences

There are several major differences between Lex and grmtools:

Lex has its own regular expression language whereas grmtools uses the well known Rust regex crate for regular expressions. These two regular expression languages are very similar, but complex regular expressions might not be supported under one or the other.
Lex files consist of a sequence of regular expressions and an action for each. grmtools lex files consists of a sequence of regular expressions and a token name. Actions are not currently supported (and, by extension, nor are special action expressions such as ECHO and REJECT).
Start conditions, character sets, and changes to internal array sizes are not supported by grmtools.

Parsing

Parsing is the act of checking whether a stream of lexemes match a grammar. Since a simple "yes/no" answer is rarely useful, it is common to execute user-defined actions during parsing.

grmtools contains libraries (cfgrammar and lrtable) which allow users to build their own LR parsers in whatever fashion they want. However, for 99% of cases, the lrpar library is what users want and need: a (largely) Yacc-compatible parser. Roughly speaking, the core parts of grammars work identically in Yacc and lrpar, but some other parts of the system have been modernised (e.g. to avoid the use of global variables) and given a more idiomatic Rust feel. Notably, lrpar is built from the ground-up to have a powerful, flexible approach to error recovery.

Yacc compatibility

grmtools supports most major Yacc features, to the extent that many Yacc grammars can be used unchanged with grmtools. In this book we assume familiarity with Yacc syntax and its major features: the Yacc manual is recommended reading.

Major differences

There are several differences between Yacc and grmtools including:

grmtools has no equivalent of any of the yy* functions (e.g. yyerror, yylex, yylval, yyparse and so on). This means, for example, that grammar actions cannot currently influence the lexer in any way.
grmtools has an entirely different approach to error recovery. The token error and the special action expressions yyerrok and yyclearin are not supported. In general, users can simply remove alternatives that consist solely of error.
%union can be mapped to %actiontype in grmtools, though this is rarely the best way of using a Yacc grammar in Rust. See the Grmtools Yacc variant below for the most common way of making grammars do something useful; in a limited number of cases (e.g. if you just want to build a parse tree), you may find the "Original" Yacc variants useful.

Grmtools

YaccKind::Grmtools is grmtools' own variant of Yacc syntax, and the one that most users will want to use. The most significant difference to "normal" Yacc is that rules are annotated with a Rust type to which all their production's actions must adhere to. Note that whilst a rule's productions must all adhere to a single type, different rules can have different types. Consider the following snippet:

R1 -> Result<i32, ()>:
     'a' { Ok(5) }
   | 'b' { Err(()) }
   ;

R2 -> u64:
   | { 0 }
   ;

Here the rule R1 has a Rust return type of Result<X, ()> (between -> and :). Both of its productions adhere to this type, the first by instantiating Ok(5) and the second Err(()). The rule R2 has a return type of u64.

“Original” Yacc

Although the name is not fully accurate (grmtools supports a slightly disjoint subset of original Yacc's input), this mode allows users to most easily test externally created Yacc files. Several sub-variants are allowed:

YaccKind::Original(YaccOriginalActionKind::GenericParseTree) does not execute user actions, but instead creates a generic parse tree, where elements are instances of the lrpar::parser::Node enum. This is useful for quickly testing whether a parser is accepting the intended language.
YaccKind::Original(YaccOriginalActionKind::NoActions) parses input and reports errors but does not execute any user actions. This is useful if you are trying to test a large corpus of input for correctness.
YaccKind::Original(YaccOriginalActionKind::UserAction) models original Yacc most closely but, in a Rust setting, is probably of little use beyond simple calculator like languages. Instead of Yacc's %union directive, users can specify %actiontype which is a Rust type to which every production's actions in the grammar must adhere to. Unless all actions happen to naturally return the same type, this quickly becomes cumbersome to use. For most use cases, YaccKind::Grmtools is a superior alternative.

Action code and return types

Action code

Action code is normal Rust code with the addition of the following special variables:

$1 ... $n refer to the respective symbol in the production, numbered from 1 (i.e. $1 refers to the first symbol in the production). If the symbol references a rule R then an instance of R's type will be stored in the $i variable; if the symbol references a lexeme then an Option<Lexeme> instance is returned.
$lexer allows access to the lexer and its various functions. The most commonly used of these is the span_str function, which allows us to extract &'input strs from a Span (e.g. to extract the string represented by a Lexeme, we would use $lexer.span_str(lexeme.span())). As this may suggest, actions may also reference the special lifetime 'input (without any $ prefix), which allows strings to be returned / stored by the grammar without copying memory.
$span is a lrpar::Span tuple (with both elements of type usize) which captures how much of the user's input the current production matched.
$$ is equivalent to $ in normal Rust code.

Any other variables beginning with $ are treated as errors.

Return types

Productions' return types can be any arbitrary Rust type. You may in addition make use of the following:

The generic parameter StorageT references the type of lexemes and is typically used with the Lexeme type i.e. Lexeme<StorageT>. This allows you to return lexemes from rules.
The lifetime 'input allows you to extract strings whose lifetime is tied to the lexer and return them from rules / store them in structs without copying. Lexer::span_str returns such strings and the typical idiom of use is &'input str.

grmtools parsing idioms

grmtools is a flexible tool and can be used in many ways. However, for those using the Grmtools format, the simple idioms below can often make life easier.

Return `Span`s when possible

When executing grammar actions one is often building up an Abstract Syntax Tree (AST) or equivalent. For example consider a simple language with assignments:

Assign: "ID" "=" Expr;

Perhaps the "obvious" way to build this into an AST is to extract the string representing the identifier as follows:

Assign -> ASTAssign: "ID" "=" Expr
    {
        let id = $lexer.span_str($1.as_ref().unwrap().span()).to_string();
        ASTAssign::new(id, $3)
    }

%%

struct ASTAssign {
    id: String
}

impl ASTAssign {
    fn new(name: String) -> Self {
        ASTAssign { name }
    }
}

This approach is easy to work with, but isn't as performant as may be desired: the to_string call allocates memory and copies part of the user's input into that. It also loses information about the part of the user's input that the string relates to.

An alternative approach is not to convert the lexeme into a String during parsing, but simply to return a Span. An outline of this is as follows:

Assign -> ASTAssign: "ID" "=" Expr
    {
        ASTAssign { id: $1, expr: Box::new($3.span()) }
    }

%%

type StorageT = u32;

struct ASTAssign {
    id: Span
    expr: Box<Expr>
}

enum Expr { ... }

If this is not quite what you want to do, you can use largely the same trick with the Lexeme struct. Working with Lexemes has the advantage that you can tell what the type of the lexeme in question is, though generally this is entirely clear from AST context, and Lexeme's type parameter makes it marginally more fiddly to work with than Span.

Alternatively, if you really want to extract strings during parsing, consider using the 'input to extract &str's during parsing, since this does not cause any additional memory to be allocated.

Have rules return a `Result` type and add a function to avoid `map_err` directly

As described in the error recovery section, it is generally a good idea to give rules a Result return type. This allows a simple interaction with error recovery. However, it can lead to endless instances of the following map_err idiom:

R -> Result<..., ()>:
    "ID" { $1.map_err(|_| ())? }
    ;

It can be helpful to define a custom map_err function which hides some of this mess for you:

R -> Result<Lexeme<StorageT>, ()>:
    "ID" { map_err($1)? }
    ;

%%

fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
        -> Result<Lexeme<StorageT>, ()>
{
    r.map_err(|_| ())
}

Define a `flatten` function

Yacc grammars make specifying sequences of things something of a bore. A common idiom is thus:

ListOfAs -> Result<Vec<A>, ()>:
      A { Ok(vec![$1?]) }
    | ListOfAs A
      {
          let mut $1 = $1?;
          $1.push($1?);
          Ok($1)
      }
    ;

A -> Result<A, ()>: ... ;

Since this idiom is often present multiple times in a grammar, it's generally worth adding a flatten function to hide some of this:

ListOfAs -> Result<Vec<A>, ()>:
      A { Ok(vec![$1?]) }
    | ListOfAs A { flatten($1, $2) }
    ;

A -> Result<A, ()>: ... ;
%%

fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
           -> Result<Vec<T>, ()>
{
    let mut flt = lhs?;
    flt.push(rhs?);
    Ok(flt)
}

Note that flatten is generic with respect to T so that it can be used in multiple places in the grammar.

Composing the idioms

Happily, flatten, map_err, and Lexeme combine well:

ListOfIds -> Result<Vec<Lexeme<StorageT>>, ()>:
      "ID" { Ok(vec![map_err($1)?]) }
    | ListOfIds "Id" { flatten($1, map_err($2)?) }
    ;

%%

type StorageT = u32;

fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
        -> Result<Lexeme<StorageT>, ()>
{
    r.map_err(|_| ())
}

fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
           -> Result<Vec<T>, ()>
{
    let mut flt = lhs?;
    flt.push(rhs?);
    Ok(flt)
}

Error recovery

One of lrpar's most powerful features is its approach to error recovery, which can be used with any grammar. This section outlines the background to error recovery, the choices that users can make, and how to best make use of this feature.

Error recovery background

Programmers frequently make mistakes when entering input, either because of simple typos, or an outright failure to use the correct syntax. Happily, LR parsing guarantees to report syntax errors at the first point that an error can be definitively proven to have occurred (though note that this might not be the same point that a user would consider the error to have been made). It has long been a goal of parsing technologies to recover from such errors, and allow parsing to continue. This allows users to fix all their syntax errors in one go and, optionally, post-parsing phases to operate as if no syntax errors had been made at all. For example, a compiler author might decide to run the compiler's static type checker even in the presence of syntax errors (since many static type errors are unaffected by syntax errors), but not generate code (which might incorrectly give users the illusion that their code is safe to run).

However, most mainstream parsers do a bad job of error recovery. The most common generic error recovery algorithm is "panic mode" (in reality, a family of algorithms). Unfortunately such simple error recovery algorithms do a poor job of recovering from syntax errors, causing a cascade of spurious further syntax errors to be reported. Programmers quickly learn that only the first reported syntax error can be trusted on to be correct.

lrpar implements the CPCT+ error recovery algorithm from Reducing Cascading Parsing Errors Through Fast Error Recovery, which, in our biased opinion, does a better job than previous approaches. It is fast, grammar neutral, and reports multiple repair sequences to users, allowing them to consider which best matches their intentions.

No matter how clever we think CPCT+ is, it is important to understand that it has a fundamental limitation: it only knows about a language's syntax; it has no concept of the language's semantics beyond that implied by the structure of the grammar; and it cannot control what the user does with the result of error recovery. Thus, grammar writers can significantly influence how useful error recovery is for users. Most of the rest of this section explains how best to make use of error recovery.

Error recovery basics

A simple calculator grammar looks as follows:

%start Expr
%%
Expr -> u64:
      Term 'PLUS' Expr { $1 + $3 }
    | Term { $1 }
    ;

Term -> u64:
      Factor 'MUL' Term { $1 * $3 }
    | Factor { $1 }
    ;

Factor -> u64:
      'LBRACK' Expr 'RBRACK' { $2 }
    | 'INT' { parse_int($lexer.span_str($1.unwrap().unwrap())) }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> u64 {
    match s.parse::<u64>() {
        Ok(val) => val,
        Err(_) => panic!("{} cannot be represented as a u64", s)
    }
}

For many examples, this simple grammar and its actions work well leading to output such as the following:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 5

Insert x means “error recovery inserted a lexeme of type x”; Delete x means “error recovery deleted the next lexeme in the stream”; and Shift x means “error recovery kept the user’s lexeme x as-is”.

Repair sequences are minimal ways of adjusting the user’s input such that it becomes correct relative to the underlying grammar. Intuitively, in this example, the two repair sequences would adjust the input to be equivalent to 2 + 3 (repair sequence 1) or 2 + <some int> + 3 (repair sequence 2). When more than one repair sequence is presented to the user, the first is used by the algorithm to continue parsing: in this case, the input was parsed as if it was equivalent to 2 + 3, hence the evaluation of the input to 5.

Repair sequences can, as their name suggests, be of arbitrary length:

>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
   1: Insert MUL, Delete 4
   2: Insert PLUS, Delete 4
   3: Delete 4, Delete 5
   4: Insert MUL, Shift 4, Delete 5
   5: Insert MUL, Shift 4, Insert PLUS
   6: Insert MUL, Shift 4, Insert MUL
   7: Insert PLUS, Shift 4, Delete 5
   8: Insert PLUS, Shift 4, Insert PLUS
   9: Insert PLUS, Shift 4, Insert MUL
Result: 17

In this case, the first repair sequence caused the input to be parsed as if it was equivalent to 2 + 3 * 5, hence the evaluation of the input to 17.

Syntax errors and language semantics

Our example inputs so far have deliberately exploited cases where the first repair sequence at worst inserted “unimportant” lexemes such as + and *. Since the grammar’s actions never read the values of such lexemes, only their type is important. However, what should happen if error recovery inserts an integer, whose value is later read by one of the grammar’s actions? An example shows the unhappy result:

>>> 2+
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Lexeme { start: 2, len: 4294967295, tok_id: 4 }', libcore/result.rs:1009:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
>>>

In this case, the first repair sequence was Insert INT. The fundamental problem is that while error recovery can adjust the user’s input to insert a lexeme of type INT, neither it nor the parser have any idea what value might have made sense for that lexeme. Thus the expression above caused the expression $lexer.span_str($1.unwrap().span()) to panic, since $1 was Err(<lexeme>).

It is thus up to the user to decide what to do in the face of the inevitable semantic issues that error recovery highlights. Fortunately, this is generally simpler than it sounds with only a slight rethink in the way that we tend to write a grammar's actions.

A rule of thumb: have rules return a `Result` type

Although rules can have any Rust type you can imagine, using a Result type allows a (deliberately) simple interaction with the effects of error recovery. The basic idea is simple: in actions, we ignore lexemes whose value we don't care about (e.g. brackets); for lexemes whose value we care about, we either introduce a default value, or percolate an Err upwards. Default values make sense in certain situations. For example, if you're writing a compiler, and want to run a static type checker even after syntax errors, it might make sense to assume that Insert 0 is a good substitute for Insert INT. However, in the case of the calculator, default values are likely to lead to confusing results. We thus change the grammar so that inserted integers prevent evaluation from occurring:

%start Expr
%%
Expr -> Result<u64, ()>:
      Term 'PLUS' Expr { Ok($1? + $3?) }
    | Term { $1 }
    ;

Term -> Result<u64, ()>:
      Factor 'MUL' Term { Ok($1? * $3?) }
    | Factor { $1 }
    ;

Factor -> Result<u64, ()>:
      'LBRACK' Expr 'RBRACK' { $2 }
    | 'INT' { parse_int($lexer.span_str($1.map_err(|_| ())?.span())) }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> u64 {
    match s.parse::<u64>() {
        Ok(val) => val,
        Err(_) => panic!("{} cannot be represented as a u64", s)
    }
}

The basic idea here is that every action returns an instance of Result<u64, ()>: if we receive Ok(u64) we successfully evaluated the expression, but if we received Err(()) we were not able to evaluate the expression. If we encounter an integer lexeme which is the result of error recovery, then the INT lexeme in the second Factor action will be Err(<lexeme>). By writing $1.map_err(|_| ())? we’re saying “if the integer lexeme was created by error recovery, percolate Err(()) upwards”. We then have to tweak a couple of other actions to percolate errors upwards, but this is a trivial change.

We then need to make a small tweak to our main.rs changing:

match res {
    Some(r) => println!("Result: {}", r),
    _ => eprintln!("Unable to evaluate expression.")
}

to:

match res {
    Some(Ok(r)) => println!("Result: {}", r),
    _ => eprintln!("Unable to evaluate expression.")
}

Now the input which previously caused a panic simply tells the user that it could not evaluate the expression:

>>> 2+
Parsing error at line 1 column 3. Repair sequences found:
   1: Insert INT
Unable to evaluate expression.

Usefully, our inability (or unwillingness) to evaluate the expression does not prevent further syntax errors from being discovered and repaired:

>>> (2+)+3+4+
Parsing error at line 1 column 4. Repair sequences found:
   1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert Int
Unable to evaluate expression.

Using a Result type allows the user arbitrary control over the classes of syntax errors they are prepared to deal with or not. For example, we could remove the panic from parse_int by making the rules have a type Result<u64, String> where the Err case would report a string such as “18446744073709551616 cannot be represented as a u64” for the first unrepresentable u64 in the user's input. If we wanted to report all unrepresentable u64s, we could have the rules have a type Result<u64, Vec<String>>, though merging together the errors found on the left and right hand sides of the + and * operators requires adding a few lines of code.

Making use of `%epp` for easier to read repair sequences

By default, pretty-printing lexeme types prints out their identifier in the grammar. These rarely match what the user would expect:

>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete 3
   2: Insert PLUS
   3: Insert MUL
Result: 2

What are PLUS and MUL? These might be semi-obvious, but many lexeme types are far from obvious. grmtools allows users to provide human friendly versions of these for error recovery using the %epp declaration in grammars. For example, we can extend the calc grammar as follows:

%epp PLUS "+"
%epp MUL "*"
%epp LBRACK "("
%epp RBRACK ")"
%epp INT "Int"

leading to the following output:

>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete 3
   2: Insert +
   3: Insert *
Result: 2

Biasing repair sequences

Depending on your language, some repair sequences are better than others. For example, sometimes Insert repairs are less welcome than Delete repairs:

>>> 2 + + 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Insert INT
   2: Delete +
Unable to evaluate expression.
>>> 2 + + 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 5

Why does the same input sometimes produce a result and sometimes fail to produce a result? The problem is that 2 + + 3 has two repair sequences Delete + and Insert Int. As things stand, both are equally good, and so one is chosen non-deterministically. If Insert Int is chosen, we hit the Err case from earlier, and fail to produce a result; if the Delete case is chosen, we can produce a result.

To lessen this problem, the %avoid_insert L directive causes grmtools to prefer repair sequences that don't include Insert L over those that do. Intuitively, we want to annotate lexemes whose value we care about in this way (e.g. INT), but we don't need to worry about lexemes whose value we never expect (e.g. (, + etc.). In the case of the calculator grammar a good use of this directive is as follows:

%avoid_insert "INT"

With this, the Delete + repair sequence is consistently favoured over Insert INT.

Turning lexing errors into parsing errors

Most lexers do not have lexical rules for all possible inputs. For example, our running calculator example has no lexical rule for the character @. Typically this causes the lexer to generate an error and stop lexing further. For example with lrlex we would encounter the following:

>>> 2@3
Lexing error at line 1 column 2.

This error message is correct, but not as helpful as we might like (what is the error specifically?). Furthermore, any further errors in the input will not be found until the lexing error is fixed.

Fortunately we can fix this easily for nearly all grammars by adding a line similar to this to the end of your .l file:

. "UNMATCHED"

Any single character which is not matched by any other lex rule will now lead to a token of type UNMATCHED. Note that it is vital that this is the last rule in your .l file, and that only a single character is matched, otherwise you will incorrectly lex correct input as UNMATCHED!

We then need to add a dummy rule to your .y file, simply so that lrpar knows about UNMATCHED tokens. This dummy rule won't be referenced by other rules, so its return type and action are irrelevant. The simplest example is thus:

Unmatched -> ():
  "UNMATCHED" { } 
  ;

With this done, all possible input will be lexed, and what were previously lexing errors are now parsing errors. This means that error recovery section kicks in, giving us more detailed and informative errors, and ensuring that multiple "lexing" errors are reported at once:

>>> 2@3+4+5+6@7
Parsing error at line 1 column 2. Repair sequences found:
   1: Delete @, Delete 3
   2: Insert +, Delete @
   3: Insert *, Delete @
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert +, Delete @
   2: Delete @, Delete 7
   3: Insert *, Delete @
Result: 24

Under the bonnet

For any given syntax error there are, potentially, a finite but vast number of possible valid repair sequences: far too many to exhaustively search. Error recovery algorithms such as CPCT+ use various heuristics to cut the search space down to something that is (generally) manageable. Although surprisingly few in practise, this inevitably leads to occasional situations where the repair sequences found (or, more accurately, those not found) surprise humans.

Timeout

The first surprising condition is that even with the small calc grammar, some user inputs lead to such a massive search space that no repair sequences can be found. The easiest way to trigger this in most grammars is bracket expressions:

>>> 1+(
Parsing error at line 1 column 4. Repair sequences found:
   1: Insert Int, Insert )
Unable to evaluate expression.
>>> 1+((
Parsing error at line 1 column 5. Repair sequences found:
   1: Insert Int, Insert ), Insert )
Unable to evaluate expression.
>>> 1+(((((((((((
Parsing error at line 1 column 14. No repair sequences found.
Unable to evaluate expression.

At a certain number of open brackets (which will partly depend on the speed of your machine), CPCT+ simply cannot find suitable repair sequences within its internal timeout, hence the “No repair sequences found” message. In practise this happens in less than 2% of real-world inputs, so it is not a significant worry.

Some “obvious” repair sequences aren't reported at the end of a file

The second surprising condition is more subtle. Before we can show the issue, we need to introduce the concept of repair sequence ranking: CPCT+ only presents the lowest cost repair sequences to users (where Inserts and Deletes cost 1, and Shifts cost 0). Higher cost repair sequences are discarded.

In an ideal world, CPCT+ would find repair sequences that allow a file to parse completely successfully. In practice, this is only feasible if a syntax error occurs near the very end of the input. In most cases, CPCT+ is happy with a weaker condition, which is that a repair sequence ends with 3 Shift repairs, showing that parsing has got back on track, at least for a little bit. This condition explains the following:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert Int
Result: 5
>>> 2 + + 3 +
Parsing error at line 1 column 5. Repair sequences found:
   1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert Int
Unable to evaluate expression.

For 2 + + 3 we match the human intuition that the input could have been 2 + 3 or 2 + <some int> + 3. However, for the input 2 + + 3 + we do not report a Delete + repair sequence for the first error in the input. Why?

The first thing we need to know is that repair sequences are always reported with trailing Shift repairs pruned: for the rest of this subsection it aids understanding to leave them unpruned. Thus, for 2 + + 3, the two repair sequences found are Delete +, Shift 3 and Insert Int, Shift +, Shift 3, both of which cause the entire input to parse successfully, and both of which have the same cost.

For 2 + + 3 +, however, the first error leads to 3 repair sequences, Insert Int, Shift +, Shift 3, Shift +, Delete +, Shift 3, Delete or Delete +, Shift 3, Shift +, Insert Int: the latter two are not even completed since they're provably higher than the Insert Int repair sequence and thus aren’t reported to the user.

In practise, this situation is rarer than the timeout problem, to the point that it’s arguably not worth worrying about or explaining to end users. Even when it happens, the repair sequences that CPCT+ reports are always correct and at least one repair sequence will be reported (assuming that error recovery doesn't time out!).

Error recovery on real-world grammars

Continuing the example from the nimbleparse section, we can see that error recovery works well on arbitrary grammars. Consider the following syntactically incorrect Lua 5.3 program:

$ cat test.lua
x = 0
if x > 0
   print("greater than")
else
   print("less than"}

When run through nimbleparse, the following output is generated:

$ caro run --release --bin nimbleparse lua5_3.l lua5_3.y test.lua
...
Error at line 3 col 4. Repair sequences found:
   1: Insert then
Error at line 5 col 21. Repair sequences found:
   1: Insert ), Insert end, Delete }
   2: Insert ), Insert {, Shift }, Insert end

Turning off error recovery

By default, lrpar uses the CPCT+ error recovery algorithm. You can use the None error recovery algorithm, which causes parsing to stop as soon as it hits the first parsing error, with the recoverer method in CTParserBuilder or RTParserBuilder. For example, we can change calc's build.rs file to:

    let lex_rule_ids_map = CTParserBuilder::new()
        .yacckind(YaccKind::Grmtools)
        .recoverer(lrpar::RecoveryKind::None)
        .process_file_in_src("calc.y")?;

and then no matter how many syntax errors we make, only one is reported:

>>> 2++3++
Parsing error at line 1 column 3. No repair sequences found.
Unable to evaluate expression.

Unless you have a good reason to do so (e.g. quickly hacking together a grammar where you would prefer not to think about error recovery at all), we do not recommend turning off error recovery.

An AST evaluator

We now know enough to put together a more sophisticated version of our simple calculator example that builds an Abstract Syntax Tree (AST) while parsing, which is then evaluated separately. This models a common way of building real compilers. The full example code can be found at https://github.com/softdevteam/grmtools/tree/master/lrpar/examples/calc_ast.

The calc.l file remains unchanged from that in the Quickstart guide. However the calc.y file is change as follows:

#![allow(unused)]
fn main() {
%start Expr
%avoid_insert "INT"
%%
Expr -> Result<Expr, ()>:
      Term '+' Expr { Ok(Expr::Add{ span: $span, lhs: Box::new($1?), rhs: Box::new($3?) }) }
    | Term { $1 }
    ;

Term -> Result<Expr, ()>:
      Factor '*' Term { Ok(Expr::Mul{ span: $span, lhs: Box::new($1?), rhs: Box::new($3?) }) }
    | Factor { $1 }
    ;

Factor -> Result<Expr, ()>:
      '(' Expr ')' { $2 }
    | 'INT' { Ok(Expr::Number{ span: $span }) }
    ;
%%

use lrpar::Span;

#[derive(Debug)]
pub enum Expr {
    Add {
        span: Span,
        lhs: Box<Expr>,
        rhs: Box<Expr>,
    },
    Mul {
        span: Span,
        lhs: Box<Expr>,
        rhs: Box<Expr>,
    },
    Number {
        span: Span
    }
}
}

The most obvious difference here is that we have defined a simple enum Expr, with three variants, for our AST. Each AST variant also records a Span which records how much input the AST element covers. By using the $span variable we can ensure that AST elements record their relationship to portions of the user's input that span multiple tokens (e.g. for the expressions 1 + 2 the resulting Expr::Add will have a Span starting at byte index 0 and ending at byte index 5 -- in other words covering the complete input string in this case).

After parsing, we thus end up with a Result<Expr, ()>. In the case of a successful parse, this will give us an arbitrarily deeply nested Expr.

Our main.rs file then looks as follows:

use std::io::{self, BufRead, Write};

use lrlex::lrlex_mod;
use lrpar::{lrpar_mod, Lexer, Span};

lrlex_mod!("calc.l");
lrpar_mod!("calc.y");

use calc_y::Expr;

fn main() {
    let lexerdef = calc_l::lexerdef();
    let stdin = io::stdin();
    loop {
        print!(">>> ");
        io::stdout().flush().ok();
        match stdin.lock().lines().next() {
            Some(Ok(ref l)) => {
                if l.trim().is_empty() {
                    continue;
                }
                let lexer = lexerdef.lexer(l);
                let (res, errs) = calc_y::parse(&lexer);
                for e in errs {
                    println!("{}", e.pp(&lexer, &calc_y::token_epp));
                }
                if let Some(Ok(r)) = res {
                    // We have a successful parse.
                    match eval(&lexer, r) {
                        Ok(i) => println!("Result: {}", i),
                        Err((span, msg)) => {
                            let ((line, col), _) = lexer.line_col(span);
                            eprintln!(
                                "Evaluation error at line {} column {}, '{}' {}.",
                                line,
                                col,
                                lexer.span_str(span),
                                msg
                            )
                        }
                    }
                }
            }
            _ => break
        }
    }
}

fn eval(lexer: &dyn Lexer<u32>, e: Expr) -> Result<u64, (Span, &'static str)> {
    match e {
        Expr::Add { span, lhs, rhs } => eval(lexer, *lhs)?
            .checked_add(eval(lexer, *rhs)?)
            .ok_or((span, "overflowed")),
        Expr::Mul { span, lhs, rhs } => eval(lexer, *lhs)?
            .checked_mul(eval(lexer, *rhs)?)
            .ok_or((span, "overflowed")),
        Expr::Number { span } => lexer
            .span_str(span)
            .parse::<u64>()
            .map_err(|_| (span, "cannot be represented as a u64"))
    }
}

Let's start by running this and seeing what happens:

>>> 2+3*4
Result: 14
>>> 2++3*4
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 14
>>> 999999*888888 + 777777*666666
Result: 1407404592594
>>> 9999999999*8888888888 + 7777777777*6666666666
Evaluation error at line 1 column 6, '9999999999*8888888888' overflowed.

The first three expressions evaluate just as before. However, the fourth is interesting: we have explicitly captured the fact that the result of 9999999999*8888888888 is too big to fit into a u64; and not only have we told the user which character the input starts out, but we've printed out the precise sub-part of the input which caused that error. This works even when it's in the middle of the input:

>>> 10 + 9999999999*8888888888 + 20
Evaluation error at line 1 column 6, '9999999999*8888888888' overflowed.

The key to this is that each AST element knows the $span of the production it is related to; and the resulting Span can extract the user's input with lexer.span_str(span).

Happily, this facility composes nicely with error recovery:

>>> 10 ++ 9999999999*8888888888 + 20
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert INT
Evaluation error at line 1 column 7, '9999999999*8888888888' overflowed.

The individual libraries and tools

grmtools consists of several libraries and command-line tools. The following sections describe each.

`lrpar`

lrpar (crate; source) is the LR parser library aspect of grmtools. It takes in streams of lexemes and parses them, determining if they successfully match a grammar or not; if not, it can optionally recover from errors.

`lrlex`

lrlex (crate; source) is a partial replacement for lex / flex. It takes an input string and splits it into lexemes based on a .l file. Unfortunately, many real-world languages have corner cases which exceed the power that lrlex can provide. However, when it is suitable, it is a very convenient way of expressing lexing.

lrlex also has a simple command-line interface, allowing you to check whether your lexing rules are working as expected:

$ cat C.java
class C {
    int x = 0;
}
$ cargo run --lrlex java.l /tmp/C.java
    Finished dev [unoptimized + debuginfo] target(s) in 0.18s
     Running `target/debug/lrlex ../grammars/java7/java.l /tmp/C.java`
CLASS class
IDENTIFIER C
LBRACE {
INT int
IDENTIFIER x
EQ =
INTEGER_LITERAL 0
SEMICOLON ;
RBRACE }

nimbleparse

If you have a lrlex compatible .l file and an lrpar compatible .y file, you can use nimbleparse as a quick way of testing inputs and exploring the resulting parse tree:

$ cargo build --release
$ target/release/nimbleparse -h
Usage: nimbleparse [-r <cpctplus|mf|panic|none>] [-y <eco|original>] <lexer.l> <parser.y> <input file>

For example, let's assume you are using the Lua 5.3 .l and .y files from the grammars repository you might run nimbleparse as follows:

$ cat test.lua
print("Hello world")
$ target/release/nimbleparse lua5_3.l lua5_3.y test.lua
block
 statlistopt
  statlist
   stat
    functioncall
     prefixexp
      var
       NAME print
     args
      LBRACKET (
      explistopt
       explist
        exp
         exp1
          exp2
           exp3
            exp4
             exp5
              exp6
               exp7
                exp8
                 exp9
                  exp10
                   exp11
                    exp12
                     literalstring
                      SHORT_STR "Hello world"
      RBRACKET )

`cfgrammar`

cfgrammar (crate; source) reads in grammar files, processes them, and provides a convenient API for operating with them. Most users only need to think about cfgrammar to the extent that they are required to use it to specify what Yacc variant they wish to use.

cfgrammar may also be of interest to those manipulating grammars directly, or who wish to use custom types of parsers. Note that cfgrammar's API should be considered semi-stable at best. As the needs of other parts of grmtools change, cfgrammar tends to have to change too. Since it is unlikely to have few direct users, the consequences of changing the API are relatively slight.

`lrtable`

lrtable (crate; source) takes in grammars from cfgrammar and creates LR state tables from them. Few users will be interested in its functionality directly, except those doing advanced forms of grammar analysis.

One, admittedly fairly advanced, aspect worth noting is that lrtable uses Pager's algorithm to compress the resulting LR state tables. In rare cases this can provide surprising results: see Denny and Malloy's paper for more.

Other tools

When parsing text in Rust, you should also evaluate the following tools to see if they are more suitable for your purposes:

LALRPOP

Keyboard shortcuts

grmtools