grmtools

grmtools is a suite of Rust libraries and binaries for parsing text, both at compile-time, and run-time. Most users will probably be interested in the compile-time Yacc feature, which allows traditional .y files to be used mostly unchanged in Rust. See the Quickstart Guide for a quick introduction to this feature.

Quickstart Guide

Most users will probably be interested in the compile-time Yacc feature of grmtools, which allows traditional .y files to be used mostly unchanged in Rust. This page is a short guide to get you up and running with this feature as quickly as possible.

grmtools includes both a Yacc-style LR parser (lrpar) and a lex-style lexer (lrlex). The lexer breaks input up into individual lexemes and the parser checks to see if the lexemes conform to a grammar. As the parser executes, it can either create a generic parse tree, or execute user-specified Rust code.

A calculator evaluator

Let's assume we want to create a simple calculator which can evaluate expressions such as 2 + 3 * 4. Assuming a fresh Rust project, we first create a Cargo.toml file with the following dependencies:

[package]
name = "calc"
version = "0.0.1"
authors = ["<authors>"]

[[bin]]
doc = false
name = "calc"

[build-dependencies]
lrlex = { path="<path to lrlex>" }
lrpar = { path="<path to lrpar>" }

[dependencies]
cfgrammar = { path="<path to cfgrammar>" }
lrlex = { path="<path to lrlex>" }
lrpar = { path="<path to lrpar>" }

In this situation we want to statically compile the .y grammar and .l lexer into Rust code. We thus need to create a build.rs file which can process the lexer and grammar. Our build.rs file thus looks as follows:

extern crate lrlex;
extern crate lrpar;

use lrlex::LexerBuilder;
use lrpar::{CTParserBuilder, ActionKind};

fn main() -> Result<(), Box<std::error::Error>> {
    let lex_rule_ids_map = CTParserBuilder::new()
        .action_kind(ActionKind::CustomAction)
        .process_file_in_src("calc.y")?;
    LexerBuilder::new()
        .rule_ids_map(lex_rule_ids_map)
        .process_file_in_src("calc.l")?;
    Ok(())
}

In our case, we want to specify Rust code which is run as the input is parsed (rather than creating a generic parse tree which we traverse later), so we specified that the action_kind is ActionKind::CustomAction. The grammar file is stored in src/calc.y, but we only specify calc.y as the filename to lrpar, since it searches relative to src/ automatically.

The lexer

While Yacc-style parsing is powerful, lex-style lexing is less powerful. grmtools allows you to use whatever lexer you want with lrpar. Fortunately, in this case, lrlex is powerful enough for us. Our lex file is stored in src/calc.l. The rule_ids_map dance synchronises the parser and lexer (the details of this are unimportant to us).

calc.l is as follows:

%%
[0-9]+ "INT"
\+ "PLUS"
\* "MUL"
\( "LBRACK"
\) "RBRACK"
[\t ]+ ;

Roughly speaking, each line after the %% line is a regular expression (we use the regex crate), a space character, and a quoted lexeme type name. For example, if the user gives us input such as 234 we will create a single lexeme with a value (234) and a type (INT).

The one exception is the final line: if a lexeme type name is replaced with ‘;’ then any matching input is discarded. In this case, whitespace (tabs and spaces) is lexed, but no lexemes are created from it.

The grammar

Our initial version of calc.y looks as follows:

%start Expr
// Define the Rust type that is to be returned by the actions.
%type u64
%%
Expr: Term 'PLUS' Expr { $1 + $3 }
    | Term { $1 }
    ;

Term: Factor 'MUL' Term { $1 * $3 }
    | Factor { $1 }
    ;

Factor: 'LBRACK' Expr 'RBRACK' { $2 }
      | 'INT' { parse_int($lexer.lexeme_str(&$1.unwrap())) }
      ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> u64 {
    match s.parse::<u64>() {
        Ok(val) => val as u64,
        Err(_) => panic!("{} cannot be represented as a u64", s)
    }
}

The grammar is in 3 parts, separated by the %% lines.

The first part specifies general settings for the grammar: its start rule (%start Expr) and the Rust type that actions in the grammar must produce (%type u64).

The second part is the Yacc grammar. It consists of 3 rules (Expr, Term, and Factor) and 6 productions (2 for each rule, separated by | characters). A production (sometimes called an “alternative”) consists of zero or more symbols. Symbols either reference rules or lexemes. If a production matches text, its ”action” (the Rust code between curly brackets at the end of the production) is executed.

lrpar's actions are somewhat different to Yacc. The $x variables refer to the respective symbol in the production (i.e. $1 refers to the first symbol in the production). If the symbol is a rule then an instance of %type is stored in the $x variable; if the symbol is a lexeme then an Option<Lexeme> instance is returned. A special $lexer variable allows access to the lexer. This allows us to turn Lexemes into strings with the lexeme_str function, which given a Lexeme returns a &str representing the corresponding piece of information.

The third part is arbitrary Rust code which can be called by productions’ actions. In our case we have a simple function which converts integers as strings into integers as u64s: if the user provides an invalid number (e.g. one that is too big) the system panics.

Putting everything together

The build.rs file will statically compile the lexer and grammar into Rust code that we can then call. The src/main.rs file below provides a simple Python-esque REPL to the user into which they can write calculator expressions:

// The cfgrammar import will not be needed once the 2018 edition is stable.
extern crate cfgrammar;
// We import lrpar and lrlex with macros so that lrlex_mod! and lrpar_mod! are in scope.
#[macro_use] extern crate lrpar;
#[macro_use] extern crate lrlex;

use std::io::{self, BufRead, Write};

// Using `lrlex_mod!` brings the lexer for `calc.l` into scope.
lrlex_mod!(calc_l);
// Using `lrpar_mod!` brings the lexer for `calc.l` into scope.
lrpar_mod!(calc_y);


fn main() {
    // We need to get a `LexerDef` for the `calc` language in order that we can lex input.
    let lexerdef = calc_l::lexerdef();
    let stdin = io::stdin();
    loop {
        print!(">>> ");
        io::stdout().flush().ok();
        match stdin.lock().lines().next() {
            Some(Ok(ref l)) => {
                if l.trim().is_empty() {
                    continue;
                }
                // Now we create a lexer with the `lexer` method with which we can lex an input.
                let mut lexer = lexerdef.lexer(l);
                // Pass the lexer to the parser and lex and parse the input.
                let (res, errs) = calc_y::parse(&mut lexer);
                for e in errs {
                    println!("{}", e.pp(&lexer, &calc_y::token_epp));
                }
                match res {
                    Some(r) => println!("Result: {}", r),
                    _ => eprintln!("Unable to evaluate expression.")
                }
            }
            _ => break
        }
    }
}

We can now cargo run our project and evaluate simple expressions:

>>> 2 + 3
Result: 5
>>> 2 + 3 * 4
Result: 14
>>> (2 + 3) * 4
Result: 20

Because powerful error recovery is built into lrpar, we can even make minor errors and have the system recover automatically:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 5
>>> 2 + 3 3
Parsing error at line 1 column 7. Repair sequences found:
   1: Delete 3
   2: Insert PLUS
   3: Insert MUL
Result: 5
>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
   1: Insert MUL, Delete 4
   2: Insert PLUS, Delete 4
   3: Delete 4, Delete 5
   4: Insert MUL, Shift 4, Delete 5
   5: Insert MUL, Shift 4, Insert PLUS
   6: Insert MUL, Shift 4, Insert MUL
   7: Insert PLUS, Shift 4, Delete 5
   8: Insert PLUS, Shift 4, Insert PLUS
   9: Insert PLUS, Shift 4, Insert MUL
Result: 17

Note that we didn't have to do anything clever in order for error recovery to happen: it happens by default, and it works with whatever grammar we throw at it. The way to read the resulting error messages are that each numbered repair sequence is a way that the error recovery system found to make sense of the input. For example, for the input 2 + + 3, an error is detected at the second +: we could either delete the second + (option 1 above) or insert an integer. In all cases, error recovery applies repair sequence 1, and continues parsing. 2 + + 3 was thus parsed as if the user had written 2 + 3, hence why it evaluated to 5. Similarly, 2 + 3 4 5 was parsed as if the user had written 2 + 3 * 5.

Error recovery opens up a number of possibilities to customise and streamline the user experience. For example, the simple approach above causes a panic if the user provides a non-u64 number or if error recovery inserts an integer. For more details about the possibilities, see the section on error recovery.

Lexing

Lexing is the act of taking in an input stream and splitting it into lexemes. Colloquially, lexing is often described as splitting input into words. In grmtools, a Lexeme has a type (e.g. "INT", "ID"), a value (e.g. "23", "xyz"), and knows which part of the user's input matched (e.g. "the input starting at index 7 to index 10"). There is also a simple mechanism to differentiate lexemes of zero length (e.g. DEDENT tokens in Python) from lexemes inserted by error recovery.

Users can write custom lexers that conform to the lrpar::lex::Lexer trait. This API allows users to deal with streaming data since the parser asks the Lexer for one token at a time. However, note that users can later ask the Lexer to return the string from the input matching a lexeme: users need to buffer input to provide this information.

Hand-written lexers are not particularly difficult to write and, for better or worse, are necessary for many real-world languages. However, a subset of languages can use a simpler lex/flex style approach to lexing, for which lrlex can be used.

Parsing

Parsing is the act of checking whether a stream of lexemes match a grammar. Since a simple "yes/no" answer is rarely useful, it is common to execute user-defined actions during parsing.

grmtools contains libraries (cfgrammar and lrtable) which allow users to build their own LR parsers in whatever fashion they want. However, for 99% of cases, the lrpar library is what users want and need: a (largely) Yacc-compatible parser. Roughly speaking, the core parts of grammars work identically in Yacc and lrpar, but some other parts of the system have been modernised (e.g. to avoid the use of global variables) and given a more idiomatic Rust feel. Notably, lrpar is built from the ground-up to have a powerful, flexible approach to error recovery.

Actions

Users can specify what sort of actions they want performed when parsing occurs. The default is ActionKind::GenericParseTree which, as its name probably suggests, creates a generic parse tree, where elements are instances of the lrpar::parser::Node enum.

Most users will probably want to specify ActionKind::CustomAction, where each production can be annotated with an action. A brief example of this is shown in the quickstart guide; a more detailed explanation can be found in the error recovery section.

The individual libraries and tools

grmtools consists of several libraries and command-line tools. The following sections describe each.

`lrpar`

lrpar is the LR parser library aspect of grmtools. It takes in streams of lexemes and parses them, determining if they successfully match a grammar or not; if not, it can optionally recover from errors.

`lrlex`

lrlex is a partial replacement for lex / flex. It takes an input string and splits it into lexemes based on a .l file. Unfortunately, many real-world languages have corner cases which exceed the power that lrlex can provide. However, when it is suitable, it is a very convenient way of expressing lexing.

lrlex also has a simple command-line interface, allowing you to check whether your lexing rules are working as expected:

$ cat C.java
class C {
    int x = 0;
}
$ cargo run --lrlex java.l /tmp/C.java
    Finished dev [unoptimized + debuginfo] target(s) in 0.18s
     Running `target/debug/lrlex ../grammars/java7/java.l /tmp/C.java`
CLASS class
IDENTIFIER C
LBRACE {
INT int
IDENTIFIER x
EQ =
INTEGER_LITERAL 0
SEMICOLON ;
RBRACE }

nimbleparse

If you have a lrlex compatible .l file and an lrpar compatible .y file, you can use nimbleparse as a quick way of testing inputs and exploring the resulting parse tree:

$ cargo build --release
$ target/release/nimbleparse -h
Usage: nimbleparse [-r <cpctplus|mf|panic|none>] [-y <eco|original>] <lexer.l> <parser.y> <input file>

For example, let's assume you are using the Lua 5.3 .l and .y files from the grammars repository you might run nimbleparse as follows:

$ cat test.lua
print("Hello world")
$ target/release/nimbleparse lua5_3.l lua5_3.y test.lua
block
 statlistopt
  statlist
   stat
    functioncall
     prefixexp
      var
       NAME print
     args
      LBRACKET (
      explistopt
       explist
        exp
         exp1
          exp2
           exp3
            exp4
             exp5
              exp6
               exp7
                exp8
                 exp9
                  exp10
                   exp11
                    exp12
                     literalstring
                      SHORT_STR "Hello world"
      RBRACKET )

`cfgrammar`

cfgrammar reads in grammar files, processes them, and provides a convenient API for operating with them. It may be of interest to those manipulating grammars directly, or who wish to use custom types of parsers.

`lrtable`

lrtable takes in grammars from cfgrammar and creates LR state tables from them. Few users will be interested in its functionality directly, except those doing advanced forms of grammar analysis.

One, admittedly fairly advanced, aspect worth noting is that lrtable uses Pager's algorithm to compress the resulting LR state tables. In rare cases this can provide surprising results: see Denny and Malloy's paper for more.

Error recovery

One of lrpar's most powerful features is its approach to error recovery, which can be used with any grammar. This section outlines the background to error recovery, the choices that users can make, and how to best make use of this feature.

Error recovery background

Programmers frequently make mistakes when entering input, either because of simple typos, or an outright failure to use the correct syntax. Happily, LR parsing guarantees to report syntax errors at the first point that an error can be definitively proven to have occurred (though note that this might not be the same point that a user would consider the error to have been made). It has long been a goal of parsing technologies to recover from such errors, and allow parsing to continue. This allows users to fix all their syntax errors in one go and, optionally, post-parsing phases to operate as if no syntax errors had been made at all. For example, a compiler author might decide to run the compiler's static type checker even in the presence of syntax errors (since many static type errors are unaffected by syntax errors), but not generate code (which might incorrectly give users the illusion that their code is safe to run).

However, most mainstream parsers do a bad job of error recovery. The most common generic error recovery algorithm is "panic mode" (in reality, a family of algorithms). Unfortunately such simple error recovery algorithms do a poor job of recovering from syntax errors, causing a cascade of spurious further syntax errors to be reported. Programmers quickly learn that only the first reported syntax error can be trusted on to be correct.

lrpar implements the MF error recovery algorithm from Reducing Cascading Parsing Errors Through Fast Error Recovery, which, in our biased opinion, does a better job than previous approaches. It is fast, grammar neutral, and reports multiple repair sequences to users, allowing them to consider which best matches their intentions.

No matter how clever we think MF is, it is important to understand that it has a fundamental limitation: it only knows about a language's syntax; it has no concept of the language's semantics beyond that implied by the structure of the grammar; and it cannot control what the user does with the result of error recovery. Thus, grammar writers can significantly influence how useful error recovery is for users. Most of the rest of this section explains how best to make use of error recovery.

Error recovery basics

Consider the calc grammar from the quickstart guide:

%start Expr
// Define the Rust type that is to be returned by each
// productions' action.
%type u64
%%
Expr: Term 'PLUS' Expr { $1 + $3 }
    | Term { $1 }
    ;

Term: Factor 'MUL' Term { $1 * $3 }
    | Factor { $1 }
    ;

Factor: 'LBRACK' Expr 'RBRACK' { $2 }
      | 'INT' { parse_int($lexer.lexeme_str(&$1.unwrap())) }
      ;
%%
// Functions / imports in this section are in scope for
// each productions' actions.

fn parse_int(s: &str) -> u64 {
    match s.parse::<u64>() {
        Ok(val) => val as u64,
        Err(_) => panic!("{} cannot be represented as a u64", s)
    }
}

In this grammar, every production has an action: each action must evaluate to an instance of the %type type (in this case u64). The $x variables refer to the respective symbol in the production (i.e. $1 refers to the first symbol in the production). If the symbol is a rule then an instance of %type is stored in the $x variable; if the symbol is a lexeme then an Option<Lexeme> instance is returned. A special $lexer variable allows access to the lexer. This allows us to turn Lexemes into strings with the lexeme_str function, which, given a Lexeme, returns a &str of the relevant part of the user’s input.

For many examples, this simple grammar and its actions work well leading to output such as the following:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 5

Insert x means “error recovery inserted a lexeme of type x”; Delete x means “error recovery deleted the next lexeme in the stream”; and Shift x means “error recovery kept the user’s lexeme x as-is”.

Repair sequences are minimal ways of adjusting the user’s input such that it becomes correct relative to the underlying grammar. Intuitively, in this example, the two repair sequences would adjust the input to be equivalent to 2 + 3 (repair sequence 1) or 2 + <some int> + 3 (repair sequence 2). When more than one repair sequence is presented to the user, the first is used by the algorithm to continue parsing: in this case, the input was parsed as if it was equivalent to 2 + 3, hence the evaluation of the input to 5.

Repair sequences can, as their name suggests, be of arbitrary length:

>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
   1: Insert MUL, Delete 4
   2: Insert PLUS, Delete 4
   3: Delete 4, Delete 5
   4: Insert MUL, Shift 4, Delete 5
   5: Insert MUL, Shift 4, Insert PLUS
   6: Insert MUL, Shift 4, Insert MUL
   7: Insert PLUS, Shift 4, Delete 5
   8: Insert PLUS, Shift 4, Insert PLUS
   9: Insert PLUS, Shift 4, Insert MUL
Result: 17

In this case, the first repair sequence caused the input to be parsed as if it was equivalent to 2 + 3 * 5, hence the evaluation of the input to 17.

Syntax errors and language semantics

Our example inputs so far have deliberately exploited cases where the first repair sequence at worst inserted “unimportant” lexemes such as + and *. Since the grammar’s actions never read the values of such lexemes, only their type is important. However, what should happen if error recovery inserts an integer, whose value is later read by one of the grammar’s actions? An example shows the unhappy result:

>>> 2+
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Lexeme { start: 2, len: 4294967295, tok_id: 4 }', libcore/result.rs:1009:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
>>>

In this case, the first repair sequence was Insert INT. The fundamental problem is that while error recovery can adjust the user’s input to insert a lexeme of type INT, neither it nor the parser have any idea what value might have made sense for that lexeme. Thus the expression above caused the expression $lexer.lexeme_str(&$1.unwrap()) to panic, since $1 was Err(<lexeme>).

It is thus up to the user to decide what to do in the face of the inevitable semantic issues that error recovery highlights. Fortunately, this is generally simpler than it sounds with only a slight rethink in the way that we tend to write a grammar's actions.

A rule of thumb: make `%type` return a `Result` type

Although you can use whatever type you use for %type, using a Result type allows a (deliberately) simple interaction with the effects of error recovery. The basic idea is simple: in actions, we ignore lexemes whose value we don't care about (e.g. brackets); for lexemes whose value we care about, we either introduce a default value, or percolate an Err upwards. Default values make sense in certain situations. For example, if you're writing a compiler, and want to run a static type checker even after syntax errors, it might make sense to assume that Insert 0 is a good substitute for Insert INT. However, in the case of the calculator, default values are likely to lead to confusing results. We thus change the grammar so that inserted integers prevent evaluation from occurring:

%start Expr
%type Result<u64, ()>
%%
Expr: Term 'PLUS' Expr { Ok($1? + $3?) }
    | Term { $1 }
    ;

Term: Factor 'MUL' Term { Ok($1? * $3?) }
    | Factor { $1 }
    ;

Factor: 'LBRACK' Expr 'RBRACK' { $2 }
      | 'INT' {
            let l = $1.map_err(|_| ())?;
            Ok(parse_int($lexer.lexeme_str(&l)))
         }
      ;
%%
fn parse_int(s: &str) -> u64 {
    match s.parse::<u64>() {
        Ok(val) => val as u64,
        Err(_) => panic!("{} cannot be represented as a u64", s)
    }
}

The basic idea here is that every action returns an instance of Result<u64, ()>: if we receive Ok(u64) we successfully evaluated the expression, but if we received Err(()) we were not able to evaluate the expression. If we encounter an integer lexeme which is the result of error recovery, then the INT lexeme in the second Factor action will be Err(<lexeme>). By writing $1.map_err(|_| ())? we’re saying “if the integer lexeme was created by error recovery, percolate Err(()) upwards”. We then have to tweak a couple of other actions to percolate errors upwards, but this is a trivial change.

We then need to make a small tweak to our main.rs changing:

match res {
    Some(r) => println!("Result: {}", r),
    _ => eprintln!("Unable to evaluate expression.")
}

to:

match res {
    Some(Ok(r)) => println!("Result: {}", r),
    _ => eprintln!("Unable to evaluate expression.")
}

Now the input which previously caused a panic simply tells the user that it could not evaluate the expression:

>>> 2+
Parsing error at line 1 column 3. Repair sequences found:
   1: Insert INT
Unable to evaluate expression.

Usefully, our inability (or unwillingness) to evaluate the expression does not prevent further syntax errors from being discovered and repaired:

>>> (2+)+3+4+
Parsing error at line 1 column 4. Repair sequences found:
   1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert Int
Unable to evaluate expression.

Using a Result type allows the user arbitrary control over the classes of syntax errors they are prepared to deal with or not. For example, we could remove the panic from parse_int by having %type be Result<u64, String> where the Err case would report a string such as “18446744073709551616 cannot be represented as a u64” for the first unrepresentable u64 in the user's input. If we wanted to report all unrepresentable u64s, we could have %type by Result<u64, Vec<String>>, though merging together the errors found on the left and right hand sides of the + and * operators requires adding a few lines of code.

Making use of `%epp` for easier to read repair sequences

By default, pretty-printing lexeme types prints out their identifier in the grammar. These rarely match what the user would expect:

>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete 3
   2: Insert PLUS
   3: Insert MUL
Result: 2

What are PLUS and MUL? These might be semi-obvious, but many lexeme types are far from obvious. grmtools allows users to provide human friendly versions of these for error recovery using the %epp declaration in grammars. For example, we can extend the calc grammar as follows:

%epp PLUS "+"
%epp MUL "*"
%epp LBRACK "("
%epp RBRACK ")"
%epp INT "Int"

leading to the following output:

>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete 3
   2: Insert +
   3: Insert *
Result: 2

Under the bonnet

For any given syntax error there are, potentially, a finite but vast number of possible valid repair sequences: far too many to exhaustively search. Error recovery algorithms such as MF use various heuristics to cut the search space down to something that is (generally) manageable. Although surprisingly few in practise, this inevitably leads to occasional situations where the repair sequences found (or, more accurately, those not found) surprise humans.

Timeout

The first surprising condition is that even with the small calc grammar, some user inputs lead to such a massive search space that no repair sequences can be found. The easiest way to trigger this in most grammars is bracket expressions:

>>> 1+(
Parsing error at line 1 column 4. Repair sequences found:
   1: Insert Int, Insert )
Unable to evaluate expression.
>>> 1+((
Parsing error at line 1 column 5. Repair sequences found:
   1: Insert Int, Insert ), Insert )
Unable to evaluate expression.
>>> 1+(((((((((((
Parsing error at line 1 column 14. No repair sequences found.
Unable to evaluate expression.

At a certain number of open brackets (which will partly depend on the speed of your machine), MF simply cannot find suitable repair sequences within its internal timeout, hence the “No repair sequences found” message. In practise this happens in less than 2% of real-world inputs, so it is not a significant worry.

Some “obvious” repair sequences with `Delete` aren’t reported

The second surprising condition is more subtle. Before we can show the issue, we need to introduce the concept of repair sequence ranking: MF only presents the lowest cost repair sequences to users (where Inserts and Deletes cost 1, and Shifts cost 0). Higher cost repair sequences are discarded.

In an ideal world, MF would find repair sequences that allow a file to parse completely successfully. In practice, this is only feasible if a syntax error occurs near the very end of the input. In most cases, MF is happy with a weaker condition, which is that a repair sequence ends with 3 Shift repairs, showing that parsing has got back on track, at least for a little bit. [Inevitably we need to gloss over several of the subtleties of the algorithm, though one is worth mentioning here: Shifts at the end of a repair sequence are stripped before being reported to the user, as they don't aid understanding.] This condition explains the following:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert Int
Result: 5
>>> 2 + + 3 +
Parsing error at line 1 column 5. Repair sequences found:
   1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert Int
Unable to evaluate expression.

For 2 + + 3 we match the human intuition that the input could have been 2 + 3 or 2 + <some int> + 3. However, for the input 2 + + 3 + we do not report a Delete + repair sequence for the first error in the input. Why?

For 2 + + 3, the two repair sequences found are Delete +, Shift 3 and Insert Int, Shift +, Shift 3, both of which cause the entire input to parse successfully, and both of which have the same cost.

For 2 + + 3 +, however, the first error leads to 3 repair sequences, Insert Int, Shift +, Shift 3, Shift +, Delete +, Shift 3, Delete or Delete +, Shift 3, Shift +, Insert Int: the latter two are not even completed since they're provably higher than the Insert Int repair sequence.

In practise, this situation is much rarer than the timeout problem, to the point that it's not worth worrying about. Even when it happens, the repair sequences that MF reports are always correct and it will always report at least one repair sequence at a given syntax error location.

Error recovery on real-world grammars

Continuing the example from the nimbleparse section, we can see that error recovery works well on arbitrary grammars. Consider the following syntactically incorrect Lua 5.3 program:

$ cat test.lua
x = 0
if x > 0
   print("greater than")
else
   print("less than"}

When run through nimbleparse, the following output is generated:

$ caro run --release --bin nimbleparse lua5_3.l lua5_3.y test.lua
...
Error at line 3 col 4. Repair sequences found:
   1: Insert then
Error at line 5 col 21. Repair sequences found:
   1: Insert ), Insert end, Delete }
   2: Insert ), Insert {, Shift }, Insert end

Turning off error recovery

By default, lrpar uses the MF error recovery algorithm. You can use the None error recovery algorithm, which causes parsing to stop as soon as it hits the first parsing error, with the recoverer method in CTParserBuilder or RTParserBuilder. For example, we can change calc's build.rs file to:

    let lex_rule_ids_map = CTParserBuilder::new()
        .action_kind(ActionKind::CustomAction)
        .recoverer(lrpar::RecoveryKind::None)
        .process_file_in_src("calc.y")?;

and then no matter how many syntax errors we make, only one is reported:

>>> 2++3++
Parsing error at line 1 column 3. No repair sequences found.
Unable to evaluate expression.

Unless you have a good reason to do so (e.g. quickly hacking together a grammar where you would prefer not to think about error recovery at all), we do not recommend turning off error recovery.

Keyboard shortcuts

grmtools