Error recovery

One of lrpar's most powerful features is its approach to error recovery, which can be used with any grammar. This section outlines the background to error recovery, the choices that users can make, and how to best make use of this feature.

Error recovery background

Programmers frequently make mistakes when entering input, either because of simple typos, or an outright failure to use the correct syntax. Happily, LR parsing guarantees to report syntax errors at the first point that an error can be definitively proven to have occurred (though note that this might not be the same point that a user would consider the error to have been made). It has long been a goal of parsing technologies to recover from such errors, and allow parsing to continue. This allows users to fix all their syntax errors in one go and, optionally, post-parsing phases to operate as if no syntax errors had been made at all. For example, a compiler author might decide to run the compiler's static type checker even in the presence of syntax errors (since many static type errors are unaffected by syntax errors), but not generate code (which might incorrectly give users the illusion that their code is safe to run).

However, most mainstream parsers do a bad job of error recovery. The most common generic error recovery algorithm is "panic mode" (in reality, a family of algorithms). Unfortunately such simple error recovery algorithms do a poor job of recovering from syntax errors, causing a cascade of spurious further syntax errors to be reported. Programmers quickly learn that only the first reported syntax error can be trusted on to be correct.

lrpar implements the CPCT+ error recovery algorithm from Reducing Cascading Parsing Errors Through Fast Error Recovery, which, in our biased opinion, does a better job than previous approaches. It is fast, grammar neutral, and reports multiple repair sequences to users, allowing them to consider which best matches their intentions.

No matter how clever we think CPCT+ is, it is important to understand that it has a fundamental limitation: it only knows about a language's syntax; it has no concept of the language's semantics beyond that implied by the structure of the grammar; and it cannot control what the user does with the result of error recovery. Thus, grammar writers can significantly influence how useful error recovery is for users. Most of the rest of this section explains how best to make use of error recovery.

Error recovery basics

A simple calculator grammar looks as follows:

%start Expr
%%
Expr -> u64:
      Expr '+' Term { $1 + $3 }
    | Term { $1 }
    ;

Term -> u64:
      Term '*' Factor { $1 * $3 }
    | Factor { $1 }
    ;

Factor -> u64:
      '(' Expr ')' { $2 }
    | 'INT' { parse_int($lexer.span_str($1.unwrap().span())) }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> u64 {
    match s.parse::<u64>() {
        Ok(val) => val,
        Err(_) => panic!("{} cannot be represented as a u64", s)
    }
}

For this simplification we need to make a small tweak to our main.rs changing:

match res {
    Some(Ok(r)) => println!("Result: {}", r),
    _ => eprintln!("Unable to evaluate expression.")
}

to:

match res {
    Some(r) => println!("Result: {}", r),
    _ => eprintln!("Unable to evaluate expression.")
}

For many examples, this simple grammar and its actions work well leading to output such as the following:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 5

Insert x means “error recovery inserted a lexeme of type x”; Delete x means “error recovery deleted the next lexeme in the stream”; and Shift x means “error recovery kept the user’s lexeme x as-is”.

Repair sequences are minimal ways of adjusting the user’s input such that it becomes correct relative to the underlying grammar. Intuitively, in this example, the two repair sequences would adjust the input to be equivalent to 2 + 3 (repair sequence 1) or 2 + <some int> + 3 (repair sequence 2). When more than one repair sequence is presented to the user, the first is used by the algorithm to continue parsing: in this case, the input was parsed as if it was equivalent to 2 + 3, hence the evaluation of the input to 5.

Repair sequences can, as their name suggests, be of arbitrary length:

>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
   1: Insert *, Delete 4
   2: Insert +, Delete 4
   3: Delete 4, Delete 5
   4: Insert *, Shift 4, Delete 5
   5: Insert *, Shift 4, Insert +
   6: Insert *, Shift 4, Insert *
   7: Insert +, Shift 4, Delete 5
   8: Insert +, Shift 4, Insert +
   9: Insert +, Shift 4, Insert *
Result: 17

In this case, the first repair sequence caused the input to be parsed as if it was equivalent to 2 + 3 * 5, hence the evaluation of the input to 17.

Syntax errors and language semantics

Our example inputs so far have deliberately exploited cases where the first repair sequence at worst inserted “unimportant” lexemes such as + and *. Since the grammar’s actions never read the values of such lexemes, only their type is important. However, what should happen if error recovery inserts an integer, whose value is later read by one of the grammar’s actions? An example shows the unhappy result:

>>> 2+
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Lexeme { start: 2, len: 4294967295, tok_id: 4 }', libcore/result.rs:1009:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
>>> 

In this case, the first repair sequence was Insert INT. The fundamental problem is that while error recovery can adjust the user’s input to insert a lexeme of type INT, neither it nor the parser have any idea what value might have made sense for that lexeme. Thus the expression above caused the expression $lexer.span_str($1.unwrap().span()) to panic, since $1 was Err(<lexeme>).

It is thus up to the user to decide what to do in the face of the inevitable semantic issues that error recovery highlights. Fortunately, this is generally simpler than it sounds with only a slight rethink in the way that we tend to write a grammar's actions.

A rule of thumb: have rules return a Result type

Although rules can have any Rust type you can imagine, using a Result type allows a (deliberately) simple interaction with the effects of error recovery. The basic idea is simple: in actions, we ignore lexemes whose value we don't care about (e.g. brackets); for lexemes whose value we care about, we either introduce a default value, or percolate an Err upwards. Default values make sense in certain situations. For example, if you're writing a compiler, and want to run a static type checker even after syntax errors, it might make sense to assume that Insert 0 is a good substitute for Insert INT. However, in the case of the calculator, default values are likely to lead to confusing results. We thus change the grammar so that inserted integers prevent evaluation from occurring:

%start Expr
%%
Expr -> Result<u64, ()>:
      Expr '+' Term { Ok($1? + $3?) }
    | Term { $1 }
    ;

Term -> Result<u64, ()>:
      Term '*' Factor { Ok($1? * $3?) }
    | Factor { $1 }
    ;

Factor -> Result<u64, ()>:
      '(' Expr ')' { $2 }
    | 'INT' { parse_int($lexer.span_str($1.map_err(|_| ())?.span())) }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> Result<u64, ()> {
    match s.parse::<u64>() {
        Ok(val) => Ok(val),
        Err(_) => panic!("{} cannot be represented as a u64", s)
    }
}

The basic idea here is that every action returns an instance of Result<u64, ()>: if we receive Ok(u64) we successfully evaluated the expression, but if we received Err(()) we were not able to evaluate the expression. If we encounter an integer lexeme which is the result of error recovery, then the INT lexeme in the second Factor action will be Err(<lexeme>). By writing $1.map_err(|_| ())? we’re saying “if the integer lexeme was created by error recovery, percolate Err(()) upwards”. We then have to tweak a couple of other actions to percolate errors upwards, but this is a trivial change. We'll also need to change main.rs back to expecting a Result.

Now the input which previously caused a panic simply tells the user that it could not evaluate the expression:

>>> 2+
Parsing error at line 1 column 3. Repair sequences found:
   1: Insert INT
Unable to evaluate expression.

Usefully, our inability (or unwillingness) to evaluate the expression does not prevent further syntax errors from being discovered and repaired:

>>> (2+)+3+4+
Parsing error at line 1 column 4. Repair sequences found:
   1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert Int
Unable to evaluate expression.

Using a Result type allows the user arbitrary control over the classes of syntax errors they are prepared to deal with or not. For example, we could remove the panic from parse_int by making the rules have a type Result<u64, String> where the Err case would report a string such as “18446744073709551616 cannot be represented as a u64” for the first unrepresentable u64 in the user's input. If we wanted to report all unrepresentable u64s, we could have the rules have a type Result<u64, Vec<String>>, though merging together the errors found on the left and right hand sides of the + and * operators requires adding a few lines of code.

Making use of %epp for easier to read repair sequences

By default, pretty-printing lexeme types prints out their identifier in the grammar. Up to now, we have used lexeme types when showing output to the user. While the lexeme types are sometimes adequate for this purpose, this is not always the case. Consider this lex file:

%%
[0-9]+ "INT"
\+ "PLUS"
\* "MUL"
\( "LBRACK"
\) "RBRACK"
[\t ]+ ;

The user would see output such as:

>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete 3
   2: Insert PLUS
   3: Insert MUL
Result: 2

What are PLUS and MUL? These might be semi-obvious, but many lexeme types are far from obvious. grmtools allows users to provide human friendly versions of these for error recovery using the %epp declaration in grammars. For example, we can extend the calc grammar as follows:

%epp PLUS "+"
%epp MUL "*"
%epp LBRACK "("
%epp RBRACK ")"
%epp INT "Int"

leading to the following output:

>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete 3
   2: Insert +
   3: Insert *
Result: 2

Biasing repair sequences

Depending on your language, some repair sequences are better than others. For example, sometimes Insert repairs are less welcome than Delete repairs:

>>> 2 + + 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Insert INT
   2: Delete +
Unable to evaluate expression.
>>> 2 + + 3
Parsing error at line 1 column 3. Repair sequences found:
   1: Delete +
   2: Insert INT
Result: 5

Why does the same input sometimes produce a result and sometimes fail to produce a result? The problem is that 2 + + 3 has two repair sequences Delete + and Insert Int. As things stand, both are equally good, and so one is chosen non-deterministically. If Insert Int is chosen, we hit the Err case from earlier, and fail to produce a result; if the Delete case is chosen, we can produce a result.

To lessen this problem, the %avoid_insert L directive causes grmtools to prefer repair sequences that don't include Insert L over those that do. Intuitively, we want to annotate lexemes whose value we care about in this way (e.g. INT), but we don't need to worry about lexemes whose value we never expect (e.g. (, + etc.). In the case of the calculator grammar a good use of this directive is as follows:

%avoid_insert "INT"

With this, the Delete + repair sequence is consistently favoured over Insert INT.

Turning lexing errors into parsing errors

Most lexers do not have lexical rules for all possible inputs. For example, our running calculator example has no lexical rule for the character @. Typically this causes the lexer to generate an error and stop lexing further. For example with lrlex we would encounter the following:

>>> 2@3
Lexing error at line 1 column 2.

This error message is correct, but not as helpful as we might like (what is the error specifically?). Furthermore, any further errors in the input will not be found until the lexing error is fixed.

Fortunately we can fix this easily for nearly all grammars by adding a line similar to this to the end of your .l file:

. "UNMATCHED"

Any single character which is not matched by any other lex rule will now lead to a token of type UNMATCHED. Note that it is vital that this is the last rule in your .l file, and that only a single character is matched, otherwise you will incorrectly lex correct input as UNMATCHED!

We then need to add a dummy rule to your .y file, simply so that lrpar knows about UNMATCHED tokens. This dummy rule won't be referenced by other rules, so its return type and action are irrelevant. The simplest example is thus:

Unmatched -> ():
  "UNMATCHED" { } 
  ;

Assuming you have the "warnings are errors" option set to true (its default), you will then receive a warning about the unused rule (Unmatched) and token (UNMATCHED). You can inform grmtools that you expect both to be unused by adding this declaration in the top part of your .y file:

%expect-unused Unmatched "UNMATCHED"

With this done, all possible input will be lexed, and what were previously lexing errors are now parsing errors. This means that error recovery section kicks in, giving us more detailed and informative errors, and ensuring that multiple "lexing" errors are reported at once:

>>> 2@3+4+5+6@7
Parsing error at line 1 column 2. Repair sequences found:
   1: Delete @, Delete 3
   2: Insert +, Delete @
   3: Insert *, Delete @
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert +, Delete @
   2: Delete @, Delete 7
   3: Insert *, Delete @
Result: 24

Under the bonnet

For any given syntax error there are, potentially, a finite but vast number of possible valid repair sequences: far too many to exhaustively search. Error recovery algorithms such as CPCT+ use various heuristics to cut the search space down to something that is (generally) manageable. Although surprisingly few in practise, this inevitably leads to occasional situations where the repair sequences found (or, more accurately, those not found) surprise humans.

Timeout

The first surprising condition is that even with the small calc grammar, some user inputs lead to such a massive search space that no repair sequences can be found. The easiest way to trigger this in most grammars is bracket expressions:

>>> 1+(
Parsing error at line 1 column 4. Repair sequences found:
   1: Insert Int, Insert )
Unable to evaluate expression.
>>> 1+((
Parsing error at line 1 column 5. Repair sequences found:
   1: Insert Int, Insert ), Insert )
Unable to evaluate expression.
>>> 1+(((((((((((
Parsing error at line 1 column 14. No repair sequences found.
Unable to evaluate expression.

At a certain number of open brackets (which will partly depend on the speed of your machine), CPCT+ simply cannot find suitable repair sequences within its internal timeout, hence the “No repair sequences found” message. In practise this happens in less than 2% of real-world inputs, so it is not a significant worry.

Some “obvious” repair sequences aren't reported at the end of a file

The second surprising condition is more subtle. Before we can show the issue, we need to introduce the concept of repair sequence ranking: CPCT+ only presents the lowest cost repair sequences to users (where Inserts and Deletes cost 1, and Shifts cost 0). Higher cost repair sequences are discarded.

In an ideal world, CPCT+ would find repair sequences that allow a file to parse completely successfully. In practice, this is only feasible if a syntax error occurs near the very end of the input. In most cases, CPCT+ is happy with a weaker condition, which is that a repair sequence ends with 3 Shift repairs, showing that parsing has got back on track, at least for a little bit. This condition explains the following:

>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
   1: Delete +
   2: Insert Int
Result: 5
>>> 2 + + 3 +
Parsing error at line 1 column 5. Repair sequences found:
   1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
   1: Insert Int
Unable to evaluate expression.

For 2 + + 3 we match the human intuition that the input could have been 2 + 3 or 2 + <some int> + 3. However, for the input 2 + + 3 + we do not report a Delete + repair sequence for the first error in the input. Why?

The first thing we need to know is that repair sequences are always reported with trailing Shift repairs pruned: for the rest of this subsection it aids understanding to leave them unpruned. Thus, for 2 + + 3, the two repair sequences found are Delete +, Shift 3 and Insert Int, Shift +, Shift 3, both of which cause the entire input to parse successfully, and both of which have the same cost.

For 2 + + 3 +, however, the first error leads to 3 repair sequences, Insert Int, Shift +, Shift 3, Shift +, Delete +, Shift 3, Delete or Delete +, Shift 3, Shift +, Insert Int: the latter two are not even completed since they're provably higher than the Insert Int repair sequence and thus aren’t reported to the user.

In practise, this situation is rarer than the timeout problem, to the point that it’s arguably not worth worrying about or explaining to end users. Even when it happens, the repair sequences that CPCT+ reports are always correct and at least one repair sequence will be reported (assuming that error recovery doesn't time out!).

Error recovery on real-world grammars

Continuing the example from the nimbleparse section, we can see that error recovery works well on arbitrary grammars. Consider the following syntactically incorrect Lua 5.3 program:

$ cat test.lua
x = 0
if x > 0
   print("greater than")
else
   print("less than"}

When run through nimbleparse, the following output is generated:

$ caro run --release --bin nimbleparse lua5_3.l lua5_3.y test.lua
...
Error at line 3 col 4. Repair sequences found:
   1: Insert then
Error at line 5 col 21. Repair sequences found:
   1: Insert ), Insert end, Delete }
   2: Insert ), Insert {, Shift }, Insert end

Turning off error recovery

By default, lrpar uses the CPCT+ error recovery algorithm. You can use the None error recovery algorithm, which causes parsing to stop as soon as it hits the first parsing error, with the recoverer method in CTParserBuilder or RTParserBuilder. For example, we can change calc's build.rs file to:

CTLexerBuilder::new()
    .lrpar_config(|ctp| {
        ctp.yacckind(YaccKind::Grmtools)
            .recoverer(lrpar::RecoveryKind::None)
            .grammar_in_src_dir("calc.y")
            .unwrap()
    })
    .lexer_in_src_dir("calc.l")?
    .build()?;

and then no matter how many syntax errors we make, only one is reported:

>>> 2++3++
Parsing error at line 1 column 3. No repair sequences found.
Unable to evaluate expression.

Unless you have a good reason to do so (e.g. quickly hacking together a grammar where you would prefer not to think about error recovery at all), we do not recommend turning off error recovery.