grmtools
grmtools is a suite of Rust libraries and binaries for parsing text, both at
compile-time, and run-time. Most users will probably be interested in the
compile-time Yacc feature, which allows traditional .y
files to be used mostly
unchanged in Rust. See the Quickstart Guide for a quick
introduction to this feature.
Quickstart Guide
Most users will probably be interested in the compile-time Yacc feature of
grmtools, which allows traditional .y
files to be used mostly unchanged in
Rust. This page is a short guide to get you up and running with this feature as
quickly as possible.
grmtools includes both a Yacc-style LR parser (lrpar
) and a
lex-style lexer (lrlex
). The lexer breaks input up into individual lexemes and
the parser checks to see if the lexemes conform to a grammar. As the parser
executes, it can either create a generic parse tree, or execute user-specified
Rust code.
A calculator evaluator
Let's assume we want to create a simple calculator which can evaluate
expressions such as 2 + 3 * 4
. Assuming a fresh Rust project, we first create
a Cargo.toml
file with the following dependencies:
[package]
name = "calc"
version = "0.0.1"
authors = ["<authors>"]
[[bin]]
doc = false
name = "calc"
[build-dependencies]
cfgrammar = "0.3"
lrlex = "0.3"
lrpar = "0.3"
[dependencies]
cfgrammar = "0.3"
lrlex = "0.3"
lrpar = "0.3"
In this situation we want to statically compile the .y
grammar and .l
lexer
into Rust code. We thus need to create a
build.rs
file which can process the lexer and grammar. Our build.rs
file thus looks as follows:
use cfgrammar::yacc::YaccKind; use lrlex::LexerBuilder; use lrpar::{CTParserBuilder, ActionKind}; fn main() -> Result<(), Box<std::error::Error>> { let lex_rule_ids_map = CTParserBuilder::new() .yacckind(YaccKind::Grmtools) .process_file_in_src("calc.y")?; LexerBuilder::new() .rule_ids_map(lex_rule_ids_map) .process_file_in_src("calc.l")?; Ok(()) }
grmtools accepts several different Yacc variants as input. In our case, we want
to execute Rust code as the input is parsed (rather than creating a generic
parse tree which we traverse later), so we specified that the yacckind
(i.e.
what variant of Yacc file we're using) is YaccKind::Grmtools
. The grammar file
is stored in src/calc.y
, but we only specify calc.y
as the filename to
lrpar
, since it searches relative to src/
automatically.
The lexer
While Yacc-style parsing is powerful, lex-style lexing is less powerful.
grmtools allows you to use whatever lexer you want with lrpar
. Fortunately, in
this case, lrlex
is powerful enough for us. Our lex file is stored in
src/calc.l
. The rule_ids_map
dance synchronises the parser and lexer (the
details of this are unimportant to us).
calc.l
is as follows:
%%
[0-9]+ "INT"
\+ "PLUS"
\* "MUL"
\( "LBRACK"
\) "RBRACK"
[\t ]+ ;
Roughly speaking, each line after the %%
line is a regular expression (we use
the regex
crate), a space character, and a
quoted lexeme type name. For example, if the user gives us input such as 234
we will create a single lexeme with a value (234
) and a type (INT
).
The one exception is the final line: if a lexeme type name is replaced with ‘;
’
then any matching input is discarded. In this case, whitespace (tabs and spaces)
is lexed, but no lexemes are created from it.
The grammar
Our initial version of calc.y looks as follows:
%start Expr
%%
Expr -> Result<u64, ()>:
Term 'PLUS' Expr { Ok($1? + $3?) }
| Term { $1 }
;
Term -> Result<u64, ()>:
Factor 'MUL' Term { Ok($1? * $3?) }
| Factor { $1 }
;
Factor -> Result<u64, ()>:
'LBRACK' Expr 'RBRACK' { $2 }
| 'INT'
{
let v = $1.map_err(|_| ())?;
parse_int($lexer.lexeme_str(&v))
}
;
%%
// Any functions here are in scope for all the grammar actions above.
fn parse_int(s: &str) -> Result<u64, ()> {
match s.parse::<u64>() {
Ok(val) => Ok(val),
Err(_) => {
eprintln!("{} cannot be represented as a u64", s);
Err(())
}
}
}
The grammar is in 3 parts, separated by the %%
lines.
The first part specifies general settings for the grammar, at a minimum the
start rule (%start Expr
).
The second part is the Yacc
grammar. It consists of 3
rules (Expr
, Term
, and Factor
) and 6 productions (2 for each rule,
separated by |
characters). Because we are using the Grmtools
Yacc variant,
each rule has a Rust type associated with it (after ->
) which specifies the
type that each production’s action must return. A production (sometimes called an “alternative”)
consists of zero or more symbols. Symbols either reference rules or lexemes. If a
production matches text, its ”action” (the Rust code between curly brackets at
the end of the production) is executed.
lrpar
's actions are subtly different to Yacc. The $x
variables refer to
the respective symbol in the production, numbered from 1 (i.e. $1
refers to the first symbol in
the production). If the symbol references a rule R
then an instance of
R
's type will be stored in the $x
variable; if the symbol references a lexeme then an Option<Lexeme>
instance is returned. A special $lexer
variable allows access to the lexer.
This allows us to turn Lexeme
s into strings with the lexeme_str
function,
which given a Lexeme
returns a &str
representing the corresponding portion of
the user's input.
The third part is arbitrary Rust code which can be called by productions’
actions. In our case we have a simple function which converts integers as
strings into integers as u64
s: if the user provides an invalid number (e.g.
one that is too big) the system panic
s.
This example uses a common grmtools idiom: making use of Result
types. This
allows us to deal with two different issues that prevent evaluation.
First is the “obvious” issue of integers which are too big to represent as
u64
s: these cause Err
to be percolated upwards, preventing evaluation.
Second is the issue of error recovery telling us that the user should have
inserted an integer: since it would be confusing for us to insert a default
value in such cases, we map_err
such cases to Err
, preventing evaluation.
See the section below on error recovery for more details
about error recovery.
Putting everything together
The build.rs
file will statically compile the lexer and grammar into Rust code
that we can then call. The src/main.rs
file below provides a simple
Python-esque REPL to the user into which they can write calculator expressions:
use std::io::{self, BufRead, Write}; use lrlex::lrlex_mod; use lrpar::lrpar_mod; // Using `lrlex_mod!` brings the lexer for `calc.l` into scope. lrlex_mod!(calc_l); // Using `lrpar_mod!` brings the lexer for `calc.l` into scope. lrpar_mod!(calc_y); fn main() { // We need to get a `LexerDef` for the `calc` language in order that we can lex input. let lexerdef = calc_l::lexerdef(); let stdin = io::stdin(); loop { print!(">>> "); io::stdout().flush().ok(); match stdin.lock().lines().next() { Some(Ok(ref l)) => { if l.trim().is_empty() { continue; } // Now we create a lexer with the `lexer` method with which we can lex an input. let mut lexer = lexerdef.lexer(l); // Pass the lexer to the parser and lex and parse the input. let (res, errs) = calc_y::parse(&mut lexer); for e in errs { println!("{}", e.pp(&lexer, &calc_y::token_epp)); } match res { Some(r) => println!("Result: {}", r), _ => eprintln!("Unable to evaluate expression.") } } _ => break } } }
We can now cargo run
our project and evaluate simple expressions:
>>> 2 + 3
Result: 5
>>> 2 + 3 * 4
Result: 14
>>> (2 + 3) * 4
Result: 20
Error recovery
Because powerful error recovery is built into lrpar
, we can even make minor
errors and have the system recover automatically:
>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
1: Delete +
2: Insert INT
Result: 5
>>> 2 + 3 3
Parsing error at line 1 column 7. Repair sequences found:
1: Delete 3
2: Insert PLUS
3: Insert MUL
Result: 5
>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
1: Insert MUL, Delete 4
2: Insert PLUS, Delete 4
3: Delete 4, Delete 5
4: Insert MUL, Shift 4, Delete 5
5: Insert MUL, Shift 4, Insert PLUS
6: Insert MUL, Shift 4, Insert MUL
7: Insert PLUS, Shift 4, Delete 5
8: Insert PLUS, Shift 4, Insert PLUS
9: Insert PLUS, Shift 4, Insert MUL
Result: 17
Note that we didn't have to do anything clever in order for error recovery to
happen: it happens by default, and it works with whatever grammar we throw at
it. The way to read the resulting error messages are that each numbered repair
sequence is a way that the error recovery system found to make sense of the
input. For example, for the input 2 + + 3
, an error is detected at the second
+
: we could either delete the second +
(option 1 above) or insert an
integer. In all cases, error recovery applies repair sequence 1, and continues
parsing. 2 + + 3
was thus parsed as if the user had written 2 + 3
,
hence why it evaluated to 5. Similarly, 2 + 3 4 5
was parsed as if the user
had written 2 + 3 * 5
.
Error recovery opens up a number of possibilities to customise and streamline
the user experience. For example, the simple approach above causes a panic
if
the user provides a non-u64 number or if error recovery inserts an integer.
For more details about the possibilities, see the section on error
recovery.
Lexing
Lexing is the act of taking in an input stream and splitting it into lexemes.
Colloquially, lexing is often described as splitting input into words. In
grmtools
, a Lexeme has a type (e.g. "INT", "ID"), a value (e.g. "23",
"xyz"), and knows which part of the user's input matched (e.g. "the input
starting at index 7 to index 10"). There is also a simple mechanism to
differentiate lexemes of zero length (e.g. DEDENT
tokens in Python) from
lexemes inserted by error recovery.
Users can write custom lexers that conform to the lrpar::lex::Lexer
trait.
This API allows users to deal with streaming data since the parser asks the
Lexer
for one token at a time. However, note that users can later ask the
Lexer
to return the string from the input matching a lexeme: users need to
buffer input to provide this information.
Hand-written lexers are not particularly difficult to write and, for better or
worse, are necessary for many real-world languages. However, a subset of
languages can use a simpler lex
/flex
style approach to lexing, for which
lrlex
can be used.
Parsing
Parsing is the act of checking whether a stream of lexemes match a grammar. Since a simple "yes/no" answer is rarely useful, it is common to execute user-defined actions during parsing.
grmtools
contains libraries (cfgrammar
and
lrtable
) which allow users to build their own LR parsers in
whatever fashion they want. However, for 99% of cases, the lrpar
library is what users want and need: a (largely) Yacc-compatible parser. Roughly
speaking, the core parts of grammars work identically in Yacc and lrpar
, but
some other parts of the system have been modernised (e.g. to avoid the use of
global variables) and given a more idiomatic Rust feel. Notably, lrpar
is
built from the ground-up to have a powerful, flexible approach to error
recovery.
Yacc variants
grmtools can deal with several different variants of Yacc input.
Grmtools
YaccKind::Grmtools
is grmtools' own variant of Yacc syntax, and the one that
most users will want to use. The most significant difference to "normal" Yacc
is that rules are annotated with a Rust type to which all their production's
actions must adhere to. Note that whilst a rule's productions must all adhere
to a single type, different rules can have different types. Consider the
following snippet:
R1 -> Result<i32, ()>:
'a' { Ok(5) }
| 'b' { Err(()) }
;
R2 -> u64:
| { 0 }
;
Here the rule R1
has a Rust return type of Result<X, ()>
(between ->
and
:
). Both of its productions adhere to this type, the first by instantiating
Ok(5)
and the second Err(())
. The rule R2
has a return type of u64
.
“Original” Yacc
Although the name is not fully accurate (grmtools supports a slightly disjoint subset of original Yacc's input), this mode allows users to most easily test externally created Yacc files. Several sub-variants are allowed:
-
YaccKind::Original(YaccOriginalActionKind::GenericParseTree)
does not execute user actions, but instead creates a generic parse tree, where elements are instances of thelrpar::parser::Node
enum. This is useful for quickly testing whether a parser is accepting the intended language. -
YaccKind::Original(YaccOriginalActionKind::NoActions)
parses input and reports errors but does not execute any user actions. This is useful if you are trying to test a large corpus of input for correctness. -
YaccKind::Original(YaccOriginalActionKind::UserAction)
models original Yacc most closely but, in a Rust setting, is probably of little use beyond simple calculator like languages. Instead of Yacc's%union
directive, users can specify%actiontype
which is a Rust type to which every production's actions in the grammar must adhere to. Unless all actions happen to naturally return the same type, this quickly becomes cumbersome to use. For most use cases,YaccKind::Grmtools
is a superior alternative.
Action code
Action code is normal Rust code with the addition of the following special variables:
-
$1
...$n
refer to the respective symbol in the production, numbered from 1 (i.e.$1
refers to the first symbol in the production). If the symbol references a ruleR
then an instance ofR
's type will be stored in the$i
variable; if the symbol references a lexeme then anOption<Lexeme>
instance is returned. -
$lexer
allows access to the lexer and its various functions. The most commonly used of these is thelexeme_str
function, which allows us to turnLexeme
s into&str
s representing the corresponding portion of the user's input. -
$span
is a(start, end)
tuple (with both elements of typeusize
) which captures how much of the user's input the current production matched. This can be useful when storing debugging information. Note that this variable is not enabled by default: useCTParserBuilder::span_var(true)
to enable it.
grmtools parsing idioms
grmtools is a flexible tool and can be used in many ways. However, for those
using the Grmtools
format, the simple idioms below can often make life easier.
Return Lexeme
s rather than &str
s or String
s when performance is critical
When executing grammar actions one is often building up an Abstract Syntax Tree (AST) or equivalent. For example consider a simple language with assignments:
Assign: "ID" "=" Expr;
Perhaps the "obvious" way to build this into an AST is to extract the string representing the identifier as follows:
Assign -> ASTAssign: "ID" "+" Expr
{
let id = $lexer.lexeme_str($1.as_ref().unwrap()).to_string();
ASTAssign::new(id, $3)
}
%%
struct ASTAssign {
id: String
}
impl ASTAssign {
fn new(name: String) -> Self {
ASTAssign { name }
}
}
This approach is easy to work with, but isn't as performant as may be desired:
the to_string
call allocates memory and copies part of the user's input into
that.
An alternative approach is not to convert the lexeme into a String
during
parsing, but simply to return the
Lexeme
itself. The
relevant &str
slice can be extracted from the user's input later and memory
allocation avoided entirely. An example of this is as follows:
Assign -> ASTAssign: "ID" "+" Expr
{
let id = $lexer.lexeme_str($1);
ASTAssign::new(id, $3)
}
%%
type StorageT = u32;
struct ASTAssign {
id: Lexeme<StorageT>
}
impl ASTAssign {
fn new(name: Lexeme<StorageT>) -> Self {
ASTAssign { name }
}
}
Dealing with the Lexeme
type can be somewhat fiddly because it has a type paramater <StorageT>
with somewhat complex bounds. In essence, StorageT
must be an unsigned integer
that is big enough to index (individually) a grammar's rules, productions, and
lexemes. Users can manually specify StorageT
by using
CTParserBuilder::new_with_storaget
;
if not specified explicitly it defaults to u32
.
A simple way of avoiding the complexity of the parameter's bounds is to encode
it directly into the Lexeme
type. A pleasant, and easily adapted way of doing
this, is to semi-hard-code the value of StorageT
into the AST, which we
do by adding type StorageT = u32;
to the programs part of the grammar
(note that if a different type than u32
was passed to CTParserBuilder::new_with_storaget
then
that will need to be used instead).
Have rules return a Result
type and add a function to avoid map_err
directly
As described in the error recovery
section, it
is generally a good idea to give rules a Result
return type. This allows a
simple interaction with error recovery. However, it can lead to endless
instances of the following map_err
idiom:
R -> Result<..., ()>:
"ID" { $1.map_err(|_| ())? }
;
It can be helpful to define a custom map_err
function which hides some of this
mess for you:
R -> Result<Lexeme<StorageT>, ()>:
"ID" { map_err($1)? }
;
%%
fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
-> Result<Lexeme<StorageT>, ()
{
r.map_err(|_| ())
}
Define a flatten
function
Yacc grammars make specifying sequences of things something of a bore. A common idiom is thus:
ListOfAs -> Result<Vec<A>, ()>:
A { Ok(vec![$1?]) }
| ListOfAs A
{
let mut $1 = $1?;
$1.push($1?);
Ok($1)
}
;
A -> Result<A, ()>: ... ;
Since this idiom is often present multiple times in a grammar, it's generally
worth adding a flatten
function to hide some of this:
ListOfAs -> Result<Vec<A>, ()>:
A { Ok(vec![$1?]) }
| ListOfAs A { flatten($1, $2) }
;
A -> Result<A, ()>: ... ;
%%
fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
-> Result<Vec<T>, ()>
{
let mut flt = lhs?;
flt.push(rhs?);
Ok(flt)
}
Note that flatten
is generic with respect to T
so that it can be used in
multiple places in the grammar.
Composing the idioms
Happily, flatten
, map_err
, and Lexeme
combine well:
ListOfIds -> Result<Vec<Lexeme<StorageT>>, ()>:
"ID" { Ok(vec![map_err($1)?]) }
| ListOfIds "Id" { flatten($1, map_err($2)?) }
;
%%
type StorageT = u32;
fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
-> Result<Lexeme<StorageT>, ()
{
r.map_err(|_| ())
}
fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
-> Result<Vec<T>, ()>
{
let mut flt = lhs?;
flt.push(rhs?);
Ok(flt)
}
The individual libraries and tools
grmtools
consists of several libraries and command-line tools. The following
sections describe each.
lrpar
lrpar
is the LR parser library aspect of grmtools. It takes in streams of
lexemes and parses them, determining if they successfully match a grammar or
not; if not, it can optionally recover from errors.
lrlex
lrlex
is a partial replacement for
lex
/
flex
. It takes an input string and
splits it into lexemes based on a .l
file. Unfortunately, many real-world
languages have corner cases which exceed the power that lrlex
can provide.
However, when it is suitable, it is a very convenient way of expressing lexing.
lrlex
also has a simple command-line interface, allowing you to check whether
your lexing rules are working as expected:
$ cat C.java
class C {
int x = 0;
}
$ cargo run --lrlex java.l /tmp/C.java
Finished dev [unoptimized + debuginfo] target(s) in 0.18s
Running `target/debug/lrlex ../grammars/java7/java.l /tmp/C.java`
CLASS class
IDENTIFIER C
LBRACE {
INT int
IDENTIFIER x
EQ =
INTEGER_LITERAL 0
SEMICOLON ;
RBRACE }
nimbleparse
If you have a lrlex
compatible .l
file and an
lrpar
compatible .y
file, you can use nimbleparse
as a quick
way of testing inputs and exploring the resulting parse tree:
$ cargo build --release
$ target/release/nimbleparse -h
Usage: nimbleparse [-r <cpctplus|mf|panic|none>] [-y <eco|original>] <lexer.l> <parser.y> <input file>
For example, let's assume you are using the Lua 5.3 .l
and .y
files from the
grammars repository you might run
nimbleparse
as follows:
$ cat test.lua
print("Hello world")
$ target/release/nimbleparse lua5_3.l lua5_3.y test.lua
block
statlistopt
statlist
stat
functioncall
prefixexp
var
NAME print
args
LBRACKET (
explistopt
explist
exp
exp1
exp2
exp3
exp4
exp5
exp6
exp7
exp8
exp9
exp10
exp11
exp12
literalstring
SHORT_STR "Hello world"
RBRACKET )
cfgrammar
cfgrammar
reads in grammar files, processes them, and provides a convenient
API for operating with them. Most users only need to think about cfgrammar
to the
extent that they are required to use it to specify what Yacc variant they wish
to use.
cfgrammar
may also be of interest to those manipulating grammars directly, or
who wish to use custom types of parsers. Note that cfgrammar
's API should be
considered semi-stable at best. As the needs of other parts of grmtools change,
cfgrammar
tends to have to change too. Since it is unlikely to have few direct
users, the consequences of changing the API are relatively slight.
lrtable
lrtable
takes in grammars from cfgrammar
and creates LR
state tables from them. Few users will be interested in its functionality
directly, except those doing advanced forms of grammar analysis.
One, admittedly fairly advanced, aspect worth noting is that
lrtable
uses Pager's
algorithm to compress the
resulting LR state tables. In rare cases this can provide surprising results:
see Denny and Malloy's
paper for
more.
Error recovery
One of lrpar
's most powerful features is its approach to error recovery, which
can be used with any grammar. This section outlines the background to error
recovery, the choices that users can make, and how to best make use of this
feature.
Error recovery background
Programmers frequently make mistakes when entering input, either because of simple typos, or an outright failure to use the correct syntax. Happily, LR parsing guarantees to report syntax errors at the first point that an error can be definitively proven to have occurred (though note that this might not be the same point that a user would consider the error to have been made). It has long been a goal of parsing technologies to recover from such errors, and allow parsing to continue. This allows users to fix all their syntax errors in one go and, optionally, post-parsing phases to operate as if no syntax errors had been made at all. For example, a compiler author might decide to run the compiler's static type checker even in the presence of syntax errors (since many static type errors are unaffected by syntax errors), but not generate code (which might incorrectly give users the illusion that their code is safe to run).
However, most mainstream parsers do a bad job of error recovery. The most common generic error recovery algorithm is "panic mode" (in reality, a family of algorithms). Unfortunately such simple error recovery algorithms do a poor job of recovering from syntax errors, causing a cascade of spurious further syntax errors to be reported. Programmers quickly learn that only the first reported syntax error can be trusted on to be correct.
lrpar
implements the MF
error recovery algorithm from Reducing
Cascading Parsing Errors Through Fast Error
Recovery, which, in our biased opinion, does
a better job than previous approaches. It is fast, grammar neutral, and reports
multiple repair sequences to users, allowing them to consider which best
matches their intentions.
No matter how clever we think MF
is, it is important to understand that it has
a fundamental limitation: it only knows about a language's syntax; it has no
concept of the language's semantics beyond that implied by the structure of the
grammar; and it cannot control what the user does with the result of error
recovery. Thus, grammar writers can significantly influence how useful error
recovery is for users. Most of the rest of this section explains how best to
make use of error recovery.
Error recovery basics
A simple calculator grammar looks as follows:
%start Expr
%%
Expr -> u64:
Term 'PLUS' Expr { $1 + $3 }
| Term { $1 }
;
Term -> u64:
Factor 'MUL' Term { $1 * $3 }
| Factor { $1 }
;
Factor -> u64:
'LBRACK' Expr 'RBRACK' { $2 }
| 'INT' { parse_int($lexer.lexeme_str(&$1.unwrap())) }
;
%%
// Any functions here are in scope for all the grammar actions above.
fn parse_int(s: &str) -> u64 {
match s.parse::<u64>() {
Ok(val) => val,
Err(_) => panic!("{} cannot be represented as a u64", s)
}
}
For many examples, this simple grammar and its actions work well leading to output such as the following:
>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
1: Delete +
2: Insert INT
Result: 5
Insert x
means “error recovery inserted a lexeme of type x”; Delete x
means “error recovery deleted the next lexeme in the stream”; and Shift x
means “error recovery kept the user’s lexeme x as-is”.
Repair sequences are minimal ways of adjusting the user’s input such that it
becomes correct relative to the underlying grammar. Intuitively, in this
example, the two repair sequences would adjust the input to be equivalent to
2 + 3
(repair sequence 1) or 2 + <some int> + 3
(repair sequence 2). When
more than one repair sequence is presented to the user, the first is used by the
algorithm to continue parsing: in this case, the input was parsed as if it was
equivalent to 2 + 3
, hence the evaluation of the input to 5
.
Repair sequences can, as their name suggests, be of arbitrary length:
>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
1: Insert MUL, Delete 4
2: Insert PLUS, Delete 4
3: Delete 4, Delete 5
4: Insert MUL, Shift 4, Delete 5
5: Insert MUL, Shift 4, Insert PLUS
6: Insert MUL, Shift 4, Insert MUL
7: Insert PLUS, Shift 4, Delete 5
8: Insert PLUS, Shift 4, Insert PLUS
9: Insert PLUS, Shift 4, Insert MUL
Result: 17
In this case, the first repair sequence caused the input to be parsed as if it
was equivalent to 2 + 3 * 5
, hence the evaluation of the input to 17
.
Syntax errors and language semantics
Our example inputs so far have deliberately exploited cases where the first
repair sequence at worst inserted “unimportant” lexemes such as +
and *
.
Since the grammar’s actions never read the values of such lexemes, only their
type is important. However, what should happen if error recovery inserts an
integer, whose value is later read by one of the grammar’s actions? An example
shows the unhappy result:
>>> 2+
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Lexeme { start: 2, len: 4294967295, tok_id: 4 }', libcore/result.rs:1009:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
>>>
In this case, the first repair sequence was Insert INT
. The fundamental
problem is that while error recovery can adjust the user’s input to insert a
lexeme of type INT
, neither it nor the parser have any idea what value might
have made sense for that lexeme. Thus the expression above caused the expression
$lexer.lexeme_str(&$1.unwrap())
to panic, since $1
was Err(<lexeme>)
.
It is thus up to the user to decide what to do in the face of the inevitable semantic issues that error recovery highlights. Fortunately, this is generally simpler than it sounds with only a slight rethink in the way that we tend to write a grammar's actions.
A rule of thumb: have rules return a Result
type
Although rules can have any Rust type you can imagine, using a Result
type
allows a (deliberately) simple interaction with the effects of error recovery.
The basic idea is simple: in actions, we ignore lexemes whose value we don't
care about (e.g. brackets); for lexemes whose value we care about, we either
introduce a default value, or percolate an Err
upwards. Default values make
sense in certain situations. For example, if you're writing a compiler, and want
to run a static type checker even after syntax errors, it might make sense to
assume that Insert 0
is a good substitute for Insert INT
. However, in the
case of the calculator, default values are likely to lead to confusing results.
We thus change the grammar so that inserted integers prevent evaluation from
occurring:
%start Expr
%%
Expr -> Result<u64, ()>:
Term 'PLUS' Expr { Ok($1? + $3?) }
| Term { $1 }
;
Term -> Result<u64, ()>:
Factor 'MUL' Term { Ok($1? * $3?) }
| Factor { $1 }
;
Factor -> Result<u64, ()>:
'LBRACK' Expr 'RBRACK' { $2 }
| 'INT' { parse_int($lexer.lexeme_str($1.map_err(|_| ())?)) }
;
%%
// Any functions here are in scope for all the grammar actions above.
fn parse_int(s: &str) -> u64 {
match s.parse::<u64>() {
Ok(val) => val,
Err(_) => panic!("{} cannot be represented as a u64", s)
}
}
The basic idea here is that every action returns an instance of Result<u64, ()>
: if we receive Ok(u64)
we successfully evaluated the expression, but if
we received Err(())
we were not able to evaluate the expression. If we
encounter an integer lexeme which is the result of error recovery, then the
INT
lexeme in the second Factor
action will be Err(<lexeme>)
. By writing
$1.map_err(|_| ())?
we’re saying “if the integer lexeme was created by error
recovery, percolate Err(())
upwards”. We then have to tweak a couple of other
actions to percolate errors upwards, but this is a trivial change.
We then need to make a small tweak to our main.rs
changing:
match res {
Some(r) => println!("Result: {}", r),
_ => eprintln!("Unable to evaluate expression.")
}
to:
match res {
Some(Ok(r)) => println!("Result: {}", r),
_ => eprintln!("Unable to evaluate expression.")
}
Now the input which previously caused a panic simply tells the user that it could not evaluate the expression:
>>> 2+
Parsing error at line 1 column 3. Repair sequences found:
1: Insert INT
Unable to evaluate expression.
Usefully, our inability (or unwillingness) to evaluate the expression does not prevent further syntax errors from being discovered and repaired:
>>> (2+)+3+4+
Parsing error at line 1 column 4. Repair sequences found:
1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
1: Insert Int
Unable to evaluate expression.
Using a Result
type allows the user arbitrary control over the classes of
syntax errors they are prepared to deal with or not. For example, we could
remove the panic
from parse_int
by making the rules have a type Result<u64, String>
where the Err
case would report a string such as “18446744073709551616 cannot
be represented as a u64” for the first unrepresentable u64
in the user's
input. If we wanted to report all unrepresentable u64
s, we could have
the rules have a type Result<u64, Vec<String>>
, though merging together the errors found
on the left and right hand sides of the +
and *
operators requires adding a
few lines of code.
Making use of %epp
for easier to read repair sequences
By default, pretty-printing lexeme types prints out their identifier in the grammar. These rarely match what the user would expect:
>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
1: Delete 3
2: Insert PLUS
3: Insert MUL
Result: 2
What are PLUS
and MUL
? These might be semi-obvious, but many lexeme types
are far from obvious. grmtools
allows users to provide human friendly versions
of these for error recovery using the %epp
declaration in grammars. For
example, we can extend the calc
grammar as follows:
%epp PLUS "+"
%epp MUL "*"
%epp LBRACK "("
%epp RBRACK ")"
%epp INT "Int"
leading to the following output:
>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
1: Delete 3
2: Insert +
3: Insert *
Result: 2
Biasing repair sequences
Depending on your language, some repair sequences are better than others. For
example, sometimes Insert
repairs are less welcome than Delete
repairs:
>>> 2 + + 3
Parsing error at line 1 column 3. Repair sequences found:
1: Insert INT
2: Delete +
Unable to evaluate expression.
>>> 2 + + 3
Parsing error at line 1 column 3. Repair sequences found:
1: Delete +
2: Insert INT
Result: 5
Why does the same input sometimes produce a result and sometimes fail to produce
a result? The problem is that 2 + + 3
has two repair sequences Delete +
and
Insert Int
. As things stand, both are equally good, and so one is chosen
non-deterministically. If Insert Int
is chosen, we hit the Err
case from
earlier, and fail to produce a result; if the Delete
case is chosen, we can
produce a result.
To lessen this problem, the %avoid_insert L
directive causes grmtools to
prefer repair sequences that don't include Insert L
over those that do.
Intuitively, we want to annotate lexemes whose value we care about in this
way (e.g. INT
), but we don't need to worry about lexemes whose value we never
expect (e.g. (
, +
etc.). In the case of the calculator grammar a good
use of this directive is as follows:
%avoid_insert "INT"
With this, the Delete +
repair sequence is consistently favoured over Insert INT
.
Under the bonnet
For any given syntax error there are, potentially, a finite but vast number of
possible valid repair sequences: far too many to exhaustively search. Error
recovery algorithms such as MF
use various heuristics to cut the search space
down to something that is (generally) manageable. Although surprisingly few in
practise, this inevitably leads to occasional situations where the repair
sequences found (or, more accurately, those not found) surprise humans.
Timeout
The first surprising condition is that even with the small calc
grammar, some
user inputs lead to such a massive search space that no repair sequences can be
found. The easiest way to trigger this in most grammars is bracket expressions:
>>> 1+(
Parsing error at line 1 column 4. Repair sequences found:
1: Insert Int, Insert )
Unable to evaluate expression.
>>> 1+((
Parsing error at line 1 column 5. Repair sequences found:
1: Insert Int, Insert ), Insert )
Unable to evaluate expression.
>>> 1+(((((((((((
Parsing error at line 1 column 14. No repair sequences found.
Unable to evaluate expression.
At a certain number of open brackets (which will partly depend on the speed of
your machine), MF
simply cannot find suitable repair sequences within its
internal timeout, hence the “No repair sequences found” message. In practise
this happens in less than 2% of real-world inputs, so it is not a significant
worry.
Some “obvious” repair sequences aren't reported at the end of a file
The second surprising condition is more subtle. Before we can show the issue, we
need to introduce the concept of repair sequence ranking: MF
only presents the
lowest cost repair sequences to users (where Insert
s and Delete
s cost 1, and
Shift
s cost 0). Higher cost repair sequences are discarded.
In an ideal world, MF
would find repair sequences that allow a file to parse
completely successfully. In practice, this is only feasible if a syntax error
occurs near the very end of the input. In most cases, MF
is happy with a
weaker condition, which is that a repair sequence ends with 3 Shift
repairs,
showing that parsing has got back on track, at least for a little bit. This
condition explains the following:
>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
1: Delete +
2: Insert Int
Result: 5
>>> 2 + + 3 +
Parsing error at line 1 column 5. Repair sequences found:
1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
1: Insert Int
Unable to evaluate expression.
For 2 + + 3
we match the human intuition that the input could have been 2 + 3
or 2 + <some int> + 3
. However, for the input 2 + + 3 +
we do not report
a Delete +
repair sequence for the first error in the input. Why?
The first thing we need to know is that repair sequences are always reported
with trailing Shift
repairs pruned: for the rest of this subsection it aids
understanding to leave them unpruned. Thus, for 2 + + 3
, the two repair
sequences found are Delete +, Shift 3
and Insert Int, Shift +, Shift 3
, both
of which cause the entire input to parse successfully, and both of which have
the same cost.
For 2 + + 3 +
, however, the first error leads to 3 repair sequences, Insert Int, Shift +, Shift 3, Shift +
, Delete +, Shift 3, Delete
or Delete +, Shift 3, Shift +, Insert Int
: the latter two are not even completed since they're
provably higher than the Insert Int
repair sequence and thus aren’t reported
to the user.
In practise, this situation is rarer than the timeout problem, to the point that
it’s arguably not worth worrying about or explaining to end users. Even when it
happens, the repair sequences that MF
reports are always correct and at least
one repair sequence will be reported (assuming that error recovery doesn't time
out!).
Error recovery on real-world grammars
Continuing the example from the nimbleparse
section, we
can see that error recovery works well on arbitrary grammars. Consider the
following syntactically incorrect Lua 5.3 program:
$ cat test.lua
x = 0
if x > 0
print("greater than")
else
print("less than"}
When run through nimbleparse
, the following output is
generated:
$ caro run --release --bin nimbleparse lua5_3.l lua5_3.y test.lua
...
Error at line 3 col 4. Repair sequences found:
1: Insert then
Error at line 5 col 21. Repair sequences found:
1: Insert ), Insert end, Delete }
2: Insert ), Insert {, Shift }, Insert end
Turning off error recovery
By default, lrpar
uses the MF
error recovery algorithm. You can use the
None
error recovery algorithm, which causes parsing to stop as soon as it hits
the first parsing error, with the recoverer
method in CTParserBuilder
or
RTParserBuilder
. For example, we can change calc
's build.rs
file to:
let lex_rule_ids_map = CTParserBuilder::new()
.yacckind(YaccKind::Grmtools)
.recoverer(lrpar::RecoveryKind::None)
.process_file_in_src("calc.y")?;
and then no matter how many syntax errors we make, only one is reported:
>>> 2++3++
Parsing error at line 1 column 3. No repair sequences found.
Unable to evaluate expression.
Unless you have a good reason to do so (e.g. quickly hacking together a grammar where you would prefer not to think about error recovery at all), we do not recommend turning off error recovery.
Other tools
When parsing text in Rust, you should also evaluate the following tools to see if they are more suitable for your purposes: