grmtools
grmtools is a suite of Rust
libraries and binaries for parsing text, both at compile-time, and run-time.
Most users will probably be interested in the compile-time Yacc feature, which
allows traditional .y
files to be used mostly unchanged in Rust. See the
Quickstart Guide for a quick introduction to this feature.
Quickstart Guide
Most users will probably be interested in the compile-time Yacc feature of
grmtools, which allows traditional .y
files to be used mostly unchanged in
Rust. This page is a short guide to get you up and running with this feature as
quickly as possible.
grmtools includes both a Yacc-style LR parser (lrpar
) and a
lex-style lexer (lrlex
). The lexer breaks input up into individual lexemes and
the parser checks to see if the lexemes conform to a grammar. As the parser
executes, it can either create a generic parse tree, or execute user-specified
Rust code.
A calculator evaluator
Let's assume we want to create a simple calculator which can evaluate
expressions such as 2 + 3 * 4
. Assuming a fresh Rust project, we first create
a Cargo.toml
file with the following dependencies:
[package]
name = "calc"
version = "0.0.1"
authors = ["<authors>"]
edition = "2021"
[[bin]]
doc = false
name = "calc"
[build-dependencies]
cfgrammar = "0.13"
lrlex = "0.13"
lrpar = "0.13"
[dependencies]
cfgrammar = "0.13"
lrlex = "0.13"
lrpar = "0.13"
In this situation we want to statically compile the .y
grammar and .l
lexer
into Rust code. We thus need to create a
build.rs
file inside the root of our project which can process the lexer and grammar.
lrlex
provides a simple interface which does both jobs for us in one go.
Our build.rs
file thus looks as follows:
use cfgrammar::yacc::YaccKind;
use lrlex::CTLexerBuilder;
fn main() {
CTLexerBuilder::new()
.lrpar_config(|ctp| {
ctp.yacckind(YaccKind::Grmtools)
.grammar_in_src_dir("calc.y")
.unwrap()
})
.lexer_in_src_dir("calc.l")
.unwrap()
.build()
.unwrap();
}
grmtools accepts several different Yacc variants as input. In our case, we want
to execute Rust code as the input is parsed (rather than creating a generic
parse tree which we traverse later), so we specified that the yacckind
(i.e.
what variant of Yacc file we're using) is YaccKind::Grmtools
. The grammar file
is stored in src/calc.y
, but we only specify calc.y
as the filename to
lrpar
, since it searches relative to src/
automatically. Similarly, we only
needed to specify calc.l
to lrlex
.
The lexer
While Yacc-style parsing is powerful, lex-style lexing is less powerful.
grmtools allows you to use whatever lexer you want with lrpar
. Fortunately, in
this case, lrlex
is powerful enough for us. calc.l
looks as follows:
%%
[0-9]+ "INT"
\+ "+"
\* "*"
\( "("
\) ")"
[\t ]+ ;
Roughly speaking, each line after the %%
line is a regular expression (we use
the regex
crate), a space character, and a
quoted lexeme type name. For example, if the user gives us input such as 234
we will create a single lexeme with a value (234
) and a type (INT
).
The one exception is the final line: if a lexeme type name is replaced with ‘;
’
then any matching input is discarded. In this case, whitespace (tabs and spaces)
is lexed, but no lexemes are created from it.
The grammar
Our initial version of calc.y looks as follows:
%start Expr
%%
Expr -> Result<u64, ()>:
Expr '+' Term { Ok($1? + $3?) }
| Term { $1 }
;
Term -> Result<u64, ()>:
Term '*' Factor { Ok($1? * $3?) }
| Factor { $1 }
;
Factor -> Result<u64, ()>:
'(' Expr ')' { $2 }
| 'INT'
{
let v = $1.map_err(|_| ())?;
parse_int($lexer.span_str(v.span()))
}
;
%%
// Any functions here are in scope for all the grammar actions above.
fn parse_int(s: &str) -> Result<u64, ()> {
match s.parse::<u64>() {
Ok(val) => Ok(val),
Err(_) => {
eprintln!("{} cannot be represented as a u64", s);
Err(())
}
}
}
The grammar is in 3 parts, separated by the %%
lines.
The first part specifies general settings for the grammar, at a minimum the
start rule (%start Expr
).
The second part is the Yacc
grammar. It consists of 3
rules (Expr
, Term
, and Factor
) and 6 productions (2 for each rule,
separated by |
characters). Because we are using the Grmtools
Yacc variant,
each rule has a Rust type associated with it (after ->
) which specifies the
type that each production’s action must return. A production (sometimes called an “alternative”)
consists of zero or more symbols. Symbols either reference rules or lexemes. If a
production matches text, its ”action” (the Rust code between curly brackets at
the end of the production) is executed.
lrpar
's actions are subtly different to Yacc. The $x
variables refer to
the respective symbol in the production, numbered from 1 (i.e. $1
refers to the first symbol in
the production). If the symbol references a rule R
then an instance of
R
's type will be stored in the $x
variable; if the symbol references a lexeme then an Option<Lexeme>
instance is returned. A special $lexer
variable allows access to the lexer.
The most commonly used function that $lexer
exposes is the span_str
function, which allows us to extract &'input str
s from a Span
(e.g. to
extract the string represented by a Lexeme
, we would use
$lexer.span_str(lexeme.span())
). As this may suggest, actions may also
reference the special lifetime 'input
(without any $
prefix), which allows
strings to be returned / stored by the grammar without copying memory.
The third part is arbitrary Rust code which can be called by productions’
actions. In our case we have a simple function which converts integers as
strings into integers as u64
s: if the user provides an invalid number (e.g.
one that is too big) the system panic
s.
This example uses a common grmtools idiom: making use of Result
types. This
allows us to deal with two different issues that prevent evaluation.
First is the “obvious” issue of integers which are too big to represent as
u64
s: these cause Err
to be percolated upwards, preventing evaluation.
Second is the issue of error recovery telling us that the user should have
inserted an integer: since it would be confusing for us to insert a default
value in such cases, we map_err
such cases to Err
, preventing evaluation.
See the section below on error recovery for more details
about error recovery.
Putting everything together
The build.rs
file will statically compile the lexer and grammar into Rust code
that we can then call. The src/main.rs
file below provides a simple
Python-esque REPL to the user into which they can write calculator expressions:
use std::io::{self, BufRead, Write};
use lrlex::lrlex_mod;
use lrpar::lrpar_mod;
// Using `lrlex_mod!` brings the lexer for `calc.l` into scope. By default the
// module name will be `calc_l` (i.e. the file name, minus any extensions,
// with a suffix of `_l`).
lrlex_mod!("calc.l");
// Using `lrpar_mod!` brings the parser for `calc.y` into scope. By default the
// module name will be `calc_y` (i.e. the file name, minus any extensions,
// with a suffix of `_y`).
lrpar_mod!("calc.y");
fn main() {
// Get the `LexerDef` for the `calc` language.
let lexerdef = calc_l::lexerdef();
let stdin = io::stdin();
loop {
print!(">>> ");
io::stdout().flush().ok();
match stdin.lock().lines().next() {
Some(Ok(ref l)) => {
if l.trim().is_empty() {
continue;
}
// Now we create a lexer with the `lexer` method with which
// we can lex an input.
let lexer = lexerdef.lexer(l);
// Pass the lexer to the parser and lex and parse the input.
let (res, errs) = calc_y::parse(&lexer);
for e in errs {
println!("{}", e.pp(&lexer, &calc_y::token_epp));
}
match res {
Some(Ok(r)) => println!("Result: {:?}", r),
_ => eprintln!("Unable to evaluate expression.")
}
}
_ => break
}
}
}
We can now cargo run
our project and evaluate simple expressions:
>>> 2 + 3
Result: 5
>>> 2 + 3 * 4
Result: 14
>>> (2 + 3) * 4
Result: 20
Error recovery
Because powerful error recovery is built into lrpar
, we can even make minor
errors and have the system recover automatically:
>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
1: Delete +
2: Insert INT
Result: 5
>>> 2 + 3 3
Parsing error at line 1 column 7. Repair sequences found:
1: Delete 3
2: Insert +
3: Insert *
Result: 5
>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
1: Insert *, Delete 4
2: Insert +, Delete 4
3: Delete 4, Delete 5
4: Insert *, Shift 4, Delete 5
5: Insert *, Shift 4, Insert +
6: Insert *, Shift 4, Insert *
7: Insert +, Shift 4, Delete 5
8: Insert +, Shift 4, Insert +
9: Insert +, Shift 4, Insert *
Result: 17
Note that we didn't have to do anything clever in order for error recovery to
happen: it happens by default, and it works with whatever grammar we throw at
it. The way to read the resulting error messages are that each numbered repair
sequence is a way that the error recovery system found to make sense of the
input. For example, for the input 2 + + 3
, an error is detected at the second
+
: we could either delete the second +
(option 1 above) or insert an
integer. In all cases, error recovery applies repair sequence 1, and continues
parsing. 2 + + 3
was thus parsed as if the user had written 2 + 3
,
hence why it evaluated to 5. Similarly, 2 + 3 4 5
was parsed as if the user
had written 2 + 3 * 5
.
Error recovery opens up a number of possibilities to customise and streamline
the user experience. For example, the simple approach above causes a panic
if
the user provides a non-u64 number or if error recovery inserts an integer.
For more details about the possibilities, see the section on error
recovery.
Lexing
Lexing is the act of taking in an input stream and splitting it into lexemes.
Colloquially, lexing is often described as splitting input into words. In
grmtools
, a Lexeme has a type (e.g. "INT", "ID"), a value (e.g. "23",
"xyz"), and knows which part of the user's input matched (e.g. "the input
starting at index 7 to index 10"). There is also a simple mechanism to
differentiate lexemes of zero length (e.g. DEDENT
tokens in Python) from
lexemes inserted by error recovery.
lrpar
provides a generic lexing interface to which any lexer can plug into.
Many easy lexing tasks can more easily be carried out by lrlex
, a
lex
replacement. lrlex
also provides helper functions which make it easier
to hand-write lexers.
Lex compatibility
grmtools currently supports one common use of Lex, which is to produce a sequence of tokens. All Lex files require at least some porting to grmtools, though in many cases this is fairly trivial. Nevertheless, aspects such as the longest match rule are identical to Lex, and we assume familiarity with Lex syntax and its major features: the Lex manual is recommended reading.
Major differences
There are several major differences between Lex and grmtools:
-
Lex has its own regular expression language whereas grmtools uses the well known Rust regex crate for regular expressions. These two regular expression languages are very similar, but complex regular expressions might not be supported under one or the other.
-
Lex files consist of a sequence of regular expressions and an action for each. grmtools lex files consists of a sequence of regular expressions and a token name. Actions are not currently supported (and, by extension, nor are special action expressions such as
ECHO
andREJECT
). -
Both Lex and grmtools lex files support start conditions as an optional prefix to regular expressions, listing necessary states for the input expression to be considered for matching against the input. Lex uses a special action expression
BEGIN(state)
to switch to the namedstate
. Start states in grmtools are described in start_states. -
Character sets, and changes to internal array sizes are not supported by grmtools.
-
Escape sequences:
In addition to escape sequences involved in the escaping of regular expressions. Lex and grmtools support the escape sequences
\123
(octal)\x1234
(hexadecimal) and ASCII escape sequences.\\
\a
\f
\n
\r
\t
\v
.Lex also interprets the escape sequence
\b
asbackspace
. While regex treats\b
as a word boundary subsequently grmtools will too.Additional escape sequences supported by regex:
The
\u1234
and\U12345678
escape sequences for unicode characters, the\p
,\P
unicode character classes, as well as the\d
\D
\s
\S
\w
\W
perl character classes, and\A
\b
\B
\z
escape sequences.Both Lex and grmtools support escaping arbitrary characters, for all other characters besides those listed above, when given an escaped character
\c
it will be passed to the regex engine as the characterc
. This is useful when a character is used within the lex format.An example of this is when the character
<
is used at the beginning of a regex. Both Lex and grmtools interpret this as the beginning of a start condition prefix. Which can be escaped with\<
to ensure it is treated as the start of a regular expression.But the characters to which this behavior applies is impacted by the escape sequence differences listed above.
-
Lex treats lines in the rules section beginning with whitespace as code to be copied verbatim into the generated lexer source. Grmtools lex does not support these and produces an error.
Hand-written lexers
lrpar
provides a generic lexing interface to which any lexer can plug into.
Users can provide
one or both of a custom lexeme type -- conforming to
lrpar::Lexeme
-- and a custom lexing type -- conforming to
lrpar::NonStreamingLexer
.
If you wish to use a custom lexer, you will need to instantiate lrpar
appropriately (both
CTParserBuilder
and
RTParserBuilder
).
For many purposes, the low-level control and performance that lrpar
gives you is unneeded,
and the boiler-plate that comes with it unwanted. Fortunately, lrlex
provides the following convenience mechanisms to make it easier to use a hand-written lexer with lrpar
:
-
lrlex
's normalLRNonStreamingLexer
struct can be instantiated by an end-user with an input stream, a list of lexemes created from that input stream, and the newlines encountered while lexing that input stream. This saves having to define a custom instance of thelrpar::NonStreamingLexer
trait. -
lrlex
'sDefaultLexeme
struct can also be instantiated by end-users, saving having to define a custom instance of thelrpar::Lexeme
trait. -
lrlex
exposes act_token_map
function to be used frombuild.rs
scripts which automatically produces a Rust module with one constant per token ID.ct_token_map
is explicitly designed to be easy to use withlrpar
's compile-time building.
Putting these together is then relatively easy. First a build.rs
file for a
hand-written lexer will look roughly as follows:
use cfgrammar::yacc::YaccKind; use lrlex::{ct_token_map, DefaultLexerTypes}; use lrpar::CTParserBuilder; fn main() { let ctp = CTParserBuilder::<DefaultLexerTypes<u8>>::new() .yacckind(YaccKind::Grmtools) .grammar_in_src_dir("grammar.y") .unwrap() .build() .unwrap(); ct_token_map::<u8>("token_map", ctp.token_map(), None).unwrap() }
This produces a module that can be imported with lrlex_mod!("token_map")
. The
module will contain one constant, prefixed with T_
per token identifiers in the
grammar. For example, for the following grammar excerpt:
Expr -> Result<u64, ()>:
Expr 'PLUS' Term { Ok($1? + $3?) }
| Term { $1 }
;
the module will contain const T_PLUS: u8 = ...;
.
Since Yacc grammars can contain token identifiers which are not valid Rust
identifiers, ct_token_map
allows you to provide a map from the token
identifier to a "Rust friendly" variant. For example, for the following grammar
excerpt:
Expr -> Result<u64, ()>:
Expr '+' Term { Ok($1? + $3?) }
| Term { $1 }
;
we would provide a map '+' => 'PLUS'
leading, again, to a constant T_PLUS
being defined.
One can then write a simple custom lexer which lexes all the input in one go
and returns an LRNonStreamingLexer
as follows:
#![allow(unused)] fn main() { use cfgrammar::NewlineCache; use lrlex::{lrlex_mod, DefaultLexeme, DefaultLexerTypes, LRNonStreamingLexer}; use lrpar::{lrpar_mod, Lexeme, NonStreamingLexer, Span}; lrlex_mod!("token_map"); use token_map::*; fn lex(s: &str) -> LRNonStreamingLexer<DefaultLexerTypes<u8>> { let mut lexemes = Vec::new(); let mut newlines = NewlineCache::new(); let mut i = 0; while i < s.len() { if i == ... { lexemes.push(DefaultLexeme::new(T_PLUS, i, ...)); } else { ... } } LRNonStreamingLexer::new(s, lexemes, newlines) } }
Start States
The following explains the syntax and semantics of Start States in lrlex.
A working example can be found in the repository at lrpar/examples/start_states
Motivation
Start states are a feature from lex which can be used for context sensitive lexing. For instance, they can be used to implement nested comments (see the example in the repository). Such that the tokens start/end markers of tokens maintain balance.
This is achieved by making rules which are qualified to match only when the lexer is in a particular state. Additionally the lexer has a stack of states, and matching rules perform actions which modify the stack.
The INITIAL start state
Unless specified otherwise all lex rules are members of the INITIAL start state.
%%
<INITIAL>a "A"
<INITIAL>[\t \n]+ ;
This is the lex file below with no start states specified.
%%
a "A"
[\t \n]+ ;
Rules matching multiple states
Rules can be matched in multiple states, just separate the states a rule should match in commas.
The following matches the a
character when in either of the states FirstState
or SecondState
.
<FirstState,SecondState>a "A"
Differences from POSIX lex
In posix lex start states are entered via code in the action, through either BEGIN(STATE)
and
calling combinations of yy_push_state
, and yy_pop_state
.
Because lrlex is actionless, and does not support code actions, instead we have operators to perform the common modifications to the stack of start states.
Push
The push operator is given by the adding '+' to the target state on the right hand side within angle brackets. The following when regex matches in the CURRENT_STATE pushes TARGET_STATE to the top of a stack of states.
<CURRENT_STATE>Regex <+TARGET_STATE>;
Pop
The pop operator is given by the adding '-' to the target state on the right hand side within angle
brackets. Similarly when in the current state, the following pops the current state off of the
stack of states. Similarly to calling yy_pop_state
from action code.
<CURRENT_STATE>Regex <-CURRENT_STATE>;
ReplaceStack
The ReplaceStack operator is given by naming the target state within angle brackets. The ReplaceStack op clears the entire stack of states, then pushing the target state.
<CURRENT_STATE>Regex <TARGET_STATE>;
Returning a token while performing an operator.
Start state operators can be combined with returning a token for example:
<CURRENT_STATE>Regex <+TARGET_STATE>"TOKEN"
Adding a start state
Start stats come in two forms, exclusive and inclusive. These are given by %x
and %s
respectively.
Exclusive states
In an exclusive state, the rule can be matched only if the rule begins with the state specified.
In the following because ExclState
is exclusive, the #=
rule is only matched during the
INITIAL
state, while the a
and =#
characters are only matched while in the ExclState
.
%x ExclState
%%
#= <+ExclState>;
<ExclState>a "A"
<ExclState>=# <-ExclState>;
Inclusive states
Inclusive states are added to the set of rules to be matched when the start state is unspecified.
%s InclusiveState
%%
a "A"
<InclusiveState>b "B"
<INITIAL>#= <+InclusiveState>;
<InclusiveState>=# <-InclusiveState>;
Is equivalent to the following using exclusive states.
%x Excl
%%
<INITIAL, Excl>a "A"
<Excl>b "B"
<INITIAL>#= <+Excl>;
<Excl>=# <-Excl>;
Parsing
Parsing is the act of checking whether a stream of lexemes match a grammar. Since a simple "yes/no" answer is rarely useful, it is common to execute user-defined actions during parsing.
grmtools
contains libraries (cfgrammar
and
lrtable
) which allow users to build their own LR parsers in
whatever fashion they want. However, for 99% of cases, the lrpar
library is what users want and need: a (largely) Yacc-compatible parser. Roughly
speaking, the core parts of grammars work identically in Yacc and lrpar
, but
some other parts of the system have been modernised (e.g. to avoid the use of
global variables) and given a more idiomatic Rust feel. Notably, lrpar
is
built from the ground-up to have a powerful, flexible approach to error
recovery.
Yacc compatibility
grmtools supports most major Yacc features, to the extent that many Yacc grammars can be used unchanged with grmtools. In this book we assume familiarity with Yacc syntax and its major features: the Yacc manual is recommended reading.
Major differences
There are several differences between Yacc and grmtools including:
-
grmtools has no equivalent of any of the
yy*
functions (e.g.yyerror
,yylex
,yylval
,yyparse
and so on). This means, for example, that grammar actions cannot currently influence the lexer in any way. -
grmtools has an entirely different approach to error recovery. The token
error
and the special action expressionsyyerrok
andyyclearin
are not supported. In general, users can simply remove alternatives that consist solely oferror
. -
%union
can be mapped to%actiontype
in grmtools, though this is rarely the best way of using a Yacc grammar in Rust. See the Grmtools Yacc variant below for the most common way of making grammars do something useful; in a limited number of cases (e.g. if you just want to build a parse tree), you may find the "Original" Yacc variant useful. -
grmtools allows both Yacc's
%expect
and Bison's%expect-rr
declarations in its base "Yacc" mode. -
Bison's
%parse-param
can take multiple arguments. grmtools'%parse-param
takes a single argument which can be a tuple, thus emulating multiple arguments while integrating naturally into Rust's type system. -
Although rare, it is possible to generate accept/reduce conflicts (e.g. for a grammar with the sole rule
A: A;
). grmtools considers accept/reduce conflicts to be a hard error, and refuses to generate anything for the resulting grammar, whereas Yacc allows them through (with unclear consequences). Bison also appears to consider accept/reduce conflicts a hard error, though it appears to detect them in a more generic way (reporting such rules as "not generating any sentences").
YaccKinds
Grmtools
YaccKind::Grmtools
is grmtools' own variant of Yacc syntax, and the one that
most users will want to use. The most significant difference to "normal" Yacc
is that rules are annotated with a Rust type to which all their production's
actions must adhere to. Note that whilst a rule's productions must all adhere
to a single type, different rules can have different types. Consider the
following snippet:
R1 -> Result<i32, ()>:
'a' { Ok(5) }
| 'b' { Err(()) }
;
R2 -> u64:
| { 0 }
;
Here the rule R1
has a Rust return type of Result<X, ()>
(between ->
and
:
). Both of its productions adhere to this type, the first by instantiating
Ok(5)
and the second Err(())
. The rule R2
has a return type of u64
.
“Original” Yacc
Although the name is not fully accurate (grmtools supports a slightly disjoint subset of original Yacc's input), this mode allows users to most easily test externally created Yacc files. Several sub-variants are allowed:
-
YaccKind::Original(YaccOriginalActionKind::GenericParseTree)
does not execute user actions, but instead creates a generic parse tree, where elements are instances of thelrpar::parser::Node
enum. This is useful for quickly testing whether a parser is accepting the intended language. -
YaccKind::Original(YaccOriginalActionKind::NoAction)
parses input and reports errors but does not execute any user actions. This is useful if you are trying to find out whether a corpus of input parses successfully against your grammar or not. -
YaccKind::Original(YaccOriginalActionKind::UserAction)
models original Yacc most closely but, in a Rust setting, is probably of little use beyond simple calculator like languages. Instead of Yacc's%union
directive, users can specify%actiontype
which is a Rust type to which every production's actions in the grammar must adhere to. Unless all actions happen to naturally return the same type, this quickly becomes cumbersome to use. For most use cases,YaccKind::Grmtools
is a superior alternative.
Action code and return types
Action code
Action code is normal Rust code with the addition of the following special variables:
-
$1
...$n
refer to the respective symbol in the production, numbered from 1 (i.e.$1
refers to the first symbol in the production). If the symbol references a ruleR
then an instance ofR
's type will be stored in the$i
variable. If the symbol references a lexeme then aResult<Lexeme<StorageT>, Lexeme<StorageT>>
instance is returned where theOk
variant is used for lexemes that are directly derived from the user's input and theErr
variant is used for lexemes that have been inserted by error recovery. -
$lexer
allows access to the lexer and its various functions. The most commonly used of these is thespan_str
function, which allows us to extract&'input str
s from aSpan
(e.g. to extract the string represented by aLexeme
, we would use$lexer.span_str(lexeme.span())
). As this may suggest, actions may also reference the special lifetime'input
(without any$
prefix), which allows strings to be returned / stored by the grammar without copying memory. -
$span
is acfgrammar::Span
which captures how much of the user's input the current production matched. -
$$
is equivalent to$
in normal Rust code.
Any other variables beginning with $
are treated as errors.
Return types
Productions' return types can be any arbitrary Rust type. You may in addition make use of the following:
-
The generic parameter
StorageT
references the type of lexemes and is typically used with theLexeme
type i.e.Lexeme<StorageT>
. This allows you to return lexemes from rules. -
The lifetime
'input
allows you to extract strings whose lifetime is tied to the lexer and return them from rules / store them in structs without copying.Lexer::span_str
returns such strings and the typical idiom of use is&'input str
.
Additional parse parameter
A single extra parameter can be passed to action functions if the %parse-param <var>: <type>
declaration is used. The variable <var>
is then visible in all
action code. <type>
must implement the Clone
trait (note that Copy
bounds imply Clone
, and &
references implement Copy
).
For example if a grammar has a declaration:
%parse-param p: u64
then the statically generated parse
function will take two paramaters
(lexer: &..., p: u64)
and the variable p
can be used in action code e.g.:
R -> ...:
'ID' { format!("{}{}", p, ...) }
;
grmtools parsing idioms
grmtools is a flexible tool and can be used in many ways. However, for those
using the Grmtools
format, the simple idioms below can often make life easier.
Return Span
s when possible
When executing grammar actions one is often building up an Abstract Syntax Tree (AST) or equivalent. For example consider a simple language with assignments:
Assign: "ID" "=" Expr;
Perhaps the "obvious" way to build this into an AST is to extract the string representing the identifier as follows:
Assign -> ASTAssign: "ID" "=" Expr
{
let id = $lexer.span_str($1.as_ref().unwrap().span()).to_string();
ASTAssign::new(id, $3)
}
%%
struct ASTAssign {
id: String
}
impl ASTAssign {
fn new(name: String) -> Self {
ASTAssign { name }
}
}
This approach is easy to work with, but isn't as performant as may be desired:
the to_string
call allocates memory and copies part of the user's input into
that. It also loses information about the part of the user's input that the
string relates to.
An alternative approach is not to convert the lexeme into a String
during
parsing, but simply to return a
Span
. An outline of this
is as follows:
Assign -> ASTAssign: "ID" "=" Expr
{
ASTAssign { id: $1, expr: Box::new($3.span()) }
}
%%
type StorageT = u32;
struct ASTAssign {
id: Span
expr: Box<Expr>
}
enum Expr { ... }
If this is not quite what you want to do, you can use largely the same trick with
the Lexeme
struct
.
Working with Lexeme
s has the advantage that you can tell what the type of the
lexeme in question is, though generally this is entirely clear from AST
context, and Lexeme
's type parameter makes it marginally more fiddly to work
with than Span
.
Alternatively, if you really want to extract strings during parsing, consider
using the 'input
to extract &str
's during parsing, since this does not
cause any additional memory to be allocated.
Have rules return a Result
type
As described in the error recovery
section, it
is generally a good idea to give rules a Result
return type as this allows
you to easily stop, or change, action code execution if you encounter
"important" inserted lexemes. There are many ways that you can use this, but
many simple cases work well using either:
-
Err(())
works well if you are creating a parse tree and simply want to stop creating the tree when you encounter an important inserted lexeme. -
Err(Box<dyn Error>)
works well if you are performing more detailed evaluation while parsing and wish to explain to the user why you stopped evaluating when you encountered an important inserted lexeme.
Using Err(())
The idea here is that we stop evaluating normal action code by returning
Err(())
. However, this can lead to endless instances of the following
map_err
idiom:
R -> Result<..., ()>:
"ID" { $1.map_err(|_| ())? }
;
It can be helpful to define a custom map_err
function which hides some of this
mess for you:
R -> Result<Lexeme<StorageT>, ()>:
"ID" { map_err($1)? }
;
%%
fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
-> Result<Lexeme<StorageT>, ()>
{
r.map_err(|_| ())
}
Using Err(Box<dyn Error>)
The idea here is that we both stop evaluating normal action code, and explain
why, by returning Err(Box<dyn Error>)
. Although Box<dyn Error>
is something
of a mouthful, it allows you significant flexibility in what you return in
error situations. If you want to quickly experiment, then this is convenient
because the token type Result<Lexeme<StorageT>, Lexeme<StorageT>>
can be
automatically coerced to Box<dyn Error>
(e.g. $1?
in action code will
return the Err
variant without additional code). You can also return
strings-as-errors with Box::<dyn Error>::from("...")
.
Using this idiom we can change our calculator example to deal with many more possible sources of error:
%start Expr
%avoid_insert "INT"
%%
Expr -> Result<u64, Box<dyn Error>>:
Expr '+' Term
{
Ok($1?.checked_add($3?)
.ok_or(Box::<dyn Error>::from("Overflow detected."))?)
}
| Term { $1 }
;
Term -> Result<u64, Box<dyn Error>>:
Term '*' Factor
{
Ok($1?.checked_mul($3?)
.ok_or(Box::<dyn Error>::from("Overflow detected."))?)
}
| Factor { $1 }
;
Factor -> Result<u64, Box<dyn Error>>:
'(' Expr ')' { $2 }
| 'INT'
{
parse_int(
$lexer.span_str(
$1.map_err(|_| "<evaluation aborted>")?.span()))
}
;
%%
// Any imports here are in scope for all the grammar actions above.
use std::error::Error;
fn parse_int(s: &str) -> Result<u64, Box<dyn Error>> {
match s.parse::<u64>() {
Ok(val) => Ok(val),
Err(_) => {
Err(Box::from(
format!("{} cannot be represented as a u64", s)))
}
}
}
Define a flatten
function
Yacc grammars make specifying sequences of things something of a bore. A common idiom is thus:
ListOfAs -> Result<Vec<A>, ()>:
A { Ok(vec![$1?]) }
| ListOfAs A
{
let mut $1 = $1?;
$1.push($1?);
Ok($1)
}
;
A -> Result<A, ()>: ... ;
Since this idiom is often present multiple times in a grammar, it's generally
worth adding a flatten
function to hide some of this:
ListOfAs -> Result<Vec<A>, ()>:
A { Ok(vec![$1?]) }
| ListOfAs A { flatten($1, $2) }
;
A -> Result<A, ()>: ... ;
%%
fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
-> Result<Vec<T>, ()>
{
let mut flt = lhs?;
flt.push(rhs?);
Ok(flt)
}
Note that flatten
is generic with respect to T
so that it can be used in
multiple places in the grammar.
Composing idioms
The above idioms compose well together. For example, flatten
, map_err
, and
Lexeme
can be used together as shown in the following example:
ListOfIds -> Result<Vec<Lexeme<StorageT>>, ()>:
"ID" { Ok(vec![map_err($1)?]) }
| ListOfIds "Id" { flatten($1, map_err($2)?) }
;
%%
type StorageT = u32;
fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
-> Result<Lexeme<StorageT>, ()>
{
r.map_err(|_| ())
}
fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
-> Result<Vec<T>, ()>
{
let mut flt = lhs?;
flt.push(rhs?);
Ok(flt)
}
Error recovery
One of lrpar
's most powerful features is its approach to error recovery, which
can be used with any grammar. This section outlines the background to error
recovery, the choices that users can make, and how to best make use of this
feature.
Error recovery background
Programmers frequently make mistakes when entering input, either because of simple typos, or an outright failure to use the correct syntax. Happily, LR parsing guarantees to report syntax errors at the first point that an error can be definitively proven to have occurred (though note that this might not be the same point that a user would consider the error to have been made). It has long been a goal of parsing technologies to recover from such errors, and allow parsing to continue. This allows users to fix all their syntax errors in one go and, optionally, post-parsing phases to operate as if no syntax errors had been made at all. For example, a compiler author might decide to run the compiler's static type checker even in the presence of syntax errors (since many static type errors are unaffected by syntax errors), but not generate code (which might incorrectly give users the illusion that their code is safe to run).
However, most mainstream parsers do a bad job of error recovery. The most common generic error recovery algorithm is "panic mode" (in reality, a family of algorithms). Unfortunately such simple error recovery algorithms do a poor job of recovering from syntax errors, causing a cascade of spurious further syntax errors to be reported. Programmers quickly learn that only the first reported syntax error can be trusted on to be correct.
lrpar
implements the CPCT+
error recovery algorithm from Reducing
Cascading Parsing Errors Through Fast Error
Recovery, which, in our biased opinion, does
a better job than previous approaches. It is fast, grammar neutral, and reports
multiple repair sequences to users, allowing them to consider which best
matches their intentions.
No matter how clever we think CPCT+
is, it is important to understand that it has
a fundamental limitation: it only knows about a language's syntax; it has no
concept of the language's semantics beyond that implied by the structure of the
grammar; and it cannot control what the user does with the result of error
recovery. Thus, grammar writers can significantly influence how useful error
recovery is for users. Most of the rest of this section explains how best to
make use of error recovery.
Error recovery basics
A simple calculator grammar looks as follows:
%start Expr
%%
Expr -> u64:
Expr '+' Term { $1 + $3 }
| Term { $1 }
;
Term -> u64:
Term '*' Factor { $1 * $3 }
| Factor { $1 }
;
Factor -> u64:
'(' Expr ')' { $2 }
| 'INT' { parse_int($lexer.span_str($1.unwrap().span())) }
;
%%
// Any functions here are in scope for all the grammar actions above.
fn parse_int(s: &str) -> u64 {
match s.parse::<u64>() {
Ok(val) => val,
Err(_) => panic!("{} cannot be represented as a u64", s)
}
}
For this simplification we need to make a small tweak to our main.rs
changing:
match res {
Some(Ok(r)) => println!("Result: {}", r),
_ => eprintln!("Unable to evaluate expression.")
}
to:
match res {
Some(r) => println!("Result: {}", r),
_ => eprintln!("Unable to evaluate expression.")
}
For many examples, this simple grammar and its actions work well leading to output such as the following:
>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
1: Delete +
2: Insert INT
Result: 5
Insert x
means “error recovery inserted a lexeme of type x”; Delete x
means “error recovery deleted the next lexeme in the stream”; and Shift x
means “error recovery kept the user’s lexeme x as-is”.
Repair sequences are minimal ways of adjusting the user’s input such that it
becomes correct relative to the underlying grammar. Intuitively, in this
example, the two repair sequences would adjust the input to be equivalent to
2 + 3
(repair sequence 1) or 2 + <some int> + 3
(repair sequence 2). When
more than one repair sequence is presented to the user, the first is used by the
algorithm to continue parsing: in this case, the input was parsed as if it was
equivalent to 2 + 3
, hence the evaluation of the input to 5
.
Repair sequences can, as their name suggests, be of arbitrary length:
>>> 2 + 3 4 5
Parsing error at line 1 column 7. Repair sequences found:
1: Insert *, Delete 4
2: Insert +, Delete 4
3: Delete 4, Delete 5
4: Insert *, Shift 4, Delete 5
5: Insert *, Shift 4, Insert +
6: Insert *, Shift 4, Insert *
7: Insert +, Shift 4, Delete 5
8: Insert +, Shift 4, Insert +
9: Insert +, Shift 4, Insert *
Result: 17
In this case, the first repair sequence caused the input to be parsed as if it
was equivalent to 2 + 3 * 5
, hence the evaluation of the input to 17
.
Syntax errors and language semantics
Our example inputs so far have deliberately exploited cases where the first
repair sequence at worst inserted “unimportant” lexemes such as +
and *
.
Since the grammar’s actions never read the values of such lexemes, only their
type is important. However, what should happen if error recovery inserts an
integer, whose value is later read by one of the grammar’s actions? An example
shows the unhappy result:
>>> 2+
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Lexeme { start: 2, len: 4294967295, tok_id: 4 }', libcore/result.rs:1009:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
>>>
In this case, the first repair sequence was Insert INT
. The fundamental
problem is that while error recovery can adjust the user’s input to insert a
lexeme of type INT
, neither it nor the parser have any idea what value might
have made sense for that lexeme. Thus the expression above caused the expression
$lexer.span_str($1.unwrap().span())
to panic, since $1
was Err(<lexeme>)
.
It is thus up to the user to decide what to do in the face of the inevitable semantic issues that error recovery highlights. Fortunately, this is generally simpler than it sounds with only a slight rethink in the way that we tend to write a grammar's actions.
A rule of thumb: have rules return a Result
type
Although rules can have any Rust type you can imagine, using a Result
type
allows a (deliberately) simple interaction with the effects of error recovery.
The basic idea is simple: in actions, we ignore lexemes whose value we don't
care about (e.g. brackets); for lexemes whose value we care about, we either
introduce a default value, or percolate an Err
upwards. Default values make
sense in certain situations. For example, if you're writing a compiler, and want
to run a static type checker even after syntax errors, it might make sense to
assume that Insert 0
is a good substitute for Insert INT
. However, in the
case of the calculator, default values are likely to lead to confusing results.
We thus change the grammar so that inserted integers prevent evaluation from
occurring:
%start Expr
%%
Expr -> Result<u64, ()>:
Expr '+' Term { Ok($1? + $3?) }
| Term { $1 }
;
Term -> Result<u64, ()>:
Term '*' Factor { Ok($1? * $3?) }
| Factor { $1 }
;
Factor -> Result<u64, ()>:
'(' Expr ')' { $2 }
| 'INT' { parse_int($lexer.span_str($1.map_err(|_| ())?.span())) }
;
%%
// Any functions here are in scope for all the grammar actions above.
fn parse_int(s: &str) -> Result<u64, ()> {
match s.parse::<u64>() {
Ok(val) => Ok(val),
Err(_) => panic!("{} cannot be represented as a u64", s)
}
}
The basic idea here is that every action returns an instance of Result<u64, ()>
: if we receive Ok(u64)
we successfully evaluated the expression, but if
we received Err(())
we were not able to evaluate the expression. If we
encounter an integer lexeme which is the result of error recovery, then the
INT
lexeme in the second Factor
action will be Err(<lexeme>)
. By writing
$1.map_err(|_| ())?
we’re saying “if the integer lexeme was created by error
recovery, percolate Err(())
upwards”. We then have to tweak a couple of other
actions to percolate errors upwards, but this is a trivial change. We'll also need to
change main.rs
back to expecting a Result
.
Now the input which previously caused a panic simply tells the user that it could not evaluate the expression:
>>> 2+
Parsing error at line 1 column 3. Repair sequences found:
1: Insert INT
Unable to evaluate expression.
Usefully, our inability (or unwillingness) to evaluate the expression does not prevent further syntax errors from being discovered and repaired:
>>> (2+)+3+4+
Parsing error at line 1 column 4. Repair sequences found:
1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
1: Insert Int
Unable to evaluate expression.
Using a Result
type allows the user arbitrary control over the classes of
syntax errors they are prepared to deal with or not. For example, we could
remove the panic
from parse_int
by making the rules have a type Result<u64, String>
where the Err
case would report a string such as “18446744073709551616 cannot
be represented as a u64” for the first unrepresentable u64
in the user's
input. If we wanted to report all unrepresentable u64
s, we could have
the rules have a type Result<u64, Vec<String>>
, though merging together the errors found
on the left and right hand sides of the +
and *
operators requires adding a
few lines of code.
Making use of %epp
for easier to read repair sequences
By default, pretty-printing lexeme types prints out their identifier in the grammar. Up to now, we have used lexeme types when showing output to the user. While the lexeme types are sometimes adequate for this purpose, this is not always the case. Consider this lex file:
%%
[0-9]+ "INT"
\+ "PLUS"
\* "MUL"
\( "LBRACK"
\) "RBRACK"
[\t ]+ ;
The user would see output such as:
>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
1: Delete 3
2: Insert PLUS
3: Insert MUL
Result: 2
What are PLUS
and MUL
? These might be semi-obvious, but many lexeme types
are far from obvious. grmtools
allows users to provide human friendly versions
of these for error recovery using the %epp
declaration in grammars. For
example, we can extend the calc
grammar as follows:
%epp PLUS "+"
%epp MUL "*"
%epp LBRACK "("
%epp RBRACK ")"
%epp INT "Int"
leading to the following output:
>>> 2 3
Parsing error at line 1 column 3. Repair sequences found:
1: Delete 3
2: Insert +
3: Insert *
Result: 2
Biasing repair sequences
Depending on your language, some repair sequences are better than others. For
example, sometimes Insert
repairs are less welcome than Delete
repairs:
>>> 2 + + 3
Parsing error at line 1 column 3. Repair sequences found:
1: Insert INT
2: Delete +
Unable to evaluate expression.
>>> 2 + + 3
Parsing error at line 1 column 3. Repair sequences found:
1: Delete +
2: Insert INT
Result: 5
Why does the same input sometimes produce a result and sometimes fail to produce
a result? The problem is that 2 + + 3
has two repair sequences Delete +
and
Insert Int
. As things stand, both are equally good, and so one is chosen
non-deterministically. If Insert Int
is chosen, we hit the Err
case from
earlier, and fail to produce a result; if the Delete
case is chosen, we can
produce a result.
To lessen this problem, the %avoid_insert L
directive causes grmtools to
prefer repair sequences that don't include Insert L
over those that do.
Intuitively, we want to annotate lexemes whose value we care about in this
way (e.g. INT
), but we don't need to worry about lexemes whose value we never
expect (e.g. (
, +
etc.). In the case of the calculator grammar a good
use of this directive is as follows:
%avoid_insert "INT"
With this, the Delete +
repair sequence is consistently favoured over Insert INT
.
Turning lexing errors into parsing errors
Most lexers do not have lexical rules for all possible inputs. For example, our
running calculator example has no lexical rule for the character @
. Typically
this causes the lexer to generate an error and stop lexing further. For example
with lrlex
we would encounter the following:
>>> 2@3
Lexing error at line 1 column 2.
This error message is correct, but not as helpful as we might like (what is the error specifically?). Furthermore, any further errors in the input will not be found until the lexing error is fixed.
Fortunately we can fix this easily for nearly all grammars by adding a line
similar to this to the end of your .l
file:
. "UNMATCHED"
Any single character which is not matched by any other lex rule will now lead
to a token of type UNMATCHED
. Note that it is vital that this is the last rule
in your .l
file, and that only a single character is matched, otherwise you
will incorrectly lex correct input as UNMATCHED
!
We then need to add a dummy rule to your .y
file, simply so that lrpar
knows about UNMATCHED
tokens. This dummy rule won't be referenced by other
rules, so its return type and action are irrelevant. The simplest example is
thus:
Unmatched -> ():
"UNMATCHED" { }
;
Assuming you have the "warnings are errors" option set to true (its default),
you will then receive a warning about the unused rule (Unmatched
) and token
(UNMATCHED
). You can inform grmtools that you expect both to be unused by
adding this declaration in the top part of your .y
file:
%expect-unused Unmatched "UNMATCHED"
With this done, all possible input will be lexed, and what were previously lexing errors are now parsing errors. This means that error recovery section kicks in, giving us more detailed and informative errors, and ensuring that multiple "lexing" errors are reported at once:
>>> 2@3+4+5+6@7
Parsing error at line 1 column 2. Repair sequences found:
1: Delete @, Delete 3
2: Insert +, Delete @
3: Insert *, Delete @
Parsing error at line 1 column 10. Repair sequences found:
1: Insert +, Delete @
2: Delete @, Delete 7
3: Insert *, Delete @
Result: 24
Under the bonnet
For any given syntax error there are, potentially, a finite but vast number of
possible valid repair sequences: far too many to exhaustively search. Error
recovery algorithms such as CPCT+
use various heuristics to cut the search space
down to something that is (generally) manageable. Although surprisingly few in
practise, this inevitably leads to occasional situations where the repair
sequences found (or, more accurately, those not found) surprise humans.
Timeout
The first surprising condition is that even with the small calc
grammar, some
user inputs lead to such a massive search space that no repair sequences can be
found. The easiest way to trigger this in most grammars is bracket expressions:
>>> 1+(
Parsing error at line 1 column 4. Repair sequences found:
1: Insert Int, Insert )
Unable to evaluate expression.
>>> 1+((
Parsing error at line 1 column 5. Repair sequences found:
1: Insert Int, Insert ), Insert )
Unable to evaluate expression.
>>> 1+(((((((((((
Parsing error at line 1 column 14. No repair sequences found.
Unable to evaluate expression.
At a certain number of open brackets (which will partly depend on the speed of
your machine), CPCT+
simply cannot find suitable repair sequences within its
internal timeout, hence the “No repair sequences found” message. In practise
this happens in less than 2% of real-world inputs, so it is not a significant
worry.
Some “obvious” repair sequences aren't reported at the end of a file
The second surprising condition is more subtle. Before we can show the issue, we
need to introduce the concept of repair sequence ranking: CPCT+
only presents the
lowest cost repair sequences to users (where Insert
s and Delete
s cost 1, and
Shift
s cost 0). Higher cost repair sequences are discarded.
In an ideal world, CPCT+
would find repair sequences that allow a file to parse
completely successfully. In practice, this is only feasible if a syntax error
occurs near the very end of the input. In most cases, CPCT+
is happy with a
weaker condition, which is that a repair sequence ends with 3 Shift
repairs,
showing that parsing has got back on track, at least for a little bit. This
condition explains the following:
>>> 2 + + 3
Parsing error at line 1 column 5. Repair sequences found:
1: Delete +
2: Insert Int
Result: 5
>>> 2 + + 3 +
Parsing error at line 1 column 5. Repair sequences found:
1: Insert Int
Parsing error at line 1 column 10. Repair sequences found:
1: Insert Int
Unable to evaluate expression.
For 2 + + 3
we match the human intuition that the input could have been 2 + 3
or 2 + <some int> + 3
. However, for the input 2 + + 3 +
we do not report
a Delete +
repair sequence for the first error in the input. Why?
The first thing we need to know is that repair sequences are always reported
with trailing Shift
repairs pruned: for the rest of this subsection it aids
understanding to leave them unpruned. Thus, for 2 + + 3
, the two repair
sequences found are Delete +, Shift 3
and Insert Int, Shift +, Shift 3
, both
of which cause the entire input to parse successfully, and both of which have
the same cost.
For 2 + + 3 +
, however, the first error leads to 3 repair sequences, Insert Int, Shift +, Shift 3, Shift +
, Delete +, Shift 3, Delete
or Delete +, Shift 3, Shift +, Insert Int
: the latter two are not even completed since they're
provably higher than the Insert Int
repair sequence and thus aren’t reported
to the user.
In practise, this situation is rarer than the timeout problem, to the point that
it’s arguably not worth worrying about or explaining to end users. Even when it
happens, the repair sequences that CPCT+
reports are always correct and at least
one repair sequence will be reported (assuming that error recovery doesn't time
out!).
Error recovery on real-world grammars
Continuing the example from the nimbleparse
section, we
can see that error recovery works well on arbitrary grammars. Consider the
following syntactically incorrect Lua 5.3 program:
$ cat test.lua
x = 0
if x > 0
print("greater than")
else
print("less than"}
When run through nimbleparse
, the following output is
generated:
$ caro run --release --bin nimbleparse lua5_3.l lua5_3.y test.lua
...
Error at line 3 col 4. Repair sequences found:
1: Insert then
Error at line 5 col 21. Repair sequences found:
1: Insert ), Insert end, Delete }
2: Insert ), Insert {, Shift }, Insert end
Turning off error recovery
By default, lrpar
uses the CPCT+
error recovery algorithm. You can use the
None
error recovery algorithm, which causes parsing to stop as soon as it hits
the first parsing error, with the recoverer
method in CTParserBuilder
or
RTParserBuilder
. For example, we can change calc
's build.rs
file to:
CTLexerBuilder::new()
.lrpar_config(|ctp| {
ctp.yacckind(YaccKind::Grmtools)
.recoverer(lrpar::RecoveryKind::None)
.grammar_in_src_dir("calc.y")
.unwrap()
})
.lexer_in_src_dir("calc.l")?
.build()?;
and then no matter how many syntax errors we make, only one is reported:
>>> 2++3++
Parsing error at line 1 column 3. No repair sequences found.
Unable to evaluate expression.
Unless you have a good reason to do so (e.g. quickly hacking together a grammar where you would prefer not to think about error recovery at all), we do not recommend turning off error recovery.
An AST evaluator
We now know enough to put together a more sophisticated version of our simple calculator example that builds an Abstract Syntax Tree (AST) while parsing, which is then evaluated separately. This models a common way of building real compilers. The full example code can be found at https://github.com/softdevteam/grmtools/tree/master/lrpar/examples/calc_ast.
The calc.l
file remains unchanged from that in the Quickstart
guide. However the calc.y
file is change as follows:
%start Expr
%avoid_insert "INT"
%%
Expr -> Result<Expr, ()>:
Expr '+' Term { Ok(Expr::Add{ span: $span, lhs: Box::new($1?), rhs: Box::new($3?) }) }
| Term { $1 }
;
Term -> Result<Expr, ()>:
Term '*' Factor { Ok(Expr::Mul{ span: $span, lhs: Box::new($1?), rhs: Box::new($3?) }) }
| Factor { $1 }
;
Factor -> Result<Expr, ()>:
'(' Expr ')' { $2 }
| 'INT' { Ok(Expr::Number{ span: $span }) }
;
%%
use cfgrammar::Span;
#[derive(Debug)]
pub enum Expr {
Add {
span: Span,
lhs: Box<Expr>,
rhs: Box<Expr>,
},
Mul {
span: Span,
lhs: Box<Expr>,
rhs: Box<Expr>,
},
Number {
span: Span
}
}
The most obvious difference here is that we have defined a simple enum
Expr
,
with three variants, for our AST. Each AST variant also records a Span
which
records how much input the AST element covers. By using the
$span
variable we can ensure that AST elements record their
relationship to portions of the user's input that span multiple tokens (e.g.
for the expressions 1 + 2
the resulting Expr::Add
will have a Span
starting at byte index 0 and ending at byte index 5 -- in other words covering
the complete input string in this case).
After parsing, we thus end up with a Result<Expr, ()>
. In the case of a
successful parse, this will give us an arbitrarily deeply nested Expr
.
Our main.rs
file then looks as follows:
use std::io::{self, BufRead, Write};
use lrlex::{lrlex_mod, DefaultLexeme, LRLexError};
use lrpar::{lrpar_mod, NonStreamingLexer, Span};
lrlex_mod!("calc.l");
lrpar_mod!("calc.y");
use calc_y::Expr;
fn main() {
let lexerdef = calc_l::lexerdef();
let stdin = io::stdin();
loop {
print!(">>> ");
io::stdout().flush().ok();
match stdin.lock().lines().next() {
Some(Ok(ref l)) => {
if l.trim().is_empty() {
continue;
}
let lexer = lexerdef.lexer(l);
let (res, errs) = calc_y::parse(&lexer);
for e in errs {
println!("{}", e.pp(&lexer, &calc_y::token_epp));
}
if let Some(Ok(r)) = res {
// We have a successful parse.
match eval(&lexer, r) {
Ok(i) => println!("Result: {}", i),
Err((span, msg)) => {
let ((line, col), _) = lexer.line_col(span);
eprintln!(
"Evaluation error at line {} column {}, '{}' {}.",
line,
col,
lexer.span_str(span),
msg
)
}
}
}
}
_ => break
}
}
}
fn eval(
lexer: &dyn NonStreamingLexer<DefaultLexeme, u32>,
e: Expr,
LRLexError)
-> Result<u64, (Span, &'static str)> {
match e {
Expr::Add { span, lhs, rhs } => eval(lexer, *lhs)?
.checked_add(eval(lexer, *rhs)?)
.ok_or((span, "overflowed")),
Expr::Mul { span, lhs, rhs } => eval(lexer, *lhs)?
.checked_mul(eval(lexer, *rhs)?)
.ok_or((span, "overflowed")),
Expr::Number { span } => lexer
.span_str(span)
.parse::<u64>()
.map_err(|_| (span, "cannot be represented as a u64"))
}
}
Let's start by running this and seeing what happens:
>>> 2+3*4
Result: 14
>>> 2++3*4
Parsing error at line 1 column 3. Repair sequences found:
1: Delete +
2: Insert INT
Result: 14
>>> 999999*888888 + 777777*666666
Result: 1407404592594
>>> 9999999999*8888888888 + 7777777777*6666666666
Evaluation error at line 1 column 6, '9999999999*8888888888' overflowed.
The first three expressions evaluate just as before. However, the fourth is
interesting: we have explicitly captured the fact that the result of
9999999999*8888888888
is too big to fit into a u64
; and not only have we
told the user which character the input starts out, but we've printed out the
precise sub-part of the input which caused that error. This works even when
it's in the middle of the input:
>>> 10 + 9999999999*8888888888 + 20
Evaluation error at line 1 column 6, '9999999999*8888888888' overflowed.
The key to this is that each AST element knows the $span
of the production it
is related to; and the resulting Span
can extract the user's input with
lexer.span_str(span)
.
Happily, this facility composes nicely with error recovery:
>>> 10 ++ 9999999999*8888888888 + 20
Parsing error at line 1 column 5. Repair sequences found:
1: Delete +
2: Insert INT
Evaluation error at line 1 column 7, '9999999999*8888888888' overflowed.
Rust Editions
The edition
of rust used by grmtools
updates as the rust language evolves. We try to
keep code generated by CTParserBuilder
and CTLexerBuilder
building with
older versions of rust, so that downstream users can use the edition that
suits their requirements.
Controlling edition used during code generation
CTLexerBuilder
and CTParserBuilder
both have functions, rust_edition()
that accept a lrpar::RustEdition
and lrlex::RustEdition
respectively.
Known edition incompatibility in the book
While there is a preference for keeping the code in this manual working with all editions, exceptions may be made when for clarity.
- In An AST evaluator, with the rust_2018_idioms lint deprecates
some behavior which was previously accepted by the 2015 edition. The
eval
function has an elided lifetime that must be given explicitly aslexer: &dyn NonStreamingLexer<'_, DefaultLexeme, u32>
.
The individual libraries and tools
grmtools consists of several libraries and command-line tools. The following sections describe each.
lrpar
lrpar
(crate;
source) is the LR
parser library aspect of grmtools. It takes in streams of lexemes (using a
lexer of the user's choice) and parses them, determining if they successfully
match a grammar or not; if not, it can optionally recover from errors.
lrlex
lrlex
(crate;
source) is a
partial replacement for lex
/
flex
. It takes an input string and
splits it into lexemes based on a .l
file. Unfortunately, many real-world
languages have corner cases which exceed the power that lrlex
can provide.
However, when it is suitable, it is a very convenient way of expressing lexing.
lrlex
also has a simple command-line interface, allowing you to check whether
your lexing rules are working as expected:
$ cat C.java
class C {
int x = 0;
}
$ cargo run --lrlex java.l /tmp/C.java
Finished dev [unoptimized + debuginfo] target(s) in 0.18s
Running `target/debug/lrlex ../grammars/java7/java.l /tmp/C.java`
CLASS class
IDENTIFIER C
LBRACE {
INT int
IDENTIFIER x
EQ =
INTEGER_LITERAL 0
SEMICOLON ;
RBRACE }
nimbleparse
nimbleparse
is a simple grammar debugging aid. It takes as input a Lex
specification, a Yacc specification, and an input file and prints any warnings
about the specifications (e.g. shift/reduce errors) as well as the resulting
parse tree to stdout. If the parse is unsuccessful it will report parsing
errors and, when possible fixes. If parsing is successful, nimbleparse
exits
with 0; if an error is detected it exits with 1.
The full command-line specification is as follows:
nimbleparse [-r <cpctplus|none>] [-y <eco|grmtools|original>] [-q] <lexer.l> <parser.y> <input file>
where:
-r
selects the recovery algorithm to be used. Defaults tocpctplus
.-y
selects the Yacc variant to be used. Defaults tooriginal
.-q
prevents warnings (e.g. shift/reduce errors) from being reported.
You can use your own Lex/Yacc files. A small repository of example grammars can be found at https://github.com/softdevteam/grammars/.
An example invocation is as follows:
$ cat Hello.java
class Hello {
public static void main(String[] args) {
System.out.println("Hello world");
}
}
$ nimbleparse java7.l java7.y Hello.java
goal
compilation_unit
type_declarations_opt
type_declarations
type_declaration
class_declaration
modifiers_opt
CLASS class
IDENTIFIER Hello
type_parameters_opt
super_opt
interfaces_opt
class_body
LBRACE {
class_body_declarations_opt
class_body_declarations
class_body_declaration
class_member_declaration
method_declaration
method_header
modifiers_opt
modifiers
modifiers
modifier
PUBLIC public
modifier
STATIC static
VOID void
method_declarator
IDENTIFIER main
LPAREN (
formal_parameter_list_opt
formal_parameter_list
formal_parameter
type
reference_type
array_type
name
simple_name
IDENTIFIER String
dims
LBRACK [
RBRACK ]
variable_declarator_id
IDENTIFIER args
RPAREN )
throws_opt
method_body
block
LBRACE {
block_statements_opt
block_statements
block_statement
statement
statement_without_trailing_substatement
expression_statement
statement_expression
method_invocation
qualified_name
name
qualified_name
name
simple_name
IDENTIFIER System
DOT .
IDENTIFIER out
DOT .
IDENTIFIER println
LPAREN (
argument_list_opt
argument_list
expression
assignment_expression
conditional_expression
conditional_or_expression
conditional_and_expression
inclusive_or_expression
exclusive_or_expression
and_expression
equality_expression
instanceof_expression
relational_expression
shift_expression
additive_expression
multiplicative_expression
unary_expression
unary_expression_not_plus_minus
postfix_expression
primary
primary_no_new_array
literal
STRING_LITERAL "Hello world"
RPAREN )
SEMICOLON ;
RBRACE }
RBRACE }
$ cat SyntaxError.java
class SyntaxError {
int x y;
}
$ nimbleparse java7.l java7.y Hello.java
goal
compilation_unit
type_declarations_opt
type_declarations
type_declaration
class_declaration
modifiers_opt
CLASS class
IDENTIFIER SyntaxError
type_parameters_opt
super_opt
interfaces_opt
class_body
LBRACE {
class_body_declarations_opt
class_body_declarations
class_body_declaration
class_member_declaration
field_declaration
modifiers_opt
type
primitive_type
numeric_type
integral_type
INT int
variable_declarators
variable_declarators
variable_declarator
variable_declarator_id
IDENTIFIER x
COMMA
variable_declarator
variable_declarator_id
IDENTIFIER y
SEMICOLON ;
RBRACE }
Parsing error at line 2 column 11. Repair sequences found:
1: Insert ,
2: Insert =
3: Delete y
cfgrammar
cfgrammar
(crate;
source) reads
in grammar files, processes them, and provides a convenient API for operating
with them. Most users only need to think about cfgrammar
to the
extent that they are required to use it to specify what Yacc variant they wish
to use.
cfgrammar
may also be of interest to those manipulating grammars directly, or
who wish to use custom types of parsers. Note that cfgrammar
's API should be
considered semi-stable at best. As the needs of other parts of grmtools change,
cfgrammar
tends to have to change too. Since it is unlikely to have few direct
users, the consequences of changing the API are relatively slight.
lrtable
lrtable
(crate;
source) takes in
grammars from cfgrammar
and creates LR state tables from
them. Few users will be interested in its functionality directly, except those
doing advanced forms of grammar analysis.
One, admittedly fairly advanced, aspect worth noting is that
lrtable
uses Pager's
algorithm to compress the
resulting LR state tables. In rare cases this can provide surprising results:
see Denny and Malloy's
paper for
more.
Libraries and tools developed by third parties
Besides using grmtools to develop parsers. The following items use grmtools to extend or augment the functionality which may be useful to people developing parsers with grmtools.
Other tools
When parsing text in Rust, you should also evaluate the following tools to see if they are more suitable for your purposes: