grmtools parsing idioms
grmtools is a flexible tool and can be used in many ways. However, for those
using the Grmtools
format, the simple idioms below can often make life easier.
Return Span
s when possible
When executing grammar actions one is often building up an Abstract Syntax Tree (AST) or equivalent. For example consider a simple language with assignments:
Assign: "ID" "=" Expr;
Perhaps the "obvious" way to build this into an AST is to extract the string representing the identifier as follows:
Assign -> ASTAssign: "ID" "=" Expr
{
let id = $lexer.span_str($1.as_ref().unwrap().span()).to_string();
ASTAssign::new(id, $3)
}
%%
struct ASTAssign {
id: String
}
impl ASTAssign {
fn new(name: String) -> Self {
ASTAssign { name }
}
}
This approach is easy to work with, but isn't as performant as may be desired:
the to_string
call allocates memory and copies part of the user's input into
that. It also loses information about the part of the user's input that the
string relates to.
An alternative approach is not to convert the lexeme into a String
during
parsing, but simply to return a
Span
. An outline of this
is as follows:
Assign -> ASTAssign: "ID" "=" Expr
{
ASTAssign { id: $1, expr: Box::new($3.span()) }
}
%%
type StorageT = u32;
struct ASTAssign {
id: Span
expr: Box<Expr>
}
enum Expr { ... }
If this is not quite what you want to do, you can use largely the same trick with
the Lexeme
struct
.
Working with Lexeme
s has the advantage that you can tell what the type of the
lexeme in question is, though generally this is entirely clear from AST
context, and Lexeme
's type parameter makes it marginally more fiddly to work
with than Span
.
Alternatively, if you really want to extract strings during parsing, consider
using the 'input
to extract &str
's during parsing, since this does not
cause any additional memory to be allocated.
Have rules return a Result
type
As described in the error recovery
section, it
is generally a good idea to give rules a Result
return type as this allows
you to easily stop, or change, action code execution if you encounter
"important" inserted lexemes. There are many ways that you can use this, but
many simple cases work well using either:
-
Err(())
works well if you are creating a parse tree and simply want to stop creating the tree when you encounter an important inserted lexeme. -
Err(Box<dyn Error>)
works well if you are performing more detailed evaluation while parsing and wish to explain to the user why you stopped evaluating when you encountered an important inserted lexeme.
Using Err(())
The idea here is that we stop evaluating normal action code by returning
Err(())
. However, this can lead to endless instances of the following
map_err
idiom:
R -> Result<..., ()>:
"ID" { $1.map_err(|_| ())? }
;
It can be helpful to define a custom map_err
function which hides some of this
mess for you:
R -> Result<Lexeme<StorageT>, ()>:
"ID" { map_err($1)? }
;
%%
fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
-> Result<Lexeme<StorageT>, ()>
{
r.map_err(|_| ())
}
Using Err(Box<dyn Error>)
The idea here is that we both stop evaluating normal action code, and explain
why, by returning Err(Box<dyn Error>)
. Although Box<dyn Error>
is something
of a mouthful, it allows you significant flexibility in what you return in
error situations. If you want to quickly experiment, then this is convenient
because the token type Result<Lexeme<StorageT>, Lexeme<StorageT>>
can be
automatically coerced to Box<dyn Error>
(e.g. $1?
in action code will
return the Err
variant without additional code). You can also return
strings-as-errors with Box::<dyn Error>::from("...")
.
Using this idiom we can change our calculator example to deal with many more possible sources of error:
%start Expr
%avoid_insert "INT"
%%
Expr -> Result<u64, Box<dyn Error>>:
Expr '+' Term
{
Ok($1?.checked_add($3?)
.ok_or(Box::<dyn Error>::from("Overflow detected."))?)
}
| Term { $1 }
;
Term -> Result<u64, Box<dyn Error>>:
Term '*' Factor
{
Ok($1?.checked_mul($3?)
.ok_or(Box::<dyn Error>::from("Overflow detected."))?)
}
| Factor { $1 }
;
Factor -> Result<u64, Box<dyn Error>>:
'(' Expr ')' { $2 }
| 'INT'
{
parse_int(
$lexer.span_str(
$1.map_err(|_| "<evaluation aborted>")?.span()))
}
;
%%
// Any imports here are in scope for all the grammar actions above.
use std::error::Error;
fn parse_int(s: &str) -> Result<u64, Box<dyn Error>> {
match s.parse::<u64>() {
Ok(val) => Ok(val),
Err(_) => {
Err(Box::from(
format!("{} cannot be represented as a u64", s)))
}
}
}
Define a flatten
function
Yacc grammars make specifying sequences of things something of a bore. A common idiom is thus:
ListOfAs -> Result<Vec<A>, ()>:
A { Ok(vec![$1?]) }
| ListOfAs A
{
let mut $1 = $1?;
$1.push($1?);
Ok($1)
}
;
A -> Result<A, ()>: ... ;
Since this idiom is often present multiple times in a grammar, it's generally
worth adding a flatten
function to hide some of this:
ListOfAs -> Result<Vec<A>, ()>:
A { Ok(vec![$1?]) }
| ListOfAs A { flatten($1, $2) }
;
A -> Result<A, ()>: ... ;
%%
fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
-> Result<Vec<T>, ()>
{
let mut flt = lhs?;
flt.push(rhs?);
Ok(flt)
}
Note that flatten
is generic with respect to T
so that it can be used in
multiple places in the grammar.
Composing idioms
The above idioms compose well together. For example, flatten
, map_err
, and
Lexeme
can be used together as shown in the following example:
ListOfIds -> Result<Vec<Lexeme<StorageT>>, ()>:
"ID" { Ok(vec![map_err($1)?]) }
| ListOfIds "Id" { flatten($1, map_err($2)?) }
;
%%
type StorageT = u32;
fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
-> Result<Lexeme<StorageT>, ()>
{
r.map_err(|_| ())
}
fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
-> Result<Vec<T>, ()>
{
let mut flt = lhs?;
flt.push(rhs?);
Ok(flt)
}