grmtools parsing idioms
grmtools is a flexible tool and can be used in many ways. However, for those
using the Grmtools
format, the simple idioms below can often make life easier.
Return Span
s when possible
When executing grammar actions one is often building up an Abstract Syntax Tree (AST) or equivalent. For example consider a simple language with assignments:
Assign: "ID" "=" Expr;
Perhaps the "obvious" way to build this into an AST is to extract the string representing the identifier as follows:
Assign -> ASTAssign: "ID" "=" Expr
{
let id = $lexer.span_str($1.as_ref().unwrap().span()).to_string();
ASTAssign::new(id, $3)
}
%%
struct ASTAssign {
id: String
}
impl ASTAssign {
fn new(name: String) -> Self {
ASTAssign { name }
}
}
This approach is easy to work with, but isn't as performant as may be desired:
the to_string
call allocates memory and copies part of the user's input into
that. It also loses information about the part of the user's input that the
string relates to.
An alternative approach is not to convert the lexeme into a String
during
parsing, but simply to return a
Span
. An outline of this
is as follows:
Assign -> ASTAssign: "ID" "=" Expr
{
ASTAssign { id: $1, expr: Box::new($3.span()) }
}
%%
type StorageT = u32;
struct ASTAssign {
id: Span
expr: Box<Expr>
}
enum Expr { ... }
If this is not quite what you want to do, you can use largely the same trick with
the Lexeme
struct
.
Working with Lexeme
s has the advantage that you can tell what the type of the
lexeme in question is, though generally this is entirely clear from AST
context, and Lexeme
's type parameter makes it marginally more fiddly to work
with than Span
.
Alternatively, if you really want to extract strings during parsing, consider
using the 'input
to extract &str
's during parsing, since this does not
cause any additional memory to be allocated.
Have rules return a Result
type and add a function to avoid map_err
directly
As described in the error recovery
section, it
is generally a good idea to give rules a Result
return type. This allows a
simple interaction with error recovery. However, it can lead to endless
instances of the following map_err
idiom:
R -> Result<..., ()>:
"ID" { $1.map_err(|_| ())? }
;
It can be helpful to define a custom map_err
function which hides some of this
mess for you:
R -> Result<Lexeme<StorageT>, ()>:
"ID" { map_err($1)? }
;
%%
fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
-> Result<Lexeme<StorageT>, ()>
{
r.map_err(|_| ())
}
Define a flatten
function
Yacc grammars make specifying sequences of things something of a bore. A common idiom is thus:
ListOfAs -> Result<Vec<A>, ()>:
A { Ok(vec![$1?]) }
| ListOfAs A
{
let mut $1 = $1?;
$1.push($1?);
Ok($1)
}
;
A -> Result<A, ()>: ... ;
Since this idiom is often present multiple times in a grammar, it's generally
worth adding a flatten
function to hide some of this:
ListOfAs -> Result<Vec<A>, ()>:
A { Ok(vec![$1?]) }
| ListOfAs A { flatten($1, $2) }
;
A -> Result<A, ()>: ... ;
%%
fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
-> Result<Vec<T>, ()>
{
let mut flt = lhs?;
flt.push(rhs?);
Ok(flt)
}
Note that flatten
is generic with respect to T
so that it can be used in
multiple places in the grammar.
Composing the idioms
Happily, flatten
, map_err
, and Lexeme
combine well:
ListOfIds -> Result<Vec<Lexeme<StorageT>>, ()>:
"ID" { Ok(vec![map_err($1)?]) }
| ListOfIds "Id" { flatten($1, map_err($2)?) }
;
%%
type StorageT = u32;
fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
-> Result<Lexeme<StorageT>, ()>
{
r.map_err(|_| ())
}
fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
-> Result<Vec<T>, ()>
{
let mut flt = lhs?;
flt.push(rhs?);
Ok(flt)
}