grmtools parsing idioms

grmtools is a flexible tool and can be used in many ways. However, for those using the Grmtools format, the simple idioms below can often make life easier.

Return Spans when possible

When executing grammar actions one is often building up an Abstract Syntax Tree (AST) or equivalent. For example consider a simple language with assignments:

Assign: "ID" "=" Expr;

Perhaps the "obvious" way to build this into an AST is to extract the string representing the identifier as follows:

Assign -> ASTAssign: "ID" "=" Expr
    {
        let id = $lexer.span_str($1.as_ref().unwrap().span()).to_string();
        ASTAssign::new(id, $3)
    }

%%

struct ASTAssign {
    id: String
}

impl ASTAssign {
    fn new(name: String) -> Self {
        ASTAssign { name }
    }
}

This approach is easy to work with, but isn't as performant as may be desired: the to_string call allocates memory and copies part of the user's input into that. It also loses information about the part of the user's input that the string relates to.

An alternative approach is not to convert the lexeme into a String during parsing, but simply to return a Span. An outline of this is as follows:

Assign -> ASTAssign: "ID" "=" Expr
    {
        ASTAssign { id: $1, expr: Box::new($3.span()) }
    }

%%

type StorageT = u32;

struct ASTAssign {
    id: Span
    expr: Box<Expr>
}

enum Expr { ... }

If this is not quite what you want to do, you can use largely the same trick with the Lexeme struct. Working with Lexemes has the advantage that you can tell what the type of the lexeme in question is, though generally this is entirely clear from AST context, and Lexeme's type parameter makes it marginally more fiddly to work with than Span.

Alternatively, if you really want to extract strings during parsing, consider using the 'input to extract &str's during parsing, since this does not cause any additional memory to be allocated.

Have rules return a Result type and add a function to avoid map_err directly

As described in the error recovery section, it is generally a good idea to give rules a Result return type. This allows a simple interaction with error recovery. However, it can lead to endless instances of the following map_err idiom:

R -> Result<..., ()>:
    "ID" { $1.map_err(|_| ())? }
    ;

It can be helpful to define a custom map_err function which hides some of this mess for you:

R -> Result<Lexeme<StorageT>, ()>:
    "ID" { map_err($1)? }
    ;

%%

fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
        -> Result<Lexeme<StorageT>, ()>
{
    r.map_err(|_| ())
}

Define a flatten function

Yacc grammars make specifying sequences of things something of a bore. A common idiom is thus:

ListOfAs -> Result<Vec<A>, ()>:
      A { Ok(vec![$1?]) }
    | ListOfAs A
      {
          let mut $1 = $1?;
          $1.push($1?);
          Ok($1)
      }
    ;

A -> Result<A, ()>: ... ;

Since this idiom is often present multiple times in a grammar, it's generally worth adding a flatten function to hide some of this:

ListOfAs -> Result<Vec<A>, ()>:
      A { Ok(vec![$1?]) }
    | ListOfAs A { flatten($1, $2) }
    ;

A -> Result<A, ()>: ... ;
%%

fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
           -> Result<Vec<T>, ()>
{
    let mut flt = lhs?;
    flt.push(rhs?);
    Ok(flt)
}

Note that flatten is generic with respect to T so that it can be used in multiple places in the grammar.

Composing the idioms

Happily, flatten, map_err, and Lexeme combine well:

ListOfIds -> Result<Vec<Lexeme<StorageT>>, ()>:
      "ID" { Ok(vec![map_err($1)?]) }
    | ListOfIds "Id" { flatten($1, map_err($2)?) }
    ;

%%

type StorageT = u32;

fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
        -> Result<Lexeme<StorageT>, ()>
{
    r.map_err(|_| ())
}

fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
           -> Result<Vec<T>, ()>
{
    let mut flt = lhs?;
    flt.push(rhs?);
    Ok(flt)
}