grmtools parsing idioms

grmtools is a flexible tool and can be used in many ways. However, for those using the Grmtools format, the simple idioms below can often make life easier.

Return Lexemes rather than &strs or Strings when performance is critical

When executing grammar actions one is often building up an Abstract Syntax Tree (AST) or equivalent. For example consider a simple language with assignments:

Assign: "ID" "=" Expr;

Perhaps the "obvious" way to build this into an AST is to extract the string representing the identifier as follows:

Assign -> ASTAssign: "ID" "+" Expr
    {
        let id = $lexer.lexeme_str($1.as_ref().unwrap()).to_string();
        ASTAssign::new(id, $3)
    }

%%

struct ASTAssign {
    id: String
}

impl ASTAssign {
    fn new(name: String) -> Self {
        ASTAssign { name }
    }
}

This approach is easy to work with, but isn't as performant as may be desired: the to_string call allocates memory and copies part of the user's input into that.

An alternative approach is not to convert the lexeme into a String during parsing, but simply to return the Lexeme itself. The relevant &str slice can be extracted from the user's input later and memory allocation avoided entirely. An example of this is as follows:

Assign -> ASTAssign: "ID" "+" Expr
    {
        let id = $lexer.lexeme_str($1);
        ASTAssign::new(id, $3)
    }

%%

type StorageT = u32;

struct ASTAssign {
    id: Lexeme<StorageT>
}

impl ASTAssign {
    fn new(name: Lexeme<StorageT>) -> Self {
        ASTAssign { name }
    }
}

Dealing with the Lexeme type can be somewhat fiddly because it has a type paramater <StorageT> with somewhat complex bounds. In essence, StorageT must be an unsigned integer that is big enough to index (individually) a grammar's rules, productions, and lexemes. Users can manually specify StorageT by using CTParserBuilder::new_with_storaget; if not specified explicitly it defaults to u32.

A simple way of avoiding the complexity of the parameter's bounds is to encode it directly into the Lexeme type. A pleasant, and easily adapted way of doing this, is to semi-hard-code the value of StorageT into the AST, which we do by adding type StorageT = u32; to the programs part of the grammar (note that if a different type than u32 was passed to CTParserBuilder::new_with_storaget then that will need to be used instead).

Have rules return a Result type and add a function to avoid map_err directly

As described in the error recovery section, it is generally a good idea to give rules a Result return type. This allows a simple interaction with error recovery. However, it can lead to endless instances of the following map_err idiom:

R -> Result<..., ()>:
    "ID" { $1.map_err(|_| ())? }
    ;

It can be helpful to define a custom map_err function which hides some of this mess for you:

R -> Result<Lexeme<StorageT>, ()>:
    "ID" { map_err($1)? }
    ;

%%

fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
        -> Result<Lexeme<StorageT>, ()
{
    r.map_err(|_| ())
}

Define a flatten function

Yacc grammars make specifying sequences of things something of a bore. A common idiom is thus:

ListOfAs -> Result<Vec<A>, ()>:
      A { Ok(vec![$1?]) }
    | ListOfAs A
      {
          let mut $1 = $1?;
          $1.push($1?);
          Ok($1)
      }
    ;

A -> Result<A, ()>: ... ;

Since this idiom is often present multiple times in a grammar, it's generally worth adding a flatten function to hide some of this:

ListOfAs -> Result<Vec<A>, ()>:
      A { Ok(vec![$1?]) }
    | ListOfAs A { flatten($1, $2) }
    ;

A -> Result<A, ()>: ... ;
%%

fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
           -> Result<Vec<T>, ()>
{
    let mut flt = lhs?;
    flt.push(rhs?);
    Ok(flt)
}

Note that flatten is generic with respect to T so that it can be used in multiple places in the grammar.

Composing the idioms

Happily, flatten, map_err, and Lexeme combine well:

ListOfIds -> Result<Vec<Lexeme<StorageT>>, ()>:
      "ID" { Ok(vec![map_err($1)?]) }
    | ListOfIds "Id" { flatten($1, map_err($2)?) }
    ;

%%

type StorageT = u32;

fn map_err(r: Result<Lexeme<StorageT>, Lexeme<StorageT>>)
        -> Result<Lexeme<StorageT>, ()
{
    r.map_err(|_| ())
}

fn flatten<T>(lhs: Result<Vec<T>, ()>, rhs: Result<T, ()>)
           -> Result<Vec<T>, ()>
{
    let mut flt = lhs?;
    flt.push(rhs?);
    Ok(flt)
}