Hand-written lexers

lrpar provides a generic lexing interface to which any lexer can plug into. Users can provide one or both of a custom lexeme type -- conforming to lrpar::Lexeme -- and a custom lexing type -- conforming to lrpar::NonStreamingLexer. If you wish to use a custom lexer, you will need to instantiate lrpar appropriately (both CTParserBuilder and RTParserBuilder).

For many purposes, the low-level control and performance that lrpar gives you is unneeded, and the boiler-plate that comes with it unwanted. Fortunately, lrlex provides the following convenience mechanisms to make it easier to use a hand-written lexer with lrpar:

  1. lrlex's normal LRNonStreamingLexer struct can be instantiated by an end-user with an input stream, a list of lexemes created from that input stream, and the newlines encountered while lexing that input stream. This saves having to define a custom instance of the lrpar::NonStreamingLexer trait.

  2. lrlex's DefaultLexeme struct can also be instantiated by end-users, saving having to define a custom instance of the lrpar::Lexeme trait.

  3. lrlex exposes a ct_token_map function to be used from build.rs scripts which automatically produces a Rust module with one constant per token ID. ct_token_map is explicitly designed to be easy to use with lrpar's compile-time building.

Putting these together is then relatively easy. First a build.rs file for a hand-written lexer will look roughly as follows:

use cfgrammar::yacc::YaccKind;
use lrlex::{ct_token_map, DefaultLexerTypes};
use lrpar::CTParserBuilder;

fn main() {
    let ctp = CTParserBuilder::<DefaultLexerTypes<u8>>::new()
        .yacckind(YaccKind::Grmtools)
        .grammar_in_src_dir("grammar.y")
        .unwrap()
        .build()
        .unwrap();
    ct_token_map::<u8>("token_map", ctp.token_map(), None).unwrap()
}

This produces a module that can be imported with lrlex_mod!("token_map"). The module will contain one constant, prefixed with T_ per token identifiers in the grammar. For example, for the following grammar excerpt:

Expr -> Result<u64, ()>:
      Expr 'PLUS' Term { Ok($1? + $3?) }
    | Term { $1 }
    ;

the module will contain const T_PLUS: u8 = ...;.

Since Yacc grammars can contain token identifiers which are not valid Rust identifiers, ct_token_map allows you to provide a map from the token identifier to a "Rust friendly" variant. For example, for the following grammar excerpt:

Expr -> Result<u64, ()>:
      Expr '+' Term { Ok($1? + $3?) }
    | Term { $1 }
    ;

we would provide a map '+' => 'PLUS' leading, again, to a constant T_PLUS being defined.

One can then write a simple custom lexer which lexes all the input in one go and returns an LRNonStreamingLexer as follows:

#![allow(unused)]
fn main() {
use cfgrammar::NewlineCache;
use lrlex::{lrlex_mod, DefaultLexeme, DefaultLexerTypes, LRNonStreamingLexer};
use lrpar::{lrpar_mod, Lexeme, NonStreamingLexer, Span};

lrlex_mod!("token_map");
use token_map::*;

fn lex(s: &str) -> LRNonStreamingLexer<DefaultLexerTypes<u8>> {
  let mut lexemes = Vec::new();
  let mut newlines = NewlineCache::new();
  let mut i = 0;
  while i < s.len() {
    if i == ... {
      lexemes.push(DefaultLexeme::new(T_PLUS, i, ...));
    } else {
      ...
    }
  }
  LRNonStreamingLexer::new(s, lexemes, newlines)
}
}