Hand-written lexers
lrpar provides a generic lexing interface to which any lexer can plug into.
Users can provide
one or both of a custom lexeme type – conforming to
lrpar::Lexeme
– and a custom lexing type – conforming to
lrpar::NonStreamingLexer.
If you wish to use a custom lexer, you will need to instantiate lrpar
appropriately (both
CTParserBuilder
and
RTParserBuilder).
For many purposes, the low-level control and performance that lrpar gives you is unneeded,
and the boiler-plate that comes with it unwanted. Fortunately, lrlex provides the following convenience mechanisms to make it easier to use a hand-written lexer with lrpar:
-
lrlex’s normalLRNonStreamingLexerstruct can be instantiated by an end-user with an input stream, a list of lexemes created from that input stream, and the newlines encountered while lexing that input stream. This saves having to define a custom instance of thelrpar::NonStreamingLexertrait. -
lrlex’sDefaultLexemestruct can also be instantiated by end-users, saving having to define a custom instance of thelrpar::Lexemetrait. -
lrlexexposesCTTokenMapBuilderto be used frombuild.rsscripts which automatically produces a Rust module with one constant per token ID. It is explicitly designed to be easy to use withlrpar’s compile-time building.
Putting these together is then relatively easy. First a build.rs file for a
hand-written lexer will look roughly as follows:
use lrlex::{CTTokenMapBuilder, DefaultLexerTypes};
use lrpar::CTParserBuilder;
fn main() {
let ctp = CTParserBuilder::<DefaultLexerTypes<u8>>::new()
.grammar_in_src_dir("grammar.y")
.unwrap()
.build()
.unwrap();
CTTokenMapBuilder::<u8>::new("token_map", ctp.token_map()).build().unwrap()
}
This produces a module that can be imported with lrlex_mod!("token_map"). The
module will contain one constant, prefixed with T_ per token identifiers in the
grammar. For example, for the following grammar excerpt:
Expr -> Result<u64, ()>:
Expr 'PLUS' Term { Ok($1? + $3?) }
| Term { $1 }
;
the module will contain const T_PLUS: u8 = ...;.
Since Yacc grammars can contain token identifiers which are not valid Rust
identifiers, CTTokenMapBuilder allows you to provide a map from the token
identifier to a “Rust friendly” variant. For example, for the following grammar
excerpt:
Expr -> Result<u64, ()>:
Expr '+' Term { Ok($1? + $3?) }
| Term { $1 }
;
we would provide a map '+' => 'PLUS' leading, again, to a constant T_PLUS
being defined.
One can then write a simple custom lexer which lexes all the input in one go
and returns an LRNonStreamingLexer as follows:
#![allow(unused)]
fn main() {
use cfgrammar::NewlineCache;
use lrlex::{lrlex_mod, DefaultLexeme, DefaultLexerTypes, LRNonStreamingLexer};
use lrpar::{lrpar_mod, Lexeme, NonStreamingLexer, Span};
lrlex_mod!("token_map");
use token_map::*;
fn lex(s: &str) -> LRNonStreamingLexer<DefaultLexerTypes<u8>> {
let mut lexemes = Vec::new();
let mut newlines = NewlineCache::new();
let mut i = 0;
while i < s.len() {
if i == ... {
lexemes.push(DefaultLexeme::new(T_PLUS, i, ...));
} else {
...
}
}
LRNonStreamingLexer::new(s, lexemes, newlines)
}
}