Hand-written lexers
lrpar provides a generic lexing interface to which any lexer can plug into.
Users can provide
one or both of a custom lexeme type -- conforming to
lrpar::Lexeme
-- and a custom lexing type -- conforming to
lrpar::NonStreamingLexer.
If you wish to use a custom lexer, you will need to instantiate lrpar
appropriately (both
CTParserBuilder
and
RTParserBuilder).
For many purposes, the low-level control and performance that lrpar gives you is unneeded,
and the boiler-plate that comes with it unwanted. Fortunately, lrlex provides the following convenience mechanisms to make it easier to use a hand-written lexer with lrpar:
-
lrlex's normalLRNonStreamingLexerstruct can be instantiated by an end-user with an input stream, a list of lexemes created from that input stream, and the newlines encountered while lexing that input stream. This saves having to define a custom instance of thelrpar::NonStreamingLexertrait. -
lrlex'sDefaultLexemestruct can also be instantiated by end-users, saving having to define a custom instance of thelrpar::Lexemetrait. -
lrlexexposesCTTokenMapBuilderto be used frombuild.rsscripts which automatically produces a Rust module with one constant per token ID. It is explicitly designed to be easy to use withlrpar's compile-time building.
Putting these together is then relatively easy. First a build.rs file for a
hand-written lexer will look roughly as follows:
use lrlex::{CTTokenMapBuilder, DefaultLexerTypes}; use lrpar::CTParserBuilder; fn main() { let ctp = CTParserBuilder::<DefaultLexerTypes<u8>>::new() .grammar_in_src_dir("grammar.y") .unwrap() .build() .unwrap(); CTTokenMapBuilder::<u8>::new("token_map", ctp.token_map()).build().unwrap() }
This produces a module that can be imported with lrlex_mod!("token_map"). The
module will contain one constant, prefixed with T_ per token identifiers in the
grammar. For example, for the following grammar excerpt:
Expr -> Result<u64, ()>:
Expr 'PLUS' Term { Ok($1? + $3?) }
| Term { $1 }
;
the module will contain const T_PLUS: u8 = ...;.
Since Yacc grammars can contain token identifiers which are not valid Rust
identifiers, CTTokenMapBuilder allows you to provide a map from the token
identifier to a "Rust friendly" variant. For example, for the following grammar
excerpt:
Expr -> Result<u64, ()>:
Expr '+' Term { Ok($1? + $3?) }
| Term { $1 }
;
we would provide a map '+' => 'PLUS' leading, again, to a constant T_PLUS
being defined.
One can then write a simple custom lexer which lexes all the input in one go
and returns an LRNonStreamingLexer as follows:
#![allow(unused)] fn main() { use cfgrammar::NewlineCache; use lrlex::{lrlex_mod, DefaultLexeme, DefaultLexerTypes, LRNonStreamingLexer}; use lrpar::{lrpar_mod, Lexeme, NonStreamingLexer, Span}; lrlex_mod!("token_map"); use token_map::*; fn lex(s: &str) -> LRNonStreamingLexer<DefaultLexerTypes<u8>> { let mut lexemes = Vec::new(); let mut newlines = NewlineCache::new(); let mut i = 0; while i < s.len() { if i == ... { lexemes.push(DefaultLexeme::new(T_PLUS, i, ...)); } else { ... } } LRNonStreamingLexer::new(s, lexemes, newlines) } }