Hand-written lexers
lrpar
provides a generic lexing interface to which any lexer can plug into.
Users can provide
one or both of a custom lexeme type -- conforming to
lrpar::Lexeme
-- and a custom lexing type -- conforming to
lrpar::NonStreamingLexer
.
If you wish to use a custom lexer, you will need to instantiate lrpar
appropriately (both
CTParserBuilder
and
RTParserBuilder
).
For many purposes, the low-level control and performance that lrpar
gives you is unneeded,
and the boiler-plate that comes with it unwanted. Fortunately, lrlex
provides the following convenience mechanisms to make it easier to use a hand-written lexer with lrpar
:
-
lrlex
's normalLRNonStreamingLexer
struct can be instantiated by an end-user with an input stream, a list of lexemes created from that input stream, and the newlines encountered while lexing that input stream. This saves having to define a custom instance of thelrpar::NonStreamingLexer
trait. -
lrlex
'sDefaultLexeme
struct can also be instantiated by end-users, saving having to define a custom instance of thelrpar::Lexeme
trait. -
lrlex
exposes act_token_map
function to be used frombuild.rs
scripts which automatically produces a Rust module with one constant per token ID.ct_token_map
is explicitly designed to be easy to use withlrpar
's compile-time building.
Putting these together is then relatively easy. First a build.rs
file for a
hand-written lexer will look roughly as follows:
use cfgrammar::yacc::YaccKind; use lrlex::{ct_token_map, DefaultLexerTypes}; use lrpar::CTParserBuilder; fn main() { let ctp = CTParserBuilder::<DefaultLexerTypes<u8>>::new() .yacckind(YaccKind::Grmtools) .grammar_in_src_dir("grammar.y") .unwrap() .build() .unwrap(); ct_token_map::<u8>("token_map", ctp.token_map(), None).unwrap() }
This produces a module that can be imported with lrlex_mod!("token_map")
. The
module will contain one constant, prefixed with T_
per token identifiers in the
grammar. For example, for the following grammar excerpt:
Expr -> Result<u64, ()>:
Expr 'PLUS' Term { Ok($1? + $3?) }
| Term { $1 }
;
the module will contain const T_PLUS: u8 = ...;
.
Since Yacc grammars can contain token identifiers which are not valid Rust
identifiers, ct_token_map
allows you to provide a map from the token
identifier to a "Rust friendly" variant. For example, for the following grammar
excerpt:
Expr -> Result<u64, ()>:
Expr '+' Term { Ok($1? + $3?) }
| Term { $1 }
;
we would provide a map '+' => 'PLUS'
leading, again, to a constant T_PLUS
being defined.
One can then write a simple custom lexer which lexes all the input in one go
and returns an LRNonStreamingLexer
as follows:
#![allow(unused)] fn main() { use cfgrammar::NewlineCache; use lrlex::{lrlex_mod, DefaultLexeme, DefaultLexerTypes, LRNonStreamingLexer}; use lrpar::{lrpar_mod, Lexeme, NonStreamingLexer, Span}; lrlex_mod!("token_map"); use token_map::*; fn lex(s: &str) -> LRNonStreamingLexer<DefaultLexerTypes<u8>> { let mut lexemes = Vec::new(); let mut newlines = NewlineCache::new(); let mut i = 0; while i < s.len() { if i == ... { lexemes.push(DefaultLexeme::new(T_PLUS, i, ...)); } else { ... } } LRNonStreamingLexer::new(s, lexemes, newlines) } }