An AST evaluator
We now know enough to put together a more sophisticated version of our simple calculator example that builds an Abstract Syntax Tree (AST) while parsing, which is then evaluated separately. This models a common way of building real compilers. The full example code can be found at https://github.com/softdevteam/grmtools/tree/master/lrpar/examples/calc_ast.
The calc.l
file remains unchanged from that in the Quickstart
guide. However the calc.y
file is change as follows:
%start Expr
%avoid_insert "INT"
%%
Expr -> Result<Expr, ()>:
Expr '+' Term { Ok(Expr::Add{ span: $span, lhs: Box::new($1?), rhs: Box::new($3?) }) }
| Term { $1 }
;
Term -> Result<Expr, ()>:
Term '*' Factor { Ok(Expr::Mul{ span: $span, lhs: Box::new($1?), rhs: Box::new($3?) }) }
| Factor { $1 }
;
Factor -> Result<Expr, ()>:
'(' Expr ')' { $2 }
| 'INT' { Ok(Expr::Number{ span: $span }) }
;
%%
use cfgrammar::Span;
#[derive(Debug)]
pub enum Expr {
Add {
span: Span,
lhs: Box<Expr>,
rhs: Box<Expr>,
},
Mul {
span: Span,
lhs: Box<Expr>,
rhs: Box<Expr>,
},
Number {
span: Span
}
}
The most obvious difference here is that we have defined a simple enum
Expr
,
with three variants, for our AST. Each AST variant also records a Span
which
records how much input the AST element covers. By using the
$span
variable we can ensure that AST elements record their
relationship to portions of the user's input that span multiple tokens (e.g.
for the expressions 1 + 2
the resulting Expr::Add
will have a Span
starting at byte index 0 and ending at byte index 5 -- in other words covering
the complete input string in this case).
After parsing, we thus end up with a Result<Expr, ()>
. In the case of a
successful parse, this will give us an arbitrarily deeply nested Expr
.
Our main.rs
file then looks as follows:
use std::io::{self, BufRead, Write};
use lrlex::{lrlex_mod, DefaultLexeme, LRLexError};
use lrpar::{lrpar_mod, NonStreamingLexer, Span};
lrlex_mod!("calc.l");
lrpar_mod!("calc.y");
use calc_y::Expr;
fn main() {
let lexerdef = calc_l::lexerdef();
let stdin = io::stdin();
loop {
print!(">>> ");
io::stdout().flush().ok();
match stdin.lock().lines().next() {
Some(Ok(ref l)) => {
if l.trim().is_empty() {
continue;
}
let lexer = lexerdef.lexer(l);
let (res, errs) = calc_y::parse(&lexer);
for e in errs {
println!("{}", e.pp(&lexer, &calc_y::token_epp));
}
if let Some(Ok(r)) = res {
// We have a successful parse.
match eval(&lexer, r) {
Ok(i) => println!("Result: {}", i),
Err((span, msg)) => {
let ((line, col), _) = lexer.line_col(span);
eprintln!(
"Evaluation error at line {} column {}, '{}' {}.",
line,
col,
lexer.span_str(span),
msg
)
}
}
}
}
_ => break
}
}
}
fn eval(
lexer: &dyn NonStreamingLexer<DefaultLexeme, u32>,
e: Expr,
LRLexError)
-> Result<u64, (Span, &'static str)> {
match e {
Expr::Add { span, lhs, rhs } => eval(lexer, *lhs)?
.checked_add(eval(lexer, *rhs)?)
.ok_or((span, "overflowed")),
Expr::Mul { span, lhs, rhs } => eval(lexer, *lhs)?
.checked_mul(eval(lexer, *rhs)?)
.ok_or((span, "overflowed")),
Expr::Number { span } => lexer
.span_str(span)
.parse::<u64>()
.map_err(|_| (span, "cannot be represented as a u64"))
}
}
Let's start by running this and seeing what happens:
>>> 2+3*4
Result: 14
>>> 2++3*4
Parsing error at line 1 column 3. Repair sequences found:
1: Delete +
2: Insert INT
Result: 14
>>> 999999*888888 + 777777*666666
Result: 1407404592594
>>> 9999999999*8888888888 + 7777777777*6666666666
Evaluation error at line 1 column 6, '9999999999*8888888888' overflowed.
The first three expressions evaluate just as before. However, the fourth is
interesting: we have explicitly captured the fact that the result of
9999999999*8888888888
is too big to fit into a u64
; and not only have we
told the user which character the input starts out, but we've printed out the
precise sub-part of the input which caused that error. This works even when
it's in the middle of the input:
>>> 10 + 9999999999*8888888888 + 20
Evaluation error at line 1 column 6, '9999999999*8888888888' overflowed.
The key to this is that each AST element knows the $span
of the production it
is related to; and the resulting Span
can extract the user's input with
lexer.span_str(span)
.
Happily, this facility composes nicely with error recovery:
>>> 10 ++ 9999999999*8888888888 + 20
Parsing error at line 1 column 5. Repair sequences found:
1: Delete +
2: Insert INT
Evaluation error at line 1 column 7, '9999999999*8888888888' overflowed.