OCaml is a great language for creating programming languages. In fact, it is inspired by the ML language (hence “ml” in its name), which as discussed by its author Robin Milner in The Definition of Standard ML book,
‘ML’ stands for meta language; this is the term logicians use for a language in which other (formal or informal) languages are discussed and analyzed.
Besides being very flexible, having a nice typing system, and powerful pattern matching, all are helpful, it also has great libraries including ocamllex
or sedlex
(better support for unicode) for building lexers, and ocamlyacc
and its more modern cousin menhir
for building parsers. There are many tutorials on using them, showing how to build a parser for a calculator, or JSON, which is also discussed in the great Real World OCaml book. There is also a nice live coding video showing a similar implementation to mine.
Below I’ll be intentionally omitting some implementation details and boilerplate, so for the full code remember to check the actual implementation or an interpreter.
Lambda calculus
Lambda calculus can be seen as the simplest programming language, it has three kinds of terms,
- a variable, e.g. \(x\),
- an application like \(t ~ u\), where \(t\) is the function applied to the \(u\) argument,
- and an abstraction \(\lambda x . t\), which is a function with an argument \(a\) and the body \(t\).
This can be translated to the following OCaml data type
type term =
of string
| Var of term * term
| App of string * term | Abs
I’ll use it as an example for writing a parser.
Tokens
Having defined the types, we can start building a parser that would convert the source code representation of lambda calculus expressions to an abstract syntax tree. Let’s start with the tokens that we will be extracting from the source code.
A variable is a string identifying its name, that would make a <string> ID
token. An abstraction starts with the Greek letter \(\lambda\) that we’ll represent with the LAMBDA
token. It also uses \(.\) DOT
to separate the function argument from the body. But that is not all, as we can also group things by placing them between the left LPAREN
and right RPAREN
bracket. Unlike many programming languages that mark the end of the expression with things like ;
, nothing like this exists in lambda calculus, so we would assume that the end of the line END
marks also the end of the expression. We would also use EOF
for the end of the file so that we know that there’s nothing more to parse.
string> ID
%token <"("
%token LPAREN ")"
%token RPAREN "λ"
%token LAMBDA "."
%token DOT
%token END %token EOF
The quoted symbols on the right-hand side are just aliases, so we could write "λ"
instead of LAMBDA
to make the parser code more readable.
Lexer
Lexer reads source code character by character transforming it into tokens.
A mathematician might be fine with writing \(fxy\) meaning applying function \(f\) with arguments \(x\) and \(y\), but the rest of us would probably appreciate being able to write meaningful multi-character names like myfunc foo bar
. To do this, we would need to be able to mark where the names start and end, and the simplest way is to use whitespaces for that, so we need to define the white
characters. Since we don’t care if we saw single or multiple whitespaces in the row, we would use +
in the regular expression to say “one or more”.
let white = [' ' '\t']+
For marking the end of an expression, we’ll use newline
characters.
let newline = '\r' | '\n' | "\r\n"
There are some reserved characters like (
, )
, .
, λ
, etc but all the others could be considered as identifiers, so we would define a set of reserved characters and negate it with ^
in the regular expression.
let string = [^ '(' ')' '\\' '.' '#' ' ' '\t' '\n' '\t']+
Now we can start defining the lexing rules. The simplest case is reaching the end of the file, or a newline character, so we return the EOF
or END
token.
rule read =
parse
| eof { EOF } | newline { END }
For whitespaces, the lexing rule is to read the next character by recursively calling the read
rule.
| white { read lexbuf }
For the brackets, we would be returning the appropriate tokens.
"(" { LPAREN }
| ")" { RPAREN } |
The same happens for \(\lambda\), but since it is not straightforward to write Greek letters using most keyboards, we would use \
as an alternative (as in Haskell).
"\\" { LAMBDA }
| "λ" { LAMBDA }
| "." { DOT } |
For the identifiers, i.e. all the other strings, we would return the ID
token with the string as its value.
string { ID (lexeme lexbuf) } |
Finally, it would be useful to support code comments. #
would mark the start of the comment that continues till the end of the line. In such a case, the lexer would call the skip_line
. The rule would recursively call itself while skipping the characters until hitting the newline
character, which would bring us back to the read
rule again.
"#" { skip_line lexbuf }
| and skip_line =
parse
| newline { new_line lexbuf; read lexbuf }
| eof { EOF } | _ { skip_line lexbuf }
Parser
The parser converts the stream of tokens to an abstract syntax tree representation of the code.
We would start with the simplest case of parsing the variables. When reaching the ID
token, the parser would transform its x
value to Var
with the identifier x
.
let variable :=
| x = ID; { Var x }
Parsing the applications is slightly harder. As a reminder, applications can take \(x ~ y\) form, where both \(x\) and \(y\) can be any terms (let’s assume for now that term
is already defined). We could naively define it as one term following another.
| t = term; u = term; { App (t, u) }
But in such a case, the compiler would show us warnings that the rule is ambiguous.
Warning: 2 states have shift/reduce conflicts.
Warning: 6 shift/reduce conflicts were arbitrarily resolved.
The application could be \(t ~ u\), but also \(a ~ b ~ c\), or \(a ~ b ~ \lambda x . x\). Since App
has only two fields, how would we parse such cases? By convention, the application is associative to the left, so \(a ~ b ~ c ~ d\) reads as \(((a ~ b) c) d\). Moreover, in the case where we would like to change the application order, we can use the brackets, e.g. \(f x (y z)\) would become \((f x) (y z)\). To parse it, we will first define the element
that can be a result of the variable
rule, that we already defined, or a term
(wait for it…) surrounded by brackets.
let element :=
| variable"("; x = term; ")"; { x } |
Now we can create a recursive application
rule. Starting from the second part of the rule, if there is an application
t
and an element
u
following it, it reads it as a new application App (App (_, _), u)
. The first part of the rule is that we just read an element
. By definition of the element
, it can be something surrounded by brackets, or a single variable.
let application :=
| element | t = application; u = element; { App (t, u) }
As a side note, this makes the definition of application
slightly misleading, since it will read also things like the variable \(x\), or \((\lambda x . x)\) (using the element
part of the rule), but we’ll sacrifice a bit of purity to make it work.
Now we need to define the parsing rule for abstractions. As a reminder, an abstraction takes the \(\lambda x . t\) form, identity function \(\lambda x . x\) is the simplest example. But, what I didn’t mention yet, it can have multiple arguments, e.g. \(\lambda x y . x y\) reads as \(\lambda x . \lambda y . x y\). Abstractions follow a different convention than applications, they extend to the right, so \(\lambda x . x y z\) is \(\lambda x . (x y z)\) not \((\lambda x . x) y z\). This is related to currying, but it’s a different story.
The parsing rule for an abstraction
is that it is something starting with the "λ"
token that is followed by an argument, and then the rest of it follows. In the simplest case, after the argument, there is a "."
token and the body of a function. Alternatively, it may be followed by another argument and something following it, using the above rule recursively. This translates to the two rules: abstraction
which reads the head of the function and body
which reads its tail.
let abstraction :=
"λ"; x = ID; u = body; { Abs (x, u) }
|
let body :=
"."; u = term; { u }
| | x = ID; u = body; { Abs (x, u) }
Having all the rules for reading the individual terms, we can finally define the term
that reads an application
(which can read a single variable
or an application consisting of more elements) or an abstraction
.
let term :=
| application | abstraction
Finally, we want to have a general rule for reading the term
s. It will terminate the parser on EOF
, read the term
s followed by the newline END
tokens, and proceed to read the next token following the END
(the middle rule).
let prog :=
None }
| EOF; {
| END; p = prog; { p }Some t }
| t = term; line_end; {
let line_end := END | EOF
Wrap it up
Having finished the above, we need to inform the OCaml parser to call ocamllex
and menhir
so that they create the lexer and parser for us. This is specified in the dune
file that uses lisp syntax for the compiler configuration. The definition assumes that we have a lexer defined in the lexer.mll
file and a parser in the parser.mly
file.
(ocamllex lexer)
(menhir (modules parser))
Calling dune build
would now create the lexer and parser modules for us. To check how they work, we can create an executable command-line program that reads user input from standard input and prints the parsed representation. For this, we would use the recursive loop
function that calls the prog
function from our parser, which uses the read
function from the lexer.
let rec loop lexer =
flush stdout;
let _ = match Parser.prog Lexer.read lexer with
Some t ->
| Printf.printf "%s\n\n" (show_term t);
None -> () in
|
loop lexer
let () =
stdin) loop (from_channel
Conclusion
OCaml has great tools for building parsers. It allows us to write lexers using rules based on regular expressions, that could be made even more flexible by calling other rules (basically, functions). For building parsers, we can also use rich functional, pattern-matching-based language to build declarative parsing rules that would be translated into a parser for us by menhir
or ocamlyacc
. If you know OCaml, they are really easy to learn and quite intuitive.