Writing parser in ocamllex and menhir

blog
Published

March 10, 2023

OCaml is a great language for creating programming languages. In fact, it is inspired by the ML language (hence “ml” in its name), which as discussed by its author Robin Milner in The Definition of Standard ML book,

‘ML’ stands for meta language; this is the term logicians use for a language in which other (formal or informal) languages are discussed and analyzed.

Besides being very flexible, having a nice typing system, and powerful pattern matching, all are helpful, it also has great libraries including ocamllex or sedlex (better support for unicode) for building lexers, and ocamlyacc and its more modern cousin menhir for building parsers. There are many tutorials on using them, showing how to build a parser for a calculator, or JSON, which is also discussed in the great Real World OCaml book. There is also a nice live coding video showing a similar implementation to mine.

Below I’ll be intentionally omitting some implementation details and boilerplate, so for the full code remember to check the actual implementation or an interpreter.

Lambda calculus

Lambda calculus can be seen as the simplest programming language, it has three kinds of terms,

  • a variable, e.g. \(x\),
  • an application like \(t ~ u\), where \(t\) is the function applied to the \(u\) argument,
  • and an abstraction \(\lambda x . t\), which is a function with an argument \(a\) and the body \(t\).

This can be translated to the following OCaml data type

term.ml

type term =
  | Var of string
  | App of term * term
  | Abs of string * term

I’ll use it as an example for writing a parser.

Tokens

Having defined the types, we can start building a parser that would convert the source code representation of lambda calculus expressions to an abstract syntax tree. Let’s start with the tokens that we will be extracting from the source code.

A variable is a string identifying its name, that would make a <string> ID token. An abstraction starts with the Greek letter \(\lambda\) that we’ll represent with the LAMBDA token. It also uses \(.\) DOT to separate the function argument from the body. But that is not all, as we can also group things by placing them between the left LPAREN and right RPAREN bracket. Unlike many programming languages that mark the end of the expression with things like ;, nothing like this exists in lambda calculus, so we would assume that the end of the line END marks also the end of the expression. We would also use EOF for the end of the file so that we know that there’s nothing more to parse.

parser.mly

%token <string> ID
%token LPAREN "("
%token RPAREN ")"
%token LAMBDA "λ"
%token DOT "."
%token END
%token EOF

The quoted symbols on the right-hand side are just aliases, so we could write "λ" instead of LAMBDA to make the parser code more readable.

Lexer

Lexer reads source code character by character transforming it into tokens.

A mathematician might be fine with writing \(fxy\) meaning applying function \(f\) with arguments \(x\) and \(y\), but the rest of us would probably appreciate being able to write meaningful multi-character names like myfunc foo bar. To do this, we would need to be able to mark where the names start and end, and the simplest way is to use whitespaces for that, so we need to define the white characters. Since we don’t care if we saw single or multiple whitespaces in the row, we would use + in the regular expression to say “one or more”.

lexer.mll

let white = [' ' '\t']+

For marking the end of an expression, we’ll use newline characters.

let newline = '\r' | '\n' | "\r\n"

There are some reserved characters like (, ), ., λ, etc but all the others could be considered as identifiers, so we would define a set of reserved characters and negate it with ^ in the regular expression.

let string = [^ '(' ')' '\\' '.' '#' ' ' '\t' '\n' '\t']+

Now we can start defining the lexing rules. The simplest case is reaching the end of the file, or a newline character, so we return the EOF or END token.

rule read =
  parse
    | eof { EOF }
    | newline { END }

For whitespaces, the lexing rule is to read the next character by recursively calling the read rule.

    | white { read lexbuf }

For the brackets, we would be returning the appropriate tokens.

    | "(" { LPAREN }
    | ")" { RPAREN }

The same happens for \(\lambda\), but since it is not straightforward to write Greek letters using most keyboards, we would use \ as an alternative (as in Haskell).

    | "\\" { LAMBDA }
    | "λ" { LAMBDA }
    | "." { DOT }

For the identifiers, i.e. all the other strings, we would return the ID token with the string as its value.

    | string { ID (lexeme lexbuf) }

Finally, it would be useful to support code comments. # would mark the start of the comment that continues till the end of the line. In such a case, the lexer would call the skip_line. The rule would recursively call itself while skipping the characters until hitting the newline character, which would bring us back to the read rule again.

    | "#" { skip_line lexbuf }
and skip_line =
  parse
    | newline { new_line lexbuf; read lexbuf }
    | eof { EOF }
    | _ { skip_line lexbuf }

Parser

The parser converts the stream of tokens to an abstract syntax tree representation of the code.

We would start with the simplest case of parsing the variables. When reaching the ID token, the parser would transform its x value to Var with the identifier x.

parser.mly

let variable :=
  | x = ID; { Var x }

Parsing the applications is slightly harder. As a reminder, applications can take \(x ~ y\) form, where both \(x\) and \(y\) can be any terms (let’s assume for now that term is already defined). We could naively define it as one term following another.

  | t = term; u = term; { App (t, u) }

But in such a case, the compiler would show us warnings that the rule is ambiguous.

Warning: 2 states have shift/reduce conflicts.
Warning: 6 shift/reduce conflicts were arbitrarily resolved.

The application could be \(t ~ u\), but also \(a ~ b ~ c\), or \(a ~ b ~ \lambda x . x\). Since App has only two fields, how would we parse such cases? By convention, the application is associative to the left, so \(a ~ b ~ c ~ d\) reads as \(((a ~ b) c) d\). Moreover, in the case where we would like to change the application order, we can use the brackets, e.g. \(f x (y z)\) would become \((f x) (y z)\). To parse it, we will first define the element that can be a result of the variable rule, that we already defined, or a term (wait for it…) surrounded by brackets.

let element :=
  | variable
  | "("; x = term; ")"; { x }

Now we can create a recursive application rule. Starting from the second part of the rule, if there is an application t and an element u following it, it reads it as a new application App (App (_, _), u). The first part of the rule is that we just read an element. By definition of the element, it can be something surrounded by brackets, or a single variable.

let application :=
  | element
  | t = application; u = element; { App (t, u) }

As a side note, this makes the definition of application slightly misleading, since it will read also things like the variable \(x\), or \((\lambda x . x)\) (using the element part of the rule), but we’ll sacrifice a bit of purity to make it work.

Now we need to define the parsing rule for abstractions. As a reminder, an abstraction takes the \(\lambda x . t\) form, identity function \(\lambda x . x\) is the simplest example. But, what I didn’t mention yet, it can have multiple arguments, e.g. \(\lambda x y . x y\) reads as \(\lambda x . \lambda y . x y\). Abstractions follow a different convention than applications, they extend to the right, so \(\lambda x . x y z\) is \(\lambda x . (x y z)\) not \((\lambda x . x) y z\). This is related to currying, but it’s a different story.

The parsing rule for an abstraction is that it is something starting with the "λ" token that is followed by an argument, and then the rest of it follows. In the simplest case, after the argument, there is a "." token and the body of a function. Alternatively, it may be followed by another argument and something following it, using the above rule recursively. This translates to the two rules: abstraction which reads the head of the function and body which reads its tail.

let abstraction :=
  | "λ"; x = ID; u = body; { Abs (x, u) }

let body :=
  | "."; u = term; { u }
  | x = ID; u = body; { Abs (x, u) }

Having all the rules for reading the individual terms, we can finally define the term that reads an application (which can read a single variable or an application consisting of more elements) or an abstraction.

let term :=
  | application
  | abstraction

Finally, we want to have a general rule for reading the terms. It will terminate the parser on EOF, read the terms followed by the newline END tokens, and proceed to read the next token following the END (the middle rule).

let prog :=
  | EOF; { None }
  | END; p = prog; { p }
  | t = term; line_end; { Some t }

let line_end := END | EOF

Wrap it up

Having finished the above, we need to inform the OCaml parser to call ocamllex and menhir so that they create the lexer and parser for us. This is specified in the dune file that uses lisp syntax for the compiler configuration. The definition assumes that we have a lexer defined in the lexer.mll file and a parser in the parser.mly file.

dune

(ocamllex lexer)

(menhir
 (modules parser))

Calling dune build would now create the lexer and parser modules for us. To check how they work, we can create an executable command-line program that reads user input from standard input and prints the parsed representation. For this, we would use the recursive loop function that calls the prog function from our parser, which uses the read function from the lexer.

main.ml

let rec loop lexer =
  flush stdout;
  let _ = match Parser.prog Lexer.read lexer with
    | Some t ->
        Printf.printf "%s\n\n" (show_term t);
    | None -> () in
  loop lexer

let () =
  loop (from_channel stdin)

Conclusion

OCaml has great tools for building parsers. It allows us to write lexers using rules based on regular expressions, that could be made even more flexible by calling other rules (basically, functions). For building parsers, we can also use rich functional, pattern-matching-based language to build declarative parsing rules that would be translated into a parser for us by menhir or ocamlyacc. If you know OCaml, they are really easy to learn and quite intuitive.