Parser_combinator (beluga.Beluga_parser.Parser

Introduction to Parser Combinators

This is a hand-made parser combinator system, inspired by Haskell's parsec library. The basic idea in parser combinators is to use higher-order functions to manipulate parsing functions to build up more complex parsers. The most basic type we can give for a parsing function would be something like:

  type 'a parser = char list -> char list * 'a option

char list represents the input stream. We must return a new stream, perhaps with some characters removed since we parsed them. The result of the parse is parametric. Since parsing may fail, we use a type constructor to represent this, namely option.

However this design has many shortcomings.

Representing the input as a char list requires the entire input to be buffered in memory. This is wasteful. Instead, we should progressively read input as we need it. Therefore in our implementation we use a lazily-evaluated persistent sequence.
Using option to represent failure doesn't give us a means to specify what the error is. Instead we use ('a, exn) result in order to also return some information in case of failure.
Finally, in order to control backtracking and other parsing features, we will need to not only pass around and transform the input, but a more general parser state.

So our parsing function type now looks like:

  type 'a parser = state -> state * ('a, exn) result

where state contains a Located_token.t Seq.t for the input as well as extra stuff for handling backtracking. This is an instance of the state monad as defined in Support.State.STATE.

Backtracking

The naive implementation of the alt alternation combinator is to say that alt p1 p2 first runs p1, and if it fails, runs p2. This implementation has the major drawback of allowing unlimited backtracking. This is undesirable because it results in terrible error messages. What we would like is a way to control the backtracking behaviour of parsers on a more fine-grained level.

Instead, the library below is non-backtracking, so it introduces the trying combinator to selectively enable backtracking. trying p runs p, and if p fails having even consumed input, then trying p can still be backtracked out of. The alt combinator is implemented like this:

Run p1. If it succeeds, return its result without trying p2.
If p1 failed without consuming any input, then run p2.
If p1 failed under a trying, then run p2.
Otherwise, the error generated by p1 was fatal, so return it.

The rationale for this is that it allows the parser writer to commit to a parse tree once certain conditions have been met. For example, in Beluga, after a case keyword, we know for sure that we're parsing a case expression. Therefore, if we fail afterwards to parse the scrutinee of the case (a synthesizable expression), we should not backtrack out of the parser for case expressions.

Language Design Considerations

This parser combinator library is only suited for top-down data-independent parsing. That is, tokens are parsed sequentially from left to right, and no auxiliary data is accumulated during parsing. This means that left-recursive grammars and data-dependent grammars are not supported without either rewriting the grammar, or disambiguating and rewriting parsed ASTs at a later stage.

It is best to implement a parser with a pre-defined context-free grammar in a language specification document. This grammar may need rewriting to eliminate left recursion, assign precedences and associativities using recursive descent parsing.

Parser State

module type PARSER_STATE = sig ... end

Abstract parser state definition.

module type LOCATED = sig ... end

Module type definition for elements having location annotations.

Constructors

module Make_persistent_state (Token : LOCATED) : sig ... end

Functor creating an instance of PARSER_STATE backed by an immutable data structure.

Parsers

module type PARSER = sig ... end

Constructors

module Make
  (State : PARSER_STATE with type location = Beluga_syntax.Location.t) : 
  PARSER
    with type state = State.state
     and type token = State.token
     and type location = State.location