Beluga_parser.Parser_combinator
General purpose parser combinator library.
This is a hand-made parser combinator system, inspired by Haskell's parsec library. The basic idea in parser combinators is to use higher-order functions to manipulate parsing functions to build up more complex parsers. The most basic type we can give for a parsing function would be something like:
type 'a parser = char list -> char list * 'a option
char list
represents the input stream. We must return a new stream, perhaps with some characters removed since we parsed them. The result of the parse is parametric. Since parsing may fail, we use a type constructor to represent this, namely option
.
However this design has many shortcomings.
char list
requires the entire input to be buffered in memory. This is wasteful. Instead, we should progressively read input as we need it. Therefore in our implementation we use a lazily-evaluated persistent sequence.option
to represent failure doesn't give us a means to specify what the error is. Instead we use ('a, exn) result
in order to also return some information in case of failure.So our parsing function type now looks like:
type 'a parser = state -> state * ('a, exn) result
where state
contains a Located_token.t Seq.t
for the input as well as extra stuff for handling backtracking. This is an instance of the state monad as defined in Support.State.STATE
.
The naive implementation of the alt
alternation combinator is to say that alt p1 p2
first runs p1
, and if it fails, runs p2
. This implementation has the major drawback of allowing unlimited backtracking. This is undesirable because it results in terrible error messages. What we would like is a way to control the backtracking behaviour of parsers on a more fine-grained level.
Instead, the library below is non-backtracking, so it introduces the trying
combinator to selectively enable backtracking. trying p
runs p
, and if p
fails having even consumed input, then trying p
can still be backtracked out of. The alt
combinator is implemented like this:
p1
. If it succeeds, return its result without trying p2
.p1
failed without consuming any input, then run p2
.p1
failed under a trying
, then run p2
.p1
was fatal, so return it.The rationale for this is that it allows the parser writer to commit to a parse tree once certain conditions have been met. For example, in Beluga, after a case
keyword, we know for sure that we're parsing a case expression. Therefore, if we fail afterwards to parse the scrutinee of the case (a synthesizable expression), we should not backtrack out of the parser for case expressions.
This parser combinator library is only suited for top-down data-independent parsing. That is, tokens are parsed sequentially from left to right, and no auxiliary data is accumulated during parsing. This means that left-recursive grammars and data-dependent grammars are not supported without either rewriting the grammar, or disambiguating and rewriting parsed ASTs at a later stage.
It is best to implement a parser with a pre-defined context-free grammar in a language specification document. This grammar may need rewriting to eliminate left recursion, assign precedences and associativities using recursive descent parsing.
module type PARSER_STATE = sig ... end
Abstract parser state definition.
module type LOCATED = sig ... end
Module type definition for elements having location annotations.
module Make_persistent_state (Token : LOCATED) : sig ... end
Functor creating an instance of PARSER_STATE
backed by an immutable data structure.
module type PARSER = sig ... end
module Make
(State : PARSER_STATE with type location = Beluga_syntax.Location.t) :
PARSER
with type state = State.state
and type token = State.token
and type location = State.location