Parsby
Parser combinator library for Ruby, based on Haskell's Parsec.
- Installation
- Examples
- Introduction
- Some commonly used combinators
- Defining combinators
- Defining parsers as modules
-
ExpectationFailed
- Cleaning up the parse tree for the trace
- Recursive parsers with
lazy
- Parsing left-recursive languages with
reduce
combinator Parsby.new
- Parsing from a string, a file, a pipe, a socket, ...
- Comparing with Haskell's Parsec
- Development
Installation
Add this line to your application's Gemfile:
gem 'parsby'
And then execute:
$ bundle
Or install it yourself as:
$ gem install parsby
Examples
If you'd like to jump right into example parsers that use this library, there's a few in this source:
Introduction
This is a library used to define parsers by declaratively describing a syntax using what's commonly referred to as combinators. Parser combinators are functions that take parsers as inputs and/or return parsers as outputs, i.e. they combine parsers into new parsers.
As an example, between
is a combinator with 3 parameters: a parser for
what's to the left, one for what's to the right, and lastly one for what's
in-between them, and it returns a parser that, after parsing, returns the
result of the in-between parser:
between(lit("<"), lit(">"), decimal).parse "<100>"
#=> 100
Some commonly used combinators
# Parse argument string literally
lit("foo").parse "foo"
#=> "foo"
# Case insensitive lit
ilit("Foo").parse "fOo"
#=> "fOo"
# Make any value into a parser that results in that value without
# consuming input.
pure("foo").parse ""
#=> "foo"
# Parse foo or bar
(lit("foo") | lit("bar")).parse "bar"
#=> "bar"
# Like `|`, parse one of foo or bar. `choice` is better when you have
# many choices to chose from. You can pass it any number of parsers or
# array of parsers.
choice(lit("foo"), lit("bar")).parse "bar"
#=> "bar"
# Parse with each argument in succesion and group the results in an
# array.
group(lit("foo"), lit("bar")).parse "foobar"
#=> ["foo", "bar"]
# Parse foo and bar, returning bar.
(lit("foo") > lit("bar")).parse "foobar"
#=> "bar"
# Parse foo and bar, returning foo.
(lit("foo") < lit("bar")).parse "foobar"
#=> "foo"
# Make parser optional
group(optional(lit("foo")), lit("bar")).parse "bar"
#=> [nil, "bar"]
# Use parser zero or more times, grouping results in array. many_1, does
# the same, but requires parsing at least once.
many(lit("foo")).parse "foofoo"
#=> ["foo", "foo"]
# Parse many, but each separated by something. sep_by_1 requires at least
# one element to be parsed.
sep_by(lit(","), lit("foo")).parse "foo,foo"
#=> ["foo", "foo"]
# `whitespace` (alias `ws`) is zero or more whitespace characters.
# `whitespace_1` (alias `ws_1`) is one or more whitespace characters.
# `spaced` allows a parser to be surrounded by optional whitespace.
# `whitespace_1` is the base definition. If you extend it to e.g. add the
# parsing of comments, the other combinators will also recognize that
# change.
(whitespace > lit("foo")).parse " foo"
#=> "foo"
group(lit("foo"), ws_1 > lit("bar")).parse "foo bar"
#=> ["foo", "bar"]
spaced(lit("foo")).parse " foo "
#=> "foo"
# Parse transform result according to block.
lit("foo").fmap {|x| x.upcase }.parse "foo"
#=> "FOO"
# join(p) is the same as p.fmap {|xs| xs.join }
join(sep_by(lit(","), lit("foo") | lit("bar"))).parse "foo,bar"
#=> "foobar"
# Parse a character from the choices in a set of strings or ranges
char_in(" \t\r\n").parse "\t"
#=> "\t"
typical_identifier_characters = ['a'..'z', 'A'..'Z', 0..9, "_"]
join(many(char_in("!?", typical_identifier_characters))).parse "foo23? bar"
#=> "foo23?"
# Parse any one character
any_char.parse "foo"
#=> "f"
# Require end of input at end of parse.
(lit("foo") < eof).parse "foobar"
#=> Parsby::ExpectationFailed: line 1:
foobar
| * failure: eof
\-/ *| success: lit("foo")
\|
| * failure: (lit("foo") < eof)
# Parse only when other parser fails.
join(many(any_char.that_fails(whitespace_1))).parse "foo bar"
#=> "foo"
# single(p) is the same as p.fmap {|x| [x] }
single(lit("foo")).parse "foo"
#=> ["foo"]
# p1 + p2 is the same as group(p1, p2).fmap {|(r1, r2)| r1 + r2 }
(lit("foo") + (ws > lit("bar"))).parse "foo bar"
#=> "foobar"
(single(lit("foo")) + many(ws > lit("bar"))).parse "foo bar bar"
#=> ["foo", "bar", "bar"]
Defining combinators
If you look at the examples in this source, you'll notice that almost all
combinators are defined with define_combinator
. Strictly speaking, it's
not necessary to use that to define combinators. You can do it with
variable assignment or def
syntax. Nevertheless, define_combinator
is
preferred because it automates the assignment of a label to the combinator.
Consider this example:
define_combinator :between do |left, right, p|
left > p < right
end
between(lit("<"), lit(">"), lit("foo")).label
#=> 'between(lit("<"), lit(">"), lit("foo"))'
Having labels that resemble the source code is helpful for the error messages.
If we use def
instead of define_combinator
, then the label would be
that of its definition. In the following case, it would be that assigned by
<
.
def between(left, right, p)
left > p < right
end
between(lit("<"), lit(">"), lit("foo")).label
#=> '((lit("<") > lit("foo")) < lit(">"))'
If we're to wrap that parser in a new one, then the label would be simply unknown.
def between(left, right, p)
Parsby.new {|c| (left > p < right).parse c }
end
between(lit("<"), lit(">"), lit("foo")).label.to_s
#=> "unknown"
Defining parsers as modules
The typical pattern I use is something like this:
module FoobarParser
include Parsby::Combinators
extend self
# Entrypoint nicety
def parse(s)
foobar.parse s
end
define_combinator :foobar do
foo + bar
end
define_combinator :foo do
lit("foo")
end
define_combinator :bar do
lit("bar")
end
end
From that, you can use it directly as:
FoobarParser.parse "foobar"
#=> "foobar"
FoobarParser.foo.parse "foo"
#=> "foo"
Being able to use subparsers directly is useful for when you want to e.g. parse a JSON array, instead of any JSON value.
Writing the parser as a module like that also makes it easy to make a new parser based on it:
module FoobarbazParser
include FoobarParser
extend self
def parse(s)
foobarbaz.parse s
end
define_combinator :foobarbaz do
foobar + baz
end
define_combinator :baz do
lit("baz")
end
end
You can also define such a module to hold your own project's combinators to use in multiple parsers.
ExpectationFailed
Here's an example of an error, when parsing fails:
pry(main)> Parsby::Example::LispParser.sexp.parse "(foo `(foo ,bar) 2.3 . . nil)"
Parsby::ExpectationFailed: line 1:
(foo `(foo ,bar) 2.3 . . nil)
| * failure: char_in("([")
| * failure: list
| *| failure: symbol
| *|| failure: nil
| *||| failure: string
| *|||| failure: number
\\\||
| *| failure: atom
| *|| failure: abbrev
\\|
| * failure: sexp
V *| success: lit(".")
\-/ *|| success: sexp
\---------/ *||| success: sexp
\-/ *|||| success: sexp
V *||||| success: char_in("([")
\\\\\|
| * failure: list
| * failure: sexp
As can be seen, Parsby manages a tree structure representing parsers and their subparsers, with the information of where a particular parser began parsing, where it ended, whether it succeeded or failed, and the label of the parser.
It might be worth mentioning that when debugging a parser from an
unexpected ExpectationFailed
error, the backtrace isn't really useful.
That's because the backtrace points to the code involved in parsing, not
the code involved in constructing the parsers, which succeeded, but is
where the problem typically lies. The tree-looking exception message above
is meant to somewhat substitute the utility of the backtrace in these
cases.
Relating to that, the right-most text are the labels of the corresponding
parsers. I find that labels that resemble the source code are quite useful,
just like the code location descriptions that appear right-most in
backtraces. It's because of this that I consider the use of
define_combinator
more preferable than using def
and explicitly
assigning labels.
Cleaning up the parse tree for the trace
If you look at the source of the example lisp parser, you might note that
there are a lot more parsers in between those shown in the tree above.
sexp
is not a direct child of list
, for example, despite it appearing
as so. There are at least 6 ancestors/descendant parsers between list
and
sexp
. It'd be very much pointless to show them all. They convey little
additional information and their labels are very verbose.
The reason why they don't appear is because the splicer.start
combinator
is used to make the tree look a little cleaner. To show an example of how
it works, here's a simplified definition of choice
:
define_combinator :choice do |*ps|
ps = ps.flatten
ps.reduce(unparseable) do |a, p|
a | p
end
end
Let's fail it:
pry(main)> choice(lit("foo"), lit("bar"), lit("baz")).parse "qux"
Parsby::ExpectationFailed: line 1:
qux
\-/ * failure: lit("baz")
\-/ *| failure: lit("bar")
\-/ *|| failure: lit("foo")
| *||| failure: unparseable
\|||
| *|| failure: (unparseable | lit("foo"))
\||
| *| failure: ((unparseable | lit("foo")) | lit("bar"))
\|
| * failure: (((unparseable | lit("foo")) | lit("bar")) | lit("baz"))
| * failure: choice(lit("foo"), lit("bar"), lit("baz"))
Those parser intermediaries that use |
aren't really making things any
clearer. Let's use splicer
to remove those:
define_combinator :choice do |*ps|
ps = ps.flatten
splicer.start do |m|
ps.reduce(unparseable) do |a, p|
a | m.end(p)
end
end
end
This makes the p
parsers appear as direct children of the splicer.start
parser in the trace. Let's fail it, again:
pry(main)> choice(lit("foo"), lit("bar"), lit("baz")).parse "qux"
Parsby::ExpectationFailed: line 1:
qux
\-/ * failure: lit("baz")
\-/ *| failure: lit("bar")
\-/ *|| failure: lit("foo")
\\|
| * failure: splicer.start((((unparseable | splicer.end(lit("foo"))) | splicer.end(lit("bar"))) | splicer.end(lit("baz"))))
| * failure: choice(lit("foo"), lit("bar"), lit("baz"))
Now, the only issue left is that define_combinator
wraps the resulting
parser in another parser. It does this so you can see the label assigned to
the combinator and to its definition separately. Let's disable that
wrapping by passing wrap: false
to it:
define_combinator :choice, wrap: false do |*ps|
ps = ps.flatten
splicer.start do |m|
ps.reduce(unparseable) do |a, p|
a | m.end(p)
end
end
end
That causes it to overwrite the label to the resulting parser of the block. Let's fail it, again:
pry(main)> choice(lit("foo"), lit("bar"), lit("baz")).parse "qux"
Parsby::ExpectationFailed: line 1:
qux
\-/ * failure: lit("baz")
\-/ *| failure: lit("bar")
\-/ *|| failure: lit("foo")
\\|
| * failure: choice(lit("foo"), lit("bar"), lit("baz"))
Recursive parsers with lazy
If we try to define a recursive parser using combinators like so:
define_combinator :value do
list | lit("foo")
end
define_combinator :list do
between(lit("["), lit("]"), sep_by(lit(","), spaced(value)))
end
value
#=> SystemStackError: stack level too deep
We get a stack overflow.
This isn't a problem in Haskell because the language evaluates lazily by default. This allows it to define recursive parsers without even thinking about it.
In Ruby's case, we need to be explicit about our laziness. For that,
there's lazy
. We just need to wrap one of the expressions in the
recursive loop with it. It could be the value
call in list
; it could be
list
call in value
; it could be the whole of value
. It really doesn't
matter.
define_combinator :value do
lazy { list | lit("foo") }
end
define_combinator :list do
between(lit("["), lit("]"), sep_by(lit(","), spaced(value)))
end
value.parse "[[[[foo, foo]]]]"
#=> [[[["foo", "foo"]]]]
Parsing left-recursive languages with reduce
combinator
Here's a little arithmetic parser based on the
Parsby::Example::ArithmeticParser
:
define_combinator :div_op {|left, right| group(left, spaced(lit("/")), right) }
define_combinator :mul_op {|left, right| group(left, spaced(lit("*")), right) }
define_combinator :add_op {|left, right| group(left, spaced(lit("+")), right) }
define_combinator :sub_op {|left, right| group(left, spaced(lit("-")), right) }
def scope(x, &b)
b.call x
end
define_combinator :expr do
lazy do
e = decimal
# hpe -- higher precedence level expression
# spe -- same precedence level expression
e = scope e do |hpe|
recursive do |spe|
choice(
mul_op(hpe, spe),
div_op(hpe, spe),
hpe,
)
end
end
e = scope e do |hpe|
recursive do |spe|
choice(
add_op(hpe, spe),
sub_op(hpe, spe),
hpe,
)
end
end
end
end
expr.parse "5 - 4 - 3"
#=> [5, "-", [4, "-", 3]]
Now, that's incorrectly right-associative because we made the
precedence-level parsers right-recursive. See how the block parameter of
recursive
is used for the right operands and not the left ones?
Let's fix that by switching the parsers used for the operands:
define_combinator :expr do
lazy do
e = decimal
# hpe -- higher precedence level expression
# spe -- same precedence level expression
e = scope e do |hpe|
recursive do |spe|
choice(
mul_op(spe, hpe),
div_op(spe, hpe),
hpe,
)
end
end
e = scope e do |hpe|
recursive do |spe|
choice(
add_op(spe, hpe),
sub_op(spe, hpe),
hpe,
)
end
end
end
end
expr.parse "5 - 4 - 3"
# ...
If you ran this, it might take a while, but eventually you'll have a bunch
of SystemStackError: stack level too deep
errors.
What's happening is that e.g. while trying to check whether the expression is a subtraction, it needs to first resolve the left operand, and as part of that it needs to check whether that's a subtraction, and so on and so forth. In other words, this causes infinite recursion. It can't even read a single character of the input because of this.
Our problem is that we're parsing top-down. We're trying to understand what
the whole thing is before looking at the parts. We need to parse bottom-up.
Successfully parse a small piece, then figure out what the whole thing is
as we keep reading. To do that while keeping our definitions declarative,
we can use the reduce
combinator (in combination with pure
):
define_combinator :expr do
lazy do
e = decimal
# hpe -- higher precedence level expression
# spe -- same precedence level expression
e = scope e do |hpe|
reduce hpe do |left_result|
choice(
mul_op(pure(left_result), hpe),
div_op(pure(left_result), hpe),
)
end
end
e = scope e do |hpe|
reduce hpe do |left_result|
choice(
add_op(pure(left_result), hpe),
sub_op(pure(left_result), hpe),
)
end
end
end
end
expr.parse "5 - 4 - 3"
#=> [[5, "-", 4], "-", 3]
reduce
starts parsing with its argument, in this case hpe
, then passes
the result to the block, which uses it for the resolved left operand.
reduce
then parses with the parser returned by the block and passes the
result again to the block, and so on and so forth until parsing fails,
returning the result of the last successful parse.
In effect, we're parsing left operands bottom-up and right operands top-down.
Parsby.new
Normally one ought to be able to define parsers using just combinators, but
there are times when one might need more control. For those times, the most
raw way to define a parser is using Parsby.new
.
Here's lit
as an example:
define_combinator :lit, wrap: false do |e, case_sensitive: true|
Parsby.new do |c|
a = c.bio.read e.length
if case_sensitive ? a == e : a.to_s.downcase == e.downcase
a
else
raise ExpectationFailed.new c
end
end
end
It takes a string argument for what it e
xpects to parse, and returns what
was a
ctually parsed if it matches the expectation.
The block parameter c
is a Parsby::Context
. c.bio
holds a
Parsby::BackedIO
. The parse
method of Parsby
objects accepts ideally
any IO
(and String
s, which it turns into StringIO
) and then wraps
them with BackedIO
to give the IO
the ability to backtrack.
Parsing from a string, a file, a pipe, a socket, ...
Any IO
ought to work (unit tests currently have only checked pipes,
though). When you pass a string to Parsby#parse
it wraps it with
StringIO
before using it.
Comparing with Haskell's Parsec
If you're already familiar with Parsec, here are some similarities:
# Parsby # -- Parsec
#
lit("foo") # string "foo"
#
foo | bar # foo <|> bar
#
pure "foo" # pure "foo"
#
foo.then {|x| bar x } # foo >>= \x -> bar x
#
foobar = Parsby.new do |c| # foobar = do
x = foo.parse c # x <- foo
bar(x).parse c # bar x
end #
#
lit("(") > foo < lit(")") # string "(" *> foo <* string ")"
#
lit("5").fmap {|n| n.to_i + 1 } # fmap (\n -> read n + 1) (string "5")
#
group(x, y, z) # (,,) <$> x <*> y <*> z
#
group( #
w, #
group(x, y), #
z, #
).fmap do |(wr, (xr, yr), zr)| #
Foo.new(wr, Bar.new(xr, yr), zr) # Foo <$> w <*> (Bar <$> x <*> y) <*> z
end #
#
# -- Means the same, but this
# -- raises an error in Haskell
# -- because it requires an
# -- infinite type [[[[...]]]]
recursive do |p| # fix $ \p ->
between(lit("("), lit(")"), # between (string "(") (string ")") $
single(p) | pure([]) # ((:[]) <$> p) <|> pure []
end #
end #
Development
After checking out the repo, run bin/setup
to install dependencies. Then,
run rake spec
to run the tests. You can also run bin/console
for an
interactive prompt that will allow you to experiment.
bin/console
includes Parsby::Combinators
into the top-level so the
combinators and define_combinator
are available directly from the prompt.
It also defines reload!
to quickly load changes made to the source.
To install this gem onto your local machine, run bundle exec rake install
.