A Ruby library for constructing general parsers for any context-free language.
What is Rley?
Rley uses the Earley algorithm which is a general parsing algorithm that can handle any context-free grammar. Earley parsers can literally swallow anything that can be described by a context-free grammar. That's why Earley parsers find their place in so many NLP (Natural Language Processing) libraries/toolkits.
In addition, Rley goes beyond most Earley parser implementations by providing support for ambiguous parses. Indeed, it delivers the results of a parse as a Shared Packed Parse Forest (SPPF). A SPPF is a data structure that allows to encode efficiently all the possible parse trees that result from an ambiguous grammar.
As another distinctive mark, Rley is also the first Ruby implementation of a parsing library based on the new Grammar Flow Graph approach .
What it can do?
Maybe parsing algorithms and internal implementation details are of lesser interest to you and the good question to ask is "what Rley can really do?".
In a nutshell:
- Rley can parse context-free languages that other well-known libraries cannot handle
- Built-in support for ambiguous grammars that typically occur in NLP
In short, the foundations of Rley are strong enough to be useful in a large application range such as:
- computer languages -e.g. Simple Regex Language - ,
- artificial intelligence and
- Natural Language Processing.
Features
- Simple API for context-free grammar definition,
- Allows ambiguous grammars,
- Generates shared packed parse forests,
- Accepts left-recursive rules/productions,
- Provides syntax error detection and reporting.
Compatibility
Rley supports the following Ruby implementations:
- MRI 2.5
- MRI 2.6
- MRI 2.7
- JRuby 9.1+
Getting Started
Installation
Installing the latest stable version is simple:
$ gem install rley
A whirlwind tour of Rley
The purpose of this section is show how to create a parser for a minimalistic English language subset. The tour is organized as follows:
- Creating facade object of Rley library
- Defining the language grammar
- Creating a lexicon
- Creating a tokenizer
- Parsing some input
- Generating the parse tree
The complete source code of the example used in this tour can be found in the examples directory
Creating facade object of Rley library
require 'rley' # Load Rley library
# Let's create a facade object called 'engine'
# It provides a unified, higher-level interface
engine = Rley::Engine.new
Defining the language grammar
The subset of English grammar is based on an example from the NLTK book.
engine.build_grammar do
# Terminal symbols (= word categories in lexicon)
add_terminals('Noun', 'Proper-Noun', 'Verb')
add_terminals('Determiner', 'Preposition')
# Here we define the productions (= grammar rules)
rule 'S' => 'NP VP'
rule 'NP' => 'Proper-Noun'
rule 'NP' => 'Determiner Noun'
rule 'NP' => 'Determiner Noun PP'
rule 'VP' => 'Verb NP'
rule 'VP' => 'Verb NP PP'
rule 'PP' => 'Preposition NP'
end
Creating a lexicon
# To simplify things, lexicon is implemented as a Hash with pairs of the form:
# word => terminal symbol name
Lexicon = {
'man' => 'Noun',
'dog' => 'Noun',
'cat' => 'Noun',
'telescope' => 'Noun',
'park' => 'Noun',
'saw' => 'Verb',
'ate' => 'Verb',
'walked' => 'Verb',
'John' => 'Proper-Noun',
'Mary' => 'Proper-Noun',
'Bob' => 'Proper-Noun',
'a' => 'Determiner',
'an' => 'Determiner',
'the' => 'Determiner',
'my' => 'Determiner',
'in' => 'Preposition',
'on' => 'Preposition',
'by' => 'Preposition',
'with' => 'Preposition'
}.freeze
Creating a tokenizer
require 'strscan'
# A tokenizer reads the input string and converts it into a sequence of tokens.
# Remark: Rley doesn't provide tokenizer functionality.
# Highly simplified tokenizer implementation
def tokenizer(aTextToParse)
scanner = StringScanner.new(aTextToParse)
tokens = []
loop do
scanner.skip(/\s+/)
curr_pos = scanner.pos
word = scanner.scan(/\S+/)
break unless word
term_name = Lexicon[word]
raise StandardError, "Word '#{word}' not found in lexicon" if term_name.nil?
pos = Rley::Lexical::Position.new(1, curr_pos + 1)
tokens << Rley::Lexical::Token.new(word, term_name, pos)
end
return tokens
end
More ambitious NLP applications will surely rely on a Part-of-Speech tagger instead of creating a lexicon and tokenizer from scratch. Here are a few Ruby Part-of-Speech gems:
Parsing some input
input_to_parse = 'John saw Mary with a telescope'
# Convert input text into a sequence of token objects...
tokens = tokenizer(input_to_parse)
result = engine.parse(tokens)
puts "Parsing successful? #{result.success?}" # => Parsing successful? true
At this stage, we're done with parsing. What we need next are convenient means
to exploit the parse result. As it is, the result
variable in the last code snippet
above is a data structure ("Earley item sets") that is highly depending on the intricate details
of the Earley's parsing algorithm. Obviously, it contains all the necessary data to exploit
the parsing results but it is rather low-level and inconvenient from a programming viewpoint.
Therefore, Rley provides out of the box two convenient data structures for
representing the parse outcome:
- Parse tree (optimal when the parse is unambiguous)
- Parse forest (a more sophisticated data structure that copes with ambiguity)
For our whirlwind tour, we will opt for parse trees.
Generating the parse tree
ptree = engine.convert(result)
OK. Now that we have the parse tree, what we can do with it?
One option is to manipulate the parse tree and its node directly. For instance,
one could write code to customize and transform the parse tree. This approach gives
most the of flexibility needed for advanced applications. The other, more common
option is to use an Rley::ParseTreeVisitor
instance.
Such a visitor walks over the parse tree nodes and generates visit events that
are dispatched to subscribed event listeners. All this may, at first, sound
complicated but the coming code snippets show it otherwise.
Let's do it by:
- Creating a parse tree visitor
- Using one of the built-in visit subscribers specifically created to render the parse tree in a given output format.
Creating a parse tree visitor
Good news: creating a parse tree visitor for the parse tree ptree
is just
an one-liner:
# Let's create a parse tree visitor
visitor = engine.ptree_visitor(ptree)
Visiting the parse tree
Unsurprisingly, to start the parse tree visit, one calls the #start
method:
visitor.start
If you try the above line, no particular result will be visible and for a good reason:
no object was specified as a visit event subscriber. As a convenience, Rley
bundles a number of formatter classes
that were designed to listen to the visit event and then render the parse tree
in a specific format. To begin with, we'll use the simple formatter
Rley::Formatter::Debug
class. Its purpose is just to print out the visit event
name.
Remove the line with the call to the #start
method and replace it with the two
statements:
# Let's create a formatter (i.e. visit event listener)
renderer = Rley::Formatter::Debug.new($stdout)
# Subscribe the formatter to the visitor's event and launch the visit
renderer.render(visitor)
These two lines will generate the following output:
before_ptree
before_non_terminal
before_subnodes
before_non_terminal
before_subnodes
before_terminal
after_terminal
after_subnodes
after_non_terminal
before_non_terminal
before_subnodes
before_terminal
after_terminal
before_non_terminal
before_subnodes
before_terminal
after_terminal
after_subnodes
after_non_terminal
before_non_terminal
before_subnodes
before_terminal
after_terminal
before_non_terminal
before_subnodes
before_terminal
after_terminal
before_terminal
after_terminal
after_subnodes
after_non_terminal
after_subnodes
after_non_terminal
after_subnodes
after_non_terminal
after_subnodes
after_non_terminal
after_ptree
At least is something visible: these are the parse tree visit events. Note that the indentation of event names depends on the nesting level of the tree node being visited.
Not really impressive? So let's use another formatter...
Visualizing the parse tree structure
If one replaces the previous formatter by an instance of
Rley::Formatter::Asciitree
the output now shows the parse tree structure.
# Let's create a formatter that will render the parse tree with characters
renderer = Rley::Formatter::Asciitree.new($stdout)
# Subscribe the formatter to the visitor's event and launch the visit
renderer.render(visitor)
The outputs looks like this:
S
+-- NP
| +-- Proper-Noun: 'John'
+-- VP
+-- Verb: 'saw'
+-- NP
| +-- Proper-Noun: 'Mary'
+-- PP
+-- Preposition: 'with'
+-- NP
+-- Determiner: 'a'
+-- Noun: 'telescope'
If you are more inclined for graphical representation, then replace the last formatter by yet another one:
# Let's create a formatter that will render the parse tree in labelled bracket notation
renderer = Rley::Formatter::BracketNotation.new($stdout)
# Subscribe the formatter to the visitor's event and launch the visit
renderer.render(visitor)
This results in the strange-looking output:
[S [NP [Proper-Noun John]][VP [Verb saw][NP [Proper-Noun Mary]][PP [Preposition with][NP [Determiner a][Noun telescope]]]]]
This output is in a format that is recognized by many NLP softwares. The next diagram was created by copy-pasting the output above in the online tool RSyntaxTree. By the way, this tool is also a Ruby gem, rsyntaxtree.
Error reporting
Rley is a non-violent parser, that is, it won't throw an exception when it
detects a syntax error. Instead, the parse result will be marked as
non-successful. The parse error can then be identified by calling the
GFGParsing#failure_reason
method. This method returns an error reason object
which can help to produce an error message.
Consider the example from the Parsing some input section
above and, as an error, we delete the verb saw
in the sentence to parse.
# Verb has been removed from the sentence on next line
input_to_parse = 'John Mary with a telescope'
# Convert input text into a sequence of token objects...
tokens = tokenizer(input_to_parse)
result = engine.parse(tokens)
puts "Parsing successful? #{result.success?}" # => Parsing successful? false
exit(1)
As expected, the parse is now failing.
To get an error message, one just need to retrieve the error reason and
ask it to generate a message.
# Show error message if parse fails...
puts result.failure_reason.message unless result.success?
Re-running the example with the error, results in the error message:
Syntax error at or near token line 1, column 6 >>>Mary<<<
Expected one 'Verb', found a 'Proper-Noun' instead.
The standard Rley message not only inform about the location of the mistake, it also provides some hint by disclosing its expectations.
Let's experiment again with the original sentence but without the word
telescope
.
# Last word has been removed from the sentence on next line
input_to_parse = 'John saw Mary with a '
# Convert input text into a sequence of token objects...
tokens = tokenizer(input_to_parse)
result = engine.parse(tokens)
puts "Parsing successful? #{result.success?}" # => Parsing successful? false
unless result.success?
puts result.failure_reason.message
exit(1)
end
This time, the following output is displayed:
Parsing successful? false
Premature end of input after 'a' at position line 1, column 20
Expected one 'Noun'.
Again, the resulting error message is user-friendly.
Examples
The project source directory contains several example scripts that demonstrate how grammars are to be constructed and used.
Other similar Ruby projects
Rley isn't the sole implementation of the Earley parser algorithm in Ruby.
Here are a few other ones:
-
Kanocc gem -- Advertised as a Ruby based parsing and translation framework.
Although the gem dates from 2009, the author still maintains its in a public repository in Github
The grammar symbols (tokens and non-terminals) must be represented as (sub)classes. Grammar rules are methods of the non-terminal classes. A rule can have a block code argument that specifies the semantic action when that rule is applied. -
lc1 project -- Advertised as a combination of Earley and Viterbi algorithms for [Probabilistic] Context Free Grammars
Aimed in parsing brazilian portuguese.
earley project -- An Earley parser (grammar rules are specified in JSON format).
The code doesn't seem to be maintained: latest commit dates from Nov. 2011. -
linguist project -- Advertised as a library for parsing context-free languages.
It is a recognizer not a parser. In other words it can only tell whether a given input conforms to the grammar rules or not. As such it cannot build parse trees.
The code doesn't seem to be maintained: latest commit dates from Oct. 2011.
Other interesting Ruby resources
The extensive resource list not to miss: Awesome NLP with Ruby actively curated by Andrei Beliankou (aka arbox).
Thanks to:
- Professor Keshav Pingali, one of the creators of the Grammar Flow Graph parsing approach for his encouraging e-mail exchange.
-
Arjun Menon for his NLP example that uses
engtagger
gem. -
Gui Heurich for spotting a mistake in the code sample in
README
file.
Grammar Flow Graph
Since the Grammar Flow Graph parsing approach is quite new, it has not yet taken a place in standard parser textbooks. Here are a few references (and links) of papers on GFG:
- K. Pingali, G. Bilardi. Parsing with Pictures
- K. Pingali, G. Bilardi. A Graphical Model for Context-Free Grammar Parsing. In : International Conference on Compiler Construction. Springer Berlin Heidelberg, 2015. p. 3-27.
- M. Fulbright. An Evaluation of Two Approaches to Parsing
Copyright
Copyright (c) 2014-2022, Dimitri Geshef.
Rley is released under the MIT License see LICENSE.txt for details.