Antelope
Antelope is a parser generator that can generate parsers for any
language*. In the sense of actually creating a parser, it works
kind of like [Bison][bison] - you give it an input file, say,
language.ace
, and it generates a parser for it, in, say,
language.rb
. Only, instead of Bison's support only for C, C++,
and Java, Antelope is meant to generate parsers for multiple
languages. Antelope is also written in Ruby for understandability.
Enough about that, though, let's get into Antelope.
Installation
Since you'll only typically use Antelope from the command line, I suggest you install it like so:
$ gem install antelope
If, however, you plan on using it in an application, or need it as a
part of a library, you can add gem "antelope"
to your Gemfile.
Usage
Antelope is fairly simple to use; you define an .ace
file, compile
it, and use the proper API for the generated language.
How Antelope Works
Before getting into Ace files, however, you have to understand how Antelope works. Antelope generates a LALR(1) parser, like Bison; for your benefit, however, there are some terms here to understand:
-
LL: A type of parser. If followed by parenthesis, the number (or letter) in the parenthesis denotes the number of tokens of lookahead; i.e., LL(0) has 0 tokens of lookahead. Standing for L eft to Right, L eftmost Derivation, these tend to be handwritten as Recursive Decent parsers. LL parsers can be represented normally by a set of productions. LL parsers are Top-Down parsers, meaning they start with only the Starting symbol and try to match the input. A look at an LL parser is given [here][ll-parser]; check out [this post][tumblr-ll-parser] for more information about LL parsers.
Antelope does not generate LL parsers. -
Production: In parsing, a production associates a nonterminal with a string of nonterminals and terminals. It is said that the string of nonterminals and terminals reduces to the nonterminal.
Productions take the form ofA -> y
, with A being the left hand side (and the nonterminal), and y being the string. -
Starting symbol: In parsing, it is the nonterminal that is used to represent any kind of valid input.
-
Symbol: A nonterminal or a terminal.
-
Nonterminal: In parsing, a nonterminal is an abstraction used to represent any number of strings of nonterminals and terminals.
-
String: In parsing, an ordered set of symbols.
-
Terminal: In parsing, it is a concrete value given by the lexer.
-
LR: A family of types of parsers. If followed by parenthesis, the number (or letter) in the parenthesis denotes the number of tokens of lookahead; i.e., LR(0) has 0 tokens of lookahead. Standing for L eft to Right, R ightmost Derivation, they tend to be more complicated than their LL brethren. LR parsers work by starting with the entire input and finding Handles from that input, eventually ending up at the Starting symbol. LR parsers typically do this by splitting the input into two parts: the stack, and the input. LR(0), LR(1), SLR(1), and LALR(1) are all examples of LR parsers.
-
Handle: In a LR parser, it is the Leftmost complete cluster of leaf nodes (in a representative AST). When a handle is found, a reduction is performed.
-
Stack: Initially empty, it can contain any symbol, and is primarily used to represent what the parser has seen. Finding handles will purely occur at the top of the stack.
-
Reduction/Reduce: In a LR parser, this is an action corresponding to replacing the right side of a production with its left side. This purely occurs at the top of the stack, and correlates to finding a Handle.
-
Input: Initially containing the full input, it can contain only terminals; it primarily contains what the parser has yet to see.
-
LR(0): In parsing, it is a type of LR parser that uses no lookahead.
It essentially uses a deterministic finite automaton to find possible handles. It does no checking to make sure that the possible handles are legitimate. -
SLR(1): A part of the LR family of parsers, it upgrades LR(0) by checking to make sure that the reduction that it will make (as a part of finding a handle) is valid in the context; basically, for every reduction that it can make, it defines a set of terminals that can FOLLOW the corresponding nonterminal.
-
FOLLOW(A) set: In parsing, it defines a set of terminals that can follow the nonterminal A anywhere in the grammar.
-
LALR(1): A part of the LR family of parsers, it upgrades SLR by using a more precise FOLLOW set, called LA.
-
LA set: LA(q, A -> y) = { t | S =>* aAtw and ay reaches q }
-
Panic mode: In parsing, this is the mode that a parser can go in for recovery, if it encounters a terminal that it was not expecting. In panic mode, the parser pops terminals off of the input until it reaches a valid synchronization token. In order to utilize panic mode, at least one production must have the special error terminal in it. If the parser encounters an error, it will attempt to find a production it can use to resynchronize; if it cannot resynchronize, it will error. It then attempts to resynchronize by continuously pop terminals off of the input and discarding them, attempting to find a synchronization token. A synchronization token is a token that follows an error terminal.
-
Shift/reduce conflict: This occurs when the parser is unable to decide if it should shift the next token over from the input to the stack, or to reduce the top token on the stack. If a shift/reduce conflict cannot be solved by changing the grammar, then precedence rules may be used (see
examples/example.ace
). -
Reduce/reduce conflict: This occurs when the parser is unable to decide which production to reduce. This cannot be solved by precedence.
-
Precedence: In some grammars, the Antelope runs into Shift/reduce conflicts when attempting to construct a parser. To resolve these conflicts, Antelope provides precedence declarations. Precedence is separated into levels, which each have a type; levels can be left-associative, right-associative, or non-associative. The higher the level, the higher the precedence. Think of the Order of Operations here; the operations multiply and divide are left associative, and on a higher level than add and subtract, which are still left-associative:
MULTIPLY, DIVIDE (left-associative) ADD, SUBTRACT (left-associative)
Exponentiation, however, is right-associative, and is higher than MULTIPLY or DIVIDE; basically,
2**2**2
would be parsed as2**(2**2)
, instead of the left-associative(2**2)**2
. For an example of a grammar that uses precedence, seeexamples/example.ace
.
Defining the Ace file
The Ace file format is very similar to Bison's y files; this was intentional, to make transitions between the two easy. The Ace file should be formatted like so:
<directives>
%%
<rules>
%%
<code>
Both %%
(internally called content boundaries) are required; the
minimum file that is technically accepted by Antelope is therefore
two content boundaries separated by a newline.
In the <directives>
section, there can be any number and
combinations of code blocks and directives. Code blocks are
blocks of code delimited by %{
and %}
, with the ending delimiter
on its own line. These are copied into the output of the file
directly. Directives tell Antelope information about the grammar.
An example directive would be the token
or terminal
directive;
this lets Antelope know that a terminal by the given name exists.
Directives take the form %<name> [<value>]*
, with <name>
being the
directive name, and <value>
being a string delimited by braces,
angle brackets, quotes, or nothing at all. An example of a directive
would be %token ADD "+"
. The available directives are determined by
the code generators available to Antelope at the time that the Ace
file is being compiled. Some directives, however, are always
available:
-
require
(1 argument): This makes Antelope check its version against the first argument to this. If the versions do not match, Antelope will raise an error and fail to parse the file. It is recommended to at least require the minor version of Antelope (i.e.%require "~> 0.1"
). -
token
,terminal
(1-2 arguments): Defines a terminal. The first argument defines its name; the second argument defines its value.
Its value isn't used anywhere but the.output
file, to make it easier to read. -
left
,right
,nonassoc
(1+ arguments): Defines a precedence level, and sets the type of the level based on the directive name used. -
type
: The code generator to use. Currently, the possible values for this can benull
,ruby
, andoutput
. -
define
(1+ arguments): Sets a key to a value. This would do the exact same thing that using the key as a directive would do, i.e.%define something "value"
does the same thing as%something "value"
. (note: This is not entirely true. If the key were one of the above, it would most likely raise an error, complaining that there is no directive named that.) -
panic-mode
(0-1 arguments): Enables/disables panic mode being put in the output code. Not included by default, but should be.
In the <rules>
section, there can be any number of rules (which are
definitions for productions). Rules have this syntax:
<head>: <body> ["|" <body>]* [";"]
With <head>
being the nonterminal that the production(s) reduce to,
and <body>
being one or more symbols followed by an optional block
that is executed when is a reduction is made using that production. A
semicolon terminating the rule is optional. Rules are what make up
the grammar. error
, nothing
, and ε
are all special symbols; the
first one defines the special error
terminal (used for panic mode,
ignored otherwise), whereas the second two are used to literally
mean nothing (i.e., the rule reduces to nothing). It is not always
a good idea to use the nothing
symbol, since most rules can be
written without it.
In the <code>
section, custom code used to wrap the generated parser
can be placed. In order to embed the generated parser, you must place
%{write}
where you want the generated parser.
Compiling the Ace file
Compiling the Ace file is somewhat straightforward;
antelope compile /path/to/file.ace
will cover most use cases. If
you want to override the type in the Ace file, you can use the
--type=
command option. If it is giving an error, and you're not
sure what's causing it, you can use the --verbose
command option to
see a backtrace.
By default, Antelope always includes the Output
generator as a
part of the output. This means that an .output
file will always be
generated along with any other files. The .output
file contains
information about the parser, like the productions that were used,
precedence levels, states, and lookahead sets.
Language API
todo.
Contributing
- Fork it (https://github.com/medcat/antelope/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
* Only if there's a generator for it. [bison]: http://www.gnu.org/software/bison/ [ll-parser]: http://i.imgur.com/XhJKrDW.png [tumblr-ll-parser]: http://redjazz96.tumblr.com/post/88336053195/what-antelope-does-and-what-i-hope-it-will-do-part