Finite Automata Algebra
This gem provides bindings to libfa, a library for doing algebra on regular expressions. Using this library, it is easy to answer questions like "Are these two regular expressions matching the same set of strings ?" or "What is the regular expression that matches all strings that match one regular expression, but not a second one ?"
Usage
To perform computations on regular expressions, libfa
needs to first
convert your regular expression into a finite automaton. This is done by
compiling your regular expression:
require 'fa'
fa1 = Fa.compile("(a|b)") # can also be written as Fa["(a|b)"]
fa2 = fa1.plus
Notice that the regular expression needs to be given as a
string. Unfortunately, Ruby regular expressions allow constructs that go
beyond the mathematical notion of a regular expression and can therefore
not be used to do the kinds of computation that libfa
performs. The
regular expressions that libfa
deals in must be written using a (subset
of) the notation for
extended POSIX regular expressions
(see this section for syntax details). The
biggest difference between POSIX ERE and the syntax that libfa
understands is that libfa
does not allow backreferences, does not support
anchors like ^
and $
, and does not support named character classes like
[[:space:]]
. Luckily, this notation is also a subset of Ruby's regular
expression syntax, and the results of computations with libfa
can be used
to construct normal Ruby regular expressions.
Here are some examples of what you can do, see the API documentation for more details on what these operations do.
puts fa1
# "b|a"
puts fa1.minimize
# "[ab]"
puts fa1.union(fa2).minimize
# "[ab][ab]*"
puts fa1.concat(fa2).minimize
# "[ab][ab][ab]*"
puts fa2.intersect(fa1).minimize
# "b|a"
puts fa2.intersect(Fa["a*"]).minimize
# "aa*"
puts fa1.intersect(Fa["a*"]).minimize
# "a"
puts Fa["a+"].minus(Fa["a{2,}"])
# "a"
You can also compare finite automata, and therefore learn things about how they behave on all strings, for example if they match the same exact set of strings, or if one matches more strings than another:
fa = Fa["[a-z]"].intersect(Fa["a*"])
puts Fa["a"].equals(fa)
# true
fa = Fa["a"].union(Fa["b"]).star.concat(Fa["c"].plus)
puts Fa["(a|b)*c+"].equals(fa)
# true
puts Fa["[ab]*"].contains(Fa["a*"])
# true
puts Fa["a+"].minus(Fa["a*"]).empty?
# true
You could, for example, construct a regular expression that matches all strings made up of digits where adjacent digits are always different:
require 'fa'
numbers = Fa["[0-9]+"]
# Matches numbers that have twin digits somewhere
twins = Fa["[0-9]*(00|11|22|33|44|55|66|77|88|99)[0-9]*"]
# Numbers where adjacent digits are always different
no_twins = numbers.minus(twins)
# Ruby regex from no_twins. We need to mark all '(' as noncapturing groups,
# otherwise Ruby's matcher freaks out
rx = /\A(#{no_twins.to_s.gsub("(", "(?:")})\Z/
["789", "202", "911", "666"].each do |s|
if s =~ rx
puts "#{s} has no duplicated digits"
else
puts "#{s} has duplicated digits"
end
end
Syntax
The regular expressions that libfa
understands are a subset of the POSIX
extended regular expression syntax. If you are a regular expression
aficionado, you should note that libfa
does not support some popular
syntax extensions. Most importantly, it does not support backreferences,
anchors such as ^
and $
, and named character classes such as
[[:upper:]]
. The first two are not supported since they take the notation
out of the realm of finite automata and actual regular expressions. Named
character classes are not implemented because there's a lot of work to
support them, even though there is no objection from theory to them.
The character set that libfa
operates on is simple 8 bit ASCII. In other
words, to libfa
, a character is a byte, and it does not support larger
character sets such as UTF-8.
The following characters have special meaning for libfa
. The symbols R
,
S
, T
, etc. in the list below can be regular expressions themselves. The
list is ordered by increasing precendence of the operators.
-
R|S
: matches anything that matches eitherR
orS
-
R*
: matches any number ofR
, including none -
R+
: matches any number ofR
, but at least one -
R?
: matches no or one occurence ofR
-
R{n,m}
: matches at leastn
but no more thanm
occurences ofR
.n
must be nonnegative. Ifm
is missing or equals-1
, matches an unlimited number ofR
. -
(R)
: the parentheses are solely there for grouping, and this expression matches the same strings asR
alone -
[C]
: matches the characters in the character setC
; see below for the syntax of character sets -
.
: matches any single character except for newline -
\c
: the literal characterc
, even if it would otherwise have special meaning; the expression\(
matches an opening parenthesis.
Character classes [C]
use the following notation:
-
[^C]
: matches all characters not in[C]
.[^a-zA-Z]
matches everything that is not a letter. -
[a-z]
: matches all characters betweena
andz
, inclusive. Multiple ranges can be specified in the same character set.[a-zA-Z0-9]
is a perfectly valid character set. - if a character set should include
]
, it must be listed as the first character.[][]
matches the opening and closing bracket. - if a character set should include
-
, it must be listed as the last character.[-]
matches solely a dash. - no characters in character sets are special, and there is no backslash
escaping of characters in character classes.
[.]
matches a literal dot.
The regular expression syntax has no notation for control characters: when
libfa
sees \n
in a string you are compiling, it will match that against
the character n
, not a newline. That's not a problem as the strings you
write in Ruby code go through Ruby's backslash interpretation first. When
you write Fa.compile("[\n]")
, libfa
never sees the backslash as Ruby
replaces \n
with a newline character before that string ever makes it to
libfa
. That has the funny consequence that if you want to use a literal
backslash in your regular expression, your input string must have four
backslashes in it: when you write Fa.compile("\\\\")
, Ruby first turns
that into a string with two backslashes, which libfa
then interprets as a
single
backslash. [If you are reading this in YARD documentation and only saw two backslashes in the Fa.compile
, it's because YARD reduced them from the markdown source. Github does not, and so this example will always be wrong in one of them.]
Installation
Add this line to your application's Gemfile:
gem 'fa'
And then execute:
$ bundle
Or install it yourself as:
$ gem install fa
For things to work out, you will also have to have libfa
installed; the
library is distributed as part of augeas. On Red
Hat-derived distros like Fedora, CentOS, or RHEL, you will need to yum install augeas-libs
, on Debian-derived distros, run apt-get install libaugeas0
.
Development
After checking out the repo, run bin/setup
to install dependencies. Then,
run rake test
to run the tests. You can also run bin/console
for an
interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in
version.rb
, and then run bundle exec rake release
, which will create a
git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/lutter/ruby-fa.