Project

regexgen

0.0
No release in over 3 years
Low commit activity in last 3 years
Generate a minimal regex matching a set of strings
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

 Project Readme

regexgen

Generate regular expressions that match a set of strings.

This is a Ruby port of @devongovett's JavaScript regexgen package.

Installation

Add this line to your application's Gemfile:

gem 'regexgen'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install regexgen

Usage

require 'regexgen'

Regexgen.generate(['foobar', 'foobaz', 'foozap', 'fooza']) #=> /foo(?:zap?|ba[rz])/

CLI

regexgen also has a simple CLI to generate regexes using inputs from the command line.

$ regexgen
usage: regexgen [-mix] strings...
    -m                               Multiline flag
    -i                               Case-insensitive flag
    -x                               Extended flag

Unicode handling

Unlike the JavaScript version, this package does not do any special Unicode handling because Ruby does it all for you. You are recommended to use a Unicode encoding for your strings.

How does it work?

Just like the JavaScript version:

  1. Generate a Trie containing all of the input strings. This is a tree structure where each edge represents a single character. This removes redundancies at the start of the strings, but common branches further down are not merged.

  2. A trie can be seen as a tree-shaped deterministic finite automaton (DFA), so DFA algorithms can be applied. In this case, we apply Hopcroft's DFA minimization algorithm to merge the nondistinguishable states.

  3. Convert the resulting minimized DFA to a regular expression. This is done using Brzozowski's algebraic method, which is quite elegant. It expresses the DFA as a system of equations which can be solved for a resulting regex. Along the way, some additional optimizations are made, such as hoisting common substrings out of an alternation, and using character class ranges. This produces an an Abstract Syntax Tree (AST) for the regex, which is then converted to a string and compiled to a Ruby Regexp object.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake test to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/amake/regexgen-ruby.

License

The gem is available as open source under the terms of the MIT License.