FastFuzzy

This gem only supports JRuby.

FastFuzzy performs fast and fuzzy text pattern matching. It uses the Lucene analyzers to tokenize the text using a configurable analyzer chain and the extracted tokens are matched against the searched text using ngram matching and a resulting match score is computed.

The original intent of this code was to perform on-the-fly matching of some specific categories of expressions or sentences in social media. Using an analyzer chain to tokenize, remove stop words, etc, allows performing the matching only on the relevant text tokens. Using ngram scoring provides the fuzzyness confidence score to find approximate matching text with typos or different spelling or any number of variations. With experimentation an "acceptable" score can be decided to establish if the searched text matches or not against what we are looking for.

Note that this gem also include a custom Lucene Twitter tokenizer, see usage examples below.

Installation

This gem only supports JRuby.

Add this line to your application's Gemfile:

gem 'fast_fuzzy'

And then execute:

$ bundle

Or install it yourself as:

$ gem install fast_fuzzy

Usage

Percolator

First configure the Percolator with any number of text strings which represents what we are looking for, or the queries:

p = FastFuzzy::Percolator.new

p << "looking for a restaurant"
p << "recommend a restaurant"

Run the Percolator against some text. The result is the list of matching "queries" sorted in ascending score order.

p.percolate("hey! anyone can recomment a good restaurant in montreal tonight?")
=> [[0, 0.5294117647058824], [1, 0.8421052631578947]]

In this example the last query "recommend a restaurant" matched with a 0.842 or ~84% score.

p.percolate("I am looking for a good suchi restaurant")
=> [[1, 0.47368421052631576], [0, 0.8823529411764706]]

In this example the first query "looking for a restaurant" matched with a 0.882 or ~88% score.

Custom Analyzer Chain

Included is a custom Twitter Tokenizer and can be used by defining an analyzer chain for the Percolator:

p = FastFuzzy::Percolator.new(:analyzer_chain => [
  [Lucene::TwitterTokenizer],
  [Lucene::LowerCaseFilter],
  [Lucene::StopFilter, Lucene::StandardAnalyzer::STOP_WORDS_SET],
])

p << "looking for a restaurant"
p << "recommend a restaurant"

p.percolate("RT yo lookin for a good restaurant #montreal #foodie")
=> [[1, 0.5263157894736842], [0, 0.7647058823529411]]

Custom Twitter Tokenizer

The included custom Twitter Tokenizer was created to classify more Twitter specific tokens and help only match on actual text words. YMMV.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Building

Install gradle-jflex-plugin

this uses the gradle-jflex-plugin. The original version does not support jflex 1.6.1 so you can use my forked version at https://github.com/colinsurprenant/gradle-jflex-plugin or the original if/when thomaslee/gradle-jflex-plugin#2 is merged.

$ git clone https://github.com/colinsurprenant/gradle-jflex-plugin
#  or
$ git clone https://github.com/thomaslee/gradle-jflex-plugin

$ cd gradle-jflex-plugin
$ gradle build
$ gradle install
# or use gradlew if you do not have Gradle installed
$ ./gradlew build
$ ./gradlew install

Build Java sources

$ gradle build
# or use gradlew if you do not have Gradle installed
$ ./gradlew build

Tests / Specs

$ bundle install

$ bundle exec rspec

Author

Colin Surprenant on GitHub and Twitter.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/colinsurprenant/fast_fuzzy.

License and Copyright

FastFuzzy is released under the Apache License, Version 2.0.

fast_fuzzy

Development

Runtime

FastFuzzy

Installation

Usage

Percolator

Custom Analyzer Chain

Custom Twitter Tokenizer

Development

Building

Tests / Specs

Author

Contributing

License and Copyright