Project

fuzz_ball

0.01
No commit activity in last 3 years
No release in over 3 years
FuzzBall is a gem that finds fuzzy matches of a string (the 'needle') within an array of strings (the 'haystack'). It does so via a two-step process: first, it finds candidate strings from the haystack that have high similarity to the needle, then uses a Smith-Waterman algorithm to fuzzily match from these candidates. Strings are returned along with a matching score. Both steps of the search are written in C for greater performance.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies
 Project Readme

fuzz_ball

FuzzBall is a gem that finds fuzzy matches of a string (the 'needle') within an array of strings (the 'haystack'). It does so via a two-step process: first, it finds candidate strings from the haystack that have high similarity to the needle, then uses a Smith-Waterman algorithm to fuzzily match from these candidates. Strings are returned along with a matching score. Both steps of the search are written in C for greater performance.

Usage

require 'fuzz_ball'
ruby-1.9.2-p290 :001 > words = %Q[Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.].split(/\s+/)

ruby-1.9.2-p290 :002 > fuzz = FuzzBall::Searcher.new( words )
=> <FuzzBall::Searcher n_files=19>

ruby-1.9.2-p290 :003 > fuzz.search("lor")
=> [{:alignment=>[2, 4, 1, 3, 0, 2], :score=>0.6666666666666666, :string=>"dolore"}, {:alignment=>[2, 4, 1, 3, 0, 2], :score=>0.8, :string=>"dolor"}]

ruby-1.9.2-p290 :004 > fuzz.add("a_new_string")
=> true

ruby-1.9.2-p290 :005 > fuzz.search("new_string")
=> [{:alignment=>[9, 11, 8, 10, 7, 9, 6, 8, 5, 7, 4, 6, 3, 5, 2, 4, 1, 3, 0, 2], :score=>1.5, :string=>"a_new_string"}]

Algorithms

FuzzBall uses a two-step process. When a search is executed, the number of haystack strings that must be searched is reduced by first eliminating haystack strings that are not similar to the needle string. Similarity is determined by a high number of shared duples between the needle string and a haystack string (in the appropriate order). This metric is computed by pre-indexing the duples in the haystack strings.

These candidate strings are then fuzzily matched using the Smith-Waterman algorithm for performaing local sequence alignment.

Duple Indexing

When a new instance of FuzzBall::Searcher is instantiated, each haystack string is decomposed into duples. The position of each duple is indexed internally by the C implementation and stored in a lightweight hash data structure, allowing fast (O(1)) lookups of the haystack strings that contain a given duple. The C implementation uses uthash. See DupleIndex.c in ext/duple_index.

Smith-Waterman

The Smith-Waterman algorithm is a dynamic programming algorithm that performs local sequence alighment by comparing segments of all possible lengths. It is commonly used to align protein or nucleotide sequences and is heavily used within the computational biology communinity. However it works just as well on normal strings.

Dependencies

The C extension makes use of uthash, a C hash implementation written by Troy D. Hanson under a modified BSD license.

Author

FuzzBall was written by Vincent Chu (vincentchu [at] gmail.com).