Blurrily — Millisecond fuzzy string matching
Show me photos of Marakech !
Here are some photos of Marrakesh, Morroco. Did you mean Martanesh, Albania, Marakkanam, India, or Marasheshty, Romania?
Blurrily finds misspelled, prefix, or partial needles in a haystack of strings, quickly. It scales well: its response time is typically 1-2ms on user-input datasets and 75-100ms on pathological datasets (more).
Blurrily is compatible and tested with all MRI Rubies from 1.9.3 to 2.2.0. It is tested on Linux 2.6 (32bit and 64bit) and MacOS X 10.8.
Blurrily uses a tweaked trigram-based approach to find good matches. If you're using ActiveRecord and looking for a lightweight (albeit much slower), in-process, Rails-friendly version of this, check out fuzzily, a Ruby gem to perform fuzzy text searching in ActiveRecord.
Installation
Add this line to your application's Gemfile:
gem 'blurrily'
Or install it yourself as:
$ gem install blurrily
Docker
You can optionally run Burrily as a Docker Container. Maintained by MrMattWright.
Usage
You can use blurrily as a client/server combination (recommended in production), or use the internals standalone.
See the API Documentation for more details.
Client/server
Fire up a blurrily server:
$ blurrily
Open up a console and connect:
$ irb -rubygems
> require 'blurrily/client'
> client = Blurrily::Client.new
Store a needle with a reference:
> client.put('London', 1337)
Recover a reference form the haystack:
> client.find('lonndon')
#=> [1337]
Standalone
Create the in-memory database:
> map = Blurrily::Map.new
Store a needle with a reference:
> map.put('London', 1337)
Recover a reference form the haystack:
> map.find('lonndon')
#=> [1337]
Save the database to disk:
> map.save('/var/db/data.trigrams')
Load a previously saved database:
> map = Blurrily::Map.load('/var/db/data.trigrams')
Caveats
Diacritics, non-latin languages
Blurrily forms trigrams from the 26 latin letters and a stop character (used to model start-of-string and separation between words in multi-word strings).
This means that case and diacritrics are completely ignored by Blurrily. For instance, Puy-de-Dôme is strictly equivalent to puy de dome.
It also means that any non-latin input will probably result in garbage data and garbage results (although it won't crash).
Multi-word needles and edge stickyness.
Multi-word needles (say, New York) are supported.
The engine always favours matches that begin and end similarly to the needle, with a bias to the beginning of the strings.
This is because internally, the string New York is turned into this
sequence of trigrams: **n
, *ne
, new
, ew*
, w*y
, *yo
, yor
,
ork
, rk*
.
Production notes
Memory usage
Blurrily does not store your original strings but rather a flat map of references and weights for each trigram in your input strings.
In practice any database will use up a base 560KB for the index header, plus 128 bits per trigram.
As a rule of thumb idea memory usages is 40MB + 8 times the size of your input data, and 50% extra on top during bulk imports (lots of writes to the database).
For instance, /usr/share/dict/words
is a list of 235k English words, and
weighs 2.5MB. Importing the whole list uses up 75MB of memory, 51MB of which
are the database.
Note that once a database has been written to disk and loaded from disk, memory usage is minimal (560KB per database) as the database file is memory mapped. For performance you do need as much free memory as the database size.
Disk usage
Disk usage is almost exactly like memory usage, since database files are nothing more than a memory dump.
In the /usr/share/dict/words
example, on-disk size is 51MB.
For the whole list of Geonames places, on-disk size is 1.1GB.
Read v write
Writing to blurrily (with #put
) is fairly expensive—it's a search engine
after all, optimized for intensive reads.
Supporting writes means the engine needs to keep a hash table of all references around, typically weighing 50% of your total input. This is build lazily while writing however; so if you load a database from disk and only ever read, you will not incur the memory penalty.
Saving & backing up
Blurrily saves atomically (writing to a separate file, then using rename(2) to overwrite the old file), meaning you should never lose data.
The server does this for you every 60 seconds and when quitting. If using
Blurrily::Map
directly, remember that a map loaded from disk is more
memory efficient that a map in memory, so if your workload is read-heavy,
you should .load
after each #save
.
Backing up comes with a caveat: database files are only portable across architectures if endianness and pointer size are the same (tested between darwin-x86_64 and linux-amd64).
Database files are very compressible; bzip2
typically shrinks them to 20%
of their original size.
Benchmarks
Blurrily is wicked fast, often 100x faster than it's ancestor, fuzzily. This is because it's a close-to- the-metal, single-purpose index using almost exclusively libc primitives. On the inside the only expensive operations it performs are
- memcpy(2) lots of data around (selection);
- mergesort(3) to aggregate/count similar entries (reduction);
- qsort(3) to order by counts (sort).
It tends to be faster with large datasets on BSD than on Linux because the
former has fast quicksort and mergesort, wheras the latter only has qsort
,
a slower, catch-all sorter. In complexity terms this is because FIND tends
to be O(n) on BSD and O(n ln n) on Linux.
Enough talk, here are the graphs. The LOAD
and PUT
operations are O(1)
and take respectively ~10ms and ~100µs on any platform, so they aren't
graphed here.
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request