Project

winnow

0.0
No commit activity in last 3 years
No release in over 3 years
A tiny Ruby library that implements Winnowing, an algorithm for document fingerprinting.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.3
>= 0
~> 2.14

Runtime

~> 1.3
 Project Readme

Winnow

A tiny Ruby library for document fingerprinting.

What is document fingerprinting?

Document fingerprinting converts a document (e.g. a book, a piece of code, or any other string) into a much smaller number of hashes called fingerprints. If two documents share any fingerprints, then this means there is an exact substring match between the two documents.

Document fingerprinting has many applications, but the most obvious one is for plagiarism detection. By taking fingerprints of two documents, you can detect if parts of one document were copied from another.

This library implements a fingerprinting algorithm called winnowing, described by Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken in a paper titled Winnowing: Local Algorithms for Document Fingerprinting.

Winnowing is perfect for when you want to precalculate fingerprints about some string. However, due to an issue with how Ruby's String#hash works, you might run into issues. Please see the section on caveats for more info and a workaround.

Installation

Add this to your Gemfile:

gem 'winnow'

Or install it directly:

gem install winnow

Usage

The Fingerprinter class takes care of fingerprinting documents. To create a fingerprint, you need to provide two parameters, called the noise threshold and the guarantee threshold. When comparing two documents' fingerprints, no match shorter than the noise threshold will be detected, but any match at least as long as the guarantee threshold is guaranteed to be found.

The proper values for your noise and guarantee thresholds varies by context. Experiment with the data you're looking at until you're happy with the results.

Creating a fingerprinter is easy:

fingerprinter = Winnow::Fingerprinter.new(noise_threshold: 10, guarantee_threshold: 18)

Then, use #fingerprints get the fingerprints. Optionally, pass :source (default is nil) so that Winnow can later report which document a match is from.

document = File.new('hamlet.txt')
fingerprints = fingerprinter.fingerprints(document.read, source: document)

#fingerprints just returns a plain-old Ruby Hash. The keys of the hash are generated from substrings of the document being fingerprinted. Finding shared substrings between documents is as simple as seeing if they share any of the keys in their #fingerprints hash.

To keep things easier for you, Winnow comes with a Matcher class that will find matches between two documents.

Here's an example that puts everything together:

require 'winnow'

str_a = <<-EOF
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
This is copied.
All mimsy were the borogoves,
And the mome raths outgrabe.
EOF

str_b = <<-EOF
"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious -- This is copied. -- Bandersnatch!"
EOF

fprinter = Winnow::Fingerprinter.new(
  guarantee_threshold: 13, noise_threshold: 9)

f1 = fprinter.fingerprints(str_a, source: "Stanza 1")
f2 = fprinter.fingerprints(str_b, source: "Stanza 2")

matches = Winnow::Matcher.find_matches(f1, f2)

# Because 'This is copied' is longer than the guarantee threshold, there might
# be a couple of matches found here. For the sake of brevity, let's only look at
# the first match found.
match = matches.first

# It's possible for the same key to appear in a document multiple times (e.g. if
# 'This is copied' appears more than once). Winnow::Matcher will return all
# matches from the same key in array.
#
# In this case, we know there's only one match (because 'This is copied' appears
# only once in each document), so let's only look at the first one.
match_a = match.matches_from_a.first
match_b = match.matches_from_b.first

p match_a.index, match_b.index # 71, 125

match_context_a = str_a[match_a.index - 10 .. match_a.index + 20]
match_context_b = str_b[match_b.index - 10 .. match_b.index + 20]

# Match from Stanza 1: "e wabe;\nThis is copied.\nAll mim"
puts "Match from #{match_a.source}: #{match_context_a.inspect}"

# Match from Stanza 2: "ious -- This is copied. -- Band"
puts "Match from #{match_b.source}: #{match_context_b.inspect}"

You may find that Matcher doesn't handle your exact use-case. That's not a problem. The built-in matcher.rb file is only about 10 lines of code, so you could easily make your own.

💥 💣 A major caveat with String#hash 💣 💥

In order to avoid algorithmic complexity attacks, the value returned from Ruby's String#hash method changes every time you restart the interpreter:

$ irb
2.0.0p353 :001 > "hello".hash
 => 482951767139383391
2.0.0p353 :002 > exit

$ irb
2.0.0p353 :001 > "hello".hash
 => 3216751850140847920
2.0.0p353 :002 > exit

(This is the case even if you're using JRuby.)

This means that although the winnowing algorithm should allow you to precalculate a document's fingerprints and store them somewhere, doing so in Ruby will not work unless you're careful to make sure you never restart your Ruby runtime.

A workaround

Winnow looks for the presence of a String#consistent_hash method. If it finds one, it'll call that rather than call String#hash. You can therefore describe your own hash function if you want to precalculate fingerprint data.

I've put together a super-simple (but effective) gem called consistent_hash that implements exactly this. It's about a dozen lines of MRI C code and it'll probably work for you as well.