Project

simhash

0.03
No commit activity in last 3 years
No release in over 3 years
Implementation of Charikar simhashes in Ruby
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Runtime

 Project Readme

Absctract¶ ↑

This is implementation of Moses Charikar’s simhashes in Ruby.

Usage¶ ↑

When you have a string and want to calculate it’s simhash, you should

my_string.simhash

By default it will generate 64-bit integer - that is simhash for this string

It’s always better to tokenize string before simhashing. It’s as simple as

my_string.simhash(:split_by => / /)

This will generate 64-bit integer based, but will split string into words before. It’s handy when you need to calculate similarity of strings based on word usage. You can split string as you like: by letters/sentences/specific letter-combinations, etc.

my_string.simhash(:split_by => /./, :hashbits => 512)

Sometimes you might need longer simhash (finding similarity for very long strings is a good example). You can set length of result hash by passing hashbits parameter. This example will return 512-bit simhash for your string splitted by sentences.

Advanced usage¶ ↑

It’s useful to clean your string before simhashing. But it’s useful not to clean, too.

Here are examples:

my_string.simhash(:stop_words => true) # here we clean

This will find stop-words in your string and remove them before simhashing. Stop-words are “the”, “not”, “about”, etc. Currently we remove only Russian and English stop-words.

my_string.simhash(:preserve_punctuation => true) # here we not

This will not remove punctuation before simhashing. Yes, we remove all dots, commas, etc. after splitting string to words by default. Because different punctiation does not mean difference in general. If you not agree you can turn this default off.

Installation¶ ↑

As usual:

gem install simhash

But if you have GNU MP library, simhash will work faster! To check out which version is used, type:

Simhash::DEFAULT_STRING_HASH_METHOD It should return symbol. If symbol ends with “rb”, your simhash is slow. If you want make it faster, install GNU MP.