StrMetrics
Ruby gem (native extension in Rust) providing implementations of various string metrics. Current metrics supported are: Sørensen–Dice, Levenshtein, Damerau–Levenshtein, Jaro & Jaro–Winkler. Strings that are UTF-8 encodable (convertible to UTF-8 representation) are supported. All comparison of strings is done at the grapheme cluster level as described by Unicode Standard Annex #29; this may be different from many gems that calculate string metrics. See here for known compatibility.
Getting Started
Prerequisites
Install Rust (tested with version >= 1.47.0
) with:
curl https://sh.rustup.rs -sSf | sh
Known compatibility
Ruby
3.1
, 3.0
, 2.7
, 2.6
, 2.5
, 2.4
, 2.3
, jruby
, truffleruby
Rust
1.60.0
, 1.59.0
, 1.58.1
, 1.57.0
, 1.56.1
, 1.55.0
, 1.54.0
, 1.53.0
, 1.52.1
, 1.51.0
, 1.50.0
, 1.49.0
, 1.48.0
, 1.47.0
Platforms
Linux
, MacOS
, Windows
Installation
With bundler
Add this line to your application's Gemfile:
gem 'str_metrics'
And then execute:
$ bundle install
Without bundler
$ gem install str_metrics
Usage
All you need to do to use the metrics provided in this gem is to make sure str_metrics
is required like:
require 'str_metrics'
Each metric is shown below with an example & meanings of optional parameters.
Sørensen–Dice
StrMetrics::SorensenDice.coefficient('abc', 'bcd', ignore_case: false)
=> 0.5
Options:
Keyword | Type | Default | Description |
---|---|---|---|
ignore_case |
boolean | false |
Case insensitive comparison? |
Levenshtein
StrMetrics::Levenshtein.distance('abc', 'acb', ignore_case: false)
=> 2
Options:
Keyword | Type | Default | Description |
---|---|---|---|
ignore_case |
boolean | false |
Case insensitive comparison? |
Damerau–Levenshtein
StrMetrics::DamerauLevenshtein.distance('abc', 'acb', ignore_case: false)
=> 1
Options:
Keyword | Type | Default | Description |
---|---|---|---|
ignore_case |
boolean | false |
Case insensitive comparison? |
Jaro
StrMetrics::Jaro.similarity('abc', 'aac', ignore_case: false)
=> 0.7777777777777777
Options:
Keyword | Type | Default | Description |
---|---|---|---|
ignore_case |
boolean | false |
Case insensitive comparison? |
Jaro–Winkler
StrMetrics::JaroWinkler.similarity('abc', 'aac', ignore_case: false, prefix_scaling_factor: 0.1, prefix_scaling_bonus_threshold: 0.7)
=> 0.7999999999999999
StrMetrics::JaroWinkler.distance('abc', 'aac', ignore_case: false, prefix_scaling_factor: 0.1, prefix_scaling_bonus_threshold: 0.7)
=> 0.20000000000000007
Options:
Keyword | Type | Default | Description |
---|---|---|---|
ignore_case |
boolean | false |
Case insensitive comparison? |
prefix_scaling_factor |
decimal | 0.1 |
Constant scaling factor for how much to weight common prefixes. Should not exceed 0.25. |
prefix_scaling_bonus_threshold |
decimal | 0.7 |
Prefix bonus weighting will only be applied if the Jaro similarity is greater given value. |
Motivation
The main motivation was to have a central gem which can provide a variety of string metric calculations. Secondary motivation was to experiment with writing a native extension in Rust (instead of C).
Development
Getting started
gem install bundler
git clone https://github.com/anirbanmu/str_metrics.git
cd ./str_metrics
bundle install
Building (for native component)
rake rust_build
Testing (will build native component before running tests)
rake spec
Local installation
rake install
Deploying a new version
To deploy a new version of the gem to rubygems:
- Bump version in version.rb according to SemVer.
- Get your code merged to
main
branch - After a
git pull
onmain
branch:
rake build && rake release
Authors
See all repo contributors here.
Versioning
SemVer is employed. See tags for released versions.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/anirbanmu/str_metrics.
Code of Conduct
Everyone interacting in this project's codebase, issue trackers etc. are expected to follow the code of conduct.
License
This project is licensed under the MIT License - see the LICENSE file for details