Twins
Twins sorts through the small differences between multiple objects and smartly consolidate all of them together.
Usage
Let's say you have a collection of objects representing the same book but from different sources, which brings the possibility for each object to be slightly different from one another.
books = [{
title: "Shantaram: A Novel",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
},
{
title: "Shantaram",
author: "Gregory David Roberts & Alejandro Palomas",
published: 2012,
details: {
paperback: false
}
},
{
title: "Shantaram",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
},
{
title: "Shantaram",
author: "Gregory D. Roberts",
published: 2005,
details: {
paperback: true
}
}]
Consolidate
Assembles a new Hash
based on every elements in the collection. By default Twins#consolidate
will determine the candidate values based on the most frequent value present for a given key, also known as the mode.
Twins.consolidate(books)
{
title: "Shantaram",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
}
You may also provide Twins#consolidate
with priorities for String
and Numeric
attributes, which will precede on the mode while determining the canditate value.
options = {
priority: {
title: "Novel"
}
}
Twins.consolidate(books, options)
{
title: "Shantaram: A Novel",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
}
Pick
Selects the collection's most representative element. By default Twins.pick
will determine the candidate element based on the highest count of modes present for a given element.
Twins.pick(books)
{
title: "Shantaram",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
}
You may also provide Twins#pick
with priorities for String
and Numeric
attributes, which will be used to compute each element's overall distance while determining the canditate element.
options = {
priority: {
title: "Novel"
}
}
Twins.pick(books, options)
{
title: "Shantaram: A Novel",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
}
Internals
Distance
String distances are calculated using a longest subsequence algorithm and Numeric distances are calculated with their difference.
Contributing
- Fork it
- Create a topic branch
- Add specs for your unimplemented modifications
- Run
bundle exec rspec
. If specs pass, return to step 3. - Implement your modifications
- Run
bundle exec rspec
. If specs fail, return to step 5. - Commit your changes and push
- Submit a pull request
- Thank you!
TODO
- Think about using jaccard to weight items
Author
License
See LICENSE