0.0
No commit activity in last 3 years
No release in over 3 years
Generates a tree representing the branches or revisions to a set of HTML files
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

~> 1.3
>= 0

Runtime

 Project Readme

SimilarityTree

This library allows you to generate a tree representing branches/revisions to a set of text HTML files, without any prior knowledge of the timelines or change history necessary. You simply need to know the original source document and this library builds a tree based on the extent of differences between each document.

Installation

Add this line to your application's Gemfile:

gem 'similarity_tree'

And then execute:

$ bundle

Or install it yourself as:

$ gem install similarity_tree

Usage

Build a "similarity matrix" of the diff scores between the different documents, then generate the tree from this matrix. First, build the "similarity matrix" of the diff scores between the different documents. You must input a set of HTML or text documents. Then, to build the tree itself, you need to specify the document id or filename of the original/root document. Eg. for the set of different Creative Commons licences in the test dir:

documents = Dir.glob('../../similarity_tree/test/cc_licences/*.html')
tree = SimilarityTree::SimilarityMatrix.new(documents).build_tree("CC-BY-3.0.html")
put tree.to_s  # to_h and to_json are also available as other tree output formats

Result:

CC-BY-3.0.html
-CC-BY-NC-3.0.html (0.9197574893009985)
--CC-BY-NC-SA-3.0.html (0.9503146737330241)
--CC-BY-NC-ND-3.0.html (0.9456402772710689)
-CC-BY-ND-3.0.html (0.9434472109631346)

You can operate directly on strings rather than files (in this case, the node id's in the tree will be the file array indices):

documents = Dir.glob('../../similarity_tree/test/cc_licences/*.html').map { |f| File.read(f) }
tree = SimilarityTree::SimilarityMatrix.new(documents).build_tree("CC-BY-3.0.html")
put tree.to_s  # to_h and to_json are also available as other tree output formats

Result:

0
-1 (0.9197574893009985)
--3 (0.9503146737330241)
--4 (0.9456402772710689)
-2 (0.9434472109631346)

Or, you can use any enumerable list of objects (eg. ActiveRecords) as the inputs. Consider the model:

class Document < ActiveRecord::Base
  attr_accessible :title, :text_filename
  ...
end

Generate the tree as follows:

tree = SimilarityTree::SimilarityMatrix.new(Document.all,
    id_func: :title, content_func: :text_filename).build_tree(Document.first.title)

Additional Options

Calculation method

You can use either the term frequency–inverse document frequency (:tf_idf, the default) or Dice's coefficient from a standard unix-style diff to calculate the diff scores. Tf-idf works much better where a document has a lot of translations (that is, "cut and pastes" of sections of text into different locations) and is often faster. However, if your intent is to show diffs of the text, the :diff option will correlate better to your diff rendering.

tf_idf_tree = SimilarityTree::SimilarityMatrix.new(documents,
    calculation_method: :tf_idf).build_tree("CC-BY-3.0.html")
diff_tree = SimilarityTree::SimilarityMatrix.new(documents,
    calculation_method: :diff).build_tree("CC-BY-3.0.html")

Progress output

Performing all the diffs to build a similarity matrix can take a while for large document sets. If you're using this gem from a script or a console, you can add a progress bar:

tree = SimilarityTree::SimilarityMatrix.new(documents, show_progress: true).build_tree(id)

Licence and Credits

(c) 2012-2013, Kent Mewhort (similarity tree) and Open North (original similarity_matrix implementation, see https://github.com/jpmckinney/clip-analysis), licensed under MIT. See LICENSE.txt for details.