Ruby GloVe
Ruby implementation of Global Vectors for Word Representations.
Overview
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
NOTE This is an early prototype.
Resources
Dependencies
This library relies on the rb-gsl gem for Matrix and Vector operations, therefore you need GNU Scientific Library installed.
Linux:
$ sudo apt-get install libgsl0-dev
OS X:
$ brew install gsl
Only compatible with MRI: tested in versions 2.0.x and 2.1.x
Installation
$ gem install glove
or add to your Gemfile
gem 'glove'
Usage
require 'glove'
# See documentation for all available options
model = Glove::Model.new
# Next feed it some text.
text = File.read('quantum-physics.txt')
model.fit(text)
# Or you can pass it a Glove::Corpus object as the text argument instead
corpus = Glove::Corpus.build(text)
model.fit(corpus)
# Finally, to query the model, we need to train it
model.train
# So far, word similarity and analogy task methods have been included:
# Most similar words to quantum
model.most_similar('quantum')
# => [["physic", 0.9974459436353388], ["mechan", 0.9971606266531394], ["theori", 0.9965966776283189]]
# What words relate to atom like quantum relates to physics?
model.analogy_words('quantum', 'physics', 'atom')
# => [["electron", 0.9858380292886947], ["energi", 0.9815122410243475], ["photon", 0.9665073849076669]]
# Save the trained matrices and vectors for later usage in binary formats
model.save('corpus.bin', 'cooc-matrix.bin', 'word-vec.bin', 'word-biases.bin')
# Later on create a new instance and call #load
model = Glove::Model.new
model.load('corpus.bin', 'cooc-matrix.bin', 'word-vec.bin', 'word-biases.bin')
# Now you can query the model again and get the same results as above
Performance
Thanks to the rb-gsl bindings for GSL, matrix/vector operations are fast. The glove algorythm itself, however, requires quite a bit of computational power, even the original C library. If you need speed, use smaller texts with vocabulaty size no more than 100K words. Processing text with 160K words (compilation of several books on quantum mechanics) on a late 2012 MBP (8GB RAM) with ruby-2.1.5 takes about 7 minutes:
$ ruby -Ilib benchmark/benchmark.rb
user system total real
Fit Text 11.320000 0.070000 11.390000 ( 11.387612)
Vocabulary size: 158323
Unique tokens: 2903
Co-occur 1.330000 0.250000 1107.720000 (300.738453)
Train 121.120000 12.960000 134.080000 (128.409034)
Similarity 0.010000 0.000000 0.010000 ( 0.057423)
Give me the 3 most similar words to quantum
[["problem", 0.9977609386134489], ["mechan", 0.9977529272587808], ["classic", 0.9974759411408415]]
Analogy 0.010000 0.000000 0.010000 ( 0.010674)
What 3 words relate to atom like quantum relates to mechanics?
[["particl", 0.9982711579369483], ["find", 0.9982303885530384], ["expect", 0.9982017117355527]]
TODO
- Word Vector graphs
Contributing
- Fork it ( https://github.com/vesselinv/glove/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request