word2vec-rb
Gem using word2vec functionality from https://code.google.com/archive/p/word2vec/
This gem was developed using the .c
files of the Google word2vec as base. Mostly by applying copy-and-paste.
Installation
Add this line to your application's Gemfile:
gem 'word2vec-rb'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install word2vec-rb
Usage
Distance arithmetic: to find the nearest words, try:
require 'word2vec'
model = Word2vec::Model.load("./data/minimal.bin")
words = model.distance("from")
words.each do |w|
puts "#{w.first} #{w.last}"
end
Analogy arithmetic: to find the analogy with three words, try:
require 'word2vec'
model = Word2vec::Model.load("./data/minimal.bin")
words = model.analogy("spain", "madrid", "france")
# In a well prepared vectors file (high quality), first word would be "Paris"
words.each do |w|
puts "#{w.first} #{w.last}"
end
Accuray: test accuracy of the vectors:
Define a file with the analogies to test, format: : section heading Word1 Word2 Word3 Word4
Sample:
: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
require 'word2vec'
model = Word2vec::Model.load(file_name)
model.accuracy("./data/questions-words.txt")
# Outputs the results on terminal
Vocabulary: create a vocabulary file from a train file:
require 'word2vec'
Word2vec::Model.build\_vocab("./data/text7", "./data/vocab.txt")
The output file will have a list of words and its number of appearances separated by line break.
Tokenizer: create a binary file by tokenizing an input file
This method requires a vocabulary file precreated.
require 'word2vec'
Word2vec::Model.tokenize("./data/text7", "./data/vocab.txt", "./data/tokenized.bin")
The output file will contain a sequence of binary identificators of each word of the input file.
Read output file with:
long long id;
fread(&id, sizeof(id), 1, fi);
Load the word2vec output bin file (vectors.bin), into ruby array
require 'word2vec'
vector_array = Word2vec::load_vectors("./data/minimal.bin")
The vector_array
variable will contain an array of pairs with the vocab and the vector the float values of each word.
Set parameter normalize: true
to normalize the vectors.
require 'word2vec'
vector_array = Word2vec::Model.load_vectors("./data/minimal.bin", normalize: true)
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Build extension
$ rake build
Launch tests
$ rake spec
Build extension
$ rake compile
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/madcato/word2vec-rb