RubyTokenizer
RubyTokenizer is a simple language processing command-line tool, modeled loosely after Apache Solr's Classic Tokenizer. It performs low-level tokenization through word-segmentation by filtering whitespaces, punctuation marks, parantheses and other special characters, and returns the top 10 most frequent words in a body of text. At the moment it's only available for English texts in UTF-8, Unicode 6.3 format. All results are case-insensitive.
Installation
To use this tool, you need to have Ruby installed. You can find more detailed instructions here: http://www.ruby-lang.org/en/downloads/
Add this line to your application's Gemfile:
gem 'ruby_tokenizer'
And then execute:
$ bundle
Or install it yourself as:
$ gem install ruby_tokenizer
Usage
To tokenize a text, enter the command "tokenizer" followed by the paths to the targeted file(s):
$ tokenizer /file1/path/here.txt /file2/path/here.txt
To use the files that come bundled with this gem, clone the repo and while in the root directory input:
$ tokenizer lib/samples/frankenstein.txt lib/samples/war_of_the_worlds.txt
If you are in the folder containing the targeted text files:
$ tokenizer file1.txt file2.text
This is the expected output:
$ [["the", 1782],
["and", 855],
["to", 790],
["a", 672],
["of", 610],
["she", 533],
["it", 463],
["said", 457],
["in", 416],
["alice", 384]]
If only the 'tokenizer' command is entered, then the user will be prompted to enter a string:
$ tokenizer
$ "--- Please input your text below ----"
$
If the file path cannot be found or the file has a format that cannot be read, a LoadError will be displayed:
$ `read_file': File not found: Please try again. (LoadError)
Special Cases
RubyTokenizer accounts for e-mail addresses, URLs, hyphenated words and certain abbreviations as follows:
Email addresses:
$ ["leslie.knope@gmail.com"]
$ ["leslie_knope@gmail.com"]
$ ["leslie-knope@gmail.com"]
URLs:
$ ["www.frankestein.com"]
Hyphenated words:
$ ["Chicago-based"]
Abbreviations:
$ ["U.S.A"]
Numbers (phone numbers and numbers with a comma format are not tokenized):
$ ["3.50"]
Development
The following dependencies are required: Bundler, Rake, RSpec, and Pry. To install these dependencies manually:
$ gem install name
To run the test suite, fork the repo, clone it to a local directory and in the root directory run the following command:
$ bundle exec rspec
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/irinarenteria/ruby_tokenizer.
License
The gem is available as open source under the terms of the MIT License.