0.0
No commit activity in last 3 years
No release in over 3 years
RubyTokenizer is a simple language processing command-line tool. It performs low-level tokenization and returns the top 10 most frequent words in a body of text. At the moment it's only available for English texts and it segments words by filtering whitespaces, punctuation marks, parantheses and other special characters.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

Runtime

~> 0
 Project Readme

RubyTokenizer

Code Climate Build Status Test Coverage Gem Version

RubyTokenizer is a simple language processing command-line tool, modeled loosely after Apache Solr's Classic Tokenizer. It performs low-level tokenization through word-segmentation by filtering whitespaces, punctuation marks, parantheses and other special characters, and returns the top 10 most frequent words in a body of text. At the moment it's only available for English texts in UTF-8, Unicode 6.3 format. All results are case-insensitive.

Installation

To use this tool, you need to have Ruby installed. You can find more detailed instructions here: http://www.ruby-lang.org/en/downloads/

Add this line to your application's Gemfile:

gem 'ruby_tokenizer'

And then execute:

$ bundle

Or install it yourself as:

$ gem install ruby_tokenizer

Usage

To tokenize a text, enter the command "tokenizer" followed by the paths to the targeted file(s):

$ tokenizer /file1/path/here.txt /file2/path/here.txt

To use the files that come bundled with this gem, clone the repo and while in the root directory input:

$ tokenizer lib/samples/frankenstein.txt lib/samples/war_of_the_worlds.txt

If you are in the folder containing the targeted text files:

$ tokenizer file1.txt file2.text

This is the expected output:

$ [["the", 1782],
  ["and", 855],
  ["to", 790],
  ["a", 672],
  ["of", 610],
  ["she", 533],
  ["it", 463],
  ["said", 457],
  ["in", 416],
  ["alice", 384]]

If only the 'tokenizer' command is entered, then the user will be prompted to enter a string:

$ tokenizer
$ "--- Please input your text below ----"
$

If the file path cannot be found or the file has a format that cannot be read, a LoadError will be displayed:

$ `read_file': File not found: Please try again. (LoadError)

Special Cases

RubyTokenizer accounts for e-mail addresses, URLs, hyphenated words and certain abbreviations as follows:

Email addresses:

$ ["leslie.knope@gmail.com"]
$ ["leslie_knope@gmail.com"]
$ ["leslie-knope@gmail.com"]

URLs:

$ ["www.frankestein.com"]

Hyphenated words:

$ ["Chicago-based"]

Abbreviations:

$ ["U.S.A"]

Numbers (phone numbers and numbers with a comma format are not tokenized):

$ ["3.50"]

Development

The following dependencies are required: Bundler, Rake, RSpec, and Pry. To install these dependencies manually:

$ gem install name

To run the test suite, fork the repo, clone it to a local directory and in the root directory run the following command:

$ bundle exec rspec

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/irinarenteria/ruby_tokenizer.

License

The gem is available as open source under the terms of the MIT License.