Tokeneyes
A string tokenizer designed to capture words with associated punctuation and sentence flow information (e.g. if they start or end a sentence).
Why write a tokenizer?
As I was writing markovian, I realized that the Markov text generator needed significantly more information about the corpus than was possible by simply calling String#split on the input text. To add punctuation or end sentences properly (rather than with a series of short, frequent prepositions or pronouns), the gem has to better understand how words are used in context.
There are a number of excellent tokenizers available, such as the tokenizer gem, Apache's OpenNLP, and the OpeNER Project -- if you're looking to do serious language processing, you should click on one one of those links.
Tokeneyes is a learning exercise; text parsing is a rich, fun, and deceptive problem -- you can quickly get 80% of the way to proper tokenization, but it's the other 20% of language use that makes the difference between "amusingly off" and "passes the Turing test". Mine doesn't and won't, but I've still enjoyed writing it and look forward to refining it further.
Installation
Add this line to your application's Gemfile:
gem 'tokeneyes'
And then execute:
$ bundle
Or install it yourself as:
$ gem install tokeneyes
Usage
In a console session, you can run
tokenizer = Tokeneyes::Tokenizer.new(text_to_parse)
tokens = tokenizer.parse_into_words
This will return an array of Tokeneyes::Word objects, each of which provides the text of the word, punctuation before and after (if applicable) and whether the word ended or began a sentence (as I have somewhat arbitrarily defined the concept 😁).
Still to do
There are several significant areas left to do:
- Capture periods at the end of a sentence
- Capture dividing punctuation that occurs after spaces (e.g. -, —, etc.)
- Capture ellipses and other multiple-character punctuation (e.g. ?!, --, etc.)
- Capture URLs as one word
Most of these should be doable by rewriting WordBuilder. Currently, a new WordBuilder is initialized for each character; if we instead initialize one per word and then pass it each new character (it then building up the word and setting/clearing punctuation as the word's format changes), that should allow us to properly handle many of these cases.
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake false
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/arsduo/tokeneyes. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
License
The gem is available as open source under the terms of the MIT License.