Ruby port of TinySegmenter.js for tokenizing Japanese text. Ruby 1.9 or higher required.
Install
gem install tiny_segmenter
or add tiny_segmenter
to your Gemfile
Usage
ts = TinySegmenter.new
ts.segment("今晩は!良い天気ですね。")
# => ["今晩", "は", "!", "良い", "天気", "です", "ね", "。"]
ts.segment("今晩は!良い天気ですね。", ignore_punctuation: true)
# => ["今晩", "は", "良い", "天気", "です", "ね"]
Input text should be UTF-8 encoded.
How it works
The Naive Bayes model was trained using the RWCP corpus and optimized using L1-norm regularization (e.g. this). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate.
License
BSD - see http://chasen.org/~taku/software/TinySegmenter/LICENCE.txt