No commit activity in last 3 years
No release in over 3 years
TactfulTokenizer uses a naive bayesian model train on the Brown and WSJ corpuses to provide high quality sentence tokenization.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

~> 10.3.1
~> 2.14.1
 Project Readme

TactfulTokenizer¶ ↑

<img src=“https://badge.fury.io/rb/tactful_tokenizer.png” alt=“Gem Version” /> <img src=“https://travis-ci.org/zencephalon/Tactful_Tokenizer.png?branch=release” alt=“Build Status” /> <img src=“https://codeclimate.com/github/zencephalon/Tactful_Tokenizer.png” /> <img src=“https://coveralls.io/repos/zencephalon/Tactful_Tokenizer/badge.png?branch=release” alt=“Coverage Status” />

TactfulTokenizer is a Ruby library for high quality sentence tokenization. It uses a Naive Bayesian statistical model, and is based on Splitta, but has support for ‘?’ and ‘!’ as well as primitive handling of XHTML markup. Better support for XHTML parsing is coming shortly.

Additionally supports unicode text tokenization.

Usage¶ ↑

require "tactful_tokenizer"
m = TactfulTokenizer::Model.new
m.tokenize_text("Here in the U.S. Senate we prefer to eat our friends. Is it easier that way? <em>Yes.</em> <em>Maybe</em>!")
#=> ["Here in the U.S. Senate we prefer to eat our friends.", "Is it easier that way?", "<em>Yes.</em>", "<em>Maybe</em>!"]

The input text is expected to consist of paragraphs delimited by line breaks.

Installation¶ ↑

gem install tactful_tokenizer

Author¶ ↑

Copyright © 2010 Matthew Bunday. All rights reserved. Released under the GNU GPL v3.