Project

te_rex

0.0
No commit activity in last 3 years
No release in over 3 years
Simple text processing for small data sets.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 1.5.3, ~> 1.5
>= 0.10.1, ~> 0.10
>= 1.4.1, ~> 1.4
>= 0.4.9.1, ~> 0.4
>= 0.5.5, ~> 0.5
>= 10.3.2, ~> 10.3

Runtime

>= 1.0.2, ~> 1.0
 Project Readme

TeRex

te_rex API Documentation

A lot of power with a little reach.

A little reach with a big bite.

If for some reason you have been using this, teh change from version 0.5 to 0.8 broke the initiallization of the classifier. See the git tags, tests, and examples.

The Bayes classifier is written to solve some small domain specific problems. This means it is not a classififer to be used for general problems where domain context is either unknown, general, or requires large data sets. In other words, you shouldn't use this gem.

Using Bayesian analysis because I wanted something that worked pretty good, was easy to implement, had a ruby runtime, and most importantly DID NOT REQUIRE VECTOR/MATRIX MULTIPLICATION (-- mostly because i don't want to introduce dependency on ruby C extensions for matrix multiplication, probably the GNU Scientific Library; though I am using fast-stemmer, which is a C extension.)

Bayes

This Bayes classifier was written specifically to classify cancellation policies and error messages from Hotel reservation providers. This doesn't mean it can't be used for other things, but it does mean it should NOT be used a general solution for text classification.

The the small domain focus of this classifier can most be gleaned from the BayesData class. It cleans the text in way specific to the goals I had in mind.

Tests

Just run pry-test pry-test. For tests against some pre-built larger corpora, which I consider the full test suite, you'll want to switch to the testing branch... then run the tests just as you would here in master: mt.

Testing is pretty liberal. The training and test data sets are small and a bit ambiguous in nature. Because of this the testing on classification can accept myultiple categories that are similar in nature (for example, CreditServiceError and CreditDeclineError). Also, it is assumed people aren't really using this gem AND if you are using it IT IS ASSUMED YOU ARE TESTING CLASSIFICATION AGAINST YOUR OWN DATA SETS. Therefore, I didn't see much need in providing strict classification tests; which are easy enough to construct in order to give the impression the classifier is working really good.

Also, the tests should be run mutliple times given that input is not the same for each run of the test and the outputs are the result of statistical functions.

Usage

For usage see tests; though here is a snippet below. Also, if you don't know what Bayesian Classification is you should probably check it out (just google it): your classifier is only as good as your training data and training methods!

cls = TeRex::Classifier::Bayes.new(
  {:tag => "Refund", :msg => "You'll get a refund"},
  {:tag => "Nonrefund", :msg => "You won't get a refund"}
)

["You will get a refund.","Full refund for you!","You will receive a full refund.","You may only get a partial refund."].each {|txt| cls.train("Refund", txt)}
["You will not get a refund.","There are no refunds.","Refunds not available.","You will not get a refund."].each {|txt| cls.train("Nonrefund", txt)}


cls.classify("We understand that you work hard for your money, but we will not give you a refund.")

Examples

The corpus builder is mostly used to test the classification on a larger data sets. I need to verify the classifier actually works to some degree and so running it against some well known corpora and comparing resutls with other classifiers provides feedback on te_rex.

Corpus builder

pos_corpus = TeRex::Corpus::Body.new(glob: '/Users/jbowles/x/training_data/corpora/words/en*', format_klass: TeRex::Format::BasicFile)
pos_corpus.build

# Then look at what you've got:
pos_corpus.training.count #Array of sentences
pos_corpus.testing.count  #Array of sentences
pos_corpus.files.count    #Array of file paths
pos_corpus.sample_size    #total files multiplied by 0.75; used to split files for test/train (0.25 for test, rest for train)


#Or you could do all this
pos_corpus.get_files
pos_train = pos_corpus.partition_train
pos_test = pos_corpus.partition_test

Stopwords

A class provided so you can append or delete from it if needed. I typically go for smaller stop lists than larger and this one is no exception. However, due the custom nature of this classifier the stop list also contains weekday and month names with usual abbreviations (i.e., nov, november, wed, monday,...).

Corpora

Some notes on corpora to download for testing.

Brown and Movie datasets

Downloadable from the NLTK svn trunk index.

Brown corpus is for part of speech tagging and is not organized for text classification.

Movie Review labelled as:

  • pos (1000 text files) -- Positive
  • neg (1000 text files) -- Negative

4UDS dataset

Downloadable from the CMU 4 Universities Data Set.

Labelled as:

  • student (1641 html files)
  • faculty (1124 html files)
  • staff (137 html files)
  • department (182 html files)
  • course (930 html files)
  • project (504 html files)
  • other (3764 html files)

They suggest "Since each university's web pages have their own idiosyncrasies, we do not recommend training and testing on pages from the same university. We recommend training on three of the universities plus the misc collection, and testing on the pages from a fourth, held-out university. There is a simple Perl script for creating a directory structure, which should make it easier to do this four-fold cross validation. No guarantees."

I've organized the directory as corpora/4uds/COLLEGE/LABEL. so far only have staff and 'student` moved over.

20Newsgroups

Downloadable from the 20Newsgroups Homepage

Labelled as: