NlpArabic
This gem is intended to contain tools for Arabic Natural Language Processing. As of version 0.1, this toolkit gem allows you to:
-
Clean a text using a stop list. This stop list was generated using the tf-idf score calculated on words from over 900 articles. The words selected have also been checked and validated by hand which resulted in a stop list of over 270 words.
-
Stem a word or a text. The stemming algorithm used is the ISRI Arabic stemmer. It is described in the following research paper:
Arabic Stemming without a root dictionary
This root-extraction stemmer is similar to the Khoja stemmer but does not use a root-dictionnary which can be laborious to maintain. Also, when the root can not be found, the ISRI stemmer would return a normalized form and not the orginial unmodified form. Overall, the ISRI has been proved to perform equivalently if not better than the Khoja.
Installation
Add this line to your application's Gemfile:
gem 'nlp_arabic'
And then execute:
$ bundle
Or install it yourself as:
$ gem install nlp_arabic
Usage
Once installed, you can use it like this:
NlpArabic.clean(text) will return the text without the stop words.
NlpArabic.stem(word) will return the word stemmed.
NlpArabic.stem_text(text) will stem an entire text.
NlpArabic.clean_and_stem(text) will do both.
NlpArabic.wash_and_stem(text) will stem the text removing stop words and delimiters from it.
NlpArabic.tokenize_text(text) will break the text into an array of words and delimiters.
Each step of the ISRI algorithm is coded in a separate function so you should be able to find the helper function you may be looking for just by browsing the code.
Development
After checking out the repo, run bin/console
for an interactive prompt that will allow you to experiment. For now the gem doesn't use any dependencies so you don't need to run bin/setup
.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
to create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
You are more than welcome to contribute to this project :) Please try to respect the ruby style guidelines described here. The default encoding used is UTF-8.
- Fork it ( https://github.com/othmanela/nlp_arabic/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Write unit tests and make sure all of them (including the old ones) pass
- Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request