Bling Fire Ruby

Bling Fire - high speed text tokenization - for Ruby

Installation

Add this line to your application’s Gemfile:

gem "blingfire"

Getting Started

Create a model

model = BlingFire::Model.new

Tokenize words

model.text_to_words(text)

Tokenize sentences

model.text_to_sentences(text)

Get offsets for words

words, start_offsets, end_offsets = model.text_to_words_with_offsets(text)

Get offsets for sentences

sentences, start_offsets, end_offsets = model.text_to_sentences_with_offsets(text)

Pre-trained Models

Bling Fire comes with a default model that follows the tokenization logic of NLTK with a few changes. You can also download other models:

BERT Base, BERT Base Cased, BERT Chinese, BERT Multilingual Cased
GPT-2
Laser 100k, Laser 250k, Laser 500k
RoBERTa
Syllab
URI 100k, URI 250k, URI 500k
XLM-RoBERTa
XLNet, XLNet No Norm
WBD

Load a model

model = BlingFire.load_model("bert_base_tok.bin")

Convert text to ids

model.text_to_ids(text)

Get offsets for ids

ids, start_offsets, end_offsets = model.text_to_ids_with_offsets(text)

Disable prefix space

model = BlingFire.load_model("roberta.bin", prefix: false)

Ids to Text

Load a model

model = BlingFire.load_model("bert_base_tok.i2w")

Convert ids to text

model.ids_to_text(ids)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

git clone https://github.com/ankane/blingfire-ruby.git
cd blingfire-ruby
bundle install
bundle exec rake vendor:all download:models
bundle exec rake test

blingfire