Tokenizers Ruby
🙂 Fast state-of-the-art tokenizers for Ruby
Installation
Add this line to your application’s Gemfile:
gem "tokenizers"
Getting Started
Load a pretrained tokenizer
tokenizer = Tokenizers.from_pretrained("bert-base-cased")
Encode
encoded = tokenizer.encode("I can feel the magic, can you?")
encoded.tokens
encoded.ids
Decode
tokenizer.decode(ids)
Training
Create a tokenizer
tokenizer = Tokenizers::Tokenizer.new(Tokenizers::Models::BPE.new(unk_token: "[UNK]"))
Set the pre-tokenizer
tokenizer.pre_tokenizer = Tokenizers::PreTokenizers::Whitespace.new
Train the tokenizer (example data)
trainer = Tokenizers::Trainers::BpeTrainer.new(special_tokens: ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer)
Encode
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
output.tokens
Save the tokenizer to a file
tokenizer.save("tokenizer.json")
Load a tokenizer from a file
tokenizer = Tokenizers.from_file("tokenizer.json")
Check out the Quicktour and equivalent Ruby code for more info
API
This library follows the Tokenizers Python API. You can follow Python tutorials and convert the code to Ruby in many cases. Feel free to open an issue if you run into problems.
History
View the changelog
Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec rake compile
bundle exec rake download:files
bundle exec rake test