Secryst
Purpose
A seq2seq transformer suited for transliteration. Written in Ruby.
Secryst was originally built for the Interscript project (at GitHub).
The goal is to allow:
-
Developers to train models and provide the trained model to users. In order to to train models, raw computing and their bindings can be used, e.g. OpenCL.
-
Users of the library in Ruby who only want to "use" the trained models should not require special bindings to run.
Introduction
Secryst works with a number of Romanization and transliteration systems.
It is composed of two separate pieces of software (Ruby gems):
-
Secryst Translator:
secryst
, for users of trained models -
Secryst Trainer:
secryst-trainer
for users who wish to train models
Secryst models are platform-independent, interoperable and transferrable across installations.
There are two types of Secryst models:
- Secryst ONNX model (recommended)
-
These can be run on any platform without
libtorch
(which requires installation of development tools). - Secryst Torch model
-
These require
libtorch
to run. Secryst Torch models are trained in Ruby.
Secryst Torch models can be converted to Secryst ONNX models via Python using instructions below.
Secryst Trainer also supports checkpoint resumption.
Examples
Under the examples/
directory the following systems are provided
-
examples/khm-latn
: Khmer Romanization -
examples/arm-latn
: Armenian Romanization based on Wikipedia data
Prerequisites
Secryst Translator
Basic (usage of Secryst ONNX models only)
If you only need to use Secryst ONNX models, there is no need
to install libtorch
and development tools.
macOS and Ubuntu:
$ gem install bundler
$ bundle install
On macOS:
$ gem install bundler
$ bundle install
To use Secryst Torch models
Usage of Secryst Torch models require libtorch
, which
requires installation of the following packages:
-
libtorch
(1.8.1) -
fftw
-
gsl
And also you must explicitly add torch
gem to your Gemfile:
gem 'torch-rb', '~> 0.6'
On Ubuntu:
$ sudo apt-get -y install libfftw3-dev libgsl-dev unzip automake \
make gcc g++ libtorch libtorch-dev
$ wget https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.9.0%2Bcpu.zip
$ unzip libtorch-cxx11-abi-shared-with-deps-1.8.1%2Bcpu.zip
$ gem install bundler
$ bundle config build.torch-rb \
--with-torch-dir=$(pwd)/libtorch
$ bundle install
On Fedora:
$ sudo dnf -y install ruby-devel gsl-devel fftw-devel rubygem-bundler unzip \
automake make gcc gcc-c++
$ git clone --recurse-submodules --depth=5 -b release/1.7 https://github.com/pytorch/pytorch.git
$ pushd pytorch
$ python setup.py install
$ popd
$ TORCH_DIR=$HOME/.local/lib/python3.9/site-packages/torch # Ensure this dir exists
$ ln -s lib $TORCH_DIR/lib64
$ bundle config build.torch-rb \
--with-torch-dir=$TORCH_DIR
$ bundle install
On macOS:
$ brew install libtorch gsl fftw automake gcc
$ gem install bundler
$ bundle config build.torch-rb \
--with-torch-dir=$(brew --prefix libtorch)
$ bundle install
Secryst Trainer
In order to use Secryst Trainer two additional components are necessary:
-
lapack
-
openblas
On Ubuntu:
$ sudo apt-get -y install libfftw3-dev libgsl-dev libopenblas-dev \
liblapack-dev liblapacke-dev unzip automake make gcc g++ \
libtorch libtorch-dev
$ wget https://download.pytorch.org/libtorch/cu111/libtorch-cxx11-abi-shared-with-deps-1.8.1%2Bcu111.zip
$ unzip libtorch-cxx11-abi-shared-with-deps-1.8.1%2Bcu111.zip
$ gem install bundler
$ bundle config build.torch-rb \
--with-torch-dir=$(pwd)/libtorch
$ bundle install
# To enable ONNX training, you must also install the Python portions
$ pip3 install -r requirements.txt
On macOS:
$ brew install libtorch gsl lapack openblas fftw automake gcc
$ gem install bundler
$ bundle config build.numo-linalg \
--with-openblas-dir=$(brew --prefix openblas) \
--with-lapack-lib=$(brew --prefix lapack)
$ bundle config build.torch-rb \
--with-torch-dir=$(brew --prefix libtorch)
$ bundle install
# To enable ONNX training, you must also install the Python portions
$ pip3 install -r requirements.txt
Note
|
(for macOS)
If you mistakenly installed numo-linalg without the above configuration
options, please uninstall it with these steps and configure the bundle as
described above:
|
$ bundle exec gem uninstall numo-linalg
Usage
Secryst provides a CLI for training models and re-using trained models.
Using trained models
You will need to install the secryst
gem (prerequisites must be fulfilled):
$ gem install secryst
To utilize a trained model:
# Transform all individual lines of `--input_text_file`.
# Specifying:
# - trained model zip archive at `--model-file`.
# Must include `metadata.yaml`, `vocabs.yaml` and
# an `.pth` or `.onnx` model file.
secryst translate \
--input_text_file=examples/to-translate.txt \
--model-file=examples/checkpoints/checkpoint-500.zip
Both Secryst ONNX models and Secryst Torch models can be used with this command.
Training models in Ruby (output: Secryst Torch model)
Secryst supports training models in Ruby into the Secryst Torch model format.
These created models can then be used by other users through the secryst
gem.
Note
|
To make a trained Secryst model available for all platforms, you should convert the Secryst Torch model into a Secryst ONNX model. |
You will need to install the secryst-trainer
gem (prerequisites must be fulfilled):
$ gem install secryst-trainer
Note
|
The secryst gem will be automatically installed alongside secryst-trainer .
|
Training a typical model:
# Train all individual lines of the file specified in `-i` to the
# corresponding line in target `-t`.
#
# Specifying:
# - `max-epochs` specifies how many epochs training will be run
# - `log-interval` specifies how often should Secryst report on
# learning parameters.
# - `checkpoint-every` indicates how often Secryst saves a checkpoint
# file to `checkpoint_dir`, in the format `checkpoint-{epoch}.zip`.
# - `checkpoint_dir` specifies the directory to store checkpoint files. If some checkpoints are already in the directory, the training will continue from the latest
secryst-trainer train \
-i 'data/khm-latn-small/input.csv' \
-t 'data/khm-latn-small/target.csv' \
--max-epochs=500 \
--log-interval=1 \
--checkpoint-every=50 \
--checkpoint_dir=examples/checkpoints
Training with all options:
# Train all individual lines of the file specified in `-i` to the
# corresponding line in target `-t`.
#
# Specifying:
# - `batch-size` specifies the batch size for training
# - `max-epochs` specifies how many epochs training will be run
# - `log-interval` specifies how often should Secryst report on
# learning parameters.
# - `checkpoint-every` indicates how often Secryst saves a checkpoint
# file to `checkpoint_dir`, in the format `checkpoint-{epoch}.zip`.
# - `checkpoint_dir` specifies the directory to store checkpoint. If some checkpoints are already in the directory, the training will continue from the latest
# - `gamma` specifies the gamma value used
# - hyperparameters in a key-value pair format
secryst-trainer train --model=transformer \
-i 'data/khm-latn-small/input.csv' \
-t 'data/khm-latn-small/target.csv' \
--batch-size=32 \
--max-epochs=500 \
--log-interval=1 \
--checkpoint-every=50 \
--checkpoint_dir=checkpoints \
--gamma=0.2 \
-h d_model:64 nhead:8 num_encoder_layers:4 num_decoder_layers:4 \
dim_feedforward:256 dropout:0.05 activation:relu
Convert Secryst Torch models to Secryst ONNX models
Due to a limitation of libtorch’s C++ interface not being able to encode trained models in ONNX, we have to use PyTorch to convert Secryst Torch models into Secryst ONNX models.
Secryst supports generation of Secryst ONNX models using PyTorch.
First, clone this current repository.
To convert a Secryst Torch model to a Secryst ONNX model, run:
python3 python/pth_to_onnx.py checkpoint.zip output.zip
The trained Secryst ONNX model can be used as usual:
bundle exec secryst translate --model-file output.zip -t texts.txt
Resuming training
Secryst Trainer supports checkpoint resumption.
It will detect whether you already have checkpoint model files in the model output
directory (in the filename of checkpoint-nnn.zip
), and attempt to
resume training from there. Notice that in resuming training, you
must use identical parameters and the identical training dataset,
otherwise the process will throw out an error.
Importing non-Secryst ONNX models
You can easily utilize non-Secryst trained ONNX models in Secryst as well.
You need to prepare a zip file with:
-
An
.onnx
model file -
The
vocabs.yaml
file
The vocabs.yaml
file has to contain two keys - input
and target
(it’s okay if they are the same), which contain all tokens from vocabulary placed in original order.
Like this:
input:
- [UNK]
- ...
target:
- [UNK]
- ...
Then just utilize model usually as described above. You can find an example of this in the examples folder (onnx_import.rb
).
Training on GPU
To allow training on CUDA drivers Secryst ships Python trainer.
Install the python version 3.8, and required packages:
pip3 install -r requirements.txt
And start the training (all the options are the same as Ruby trainer):
python3 python/train.py -i 'data/khm-latn-small/input.csv' \
-t 'data/khm-latn-small/target.csv' \
--max-epochs=500 \
--log-interval=1 \
--checkpoint-every=50 \
--checkpoint-dir=examples/checkpoints
Examples
The Khmer transliteration system is implemented as an example.
To run the training:
$ bundle exec examples/training.rb
To run translations through the transformer:
$ bundle exec examples/translating.rb
-
Checkpoint files are generated as
examples/checkpoints/*.zip
-
It includes
metadata.yaml
,model.pth
andvocabs.yaml
files
References
Secryst is built on the transformer model with architecture based on:
-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017. In: Advances in Neural Information Processing Systems, pages 6000-6010.
Origin of name
Scrying is the practice of peering into a crystal sphere for fortune telling.
The purpose of seq2seq
is nearly like scrying: looking into a crystal sphere
for some machine-learning magic to happen.
“Secryst” comes from the combination of “seq2seq” + “crystal” + “scrying”.