Project

secryst

0.0
No commit activity in last 3 years
No release in over 3 years
There's a lot of open issues
Seq2seq transformer for transliteration in Ruby.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 0
>= 0

Runtime

~> 0.4
 Project Readme

Secryst

Build status

Purpose

A seq2seq transformer suited for transliteration. Written in Ruby.

Secryst was originally built for the Interscript project (at GitHub).

The goal is to allow:

  • Developers to train models and provide the trained model to users. In order to to train models, raw computing and their bindings can be used, e.g. OpenCL.

  • Users of the library in Ruby who only want to "use" the trained models should not require special bindings to run.

Introduction

Secryst works with a number of Romanization and transliteration systems.

It is composed of two separate pieces of software (Ruby gems):

  • Secryst Translator: secryst, for users of trained models

  • Secryst Trainer: secryst-trainer for users who wish to train models

Secryst models are platform-independent, interoperable and transferrable across installations.

There are two types of Secryst models:

Secryst ONNX model (recommended)

These can be run on any platform without libtorch (which requires installation of development tools).

Secryst Torch model

These require libtorch to run. Secryst Torch models are trained in Ruby.

Secryst Torch models can be converted to Secryst ONNX models via Python using instructions below.

Secryst Trainer also supports checkpoint resumption.

Examples

Under the examples/ directory the following systems are provided

  • examples/khm-latn: Khmer Romanization

  • examples/arm-latn: Armenian Romanization based on Wikipedia data

Prerequisites

Secryst Translator

Basic (usage of Secryst ONNX models only)

If you only need to use Secryst ONNX models, there is no need to install libtorch and development tools.

macOS and Ubuntu:

$ gem install bundler
$ bundle install

On macOS:

$ gem install bundler
$ bundle install

To use Secryst Torch models

Usage of Secryst Torch models require libtorch, which requires installation of the following packages:

  • libtorch (1.8.1)

  • fftw

  • gsl

And also you must explicitly add torch gem to your Gemfile:

gem 'torch-rb', '~> 0.6'

On Ubuntu:

$ sudo apt-get -y install libfftw3-dev libgsl-dev unzip automake \
  make gcc g++ libtorch libtorch-dev
$ wget https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.9.0%2Bcpu.zip
$ unzip libtorch-cxx11-abi-shared-with-deps-1.8.1%2Bcpu.zip

$ gem install bundler
$ bundle config build.torch-rb \
    --with-torch-dir=$(pwd)/libtorch

$ bundle install

On Fedora:

$ sudo dnf -y install ruby-devel gsl-devel fftw-devel rubygem-bundler unzip \
  automake make gcc gcc-c++
$ git clone --recurse-submodules --depth=5 -b release/1.7 https://github.com/pytorch/pytorch.git
$ pushd pytorch
$ python setup.py install
$ popd
$ TORCH_DIR=$HOME/.local/lib/python3.9/site-packages/torch # Ensure this dir exists
$ ln -s lib $TORCH_DIR/lib64

$ bundle config build.torch-rb \
    --with-torch-dir=$TORCH_DIR

$ bundle install

On macOS:

$ brew install libtorch gsl fftw automake gcc
$ gem install bundler
$ bundle config build.torch-rb \
    --with-torch-dir=$(brew --prefix libtorch)
$ bundle install

Secryst Trainer

In order to use Secryst Trainer two additional components are necessary:

  • lapack

  • openblas

On Ubuntu:

$ sudo apt-get -y install libfftw3-dev libgsl-dev libopenblas-dev \
    liblapack-dev liblapacke-dev unzip automake make gcc g++ \
    libtorch libtorch-dev
$ wget https://download.pytorch.org/libtorch/cu111/libtorch-cxx11-abi-shared-with-deps-1.8.1%2Bcu111.zip
$ unzip libtorch-cxx11-abi-shared-with-deps-1.8.1%2Bcu111.zip

$ gem install bundler
$ bundle config build.torch-rb \
    --with-torch-dir=$(pwd)/libtorch

$ bundle install

# To enable ONNX training, you must also install the Python portions
$ pip3 install -r requirements.txt

On macOS:

$ brew install libtorch gsl lapack openblas fftw automake gcc

$ gem install bundler
$ bundle config build.numo-linalg \
    --with-openblas-dir=$(brew --prefix openblas) \
    --with-lapack-lib=$(brew --prefix lapack)
$ bundle config build.torch-rb \
    --with-torch-dir=$(brew --prefix libtorch)

$ bundle install

# To enable ONNX training, you must also install the Python portions
$ pip3 install -r requirements.txt
Note
(for macOS) If you mistakenly installed numo-linalg without the above configuration options, please uninstall it with these steps and configure the bundle as described above:
$ bundle exec gem uninstall numo-linalg

Usage

Secryst provides a CLI for training models and re-using trained models.

Using trained models

You will need to install the secryst gem (prerequisites must be fulfilled):

$ gem install secryst

To utilize a trained model:

# Transform all individual lines of `--input_text_file`.
# Specifying:
#   - trained model zip archive at `--model-file`.
#     Must include `metadata.yaml`, `vocabs.yaml` and
#     an `.pth` or `.onnx` model file.

secryst translate \
  --input_text_file=examples/to-translate.txt \
  --model-file=examples/checkpoints/checkpoint-500.zip

Both Secryst ONNX models and Secryst Torch models can be used with this command.

Training models in Ruby (output: Secryst Torch model)

Secryst supports training models in Ruby into the Secryst Torch model format. These created models can then be used by other users through the secryst gem.

Note
To make a trained Secryst model available for all platforms, you should convert the Secryst Torch model into a Secryst ONNX model.

You will need to install the secryst-trainer gem (prerequisites must be fulfilled):

$ gem install secryst-trainer
Note
The secryst gem will be automatically installed alongside secryst-trainer.

Training a typical model:

# Train all individual lines of the file specified in `-i` to the
# corresponding line in target `-t`.
#
# Specifying:
#   - `max-epochs` specifies how many epochs training will be run
#   - `log-interval` specifies how often should Secryst report on
#     learning parameters.
#   - `checkpoint-every` indicates how often Secryst saves a checkpoint
#     file to `checkpoint_dir`, in the format `checkpoint-{epoch}.zip`.
#   - `checkpoint_dir` specifies the directory to store checkpoint files. If some checkpoints are already in the directory, the training will continue from the latest

secryst-trainer train \
  -i 'data/khm-latn-small/input.csv' \
  -t 'data/khm-latn-small/target.csv' \
  --max-epochs=500 \
  --log-interval=1 \
  --checkpoint-every=50 \
  --checkpoint_dir=examples/checkpoints

Training with all options:

# Train all individual lines of the file specified in `-i` to the
# corresponding line in target `-t`.
#
# Specifying:
#   - `batch-size` specifies the batch size for training
#   - `max-epochs` specifies how many epochs training will be run
#   - `log-interval` specifies how often should Secryst report on
#     learning parameters.
#   - `checkpoint-every` indicates how often Secryst saves a checkpoint
#     file to `checkpoint_dir`, in the format `checkpoint-{epoch}.zip`.
#   - `checkpoint_dir` specifies the directory to store checkpoint. If some checkpoints are already in the directory, the training will continue from the latest
#   - `gamma` specifies the gamma value used
#   - hyperparameters in a key-value pair format

secryst-trainer train --model=transformer \
  -i 'data/khm-latn-small/input.csv' \
  -t 'data/khm-latn-small/target.csv' \
  --batch-size=32 \
  --max-epochs=500 \
  --log-interval=1 \
  --checkpoint-every=50 \
  --checkpoint_dir=checkpoints \
  --gamma=0.2 \
  -h d_model:64 nhead:8 num_encoder_layers:4 num_decoder_layers:4 \
    dim_feedforward:256 dropout:0.05 activation:relu

Convert Secryst Torch models to Secryst ONNX models

Due to a limitation of libtorch’s C++ interface not being able to encode trained models in ONNX, we have to use PyTorch to convert Secryst Torch models into Secryst ONNX models.

Secryst supports generation of Secryst ONNX models using PyTorch.

First, clone this current repository.

To convert a Secryst Torch model to a Secryst ONNX model, run:

python3 python/pth_to_onnx.py checkpoint.zip output.zip

The trained Secryst ONNX model can be used as usual:

bundle exec secryst translate --model-file output.zip -t texts.txt

Resuming training

Secryst Trainer supports checkpoint resumption.

It will detect whether you already have checkpoint model files in the model output directory (in the filename of checkpoint-nnn.zip), and attempt to resume training from there. Notice that in resuming training, you must use identical parameters and the identical training dataset, otherwise the process will throw out an error.

Importing non-Secryst ONNX models

You can easily utilize non-Secryst trained ONNX models in Secryst as well.

You need to prepare a zip file with:

  • An .onnx model file

  • The vocabs.yaml file

The vocabs.yaml file has to contain two keys - input and target (it’s okay if they are the same), which contain all tokens from vocabulary placed in original order.

Like this:

input:
- [UNK]
- ...
target:
- [UNK]
- ...

Then just utilize model usually as described above. You can find an example of this in the examples folder (onnx_import.rb).

Training on GPU

To allow training on CUDA drivers Secryst ships Python trainer.

Install the python version 3.8, and required packages:

pip3 install -r requirements.txt

And start the training (all the options are the same as Ruby trainer):

python3 python/train.py -i 'data/khm-latn-small/input.csv' \
  -t 'data/khm-latn-small/target.csv' \
  --max-epochs=500 \
  --log-interval=1 \
  --checkpoint-every=50 \
  --checkpoint-dir=examples/checkpoints

Examples

The Khmer transliteration system is implemented as an example.

To run the training:

$ bundle exec examples/training.rb

To run translations through the transformer:

$ bundle exec examples/translating.rb
  • Checkpoint files are generated as examples/checkpoints/*.zip

  • It includes metadata.yaml, model.pth and vocabs.yaml files

References

Secryst is built on the transformer model with architecture based on:

  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017. In: Advances in Neural Information Processing Systems, pages 6000-6010.

Origin of name

Scrying is the practice of peering into a crystal sphere for fortune telling. The purpose of seq2seq is nearly like scrying: looking into a crystal sphere for some machine-learning magic to happen.

“Secryst” comes from the combination of “seq2seq” + “crystal” + “scrying”.