0.3
No commit activity in last 3 years
No release in over 3 years
Efficient pure Ruby Unicode normalization.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

Efficient Pure Ruby Unicode Normalization (eprun)

(pronounced e-prune)

The Talk

Please see the Internationalization & Unicode Conference 37 talk on Implementing Normalization in Pure Ruby - the Fast and Easy Way.

Directories and Files

  • lib/normalize.rb: The core normalization code.
  • lib/string_normalize.rm: String#normalize.
  • lib/generate.rb: Generation script, generates lib/normalize_tables.rb from data/UnicodeData.txt and data/CompositionExclusions.txt. This needs to be run only once when updating to a new Unicode version.
  • lib/normalize_tables.rb: Data used for normalization, automatically generated by lib/generate.rb.
  • data/: All three files in this directory are downloaded from the Unicode Character Database. They are currently at Unicode version 6.3. They need to be updated for a newer Unicode version (happens about once a year).
  • test/test_normalize.rb: Tests for lib/string_normalize.rb, using data/NormalizationTest.txt.
  • benchmark/benchmark.rb: Runs the benchmark with example text files. Automatically checks for existing gems/libraries; if e.g. the unicode_util gem is not available, that part of the benchmark is skipped. This also applies to eprun, which will not be run on Ruby 1.8.
  • benchmark/Deutsch_.txt, Japanese_.txt, Korean_.txt, Vietnamese_.txt: example texts extracted from random Wikipedia pages (see http://en.wikipedia.org/wiki/Wikipedia:Random). The languages are choosen based on number of characters affected by normalization (Deutsch < Japanese < Vietnamese < Korean). These files have somewhat differing lengths, so the results cannot directly be compared across languages. Adding other files with ending "_.txt" will include them in the benchmark.
  • benchmark/benchmark_results.rb: Results of benchmark for eprun, unicode_utils, ActiveSupport::Multibyte (version 3.0.0), twitter_cldr, and the unicode gem. Eprun, unicode_utils, and unicode normalizations are run 100 times each, ActiveSupport::Multibyte is run 10 times each, and twitter_cldr is run only 1 time (didn't want to wait any longer).
  • benchmark/benchmark_results_jruby.txt: Results of benchmark when using jruby (excludes unicode gem), version 1.7.4 (1.9.3p392, 2013-05-16 2390d3b on Java HotSpot(TM) Client VM 1.7.0_07-b10 [Windows 7-x86]).
  • benchmark/benchmark.pl: Runs the benchmark using Perl, both with xsub (i.e. C) version (run 100 times) and pure Perl version (run 10 times).
  • benchmark/benchmark_results_pl.txt: Results of Perl benchmarks.

TODOs and Ideas

  • Publish as a gem, or several gems.
  • Deal better with encodings other than UTF-8.
  • Add methods such as String#nfc, String#nfd,...
  • Add methods for normalization variants.
  • See talk for more.