EncodingEstimator: Detect encoding of strings
This gem allows you to detect the encoding of strings/files based on their content. This can be useful if you need to load data from sources with unknown encodings. The gem uses character distribution statistics to check which encoding is the one that gives you the best results.
Usage in Ruby Code
The gem has two major high level methods. The first one can be used when you want to know, how a string is encoded:
detection = EncodingEstimator.detect( File.read( 'foo.txt' ), languages: [ :en, :de ] )
puts "Encoding: #{detection.result.encoding}"
The second one is a shortcut you can use in case you just want to be sure to get a string of an unknown encoding as a UTF-8 encoded string (should be the ruby default):
utf8_txt = EncodingEstimator.ensure_utf8( File.read( 'foo.txt' ), languages: [ :en, :de ] )
More detailed tutorials can be found here.
If you need more control over the operations to perform, just have a look at EncodingEstimator::Detector
and EncodingEstimator::Conversion
.
Installation
Add this line to your application's Gemfile:
gem 'encoding_estimator'
And then execute:
$ bundle
Or install it yourself as:
$ gem install encoding_estimator
Note: if you want to use the multithreaded versions of the algorithms, please install parallel
and ruby-progressbar
gem.
Command line utilities
This gem provides two command line utilities: encest-detect
and encest-gen
.
encest-detect
This tool can detect the encoding of files. Therefore, it has some command line options you should use whenever you know more about a file (e.g. which languages it could be written in or which encodings it could have).
usage: encest-detect [options]
--encodings, -e Encodings to test (default: iso-8859-1,utf-16le,windows-1251)
--operations, -o Operations (enc/dec) to test (default: dec)
--languages, -l Language profiles to apply (default: en,de)
--threads, -t Number of threads to use (0 to disable multithreading, default)
--help, -h Display help
other arguments: files to parse
Please note that the -l
argument accepts the short two-letter-codes for the included language profiles as well as paths to language model files. These can be generated by using encest-gen
.
The output might look like this:
$ encest-detect -l en,de,fr */*.txt
de/iso-8859-1.txt: dec_iso-8859-1
keep_utf-8: 0.9983638601518013
dec_iso-8859-1: 1.0
dec_utf-16le: 0.0
dec_windows-1251: 0.9984215377764598
en/utf-16le.txt: dec_utf-16le
keep_utf-8: 0.0
dec_iso-8859-1: 0.3981167811176304
dec_utf-16le: 1.0
dec_windows-1251: 0.005410547626031029
fr/utf-8.txt: keep_utf-8
keep_utf-8: 1.0
dec_iso-8859-1: 0.9957726010451553
dec_utf-16le: 0.0
dec_windows-1251: 0.9957810888135232
encest-gen
This tool is can generate the language models the encest-detect
tool uses (or the other classes in this gem). The language models are very simple JSON files, looking somewhat like that:
{"W":0.222539,"รค":0.288427,"-":0.513657,"Z":0.118473 ... }
The encest-gen
command generates these scores based on a lot of input text. To generate the language models this gem provides by default, I used dumps of the Wikipedia, but you can use any (UTF-8-encoded) text files you like. Just put them in one directory, let's call it pt (for Portuguese) and extract the files you want to learn the language model from to that directory (e.g. the Wikipedia dump). Please split large files into smaller chunks of text (max ~20MiB) because ruby otherwise will crash with NoMemoryError and you don't see a progressbar.
Usage of encest-gen
is quite simple:
usage: encest-gen [options]
--threshold, -t Minimum character count threshold to include a char in the model (default: 0.00001)
--threads, -n Number of threads used to process the files (default: 4)
--silent, -s Disable progressbars and other outputs
--help, -h Display help
other arguments: lang1=directory1 ... langN=directoryN
So for our Portuguese language model on a 8 core machine we call:
encest-gen -n 8 pt=/path/to/the/directory/with/text
The command will produce a file called pt.json
which is you new language model.
How it works
This gem uses a statistical approach to determine the encoding of an input string. Therefore, it interprets the input as different encodings (all encodings to test) and compares the character distribution against one or multiple language models. The detector then returns the likelihood of every encoding.
Supported languages
Currently, the gem has support for 10 languages: English, German, French, Spanish, Russian, Portuguese, Greek, Turkish, Chinese and Arabic. The language profiles were generated from Wikipedia dumps. You can generate your own language profiles using the encest-gen
tool. For more information on this tool, see above.
Supported encodings
The gem supports all encodings your ruby implementation supports. But note that including more encodings in the list of encodings you want to test slows down the detection process.
License
The gem is available as open source under the terms of the MIT License.