ICU - A Unicode processing functions ruby gem - binding to ICU
Beta stage. I'm sure ffi-icu can also do the same thing once you understand the internal C module for transcoding.
Ruby required 2.3.1.
Usage
If you use OS X,
brew install icu4c
gem install icu -- --use-system-libraries
else,
gem install icu
For usage:
require 'icu'
Design
Almost all arguments passed should be expected as Ruby String
with various encodings.
Sometimes, symbol is also allowed. More specifically, ICU::Locale
accepts only ASCII-compatible string.
Ruby API
Ruby API should be higher level and easily configurable. The underlying encoding conversion should be transparent to the user.
Encoding
Ruby has an Code Set Independent (CSI) model
for string implementation because of its community.
Ruby's string should honour the Encoding.default_internal
,
otherwise __ENCODING__
(including newly created string)
from the environment is used for encoding string.
The string holds only the byte array. A string may have unmatched encoding
or invalid bytes.
ICU uses UTF-16 internally. But there is also dedicate fast UTF-8 code path.
In most environment for Ruby community,
MRI will take UTF-8 as string's representation.
So the use of UTF-8 code path should be considered.
If possible and matches with the MRI encoding,
ICU should also be compiled for -DU_CHARSET_IS_UTF8=1
.
Considered the fact above, the instances shall follow the encoding settings to return the desired encoding generally. The input string can be treated as UTF-8 can be used for its code path. Otherwise, a conversion by ICU should be employed. The output string should honor the encoding settings. The conversion should be transparent to Ruby users.
Some details about MRI and encoding:
-
string.pack("U*")
actually returns Unicode Scalar value. Whilen*
rules don't know UTF16. - macro
ENCODING_GET
retrieves an object's encoding index. The encoding can be in object'sRBasic
or an instance variable in that object depending on encoding's index. -
rb_default_internal_encoding()
andrb_enc_default_internal()
returns the c struct encoding and ruby encoding object accordingly. -
rb_locale_encindex()
gets the encoding index from the locale.
Contributing
Feel free to fork and submit a pull request.
TODO
- Support Ruby 2.2+. Rails 5 requires Ruby 2.2.2.
- Merge ffi-icu. This branch can be a start
- Merge some resources from this branch (old
icu
gem). - port time/number_formatting module from ffi-icu.
- binary distribution of ICU & system library support
- documentation