URI::IDNA
A IDNA2008, UTS46, IDNA from WHATWG URL Standard and Punycode implementation in pure Ruby.
This gem provides a number of functions for converting internationalized domain names (IDNs) between the Unicode and ASCII Compatible Encoding (ACE) forms.
Installation
Add to your Gemfile:
gem "uri-idna"
And then run bundle install
.
Usage
There are plenty of ways to convert IDNs between Unicode and ACE forms.
IDNA2008
The RFC 5891 defines two protocols for IDN conversion: Registration and Domain Name Lookup.
Registration protocol
URI::IDNA.register(alabel:, ulabel:, **options)
Options
-
check_hyphens
:true
– whether to check hyphens according to Section 5.4. -
leading_combining
:true
– whether to check leading combining marks according to Section 5.4. -
check_joiners
:true
– whether to checkCONTEXTJ
code points according to Section 5.4. -
check_others
:true
– whether to checkCONTEXTO
code points according to Section 5.4. -
check_bidi
:true
– whether to check bidirectional characters according to Section 5.4.
require "uri/idna"
URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp", ulabel: "ハロー・ワールド.jp")
#=> "xn--gdkl8fhk5egc.jp"
URI::IDNA.register(ulabel: "ハロー・ワールド.jp")
#=> "xn--gdkl8fhk5egc.jp"
URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp")
#=> "xn--gdkl8fhk5egc.jp"
URI::IDNA.register(ulabel: "☕.us")
#<URI::IDNA::InvalidCodepointError: Codepoint U+2615 at position 1 of "☕" not allowed>
Domain Name Lookup Protocol
URI::IDNA.lookup(domain_name, **options)
Options
-
check_hyphens
:true
– whether to check hyphens according to Section 4.2.3.1. -
leading_combining
:true
– whether to check leading combining marks according to Section 4.2.3.2. -
check_joiners
:true
– whether to check CONTEXTJ code points according to Section 4.2.3.3. -
check_others
:true
– whether to check CONTEXTO code points according to Section 4.2.3.3. -
check_bidi
:true
– whether to check bidirectional characters according to Section 4.2.3.4. -
verify_dns_length
:true
– whether to check DNS length according to Section 4.4.
require "uri/idna"
URI::IDNA.lookup("ハロー・ワールド.jp")
#=> "xn--pck0a1b0a6a2e.jp"
URI::IDNA.lookup("xn--pck0a1b0a6a2e.jp")
#=> "xn--pck0a1b0a6a2e.jp"
URI::IDNA.lookup("Ῠ.me")
#<URI::IDNA::InvalidCodepointError: Codepoint U+1FE8 at position 1 of "Ῠ" not allowed>
Unicode UTS46 (TR46)
Current revision: 31
The UTS46 defines two IDN conversion functions: ToASCII and ToUnicode.
ToASCII
URI::IDNA.to_ascii(domain_name, **options)
Options
-
use_std3_ascii_rules
:true
– whether to apply STD3 rules for both mapping and validation. -
check_hyphens
:true
– whether to check hyphens according to Section 4.2.3.1 of RFC 5891. -
check_bidi
:true
– whether to check bidirectional characters according to Section 4.2.3.4 of RFC 5891. -
check_joiners
:true
– whether to check CONTEXTJ code points according to Section 4.2.3.3 of RFC 5891. -
transitional_processing
:false
– (deprecated) whether to apply transitional processing for mapping. -
ignore_invalid_punycode
:false
– whether to fast-path invalid Punycode labels according to 4th step of Processing. -
verify_dns_length
:true
– whether to check DNS length according to Section 4.4 of RFC 5891.
require "uri/idna"
URI::IDNA.to_ascii("Bloß.de")
#=> "xn--blo-7ka.de"
# UTS46 transitional processing is disabled by default,
# but can be enabled via option:
URI::IDNA.to_ascii("Bloß.de", transitional_processing: true)
#=> "bloss.de"
# Note that UTS46 processing is not fully IDNA2008 compliant:
URI::IDNA.to_ascii("☕.us")
#=> "xn--53h.us"
ToUnicode
URI::IDNA.to_unicode(domain_name, **options)
Options
-
use_std3_ascii_rules
:true
– whether to apply STD3 rules for both mapping and validation. -
check_hyphens
:true
– whether to check hyphens according to Section 4.2.3.1 of RFC 5891. -
check_bidi
:true
– whether to check bidirectional characters according to Section 4.2.3.4 of RFC 5891. -
check_joiners
:true
– whether to check CONTEXTJ code points according to Section 4.2.3.3 of RFC 5891. -
transitional_processing
:false
– (deprecated) whether to apply transitional processing for mapping. -
ignore_invalid_punycode
:false
– whether to fast-path invalid Punycode labels according to 4th step of Processing.
require "uri/idna"
URI::IDNA.to_unicode("xn--blo-7ka.de")
#=> "bloß.de"
IDNA2008 compatibility
It's possible to use UTS46 mapping first and then apply IDNA2008, so the processing fully conforms IDNA2008:
require "uri/idna"
# For example we can use UTS46 mapping to downcase some characters
char = "⼤"
char.ord # "\u2F24"
#=> 12068
# just downcase doesn't work in this case
char.downcase.ord
#=> 12068
# but UTS46 mapping does it's thing:
URI::IDNA::UTS46::Mapping.call(char).ord
#=> 22823
# so here is a full example:
domain = "⼤.cn" # "\u2F24.cn"
URI::IDNA.lookup(domain)
# <URI::IDNA::InvalidCodepointError: Codepoint U+2F24 at position 1 of "⼤" not allowed>
mapped_domain = URI::IDNA::UTS46::Mapping.call(domain)
URI::IDNA.lookup(mapped_domain)
#=> "xn--pss.cn"
WHATWG
WHATWG's URL Standard uses UTS46 algorithm to define ToASCII and ToUnicode functions, it abstracts all available flags and provides only one—the be_btrict
flag instead.
Note that the check_hyphens
UTS46 option is set to false
in this algorithm.
ToASCII
URI::IDNA.whatwg_to_ascii(domain_name, **options)
Options
-
be_strict
:true
– defines values ofuse_std3_ascii_rules
andverify_dns_length
UTS46 options.
require "uri/idna"
URI::IDNA.whatwg_to_ascii("Bloß.de")
#=> "xn--blo-7ka.de"
# The be_strict flag sets use_std3_ascii_rules and verify_dns_length UTS46 flags to its value
URI::IDNA.whatwg_to_ascii("2003_rules.com", be_strict: false)
#=> "2003_rules.com"
# By default be_strict is set to true
URI::IDNA.whatwg_to_ascii("2003_rules.com")
#<URI::IDNA::InvalidCodepointError: Codepoint U+005F at position 5 of "2003_rules" not allowed>
ToUnicode
URI::IDNA.whatwg_to_unicode(domain_name, **options)
Options
-
be_strict
:true
-be_strict
:true
– defines value ofuse_std3_ascii_rules
UTS46 option.
require "uri/idna"
URI::IDNA.whatwg_to_unicode("xn--blo-7ka.de")
#=> "bloß.de"
Punycode
Punycode module performs conversion between Unicode and Punycode. Note that Punycode is not IDNA2008 compliant, it is only used for conversion, no validations performed.
require "uri/idna/punycode"
URI::IDNA::Punycode.encode("ハロー・ワールド")
#=> "gdkl8fhk5egc"
URI::IDNA::Punycode.decode("gdkl8fhk5egc")
#=> "ハロー・ワールド"
Full technical reference:
IDNA2008
- RFC 5890 – Definitions and Document Framework
- RFC 5891 – Protocol
- RFC 5892 – The Unicode Code Points
- RFC 5893 – Bidi rule
Punycode
- RFC 3492 – Punycode: A Bootstring encoding of Unicode
UTS46 (also referenced as TS46)
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org.
Generating Unicode data
This gem uses Unicode data files to perform IDN conversion. To generate new Unicode data files, run bundle exec rake idna:generate
.
To specify Unicode version, use VERSION
environment variable, e.g. VERSION=15.1.0 bundle exec rake idna:generate
.
By default, used Unicode version is the one used by the Ruby version (RbConfig::CONFIG["UNICODE_VERSION"]
).
To set directory for generated files, use DEST_DIR
environment variable, e.g. DEST_DIR=lib/uri/idna/data bundle exec rake idna:generate
.
Unicode data cached in the tmp
directory by default, to change it, use CACHE_DIR
environment variable, e.g. CACHE_DIR=~/.cache/unicode_data bundle exec rake idna:generate
.
Note: rake idna:generate
might generate different results on different versions of Ruby due to usage of built-in Unicode normalization methods.
Inspect Unicode data
To inspect Unicode data, run bundle exec rake 'idna:inspect[<HEX_CODE>]'
.
To specify Unicode version, or cache directory, use VERSION
or CACHE_DIR
environment variables, e.g. VERSION=15.1.0 bundle exec rake 'idna:inspect[1f495]'
.
Update UTS46 test suite data
To update UTS46 test suite data, run bundle exec rake idna:update_uts46_test_suite
.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/skryukov/uri-idna.
License
The gem is available as open source under the terms of the MIT License.