Project

utf8

0.01
Repository is archived
No commit activity in last 3 years
No release in over 3 years
A lightweight UTF8-aware String class meant for use with Ruby 1.8
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

 Project Readme

A lightweight UTF8-aware String class meant for use with Ruby 1.8.x

The scanning code is implemented in C (would love a Java version as well, if any of you want to help with that).

At the moment, this gem is tested on 1.8.7, 1.9.2, 1.9.3 and Rubinius - it may work on others but ymmv.

String::UTF8 Example

The String::UTF8#[] API isn't fully supported yet, and may never support regular expressions. I'll update this readme as I make progress.

# String (on 1.8)
str = "你好世界"
str.length # 12
str[0] # 228
str.chars.to_a.inspect # ["\344", "\275", "\240", "\345", "\245", "\275", "\344", "\270", "\226", "\347", "\225", "\214"]


# String::UTF8 (on 1.8)
utf8_str = "你好世界".as_utf8
utf8_str.length # 4
utf8.chars.to_a.inspect # => ["你", "好", "世", "界"]

utf8[0]        # => "你"
utf8_str[0, 3] # => "你好世"
utf8_str[0..3] # => "你好世界"

StringScanner::UTF8 Example

The full StringScanner API hasn't been made UTF8-aware yet. I may try and finish it later, but the getch method is all I need at the moment.

# StringScanner (on 1.8)
scanner = StringScanner.new("你好世界")
scanner.getch # => "\344"

# StringScanner::UTF8
require 'utf8/string_scanner'

# NOTE: we could have used scanner.as_utf8 instead
utf8_scanner = StringScanner::UTF8.new("你好世界")
utf8_scanner.getch # => "你"
utf8_scanner.getch # => "好"
utf8_scanner.getch # => "世"
utf8_scanner.reset
utf8_scanner.getch # => "你"

Disclaimer

I'm only planning on making the methods I need UTF8-aware, if you need others I welcome pull requests :)