ScrubRb

Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter

Installation

Add this line to your application's Gemfile:

gem 'scrub_rb'

And then execute:

$ bundle

Or install it yourself as:

$ gem install scrub_rb

What it is

Ruby 2.1 introduces String#scrub, a method to replace bytes in a string that are invalid for it's specified encoding. See docs in MRI ruby source

If you need String#scrub in MRI ruby 2.0, you can use the string-scrub gem, which provides a backport of the C code from MRI ruby 2.1 into MRI 2.0.

What if you need this functionality in ruby 1.9, in jruby in 1.9 or 2.0 modes, or in any other ruby platform that does not (or does not yet) support String#scrub? What if you need to write code that will work on any of these platforms?

This gem provides a pure-ruby implementation of String#scrub and #scrub!, monkey-patched into String, that should work on any ruby platform. It will only monkey-patch String if String does not already have a #scrub method -- so it's safe to include this gem in multi-platform code, when the code runs on ruby 2.1, String#scrub will still be the original stdlib implementation.

# Encoding: utf-8

"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
"abc\u3042\x81".scrub("*") #=> "abc\u3042*"
"abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"

Performance

This pure ruby implementation is about an order of magnitude slower than stdlib String#scrub on ruby 2.1, or than string-scrub C gem on MRI 2.0. For most applications, string-scrubbing will probably be a small portion of total execution time, is still fairly fast, and hopefully won't be a problem.

Discrepency with MRI 2.1 String#scrub

If there is a sequence of multiple contiguous invalid bytes in a string, should the entire block be replaced with only one replacement, or should each invalid byte be replaced with a replacement?

I have not been able to understand the logic MRI 2.1 uses to divide contiguous invalid bytes into certain sub-sequences for replacement, as represented in the test suite. The test suite may be suggesting that the examples are from unicode documentation, but I wasn't able to find such documentation to see if it shed any light on the matter.

scrub_rb always combines contiguous invalid bytes into a single replacement. As a result, it fails several tests from the original String#scrub test suite, which want other divisions of contiguous invalid bytes. I've altered our local tests for our current behavior.

Beware of this potential difference when using the block form of #scrub especially -- you may get a different number of calls with sequence of invalid bytes divided into different substrings with scrub_rb as compared to official MRI 2.1 String#scrub or string-scrub.

For most uses, this discrepency is probably not of consequence.

If anyone can explain whats going on here, I'm very curious! I can't read C very well to try and figure it out from source.

JRuby in earlier versions may raise

Use Jruby 1.7.11 or later to avoid a known bug that made JRuby raise exceptions on certain unusual illegal byte combinations and prevent scrub_rb from scrubbing them.

Contributions

Pull requests or suggestions welcome, especially on performance, on JRuby issue, and on discrepencies with official String#scrub.

scrub_rb