errata
Define an errata in table format (CSV) and then apply it to an arbitrary source. Inspired by RFC Errata, lets you keep your own errata in a transparent way.
Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.
Inspiration
There's a process for reporting errata on RFC:
Example
Every errata has a table structure based on the IETF RFC Editor's "How to Report Errata".
date | name | type | section | action | x | y | condition | notes | |
---|---|---|---|---|---|---|---|---|---|
2011-03-22 | Ian Hough | ian@brighterplanet.com | meta | Intended use | http://example.com/original-data-with-errors.xls | A hypothetical document that uses non-ISO country names | |||
2011-03-22 | Ian Hough | ian@brighterplanet.com | technical | Country Name | replace | /ANTIGUA & BARBUDA/ | ANTIGUA AND BARBUDA | ||
2011-03-22 | Ian Hough | ian@brighterplanet.com | technical | Country Name | replace | /BOLIVIA/ | BOLIVIA, PLURINATIONAL STATE OF | ||
2011-03-22 | Ian Hough | ian@brighterplanet.com | technical | Country Name | replace | /BOSNIA & HERZEGOVINA/ | BOSNIA AND HERZEGOVINA | ||
2011-03-22 | Ian Hough | ian@brighterplanet.com | technical | Country Name | replace | /BRITISH VIRGIN ISLANDS/ | VIRGIN ISLANDS, BRITISH | ||
2011-03-22 | Ian Hough | ian@brighterplanet.com | technical | Country Name | replace | /COTE D'IVOIRE/ | CÔTE D'IVOIRE | ||
2011-03-22 | Ian Hough | ian@brighterplanet.com | technical | Country Name | replace | /DEM\. PEOPLE'S REP\. OF KOREA/ | KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF | ||
2011-03-22 | Ian Hough | ian@brighterplanet.com | technical | Country Name | replace | /DEM\. REP\. OF THE CONGO/ | CONGO, THE DEMOCRATIC REPUBLIC OF THE | ||
2011-03-22 | Ian Hough | ian@brighterplanet.com | technical | Country Name | replace | /HONG KONG SAR/ | HONG KONG | ||
2011-03-22 | Ian Hough | ian@brighterplanet.com | technical | Country Name | replace | /IRAN \(ISLAMIC REPUBLIC OF\)/ | IRAN, ISLAMIC REPUBLIC OF |
Which would be saved as a CSV:
date,name,email,type,section,action,x,y,condition,notes
2011-03-22,Ian Hough,ian@brighterplanet.com,meta,Intended use,,http://example.com/original-data-with-errors.xls,,A hypothetical document that uses non-ISO country names
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/ANTIGUA & BARBUDA/,ANTIGUA AND BARBUDA,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/BOLIVIA/,"BOLIVIA, PLURINATIONAL STATE OF",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/BOSNIA & HERZEGOVINA/,BOSNIA AND HERZEGOVINA,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/BRITISH VIRGIN ISLANDS/,"VIRGIN ISLANDS, BRITISH",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/COTE D'IVOIRE/,CÔTE D'IVOIRE,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/DEM\. PEOPLE'S REP\. OF KOREA/,"KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/DEM\. REP\. OF THE CONGO/,"CONGO, THE DEMOCRATIC REPUBLIC OF THE",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/HONG KONG SAR/,HONG KONG,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/IRAN \(ISLAMIC REPUBLIC OF\)/,"IRAN, ISLAMIC REPUBLIC OF",,
And then used
errata = Errata.new(:url => 'http://example.com/errata.csv')
original = RemoteTable.new(:url => 'http://example.com/original-data-with-errors.xls')
original.each do |row|
errata.correct! row # destructively correct each row
end
UTF-8
Assumes all input strings are UTF-8. Otherwise there can be problems with Ruby 1.9 and Regexp::FIXEDENCODING. Specifically, ASCII-8BIT regexps might be applied to UTF-8 strings (or vice-versa), resulting in Encoding::CompatibilityError.
More advanced usage
The earth
library has dozens of real-life examples showing errata in action:
Real-world usage
We use errata
for data science at Brighter Planet and in production at
The killer combination:
-
active_record_inline_schema
- define table structure -
remote_table
- download data and parse it -
errata
(this library!) - apply corrections in a transparent way -
data_miner
- import data idempotently
Authors
- Seamus Abshere seamus@abshere.net
- Andy Rossmeissl andy@rossmeissl.net
- Ian Hough ijhough@gmail.com
Copyright
Copyright (c) 2012 Brighter Planet. See LICENSE for details.