data_miner
Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.
Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.
Real-world usage
We use data_miner
for data science at Brighter Planet and in production at
The killer combination for us is:
-
active_record_inline_schema
- define table structure -
remote_table
- download data and parse it -
errata
- apply corrections in a transparent way -
data_miner
(this library!) - import data idempotently
Documentation
Check out the extensive documentation.
Quick start
You define data_miner
blocks in your ActiveRecord models. For example, in app/models/country.rb
:
class Country < ActiveRecord::Base
self.primary_key = 'iso_3166_code'
# the "col" class method is provided by a different library - active_record_inline_schema
col :iso_3166_code # alpha-2 2-letter like GB
col :iso_3166_numeric_code, :type => :integer # numeric like 826; aka UN M49 code
col :iso_3166_alpha_3_code # 3-letter like GBR
col :name
data_miner do
# auto_upgrade! is provided by active_record_inline_schema
process :auto_upgrade!
import("OpenGeoCode.org's Country Codes to Country Names list",
:url => 'http://opengeocode.org/download/countrynames.txt',
:format => :delimited,
:delimiter => '; ',
:headers => false,
:skip => 22) do
key :iso_3166_code, :field_number => 0
store :iso_3166_alpha_3_code, :field_number => 1
store :iso_3166_numeric_code, :field_number => 2
store :name, :field_number => 5
end
end
end
Now you can run:
>> Country.run_data_miner!
=> nil
More advanced usage
The earth
library has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:
Model | Highlights | Reference |
---|---|---|
Aircraft | parsing Microsoft Frontpage HTML (!) | data_miner.rb |
Airports | forcing column names and use of :select block (Proc ) |
data_miner.rb |
Automobile model variants | super advanced usage of "custom parser" and errata | data_miner.rb |
Country | parsing CSV and a few other tricks | data_miner.rb |
EGRID regions | parsing XLS | data_miner.rb |
Flight segment (stage) | super advanced usage of POSTing form data | data_miner.rb |
Zip codes | downloading a ZIP file and pulling an XLSX out of it | data_miner.rb |
And many more - look for the data_miner.rb
file that corresponds to each model. Note that you would normally put the data_miner
declaration right inside the ActiveRecord model file... it's kept separate in earth
so that loading it is optional.
Authors
- Seamus Abshere seamus@abshere.net
- Andy Rossmeissl andy@rossmeissl.net
- Derek Kastner dkastner@gmail.com
- Ian Hough ijhough@gmail.com
- Tower He towerhe@gmail.com
Wishlist
- Make the tests real unit tests
- sql steps shouldn't shell out if binaries are missing
Copyright
Copyright (c) 2013 Seamus Abshere