Ingestor

A simple DSL for importing data from text and csv files to ActiveRecord. This was originally designed to continually import changing data from EAN and Geonames.

Great for parsing JSON, XML, CSV and plaint text into ActiveRecord, if you need to scrape HTML into ActiveRecord check out klepto.

Installation

Add this line to your application's Gemfile:

gem 'ingestor'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install ingestor

Add the following to your Rakefile require 'ingestor/tasks'

Usage

Given a text file:

id|name|population
1|China|1,354,040,000
2|India|1,210,193,422
3|United States|315,550,000

And an AR Class:

class Country
  attr_accessible :name, :population
end

Sync the file with AR:

ingest("path/to/countries.txt") do
  map_attributes do |values|
    {
      id:           values[0],
      name:         values[1],
      population:  values[2]
    }
  end

  # current lines values
  finder{|attrs| 
    Country.where(id: attrs[:id]).first || Country.new
  }
end

It can handle remote files and zip files as well.

ingest("http://example.com/a_lot_of_countries.zip") do
  compressed true
  map_attributes do |values|
    {
      id:           values[0],
      name:         values[1],
      population:  values[2]
    }
  end

  # current lines values
  finder{|attrs| 
    Country.where(id: attrs[:id]).first || Country.new
  }
end

It can handle XML, JSON, and more...

require 'ingestor/parser/xml'
ingest("http://example.com/books.xml") do
  parser :xml
  parser_options xpath: '//book'
  map_attributes do |values|
    {
      id:           values['id'],
      title:        values['title'],
      author:       {
        name: values['author']
      }
    }
  end

  # current lines values
  finder{|attrs| 
    Book.where(id: attrs[:id]).first || Book.new
  }

  processor{|attrs,record|
    record.update_attributes(attrs)
    record.reviews.create({
      stars: 5,
      comment: "Every book they sell is so great!"
    })
  }
end

CSV Example

require 'ingestor/parser/csv'
ingest "./samples/contracts.csv" do
  parser :csv
  
  # all options come directly from Ruby core CSV class
  parser_options :headers => true,
    :col_sep            => ",",
    :row_sep            => :auto,
    :quote_char         => '"',
    :field_size_limit   => nil,
    :converters         => nil,
    :unconverted_fields => nil,
    :return_headers     => false,
    :header_converters  => nil,
    :skip_blanks        => false,
    :force_quotes       => false    

  # How to map out the columns from text to AR
  map_attributes do |row|
    {
      id:                 row[0],
      seller_name:        row[1],
      customer_name:      row[2],
      commencement_date:  row[7],
      termination_date:   row[8]
    }
  end
  
  # before{|attrs| attrs}
  
  # Your strategy for finding or instantiating a new object to be handled by the processor block
  finder{|attrs|
    Contract.new
  }

  processor{|attrs,record|
    # ... custom processor here ...
    record.update_attributes attrs
  }
  
  after{|record| 
    puts "Created: #{record.summary}"
  }
end

JSON Example

require 'ingestor/parser/json'
ingest("http://example.com/people.json") do
  parser :json
  parser_options collection: lambda{|document|
    document['people']
  }
  map_attributes do |values|
    {
      name:         values["first_name"] + " " + values["last_name"]
      age:          values['age'],
      address:      values['address']
    }
  end

  # current lines values
  finder{|attrs| 
    Person.where(name: attrs[:name]).first || Person.new
  }

  processor{|attrs,record|
    record.update_attributes(attrs)
    record.send_junk_mail!
  }
end

Advanced Usage

DSL Options

parser - the parser to use on the file
- Symbol
- Optional
- Default: :plain_text
- Available Values: :plain_text, :xml, :json, :csv, :html
- See 'Included Parsers' below
parser_options - options for a specific parser
- Hash
- Optional
- Default: set per parser
- See 'Included Parsers' below
sample - dump a single raw entry from the file to STDOUT and exit
- Boolean
- Optional
- Default: false (defaults: false) will
includes_header - Tells the parser that the first line is a header and should be ignored
- Boolean
- Optional
- Default: false
compressed - Should the file be decompressed
- Boolean
- Optional
- Default: false
working_directory - where to store remote or decompressed files for local processing
- String
- Optional
- Default: /tmp/ingestor
before - callback that receives attributes for each record BEFORE call to [finder]
- Proc(attributes)
- Optional
- Default: nil
finder - Arel finder for each object
- Proc(attributes)
- Returns: ~ActiveModel
- Required
processor - What to do with the attributes and object
- Proc(attributes,record)
- Returns: ~ActiveModel
- Optional
- Default: Proc, calls #update_attributes on record without protection
after - callback that receives each record after [processor]
- Proc(record)
- Optional

Included Parsers

Writing parsers is simple (see examples).

Plain Text Parser

Parses a plain text document.

Options

delimiter - how to split up each line
- String
- Default: '|'
- Optional
line_processor - override default_line_processor. The default_line_processor simply splits the string using the delimiter
- Proc(string)
- Returns Array
- Default: nil
- Optional

XML Parser

Parses an XML document

Options

selector - xpath selector to get the node collection
- String
- Required
encoding - XML Encoding. See nokogiri encoding
- String
- Optional
- Default libxml2 best guess

JSON Parser

Parses a JSON document

Options

collection - receives the document and narrows it down to the collection you are interested in
- Proc(Hash)
- Returns Hash | Array
- Required

CSV Parser

Coming soon...

HTML Parser

Coming soon...

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

Running Tests

Copy spec/orm/database.example.yml => spec/orm/database.yml
Configure spec/orm/database.yml
bundle exec guard

Todos

Deprecate plain_text (this was the first thing I created)
rdoc http://rdoc.rubyforge.org/RDoc/Markup.html
Move includes_header to CSV, PlainText
Mongoid Support
sort/limit options
configure travis
A way to sample a file without building an ingestor first
- bin/ingestor --sample --path=./my.xml --parser xml --parser_options_xpath '//book'

ingestor

Runtime

Ingestor

Installation

Usage

Advanced Usage

Included Parsers

Plain Text Parser

XML Parser

JSON Parser

CSV Parser

HTML Parser

Contributing

Running Tests

Todos