Project

simple_etl

0.0
No commit activity in last 3 years
No release in over 3 years
An easy-to-use toolkit to help you with ETL (Extract Transform Load) operations. Simple ETL 'would be' (:D) framework-agnostic and easy to use.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 0
>= 0
>= 0
>= 0
 Project Readme

Simple ETL

An easy-to-use toolkit to help you with ETL (Extract Transform Load) operations.

Simple ETL 'would be' (:D) framework-agnostic and easy to use.

Source

Source namespace is responsible of input files parsing.

Note: every format plugin will define its own field syntax derived from the base syntax described here, so remember to read the Wiki

First of all you have to define a "source template" inside a definition file (for example my_template.stl):

    define :format_name do
      field :name
      field :surname
    end

Then you will load the template with the following code:

    my_template = SimpleEtl::Source.load './etl/my_template.stl'

At this point you can parse a source and process the result as with the following code:

    my_template.parse '....', :type => :inline # load data inline
    result = my_template.parse 'source.dat' # load from file

    if result.valid?
      result.rows.each do |row|
        puts "|\t#{row.name}\t|\t#{row.surname}\t|"
      end
      puts "Parse Completed!"
    else
      result.errors.each do |error|
        puts "Error while parsing line #{error.row_index}: #{error.message}"
      end
    end

As you can see the result is valid if there are no errors.

The rows array contains all the parsed rows. Each row contains the parsed attributes as accessors.

The errors array contains all the generated errors. Each error is an object with 'row_index', 'message' and 'exception' properties.

Structure of the template definition

A template definition is composed by three layers:

  • raw fields
  • transformations
  • generators

Fields

    field :name
    field :surname, :type => :string, :required => true

By default type is 'object'. It means it's not converted in any format. Other possible types are:

  • string: field is stripped by extra spaces;

  • integer: field is stripped. If the input value is nil or empty, nil is returned; it's converted in integer if the value contains numbers; a CastError is raised otherwise;

  • float: field is stripped. If the input value is nil or empty, nil is returned; it's converted in float if the value contains numbers; a CastError is raised otherwise;

  • boolean field is stripped. If the input value is nil or empty, nil is returned; it's converted in boolean if the input value is true,false,1,0; a CastError is raised otherwise;

The template definition will provide you an helper for each defined type. So you can write:

    string :name
    integer :age

For each helper, an additional 'required' helper will also be available:

    required_string :name
    required_integer :age

Note: every format plugin will define its own field syntax derived from the base syntax described here, so remember to read the Wiki

Transformers and generators

They are functions that help you manipulate the parsed raw data:

    transform :name { |name| name.downcase } # => name field is transformed in downcase

    # a full_name field will be present in the row
    generate :full_name do
      "#{name} #{surname}"
    end

    generate :company do
      if cmp = Company.find(company_id)
        cmp
      else
        raise ParseError.new "Cannot find a company with id #{company_id}"
      end
    end

A transformer is a code block that transform a particular value. It's executed as soon as the input value is parsed (if it's valid).

A generator is a code block that generates a new property for the current row. All the generators are executed when the entire row as been read and transformed.

Skip rows

If you need to skip some rows before parsing the file you can use the helper 'skip_rows':

  define :format do
    skip_rows 2
    field :name
  end

This will start the parsing from the third row.