Project

remi

0.0
No commit activity in last 3 years
No release in over 3 years
Data manipulation and ETL in Ruby
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

= 0.2.7
~> 3.3
~> 0.9

Runtime

~> 0.5
~> 2.1
~> 1.1
~> 1.6
~> 2.1
~> 0.18
~> 3.3
 Project Readme

Remi - Ruby Extract Map Integrate

Warning: This project is very much in the development stage and all features are subject to change. Documentation is skimpy at best. Test coverage is light but improving. To get a taste of what Remi has to offer, check out the sample feature and sample job.

Purpose: Remi is a Ruby-based ETL package that provides an expressive data transformation language and facilitates the implementation and validation of non-technical business logic.

Borrowing from principles of test/behavior-driven development (TDD/BDD), Remi is a system that supports business-rule-driven development (BRDD). BRDD captures the idea that the rules that describe data transformations should both (1) be accessible to non-technical business users and (2) be strictly enforced in the logic that executes those transformations. Remi is a Ruby application that allows a developer to write data transformation logic and have that logic validated according to business rule documentation.

Remi will follow semantic versioning principles. Of course, while we're still on major version zero, little effort will be made to maintain backward compatibility.

The data transformation layer is built on top of Daru dataframe. Familiarity with Daru dataframes is essential for writing complex transformations in Remi.

Getting Started

Add the gem to your Gemfile, bundle install, and then initialize your repository as Remi project

remi --init

This command will create two directories: jobs and features. The jobs directory contains an example of a Remi job that can be tested using the BRDD spec defined in the features directory. Test to make sure this works by running

cucumber

All of the test should pass.

Remi Jobs

A Remi job describes the data sources that will be used to collect data, the transformations that will be performed on the data, and the data targets that will be populated when all transformations are complete. With Remi, an ETL process is defined in a class that inherits from the Remi::Job class.

Hello World

A very simple "Hello World" example of a Remi job would be

class HelloWorldJob < Remi::Job
  transform :say_hi do
    puts "Hello World"
  end
end

This job doesn't make use of any data subjects (data sources or data targets), but it does define a single data transform called :say_hi. The full job can be executed by calling the #execute method on an instance of the HelloWorldJob class

job = HelloWorldJob.new
job.execute
#=> "Hello World"

The transform called say_hi is just a method of the HelloWorldJob class representing a job transform object. Multiple transforms can be defined in a Remi job. To execute a specific transform we can call that transform by name using

job = HelloWorldJob.new
job.say_hi.execute
#=> "Hello World"

A more complete example

Suppose we have a database containing data on beer sales. It's a normalized database where we store data on individual beers sold in a beer_sales_fact table and information on the details of the beer in a beers_dim table. We'd like to extract data from both of these sources, combine them into a single flattened table and save it as a CSV file. This operation could be performed with the following Remi job. (Of course, if this were a real world problem, we'd do the join in the database before extracting; this is a contrived example to show how one can combine data from multiple arbitrary sources).

class DenormalizeBeersJob < Remi::Job
  source :beer_sales_fact do
    extractor Remi::Extractor::Postgres.new(
      credentials: {
        dbname: 'my_local_db'
      },
      query: 'SELECT beer_id, sold_date, quantity FROM beer_sales_fact'
    )
    parser Remi::Parser::Postgres.new

    fields(
      {
        :beer_id  => {},
        :sold_at  => { type: :date, in_format: '%Y-%m-%d' },
        :quantity => { type: :integer }
      }
    )
  end

  source :beers_dim do
    extractor Remi::Extractor::Postgres.new(
      credentials: {
        dbname: 'my_local_db'
      },
      query: 'SELECT beer_id, name, price_per_unit FROM beers_dim'
    )
    parser Remi::Parser::Postgres.new

    fields(
      {
        :beer_id        => {},
        :name           => {},
        :price_per_unit => { type: :decimal, scale: 2 }
      }
    )
  end

  target :flat_beer_file do
    encoder Remi::Encoder::CsvFile.new
    loader Remi::Loader::LocalFile.new(
      path: 'flat_beers.csv'
    )
  end

  transform :type_enforcement do
    beer_sales_fact.enforce_types
    beers_dim.enforce_types
  end

  transform :flatten do
    flat_beer_file.df = beer_sales_fact.df.join(flat_beer_file.df, on: [:beer_id], how: :inner)

    Remi::SourceToTargetMap.apply(flat_beer_file.df) do
      map source(:quantity, :price_per_unit) .target(:total_price)
        .transform(->(row) {
          row[:quantity] * row[:price_per_unit]
        })
    end
  end
end

Components of a Remi Job

A Remi job is composed of one or more of the following elements, which are described in more detail below. All of these elements are defined using class methods (part of Remi::Job). Each of the elements is given a name and defined in a block.

  • Data Subjects - A data subject is either a data source or a data target.

    • Data Sources - A data source describes where data is extracted from.
    source :my_source do
      # ... source definition
    end
    • Data Targets - A data target describes where data is loaded to.
    target :my_target do
      # ... target definition
    end
  • Transforms - A transform is essentially arbitrary block of of Ruby code, but is typically used to transform data sources into data targets.

    transform :my_transform do
      # ... lots of code
    end
  • Job Parameters - A job parameter is a memoized block of code (similar to RSpecs' let method) that is used to configure a job and may be overridden at runtime if needed.

    param :my_param do
      # ... the return value of this block is memoized
    end
  • Sub Transforms - Sub transforms are essentially transforms, but they are NOT automatically executed when the job is executed. Instead, they must be imported in a transform. They are meant to be reusable bits of transform code.

    sub_transform :my_sub_transform do
      #... sub_transform stuff
    end
  • Sub Jobs - Sub jobs are simply references to other Remi jobs that may be executed within the current job.

    sub_job :my_sub_job { MySubJob.new }

Execution Plan

The DenormalizeBeersJob example above can be executed using

job = DenormalizeBeersJob.new
job.execute

Calling #execute on an instance of a job does the following, in this order:

  1. All transforms defined in the job (via transform :name do ... end) are executed in the order they were defined in the class definition.
  2. All data targets are loaded in the order they are defined in the job.

Note that data sources are not extracted until the moment the data is needed in a transform. If the source data is never referenced in a transform, it is never extracted.

Data Subjects

A data subject refers to either a data source or a data target. Either way, a data subject is associated with a data frame. Currently the only data frames supported are Daru data frames, but support for other data frames may be developed in the future. The data frame associated with a data subject is accessed with the #df method and assigned with the #df= method.

  my_data_subject.df #=> Daru::DataFrame
  my_data_subject.df = Daru::DataFrame.new(...)

Additionally, all data subjects can be associated with a set of fields and field metadata. Associating a data subject with feild data allows us to develop generic ETL routines that triggered by arbitrary metadata that may be associated with a field.

Sources

Targets

Field Metadata

Available Data Subjects

  • CSV Files
  • DataFrames
  • None
  • Local files
  • SFTP Files
  • S3 Files
  • Salesforce
  • Postgres

Transforms

Sub Jobs

Job Parameters

Sub Transforms

Transforming Data

When #execute is called on an instance of a Remi::Job, all transforms are executed in the order defined in the class

TODO:

Describe Daru foundation

Examples setting up a job class with

  • csv source
  • sf source
  • dataframe intermediate target
  • csv target
  • parameters
  • maps

Transform cardinality

Within a source-to-target map block, there are a few different possible transform cardinalities: one-to-one, many-to-one, one-to-many, many-to-many, zero-to-one, and zero-to-many. The lambda functions that are supplied to #transfrom method must satisfy different conditions based on cardinality.

For all of the following examples, we'll assume that a dataframe exists defined by

  df = Remi::DataFrame::Daru.new(
    [
      ['a1','b1','c1', ['d',1]],
      ['a2','b2','c2', ['d',2]],
      ['a3','b3','c3', ['d',3]],
    ].transpose,
    order: [:a, :b, :c, :d]
  )

one-to-one - These maps expect a lambda that accepts the value of a field as an argument and returns the result of some operation, which is used to populate the target.

Remi::SourceToTargetMap.apply(df) do
  map source(:a) .target(:aprime)
    .transform(->(v) { "#{v}prime" })
end

df[:aprime].to_a #=> ['a1prime', 'a2prime', 'a3prime']

many-to-one - These maps expect that the lambda accepts a row object as an argument and returns the result of the operation, which is used to populate the target.

Remi::SourceToTargetMap.apply(df) do
  map source(:a, :b) .target(:ab)
    .transform(->(row) { "#{row[:a]}#{row[:b]}" })
end

df[:ab].to_a #=> ['a1b1', 'a2b2', 'a3b3']

zero-to-many/one-to-many/many-to-many - These maps expect that the lambda accepts a row object as an argument. The row object is then modified in place, which is used to populate the targets. The return value of the lambda is ignored.

Remi::SourceToTargetMap.apply(df) do
  map source(:a, :b) .target(:aprime, :ab)
    .transform(->(row) {
      row[:aprime] = row[:a]
      row[:ab] = "#{row[:a]}#{row[:b]}"
    })
end

df[:aprime].to_a #=> ['a1prime', 'a2prime', 'a3prime']
df[:ab].to_a #=> ['a1b1', 'a2b2', 'a3b3']

zero-to-one - These maps expect that the lambda accepts no arguments and returns the result of some operation, which is used to populate the target.

Remi::SourceToTargetMap.apply(df) do
  counter = 1.upto(3).to_a
  map target(:counter)
    .transform(->() { counter.pop })
end

df[:counter].to_a #=> [1, 2, 3]

Business Rules

TODO: Description of writing Business Rules.

Conventions to follow when writing features

  • Sources, targets, examples, field names enclosed in single quotes - 'field name'
  • Field values enclosed in double quotes - "field value"
  • Special functions enclosed in stars - *function*
  • Example values encolsed in angular brackets - <example>

Write whatever in scenario and feature descriptions

Common step library

Given the job is 'My Cool Job' Given the job source 'Client File' Given the job source 'Salesforce Extract' Given the job target 'Salesforce Contact'`

Given the following example record called 'my killer example record':
  | Id   | Name                |
  | 1234 | OneTwoThreeFour     |

... etc ...

Business Rule Validation

TODO: Description of how to write Business Rule validations.

Contributing

The best way to contribute would be to try it out and provide as much feedback as possible.

If you want to develop the Remi framework then just fork, code, pull request, repeat. Try to follow the Ruby style guide and suggest other best practices. I'm very interested in getting other ETL developers contribute their own perspective to the project.