Mini ETL
Basic toolkit for Extract/Transform/Load operations. Abstracts the details of performing sourcing, intermediate structure generation and data persistance.
Usage
Sourcing
A MiniEtl
Process
is kicked off by configuring a process. For a basic CSV file
deserialize and bulk load:
process = MiniEtl.create_process do |process|
process.source.type = :csv
process.source.location = 'samples/small.csv'
end
process.bootstrap
TODO: Write a strategy for HTTP, use JSON server
process = MiniEtl.create_process do |process|
process.source.type = :http
process.source.location = 'localhost:8080/sample'
end
process.bootstrap
Strategies are available for CSV and JSON. If you need something else entirely, a manual source can be used instead:
process = MiniEtl.create_process do |process|
process.source.type = :manual
process.source.method = Proc.new do
...
end
end
Structure generation
Once data sourcing is complete, data can be fetched in-place.
process = MiniEtl.create_process do |process|
process.source.type = :csv
process.source.location = 'samples/small.csv'
end
process.bootstrap
process.generate
process.generator.structures # intermediate structure for bulk import
If the data source is too large to process in memory, an iterator can be given instead:
process = MiniEtl.create_process do |process|
process.source.type = :csv
process.source.location = 'samples/large.csv'
process.generator.lazy = true
end
process.bootstrap
process.generator.start do |structures|
...
end
Data persistance
Finally, once data is shaped the way you need it to, data can be persisted in
any kind of way you need it to. The receiver class is expected to respond to
.create(args)
process = MiniEtl.create_process do |process|
process.source.type = :csv
process.source.location = 'samples/large.csv'
process.store.type = Person # An active record model
end
process.bootstrap
process.generate
process.persist
In this way, any arbitrary store can be created,
class Payroll
Struct.new(:target, :name, :last_name, ...)
@@data = []
def create(params = {})
@@data << Struct::Target.new(name: params[:name], last_name: params[:last_name], ...)
end
end
process = MiniEtl.create_process do |process|
process.source.type = :csv
process.source.location = 'samples/small.csv'
process.store.type = Payroll
end
process.bootstrap
process.generate
process.persist
Development
TODO: Test stuff
$ rake
Runs rspec, rubocop, generates coverage report
TODO: Explain how to generate csv files and the rest of the samples NOTE: This will take ~5.5 mins, super slow, would need a parallel version
$ rake sample:csv:all
TODO: Explain how to use JSON Server to provide a fake API
$ npm install -g json-server
$ rake sample:json:small
$ json-server --watch samples/small.json --port 3001
API is now available at localhost:3001/payroll
...
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/etl.