0.0
No commit activity in last 3 years
No release in over 3 years
Database package for the metacrunch ETL toolkit.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

>= 5.0.0
 Project Readme

metacrunch-db

Gem Version Code Climate Test Coverage CircleCI

This is the official SQL database package for the metacrunch ETL toolkit. The implementation uses the Sequel Gem as a dependency. Every database that is supported by Sequel can be used with this package.

Installation

Include the gem in your Gemfile

gem "metacrunch-db", "~> 1.0.0"

and run $ bundle install to install it.

Or install it manually

$ gem install metacrunch-db

Usage

Note: For working examples on how to use this package check out our demo repository.

Metacrunch::DB::Source

This class provides a metacrunch source implementation that can be used to read data from SQL databases into a metacrunch job.

# my_job.metacrunch

# Create a Sequel database connection 
SOURCE_DB = Sequel.connect(...)

# Create a Sequel dataset with an unambiguous order.
my_source_dataset = SOURCE_DB[:my_table].order(:id)

# Set the source
source Metacrunch::DB::Source.new(my_source_dataset [, OPTIONS])

The implementation uses Sequel's paged_each to efficiently iterate even over large result sets. You can provide the following options, to control how paged_each works.

Options

For a detailed descriptions consult the Sequel documentation of paged_each. Please note that the default for strategy has been changed to :filter.

  • :rows_per_fetch: Defaults to 1000.
  • :strategy: :offset or :filter, Defaults to :filter.
  • :filter_values: Defaults to nil

Metacrunch::DB::Destination

This class provides a metacrunch destination implementation that can be used to write data from a metacrunch job to SQL databases.

# my_job.metacrunch

# Create a Sequel database connection 
DEST_DB = Sequel.connect(...)

# Create a Sequel dataset where data should be written
my_target_dataset = DEST_DB[:my_table]

# For performance reasons it may be useful to create a batch
# of records that gets written to the database
transformation ->(row) { row }, buffer: 1000

# Set the destination
destination Metacrunch::DB::Destination.new(my_target_dataset [, OPTIONS])

Options

  • :use_upsert: When set to true it will perform an upsert (Update an existing record) and not an insert. Defaults to false.
  • :primary_key: The primary key to use to identify an existing record in case of an upsert. It defaults to :id.
  • :transaction_options: A hash of options to control how the database should handle the transaction. For a complete list of available options checkout out the Sequel documentation here.

License

metacrunch-db is available at github under MIT license.