Project

samidare

0.01
No commit activity in last 3 years
No release in over 3 years
Generate Embulk config and BigQuery schema from MySQL schema
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.7
~> 10.0
>= 2.0
= 0.8.0

Runtime

 Project Readme

Samidare

Generate Embulk config and BigQuery schema from MySQL schema and run Embulk.

Installation

Add this line to your application's Gemfile:

gem 'samidare'

And then execute:

$ bundle

Or install it yourself as:

$ gem install samidare

Embulk setup

Samidare is utility for Embulk . You need to install Embulk and install some gems like below.

embulk gem install embulk-input-mysql --version 0.8.2
embulk gem install embulk-output-bigquery --version 0.4.3
embulk gem install embulk-parser-jsonl --version 0.2.0
embulk gem install embulk-formatter-jsonl --version 0.1.4

Usage

Require database.yml and table.yml. Below is a sample config file.

database.yml

db01:
  host: localhost
  username: root
  password: pswd
  database: production
  bq_dataset: mysql_db01

db02:
  host: localhost
  username: root
  password: pswd
  database: production
  bq_dataset: mysql_db02

Caution: Embulk doesn't allow no password for MySQL

table.yml

db01:
  tables:
    - name: users
    - name: events
    - name: hobbies

db02:
  tables:
    - name: administrators
    - name: configs

Samidare requires BigQuery parameters like below.

[sample.rb]
require 'samidare'

config = {
 'project_id' => 'BIGQUERY_PROJECT_ID',
 'service_email' => 'SERVICE_ACCOUNT_EMAIL',
 'key' => '/etc/embulk/bigquery.p12',
 'schema_dir' => '/var/tmp/embulk/schema',
 'config_dir' => '/var/tmp/embulk/config',
 'auth_method' => 'private_key'
}

client = Samidare::EmbulkClient.new
client.generate_config(config)
client.run(config)
ruby sample.rb

Features

process status

Samidare returns process status as boolean.
If all tables are succeed, then returns true, else false .
It is useful to control system flow.

process_status = Samidare::EmbulkClient.new.run(config)
exit 1 unless process_status

narrow tables

You can narrow actual target tables from table.yml for test or to retry.
If no target tables is given, Samidare will execute all tables.

# in case, all tables are ['users', 'purchases', 'items']
target_tables = ['users', 'purchases']
Samidare::EmbulkClient.new.run(config, target_tables)

retry

You can set retry count.
If any table failed, only failed table will be retried until retry count.
If no retry count is given, Samidare dosen't retry.

# 2 times retry will execute
Samidare::EmbulkClient.new.run(config, [], 2)

SQL condition

If you set condition to a table in table.yml , SQL is generated like below.
It is useful for large size table.

[table.yml]
production:
  tables:
    - name: users
    - name: events
      conditon: created_at < CURRENT_DATE()
SELECT * FROM users
SELECT * FROM events WHERE created_at < CURRENT_DATE()

daily snapshot

BigQuery supports table wildcard expression of a specific set of daily tables, for example, sales20150701 .
If you need daily snapshot of a table for BigQuery, use daily_snapshot option to database.yml or table.yml like below.
daily_snapshot option effects all tables in case of database.yml .
On the other hand, only target table in table.yml .
Daily part is determined by execute date.

[database.yml]
production:
  host: localhost
  username: root
  password: pswd
  database: production
  bq_dataset: mysql
  daily_snapshot: true
[table.yml]
production:
  tables:
    - name: users
    - name: events
      daily_snapshot: true
    - name: hobbies

Only `events` is renamed to `eventsYYYYMMDD` for BigQuery.

Contributing

  1. Fork it ( https://github.com/[my-github-username]/samidare/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request