Raka is a DSL(Domain Specific Language) on top of Rake for defining rules and running data processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity.
Installation
Raka is a library based on rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities. Ruby is available for most *nix systems including Mac OSX so the only task is to install raka like:
gem install raka
Quick Start
First create a file named main.raka
and import & initialize the DSL
require 'raka'
dsl = Raka.new(self,
output_types: [:txt],
input_types: [:txt]
)
Then the code below will define two simple rules:
txt._.first50 = shell* "cat $< | head -n 50 > $@"
txt.sort = [txt.input] | shell* "cat $(dep0) | sort -rn > $@"
For testing let's prepare an input file named input.txt
:
seq 1000 > input.txt
Invoke:
raka first50__sort.txt
Raka will read data from input.txt, sort the numbers descendingly and copy the first 50 lines to first50__sort.txt.
The workflow here is as follows:
- Try to find first50__sort.txt: not exists.
- Rule with target
txt.sort.first50
matched. - Find input file sort.txt, not exists.
- Rule with target
txt.sort
matched. - This rule has no input but a depended target
txt.input
. - File input.txt exists. Use it.
- Run rule
txt.sort
and create sort.txt. - Run rule
txt.sort.first50
and create first50__sort.txt
We may want to skip the sort step, and invoke:
raka first50__input.txt
Raka will read data from input.txt and copy the first 50 lines to first50__input.txt.
This illustrates some basic ideas but may not be particularly interesting. Following is a slightly more complex example which covers more features.
require 'raka'
dsl = Raka.new(self,
output_types: %i[csv pdf],
input_types: %i[csv],
lang: ['lang/shell', 'lang/python'])
py_template = <<~PYTHON
import os.path
import pandas as pd
def write_variety(input, output, variety):
print(variety)
folder = os.path.dirname(output)
if len(folder) > 0:
os.makedirs(folder, exist_ok=True)
df = pd.read_csv(input)
df[df['class'] == variety].to_csv(output)
<code>
PYTHON
py.config script_template: py_template
groups = %i[virginica versicolor]
csv(groups.join('|')).iris =
[csv.iris_all] | py* %(write_variety('$<', '$@', 'Iris-$(target_scope)'))
csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@)
dsl.scope(*groups) do
pdf.iris.plot['plot_(\S+)_(\S+)'] = py do |rask|
<<-PYTHON
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.read_csv('#{rask.input}')
ax = sns.displot(x=df['#{rask.captures.plot0}#{rask.captures.plot1}'])
ax.set_axis_labels('#{rask.captures.plot0} #{rask.captures.plot1}', 'frequency')
plt.savefig('#{rask.output}')
PYTHON
end
end
task figures: (groups.product(%w[sepal petal], %w[length width]).map do |info|
"_out/#{info[0]}/plot_#{info[1]}_#{info[2]}__iris.pdf"
end)
In this example, we download a classical dataset named iris.csv, use python code to extract two varieties including virginica and versicolor, and generate thematic plots of frequency histograms for both varieties.
To invoke the script, we run in terminal:
raka -j 8 -v figures
The option -j 8
indicates we want to parallelize the tasks with 8 concurrent processes at most where possible. The option -v
let raka print detailed information so we can view the generated python code.
The tool will then act as the following:
- Match
figures with the last
rule, which is a normal rake task. - The prerequisites include 8 figures, none of them exists yet. Take *_out/versicolor/plot_petal_length__iris.pdf * as an example from now on.
- Rule
pdf.iris.plot['plot_(\S+)_(\S+)']...
is matched, where "petal" is bound toplot0
and "length" is bound toplot1
. - Neither of the 2 possible input files: _out/versicolor/iris.csv and _out/versicolor/iris.pdf and can be found. But the rule
csv(groups.join('|')).iris = ...
(csv('virginica|versicolor').iris
) can be matched for the former, where the target scope is matched asversicolor
. - The only dependecy
csv.iris_all
is resolved as out/iris_all.csv. The path does not containvesicolor
since the target scope only applies to the target. - Rule
csv.iris_all
is matched without any dependencies. - The protocol
shell
replaces the automatic variable$@
with_out/iris_all.csv
to build a curl command and download the iris dataset from ()[datahub.io]. - Now raka goes back to generate output _out/versicolor/iris.csv, by executing the code generated by the
python
protocol, which extracts rows where the class field equals "Iris-versicolor". - Raka goes back to generate output _out/versicolor/plot_petal_length__iris.pdf, , by executing the code generated by the
python
protocol, which draws a histogram plot to depict the distribution of petal length. - Raka continues to generate plot files until all 8 figures exist.
As an example, the generated python code in 9 are:
import sys
import os.path
import pandas as pd
def write_variety(input, output, variety):
print(variety)
folder = os.path.dirname(output)
if len(folder) > 0:
os.makedirs(folder, exist_ok=True)
df = pd.read_csv(input)
df[df['class'] == variety].to_csv(output)
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.read_csv('_out/versicolor/iris.csv')
ax = sns.displot(x=df['petallength'])
ax.set_axis_labels('petal length', 'frequency')
plt.savefig('_out/versicolor/plot_petal_length__iris.pdf')
The rule-based system, the strategy to execute tasks only when necessary, and the capable host language make it fairly easy to adjust the experiments during the exploration. For example, suppose we want to also apply experiments also to the setosa class, we can just change the line
groups = %i[virginica versicolor]
to
groups = %i[virginica versicolor setosa]
The command raka -j 8 -v figures
will generate 4 figures for the new class, without re-executing tasks for the other two classes.
Why Raka
Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages:
- Advanced pattern matching and template resolving to define general rules and maximize code reuse.
- Extensible and context-aware protocol architecture.
- Multilingual. Other programming languages can be easily embedded.
- Auto dependency and naming by conventions.
- Scopes to ease comparative studies.
- Terser syntax.
... and more.
Compared to more comlex, GUI-based solutions (perhaps classified as scientific-workflow software) like Kepler, etc., Raka has the following advantages:
- Lightweight and easy to setup, especially on platforms with ruby preinstalled.
- Easy to deploy, version-control, backup or share workflows since the workflows are merely text files.
- Easy to reuse modules or create reusable modules, which are merely plain ruby code snippets (or in other languages with protocols).
- Expressive so a few lines of code can replace many manual operations.
Documentation
Conceptual Model
A raka rule consists of target, dependencies, actions and
Syntax Definition
It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (W3C EBNF form):
rule ::= target "=" (dependencies "|")* action ("|" post_target)*
target ::= ext "." ltoken ("." ltoken)*
dependencies ::= "[]" | "[" dependency ("," dependency)* "]"
dependency ::= rexpr | template
post_target ::= rexpr | template
rexpr ::= ext "." rtoken ("." rtoken)*
ltoken ::= word | word "[" pattern "]"
rtoken ::= word | word "(" template ")"
word ::= ("_" | letter) ( letter | digit | "_" )*
action ::= ("shell" | "r" | "psql" | "py" ) ("*" template | block ) | "run" block
The corresponding railroad diagrams are:
rule
target
dependencies
dependency
post_target_
rexpr
ltoken
rtoken
word
action
The definition is concise but several details are omitted for simplicity:
- BLOCK and HASH is ruby's block and hash object.
- A template is just a ruby string, but with some placeholders (see the next section for details)
- A pattern is just a ruby string which represents regex (see the next section for details)
- The listed protocols are merely what we offered now. It can be greatly extended.
- Nearly any concept in the syntax can be replaced by a suitable ruby variable.
Pattern matching and template resolving
When defined a rule like target = <specification>
, the left side represents a pattern and the right side contains specifications for extra dependecies, actions and some targets to create thereafter. When raking a target file, the left sides of the rules will be examined one by one until a rule is matched. The matching process based on Regex also support named captures so that some varibales can be extracted for use in the right side.
The specifications on the right side of a rule can contain templates. The "holes" in the templates will be fulfilled by automatic variables and variables captured when matching the left side.
Pattern matching
To match a given file with a target
, the extension will be matched first. The substrings of the file name between "__" are mapped to tokens separated by .
, in reverse order. After that, each substring is matched to the corresponding token or the regex in []
. For example, the rule
pdf.buildings.indicator['\S+'].top['top_(\d+)']
can match "top_50__node_num__buildings.pdf". The logical process is:
- The extension
pdf
matches. - The substrings and the tokens are paired and they all match:
buildings ~ buildings
'\S+' ~ node_num
top_(\d+) ~ top_50
- Two levels of captures are made. First, 'node_num' is captured as
indicator
, 'top_50' is captured astop
; Second, '50' is captured astop0
since\d+
is wrapped in parenthesis and is the first.
One can write special token _
to match any token. Since raka uses prefix matching, something like token0['']
can also match any token and capture it in token0
in addition. End-of-line symbol $
can be used to match the whole token, e.g., token0['word$']
will not match word_bench
.
Template resolving
In some places of rexpr
, templates can be written instead of strings, so that it can represent different values at runtime. There are two types of variables that can be used in templates. The first is automatic variables, which is just like $@
in Make or task.name
in Rake. We even preserve some Make conventions for easier migrations. All automatic varibales begin with $
. The possible automatic variables are:
symbol | description |
---|---|
$@, $(output) | the output file |
$<, $(input) | the input file defined in the chained target |
$^, $(deps) | all dependecies concated by comma (including input) |
$(dep0), $(dep1), ... | the i-th depdency (input is $(dep0)) |
$(input_stem) | stem of the input file |
$(output_stem) | stem of the output file |
$(func) | the token added to input to generate output, e.g., stat in csv.data.stat |
$(ext) | extension of the output file |
$(scope) | scope for current task, i.e. the common directory for output, input and dependencies |
$(target_scope) | the inline scope defined in target |
$(target_scope0), $(target_scope1), ... | the i-th captured value by inline scope defined in target |
$(rule_scope0), $(rule_scope1), ... | the i-th scope defined in rule-level by nested calls of the dsl.scope function (i is larger insideout) |
The other type of variables are those captured during pattern matching, which can be referred to using %{var}
. In the example of the pattern matching section, %{indicator}
will be replaced by node_num
, %{top}
will be replaced by top_50
and %{top0}
will be replaced by 50
. In such case, a template as 'calculate top %{top0} of %{indicator} for $@'
will be resolved as 'calculate top 50 of node_num for top_50__node_num__buildings.pdf'
Templates can happen in various places. For depdencies and post targets, tokens with parenthesis can contain templates, like csv._('%{indicator}')
. The symbol of a token with parenthesis is of no use and is generally omitted with an underscore. It is also possible to write template literal directly, i.e. '%{indicator}.csv'
. Templates can also be applied in actions but it depends on the implementations of protocols.
Actions and protocols
Raka invokes actions when all input and dependencies are presented. Generally, users define an action that generates the output. To maximize the flexibility, users can feed code in an arbitrary programming language to the corresponding protocol. The protocol will then transform and execute the code. Raka natively supports the host(ruby) protocol and several foreign protocols including shell, python, psql, and r.
The host protocol is special and just executes the given ruby block. All other protocols can accept a templated code string given an aterisk operator or a block producing a templated code string. Following illustrates examples for each protocol.
In the host protocol and the block versions of other protocols, a raka task (the rask variable) is provided, which offers the following properties:
property | description |
---|---|
output | the output file |
input | the input file defined in the chained target |
deps | the depdencies (input is deps[0]) |
func | the token added to input to generate output, e.g., stat in csv.data.stat |
ext | extension of the output file |
captures | captured text during pattern matching, key-value |
scope | scope for current task, i.e. the common directory for output, input and dependencies |
target_scope | the inline scope defined in target |
target_scope_captures | captured values by inline scope defined in target |
rule_scopes | the scope components bounded by the rule scopes |
require 'raka'
require 'csv'
dsl = Raka.new(
self, output_types: %i[table view csv],
lang: ['lang/psql', 'lang/shell', 'lang/python', 'lang/r']
)
csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@)
# host(ruby) protocol
csv.rb_out = [csv.iris_all] | run do |rask|
in_f = File.open(rask.deps[0])
out_f = File.open(rask.output, 'w')
options = { headers: true, return_headers: true, write_headers: true }
CSV.filter(in_f, out_f, options) do |row|
row['class'] == 'Iris-versicolor'
end
end
# python protocol
csv.py_out = [csv.iris_all] | py* %(
import pandas as pd
df = pd.read_csv('$(dep0)')
df[df['class'] == 'Iris-versicolor'].to_csv('$@')
)
# python protocol (block)
csv.py_out2 = [csv.iris_all] | py do |rask|
<<-PYTHON
import pandas as pd
df = pd.read_csv('#{rask.deps[0]}')
df[df['class'] == 'Iris-versicolor'].to_csv('#{rask.output}')
PYTHON
end
# r protocol
csv.r_out = [csv.iris_all] | r* %(
df <- read.csv("$(dep0)")
write.csv(df[(df$class == "Iris-versicolor"),], file="$@")
)
# r protocol (block)
csv.r_out = [csv.iris_all] | r do |rask|
<<-R
df <- read.csv("#{rask.deps[0]}")
write.csv(df[(df$class == "Iris-versicolor"),], file="#{rask.output}")
R
end
# shell protocol
csv.shell_out = [csv.iris_all] | shell* %(
cat <(head $(dep0)) <(grep "Iris-versicolor" $(dep0)) > $@
)
# shell protocol (block)
csv.shell_out2 = [csv.iris_all] | shell do |rask|
"cat <(head -1 #{rask.deps[0]}) <(grep 'Iris-versicolor' #{rask.deps[0]}) > rask.output"
end
# psql protocol
pg = OpenStruct.new(
user: 'postgres',
port: 5433,
host: '127.0.0.1',
db: 'postgres',
password: 'postgres'
)
psql.config conn: pg, create: :mview
table.iris_all = [csv.iris_all] | psql(create: nil)* %(
DROP TABLE IF EXISTS $(output_stem);
CREATE TABLE $(output_stem) (
sepallength float,
sepalwidth float,
petallength float,
petalwidth float,
class varchar
);
\\COPY $(output_stem) FROM '$(dep0)' CSV HEADER;
)
table.psql_out = [table.iris_all] | psql* %(
SELECT * FROM $(dep0_stem) WHERE class='Iris-versicolor'
)
# psql protocol (block)
table.psql_out2 = [table.iris_all] | psql do |rask|
<<-SQL
SELECT * FROM #{dsl.stem(rask.deps[0])} WHERE class='Iris-versicolor'
SQL
end
Initialization and options
These APIs are bounded to an instance of DSL, you can create the object at the top:
dsl = DSL.new(<env>, <options>)
The argument <env>
should be the self of a running Rakefile. In most case you can directly write:
dsl = DSL.new(self, <options>)
Two important fields of options
are output_types
and input_types
. For each item in output_types
, you will get a global function to bootstrap a rule. For example, with
dsl = DSL.new(self, { output_types: [:csv, :pdf] })
you can write these rules like:
csv.data = ...
pdf.graph = ...
which will match
/data.csv and /graph.pdfThe input_types
involves the strategy to find inputs. All possible input types will be tried when resolving an input file in chained target. For example, raka will try to find both numbers.csv and numbers.table for a rule like table.numbers.mean = …
if input_type = [:csv, :table]
.
Scope
Scopes represent the context of a running task and its components, which are generally folders physically. Users can define scope constraints with rules to help users create rules more precisely. and can happen in several places.
Scope constraints:
Rule scope is the scope to restrict possible task scope. Rule scopes can be layered, each layer with several options, like:
dsl.scope :de, :fr
dsl.scope :food, :med
...rules
end
end
Target scope is the scope to restrict a single target. It can be used in the left-hand part of a rule, or as dependencies in the right-hand. For example:
# The constraint "de" only apply to food.csv and fruit__food.csv, not classifier.csv
csv('de').food.fruit = [csv.classifier] | action
When resolved, scope constraints are verified and scopes are extracted and parsed, including the following concepts:
Scope(Task scope) is the common scope of the task, every target is resolved under the scope.
Rule bounded scopes (abbr. rule scopes) are the parts of scope bounded by the Rule scope constraints.
Target bounded scope (abbr. target scope) is the part of scope bounded by the Target scope constraints.
Output scope is the scope of output, while Dep scope is the scope of dependencies.
The following example illustrates the relationships of the above concepts.
Rule definition:
dsl.scope :de, :fr,
dsl.scope :food, :med
csv('percent_(\d+)').data.cheap = [csv('base').price] | ...
end
end
When running raka out/de/food/percent_50/cheap__data.csv
, the extracted scopes are as follows:
The auto variables are:
var | value | var | value |
---|---|---|---|
$(scope) | out/de/food | $(output_scope) | out/de/food/percent_50 |
$(rule_scope0) | food | $(rule_scope1) | de |
$(target_scope) | percent_50 | $(target_scope0) | 50 |
$(dep1_scope) | out/de/food/base |
Rakefile Template
Write your own protocols
Compare to other tools
Raka borrows some ideas from Drake but not much (currently mainly the name "protocol"). Briefly, we have different visions and maybe different suitable scenarios.