metacrunch-elasticsearch
This is the official Elasticsearch package for the metacrunch ETL toolkit.
NOTE: metacrunch-elasticsearch 5.x requires Elasticsearch 7.x. For older versions of Elasticsearch try metacrunch-elasticsearch 4.x
Installation
Include the gem in your Gemfile
gem "metacrunch-elasticsearch", "~> 5.0.0"
and run $ bundle install
to install it.
Or install it manually
$ gem install metacrunch-elasticsearch
Usage
Note: For working examples on how to use this package check out our demo repository.
Metacrunch::Elasticsearch::Source
This class provides a metacrunch source
implementation that can be used to read data from Elasticsearch into a metacrunch job.
# my_job.metacrunch
# Create a Elasticsearch connection
elasticsearch = Elasticsearch::Client.new(...)
# Set the source
source Metacrunch::Elasticsearch::Source.new(elasticsearch, OPTIONS)
Options
-
:search_options
: A hash with search options (including your query) as described here. We have set some meaningful defaults though:size: 100
,scroll: 1m
,sort: ["_doc"]
. Depending on your use-case it may be needed to modify:size
and:scroll
for optimal performance. -
:total_hits_callback
: You can set aProc
that gets called with the total number of hits your query will match. Use can use this callback to setup a progress bar for example. Defaults tonil
.
Metacrunch::Elasticsearch::Destination
This class provides a metacrunch destination
implementation that can be used to write data from a metacrunch job to Elasticsearch.
The data that gets passed to the destination, must be in a proper format. You can use a transformation to transform your data before it reaches the destination.
As Metacrunch::Elasticsearch::Destination
utilizes the Elasticsearch bulk API, the expected format must match one of the available options for the body
parameter described here. Please note that you can use the bulk API not only to index records. You can update or delete records as well.
# my_job.metacrunch
# Transform data into a format that the destination can understand.
# In this example `data` is some hash.
transformation ->(data) do
{
index: {
_index: "my-index",
_id: data.delete(:id),
data: data
}
}
end
It is not efficient to call Elasticsearch for every single record. Therefore we can use a transformation with a buffer, to create bulks of records. In this example we use a buffer size of 10. In production environments and depending on your data, larger buffers may be useful.
# my_job.metacrunch
transformation ->(data) { data }, buffer: 10
If these transformations are in place you can now use the Metacrunch::Elasticsearch::Destination
class as a destination.
# my_job.metacrunch
# Write data into elasticsearch
destination Metacrunch::Elasticsearch::Destination.new(elasticsearch [, OPTIONS])
Options
-
:result_callback
: You can set aProc
that gets called with the result from the bulk operation. Defaults tonil
. -
:bulk_options
: A hash of options for the Eleasticsearch bulk API as described here. Settingbody
here will be ignored. Defaults to{}
.
License
metacrunch-elasticsearch is available at github under MIT license.