Parqueteur
Parqueteur enables you to generate Apache Parquet files from raw data.
Dependencies
Since I only tested Parqueteur on Ubuntu, I don't have any install scripts for others operating systems.
Debian/Ubuntu packages
libgirepository1.0-dev
libarrow-dev
libarrow-glib-dev
libparquet-dev
libparquet-glib-dev
You can check scripts/apache-arrow-ubuntu-install.sh
script for a quick way to install all of them.
Installation
Add this line to your application's Gemfile:
gem 'parqueteur', '~> 1.0'
(optional) If you don't want to require Parqueteur globally you can add
require: false
to the Gemfile instruction:
gem 'parqueteur', '~> 1.0', require: false
And then execute:
$ bundle install
Or install it yourself as:
$ gem install parqueteur
Usage
Parqueteur provides an elegant way to generate Apache Parquet files from a defined schema.
Converters accepts any object that implements Enumerable
as data source.
Working example
require 'parqueteur'
class FooParquetConverter < Parqueteur::Converter
column :id, :bigint
column :reference, :string
column :datetime, :timestamp
end
data = [
{ 'id' => 1, 'reference' => 'hello world 1', 'datetime' => Time.now },
{ 'id' => 2, 'reference' => 'hello world 2', 'datetime' => Time.now },
{ 'id' => 3, 'reference' => 'hello world 3', 'datetime' => Time.now }
]
# initialize Converter with Parquet GZIP compression mode
converter = FooParquetConverter.new(data, compression: :gzip)
# write result to file
converter.write('hello_world.parquet')
# in-memory result (StringIO)
converter.to_io
# write to temporary file (Tempfile)
# don't forget to `close` / `unlink` it after usage
converter.to_tmpfile
# convert to Arrow::Table
pp converter.to_arrow_table
Using transformers
You can use transformers to apply data items transformations.
From examples/cars.rb
:
require 'parqueteur'
class Car
attr_reader :name, :production_year
def initialize(name, production_year)
@name = name
@production_year = production_year
end
end
class CarParquetConverter < Parqueteur::Converter
column :name, :string
column :production_year, :integer
transform do |car|
{
'name' => car.name,
'production_year' => car.production_year
}
end
end
cars = [
Car.new('Alfa Romeo 75', 1985),
Car.new('Alfa Romeo 33', 1983),
Car.new('Audi A3', 1996),
Car.new('Audi A4', 1994),
Car.new('BMW 503', 1956),
Car.new('BMW X5', 1999)
]
# initialize Converter with Parquet GZIP compression mode
converter = CarParquetConverter.new(data, compression: :gzip)
# write result to file
pp converter.to_arrow_table
Output:
#<Arrow::Table:0x7fc1fb24b958 ptr=0x7fc1faedd910>
# name production_year
0 Alfa Romeo 75 1985
1 Alfa Romeo 33 1983
2 Audi A3 1996
3 Audi A4 1994
4 BMW 503 1956
5 BMW X5 1999
Available Types
Name (Symbol) | Apache Parquet Type |
---|---|
:array |
Array |
:bigdecimal |
Decimal256 |
:bigint |
Int64 or UInt64 with unsigned: true option |
:boolean |
Boolean |
:date |
Date32 |
:date32 |
Date32 |
:date64 |
Date64 |
:decimal |
Decimal128 |
:decimal128 |
Decimal128 |
:decimal256 |
Decimal256 |
:int32 |
Int32 or UInt32 with unsigned: true option |
:int64 |
Int64 or UInt64 with unsigned: true option |
:integer |
Int32 or UInt32 with unsigned: true option |
:map |
Map |
:string |
String |
:struct |
Struct |
:time |
Time32 |
:time32 |
Time32 |
:time64 |
Time64 |
:timestamp |
Timestamp |
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/pocketsizesun/parqueteur-ruby.