Project

parquet

0.0
The project is in a healthy, maintained state
Parquet is a high-performance Parquet library for Ruby, written in Rust. It wraps the official Apache Rust implementation to provide fast, correct Parquet parsing.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

Runtime

~> 0.9.39
 Project Readme

parquet-ruby

Gem Version

This project is a Ruby library wrapping the parquet-rs rust crate.

Usage

This library provides high-level bindings to parquet-rs with two primary APIs for reading Parquet files: row-wise and column-wise iteration. The column-wise API generally offers better performance, especially when working with subset of columns.

Row-wise Iteration

The each_row method provides sequential access to individual rows:

require "parquet"

# Basic usage with default hash output
Parquet.each_row("data.parquet") do |row|
  puts row.inspect  # {"id"=>1, "name"=>"name_1"}
end

# Array output for more efficient memory usage
Parquet.each_row("data.parquet", result_type: :array) do |row|
  puts row.inspect  # [1, "name_1"]
end

# Select specific columns to reduce I/O
Parquet.each_row("data.parquet", columns: ["id", "name"]) do |row|
  puts row.inspect
end

# Reading from IO objects
File.open("data.parquet", "rb") do |file|
  Parquet.each_row(file) do |row|
    puts row.inspect
  end
end

Column-wise Iteration

The each_column method reads data in column-oriented batches, which is typically more efficient for analytical queries:

require "parquet"

# Process columns in batches of 1024 rows
Parquet.each_column("data.parquet", batch_size: 1024) do |batch|
  # With result_type: :hash (default)
  puts batch.inspect
  # {
  #   "id" => [1, 2, ..., 1024],
  #   "name" => ["name_1", "name_2", ..., "name_1024"]
  # }
end

# Array output with specific columns
Parquet.each_column("data.parquet",
                    columns: ["id", "name"],
                    result_type: :array,
                    batch_size: 1024) do |batch|
  puts batch.inspect
  # [
  #   [1, 2, ..., 1024],           # id column
  #   ["name_1", "name_2", ...]    # name column
  # ]
end

Arguments

Both methods accept these common arguments:

  • input: Path string or IO-like object containing Parquet data
  • result_type: Output format (:hash or :array, defaults to :hash)
  • columns: Optional array of column names to read (improves performance)

Additional arguments for each_column:

  • batch_size: Number of rows per batch (defaults to implementation-defined value)

When no block is given, both methods return an Enumerator.