No release in over a year
This small gem contains several data structures that I have implemented in Ruby to learn how they work. Sometimes it is not enough to read the description of a data structure and accompanying pseudo-code. Actually implementing it is often helpful in understanding what is going on. It is also usually fun. The gem contains basic implementions of Disjoint Union, Heap, Priority Search Tree, and Segment Tree. See the homepage for more details.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

~> 11.1.3
~> 1.4.5
~> 0.22.0

Runtime

~> 1.1.0
 Project Readme

Data Structures

This is a small collection of Ruby data structures that I have implemented for my own interest. Implementing the code for a data structure is almost always more educational than simply reading about it and is usually fun. I wrote some of them while participating in the Advent of Code (https://adventofcode.com/).

The implementations are based on the expository descriptions and pseudo-code I found as I read about each structure and so are not as fast as possible.

The code lives in the DataStructuresRMolinari module to avoid polluting the global namespace.

It is distributed under the MIT license.

It is available as a gem: https://rubygems.org/gems/data_structures_rmolinari.

Implementations

Disjoint Union

We represent a set S of non-negative integers as the disjoint union of subsets. Equivalently, we represent a partition of S. The data structure provides very efficient implementation of the two key operations

  • unite(e, f), which merges the subsets containing e and f; and
  • find(e), which returns the canonical representative of the subset containing e. Two elements e and f are in the same subset exactly when find(e) == find(f).

It also provides

  • make_set(v), which adds a new value v to the set S, starting out in a singleton subset.

For more details see https://en.wikipedia.org/wiki/Disjoint-set_data_structure and the paper [TvL1984] by Tarjan and van Leeuwen.

require 'data_structures_rmolinari'
DisjointUnion = DataStructuresRMolinari::DisjointUnion

# Create an instance over the "universe" 0, 1, ..., 9.
du = DisjointUnion.new(10)
du.subset_count          # => 10; each element starts out in its own subset

du.unite(2, 3)           # say that 2 and 3 are actually in the same subset
du.subset_count          # => 9
du.find(2) == du.find(3) # => true

du.unite(4, 5)
du.unite(3, 4)           # now 2, 3, 4, and 5 are all in the same subset
du.subset_count          # => 7

Heap

This is a standard binary heap with an update method, suitable for use as a priority queue. There are several supported operations:

  • insert(item, priority), insert the given item with the stated priority.
    • By default, items must be distinct.
  • top, return the element with smallest priority
  • top_priority, return the priority of the top element
  • pop, return the element with smallest priority and remove it from the structure
  • update(item, priority), update the priority of the given item, which must already be in the heap
  • update_by_delta(item, delta), update the priorityof the given item by adding delta to the priority; the item must already be in the heap

top and top_priority are O(1). The others are O(log n) where n is the number of items in the heap.

By default we have a min-heap: the top element is the one with smallest priority. A configuration parameter at construction can make it a max-heap.

Another configuration parameter allows the creation of a "non-addressable" heap. This makes it impossible to call update or update_by_delta, but allows the insertion of duplicate items (which is sometimes useful) and slightly faster operation overall.

See https://en.wikipedia.org/wiki/Binary_heap and the paper by Edelkamp, Elmasry, and Katajainen [EEK2017].

require 'data_structures_rmolinari'
Heap = DataStructuresRMolinari::Heap

data = [4, 3, 2, 1]

heap = Heap.new

# Insert the elements of data, each with itself as priority.
data.each { |v| heap.insert(v, v) }

heap.top           # => 1, since we have a min-heap.
heap.pop           # => 1
heap.top           # => 2; with 1 gone, this is the element with least priority
heap.update(3, -3)
heap.top           # => 3; now 3 is the element with least priority

Priority Search Tree

A PST stores a set P of two-dimensional points in a way that allows certain queries about P to be answered efficiently. The data structure was introduced by McCreight [McC1985]. De, Maheshawari, Nandy, and Smid [DMNS2011] showed how to build the structure in-place and we use their approach here.

  • largest_y_in_ne(x0, y0) and largest_y_in_nw(x0, y0), the "highest" (max-y) point in the quadrant to the northest/northwest of (x0, y0);
  • smallest_x_in_ne(x0, y0), the "leftmost" (min-x) point in the quadrant to the northeast of (x0, y0);
  • largest_x_in_nw(x0, y0), the "rightmost" (max-x) point in the quadrant to the northwest of (x0, y0);
  • largest_y_in_3_sided(x0, x1, y0), the highest point in the region specified by x0 <= x <= x1 and y0 <= y; and
  • enumerate_3_sided(x0, x1, y0), enumerate all the points in that region.

Here compass directions are the natural ones in the x-y plane with the positive x-axis pointing east and the positive y-axis pointing north.

There is no smallest_x_in_3_sided(x0, x1, y0). Just use smallest_x_in_ne(x0, y0).

(These queries appear rather abstract at first but there are interesting applications. See, for example, section 4 of [McC85], keeping in mind that the data structure in that paper is actually a MinPST.)

Each method also has a named parameter open: that makes the search region an open set. For example, if we call smallest_x_in_ne with open: true then we consider points satisifying x > x0 and y > y0. The default value for this parameter is always false.

The single-point queries run in O(log n) time, where n is the size of P, while enumerate_3_sided runs in O(m + log n), where m is the number of points actually enumerated.

The implementation is in MaxPrioritySearchTree (MaxPST for short), so called because internally the structure is, among other things, a max-heap on the y-coordinates.

We also provide a MinPrioritySearchTree, which answers analagous queries in the southward-infinite quadrants and 3-sided regions.

By default these data structures are immutable: once constructed they cannot be changed. But there is a constructor option that makes the instance "dynamic". This allows us to delete the element at the root of the tree - the one with largest y value (smallest for MinPST) - with the delete_top! method. This operation is important in certain algorithms, such as enumerating all maximal empty rectangles (see the second paper by De et al[DMNS2013]). Note that points can still not be added to the PST in any case, and choosing the dynamic option makes certain internal bookkeeping operations slower.

In [DMNS2013] De et al. generalize the in-place structure to a Min-max Priority Search Tree (MinmaxPST) that can answer queries in all four quadrants and both "kinds" of 3-sided boxes. Having one of these would save the trouble of constructing both a MaxPST and MinPST. But the presentiation is hard to follow in places and the paper's pseudocode is buggy.1

require 'data_structures_rmolinari'
MaxPST = DataStructuresRMolinari::MaxPrioritySearchTree
Point = Shared::Point # simple (x, y) struct. Anything responding to #x and #y will work

data = [Point.new(0, 0), Point.new(1, 2), Point.new(2, 1)]
pst = MaxPST.new(data)

pst.largest_y_in_ne(0, 0)              # => #<struct Shared::Point x=1, y=2>
pst.largest_y_in_ne(1, 1)              # => #<struct Shared::Point x=1, y=2>
pst.largest_y_in_ne(1.5, 1)            # => #<struct Shared::Point x=2, y=1>
pst.largest_y_in_3_sided(-0.5, 0.5, 0) # => #<struct Shared::Point x=0, y=0>

Segment Tree

A segment tree stores information related to subintervals of a certain array. For example, a segment tree can be used to find the sum of the elements in an arbitrary subinterval A(i..j) of an array A(0..n) in O(log n) time. Each node in the tree corresponds to a subarray of A in such a way that the values we store in the nodes can be combined efficiently to determine the desired result for arbitrary subarrays.

An excellent description of the idea is found at https://cp-algorithms.com/data_structures/segment_tree.html.

Generic code is provided in SegmentTree::SegmentTreeTemplate and its equivalent (and faster) C-based sibling, SegmentTree::CSegmentTreeTemplate (see below).

Writing a concrete segment tree class just means providing some simple lambdas and constants to the template class's initializer. Figuring out the details requires some knowledge of the internal mechanisms of a segment tree, for which the link at cp-algorithms.com is very helpful. See the implementations of the concrete classes MaxValSegmentTree and IndexOfMaxValSegmentTree for examples.

Since there are several concrete "types" and two underlying generic implementions there is a convenience method on the SegmentTree module to get instances.

require 'data_structures_rmolinari'
SegmentTree = DataStructuresRMolinari::SegmentTree # namespace module

data = [1, -3, 2, 1, 5, -9]

# Get a segment tree instance that will answer "max over this subinterval?" questions about data.
# Here we get one using the ruby implementation of the generic functionality.
#
# Put :index_of_max in place of :map to get an instance that returns "an index of the maximum value
# over this subinterval".
#
# To use the generic code written in C, put :c instead of :ruby.
seg_tree = SegmentTree.construct(data, :max, :ruby)

seg_tree.max_on(0, 2) # => 2
seg_tree.max_on(1, 4) # => 5
# ..etc..

Algorithms

The Algorithms submodule contains some algorithms using the data structures.

  • maximal_empty_rectangles(points)
    • We are given a set P contained in a minimal box B = [x_min, x_max] x [y_min, y_max]. An empty rectangle is a axis-parallel rectangle with positive area contained in B containing no element of P in its interior. A maximal empty rectangle is an empty rectangle not properly contained in any other empty rectangle. This method yields each maximal empty rectangle in the form [left, right, bottom, top].
    • The algorithm is due to [DMNS2013].

C Extensions

As another learning process I have implemented several of these data structures as C extensions. The APIs are the same.

Disjoint Union

The C version is called CDisjointUnion. A benchmark suggests that a long sequence of unite operations is about 3 times as fast with CDisjointUnion as with DisjointUnion.

The implementation uses the remarkable Convenient Containers library from Jackson Allan.[Allan].

Segment Tree

CSegmentTreeTemplate is the C implementation of the generic class. Concrete classes are built on top of this in Ruby, just as with the pure Ruby SegmentTreeTemplate class.

A benchmark suggests that a long sequence of max_on operations against a max-val Segment Tree is about 4 times as fast with C as with Ruby. The speedup factor is lowered to a little over 2 when the Ruby code is run with YJIT (under Ruby 3.1.3).

I'm a bit suprised the improvement isn't larger, but remember that the C code must still interact with the Ruby objects in the underlying data array, and must access and combine them via Ruby lambdas.

References

Footnotes

  1. See the comments in the fragmentary class MinMaxPrioritySearchTree for further details. ↩