Project

hetchy

0.05
No commit activity in last 3 years
No release in over 3 years
A high performance, thread-safe reservoir sampler with snapshot and percentile support.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

Hetchy

Gem Version Build Status Code Climate

A high performance, thread-safe reservoir sampler for ruby with snapshot and percentile support.

Benefits/Goals

  • Generate statistics for large datasets with a very small in-memory footprint
  • Generate percentiles/quantiles for a known set or numbers or for real-time streaming set
  • Ability to capture sample state at a moment in time for further analysis
  • Speed
  • Thorough test suite
  • No dependencies

Installation

Add this line to your application's Gemfile:

gem 'hetchy'

And then $ bundle

Or install it yourself with:

$ gem install hetchy

Usage

Create a reservoir and designate how big you want it to be:

reservoir = Hetchy::Reservoir.new(size: 1000)

Add samples as they arrive or are generated:

reservoir << 123

You can add sets of samples as well:

reservoir << [45,89,124,96]

You can calculate a percentile at any time:

reservoir.percentile(95)

Hetchy supports high resolution percentiles as well:

reservoir.percentile(99.99)

NOTE: you may need to increase reservoir size for very high resolution percentiles. Experiment to see what works for your data set.

For threaded applications where the reservoir is accepting samples rapidly you can increase performance by snapshotting before running a series of calculations on the reservoir:

dataset  = reservoir.snapshot

perc_95  = dataset.percentile(95)
perc_99  = dataset.percentile(99)
perc_999 = dataset.percentile(99.9)

Clear the reservoir to reset it:

reservoir.clear

Datasets

If you have an existing series you can use Dataset to generate stats for it:

my_series = Array(1..1000)
dataset = Hetchy::Dataset.new(my_series)

perc_95 = dataset.percentile(95)   #=> 950.95
median  = dataset.median           #=> 500.5

Stats Details

For those interested:

  • Reservoir sampling is based on Vitter's algorithm R, ensures a uniform sampling probability for every entry in the series
  • Percentile calculations use weighted averages, not nearest neighbor

Contributing

  • Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
  • Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
  • Fork the project and submit a pull request from a feature or bugfix branch.
  • Please include tests. This is important so we don't break your changes unintentionally in a future version.
  • Please don't modify the gemspec, Rakefile, version, or changelog. If you do change these files, please isolate a separate commit so we can cherry-pick around it.

Credits

Parts of Hetchy are inspired by Eric Lindvall's metriks gem.

Copyright

Copyright (c) 2015 Matt Sanders. See LICENSE for details.