Wisconsin Benchmark
Wisconsin Benchmark is a benchmark for relational database systems and machines developed at University of Wisconsin. It was used to asses early relational data system performance, but I think it will be useful to measure the capability of DataFrame and its scalability because it is well designed synthetic dataset.
The Scalable Wisconsin Benchmark Dataset has the following structure: Attribute and Tuple are words used in database systems, corresponding to column name and row (or record) in the data frame, respectively.
Table: Attribute Specification of "Scalable" Wisconsin Benchmark
Attribute Name | Range of Values | Order | Comment |
---|---|---|---|
unique1 | 0-(MAXTUPLES-1) | random | unique, random order |
unique2 | 0-(MAXTUPLES-1) | sequential | unique, sequential |
two | 0-1 | random | (unique1 mod 2) |
four | 0-3 | random | (unique1 mod 4) |
ten | 0-9 | random | (unique1 mod 10) |
twenty | 0-19 | random | (unique1 mod 20) |
onePercent | 0-99 | random | (unique1 mod 100) |
tenPercent | 0-9 | random | (unique1 mod 10) |
twentyPercent | 0-4 | random | (unique1 mod 5) |
fiftyPercent | 0-1 | random | (unique1 mod 2) |
unique3 | 0-(MAXTUPLES-1) | random | unique1 |
evenOnePercent | 0,2,4,...,198 | random | (onePercent * 2) |
oddOnePercent | 1,3,5,...,199 | random | (onePercent * 2)+1 |
stringu1 | (string from unique1) | random | unique, random order, 52bytes each |
stringu2 | (string from unique2) | sequential | unique, sequential, 52bytes each |
string4 | (string) | cyclic | 4 unique string, 52bytes each |
(This table is taken from Table 2 in Wisconsin Benchmark. I added italic letters to describe in detail.)
Benchmark dataset
This project will provide a generator of "Scaled Wisconsin Benchmark Dataset" and generated table in arrow, parquet, and csv with records ranging from 100 rows to 10,000 rows in 10x increments. The Dataset Generator is able to generate up to 100_000_000 rows if memory is sufficiently supplied.
Benchmark suites for DataFrames
Coming soon.
Installation
Install Apache Arrow, Arrow GLib and Parquet GLib. See Apache Arrow install document.
- Apache Arrow
- Apache Arrow GLib
- Apache Parquet GLib
Install the gem and add to the application's Gemfile by executing:
$ bundle add wisconsin-benchmark
If bundler is not being used to manage dependencies, install the gem by executing:
$ gem install wisconsin-benchmark
Usage
To start the Dataset Generator run;
generate-dataset (dataset_size)
Here we can specify dataset_size (number of rows). Default size is 1,000.
Then generator will create .arrow, .csv, .parquet files in specified size. Filename will be like 'WB_1E3.arrow'.
Generated Datasets are stored in datasets
directory.
To experiment with the code, run bin/console
for an interactive prompt.
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake test-unit
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/wisconsin-benchmark. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
License
The gem is available as open source under the terms of the MIT License.
Code of Conduct
Everyone interacting in the Wisconsin::Benchmark project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.