Goldmine
Extract a wealth of information from lists.
Goldmine is especially helpful when working with source data that is difficult to query. e.g. CSV files, API results, etc...
Uses
- Data mining
- Data transformation
- Data blending
- Data visualization prep
- CSV report generation
Quick Start
gem install goldmine
require "goldmine"
list = [1,2,3,4,5,6,7,8,9]
Goldmine(list)
.pivot("< 5") { |i| i < 5 }
.to_h
{
[["< 5", true]] => [1, 2, 3, 4],
[["< 5", false]] => [5, 6, 7, 8, 9]
}
Array Value Pivots
users = [
{ :name => "Sally", :favorite_colors => [:blue] },
{ :name => "John", :favorite_colors => [:blue, :green] },
{ :name => "Stephen", :favorite_colors => [:red, :pink, :purple] },
{ :name => "Emily", :favorite_colors => [:orange, :green] },
{ :name => "Joe", :favorite_colors => [:red] }
]
Goldmine(users)
.pivot(:favorite_color) { |record| record[:favorite_colors] }
.to_h
{
[:favorite_color, :blue] => [{:name=>"Sally", :favorite_colors=>[:blue]}, {:name=>"John", :favorite_colors=>[:blue, :green]}],
[:favorite_color, :green] => [{:name=>"John", :favorite_colors=>[:blue, :green]}, {:name=>"Emily", :favorite_colors=>[:orange, :green]}],
[:favorite_color, :red] => [{:name=>"Stephen", :favorite_colors=>[:red, :pink, :purple]}, {:name=>"Joe", :favorite_colors=>[:red]}],
[:favorite_color, :pink] => [{:name=>"Stephen", :favorite_colors=>[:red, :pink, :purple]}],
[:favorite_color, :purple] => [{:name=>"Stephen", :favorite_colors=>[:red, :pink, :purple]}],
[:favorite_color, :orange] => [{:name=>"Emily", :favorite_colors=>[:orange, :green]}]
}
Chained pivots
users = [
{ :name => "Sally", :age => 21 },
{ :name => "John", :age => 28 },
{ :name => "Stephen", :age => 37 },
{ :name => "Emily", :age => 32 },
{ :name => "Joe", :age => 18 }
]
Goldmine(users)
.pivot("'e' in name") { |user| !!user[:name].match(/e/i) }
.pivot("21 or over") { |user| user[:age] >= 21 }
.to_h
{
[["'e' in name", false], ["21 or over", true]] => [{:name=>"Sally", :age=>21}, {:name=>"John", :age=>28}],
[["'e' in name", true], ["21 or over", true]] => [{:name=>"Stephen", :age=>37}, {:name=>"Emily", :age=>32}],
[["'e' in name", true], ["21 or over", false]] => [{:name=>"Joe", :age=>18}]
}
Rollups
Rollups provide an intuitive way to aggregate pivoted data into a report friendly format. Think computed columns.
Rollups are blocks
that get executed once for each pivot entry.
They can be also be chained.
list = [1,2,3,4,5,6,7,8,9]
Goldmine(list)
.pivot("< 5") { |i| i < 5 }
.pivot("even") { |i| i % 2 == 0 }
.rollup("count", &:count)
.to_h
{
[["< 5", true], ["even", false]] => [["count", 2]],
[["< 5", true], ["even", true]] => [["count", 2]],
[["< 5", false], ["even", false]] => [["count", 3]],
[["< 5", false], ["even", true]] => [["count", 2]]
}
Rollup Caching
Rollups can be computationally expensive. Optional caching can be used to reduce this computational overhead.
list = [1,2,3,4,5,6,7,8,9]
Goldmine(list, cache: true)
.pivot(:less_than_5) { |i| i < 5 }
.rollup(:count, &:count)
.rollup(:evens) { |list| list.select { |i| i % 2 == 0 }.count }
.rollup(:even_percentage) { |list| cache[:evens] / cache[:count].to_f }
.to_h
{
[[:less_than_5, true]] => [[:count, 4], [:evens, 2], [:even_percentage, 0.5]],
[[:less_than_5, false]] => [[:count, 5], [:evens, 2], [:even_percentage, 0.4]]
}
Rows
It's often helpful to flatten rollups into rows.
list = [1,2,3,4,5,6,7,8,9]
result = Goldmine(list, cache: true)
.pivot(:less_than_5) { |i| i < 5 }
.rollup(:count, &:count)
.rollup(:evens) { |list| list.select { |i| i % 2 == 0 }.count }
.rollup(:even_percentage) { |list| cache[:evens] / cache[:count].to_f }
.result
result.to_rows
[
[[:less_than_5, true], [:count, 4], [:evens, 2], [:even_percentage, 0.5]],
[[:less_than_5, false], [:count, 5], [:evens, 2], [:even_percentage, 0.4]]
]
result.to_hash_rows
[
{:less_than_5=>true, :count=>4, :evens=>2, :even_percentage=>0.5},
{:less_than_5=>false, :count=>5, :evens=>2, :even_percentage=>0.4}
]
Tabular
Rollups can also be converted into tabular format.
list = [1,2,3,4,5,6,7,8,9]
Goldmine(list)
.pivot(:less_than_5) { |i| i < 5 }
.pivot(:even) { |i| i % 2 == 0 }
.rollup(:count, &:count)
.to_tabular
[
[:less_than_5, :even, :count],
[true, false, 2],
[true, true, 2],
[false, false, 3],
[false, true, 2]
]
CSV
Rollups can also be converted into CSV format.
list = [1,2,3,4,5,6,7,8,9]
Goldmine(list)
.pivot(:less_than_5) { |i| i < 5 }
.pivot(:even) { |i| i % 2 == 0 }
.rollup(:count, &:count)
.to_csv
"less_than_5,even,count\ntrue,false,2\ntrue,true,2\nfalse,false,3\nfalse,true,2\n"
Example Apps
All examples are small Sinatra apps. They are designed to help communicate Goldmine use-cases.
Setup
git clone git@github.com:hopsoft/goldmine.git
cd /path/to/goldmine
bundle
Uses data from https://github.com/hopsoft/goldmine/blob/master/examples/new_york_wifi_hotspots/DOITT_WIFI_HOTSPOT_01_13SEPT2010.csv
In this example, we mine out the following information.
- Total hotspots by city, zip, & area code
- Free hotspots by city, zip, & area code
- Paid hotspots by city, zip, & area code
- Library hotspots by city, zip, & area code
- Starbucks hotspots by city, zip, & area code
- McDonalds hotspots by city, zip, & area code
ruby examples/new_york_wifi_hotspots/app.rb
curl http://localhost:3000/raw
curl http://localhost:3000/pivoted
curl http://localhost:3000/rolled_up
curl http://localhost:3000/rows
curl http://localhost:3000/tabular
curl http://localhost:3000/csv
Uses data from http://dev.socrata.com/foundry/#/data.medicare.gov/aeay-dfax
In this example, we mine out the following information.
- Total doctors by state & specialty
- Preferred doctors by state & specialty
- Female doctors by state & specialty
- Male doctors by state & specialty
- Preferred female doctors by state & specialty
- Preferred male doctors by state & specialty
ruby examples/medicare_physician_compare/app.rb
curl http://localhost:3000/raw
curl http://localhost:3000/pivoted
curl http://localhost:3000/rolled_up
curl http://localhost:3000/rows
curl http://localhost:3000/tabular
curl http://localhost:3000/csv
Performance
The Medicare dataset is large & works well for performance testing.
My Macbook Pro yields the following benchmarks.
- 3.1 GHz Intel Core i7
- 16 GB 1867 MHz DDR3
100,000 Records
user system total real
pivoted 0.630000 0.030000 0.660000 ( 0.670409)
rolled_up 0.570000 0.030000 0.600000 ( 0.626413)
rows 0.010000 0.000000 0.010000 ( 0.003258)
tabular 0.010000 0.000000 0.010000 ( 0.010110)
csv 0.050000 0.000000 0.050000 ( 0.057677)
1,000,000 Records
user system total real
pivoted 7.270000 0.300000 7.570000 ( 8.053166)
rolled_up 6.800000 0.830000 7.630000 ( 8.051707)
rows 0.000000 0.000000 0.000000 ( 0.003934)
tabular 0.010000 0.000000 0.010000 ( 0.011825)
csv 0.210000 0.010000 0.220000 ( 0.222752)
Summary
Goldmine makes data highly malleable. It allows you to combine the power of pivots, rollups, tabular data, & csv to construct deep insights with minimal effort.
Real world use cases include:
- Build a better understanding of database data before canonizing reports in SQL
- Create source data for building user interfaces & data visualizations
- Transform CSV data from one format to another