Tablestakes
Tablestakes is a gem for processing tabular data. It is for people who would rather not meddle with a spreadsheet, or load their data into a SQL database. You get the instant gratification of being able to read a tab-delimited file, with header values, and then do field counts, field modifications, selections, joins to your heart's content. Tablestakes operates only in memory, so it is fast. Of course that also means that there are some size limitations -- very large tables should be processed with another library.
Contents
- How to install
- Philosophy and Conventions
- Loading and Writing Files
- Selecting Data
- Counting Data
- Updating Data
- Join, Union, and Intersect
- Sorting Data
- Interacting with your Data
How to Install
-
Install the gem
gem install tablestakes
-
Add the
tablestakes
gem to your ruby coderequire 'tablestakes'
Now you're ready to start slicing and dicing your data tables!
Philosophy and Conventions
Tablestakes is meant to be fast and easy for manipulating your data. It maintains Ruby conventions, like method chaining and mostly non-destructive methods.
Tablestakes tables also maintain some conventions for simplicity:
- Table column names are always the values in the first row of your data file.
- Fields in the table are always strings (conversion to numbers or dates is a potential enhancement).
- Methods only modify one dimension at a time. So, for instance,
Table#select
only selects columns andTable#where
only selects rows. Chain them together for the desired effect. - Tables are ordered, both columns and rows, until modified.
Loading and Writing Files
Tables can be created from tab-delimited data files, using the Table constructor:
cities = Table.new('cities.txt')
Tables can also be created from other tables (useful in saving sub-tables), they can be
created from an Array of rows (embedded Arrays), and the Table#new
function also
creates a blank table when no parameters (or nil) are given.
Tables are saved as tab-delimited files using the Table#write_file
method:
cities.write_file('new-cities.txt')
Tables can also be sent to your favorite I/O channel using the Table#to_s
function,
which creates a tab-delimited string.
Selecting Data
Selecting your data happens in two dimensions - rows and columns. First, you can create an ordered Array of a row or column just by asking for it by header name.
cities.column('State') # returns ["Texas", "Tennessee", "California", ...]
If you're uncertain of your header names, they are accessible as an Array via the
Table#headers
method.
cities.headers # returns ["2012 rank", "City", "State", ...]
Rows can be returned when a numeric index is known. To return the first data row:
cities.row(0) # returns ["119", "Amarillo", "Texas", "195250", "190695", ...]
Table columns can be selected specifically with the Table#select
method:
cities.select("City", "State", "2010 Census") # returns a table with only those columns
In order to select rows, use Table#where
, which allows you to select rows given a ruby
statement to be evaluated against a given value in a column. For instance:
cities.where("State", "=~ /^N.*/") # returns a sub-table of cities in states that begin with 'N'
Use single quotes when comparing your column value to a string. Also, note that all numeric data is stored as a string value unless explicitly converted by your selection statement.
Counting Data
One reason to manipulate tables quickly in memory is to get counts for histograms, pie charts, and other data analysis representations. Tablestakes gives you simple methods for counting.
cities.size # returns the number of rows in the cities table
cities.length # same as cities.size
cities.count # same as cities.size
cities.count('State', 'New York') # returns the number of entries that have State=='New York'
If you want to know the frequency of certain values in your data set, there are a couple of methods for selecting the most and least frequent items.
cities.top("State") # returns the state with the most cities listed
cities.top("State", 5) # returns the 5 most frequent states
cities.bottom("State", 5) # returns the 5 least frequent states
Additionally, you can create a separate Table object that tallies on a given column
cities.tally('State') # returns a Table of States and the number of times they appear
puts cities.tally('State').to_s # print a table of the frequency that states appear
Updating Data
Sometimes data in a table needs to be cleaned up and modified.
the Table#sub
method provides a way to eliminate common garbage from
your data such as stray characters.
cities.sub("2012 land area", /(.*) sq mi/, '\1') # deletes ' sq mi' from the end of the 2012 land area field
Table#sub
takes a regular expression and a substitute string, which
gives some flexibility in how data is updated. Note that this is
a method which modifies the table object.
Join, Union, and Intersect
Once your tables are read into memory, it is useful to join them with other tables or find the common elements. Tablestakes provides a simple join function as follows
capitals.join(cities, "Capital", "City") # create a table which only contains highly populated Capital cities
You may also need to quickly compare the elements of one column
in a table with the elements in another table. Table#union
and Table#intersect
are for that situation.
capitals.union(cities, "Capitals", "Cities") # returns an array of all cities in both tables
capitals.intersect(cities, "Capitals", "Cities") # returns an array of only the cities in both tables
Sorting Data
Sorting your data table can be done on any given column (if not specified, it defaults to the first column). Any blocks passed to the sort function are then used internally to sort the column. Here are some examples:
cities.sort("State") # Re-orders the cities table based on State name
cities.sort { |a,b| b<=>a } # Reverse the order of the cities table
cities.sort("State") { |a,b| b<=>a } # Sort by State in reverse alpha order
Interacting with your Data
Typically, you can accomplish your goals with chained queries of the datatable. Here are some examples:
-
Find all of the cities in New York
ny_cities = cities.where("State", "== 'New York'")
-
Find all of the capitals which are not in the set of most populated cities
small_caps = capitals.column("Capital") - capitals.join(cities, 'Capital', 'City').column('Capital')
-
Read a file, select the columns and rows you want, and write the subtable as a tab-delimited file.
Table.new('cities.txt').select('City','State','2012 estimate').where('2012 estimate', ".to_i > 1000000").write_file('big_cities.txt')
Some methods, such as Table#row
and Table#column
return Arrays, and of course these are
readily modified using their own native methods.