RStore
A library for easy batch storage of csv data into a database
Uses the CSV standard library for parsing, Nokogiri for URL handling, and Sequel ORM for database management.
Special Features
- Batch processing of csv files
- Fetches data from different sources: files, directories, URLs
- Customizable using additional options (also see section Available Options)
-
Validation of field values. At the moment validation of the following types is supported:
-
String
,Integer
,Float
,Date
,DateTime
,Time
, andBoolean
-
- Descriptive error messages pointing helping you to find any invalid data quickly.
- Only define your database and table classes once, then just
require
them when needed. -
Safe and transparent data storage:
- Using database transactions: Either the data from all all files is stored or none (also see section Database Requirements)
- To avoid double entry of data, the
run
method can only be run once on a single instance ofRStore::CSV
.
Database Requirements
- Expects the database table to have an addition column storing an auto-incrementing primary key.
-
Requires the database to support transactions:
Most other database platforms support transactions natively.
In MySQL, you'll need to be running
InnoDB
orBDB
table types rather than the more commonMyISAM
. If you are using MySQL and the table has not been created yet, RStore::CSV will take care of using the correct table type upon creation.
Installation
$ gem install rstore
Note:
As RStore
depends on Nokogiri for fetching data from URLs, you need to install Nokogiri first to use this feature.
However, on some operating systems there can be problems due to missing libraries,
so you might want to take a look at the following installation instructions:
Debian Users of Debian Linux (e.g. Ubuntu) need to run:
$ sudo apt-get install libxslt1-dev libxml2-dev
$ gem install nokogiri
Mac OS X The following instruction should work, but I haven't tested them personally
$ sudo port install libxml2 libxslt
$ gem install nokogiri
Source: Installing Nokogiri
If you have any difficulties installing Nokogiri, please let me know, so that I can help you.
Public API Documentation
The documentation is hosted on RubyDoc.info: RStore Public API documentation.
Sample Usage
Sample csv file
"product","quantity","price","created_at","min_demand","max_demand","on_stock" "toy1","1","1.12","2011-2-4","1:30","1:30am","true" "toy2","2","2.22","2012/2/4","2:30","2:30pm","false "toy3","3","3.33","2013/2/4","3:30","3:30 a.m.","True "toy4","4",,,"4:30","4:30 p.m.","False" "toy4","5","5.55","2015-2-4","5:30","5:30AM","1" "toy5","6","6.66","2016/2/4","6:30","6:30 P.M.","0" "toy6","7","7.77",,,,"false"
- Load gem
require 'rstore/csv'
- Store database information in a subclass of
RStore::BaseDB
Naming convention: name => NameDB
class CompanyDB < RStore::BaseDB
# Same as Sequel.connect, except that you don't need to
# provide the :database key.
info(:adapter => 'mysql',
:host => 'localhost',
:user => 'root',
:password => 'xxx')
end
- Store table information in a subclass of
RStore::BaseTable
Naming convention: name => NameTable
class ProductsTable < RStore::BaseTable
# Specify the database table the same way
# you do in Sequel
create do
primary_key :id, :allow_null => false
String :product
Integer :quantity
Float :price
Date :created_at
DateTime :min_demand
Time :max_demand
Boolean :on_stock, :allow_null => false, :default => false
end
end
Note:
You can put the database and table class definitions in separate files
and require
them when needed.
- Enter csv data into the database
The
from
method accepts a path to a file or directory as well as an URL. Theto
metthod accepts a string of the form db_name.table_name
RStore::CSV.new do
from '../easter/children', :recursive => true # select a directory or
from '../christmas/children/toys.csv' # file, or
from 'www.example.com/sweets.csv', :selector => 'pre div.line' # URL
to 'company.products' # provide database and table name
run # run the program
end
Additional Features
You can change and reset the default options (see section Available Options below for details)
# Search directories recursively and handle the first row of a file as data by default
RStore::CSV.change_default_options(:recursive => true, :has_headers => false)
RStore::CSV.new do
from 'dir1'
from 'dir2'
from 'dir3'
to 'company.products'
run
end
# Restore default options
RStore::CSV.reset_default_options
There is also a convenience method enabling you to use all of Sequels query methods.
RStore::CSV.query('company.products') do |table| # table = Sequel::Dataset object
table.all # fetch everything
table.all[3] # fetch row number 4 (see output below)
table.filter(:id => 2).update(:on_stock => true) # update entry
table.filter(:id => 3).delete # delete entry
end
*)
Output of db[table.name].all[3]
# {:product => "toy4",
# :quantity => 4,
# :price => nil,
# :created_at => nil,
# :min_demand => <DateTime: 2011-10-25T04:30:00+00:00 (39293755/16,0/1,2299161)>,
# :max_demand => <DateTime: 2011-10-25T16:30:00+00:00 (39293763/16,0/1,2299161)>,
# :on_stock => false}
Access all of Sequels functionality by using the convenience methods
BaseDB.connect
, BaseTable.name
, and BaseTable.table_info
:
DB = CompanyDB.connect # Open connection to 'company' database
name = ProductTable.name # Table name, :products, used as an argument to the following methods.
layout = ProductsTable.table_info # The Proc that was passed to ProductsTable.create
DB.create_table(name, &layout) # Create table
DB.alter_table name do # Alter table
drop_column :created_at
add_column :entry_date, :date
end
DB.drop_table(name) # Drop table
Available Options
The method from
accepts two kinds of options, file options and parse options:
File Options
File options are used for fetching csv data from a source. The following options are recognized:
-
:has_headers, default:
true
- When set to false, the first line of a file is processed as data, otherwise it is discarded.
-
:recursive, default:
false
- When set to true and a directory is given, recursively search for files. Non-csv files are skipped.
-
:digit_seps, default
[',', '.']
- The thousands separator and decimal mark used for numbers in the data source. Different countries use different thousands separators and decimal marks, and setting this options ensures that parsing of these numbers succeeds. Note that all numbers will still be stored in the format that Ruby recognizes, that is with a point (.) as the decimal mark.
-
:selector, default:
""
- Mandatory css selector when fetching data from an URL. For more details please see the section Further Reading below
Parse Options
Parse options are arguments to CSV::parse
. The following options are recognized:
-
:col_sep, default:
","
- The String placed between each field.
-
:row_sep, default:
:auto
- The String appended to the end of each row.
-
:quote_char, default:
'"'
- The character used to quote fields.
-
:field_size_limit, default:
nil
- The maximum size CSV will read ahead looking for the closing quote for a field.
-
:skip_blanks, default:
false
- When set to a true value, CSV will skip over any rows with no content.
For more information on the parse options, please see section Further Reading below.
Further Reading
- Sequel
- CSV
- Parse options documentation
- Common Format and MIME Type for Comma-Separated Values (CSV) Files
- Nokogiri
Feedback
Any suggestions or criticism are highly welcome! Whatever your feedback, it will help me make this gem better!