Extraloop Redis Storage¶ ↑
Description¶ ↑
Persistence layer for the ExtraLoop data extraction toolkit. This module is implemented as a wrapper around Ohm, an object-hash mapping library which makes easy storing structured data into Redis. Includes a convinent command line tool that allows to list, filter, and delete harvested datasets, as well as exporting them on local files or remote data stores (i.e Google Fusion tables).
Installation ¶ ↑
gem install extraloop-redis-storage
Usage¶ ↑
Extraloop’s Redis storage module decorates ExtraLoop::ScraperBase
and ExtraLoop::IterativeScraper
instances with the set_storage
method: a helper method that allows to specify how the scraped data should be stored.
require "extraloop/redis-storage" class AmazonReview < ExtraLoop::Storage::Record attribute :title attribute :rank attribute :date def validate assert (0..5).include?(rank.to_i), "Rank not in range" end end scraper = AmazonReviewScraper.new("0262560992"). .set_storage(AmazonReview, "Amazon reviews of 'The Little Schemer'") .run()
At each scraper run, the ExtraLoop storage module internally instantiates a session (see ExtraLoop::Storage::ScrapingSession
) and associates the extracted records to it. The ‘AmazonReview` records just created, can now be accessed by calling the `#records` metod on scraper session object.
reviews = scraper.session.records
#set_storage ¶ ↑
The set_storage
method accepts the following arguments:
-
model A Ruby constant or a symbol specifying the model to be used for storing the extracted data. If a symbol is passed, it is assumed that a model does not exist and the storage module dynamically generates one by subclassing
ExtraLoop::Storage::Record
. -
session_title A human readable title for the extracted dataset (optional).
Command line interface ¶ ↑
Once installed, the gem will also add to your system path the extraloop
executable: a command line interface to the datasets harvested through ExtraLoop. A list of datasets can be obtained by running:
extraloop datastore list
This will generate a table like the following one:
id | title | model | records -------------------------------------------------------------------- 48 | 1330106699 GoogleNewsStory Dataset | GoogleNewsStory | 110 49 | 1330106948 AmazonReview Dataset | AmazonReview | 0 51 | 1330107087 GoogleNewsStory Dataset | GoogleNewsStory | 110 52 | 1330111630 AmazonReview Dataset | AmazonReview | 10
Datasets can be removed using the delete
subcommand:
extraloop datastore delete [id]
Where id
is either a single scraping session id, or a session id range (e.g. 48..52).
From the Redis datastore, ExtraLoop datasets can be exported to disk as CSV, JSON, or YAML documents:
extraloop datastore export 51..52 -f csv
Similarly, stored datasets can be uploaded to a remote datastore:
extraloop datastore push 51..48 fusion_tables -c google_username:password
While Google’s Fusion Tables is currently the only one implemented, support for pushing dataset to other remote datastores (e.g. couchDB, cartoDB, and CKAN Webstore) will be added soon.