Monkeyshines is a tool for doing an algorithmic scrape.
It’s designed to handle large-scale scrapes that may exceed the capabilities of single-machine relational databases, so it plays nicely with Hadoop / Wukong, with distributed databases (MongoDB, tokyocabinet, etc.), and distributed job queue (eg edamame/beanstalk).
Install
Get the code
We’re still actively developing monkeyshines. The newest version is available via Git on github:
$ git clone git://github.com/mrflip/monkeyshines
A gem is available from gemcutter:
$ sudo gem install monkeyshines --source=http://gemcutter.org
(don’t use the gems.github.com version — it’s way out of date.)
You can instead download this project in either zip or tar formats.
Dependencies and setup
To finish setting up, see the detailed setup instructions and then read the usage notes
Overview
Runner
- Builder Pattern to construct
- does the running itself
- Set stuff up
- Loop. (Until no more requests)
- Get a request from #source
- Pass that request to the fetcher
- The fetcher has a #get method,
- which stuffs the response contents into the request object
- if the fetcher has a successful response,
Bulk URL Scraper
- Open a file with URLs, one per line
- Loop until no more requests:
- Get a simple_request from #source
- The source is a FlatFileStore;
- It generates simple_request (objects of type SimpleRequest): has a #url and
an attribute holding (contents, response_code, response_message).
- Get a simple_request from #source
- Pass that request to an http_fetcher
- The fetcher has a #get method,
- which stuffs the body of the response — basically, the HTML for the page — request object’s contents. (and so on for the response_code and response_message).
- if the fetcher has a successful response,
- Pass it to a flat_file_store
- which just wrtes the request to disk, one line per request, tab separated on fields.
url moreinfo scraped_at response_code response_message contents
beanstalk == queue
ttserver == distributed lightweight DB
god == monitoring & restart
shotgun == runs sinatra for development
thin == runs sinatra for production
- work directory holds everything generated: logs, output, dumps of the scrape queue
- ./dump_twitter_search_jobs.rb —handle=com.twitter.search —dest-filename=dump.tsv
serializes the queue to a flat file in work/seed
load_twitter_search_jobs.rb*
scrape_twitter_search.rb - nohup ./scrape_twitter_search.rb
-handle=com.twitter.search >> work/log/twitter_search-console`date "+%Y%m%d%M%H%S`.log 2>&1 &
- tail
f work/log/twitter_search-console-20091006.log (<- replace date with latest run)
- the acutal file being stored
- tail -f work/20091013/comtwittersearch+20091013164824-17240.tsv | cutc 150
Request Source
- runner.source
- request stream
- Supplies raw material to initialize a job
Twitter search scra
Request Queue
Periodic requests
Request stream can be metered using read-through, scheduled (eg cron), or test-and-sleep.
- Scheduled
- Test and sleep. A queue of resources is cyclically polled, sleeping whenever bored.
Requests
- Base: simple fetch and store of URI. (URI specifies immutable unique resource)
- : single resource, want to check for updates over time.
- Timeline:
- Message stream, eg. twitter search or user timeline. Want to do paginated requests back to last-seen
- Feed: Poll the resource and extract contents, store by GUID. Want to poll frequently enough that single-page request gives full coverage.
Scraper
- HttpScraper —
- JSON
-
HTML
- \0 separates records, \t separates initial fields;
- map \ to \\, then tab, cr and newline to \t, \r and \n resp.
- map tab, cr and newline to 	 
 and 
 resp.
x9 xa xd x7f
- HeadScraper — records the HEAD parameters
Store
- Flat file (chunked)
- Key store
- Read-through cache
Periodic
- Log only every N requests, or t minutes, or whatever.
- Restart session every hour
- Close file and start new chunk every 4 hours or so. (Mitigates data loss if a file is corrupted, makes for easy batch processing).
Pagination
Session
- Twitter Search: Each req brings in up to 100 results in strict reverse ID (pseudo time) order. If the last item ID in a request is less than the previous scrape session’s max_id, or if fewer than 100 results are returned, the scrape session is complete. We maintain two scrape_intervals: one spans from the earliest seen search hit to the highest one from the previous scrape; the other ranges backwards from the highest in this scrape session (the first item in the first successful page request) to the lowest in this scrape session (the last item on the most recent successful page request).
- Set no upper limit on the first request.
- Request by page, holding the max_id fixed
- Use the lowest ID from the previous request as the new max_id
- Use the supplied ‘next page’ parameter
- Twitter Followers: Each request brings in 100 followers in reverse order of when the relationship formed. A separate call to the user can tell you how many total followers there are, and you can record how many there were at end of last scrape, but there’s some slop (if 100 people in the middle of the list /un/follow and 100 more people at the front /follow/ then the total will be the same). High-degree accounts may have as many as 2M followers (20,000 calls).
- FriendFeed: Up to four pages. Expiry given by result set of <100 results.
- Paginated: one resource, but requires one or more requests to
- Paginated + limit (max_id/since_date): rather than request by increasing page, request one page with a limit parameter until the last-on-page overlaps the previous scrape. For example, say you are scraping search results, and that when you last made the request the max ID was 120_000; the current max_id is 155_000. Request the first page (no limit). Using the last result on each page as the new limit_id until that last result is less than 120_000.
- Paginated + stop_on_duplicate: request pages until the last one on the page matches an already-requested instance.
- Paginated + velocity_estimate: . For example, say a user acquires on average 4.1 followers/day and it has been 80 days since last scrape. With 100 followers/req you will want to request ceil( 4.1 * 80 / 100 ) = 4 pages.
Rescheduling
Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.
More info
There are many useful examples in the examples/ directory.
Credits
Monkeyshines was written by Philip (flip) Kromer (flip@infochimps.org / @mrflip) for the infochimps project
Help!
Send monkeyshines questions to the Infinite Monkeywrench mailing list