Clownfish

Helper for Anemone. Makes common crawls easier to repeat.

Installation

Add this line to your application's Gemfile:

gem 'clownfish'

And then execute:

$ bundle

Or install it yourself as:

$ gem install clownfish

Usage

require 'clownfish'

clownfish = MyClownfish.new

Anemone.crawl_with_clownfish(start_url, clownfish)

# query clownfish for data from crawl

Clownfish Spec

A clownfish is an object that has one or more of the following instance methods:

Reference: Anemone RDocs

anemone_options

Returns a Hash of Symbol to values. See Anemone::Core::DEFAULT_OPTS for available options. This is forwarded as the second argument to Anemone.crawl (rdoc). Invoked once before crawl.

skip_links_like

Returns a single Regexp or Array of Regexp. Urls matching any of these will not be crawled. Invoked once before crawl.

on_every_page

Takes one argument, an Anemone::Page (rdoc). Invoked once per page during crawl.

focus_crawl

Takes one argument, an Anemone::Page (rdoc). Returns the links (Array of URI) on that page that should be crawled. See Anemone::Page#links for a starting point. Invoked once per page during crawl.

after_crawl

Takes one argument, an Anemone::PageStore (rdoc). Invoked once after crawl is done.

What's Included

See wiki for examples.

Clownfish::LinksByPage

Lists every page that has links, the links and the status code when following those links.

Clownfish::ResponseTimes

Record every url and it's response time.

Clownfish::Count

Count pages.

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

clownfish

Development

Runtime