Clownfish
Helper for Anemone. Makes common crawls easier to repeat.
Installation
Add this line to your application's Gemfile:
gem 'clownfish'
And then execute:
$ bundle
Or install it yourself as:
$ gem install clownfish
Usage
require 'clownfish'
clownfish = MyClownfish.new
Anemone.crawl_with_clownfish(start_url, clownfish)
# query clownfish for data from crawl
Clownfish Spec
A clownfish is an object that has one or more of the following instance methods:
Reference: Anemone RDocs
anemone_options
Returns a Hash
of Symbol
to values. See Anemone::Core::DEFAULT_OPTS for available options.
This is forwarded as the second argument to Anemone.crawl
(rdoc). Invoked once before crawl.
skip_links_like
Returns a single Regexp
or Array
of Regexp
. Urls matching any of these will not be crawled. Invoked once before crawl.
on_every_page
Takes one argument, an Anemone::Page
(rdoc). Invoked once per page during crawl.
focus_crawl
Takes one argument, an Anemone::Page
(rdoc). Returns the links (Array
of URI
) on that page that should be crawled. See Anemone::Page#links
for a starting point. Invoked once per page during crawl.
after_crawl
Takes one argument, an Anemone::PageStore
(rdoc). Invoked once after crawl is done.
What's Included
See wiki for examples.
Clownfish::LinksByPage
Lists every page that has links, the links and the status code when following those links.
Clownfish::ResponseTimes
Record every url and it's response time.
Clownfish::Count
Count pages.
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request