Project: spiderman - The Ruby Toolbox

your friendly neighborhood web crawler

Spiderman is a Ruby gem for crawling and processing web pages.

Installation

Add this line to your application's Gemfile:

gem 'spiderman'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install spiderman

Usage

class HackerNewsCrawler
 include Spiderman

 crawl "https://news.ycombinator.com/" do |response|
   response.css('a.storylink').each do |a|
     process! a["href"], :story
   end
 end

 process :story do |response|
   logging.info "#{response.uri} #{response.css('title').text}"
   save_page(response)
 end

 def save_page(page)
   # logic here for saving the page
 end
end

Run the crawler:

HackerNewsCrawler.crawl!

ActiveJob

Spiderman works with ActiveJob out of the box. If your crawler class inherits from ActiveJob:Base, then requests will be made in your background worker. Each request will run as a separate job.

class MyCrawer < ActiveJob::Base
  queue_as :crawler

  crawl "https://example.com" do |response|
    response.css('a').each {|a| process! a["href"], :link }
  end

  process :link do |response|
    logger.info "Processing #{response.uri}"
  end
end

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/bkeepers/spiderman.

License

The gem is available as open source under the terms of the MIT License.

spiderman