Cangrejo

TODO: Write a gem description

Installation

Add this line to your application's Gemfile:

gem 'cangrejo'

And then execute:

$ bundle

Or install it yourself as:

$ gem install cangrejo

Usage

Configuration

Cangrejo.configure do |config|
  config.set_crawler_cache_path 'tmp' # make sure this path exists!
  config.set_temp_path '/tmp/crawler_cache' # make sure this path exists!

  if Rails.env.development?
    # Override crawler configurations, more on this later
    config.set_crawler_setup_for 'platanus/some-crawler', {
      path: '/path/to/crawler',
      git_remote: 'git://crawler/repo',
      git_commit: 'ThEcr4wl3rc0m1ty0un33d'
    }
  end
end

Rails integration

When using cangrejo inside a rails app use the following base configuration inside an initializer (railtie is comming soon!):

Cangrejo.configure do |config|
  config.set_temp_path Rails.root.join '/tmp'
end

About crawler configurations

There are three ways to run a crawler:

By default, crawlers are identified by their unique uri (like platanus/demo)and ran in the Crabfarm.io cloud. To do so you will need to create an account and register the crawler repo.

Crawlers can also be run from a local repository, just map the crawler uri to a path in the initializer:

config.set_crawler_setup_for 'org/repo', {
  path: '/path/to/crawler'
}

Crawlers can also be ran from a git remote, the crawler is downloaded to the path specified using config.set_crawler_cache_path and then ran locally:

config.set_crawler_setup_for 'org/repo', {
  git_remote: 'git://crawler/repo',
  git_commit: 'ThEcr4wl3rc0m1ty0un33d'
}

Sessions

To communicate with crawlers you use crawling sessions. event though you can manually build and start a session, it is recommended to use the Cangrejo.connect method to handle session lifecycle for you:

Cangrejo.connect 'org/repo' do |session|
  session.navigate(:front_page, param1: 'hello')
end

You can also call connect with no crawler name, if so, connect will use the crawler that was registered first in the configuration.

Once inside a connect block, you can change the session state using navigate

session.navigate(:front_page, param1: 'hello')

Data extracted by last navigation is available at doc property as an open struct

session.doc.title
session.doc.price

You can also create, start and stop sessions manually;

session = Cangrejo::Session.new 'org/repo'
session.navigate(:front_page, param1: 'hello')
session.relase

Don't forget to release the session when you are done!! Once released the session becomes unusable.

Contributing

Fork it ( https://github.com/[my-github-username]/cangrejo/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

cangrejo

Development

Runtime