Crabfarm Toolbelt
This gem facilitates the creation of new crabfarm crawlers.
Building your first crawler
One of our biggest advantages comes from having a structured developing process based on TDD.
1. Install crawbfarm and create the crawler application
Install the crabfarm gem:
$ gem install crabfarm
Install crabtrap (a nodejs based recording proxy server used in tests)
$ npm install crabtrap -g
Install phantomjs (only if you plan to use it as your browser)
$ npm install phantomjs -g
Generate the application
$ crabfarm g app my_crawler
Run bundler and rspec to check everything is in place
$ cd my_crawler
$ bundle
$ rspec
2. Record a memento
You start developing a crawler by recording a memento. A memento is a piece of the web that gets stored in a single file and is used to test the crawlers without loading any remote resources.
crabfarm r memento my_memento
This will open your web browser, now you should pretend to be the crawler and access the pages and perform the actions you expect your crawler to perform. For this example, enter www.btc-e.com and press the LTC/BTC market button. Wait for page to load completely and the just close the browser, your new memento should be available at /spec/mementos/my_memento.json.gz
.
3. Generate a navigator
Navigators are like your controllers, they receive some input parameters, navigate and interact with one or more web resources and then generate some usefull output.
We are going to build a btc-e.com crawler to extract the last price for a given market:
crabfarm g navigator BtcPrice -u www.btc-e.com
This should generate a navigator source file and a corresponding spec (this will also generate a reducer, more on that later). You can see we passed the target url using the -u
option in the generator, this is optional.
Its time to take a look at the generated spec at /specs/navigators/btc_price_spec.rb
and add some tests. Lets add an example to test that the navigator reached the correct page:
it "should navigate to correct market page", navigating: 'my_memento' do
navigate market: 'LTC/BTC'
expect(browser.li(class: 'pairs-selected').text.lines.first.strip).to eq('LTC/BTC')
end
Lets go line by line:
it "should navigate to correct market page", navigating: 'my_memento' do
By adding the navigating: 'my_memento'
metadata, we are telling the example to run the crawler over the recorded memento from step 2.
navigate market: 'LTC/BTC'
Calling navigate
executes the navigator, every keyed argument is passed to the navigator.
expect(browser.li(class: 'pairs-selected').text.lines.first.strip).to eq('LTC/BTC')
The browser
property exposes the browser session used by the navigator, it can be used to check the browser status right after the navigator finishes.
We could also add a test to check that the navigator ouput has the proper structure.
it "should provide the last, high and low prices", navigating: 'my_memento' do
expect(state.document).to have_key :last
expect(state.document).to have_key :high
expect(state.document).to have_key :low
end
The main difference here is the use of the state
method. It contains the crawling session state AFTER the navigator is called. If state
is called before any calls to navigate
then navigate
is automatically called by state
.
Lets move to /app/navigators/btc_price.rb
file now, that's where the navigator code is located. As you can see there is already some code there, just a call to browser.goto
to load the requested url and another to reduce_with_defaults
method that will run the default reducer. Lets add some additional navigation logic to select the required market.
def run
browser.goto 'www.btc-e.com'
if params[:market]
browser.search('ul.pairs li').find { |li|
li.text.include? params[:market]
}.search('a').click
end
reduce_with_defaults
end
This is mainly pincers on webdriver code. You can access the current browser session using the browser
property. You should be able to call rspec
now and get the first example right.
TIP: There is a very nice tool to help you with the HTML css selectors called Selector Gadget.
4. Code the reducer
During the navigator generation a reducer with the same name was generated too. The reducer is responsible of extracting data from the document being crawled. The most common use case is having one reducer per navigator, but in some cases more than one reducer may be needed per navigator, so a reducer generator is included as well.
As with the navigator, you start developing the reducer by generating a document snapshot. For HTML reducers, a snapshot is just a portion of HTML. A snapshot can be generated manually but we recommend using the snapshot recorder command.
The snapshot recorder uses an already coded navigator spec to capture the html passed by the navigator to the reducer. To generate a snapshot call:
crabfarm r snapshot BtcPrice
The command above tells crabfarm to extract snapshots from the BtcPrice navigator using the last BtcPrice navigator spec.
Crabfarm will ask you to give the snapshot a name, call it my_snapshot, notice it is stored in /spec/snapshots/my_snapshot.html
.
Now that you have the snapshot, lets write some reducer specs, go to /spec/reducers/btc_price_reducer_spec.rb
and add the following example:
it "should extract low, high and last values", reducing: 'my_snapshot' do
# the tested values depend on the ltc value at the time
# the snapshot/memento is recorded.
expect(reducer.low).to eq 0.0061
expect(reducer.high).to eq 0.0064
expect(reducer.last).to eq 0.0061
end
Notice that the structure is very similar to a navigator spec, this time use the reducing: 'my_snapshot'
option to select the snapshot to reduce and the reducer
property to refer to the reducer AFTER processing the given snapshot.
The last step is writting the reducer code, parsing code goes inside the run
method. By default the reducer uses pincers on nokogiri for parsing HTML.
class BtcPriceReducer < Crabfarm::BaseReducer
has_float :last, greater_or_equal_to: 0.0
has_float :high, greater_or_equal_to: 0.0
has_float :low, greater_or_equal_to: 0.0
def run
self.last = search('.orderStats:nth-child(1) strong').text
self.low = search '.orderStats:nth-child(2) strong'
self.high = search '.orderStats:nth-child(3) strong'
end
end
Chunk by chunk:
has_float :last, greater_or_equal_to: 0.0
has_float :high, greater_or_equal_to: 0.0
has_float :low, greater_or_equal_to: 0.0
The reducer allows you to define fields that take care of the parsing and validation of text values for you. Also, declared fields help keep things dry since are included in reducer.to_json
.
self.last = search('.orderStats:nth-child(1) strong').text
If you dig a little deeper, you will see that last
is beign assigned something like "0.0061 BTC". The assertion framework is smart enough to extract just the floating point number (since we declared last
as float) and fail if no number can be extracted from string. search
is just a pincers method, the reducer exposes every parser method.
self.low = search '.orderStats:nth-child(2) strong'
The only difference of the above line with the previous is that it shows that is not necessary to call text
every time. The field setter detects if the passed value provides a text
method and calls it.
And thats all!, run your specs, everything should check ok.
Trying the crawler in the console
Run the crabfarm console when inside the crawler's root
crabfarm c
Call a navigator with some parameters, lets get the LTC/USD value using the BtcPrice navigator we built in the example above.
nav :btc_price, coin: 'LTC/USD'
You can make changes the crawler classes and reload the code in the console by calling reload!
.
You can also extract snapshots in the console:
snap :btc_price, coin: 'LTC/USD'
Integrating the crawler to your application
Depending on your app's languaje, the following client libraries are available:
- Ruby/Rails - cangrejo gem: The cangrejo gem has support for spawning your crawlers locally or in a crabfarm grid (crabfarm.io).
If the languaje you are using is not listed here, you can submit an issue or better yet an implementation.
For more information on how to create a new client library refer to the Crabfarm client developer guide and the cangrejo source.
About the Crabfarm.io service
The best way to run your crawlers is on the crabfarm.io grid, it also provides monitoring and alert notifications for your crawlers. For more information visit www.crabfarm.io.