Pincers is a jQuery inspired Ruby DSL on top of webdriver or plain net/http. In other words: an easy to use set of functions that allow you to scrape or automate navigation on a Javascript intensive webpage.
Example
require 'pincers'
Pincers.for_webdriver :chrome do |pincers|
pincers.goto "https://github.com"
pincers.search("input[name=q]").set("pincers")
pincers.search("form[action='/search']").submit
pincers_repo = pincers.search(".repo-list-item").first
name = pincers_repo.search("h3 a").text
stars = pincers_repo.search("a[href$=stargazers]").text
puts "The repo #{name} has #{stars} stars"
end
Great! But I already know ( selenium | watir | mechanize | nokogiri ) ... why do I need this?
The jQuery interface solves DOM element traversal in a very practical way that most programmers feel comfortable with. When using any of the options listed above, we found ourselves missing jQuery's ease of use.
Also, by harnessing the power of nokogiri, pincers lets you extract complex data like tables or lists in a fraction of the time required by using pure webdriver. Take a look at Read-only Results.
Features:
- Full support for jQuery selectors.
- Simple interface, also like jQuery, you will only interact with one pincers-object type.
- Sensible waiting conventions, built for dynamic webpages.
- Support for both webdriver and net/http + nokogiri backends using the same DSL.
- Ability to switch to nokogiri for parsing (keeping the same DSL) for heavy duty data extraction. Take a look at Read-only Results.
- Ability to preform random http requests impersonating the current browser (cookies and headers).
Install
To install just run:
gem install pincers
Or add to your Gemfile and run bundle install
:
gem 'pincers'
Basic usage
Create a new pincers root context using your favorite browser:
Pincers.for_webdriver :chrome do |pincers|
# do something, driver object will be discarded at the end of the block.
end
You can also pass a webdriver object, or another symbol like :firefox
or :phantomjs
.
NOTE: You can also use the pincers DSL on top of our non-webdriver backend.
Cleaning up
It is posible to use the Pincers.for_webdriver
factory method without a block, you will need to manually release the associated resources by calling close
after you are done:
pincers = Pincers.for_webdriver :chrome
# do something
pincers.close # release webdriver resources
Basic Navigation
The first thing to do is to navigate to some url:
pincers.goto 'www.crabfarm.io'
Searching
If you have used jQuery before, all this will sound quite familiar to you.
Consider the following HTML structure for the examples below:
<body>
<div id="first-parent" class="my-class">
<p id="first-child" class="child-class other-class">Foo</p>
<p id="second-child" class="child-class">Bar</p>
</div>
<div id="second-parent" class="my-class">
<p id="third-child" class="child-class">Imateapot</p>
</div>
<p id="fourth-child" class="child-class">Imateapot</p>
</body>
Most element traversing in pincers is done via jQuery extended selectors using the search
method:
# Select the second parent by jumping through loops:
pincers.search(".my-class:has(p:contains('Imateapo'))")
This call will return another context contaning all elements matching the given selector. The context object is an enumerable that yields single element contexts, so you can use pincers methods on separate elements too:
pincers.search('.my-class').map do |div|
div.search('.child-class') # div is also a contex!
end
Pincers contexts also have first
and last
methods that return the first and last element wrapped on a separate context.
pincers.search('.my-class').first # first is also a context!
Searching over a context will search among all contained elements children:
parents = pincers.search('.my-class')
parents.search('.child-class') # will select all childs except fourth-child
If you don't feel comfortable using css, pincers also provides a more idiomatic search
method, it allows you to search by tag
, contents
, class
or any attribute:
pincers.search(tag: 'p', class: 'some-class other-class')
pincers.search(tag: 'input', value: 'email@crabfarm.io')
pincers.search(content: 'Title')
Context properties
Retrieve the concatenated text contents for all matched elements.
pincers.search('#first-parent').search('.child-class').text # = 'FooBar'
Retrieve the concatenated html contents for all matched elements.
pincers.search('.child-class').to_html # will dump all p elements in our example.
First element properties
There are several methods that when called on a context will only apply to the first element contained by that context:
Retrieve an attribute from the first matching element:
pincers.search('.child-class')[:id] # = 'first-child'
pincers.search('.child-class').attribute('id') # same as above
Retrieve the tag name from an element:
pincers.search('.child-class').tag # = 'p'
Retrieve an array with all classes from the first matching element:
pincers.search('.child-class').classes # = ['child-class', 'other-class']
Element interaction
The following methods change the element or document state and are only available in some backends. Like the Single Element Properties, when called, these methods only affect the first element in the context.
To set the text on a text input
pincers.search('input#some-input').set 'sometext'
Choose a select box option by it's label
pincers.search('select#some-select').set 'Some Label'
Choose a select box option by the option text
pincers.search('select#some-select').set 'Option text'
Or by the option value
pincers.search('select#some-select').set by_value: 'option-value'
Change a checkbox or radio button state
pincers.search('input#some-checkbox').set # check
pincers.search('input#some-checkbox').set false # uncheck
Click on a button (or any other element)
pincers.search('a#some-link').click
Submit a form directly
pincers.css('form').submit
Hover over an element
pincers.search('div#some-menu').hover
Root properties
The root context has some special methods to access document properties.
To get the document title
pincers.title
To get the document url
pincers.url
pincers.uri # same as url but returns an URI object
To get the document driver itself (webdriver driver or nokogiri root node)
pincers.document
Advanced topics
Read-only results
Using webdriver to extract data that requires iterating over big lists or lots of table rows can be painfully slow. To process big datasets pincers provides the readonly
method, that transforms the webdriver backed result into a nokogiri backed one.
list_contents = pincers.search('#long-list').readonly do |list|
# operating over list is very fast
list.search('li').map &:text
end
Chenso backend
The chenso backend provides a performant way of navigating simple pages (similar to mechanize). It uses net/http + nokogiri instead of webdriver and provides support for most pincers features.
Chenso doesn't do javascript so waiting is disabled on chenso backed pincer objects.
To use the chenso backend just use the for_chenso
factory method to generate a new pincers context:
Pincers.for_chenso do |pincers|
# same DSL as the webdriver backed context.
end
Chenso also supports client SSL certificate, to use a client certificate use the ssl_cert
and ssl_key
options:
Pincers.for_chenso(
ssl_cert: OpenSSL::X509::Certificate.new(File.read('./client.cert.pem')),
ssl_key: OpenSSL::PKey::RSA.new(File.read('./client.key.pem'))
)
Navigating frames
Pincers operations can only target one frame at a time. By default, the top frame is selected when location is changed. To switch to a diferent frame use the goto
method with the frame:
option:
pincers.goto 'http://www.someurlwithfram.es'
pincers.goto frame: pincers.search('#my-frame')
pincers.text # this will return the '#my-frame' frame contents
Tip: You can also use a selector directly
pincers.goto frame: '#my-frame'
To navigate back to the top frame after working on a child frame use the special identifier :top
:
pincers.goto frame: :top
Waiting for a condition
In javascript enabled backends like webdriver, even though pincers will do it's best to do most of the waiting, it is sometimes necessary to wait for an special condition before interacting with an element:
pincers.search('#my-async-stuff').wait(:enabled)
It's posible to wait on the following states:
-
:present
: wait for element to be visible -
:actionable
: wait for element to be able to receive input -
:enabled
: wait for input to be enabled - Any valid DOM property, like
:disabled
or:value
Its also possible to wait for custom conditions by passing a block, the process will wait until the block stops returning false
(only false
, not nil
).
pincers.search('#my-async-stuff').wait { |r| r.count > 10 }
When using a custom condition, you can also wait for the block not to raise a navigation error.
pincers.search('#async-button').wait { |r| r.click } # wait until click succeeds
By default, the waiting process times out in 10 seconds. This can be changed by setting the Pincers.config.wait_timeout
property or by calling the search function with the timeout:
option:
pincers.search('#my-async-stuff').wait(:enabled, timeout: 5.0)
Downloading a resource
You can download resources from the currently loaded document using the download
method on a link, image or any other element that has a src
attribute. Javascript triggered downloads are not supported by this method
dl = pincers.search('#a-download-link').download
dl.content # the resource data as string
dl.content_type # the resource content type
dl.save('/some-file.txt') # store resource in file
Driver options
Pincers tries its best to configure the webdriver bridge in a way it will fit most use cases. If you need to further configure the driver for a special situation the following options are available when using the for_webdriver
method:
-
:proxy
: either an url likewww.myproxy.com:40
or a seleniumProxy
object. -
:wait_timeout
: default wait timeout for element lookup and any call tocontext.wait
-
:page_timeout
: page load timeout, in ms, defaults to 60 seconds. - any valid webdriver configuration key
Its also posible to call for_webdriver
with an already created webdriver object:
pincers = Pincers.for_webdriver some_driver_object
If this creation method is used, then only the page_timeout
and wait_timeout
are options are available.
Accessing the underlying backend objects
Sometimes (hopefully not too often) you will need to access the original webdriver or nokogiri api. Pincers provides a couple of methods for you to do so.
To get the document handler itself call document
on the root context.
pincers.document # webdriver driver or nokogiri root node
To get the contained nodes on a pincers context use elements
pincers.search('foo').elements # array of webdriver elements or nokogiri nodes.
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
Credits
Thank you contributors!
Pincers is part of the Crabfarm Framework.
License
Pincers is © 2015 Platanus, spa. It is free software and may be redistributed under the MIT License terms specified in the LICENSE file.