Klepto
A mean little DSL'd capybara (poltergeist) based web scraper that structures data into ActiveRecord or wherever(TM).
Features
- CSS or XPath Syntax
- Full javascript processing via phantomjs / poltergeist
- All the fun of capybara
- Scrape multiple pages with a single bot
- Pretty nifty DSL
- Test coverage!
Installing
You need at least PhantomJS 1.8.1. There are no other external dependencies (you don't need Qt, or a running X server, etc.)
Mac
-
Homebrew:
brew install phantomjs
-
MacPorts:
sudo port install phantomjs
- Manual install: Download this
Linux
Windows
- Download the precompiled binary for Windows
Manual compilation
Do this as a last resort if the binaries don't work for you. It will take quite a long time as it has to build WebKit.
- Download the source tarball
- Extract and cd in
./build.sh
(See also the PhantomJS building guide.)
Then put klepto in your gemfile.
gem 'klepto', '>= 0.2.5'
Usage (All your content are belong to us)
Say you want a bunch of Bieb tweets! How is there not profit in that?
# Fetch a web site or multiple. Bot#new takes a *splat!
@bot = Klepto::Bot.new("https://twitter.com/justinbieber"){
# By default, it uses CSS selectors
name 'h1.fullname'
# If you love C# or you are over 40, XPath is an option!
username "//span[contains(concat(' ',normalize-space(@class),' '),' screen-name ')]", :syntax => :xpath
# By default Klepto uses the #text method, you can pass an :attr to use instead...
# or a block that will receive the Capybara Node or Result set.
tweet_ids 'li.stream-item', :match => :all, :attr => 'data-item-id'
# Want to match all the nodes for the selector? Pass :match => :all
links 'span.url a', :match => :all do |node|
node[:href]
end
# Nested structures? Let klepto know this is a resource
last_tweet 'li.stream-item', :as => :resource do
twitter_id do |node|
node['data-item-id']
end
content '.content p'
timestamp '._timestamp', :attr => 'data-time'
permalink '.time a', :attr => :href
end
# Multiple Nested structures? Let klepto know this is a collection of resources
# Does bieber, tweet to much? Maybe. Lets only get the new stuff kids crave.
tweets 'li.stream-item', :as => :collection, :limit => 10 do
twitter_id do |node|
node['data-item-id']
end
tweet '.content p', :css
timestamp '._timestamp', :attr => 'data-time'
permalink '.time a', :css, :attr => :href
end
# Set some headers, why not.
config.headers({
'Referer' => 'http://www.twitter.com'
})
# on_http_status can take a splat of statuses or ~statuses(4xx,5xx)
# you can also have multiple handlers on a status
# Note: Capybara automatically follows redirects, so the statuses 3xx
# are never present. If you want to watch for a redirect pass see below
config.on_http_status(:redirect){
puts "Something redirected..."
}
config.on_http_status(200){
puts "Expected this, NBD."
}
config.on_http_status('5xx','4xx'){
puts "HOLY CRAP!"
}
config.after(:get) do |page|
# This is fired after each HTTP GET. It receives a Capybara::Node
end
# If you want to do something with each resource, like stick it in AR
# go for it here...
config.after do |resource|
@user = User.new
@user.name = resource[:name]
@user.username = resource[:username]
@user.save
resource[:tweets].each do |tweet|
Tweet.create(tweet)
end
end #=> Profit!
}
# You can get an array of hashes(resources), so if you wanted to do something else
# you could do it here...
@bot.resources.each do |resource|
pp resource
end
Got a string of HTML you don't need to crawl first?
@html = Capybara::Node::Simple.new(@html_string)
@structure = Klepto::Structure.build(@html){
# inside the build method, everything works the same as Bot.new
name 'h1.fullname'
username 'span.screen-name'
links 'span.url a', :match => :all do |node|
node[:href]
end
tweets 'li.stream-item', :as => :collection do
twitter_id do |node|
node['data-item-id']
end
tweet '.content p', :css
timestamp '._timestamp', :attr => 'data-time'
permalink '.time a', :css, :attr => :href
end
}
Configuration Options
- config.headers - Hash; Sets request headers
- config.url - String; Set URL to structure
- config.abort_on_failure - Boolean(Default: true); Should structuring be aborted on 4xx or 5xx
Callbacks & Processing
- before
- :get (browser, url)
- after
- :structure (Hash) - receives the structure from the page
- :get (browser, url) - called after each HTTP GET
- :abort (browser, hash(details)) - called after a 4xx or 5xx if config.abort_on_failure is true (default)
Stuff I'm going to add.
- Ensure after(:each) work at resource/collection level as well
- Add after(:all)
- :if, :unless for as: (:collection|:resource) to. context should be captured node that block is run against
- Access to hash from within a block (for bulk assignment of other attributes) ?
- config.allow_rescue_in_block #should exceptions in blocks be auto rescued with nil as the return value
- :default should be able to take a proc
Async
-> https://github.com/igrigorik/em-synchrony
Cookie Stuffing
cookies({
'Has Fun' => true
})
Pre-req Steps
prepare [
[:GET, 'http://example.com'],
[:POST, 'http://example.com/login', {username: 'cory', password: '123456'}],
]
Page Assertions
assertions do
#presence and value assertions...
end
on_assertion_failure{ |response, bot| }
Structure :if unless: lambda{|node| node.class.include?("newsflash")}