Scraped
Write declarative scrapers in Ruby.
If you need to write a webscraper (maybe to scrape a single page, or one that hits a page which lists a load of other pages, and jumps into each of those pages to pull out the same data) the scraped
gem will help you write it quickly and clearly.
Installation
Add this line to your application’s Gemfile:
gem 'scraped'
And then execute:
$ bundle
Or install it yourself as:
$ gem install scraped
Usage
Scraped currently has support for working with HTML and JSON documents (pull requests for other formats are welcome).
To write a scraper, start by creating a subclass of Scraped::HTML
or Scraped::JSON
for each type of page you’re scraping. The choice of class here means you get the most helpful one — the way you'll want to work with an HTML page is probably very different from JSON. See the following descriptions to see this at work.
Then specify the (data) fields you want to extract from that. You can control the strategy used to get the page, and decorate the response you get back to make it easier to parse.
HTML
Here’s the HTML source from the webpage at example.com:
<html>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used for illustrative examples
in documents. You may use this domain in examples without prior
coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
So, if that’s your target and you want to get data such as title and more info URL from that page, you pick them out as fields with Nokogiri (using CSS selectors, in this example). Note that ExamplePage
is a subclass of Scraped::HTML
because it's specifically HTML that you’re scraping:
require 'scraped'
class ExamplePage < Scraped::HTML
field :title do
noko.at_css('h1').text
end
field :more_info_url do
noko.at_css('a/@href').text
end
end
Now you’ve defined your ExamplePage
, you can create a new instance and pass in a Scraped::Response
instance. The resulting page
has the data you’ve scraped:
page = ExamplePage.new(response: Scraped::Request.new(url: 'http://example.com').response)
page.title
# => "Example Domain"
page.more_info_url
# => "http://www.iana.org/domains/example"
page.to_h
# => { :title => "Example Domain", :more_info_url => "http://www.iana.org/domains/example" }
You can see that those fields now contain the data scraped from ExamplePage
, and of course the .to_h
is handy if you want to dump this into a database. That’s why we call each data item you’ve picked out a field
— if you’re scraping data that’s going into a database, often these will be fields on each record.
JSON
JSON documents are handled in a similar way.
If http://example.com/data.json
returns something like this:
{
"name": "John Doe",
"email": "john_doe@example.com"
}
This time, create a subclass of Scraped::JSON
. Fields are typically easier to extract from a JSON document because the data is already structured (hence there’s no noko
here):
require 'scraped'
class ExampleRecord < Scraped::JSON
field :name do
json[:name]
end
field :email do
json[:email_address]
end
end
Again, create a new instance and pass in a Scraped::Response
instance.
page = ExampleRecord.new(response: Scraped::Request.new(url: 'http://example.com/data.json').response)
record.name
# => "John Doe"
record.email
# => "john_doe@example.com"
record.to_h
# => { :name => "John Doe", :email => "john_doe@example.com" }
In practice, you may be accessing the JSON data via an API rather than as a static URL, but in either case scraped
handles this as a simple request and response.
Dealing with sections of a page
In the example (scraping example.com
) above, the scraper was handling a whole webpage; but often you’re only interested in working with part of a page. For example, you might want to scrape just a table containing a list of people and some associated data.
To do this, use the fragment
method, passing it a one-entry hash: the key is the noko
fragment you want to use, and the value is the class that should handle that fragment.
fragment row => MemberRow
In the example below, the fragment is one "table row" (an HTML tr
element), within the table
which has a CSS class of members-list
. The MemberRow
class is extracting fields from the fragment, rather than the whole page.
class MemberRow < Scraped::HTML
field :name do
noko.css('td')[2].text
end
field :party do
noko.css('td')[3].text
end
end
class AllMembersPage < Scraped::HTML
field :members do
noko.css('table.members-list tr').map do |row|
fragment row => MemberRow
end
end
end
If you restrict your class to a fragment like this, the CSS or XPath expressions you use to identify the data you want can often be simpler. In the example above, td[2]
is the column with index 2
(that is, the third column) in the table row.
The fragment
method is also available in Scraped::JSON
. This time, the key is the part of the JSON document you want to use.
{
"House": "Upper",
"Members": [{
"ID": "001",
"Name": "John Doe"
}, {
"ID": "002",
"Name": "Jane Doe"
}]
}
class MemberRecord < Scraped::JSON
field :id do
json[:id]
end
field :name do
json[:name]
end
end
class MemberRecords < Scraped::JSON
field :members do
json[:members].map do |member|
fragment member => MemberRecord
end
end
end
Extending
There are two main ways to extend scraped
with your own custom logic — custom request strategies and decorated responses.
Request strategies allow you to change where (and perhaps how) the scraper gets its responses from. By default, scraped
uses its built-in LiveRequest
strategy, which attempts to make an HTTP request to the URL provided. The response to that request is typically an HTML page, and it’s that response that gets passed to your scraper to work on.
When you need more control over how this works, you can create custom request strategies. For example, you might want to make requests to archive.org
if the site you’re scraping is unavailable at the moment the scraper runs. Or you may want to use a cache and only refresh on certain calendar conditions. Or negotiate authentication before making the request.
Decorated responses allow you to manipulate the response before it’s passed to the scraper. This is useful because sometimes your scraper’s code will be much simpler if you clean up or standardise the incoming page before you parse it.
scraped
comes with some built-in decorators for common tasks. For example, CleanUrls
(see below) tidies up all the link and image source URLs before your scraper code extracts them. You can write your own custom decorators too, to fix up something specific or idiosyncratic about the pages you’re scraping.
Custom request strategies
To make a custom request strategy, create a class that subclasses Scraped::Request::Strategy
and defines a response
method:
class FileOnDiskRequest < Scraped::Request::Strategy
def response
{ body: open(filename).read }
end
private
def filename
@filename ||= File.join(URI.parse(url).host, Digest::SHA1.hexdigest(url))
end
end
The response
method should return a Hash
which has a body
key. You can also include status
and headers
parameters in the hash to fill out those fields in the response. If not given, status
will default to 200
(OK) and headers
will default to {}
(empty).
To use a custom request strategy pass it to Scraped::Request
:
request = Scraped::Request.new(url: 'http://example.com', strategies: [FileOnDiskRequest, Scraped::Request::Strategy::LiveRequest])
page = MyPersonPage.new(response: request.response)
Note that you can provide multiple strategies, and scraped
will try each in turn until one results in a response.
Custom response decorators
We’ve found decorators useful in our own work. For example, we "unspan" fiddly HTML tables that have colspan
in them, because that makes it much easier to extract data once tables have been normalised for every row to have the same number of columns. Sometimes it’s helpful to clean up whitespace by converting HTML entities such as
before parsing too.
To manipulate the response before it is processed by the scraper, create a class that subclasses Scraped::Response::Decorator
. This class must define
a body
method, as well as (optionally) url
, status
, or headers
.
class AbsoluteLinks < Scraped::Response::Decorator
def body
doc = Nokogiri::HTML(super)
doc.css('a').each do |link|
link[:href] = URI.join(url, link[:href]).to_s
end
doc.to_s
end
end
You can access the current request body by calling super
from your method. You can also call url
, headers
or status
to access those properties of the current response.
To use a response decorator you need to use the decorator
class method in a Scraped::HTML
subclass:
class PageWithRelativeLinks < Scraped::HTML
decorator AbsoluteLinks
# Other fields...
end
Note: see CleanUrls
in Built-in decorators, which is a decorator that implements a more thorough version of this example.
Configuring requests and responses
When passing an array of request strategies or response decorators you should always pass the class, rather than the instance. If you want to configure an instance you can pass in a two-element array where the first element is the class and the second element is the config:
class CustomHeader < Scraped::Response::Decorator
def headers
response.headers.merge('X-Greeting' => config[:greeting])
end
end
class ExamplePage < Scraped::HTML
decorator CustomHeader, greeting: 'Hello, world'
end
The code above adds this header to the response: X-Greeting: Hello, world
.
Passing request headers
Note that you don’t need to define a custom strategy if you just want to set headers on a request. You can explicitly pass a headers:
argument to Scraped::Request.new
:
response = Scraped::Request.new(url: 'http://example.com', headers: { 'Cookie' => 'user_id' => '42' }).response
page = ExamplePage.new(response: response)
Inheritance with decorators
When you inherit from a class that already has decorators the child class will also inherit the parent’s decorators. There’s currently no way to re-order or remove decorators in child classes, though that may be added in the future.
Built-in decorators
Clean link and image URLs
If your scraper is capturing URLs, you may find that they occur in the HTML as relative URLs. It’s better to convert them all to fully-qualified absolute URLs before your scraper extracts them. This is such a common inconvenience that the scraped
gem comes with support for this out of the box with the Scraped::Response::Decorator::CleanUrls
decorator.
The CleanUrls
decorator also fixes up any encoding issues the URL may have, as part of making sure it’s fully-qualified.
So if you apply the CleanUrls
decorator, any src
or href
attributes of image or anchor elements respectively will be correctly encoded and made absolute.
require 'scraped'
class MemberPage < Scraped::HTML
decorator Scraped::Response::Decorator::CleanUrls
field :image do
# Image url will be absolute thanks to the decorator.
noko.at_css('.profile-picture/@src').text
end
end
Declarative tests
We’ve also been working on making it easy to write simple, declarative tests for scrapers: see scraper-test. That’s useful because often you can state very clearly what data (the fields and their values) you’re expecting your scraper to return before you’ve even started writing it.
Development
If you want to work on scraped
itself (rather than simply using it or writing custom strategies or decorators), this is for you.
After checking out the repo, run bin/setup
to install dependencies. Then, run rake test
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/everypolitician/scraped.
License
The gem is available as open source under the terms of the MIT License.