Scraped

Write declarative scrapers in Ruby.

If you need to write a webscraper (maybe to scrape a single page, or one that hits a page which lists a load of other pages, and jumps into each of those pages to pull out the same data) the scraped gem will help you write it quickly and clearly.

Installation

Add this line to your application’s Gemfile:

gem 'scraped'

And then execute:

$ bundle

Or install it yourself as:

$ gem install scraped

Usage

Scraped currently has support for working with HTML and JSON documents (pull requests for other formats are welcome).

To write a scraper, start by creating a subclass of Scraped::HTML or Scraped::JSON for each type of page you’re scraping. The choice of class here means you get the most helpful one — the way you'll want to work with an HTML page is probably very different from JSON. See the following descriptions to see this at work.

Then specify the (data) fields you want to extract from that. You can control the strategy used to get the page, and decorate the response you get back to make it easier to parse.

HTML

Here’s the HTML source from the webpage at example.com:

<html>
<body>
<div>
  <h1>Example Domain</h1>
  <p>This domain is established to be used for illustrative examples
     in documents. You may use this domain in examples without prior
     coordination or asking for permission.</p>
  <p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

So, if that’s your target and you want to get data such as title and more info URL from that page, you pick them out as fields with Nokogiri (using CSS selectors, in this example). Note that ExamplePage is a subclass of Scraped::HTML because it's specifically HTML that you’re scraping:

require 'scraped'

class ExamplePage < Scraped::HTML
  field :title do
    noko.at_css('h1').text
  end

  field :more_info_url do
    noko.at_css('a/@href').text
  end
end

Now you’ve defined your ExamplePage, you can create a new instance and pass in a Scraped::Response instance. The resulting page has the data you’ve scraped:

page = ExamplePage.new(response: Scraped::Request.new(url: 'http://example.com').response)

page.title
# => "Example Domain"

page.more_info_url
# => "http://www.iana.org/domains/example"

page.to_h
# => { :title => "Example Domain", :more_info_url => "http://www.iana.org/domains/example" }

You can see that those fields now contain the data scraped from ExamplePage, and of course the .to_h is handy if you want to dump this into a database. That’s why we call each data item you’ve picked out a field — if you’re scraping data that’s going into a database, often these will be fields on each record.

JSON

JSON documents are handled in a similar way.

If http://example.com/data.json returns something like this:

{
  "name": "John Doe",
  "email": "john_doe@example.com"
}

This time, create a subclass of Scraped::JSON. Fields are typically easier to extract from a JSON document because the data is already structured (hence there’s no noko here):

require 'scraped'

class ExampleRecord < Scraped::JSON
  field :name do
    json[:name]
  end

  field :email do
    json[:email_address]
  end
end

Again, create a new instance and pass in a Scraped::Response instance.

page = ExampleRecord.new(response: Scraped::Request.new(url: 'http://example.com/data.json').response)

record.name
# => "John Doe"

record.email
# => "john_doe@example.com"

record.to_h
# => { :name => "John Doe", :email => "john_doe@example.com" }

In practice, you may be accessing the JSON data via an API rather than as a static URL, but in either case scraped handles this as a simple request and response.

Dealing with sections of a page

In the example (scraping example.com) above, the scraper was handling a whole webpage; but often you’re only interested in working with part of a page. For example, you might want to scrape just a table containing a list of people and some associated data.

To do this, use the fragment method, passing it a one-entry hash: the key is the noko fragment you want to use, and the value is the class that should handle that fragment.

  fragment row => MemberRow

In the example below, the fragment is one "table row" (an HTML tr element), within the table which has a CSS class of members-list. The MemberRow class is extracting fields from the fragment, rather than the whole page.

class MemberRow < Scraped::HTML
  field :name do
    noko.css('td')[2].text
  end

  field :party do
    noko.css('td')[3].text
  end
end

class AllMembersPage < Scraped::HTML
  field :members do
    noko.css('table.members-list tr').map do |row|
      fragment row => MemberRow
    end
  end
end

If you restrict your class to a fragment like this, the CSS or XPath expressions you use to identify the data you want can often be simpler. In the example above, td[2] is the column with index 2 (that is, the third column) in the table row.

The fragment method is also available in Scraped::JSON. This time, the key is the part of the JSON document you want to use.

{
  "House": "Upper",
  "Members": [{
    "ID": "001",
    "Name": "John Doe"
  }, {
    "ID": "002",
    "Name": "Jane Doe"
  }]
}

class MemberRecord < Scraped::JSON
  field :id do
    json[:id]
  end

  field :name do
    json[:name]
  end
end

class MemberRecords < Scraped::JSON
  field :members do
    json[:members].map do |member|
      fragment member => MemberRecord
    end
  end
end

Extending

There are two main ways to extend scraped with your own custom logic — custom request strategies and decorated responses.

Request strategies allow you to change where (and perhaps how) the scraper gets its responses from. By default, scraped uses its built-in LiveRequest strategy, which attempts to make an HTTP request to the URL provided. The response to that request is typically an HTML page, and it’s that response that gets passed to your scraper to work on.

When you need more control over how this works, you can create custom request strategies. For example, you might want to make requests to archive.org if the site you’re scraping is unavailable at the moment the scraper runs. Or you may want to use a cache and only refresh on certain calendar conditions. Or negotiate authentication before making the request.

Decorated responses allow you to manipulate the response before it’s passed to the scraper. This is useful because sometimes your scraper’s code will be much simpler if you clean up or standardise the incoming page before you parse it.

scraped comes with some built-in decorators for common tasks. For example, CleanUrls (see below) tidies up all the link and image source URLs before your scraper code extracts them. You can write your own custom decorators too, to fix up something specific or idiosyncratic about the pages you’re scraping.

Custom request strategies

To make a custom request strategy, create a class that subclasses Scraped::Request::Strategy and defines a response method:

class FileOnDiskRequest < Scraped::Request::Strategy
  def response
    { body: open(filename).read }
  end

  private

  def filename
    @filename ||= File.join(URI.parse(url).host, Digest::SHA1.hexdigest(url))
  end
end

The response method should return a Hash which has a body key. You can also include status and headers parameters in the hash to fill out those fields in the response. If not given, status will default to 200 (OK) and headers will default to {} (empty).

To use a custom request strategy pass it to Scraped::Request:

request = Scraped::Request.new(url: 'http://example.com', strategies: [FileOnDiskRequest, Scraped::Request::Strategy::LiveRequest])
page = MyPersonPage.new(response: request.response)

Note that you can provide multiple strategies, and scraped will try each in turn until one results in a response.

Custom response decorators

We’ve found decorators useful in our own work. For example, we "unspan" fiddly HTML tables that have colspan in them, because that makes it much easier to extract data once tables have been normalised for every row to have the same number of columns. Sometimes it’s helpful to clean up whitespace by converting HTML entities such as   before parsing too.

To manipulate the response before it is processed by the scraper, create a class that subclasses Scraped::Response::Decorator. This class must define a body method, as well as (optionally) url, status, or headers.

class AbsoluteLinks < Scraped::Response::Decorator
  def body
    doc = Nokogiri::HTML(super)
    doc.css('a').each do |link|
      link[:href] = URI.join(url, link[:href]).to_s
    end
    doc.to_s
  end
end

You can access the current request body by calling super from your method. You can also call url, headers or status to access those properties of the current response.

To use a response decorator you need to use the decorator class method in a Scraped::HTML subclass:

class PageWithRelativeLinks < Scraped::HTML
  decorator AbsoluteLinks

  # Other fields...
end

Note: see CleanUrls in Built-in decorators, which is a decorator that implements a more thorough version of this example.

Configuring requests and responses

When passing an array of request strategies or response decorators you should always pass the class, rather than the instance. If you want to configure an instance you can pass in a two-element array where the first element is the class and the second element is the config:

class CustomHeader < Scraped::Response::Decorator
  def headers
    response.headers.merge('X-Greeting' => config[:greeting])
  end
end

class ExamplePage < Scraped::HTML
  decorator CustomHeader, greeting: 'Hello, world'
end

The code above adds this header to the response: X-Greeting: Hello, world.

Passing request headers

Note that you don’t need to define a custom strategy if you just want to set headers on a request. You can explicitly pass a headers: argument to Scraped::Request.new:

response = Scraped::Request.new(url: 'http://example.com', headers: { 'Cookie' => 'user_id' => '42' }).response
page = ExamplePage.new(response: response)

Inheritance with decorators

When you inherit from a class that already has decorators the child class will also inherit the parent’s decorators. There’s currently no way to re-order or remove decorators in child classes, though that may be added in the future.

Built-in decorators

Clean link and image URLs

If your scraper is capturing URLs, you may find that they occur in the HTML as relative URLs. It’s better to convert them all to fully-qualified absolute URLs before your scraper extracts them. This is such a common inconvenience that the scraped gem comes with support for this out of the box with the Scraped::Response::Decorator::CleanUrls decorator.

The CleanUrls decorator also fixes up any encoding issues the URL may have, as part of making sure it’s fully-qualified.

So if you apply the CleanUrls decorator, any src or href attributes of image or anchor elements respectively will be correctly encoded and made absolute.

require 'scraped'

class MemberPage < Scraped::HTML
  decorator Scraped::Response::Decorator::CleanUrls

  field :image do
    # Image url will be absolute thanks to the decorator.
    noko.at_css('.profile-picture/@src').text
  end
end

Declarative tests

We’ve also been working on making it easy to write simple, declarative tests for scrapers: see scraper-test. That’s useful because often you can state very clearly what data (the fields and their values) you’re expecting your scraper to return before you’ve even started writing it.

Development

If you want to work on scraped itself (rather than simply using it or writing custom strategies or decorators), this is for you.

After checking out the repo, run bin/setup to install dependencies. Then, run rake test to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/everypolitician/scraped.

License

The gem is available as open source under the terms of the MIT License.

scraped

Development

Runtime

Scraped

Installation

Usage

HTML

JSON

Dealing with sections of a page

Extending

Custom request strategies

Custom response decorators

Configuring requests and responses

Passing request headers

Inheritance with decorators

Built-in decorators

Clean link and image URLs

Declarative tests

Development

Contributing

License