Sagrone scraper

Simple library to scrap web pages. Bellow you will find information on how to use it.

Installation
Basic Usage
Modules
- SagroneScraper::Agent
- SagroneScraper::Base
  - Create a scraper class
  - Instantiate the scraper
  - Scrape the page
  - Extract the data
- SagroneScraper::Collection

Installation

Add this line to your application's Gemfile:

$ gem 'sagrone_scraper'

And then execute:

$ bundle

Or install it yourself as:

$ gem install sagrone_scraper

Basic Usage

In order to scrape a web page you will need to:

create a new scraper class by inheriting from SagroneScraper::Base, and
instantiate it with a url or page
then you can use the scraper instance to scrape the page and extract structured data

More informations at SagroneScraper::Base module.

Modules

`SagroneScraper::Agent`

The agent is responsible for obtaining a page, Mechanize::Page, from a URL. Here is how you can create an agent:

require 'sagrone_scraper'

agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
agent.page
# => Mechanize::Page

agent.page.at('.ProfileHeaderCard-bio').text
# => "Javascript User Group Milano #milanojs"

`SagroneScraper::Base`

Here we define a TwitterScraper, by inheriting from SagroneScraper::Base class.

The scraper is responsible for extracting structured data from a page or a url. The page can be obtained by the agent.

Public instance methods will be used to extract data, whereas private instance methods will be ignored (seen as helper methods). Most importantly self.can_scrape?(url) class method ensures that only a known subset of pages can be scraped for data.

Create a scraper class

require 'sagrone_scraper'

class TwitterScraper < SagroneScraper::Base
  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i

  def self.can_scrape?(url)
    url.match(TWITTER_PROFILE_URL) ? true : false
  end

  # Public instance methods are used for data extraction.

  def bio
    text_at('.ProfileHeaderCard-bio')
  end

  def location
    text_at('.ProfileHeaderCard-locationText')
  end

  private

  # Private instance methods are not used for data extraction.

  def text_at(selector)
    page.at(selector).text if page.at(selector)
  end
end

Instantiate the scraper

# Instantiate the scraper with a "url".
scraper = TwitterScraper.new(url: 'https://twitter.com/Milano_JS')

# Instantiate the scraper with a "page" (Mechanize::Page).
agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
scraper = TwitterScraper.new(page: agent.page)

Scrape the page

scraper.scrape_page!

Extract the data

scraper.attributes
# => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}

`SagroneScraper::Collection`

This is the simplest way to scrape a web page:

require 'sagrone_scraper'

# 1) Define a scraper. For example, the TwitterScraper above.

# 2) New created scrapers will be registered.
SagroneScraper.Collection::registered_scrapers
# => ['TwitterScraper']

# 3) Here we use the collection to scrape data at a URL.
SagroneScraper::Collection.scrape(url: 'https://twitter.com/Milano_JS')
# => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}

Contributing

Fork it ( https://github.com/[my-github-username]/sagrone_scraper/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

sagrone_scraper

Development

Runtime