0.0
No commit activity in last 3 years
No release in over 3 years
Interactive and declarative XPath driven HTML scraper
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

~> 1.6
>= 0
>= 0

Runtime

 Project Readme

Scrapouille

Scrapouille is a declarative XPath driven HTML scraper with an interactive mode as a bonus

Why XPath ? XPath is powerful enough to get any data on a HTML document (see http://www.w3schools.com/xpath/xpath_axes.asp)

Scrapouille run XPath queries using the nokogiri gem

Install

gem install scrapouille

Test

rake

Usage

Interactive mode

From the command line you can interact with a remote web page as if it was local

$ scrapouille http://tennis.com/player.html        # launch scrapouille on the command line with a provided URI
> //div[@class='player-name']/h1/child::text()     # You will get a prompt. Enter a xpath query
Richard Gasquest                                   # Get the result string
>

Behind the scene - during the session - the remote web page is stored in a Tempfile for fast xpath interaction

You can also directly interact with a local file

$ scrapouille /Users/simon/web/player.html         # launch scrapouille on the command line with a provided filepath
> //div[@class='player-name']/h1/child::text()     # enter your xpath query
Richard Gasquest                                   # Get the result String
>

Scraping programatically

Define a scraper

scraper = Scrapouille.new do
  scrap 'fullname', at: "//div[@class='player-name']/h1/child::text()"
  scrap 'image_url', at: "//div[@id='basic']//img/attribute::src"
  scrap 'rank', at: "//div[@class='position']/text()" do |c|
    Integer(c.sub('#', ''))
  end
end

Use the scraper instance on an URI (as defined by open-uri: filepath, http, ...)

results = scraper.scrap!('http://tennis-player.com/richard-gasquet')
results['fullname'] # => 'Richard Gasquest'

You can also run your scraper using a local HTML filepath for testing purposes

scraper.scrap!(File.join('..', 'player.html'))