0.0
No commit activity in last 3 years
No release in over 3 years
Interactive and declarative XPath driven HTML scraper
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.6
>= 0
>= 0

Runtime

 Project Readme

Scrapouille

Scrapouille is a declarative XPath driven HTML scraper with an interactive mode as a bonus

Why XPath ? XPath is powerful enough to get any data on a HTML document (see http://www.w3schools.com/xpath/xpath_axes.asp)

Scrapouille run XPath queries using the nokogiri gem

Install

gem install scrapouille

Test

rake

Usage

Interactive mode

From the command line you can interact with a remote web page as if it was local

$ scrapouille http://tennis.com/player.html        # launch scrapouille on the command line with a provided URI
> //div[@class='player-name']/h1/child::text()     # You will get a prompt. Enter a xpath query
Richard Gasquest                                   # Get the result string
>

Behind the scene - during the session - the remote web page is stored in a Tempfile for fast xpath interaction

You can also directly interact with a local file

$ scrapouille /Users/simon/web/player.html         # launch scrapouille on the command line with a provided filepath
> //div[@class='player-name']/h1/child::text()     # enter your xpath query
Richard Gasquest                                   # Get the result String
>

Scraping programatically

Define a scraper

scraper = Scrapouille.new do
  scrap 'fullname', at: "//div[@class='player-name']/h1/child::text()"
  scrap 'image_url', at: "//div[@id='basic']//img/attribute::src"
  scrap 'rank', at: "//div[@class='position']/text()" do |c|
    Integer(c.sub('#', ''))
  end
end

Use the scraper instance on an URI (as defined by open-uri: filepath, http, ...)

results = scraper.scrap!('http://tennis-player.com/richard-gasquet')
results['fullname'] # => 'Richard Gasquest'

You can also run your scraper using a local HTML filepath for testing purposes

scraper.scrap!(File.join('..', 'player.html'))