Project

scrape

0.0
Repository is archived
No commit activity in last 3 years
No release in over 3 years
An easy to use utility to scrape websites using a DSL similar to rake.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 2.2.8
~> 0.8.0
~> 1.5.5
 Project Readme

Scrape

A really simple web scraper.

site "https://github.com/explore" # The site to scrape. Will be used as the base address.

match /evilmarty/ do |doc| # A regexp/string/proc to match against the current url.

  doc.search('a[href]') # The nokogiri document of the contents of the current url.

end

site "http://www.tumblr.com" # Can define multiple sites

queue "http://www.tumblr.com/tagged" # Add specified urls to scrape

match "/tagged" do |doc|
  # Do what ever we want with the document.
end

Usage

After creating a Scrapefile simple run:

scrape -f [FILE]

If no scapefile is specified then Scrapefile is used by default.

Installation

Simply install the gem

gem install scrape

or you can download the source by cloning the repository

git clone https://github.com/evilmarty/scrape.git

Contribute

Please fork the repository and make a pull request on Github.

If you discover an issue please lodge it.

TODO

  • Fix bugs
  • Depth limiting
  • Better docs