SpiderCrawl

A ruby gem that can crawl a domain and let you have information about the pages it visits.

With the help of Nokogiri, SpiderCrawl will parse each page and return you its title, links, css, words, and many many more! You can also customize what you want to do before & after each fetch request.

Long story short - Feed an URL to SpiderCrawl and it will crawl + scrape the content for you.

Installation

Add this line to your application's Gemfile:

gem 'spidercrawl'

And then execute:

$ bundle

Or install it yourself as:

$ gem install spidercrawl

Usage

Start crawling a domain by calling Spiderman.shoot(url) and it will return you a list of pages it has crawled and scraped:

pages = Spiderman.shoot('http://forums.hardwarezone.com.sg/hwm-magazine-publication-38/')

To include a pattern matching for each page:

pages = Spiderman.shoot('http://forums.hardwarezone.com.sg/hwm-magazine-publication-38/',
                        :pattern => Regexp.new('^http:\/\/forums\.hardwarezone\.com\.sg\/hwm-magazine-publication-38\/?(.*\.html)?$')

Access the following scraped data:

pages.each do |page|
  page.url              #URL of the page
  page.scheme           #Scheme of the page (http, https, etc.)
  page.host             #Hostname of the page
  page.base_url         #Root URL of the page
  page.doc              #Nokogiri document
  page.headers          #Response headers for the page
  page.title            #Title of the page
  page.links            #Every link found in the page, returned as an array
  page.internal_links   #Only internal links returned as an array
  page.external_links   #Only external links returned as an array
  page.emails           #Every email found in the page, returned as an array
  page.images           #Every img found in the page, returned as an array
  page.words            #Every word that appeared in the page, returned as an array
  page.css              #CSS scripts used in the page, returned as an array
  page.content          #Contents of the HTML document in string
  page.content_type     #Content type of the page
  page.text             #Any text found in the page without HTML tags
  page.response_code    #HTTP response code of the page
  page.response_time    #HTTP response time of the page
  page.crawled_time     #The time when this page is crawled/fetched, returned as milliseconds since epoch
end

TODO

Include faraday
Replace curb dependency with patron

Dependencies

Colorize
Curb
Nokogiri
Typhoeus

Contributing

Fork it ( https://github.com/belsonheng/spidercrawl/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

License

SpiderCrawl is released under the MIT license.

spidercrawl

Development