SuperCrawler

Easy (yet efficient) ruby gem to crawl your favorite website.

Quick Start

Open your terminal, then:

git clone https://github.com/htaidirt/super_crawler
cd super_crawler
bundle
./bin/console

Then

sc = SuperCrawler::Crawl.new('https://gocardless.com')
sc.start(10) # => Start crawling the website using 10 threads
sc.render(5) # => Show the first 5 results of the crawling as sitemap

Installation

Add this line to your application's Gemfile:

gem 'super_crawler'

And then execute:

bundle install

Or install it yourself as:

gem install super_crawler

Want to experiment with the gem without installing it? Clone the following repo and run bin/console for an interactive prompt that will allow you to experiment.

Warning!

This gem is an experiment and can't be used for production purposes. Please, use it with caution if you want to use it in your projects.

There are also a lot of limitations that weren't handled due to time. You'll find more information on the limitations below.

SuperCrawler gem was only tested on MRI 2.3.1 and Rubinius 2.5.8.

Philosophy

Starting from a given URL, the crawler extracts all the internal links and assets within the page. The links are added to a list of unique links for further exploration. The crawler repeats the exploration visiting all the links until no new link is found.

Due to the heavy operations (thousands of pages), and the network time to access each page content, we will use threads to perform near-parallel processing.

In order to keep the code readable and structured, we created two classes:

SuperCrawler::Scrap is responsible for scrapping a single page and extracting all relevant information (internal links and assets)
SuperCrawler::Crawl is responsible for crawling a whole website by collecting and managing links (using SuperCrawler::Scrap on every internal link found.) This class is also responsible for rendering results.

More detailed use

Open your favorite ruby console and require the gem:

require 'super_crawler'

Scrapping a single web page

Read the following if you would like to crawl a single web page and extract relevant information (internal links and assets).

page = SuperCrawler::Scrap.new( url )

Where url should be the URL of the page you would like to scrap.

Nota: If the given URL has a missing scheme (http:// or https://), SuperCrawler will prepend http:// to the URL.

Get the encoded URL

Run

page.url

to get the encoded URL.

Get internal links of a page

Run

page.get_links

to get the list of internal links in the page. An internal link is a link that has the same schame and host than the provided URL. Subdomains are rejected.

This method searches in the href attribute of all <a> anchor tags.

Nota:

This method returns an array of absolute URLs (all internal links).
Bad links and special links (like mailto and javascript) are discarded.

Get images of a page

Run

page.get_images

to get a list of images links within the page. The images links are extracted from the src="..." attribute of all <img> tags.

Nota: Images included using CSS or JavaScript aren't detected by the method.

Nota 2: This method returns an array of absolute URLs.

Get stylesheets of a page

Run

page.get_stylesheets

to get a list of stylesheet links within the page. The links are extracted from the href="..." attribute of all <link rel="stylesheet"> tags.

Nota:

Inline styling isn't yet detected by the method.
This method returns an array of absolute URLs.

Get scripts of a page

Run

page.get_scripts

to get a list of script links within the page. The links are extracted from the src="..." attribute of all <script> tags.

Nota:

Inline script isn't yet detected by the method.
This method returns an array of absolute URLs.

List all assets of a page

Run

page.get_assets

to get a list of all assets (links of images, stylesheets and scripts) as a hash of arrays.

Crawling a whole web site

sc = SuperCrawler::Crawl.new(url)

where url is the URL of the website to crawl.

Next, start the crawler:

sc.start(number_of_threads)

where number_of_threads is the number of threads that will perform the job (10 by default.) This can take some time, depending on the site to crawl.

To access the crawl results, use the following:

sc.links # The array of unique internal links
sc.crawl_results # Array of hashes containing links and assets for every unique internal link found

To see the crawling as a sitemap, use:

sc.render(5) # Will render the sitemap of the first 5 pages

TODO: Make more sophisticated rendering methods, that can render within files of different formats (HTML, XML, JSON,...)

Tips on searching assets and links

After sc.start, you can access all collected resources (links and assets) using sc.crawl_results. This has the following structure:

[
  {
    url: 'http://example.com/',
    links: [...array of internal links...],
    assets: {
      images: [...array of images links],
      stylesheets: [...array of stylesheets links],
      scripts: [...array of scripts links],
    }
  },
  ...
]

You can use sc.crawl_results.select{ |resource| ... } to select a particular resource.

Example:

images = sc.crawl_results.map{ |page| page[:assets][:images] }.flatten.uniq
# => Returns an array of all unique images found during the crawling

Get assets of a whole crawling

You can collect in a single array any assets of a crawling, by using the following:

images      = sc.get_assets :images       # => Returns an array of unique images
stylesheets = sc.get_assets :stylesheets  # => Returns an array of unique stylesheets
scripts     = sc.get_assets :scripts      # => Returns an array of unique scripts

It is important to note that all the given arrays contain unique absolute URLs. As said before, the assets are not necessarily internal assets.

Limitations

Actually, the gem has the following limitations:

Subdomains are not considered as internal links
A link with the same domain but different scheme is ignored (http -> https, or the opposite)
Only links within <a href="..."> tags are extracted
Only images links within <img src="..."/> tags are extracted
Only stylesheets links within <link rel="stylesheet" href="..." /> tags are extracted
Only scripts links within <script src="..."> tags are extracted
A page that is not accessible (not status 200) is not checked later

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake test to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/htaidirt/super_crawler. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

Please, follow this process:

Fork the project
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

License

The gem is available as open source under the terms of the MIT License.

super_crawler

Development

Runtime

SuperCrawler

Quick Start

Installation

Warning!

Philosophy

More detailed use

Scrapping a single web page

Get the encoded URL

Get internal links of a page

Get images of a page

Get stylesheets of a page

Get scripts of a page

List all assets of a page

Crawling a whole web site

Tips on searching assets and links

Get assets of a whole crawling

Limitations

Development

Contributing

License

Don't forget to have fun coding Ruby...