sq

sq is a web scrapping tool for PDFs. Give it an URL and an optional regex, and it’ll download all PDFs linked on it.

Install

gem install sq

Usage

From the command-line:

$ sq [-o <directory>] [-F <format>] <url> [<regex>]

Available options:

-F: output format (see below), default is %s.pdf
-o: choose the output directory
-V: be more verbose
--formats: list available formats

The regex is case-sensitive and is matched against the whole URL.

Examples

# Get all PDFs from a Web page
sq http://liafa.fr/~yunes/cours/interfaces/

# Use a regexp to get only those you want
sq http://liafa.fr/~yunes/cours/interfaces/ 'fiches/\d+'

# Be more verbose
sq -V http://liafa.fr/~yunes/cours/interfaces/ 'fiches/\d+'

# Add a filename format
sq -V http://liafa.fr/~yunes/cours/interfaces/ 'fiches/\d+' -F 'class-%Z.pdf'

Formats

The output format is used for each PDF filename. It’s a string with zero or more special strings that will be replaced by a special value.

%n - PDF number, starting at 0
%N - PDF number, starting at 1
%z - same as %n, but zero-padded
%Z - same as %N, but zero-padded
%c - total number of PDFs
%s - name of the PDF, extracted from its URI, without `.pdf`
%S - name of the PDF, extracted from the link text
%_ - same as %S, but spaces are replaced with underscores
%- - same as %S, but spaces are replaced with hyphens
%% - litteral %

API

In a Ruby file:

require 'sq'

urls = SQ.query('http://example.com', /important/i)

Tests

$ git clone https://github.com/bfontaine/sq.git
$ cd sq
$ bundle install
$ rake test

It’ll generate a coverage/index.html, which you can open in a Web browser.

sq

Development

Runtime

sq

Install

Usage

Examples

Formats

API

Tests