HtmlsToPdf
DESCRIPTION
HtmlsToPdf enables you to package one or more (ordered) HTML pages as a PDF.
WHY?
I often see multi-page websites with content I would rather have in a single PDF file for searching and offline viewing. Examples include: The Ruby on Rails Guides and RSpec documentation.
Viewing docs offline also reduces browser "tab-itis," browser crashes, and unnecessary re-downloading of server content.
REQUIREMENTS
I have run this only on Linux. It likely works on OS X. It may not work on Windows.
HtmlsToPdf uses the PDFKit gem, which itself uses the wkhtmltopdf program, which uses qtwebkit.
Dependence chain summary: HtmlsToPdf -> PDFKit -> wkhtmltopdf -> qtwebkit -> webkit
For information on qtwebkit:
For information on wkhtmltopdf:
For information on PDFKit:
BASIC USAGE
Create a new HtmlsToPdf object, passing in all your configuration options. Then tell the new object to .create_pdf:
require 'rubygems'
require 'htmls_to_pdf'
config = {}
config[:urls] = ['http://.../url1.htm', 'https://.../url2.html']
config[:savedir] = '~/my/savedir'
config[:savename] = 'Name_to_save_file_as.pdf'
config[:css] = ['http://www.example.com/css_file.css',
'h1 {color: red; margin: 10px 5px;} p {color: blue; border: 1px solid green; font-size: 80%;}']
HtmlsToPdf.new(config).create_pdf
(Alternatively, you can set configuration options by calling setters on an HtmlsToPdf instance, e.g.: h2p = HtmlsToPdf.new({}); h2p.savedir = '~/my/savedir')
OPTIONS
config[:css]
takes an array of CSS file URLs and/or valid CSS strings (you can mix URLs and CSS strings within an array) to apply during PDF rendering. (If you have just one CSS URL/string, you can pass it without an array.)
config[:debug]
(default: false) determines whether the program outputs verbose information while processing create_pdf()
config[:overwrite_existing_pdf]
(default: false) determines whether the program can overwrite a previously generated PDF file
config[:options]
takes a hash of options that are passed through to PdfKit
config[:remove_css_files]
(default: true) determines whether CSS files used to generate the PDF file are deleted or retained. You probably want to set this to false if you want to modify the CSS file(s).
config[:remove_html_files]
(default: true) determines whether HTML files downloaded from websites and used to generate the PDF file are deleted or retained. You probably want to set this to false if you think you may want to regenerate the PDF again, perhaps because you're tweaking the CSS file to adjust rendering.
config[:remove_tmp_pdf_files]
(default: true) determines whether temporary PDF files (one per HTML file) created during the PDF generation process are deleted or retained. You probably want to accept the default and always regenerate the temporary PDFs.
config[:remove_temp_files]
(default: false) sets :remove_css_files
, :remove_html_files
, and :remove_tmp_pdf_files
all to true
EXAMPLES
You will find 20 example scripts in the /examples directory. Each creates a PDF from a website:
- The 12 Factor App (Adam Wiggins)
- Advanced Rails - Five-Day (Jumpstart Labs)
- Backbone Fundamentals (Addy Osmani)
- Bash Guide (Greg Wooledge)
- Coffeescript Meet Backbone.js (Adam J. Spooner)
- Coffeescript Cookbook (Various authors)
- Coffeescript official documentation
- Exploring Coffeescript (ElegantCode.com)
- Jasmine Wiki (Pivotal Labs)
- The Little Book on Coffeescript (Alex MacCaw)
- Natural Language Processing for the Working Programmer (Daniƫl de Kok)
- Learn Python the Hard Way (Zed A. Shaw)
- Practicing Ruby Vol 2 (Gregory Brown)
- The Python Tutorial
- Rails 3.1 release notes
- Ruby on Rails Guides
- RSpec-Rails documentation
- RSpec documentation
- Learn Ruby the Hard Way (Zed A. Shaw)
- RubyGems User Guide
After you install HtmlsToPdf and its dependencies, you can write an ordinary Ruby script to save multiple ordered HTML pages as a single PDF.
EXAMPLE 1: Single HTML page without CSS, with debugging
Annotated version of /examples/get_rails_3_1_release_notes.rb:
# require the gem
require 'rubygems'
require 'htmls_to_pdf'
# Get 'Rails 3.1 Release Notes' as pdf file
# Source: 'http://guides.rubyonrails.org/3_1_release_notes.html'
# create an empty hash to hold your configuration options
config = {}
config[:urls] = ['http://guides.rubyonrails.org/3_1_release_notes.html']
# enable verbose messages during PDF creation process
config[:debug] = true
# set a :savedir key with a string value indicating the directory to create
# your PDF file in. If the directory does not exist, it will be created
config[:savedir] = '~/Tech/Rails/3.1'
# set a :savename key with a string value indicating the name of the PDF file
config[:savename] = 'Rails_3.1_Release_Notes.pdf'
# create a new HtmlsToPdf object, passing in your hash, and then call create_pdf
# on the new object
HtmlsToPdf.new(config).create_pdf
EXAMPLE 2: Multiple HTML pages without CSS
Annotated version of /examples/get_rubygems_user_guide.rb:
# require the gem
require 'rubygems'
require 'htmls_to_pdf'
# Get 'RubyGems User Guide' as pdf file
# Source: 'http://docs.rubygems.org/read/book/1'
# create an empty hash to hold your configuration options
config = {}
# set a :urls key with a value of an array containing all the
# urls you want in your PDF (in the order you want them)
config[:urls] = ['http://docs.rubygems.org/read/book/1']
# I have no idea why these chapters are numbered as they are!
[1,2,3,4,16,7,5,6,21].each do |val|
config[:urls] << 'http://docs.rubygems.org/read/chapter/' + val.to_s
end
# set a :savedir key with a string value indicating the directory to create
# your PDF file in. If the directory does not exist, it will be created
config[:savedir] = '~/Tech/Ruby/GEMS/DOCUMENTATION'
# set a :savename key with a string value indicating the name of the PDF file
config[:savename] = 'RubyGems_User_Guide.pdf'
# create a new HtmlsToPdf object, passing in your hash, and then call create_pdf
# on the new object
HtmlsToPdf.new(config).create_pdf
EXAMPLE 3: Multiple HTML pages with CSS & PdfKit formatting options
Annotated version of /examples/get_coffeescript_meet_backbone.rb:
require 'rubygems'
require 'htmls_to_pdf'
# Get 'CoffeeScript, Meet Backbone.js' as pdf file
# Source: 'http://adamjspooner.github.com/coffeescript-meet-backbonejs/'
config = {}
config[:urls] = ['http://adamjspooner.github.com/coffeescript-meet-backbonejs/']
(1..5).each do |val|
config[:urls] << 'http://adamjspooner.github.com/coffeescript-meet-backbonejs/0' + val.to_s + '/docs/script.html'
end
config[:savedir] = '~/Tech/Javascript/COFFEESCRIPT/BACKBONE.JS'
config[:savename] = 'CoffeeScript_Meet_Backbone.js.pdf'
# If a :css key is given with an array value, the CSS files in the array will be used to generate
# the PDF document. This allows you to modify the CSS file(s) to, for example, hide HTML headers,
# sidebars and footers you do not wish to appear in your PDF.
config[:css] = ['http://adamjspooner.github.com/coffeescript-meet-backbonejs/05/docs/docco.css']
# If a :options key is passed with a hash value, that hash will be passed to wkhtmltopdf.
# Many options are available through wkhtmltopdf; see: [the wkhtmltopdf documentation](http://madalgo.au.dk/~jakobt/wkhtmltoxdoc/wkhtmltopdf-0.9.9-doc.html).
config[:options] = {:page_size => 'Letter', :orientation => 'Landscape'}
HtmlsToPdf.new(config).create_pdf
EXAMPLE 4: Multiple HTML pages with hand-modified CSS file to adjust rendering
Annotated version of /examples/get_ruby_core_docs.rb:
require 'rubygems'
require 'htmls_to_pdf'
# Get 'Ruby Core documentation' as pdf file
# Source: 'http://www.ruby-doc.org/core-1.9.3/'
config = {}
config[:urls] = %w(
ARGF.html
ArgumentError.html
Array.html
BasicObject.html
...
ZeroDivisionError.html
fatal.html)
config[:urls] = config[:urls].map { |u| 'http://www.ruby-doc.org/core-1.9.3/' + u }
config[:savedir] = '~/Tech/Ruby/DOCUMENTATION'
config[:savename] = 'Ruby_Core_docs.pdf'
# Specify a CSS file
config[:css] = 'http://www.ruby-doc.org/core-1.9.3/css/obf.css'
# Tell HtmlsToPdf not to remove the CSS file
config[:remove_css_files] = false
# You are now free to create a "obf.css" file in the directory
# and edit it however you choose. It will not be overwritten.
# (Alternatively, you can run the program once and then modify
# the downloaded CSS file.)
#
# I added the following to the CSS file to suppress unwanted output:
#
# .info, noscript, #footer, #metadata, #actionbar, .dsq-brlink {
# display: none;
# width: 0;
# }
# .class #documentation, .file #documentation, .module #documentation {
# margin: 2em 1em 5em 1em;
# }
#
# If you're playing around with CSS to optimize the display in your
# PDF, I recommend you set config[:remove_html_files] = false to
# avoid repeatedly downloading the HTML files from the server.
HtmlsToPdf.new(config).create_pdf
EXAMPLE 5: Using CSS string to remove unwanted cruft
Abbreviated version of /examples/get_jasmine_wiki.rb:
# When I tried to create this PDF, lots of unwanted formatting (headers, footers, etc.) appeared in the PDF.
# When this happens, I tell the HtmlsToPdf instance to NOT re-download the content each time:
config[:remove_css_files] = false
config[:remove_html_files] = false
config[:overwrite_existing_pdf] = true
# And then I start building up a CSS string I pass into config[:css] that suppresses the unwanted output:
config[:css] = 'div#header{display:none;} ul.tabs{display:none;} div#logo-popup{display:none;} div#footer{display:none;} div#markdown-help{display:none;} div.pagehead{display:none;} ul.wiki-actions{display:none;} div#keyboard_shortcuts_pane{display:none;} div.js-hidden-pane{display:none;} div#ajax-error-message{display:none;}'
LEGAL DISCLAIMER
Please use at your own risk. I guarantee nothing about this program.