Project

html2tex

0.01
No commit activity in last 3 years
No release in over 3 years
Converts an HTML document to a LaTeX version with reasonably sensible markup.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

Runtime

>= 4.2.0
>= 0.2.0
 Project Readme

HTML2TeX

This is a simple library and command-line tool to convert HTML documents into LaTeX. The intended use is for preparing ebooks, but it might also be useful for other purposes.

Command-line use

Usage: html2tex [options] input.html [output.tex]
    -t, --title TITLE                Set (or override) the title of the document
    -a, --author AUTHOR              Set (or override) the author of the document
    -c, --document-class CLASS       Set the LaTeX document class to be used; default is book
    -h, --help                       Display this screen

If the output file is not specified, the program will create an output file based on the input filename, replacing the extension with .tex – or, if there is no extension, by appending .tex.

The title and author will, by default, be extracted from the HTML document, using either Dublin Core metadata or the HTML title and an author META element.

Library use

require "html2tex"

# Return output as a string
tex = HTML2TeX.new(html).to_tex

# Write directly to a stream
File.open("output.tex", "w") do |f|
  HTML2TeX.new(html).to_tex(f)
end

A hash of options can be supplied as the second argument to to_tex; the available keys are:

  • :author
  • :title
  • :class

See the command-line section above for an explanation of these.

Dependencies

  • htmlentities, to decode character entities
  • rubypants, to convert ' and " into something more appealing

Limitations

The HTML recognised is currently limited to headings, paragraphs, and bold and italic mark-up. This covers the majority of novels, but it's far from comprehensive.

StringScanner is used to process the HTML, but cannot read from a stream directly, so the entire input document must be read into memory as a string first.

UTF-8 is assumed everywhere; other character encodings will produce odd results. If the HTML file to be processed is not in UTF-8 encoding with unix line endings (at least, on Linux/OS X/etc.), fix that first. The usual suspects will help here:

iconv -f windows-1252 -t utf-8 < somefile-win1252.html > somefile-utf8.html
dos2unix somefile-utf8.html

Next steps

If you have XeLaTex, you can easily turn the generated .tex file into a PDF:

xelatex my-book.tex

For better results, tweak the font settings or use a custom class like this.