Copy-Paste PDF
Tabula was written for those cases where you can’t easily copy-and-paste tables from a PDF to a spreadsheet. Surprisingly, Tabula sometimes fails where copy-and-pasting succeeds. This project is for those cases when copy-and-pasting is all you need (and where nothing else works).
This gem only works on OS X. It works best on PDFs whose source materials are Excel spreadsheets.
Getting Started
PDF to CSV
Install with:
gem install --no-wrappers copy_paste_pdf
If you omit the --no-wrappers
switch, the AppleScript will not install properly. You may run the script with:
copy-paste-pdf.applescript /path/to/input.pdf /path/to/output.csv
- The script will open the PDF in Preview and copy the contents of the PDF
- The script will open Microsoft Excel, paste the contents and save as CSV
If you want the script to quit Preview and Excel once it's done, pass a third argument, like:
copy-paste-pdf.applescript /path/to/input.pdf /path/to/output.csv true
The script may pinwheel while copying the contents of the PDF and while pasting the contents to the spreadsheet. If it looks like nothing is happening, wait a few seconds.
You can work in other applications while the script is running - just don't use the clipboard as it may interfere with the script.
This method is admittedly not very efficient. Running time averages under 2 seconds per page but varies considerably depending on your system's load.
Data Cleaning
The Ruby gem defines helper methods for cleaning the CSV. In most cases, the PDF to CSV conversion will create many empty rows. You can easily remove those rows with:
require 'csv'
require 'copy_paste_pdf'
table = CopyPastePDF::Table.new(CSV.read('/path/to/output.csv'))
table.remove_empty_rows!
CSV.open('/path/to/clean.csv', 'w') do |csv|
table.each do |row|
csv << row
end
end
If the table in the PDF contained vertically-merged cells, then, in the CSV, the first of the merged cells will have a value and the others will be empty. To copy the value of the first cell to the others, use the copy_into_cell_below
method, which accepts the indices of columns containing merged cells:
table.copy_into_cell_below(0, 3, 4)
Sometimes, if a cell contains multiple lines of text, the PDF to CSV conversion will incorrectly break the cell into multiple rows. To remove the spurious row and merge its values into the row above, use the merge_into_cell_above
method, which accepts the indices of columns in which this error occurs:
table.merge_into_cell_above(1, 2)
With additional time and effort, these two methods can be made to operate without needing columns as hints.
Troubleshooting
If you see warnings on the command-line like:
2013-10-09 14:30:03.704 osascript[2056:707] Error loading /Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types: dlopen(/Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types, 262): no suitable image found. Did find:
/Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types: no matching architecture in universal wrapper
osascript: OpenScripting.framework - scripting addition "/Library/ScriptingAdditions/Adobe Unit Types.osax" declares no loadable handlers.
Developers
If, like me, you almost never write AppleScript, you can access much of AppleScript's documentation through Apple's AppleScript Editor. See, for example, how to access the entries about Microsoft Excel.
Why?
Most of the PDFs I work with contain no tables. In those cases I either:
- Run
pdftotext filename.pdf
to convert the PDF to text, and write a script using regular expressions to parse the output. - Run
pdftotext -layout filename.pdf
to convert the PDF to text while preserving the text layout – very useful when working with two-column layouts. - Use commercial software like Adobe Acrobat Pro to save the PDF to another format, usually Excel.
- I recently learned that Apple's Automator has an
Extract PDF Text
action which performs well.
For PDFs containing tables, I discovered that copy-pasting from Apple's Preview to Microsoft Excel worked better than all alternatives tested, for the PDFs I was interested in.
Related projects
- If you need to extract tables from text-based PDFs, see Tabula
- If you need to extract text from text-based PDFs, use docsplit or
pdftotext
- If you need to extract tables from image-based PDFs, see Carpenter or DocHive
- If you need to extract text from image-based PDFs, use docsplit or Tesseract
- If you have many volunteers, consider using Crowdcrafting to transcribe tables
You may also be interested in the Open Knowledge Foundation's messytables and pdftables.
Copyright (c) 2013 James McKinney, released under the MIT license