PDFBox text extraction
This gem lets you extract plain text from PDF documents. It is a Jruby wrapper for the Apache PDFBox library.
Installation
Add this line to your application's Gemfile:
gem 'pdfbox_text_extraction'
And then execute:
$ bundle
Or install it yourself as:
$ gem install pdfbox_text_extraction
Usage
To extract all text on every page:
extracted_text = PdfboxTextExtraction.run(path_to_pdf)
To extract text inside a crop area:
extracted_text = PdfboxTextExtraction.run(
path_to_pdf,
{
crop_x: 0, # crop area top left corner x-coordinate
crop_y: 1.0, # crop area top left corner y-coordinate
crop_width: 8.5, # crop area width
crop_height: 9.4, # crop area height
}
)
Contributing
- Fork it ( https://github.com/jhund/pdfbox_text_extraction/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
Resources
License
Copyright
Copyright (c) 2016 Jo Hund. See (MIT) LICENSE for details.