Rika
A JRuby wrapper for Apache Tika to extract text and metadata from various file formats.
More information about Apache Tika can be found here: http://tika.apache.org/
Jeremy's modifications
basically, just using my own version of Tika with special email parsing fixes, adds X-Attachments metadata key (listing attachment filenames from emails) and removes bouncycastle from Tika-parsers's requirements because everything is awful.
for instance, Tika by itself detects an .eml if the file has "Received: " as the first string of bytes. I've made it so it'll detect an email with that in the first 300 bytes, to cope with leaked emails that have non-standard headers first, then the Received: line.
Installation
Add this line to your application's Gemfile:
gem 'rika'
Remember that this gem only works on JRuby.
And then execute:
$ bundle
Or install it yourself as:
$ gem install rika
Usage
For a quick start with the simplest use cases, the following functions are provided to get what you need in a single function call, for your convenience:
require 'rika'
content = Rika.parse_content('document.pdf') # string containing all content text
metadata = Rika.parse_metadata('document.pdf') # hash containing the document metadata
content, metadata = Rika.parse_content_and_metadata('document.pdf') # both of the above
For other use cases and finer control, you can work directly with the Rika::Parser object:
require 'rika'
parser = Rika::Parser.new('document.pdf')
# Return the content of the document:
parser.content
# Return the media type for the document:
parser.media_type
=> "application/pdf"
# Return the metadata field title if it exists:
parser.metadata["title"] if parser.metadata_exists?("title")
# Return all the available metadata keys that can be read from the document
parser.available_metadata
# Return only the first 10000 chars of the content:
parser = Rika::Parser.new('document.pdf', 10000)
parser.content # 10000 first chars returned
# Return content from URL
parser = Rika::Parser.new('http://riakhandbook.com/sample.pdf', 200)
parser.content
# Return the language for the content
parser = parser = Rika::Parser.new('german document.pdf')
parser.language
=> "de"
# Check whether the langugage identification is certain enough to be trusted
parser.language_is_reasonably_certain?
Credits
The following people have contributed ideas, documentation, or code to Rika:
- Keith Bennett
- Richard Nyström
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request