Extractula
http://github.com/pauldix/extractula
Summary
Extracts content like title, summary, and images from web pages like Dracula extracts blood: with care and finesse.
Description
Extractula attempts to extract the core content from a web page. For a news article or blog post this would be the content of the article itself. For a github project this would be the main README file. The library also has logic for writing your own custom extractors. This is useful if you want to write extractors for popular sites that you want to build custom support for.
Installation
gem install extractula --source http://gemcutter.org
Use
require 'extractula' some_html = "..." # get some html to extract, yo! extracted_content = Extractula.extract(url, some_html) extracted_content.title # pulled from the page extracted_content.url # what you passed in extracted_content.content # the main content body (article, blog post, etc) extracted_content.summary # an automatically generated plain text summary of the content extracted_content.image_urls # the urls for images that appear in the content extracted_content.video_embed # the embed code if a video is embedded in the content Extractula.add_extractor(SomeClass) # so you can add a custom extractor
Custom Extractors
The “Use” section showed adding a custom extractor. This should be a class that at a minimum implements the following methods.
class MyCustomExtractor def self.can_extract?(url, html) end def extract(url, html) # should return a Extractula::ExtractedContent object end end
Notice that can_extract? is a class method while extract is an instance method. Extract should return an ExtractedContent object.
ExtractedContent
The ExtractedContent object holds the results of an extraction. It additionally has methods to automatically generate a summary, image_urls, and video_embed code from the content. If you implement a custom extractor and want to provide the summary, image_urls, and video_embed, simply pass those values into the constructor for ExtractedContent. Here are some examples:
extracted_content = ExtractedContent.new(:url => "http://pauldix.net", :content => "...some content...") extracted_content.summary # auto-generated from content extracted_content.image_urls # auto-generated from content extracted_content.video_embed # auto-generated from content extracted_content = ExtractedContent.new(:url => "http://pauldix.net", :content => "...some content...", :summary => "a summary", :image_urls => ["foo.jpg"], :video_embed => "blah") extracted_content.summary # "a summary" extracted_content.image_urls # ["foo.jpg"] extracted_content.video_embed # "blah"
Zero, one, or more of the values can be passed into the ExtractedContent constructor. It will auto-generate ones not passed in and keep the others.
LICENSE
(The MIT License)
Copyright © 2009:
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
‘Software’), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED ‘AS IS’, WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.