Webtractor
The Webtractor is a ruby library which is able to extract main content from webpages like news, blogs, etc. As a result you can just have a main content without any boilerplate (menu, footer, comments, etc).
Installation
You can install it directly via gem:
gem install webtractor
Or you can put it in your Gemfile:
gem 'webtractor'
Then run:
bundle install
Basic usage
extractor = Webtractor::Extractor.new
result = extractor.extract_from_url
'http://techcrunch.com/2014/05/24/dont-believe-anyone-who-tells-you-learning-to-code-is-easy/'
puts result.text
Or
extractor = Webtractor::Extractor.new
result = extractor.extract '<html><body>...</body></html>'
Or
page = Nokogiri::HTML(...)
extractor = Webtractor::Extractor.new
result = extractor.extract_from_xml page
You can also access Nokogiri document from result via xml attribute:
puts result.xml.xpath('...').text
Advanced usage
Process of getting main content from the webpage is really simple. It consists of applying multiple filters on the document where every filter gets on input output of the last applied filter.
You can look at the names of default filters:
p Webtractor::Filters::DefaultFilter.new.filters.map{|f| f.class.to_s}
You can remove any filter:
extractor.remove_filter Webtractor::Filters::RemoveComments
Or you can also create your own filter. It can be any class which implements process method which takes page as an argument and returns page:
class RemoveBolds
def process page
page.css('b').remove
page
end
end
extractor.add_filter RemoveBolds.new
License
This library is distributed under the Beerware license.