Nikkou
Extract useful data from HTML and XML with ease!
Description
Nikkou adds additional methods to Nokogiri to make extracting commonly-used data from HTML and XML easier. It lets you transform HTML into structured data very quickly, and it integrates nicely with Mechanize.
Installation
Add Nikkou to your Gemfile:
gem 'nikkou'
Method Overview
Here's a summary of the methods Nikkou provides (see "Methods" for details):
Formatting
parse_text - Parses the node's text as XML and returns it as a Nokogiri::XML::NodeSet
time(options={}) - Intelligently parses the time (relative or absolute) of either the text or a specified attribute; accepts a time_zone
option
url(attribute='href') - Converts the href (or other specified attribute) into an absolute URL using the document's URI; <a href="/p/1">Link</a>
yields http://mysite.com/p/1
Searching
attr_equals(attribute, string) - Finds nodes where the attribute equals the string
attr_includes(attribute, string) - Finds nodes where the attribute includes the string
attr_matches(attribute, pattern) - Finds nodes where the attribute matches the pattern
*drill(methods) - Nil-safe method chaining
find(path) - Same as search
but returns the first matched node
text_equals(string) - Finds nodes where the text equals the string
text_includes(string) - Finds nodes where the text includes the string
text_matches(pattern) - Finds nodes where the text matches the pattern
Methods
Formatting
time(options={})
Returns a Time object (in UTC) by automatically parsing the text or specified attribute of the node.
# <a href="/p/1">3 hours ago</a>
doc.search('a').first.time
Options
attribute
The attribute to parse:
# <a href="/p/1" data-published-at="2013-05-22 02:42:34">My link</a>
doc.search('a').first.time(attribute: 'data-published-at')
time_zone
The document's time zone (the time will be converted from that to UTC):
# <a href="/p/1">3 hours ago</a>
doc.search('a').first.time(time_zone: 'America/New_York')
url(attribute='href')
Returns an absolute URL; useful for parsing relative hrefs. The document's uri
needs to be set for Nikkou to know what domain to add to relative paths.
# <a href="/p/1">My link</a>
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url # "http://mysite.com/p/1"
If Mechanize is being used, the uri
doesn't need to be manually set.
Options
attribute
The attribute to parse:
# <a href="/p/1" data-comments-url="/p/1#comments">My Link</a>
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url('data-comments-url') # "http://mysite.com/p/1#comments"
Searching
attr_equals(attribute, string)
Selects nodes where the specified attribute equals the string.
# <div data-type="news">My Text</div>
doc.attr_equals('data-type', 'news').first.text # "My Text"
attr_includes(attribute, string)
Selects nodes where the specified attribute includes the string.
# <div data-type="major-news">My Text</div>
doc.attr_includes('data-type', 'news').first.text # "My Text"
attr_matches(attribute, pattern)
Selects nodes with an attribute matching a pattern. The pattern's matches are available in Node#matches
.
# <span data-tooltip="3 Comments">My Text</span>
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.text # "My Text"
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.matches # ["3 Comments", "3"]
drill(*methods)
Nil-safe method chaining. Replaces this:
node = doc.find('.count')
if node
attribute = node.attr('data-count')
if attribute
return attribute.to_i
end
end
With this:
return doc.drill([:find, '.count'], [:attr, 'data-count'], :to_i)
find(path)
Same as search
, but returns the first matched node. Replaces this:
nodes = node.search('h4')
if nodes
return nodes.first
end
With this:
return node.find('h4')
text_includes(string)
Selects nodes where the text includes the string.
# <div data-type="news">My Text</div>
doc.text_includes('Text').first.text # "My Text"
text_matches(pattern)
Selects nodes with text matching a pattern. The pattern's matches are available in Node#matches
.
# <a href="/p/1">3 Comments</a>
doc.text_matches(/^(\d+) comments$/i).first.attr('href') # "/p/1"
doc.text_matches(/^(\d+) comments$/i).first.matches # ["3 Comments", "3"]
License
Nikkou is released under the MIT License. Please see the MIT-LICENSE file for details.