jules
Experimental
gem 'jules'
A fast new way to write scrapers. This is still an ongoing project.
require 'open-uri'
require 'jules'
source = URI.parse('https://news.ycombinator.com').read
filters = {
title: 'td.title',
comments: [/(\d+) comments/, :optional],
points: /(\d+) points/
}
items = Jules.collect(source, filters)
# [{title: '2 years with Angular', comments: '95', points: '245'},
# {title: 'PolarSSL is now a part of ARM', comments: '13', points: '48'},
# {title: 'My boys love 1986 computing', comments: '25', points: '105'},
# {title: 'Kill init by touching a bunch of files', comments: '66', points: '102'},
# ...
How?
Jules uses the repitition of HTML structure. It rearranges the document using locality sensitive hashing (Simhash). It then continues to extract data using the user-specified filters.
Filters
Filters can be
- Strings (CSS selector / XPath query)
- Regexp
- Anonymous methods (
lambda
)
By default filters are required fields. If a field is optional, mark it with ['#example', :optional]
.
Options
Enabled HTML elements
By default, div
, tr
or li
are enabled HTML elements for repitition. If a website wraps every item in ul
and div
elements, do this:
Jules.collect(html, filters, ['ul', 'div'])
Examples
The Onion
require 'open-uri'
require 'jules'
source = URI.parse('http://www.theonion.com/search/?q=why').read
filters = {
title: 'h1',
pubdate: /(\d{2}\.\d{2}\.\d{2})/,
img: 'img'
}
items = Jules.collect(source, filters)
# [{title: 'Why Are We Leaving Facebook?', pubdate: '10.10.13', img: 'http://o.onionstatic.com/images/23/23823/16x9/350.jpg?0553'},
# {title: 'Why Are We Filing For Disability?', pubdate: '01.24.14', img: 'http://o.onionstatic.com/images/25/25070/16x9/350.jpg?8738'},
# {title: 'Why Are We Waiting To Have Children?', pubdate: '07.10.14', img: 'http://o.onionstatic.com/images/26/26746/16x9/350.jpg?7206'},
# {title: 'Why Are We Postponing The Wedding?', pubdate: '04.25.13', img: '/images/21/21801/16x9/350.jpg?8189'},
# {title: 'Why Are We Canceling Our Netflix Account?', pubdate: '01.09.14', img: 'http://o.onionstatic.com/images/24/24668/16x9/350.jpg?3803'},
# {title: 'Why Aren't We Watching The Olympics?', pubdate: '02.20.14', img: 'http://o.onionstatic.com/images/25/25345/16x9/350.jpg?2178'},
# ...
See the tests folder for more examples.