Yasuri
Yasuri (鑢) is a library for declarative web scraping and a command line tool for scraping with it. It performs scraping by simply describing the expected result in a simple declarative notation.
Yasuri makes it easy to write common scraping operations. For example, the following processes can be easily implemented.
For example,
- Open links in the page, scraping each page, and getting result as Hash.
- Scraping texts in the page, and named result in Hash.
- A table that repeatedly appears in a page each, scraping, get as an array.
- Of each page provided by the pagination, scraping the only top 3.
You can implement easy by Yasuri.
Sample
https://yasuri-sample.herokuapp.com/
(source code: https://github.com/tac0x2a/yasuri-sample)
Installation
Add this line to your application's Gemfile:
gem 'yasuri'
or
# for Ruby 1.9.3 or lower
gem 'yasuri', '~> 2.0', '>= 2.0.13'
# for Ruby 3.0.0 or lower
gem 'yasuri', '~> 3.1'
And then execute:
$ bundle
Or install it yourself as:
$ gem install yasuri
Usage
Use as library
# Node tree constructing by DSL
root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
text_title '//*[@id="contents"]/h2'
text_content '//*[@id="contents"]/p[1]'
end
# Node tree constructing by YAML
src = <<-EOYAML
links_root:
path: "//*[@id='menu']/ul/li/a"
text_title: "//*[@id='contents']/h2"
text_content: "//*[@id='contents']/p[1]"
EOYAML
root = Yasuri.yaml2tree(src)
# Node tree constructing by JSON
src = <<-EOJSON
{
"links_root": {
"path": "//*[@id='menu']/ul/li/a",
"text_title": "//*[@id='contents']/h2",
"text_content": "//*[@id='contents']/p[1]"
}
}
EOJSON
root = Yasuri.json2tree(src)
# Execution and getting scraped result
result = root.scrape("http://some.scraping.page.tac42.net/")
# => [
# {"title" => "PageTitle 01", "content" => "Page Contents 01" },
# {"title" => "PageTitle 02", "content" => "Page Contents 02" },
# ...
# {"title" => "PageTitle N", "content" => "Page Contents N" }
# ]
Use as CLI
# After gem installation..
$ yasuri help scrape
Usage:
yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
Options:
f, [--file=FILE] # path to file that written yasuri tree as json or yaml
j, [--json=JSON] # yasuri tree format json string
i, [--interval=N] # interval each request [ms]
Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
Example
$ yasuri scrape "https://www.ruby-lang.org/en/" -j '
{
"text_title": "/html/head/title",
"text_desc": "//*[@id=\"intro\"]/p"
}'
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
Run on Docker
$ docker run tac0x2a/yasuri yasuri scrape "https://www.ruby-lang.org/en/" -j '
{
"text_title": "/html/head/title",
"text_desc": "//*[@id=\"intro\"]/p"
}'
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
Dev
$ gem install bundler
$ bundle install
Test
$ rake
# or
$ rspec spec/*spec.rb
Test gem in local
$ gem build yasuri.gemspec
$ gem install yasuri-*.gem
Release RubyGems
# Only first time
$ curl -u <user_name> https://rubygems.org/api/v1/api_key.yaml > ~/.gem/credentials
$ chmod 0600 ~/.gem/credentials
$ nano lib/yasuri/version.rb # edit gem version
$ rake release
Contributing
- Fork it ( https://github.com/tac0x2a/yasuri/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request