Project

yasuri

0.01
No commit activity in last 3 years
No release in over 3 years
Yasuri (鑢) is a library for declarative web scraping and a command line tool for scraping with it.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Project Readme

Yasuri

Build Status Coverage Status Maintainability

Yasuri (鑢) is a library for declarative web scraping and a command line tool for scraping with it. It performs scraping by simply describing the expected result in a simple declarative notation.

Yasuri makes it easy to write common scraping operations. For example, the following processes can be easily implemented.

For example,

  • Open links in the page, scraping each page, and getting result as Hash.
  • Scraping texts in the page, and named result in Hash.
  • A table that repeatedly appears in a page each, scraping, get as an array.
  • Of each page provided by the pagination, scraping the only top 3.

You can implement easy by Yasuri.

Sample

https://yasuri-sample.herokuapp.com/

(source code: https://github.com/tac0x2a/yasuri-sample)

Installation

Add this line to your application's Gemfile:

gem 'yasuri'

or

# for Ruby 1.9.3 or lower
gem 'yasuri', '~> 2.0', '>= 2.0.13'

# for Ruby 3.0.0 or lower
gem 'yasuri', '~> 3.1'

And then execute:

$ bundle

Or install it yourself as:

$ gem install yasuri

Usage

Use as library

# Node tree constructing by DSL
root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
         text_title '//*[@id="contents"]/h2'
         text_content '//*[@id="contents"]/p[1]'
       end


# Node tree constructing by YAML
src = <<-EOYAML
links_root:
  path: "//*[@id='menu']/ul/li/a"
  text_title: "//*[@id='contents']/h2"
  text_content: "//*[@id='contents']/p[1]"
EOYAML
root = Yasuri.yaml2tree(src)


# Node tree constructing by JSON
src = <<-EOJSON
{
  "links_root": {
    "path": "//*[@id='menu']/ul/li/a",
    "text_title": "//*[@id='contents']/h2",
    "text_content": "//*[@id='contents']/p[1]"
  }
}
EOJSON
root = Yasuri.json2tree(src)

# Execution and getting scraped result
result = root.scrape("http://some.scraping.page.tac42.net/")
# => [
#      {"title" => "PageTitle 01", "content" => "Page Contents  01" },
#      {"title" => "PageTitle 02", "content" => "Page Contents  02" },
#      ...
#      {"title" => "PageTitle N",  "content" => "Page Contents  N" }
#    ]

Use as CLI

# After gem installation..
$ yasuri help scrape
Usage:
  yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]

Options:
  f, [--file=FILE]   # path to file that written yasuri tree as json or yaml
  j, [--json=JSON]   # yasuri tree format json string
  i, [--interval=N]  # interval each request [ms]

Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.

Example

$ yasuri scrape "https://www.ruby-lang.org/en/" -j '
{
  "text_title": "/html/head/title",
  "text_desc": "//*[@id=\"intro\"]/p"
}'

{"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}

Run on Docker

$ docker run tac0x2a/yasuri yasuri scrape "https://www.ruby-lang.org/en/" -j '
{
  "text_title": "/html/head/title",
  "text_desc": "//*[@id=\"intro\"]/p"
}'

{"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}

Dev

$ gem install bundler
$ bundle install

Test

$ rake
# or
$ rspec spec/*spec.rb

Test gem in local

$ gem build yasuri.gemspec
$ gem install yasuri-*.gem

Release RubyGems

# Only first time
$ curl -u <user_name> https://rubygems.org/api/v1/api_key.yaml > ~/.gem/credentials
$ chmod 0600 ~/.gem/credentials

$ nano lib/yasuri/version.rb # edit gem version
$ rake release

Contributing

  1. Fork it ( https://github.com/tac0x2a/yasuri/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request