0.0
No commit activity in last 3 years
No release in over 3 years
scrape page by page , according to url pattern
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.13
~> 5.0
>= 0
~> 10.0

Runtime

~> 1.6
 Project Readme

PageByPage

Scrape page by page, according to url pattern, return an array of Nokogiri::XML::Element you want.

Installation

Add this line to your application's Gemfile:

gem 'page_by_page'

And then execute:

$ bundle

Or install it yourself as:

$ gem install page_by_page

Usage

number pattern

If you know page number pattern, use fetch:

nodes = PageByPage.fetch do
  url 'https://book.douban.com/subject/25846075/comments/hot?p=<%= n %>'
  selector '.comment-item'
  # from 2
  # step 2
  # to 100
  # interval 3
  # threads 4
  # no_progress
  # header Cookie: 'douban-fav-remind=1'
end

other pattern

If the pattern is not simple numbers, use enumerator in fetch:

nodes = PageByiPage.fetch do
  url 'http://mysql.taobao.org/monthly/<%= n %>'
  selector 'h3'
  enumerator ['2020/09/', '2020/08/'].to_enum
end

unknown pattern

If you don't know the pattern, but you see link to next page, use jump:

nodes = PageByPage.jump do
  start 'https://book.douban.com/subject/25846075/comments/hot'
  iterate '.comment-paginator li:nth-child(3) a'
  selector '.comment-item'
  # to 100
  # interval 3
  # no_progress
  # header Cookie: 'douban-fav-remind=1'
end

parameters instead of block

You may just pass parameters instead of block:

nodes = PageByPage.fetch(
  url: 'https://book.douban.com/subject/25846075/comments/hot?p=<%= n %>',
  selector: '.comment-item',
  # from: 2,
  # step: 2,
  # to: 100,
  # interval: 3
  # threads: 4,
  # no_progress: true
  # header: {Cookie: 'douban-fav-remind=1'}
)

lazy

Also note that, instead of Array, lazy_fetch returns an Enumerator, which is native lazy-loading:

nodes = PageByPage.lazy_fetch(
  #...
)