RubyScraper
RubyScraper is a gem built to scrape 1-2 layer listing sites. The original intent was, and the included example scrapes.json config file is for scraping job posting sites. The gem allows you to pull summary listings from a main index page, then follow in the nested url's to sub-pages for those listings to scrape additional data. The example is for job sites, but this could easily be used for blogs, recipe sites, news sites, products, etc..
RubyScraper only currently allows sending results to an api endpoint. It will eventually be built out to allow additional output options (csv, etc.).
NOTE: Only outputting specified parameters in post request for the time being... TODO Dynamically generate post request data.
######Usage Examples:
# Scrapes all sites defined in the scrapes.json file (located in the PWD) and sends results as a POST request to
the specified endpoint.
rubyscraper -f scrapes.json -e http://myserver.com/api/v1/post-endpoint
# Scrapes only the site with name 'stackoverflow' in the scrapes.json file
rubyscraper -f scrapes.json -e http://myserver.com/api/v1/post-endpoint -s stackoverflow
# Pulls full pages with records of at least 75 records (defaults to 50)
NOTE: For unpaginated sites, all results are scraped
NOTE: For a site with 25 records/page, a -r 51 will pull 3 full pages, or 75 records
rubyscraper -f scrapes.json -e http://myserver.com/api/v1/post-endpoint -r 75
Installation
Dependency
RubyScraper relies on PhantomJS as its headless web browser. Install this before installing the gem with:
brew install phantomjs
CLI
Install RubyScraper by running:
gem install rubyscraper
Gemfile
Work in Progress
Usage
First configure a scrape config file. See the example file (scrapes.json
) for format and options. All available options are specified in the contained examples. As a rough overview though:
####Scrape Config File Options
[
{
"name":"stackoverflow", # => REQUIRED Site Name (No spaces)
"base_url":"http://www.careers.stackoverflow.com", # => REQUIRED Base Site URL (Thru domain, no trailing '/')
"summary":{ # => REQUIRED Summary block (main scrape page)
"url":"/jobs/tag/ruby?sort=p", # => REQUIRED Any url additions to access main scrape page
If only pulling base site use "/"
"has_sub_pages":"true", # => REQUIRED Are there sub-pages to scrape?
"paginated":"true", # => REQUIRED Are all listings on the main page? Or is the
site paginated?
"pagination":{ # => OPTIONAL Required for paginated sites
"format":"&pg=", - URL pagination param
"start":"1", - Starting point (some sites go by records)
"scale":"1", - Whats the incrementer for pages/records
"records_per_page":"25" - Number of records on each page
},
"loop":".listResults .-item", # => REQUIRED The main container of each scrape element
"fields":[ # => REQUIRED Which fields should be scraped from this page
{
"field":"position", - Output file name
"method":"find", - Capybara search method (find: only 1 matching elem)
"path":"h3.-title a" - Path to containing element
},
{
"field":"url", - To scrape sub-pages, this is required with name 'url'
"method":"find",
"path":"h3.-title a",
"attr":"href" - To access an html attribute, add an attr row
},
{
"field":"posting_date",
"method":"first", - To access the first occurence of a given element
"path":"p._muted"
}
]
},
"sub_page":{ # => OPTIONAL if scraping from sub-pages, list fields
"fields":[
{
"field":"company",
"method":"find",
"path":"a.employer"
},
{
"field":"location",
"method":"find",
"path":"span.location"
},
{
"field":"description",
"method":"all", - Use 'all' method to collect data from multiple elems
"path":"div.description p", - Path to the collection of elems to be aggregated
"loop_collect":"text", - What is being selected from that collection
"join":"\n" - How to join output string
},
{
"field":"tags",
"method":"all",
"path":"div.tags a.post-tag",
"loop_collect":"text",
"join":", "
}
]
}
}
]
RubyScraper runtime options
Type rubyscraper -h
for full option list.
Usage: RubyScraper [options]
Specific options:
REQUIRED:
-f, --file FILENAME.JSON Specify the file_name of your RubyScraper config file
REQUIRED (if using as service to send results as post requests):
-e, --endpoint URL Enter the api endpoint URL here
(If using scraper as a service to send post requests to server)
OPTIONAL:
-r, --record-limit N Pull N records per site
(approximate because if there are 25 records per
page, and 51 is provided, it will go to 3 pages)
-d, --delay N Delay N seconds before executing
-s, --site SITENAME Scrape a single SITENAME from the config file
Common options:
-h, --help Show this message
--version Show version
Contributing
- Fork it ( https://github.com/[my-github-username]/rubyscraper/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Write your tests and don't break anything :) run tests with
rspec
- Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request