Selenium Standalone DSL
Gem for using Selenium webdriver simply.
Let's scrape sites using JavaScript and iframe!
This project aims to expanding Selenium and Nokogiri.
Installation
Ensure you have already installed Firefox:
sudo apt-get install firefox
If you want to run firefox headlessly, install Xvfb:
sudo apt-get install xvfb
Also on Arch Linux:
sudo pacman -S firefox xorg-server-xvfb
Add this line to your application's Gemfile:
gem 'selenium_standalone_dsl'
And then execute:
bundle
Or install it yourself as:
gem install selenium_standalone_dsl
Usage
require 'selenium_standalone_dsl'
class YahooSearcher < SeleniumStandaloneDSL::Base
def search_for_wikipedia
visit 'https://www.yahoo.com/'
fill_in 'p', with: 'wikipedia'
# You can declare how to find elements: [class|id|css|xpath]
click 'IconNavSearch', find_by: :class
if has_element? :text, 'Next'
click 'Next'
end
# You can use full jQuery CSS selector using 'search'
puts search('span:contains("results")').inner_text
end
end
config = {
log_path: '/tmp/selenium_standalone_dsl.log',
user_agent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 ' +
'(KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36',
headless: true,
}
driver = YahooSearcher.new(config)
driver.search_for_wikipedia
# => 141,000,000 results
Supported API
Method | Summary | Arguments | Remarks |
---|---|---|---|
click | Click a button or link | text of link, name, etc. | |
select | Choose an option in select box | option text | |
visit | Navigate firefox to a link | link url | |
fill_in | Fill in a input box | element name | |
search | Search page source using Nokogiri | Returns Nokogiri::XML::Element |
Options
Name | Arguments | Defalut |
---|---|---|
:find_by | :link_text, :name, :id, :class, :css, :xpath | :link_text |
Development
git clone git@github.com:acro5piano/selenium_standalone_dsl.git
cd selenium_standalone_dsl
bundle install --path vendor/bundle
bundle exec rake install
bundle exec rake spec
To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Roadmap
I will create Selenium Spider
after completing Selenium Standalone DSL
.
Selenium Spider
aims to scrape websites using JavaScript.
There are a lot of tools such as mechanize
but they are no longer useful for SPA websites.
In my heart, Selenium Spider
will have these features:
Full JavaScript support
Based on Selenium Standalone DSL which run Firefox headlessly, it comprehences JavaScript completely.
PMC architecture
PMC = Pagination Model Controller
Generally, scraping is consist of two parts: Listing page and Detail page.
In PMC architecture, Page is for listing items and pagenation.
Model is for extracting information from detail page and store data to database.
Controller is for handling the above two.
Web-based task execution
Scraping tasks are often multiply and difficult to arrange.
Imagine Web-based task execution, definition, csv-export and scheduling like Jenkins.
TODO
- Fix module name. Dsl => DSL
- Defalut config
- Add API document
- TravisCI
- CommonLogger
- Add
current_url
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/acro5piano/selenium_standalone_dsl. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
License
The gem is available as open source under the terms of the MIT License.