Pagerecognizer -- the visual web page structure recognizing A.I. tool
The idea is to forget that DOM is a tree and view the page like a human would do. Then apply smart algorithms to recognize the main blocks that really form a UI. This is particularly useful in test automation because HTML/CSS internals are changing more frequently than design.
Example of splitting in rows (also check examples/google.rb
for some other details):
I'll show how to use this tool on www.google.com as an exmple. The HTML page of it might already have some convenient ids or classes but let's pretend there are none. Currently the gem utilizes the Ferrum so you may already know some basic methods:
require "ferrum"
require "pagerecognizer"
Ferrum::Node.include PageRecognizer
browser = Ferrum::Browser.new
browser.goto "https://google.com/"
We've just added some methods to Ferrum::Node
, let's call the private method #recognize
to export what the A.I. would see to an HTML file like this:
File.write "dump.htm", browser.at_css("body").send(:recognize).dump
This is a nodes rects view that the A.I. will use later for the recognition. Let's do a web search and see what it sees now:
browser.at_css("input[type=text]").focus.type "Ruby", :enter
Now let's try the magic method #rows
and see if it has recognized the search results sections of the page.
File.write "dump.htm", browser.at_css("body").rows([:AREA, :SIZE]).dump
:AREA
and :SIZE
are the recommended euristics for the rows
and cols
methods, you can find others in the source code.
The Google Search page is complex today and as you can see with the default options it did not recognize the first result and misrecognized others. The misrecognized ones either have no blue hyperlinks or no text at all. What can we do? Each recognized node has a method #texts
that allows us to access the text blocks and their style. It also recognizes text color classifying it based on 16 Basic Web colors. Let's use it and add a custom euristic that would give a hint to process only such nodes that contain black and blue text:
results = browser.at_css("body").rows([:AREA, :SIZE]) do |node|
colors = node.texts.map{ |text, style, color, | color }
colors.any?{ |c| :black == c } &&
colors.any?{ |c| :blue == c || :navy == c }
end
File.write "dump.htm", results.dump
Custom euristic not only helps the A.I. but also may make the recognition faster because it makes less nodes to process. It still picks wrong nodes though. Then let's select such that the biggest text in them is blue and happens only once:
... do |node|
texts = node.texts
next if texts.none?{ |text, style, color, | :black == color }
_, group = texts.group_by{ |text, style, | style["fontSize"].to_i }.to_a.max_by(&:first)
next unless group
next unless group.size == 1 && %i{ blue navy }.include?(group[0][2])
true
end
Perfect. Now we can reject the nodes with images because we are not interested in video results (note that we use node.node
since the node
is a recognized object, a structure, and node.node
is the actual Ferrum object), and then parse the results:
results.reject{ |_| _.node.at_css "img" }.map do |result|
[
result.node.at_css("a").property("href")[0,40],
result.texts.max_by{ |t, s, | s["fontStyle"].to_i }[0].sub(/(.{40}) .+/, "\\1..."),
]
end
https://ru.wikipedia.org/wiki/Ruby Ruby - Википедия
https://www.ruby-lang.org/ru/ Язык программирования Ruby
https://evrone.ru/why-ruby 5 причин, почему мы выбираем Ruby - evrone.ru
https://habr.com/ru/hub/ruby/ Ruby — Динамический высокоуровневый язык...
https://ru.wikibooks.org/wiki/Ruby Ruby - Викиучебник
https://context.reverso.net/%D0%BF%D0%B5 ruby - Перевод на русский - примеры английский...
https://web-creator.ru/articles/ruby Язык программирования Ruby - Веб Креатор
https://ru.hexlet.io/courses/ruby Введение в Ruby - Хекслет
https://www.ozon.ru/product/yazyk-progra Книга "Язык программирования Ruby" - OZON
We've just scraped the SERP knowing nothing about its DOM other that there are big blue links with black descriptions!
Example of grid detection
browser.goto "https://youtube.com/"
grid = browser.at_css("#content").grid
grid.size # => 24
grid.cols.size # => 3
grid.cols.map(&:size) # => [8, 8, 8]
grid.rows.size # => 8
grid.rows.map(&:size) # => [3, 3, 3, 3, 3, 3, 3, 3]