Project

nokolexbor

0.1
The project is in a healthy, maintained state
Nokolexbor is a high-performance HTML5 parser, with support for both CSS selectors and XPath. It's API is designed to be compatible with Nokogiri.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

 Project Readme

Nokolexbor

CI

Nokolexbor is a drop-in replacement for Nokogiri. It's 5.2x faster at parsing HTML and up to 997x faster at CSS selectors.

It's a performance-focused HTML5 parser for Ruby based on Lexbor. It supports both CSS selectors and XPath. Nokolexbor's API is designed to be 1:1 compatible as much as possible with Nokogiri's API.

Requirements

Nokolexbor is shipped with pre-compiled gems on most common platforms:

  • Linux: x86_64, with glibc >= 2.17
  • macOS: x86_64 and arm64
  • Windows: ucrt64, mingw32 and mingw64

If you are on a supported platform, just jump to the Installation section. Otherwise, you need to install CMake to compile C extensions:

macOS

brew install cmake

Linux (Debian, Ubuntu, etc.)

sudo apt-get install cmake

Installation

Add to your Gemfile:

gem 'nokolexbor'

Then, run bundle install.

Or, install the gem directly:

gem install nokolexbor

Quick start

require 'nokolexbor'
require 'open-uri'

# Parse HTML document
doc = Nokolexbor::HTML(URI.open('https://github.com/serpapi/nokolexbor'))

# Search for nodes by css
doc.css('#readme h1', 'article h2', 'p[dir=auto]').each do |node|
  puts node.content
end

# Search for text nodes by css
doc.css('#readme p > ::text').each do |text|
  puts text.content
end

# Search for nodes by xpath
doc.xpath('//div[@id="readme"]//h1', '//article//h2').each do |node|
  puts node.content
end

Features

  • Nokogiri-compatible APIs.
  • High performance HTML parsing, DOM manipulation and CSS selectors engine.
  • XPath search engine (ported from libxml2).
  • Text nodes CSS selector support: ::text.

Searching methods overview

  • css and at_css
    • Based on Lexbor.
    • Only accepts CSS selectors, doesn't support mixed syntax like div#abc /text().
    • To select text nodes, use pseudo element ::text. e.g. div#abc > ::text.
    • Performance is much higher than libxml2 based methods.
  • xpath and at_xpath
    • Based on libxml2.
    • Only accepts XPath syntax.
    • Works in the same way as Nokogiri's xpath and at_xpath.
  • nokogiri_css and nokogiri_at_css (requires Nokogiri installed)
    • Based on libxml2.
    • Accept mixed syntax like div#abc /text().
    • Works in the same way as Nokogiri's css and at_css.

Different behaviors from Nokogiri

  • For selector :nth-of-type(n), n is not affected by prior filter. For example, if we want to select the 3rd div excluding class a and class b, which will be the last div in the following HTML:

    <body>
      <div></div>
      <div class="a"></div>
      <div class="b"></div>
      <div></div>
      <div></div>
    </body>
    

    In Nokogiri, the selector should be div:not(.a):not(.b):nth-of-type(3)

    In Nokolexbor, :not does affect the place of the last div (same in browsers), the selector should be div:not(.a):not(.b):nth-of-type(5), but this losts the purpose of filtering though.

Benchmarks

Benchmark parsing google result page (368 KB) and selecting nodes using CSS and XPath. Run on MacBook Pro (2019) 2.3 GHz 8-Core Intel Core i9.

Run with: ruby bench/bench.rb

Nokolexbor (iters/s) Nokogiri (iters/s) Diff
parsing 487.6 93.5 5.22x faster
at_css 50798.8 50.9 997.87x faster
css 7437.6 52.3 142.11x faster
at_xpath 57.077 53.176 same-ish
xpath 51.523 58.438 same-ish
Raw data
Warming up --------------------------------------
    Nokolexbor parse    56.000  i/100ms
      Nokogiri parse     8.000  i/100ms
Calculating -------------------------------------
    Nokolexbor parse    487.564  (±10.9%) i/s -      9.688k in  20.117173s
      Nokogiri parse     93.470  (±21.4%) i/s -      1.736k in  20.024163s

Comparison:
    Nokolexbor parse:      487.6 i/s
      Nokogiri parse:       93.5 i/s - 5.22x  (± 0.00) slower

Warming up --------------------------------------
   Nokolexbor at_css     5.548k i/100ms
     Nokogiri at_css     6.000  i/100ms
Calculating -------------------------------------
   Nokolexbor at_css     50.799k (±13.8%) i/s -    987.544k in  20.018481s
     Nokogiri at_css     50.907  (±35.4%) i/s -    828.000  in  20.666258s

Comparison:
   Nokolexbor at_css:    50798.8 i/s
     Nokogiri at_css:       50.9 i/s - 997.87x  (± 0.00) slower

Warming up --------------------------------------
      Nokolexbor css   709.000  i/100ms
        Nokogiri css     4.000  i/100ms
Calculating -------------------------------------
      Nokolexbor css      7.438k (±14.7%) i/s -    145.345k in  20.083833s
        Nokogiri css     52.338  (±36.3%) i/s -    816.000  in  20.042053s

Comparison:
      Nokolexbor css:     7437.6 i/s
        Nokogiri css:       52.3 i/s - 142.11x  (± 0.00) slower

Warming up --------------------------------------
 Nokolexbor at_xpath     2.000  i/100ms
   Nokogiri at_xpath     4.000  i/100ms
Calculating -------------------------------------
 Nokolexbor at_xpath     57.077  (±31.5%) i/s -    920.000  in  20.156393s
   Nokogiri at_xpath     53.176  (±35.7%) i/s -    876.000  in  20.036717s

Comparison:
 Nokolexbor at_xpath:       57.1 i/s
   Nokogiri at_xpath:       53.2 i/s - same-ish: difference falls within error

Warming up --------------------------------------
    Nokolexbor xpath     3.000  i/100ms
      Nokogiri xpath     3.000  i/100ms
Calculating -------------------------------------
    Nokolexbor xpath     51.523  (±31.1%) i/s -    903.000  in  20.102568s
      Nokogiri xpath     58.438  (±35.9%) i/s -    852.000  in  20.001408s

Comparison:
      Nokogiri xpath:       58.4 i/s
    Nokolexbor xpath:       51.5 i/s - same-ish: difference falls within error