Low commit activity in last 3 years
No release in over a year
MediaWiki API and Page content parser for Headlines (nested), TextBlocks, ListItems, and Links.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

> 1.3
~> 0.14
~> 1.59.0
> 2.0.0
>= 0

Runtime

> 1.13.0
 Project Readme

Wiki::Api

Build Status Code Climate

Wiki API is a gem (Ruby on Rails) that interfaces with the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page). This gem is more than a interface, it has abstract classes for Page and Headline parsing. You're able to iterate through these headlines, and access data accordingly.

NOTE: This gem has a nokogiri (http://nokogiri.org/Nokogiri.html) backend (for HTML parsing). Major components: Page, Headline, Block, ListItem, and Link are wrappers for easy data access, however it's still possible to retreive the raw HTML within these objects.

Requests to the MediaWiki API use the following URI structure:

http(s)://somemediawiki.org/w/api.php?action=parse&format=json&page="anypage"

Dependencies

  • nokogiri

Installation

Add this line to your application's Gemfile (bundler):

gem 'wiki-api', git: "git://github.com/dblommesteijn/wiki-api.git"

And then execute:

$ bundle

Or install it yourself (RubyGems):

$ gem install wiki-api

Or try it from this repository (local) in a console:

$ bin/console

Setup

Define a configuration for your connection (initialize script), this example uses wiktionary.org. NOTE: it can connect to both HTTP and HTTPS MediaWikis (however you'll get a 302 response from MediaWiki)

Setup default configuration (initialize script)

Wiki::Api::Connect.config = { uri: 'https://en.wiktionary.org' }

Running tests

$ rake test

Usage

Query a Page and Headline

Requesting headlines from a given page.

page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
# the root headline equals the pagename
puts page.root_headline.name
# iterate next level of headlines
page.root_headline.headlines.each do |headline_name, headline|
  # printing headline name (PageHeadline)
  puts headline.name
end

Getting headlines for a given name.

page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
# lookup headline by name (underscore and case are ignored)
headline = page.root_headline.headline('editing wiktionary').first
# printing headline name (PageHeadline)
puts headline.name
# get the type of nested headline (html h1,2,3,4 etc.)
puts headline.type

Basic Page structure

page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
# iterate PageHeadline objects
page.root_headline.headlines.each do |headline_name, headline|
  # exposing nokogiri internal elements
  elements = headline.elements.flatten
  elements.each do |element|
    # print will result in: Nokogiri::XML::Text or Nokogiri::XML::Element
    puts element.class
  end

  # string representation of all nested text
  block.to_texts
  # iterate PageListItem objects
  block.list_items.each do |list_item|
    # string representation of nested text
    list_item.to_text
    # iterate PageLink objects
    list_item.links.each do |link|
      # check part: 'iterate PageLink objects'
    end
  end

  # iterate PageLink objects
  headline.block.links.each do |link|
    # absolute URI object
    link.uri
    # html link
    link.html
    # link name
    link.title
    # string representation of nested text
    link.to_text
  end
end

Example using Global config (https://en.wikipedia.org/wiki/Ruby_on_Rails)

This is a example of querying wikipedia.org on the page: "Ruby_on_rails", and printing the References headline links for each list item.

# setting a target config
Wiki::Api::Connect.config = { uri: 'https://en.wikipedia.org' }

# querying the page
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails')

# get headlines with name Reference (there can be multiple headlines with the same name!)
headlines = page.root_headline.headline('References')

# iterate headlines
headlines.each do |headline|
  # iterate list items on the given headline
  headline.block.list_items.each do |list_item|
    # print the uri of all links
    puts list_item.links.map(&:uri)
  end
end

This is the same example as the one above, except for setting a global config to direct the requests to a given URI.

# querying the page
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')

# get headlines with name Reference (there can be multiple headlines with the same name!)
headlines = page.root_headline.headline('References')

# iterate headlines
headlines.each do |headline|
  # iterate list items on the given headline
  headline.block.list_items.each do |list_item|
    # print the uri of all links
    puts list_item.links.map(&:uri)
  end
end

Example searching headlines

This example shows how the headlines can be searched. For more info check: https://github.com/dblommesteijn/wiki-api/blob/master/lib/wiki/api/page.rb#L97

# querying the page
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')

# NOTE: the following are all valid headline names:
# request headline (by literal name)
headlines = page.root_headline.headline('Philosophy_and_design')
puts headlines.map(&:name)
# request headline (by downcase name)
headlines = page.root_headline.headline('philosophy_and_design')
puts headlines.map(&:name)
# request headline (by human name)
headlines = page.root_headline.headline('philosophy and design')
puts headlines.map(&:name)

# NOTE2: headlines are matched on headline.start_with?(requested_headline)
# because of start_with? compare this should work as well!
headlines = page.root_headline.headline('philosophy')
puts headlines.map(&:name)

Example searching headlines in depth

Recursive search on all nested headlines, including in depth searches.

# querying the page
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
# get root
root_headline = page.root_headline
# lookup 'ramework structure' on current level
headline = root_headline.headline_in_depth('framework structure').first
puts headline.name
# NOTE: lookup of nested headlines does not work with the headline function (because 'Framework_structure' is nested within 'Technical_overview')
headline = root_headline.headline('framework structure').first
# depth can be limited adding the depth parameter
# NOTE: the example below will return nil, 'Framework_structure' is nested beyond depth = 0!
depth = 0
headline = root_headline.headline_in_depth('framework structure', depth).first
# increasing depth search will show the requested headline
depth = 5
headline = root_headline.headline_in_depth('framework structure', depth).first
puts headline.name