Arx
A Ruby interface for querying academic papers on the arXiv search API.
arXiv provides an advanced search utility on their website, as well as an extensive search API that allows for the external querying of academic papers hosted on their website.
Although Scholastica offer a great Ruby gem for retrieving papers from arXiv through the search API, this gem only allows for the retrieval of one paper at a time, and only supports searching for paper by ID.
Arx is a gem that allows for quick and easy querying of the arXiv search API, without having to worry about manually writing your own search query strings or parsing the resulting XML query response to find the data you need.
Examples
- Suppose we wish to search for papers in the
cs.FL
(Formal Languages and Automata Theory) category whose title contains"Buchi Automata"
, not authored byTomáš Babiak
, sorted by submission date (latest first).
require 'arx'
papers = Arx(sort_by: :submitted_at) do |query|
query.category('cs.FL')
query.title('Buchi Automata').and_not.author('Tomáš Babiak')
end
- Suppose we wish to retrieve the main category of the paper with arXiv ID
1809.09415
, the name of the first author and the date it was published.
require 'arx'
paper = Arx('1809.09415')
paper.authors.first.name
#=> "Christof Löding"
paper.categories.first.full_name # or paper.primary_category.full_name
#=> "Formal Languages and Automata Theory"
paper.published_at
#=> #<DateTime: 2018-09-25T11:40:39+00:00 ((2458387j,42039s,0n),+0s,2299161j)>
Features
- Ruby classes
Arx::Paper
,Arx::Author
andArx::Category
that wrap the resulting Atom XML query result from the search API. - Supports querying by a paper's ID, title, author(s), abstract, subject category, comment, journal reference, report number or last updated date.
- Provides a small DSL for writing queries.
- Supports searching fields by exact match.
Installation
To install Arx, run the following in your terminal:
gem install arx
Documentation
The documentation for Arx is hosted on .
Contributing
All contributions to Arx are greatly appreciated. Contribution guidelines can be found here.
Usage
Before you start using Arx, you'll have to ensure that the gem is required (either in your current working file, or shell such as IRB):
require 'arx'
Building search queries
Query requests submitted to the arXiv search API are typically of the following form (where the query string is indicated in bold):
http://export.arxiv.org/api/query?search_query=ti:%22Buchi+Automata%22+AND+cat:%22cs.FL%22
This particular query searches for papers whose title includes the string
Buchi Automata
, and are in the Formal Languages and Automata Theory (cs.FL
) category.
Obviously writing out queries like this can quickly become time-consuming and tedious.
The Arx::Query
class provides a small DSL for writing these query strings.
Sorting criteria and order
The order in which search results are returned can be modified through the sort_by
and sort_order
keyword arguments (in the Arx::Query
initializer):
-
sort_by
accepts the symbols::relevance
,:updated_at
or:submitted_at
-
sort_order
accepts the symbols::ascending
or:descending
# Sort by submission date in ascending order (earliest first)
Arx::Query.new(sort_by: :submitted_at, sort_order: :ascending)
#=> sortBy=submittedDate&sortOrder=ascending
Note: The default setting is to sort by :relevance
in :descending
order:
Arx::Query.new #=> sortBy=relevance&sortOrder=descending
Paging
The arXiv API offers a paging mechanism that allows you to get chucks of the result set at a time. It can be used through the start
and max_results
keyword arguments (in the Arx::Query
initializer):
-
start
is the index of the first returned result (using 0-based indexing) -
max_results
is the number of results returned by the query
# Get results 10-29
Arx::Query.new(start: 10, max_results: 20)
#=> start=10&max_results=20
Note: The default values are those of the arXiv API: start
defaults to 0
and max_results
defaults to 10
:
Arx::Query.new #=> start=0&max_results=10
Searching by ID
The arXiv search API doesn't only support searching for papers by metadata fields, but also by ID. When searching by ID, a different URL query string parameter id_list
is used (instead of search_query
as seen before).
Although the id_list
can be used to "search by ID", it is better to think of it as restricting the search space to the papers with the provided IDs:
search_query present? |
id_list present? |
Returns |
---|---|---|
Yes | No | Articles that match search_query
|
No | Yes | Articles that are in id_list
|
Yes | Yes | Articles in id_list that also match search_query
|
To search by ID, simply pass the arXiv paper identifiers (ID) or URLs into the Arx::Query
initializer method:
Arx::Query.new('https://arxiv.org/abs/1711.05738', '1809.09415')
#=> sortBy=relevance&sortOrder=descending&id_list=1711.05738,1809.09415
Searching by metadata fields
The arXiv search API supports searches for the following paper metadata fields:
FIELDS = {
title: 'ti', # Title
author: 'au', # Author
abstract: 'abs', # Abstract
comment: 'co', # Comment
journal: 'jr', # Journal reference
category: 'cat', # Subject category
report: 'rn', # Report number
updated_at: 'lastUpdatedDate', # Last updated date
submitted_at: 'submittedDate', # Submission date
all: 'all' # All (of the above)
}
Each of these fields has an instance method defined under the Arx::Query
class. For example:
# Papers whose title contains the string "Buchi Automata".
q = Arx::Query.new
q.title('Buchi Automata')
#=> sortBy=relevance&sortOrder=descending&search_query=ti:%22Buchi+Automata%22
Exact matches
By default, this searches for exact matches of the provided string (by adding double quotes around the string - in the query string, this is represented by the %22
s). To disable this, you can use the exact
keyword argument (which defaults to true
):
# Papers whose title contains either the words "Buchi" or "Automata".
q = Arx::Query.new
q.title('Buchi Automata', exact: false)
#=> sortBy=relevance&sortOrder=descending&search_query=ti:Buchi+Automata
Multiple values for one field
Sometimes you might want to provide multiple field values to search for a paper by. This can simply be done by adding them as another argument (or providing an Array
):
Note: The default logical connective used when there are multiple values for one field is and
.
# Papers authored by both "Eleonora Andreotti" and "Dominik Edelmann".
q = Arx::Query.new
q.author('Eleonora Andreotti', 'Dominik Edelmann')
To change the logical connective to or
or not
(and not), use the connective
keyword argument:
# Papers authored by either "Eleonora Andreotti" or "Dominik Edelmann".
q = Arx::Query.new
q.author('Eleonora Andreotti', 'Dominik Edelmann', connective: :or)
# Papers authored by "Eleonora Andreotti" and not "Dominik Edelmann".
q = Arx::Query.new
q.author('Eleonora Andreotti', 'Dominik Edelmann', connective: :and_not)
Chaining subqueries (logical connectives)
Note: By default, subqueries (successive instance method calls) are chained with a logical and
connective.
# Papers authored by "Dominik Edelmann" in the "Numerical Analysis" (math.NA) category.
q = Arx::Query.new
q.author('Dominik Edelmann')
q.category('math.NA')
To change the logical connective used to chain subqueries, use the and
, or
, and_not
instance methods between the subquery calls:
# Papers authored by "Eleonora Andreotti" in neither the "Numerical Analysis" (math.NA) or "Combinatorics (math.CO)" categories.
q = Arx::Query.new
q.author('Eleonora Andreotti')
q.and_not
q.category('math.NA', 'math.CO', connective: :or)
Grouping subqueries
Sometimes you'll have a query that requires nested or grouped logic, using parentheses. This can be done using the Arx::Query#group
method.
This method accepts a block and basically parenthesises the result of whichever methods were called within the block.
For example, this will allow the last query from the previous section to be written as:
# Papers authored by "Eleonora Andreotti" in neither the "Numerical Analysis" (math.NA) or "Combinatorics (math.CO)" categories.
q = Arx::Query.new
q.author('Eleonora Andreotti')
q.and_not
q.group do
q.category('math.NA').or.category('math.CO')
end
Another more complicated example with two grouped subqueries:
# Papers whose title contains "Buchi Automata", either authored by "Tomáš Babiak", or in the "Formal Languages and Automata Theory (cs.FL)" category and not the "Computational Complexity (cs.CC)" category.
q = Arx::Query.new
q.title('Buchi Automata')
q.group do
q.author('Tomáš Babiak')
q.or
q.group do
q.category('cs.FL').and_not.category('cs.CC')
end
end
Running search queries
Search queries can be executed with the Arx()
method (alias of Arx.search
). This method contains the same parameters as the Arx::Query
initializer - including the list of IDs.
Without a predefined query
Calling the Arx()
method with a block allows for the construction and execution of a new query.
Note: If running a search query this way, then the sort_by
and sort_order
parameters can be added as additional keyword arguments.
# Papers in the cs.FL category whose title contains "Buchi Automata", not authored by Tomáš Babiak
results = Arx(sort_by: :submitted_at) do |query|
query.category('cs.FL')
query.title('Buchi Automata').and_not.author('Tomáš Babiak')
end
results.size #=> 18
With a predefined query
The Arx()
method accepts a predefined Arx::Query
object through the query
keyword parameter.
Note: If using the query
parameter, the sort_by
and sort_order
criteria should be defined in the Arx::Query
object initializer rather than as arguments in Arx()
.
# Papers in the cs.FL category whose title contains "Buchi Automata", not authored by Tomáš Babiak
q = Arx::Query.new(sort_by: :submitted_at)
q.category('cs.FL')
q.title('Buchi Automata').and_not.author('Tomáš Babiak')
results = Arx(query: q)
results.size #=> 18
With IDs
The Arx()
methods accepts a list of IDs as a splat parameter, just like the Arx::Query
initializer.
If only one ID is specified, then a single Arx::Paper
is returned:
result = Arx('1809.09415')
result.class #=> Arx::Paper
Otherwise, an Array
of Arx::Paper
s is returned.
Query results
Search results are typically:
- an
Array
, either empty if no papers matched the supplied query, or containingArx::Paper
objects. - a single
Arx::Paper
object (when the search method is only supplied with one ID).
Entities
The Arx::Paper
, Arx::Author
and Arx::Category
classes provide a simple interface for the metadata concerning a single arXiv paper:
Arx::Paper
paper = Arx('1809.09415')
#=> #<Arx::Paper:0x00007fb657b59bd0>
paper.id
#=> "1809.09415"
paper.id(true)
#=> "1809.09415v1"
paper.url
#=> "http://arxiv.org/abs/1809.09415"
paper.url(true)
#=> "http://arxiv.org/abs/1809.09415v1"
paper.version
#=> 1
paper.revision?
#=> false
paper.title
#=> "On finitely ambiguous Büchi automata"
paper.summary
#=> "Unambiguous B\\\"uchi automata, i.e. B\\\"uchi automata allowing..."
paper.authors
#=> [#<Arx::Author:0x00007fb657b63108>, #<Arx::Author:0x00007fb657b62438>]
# Paper's categories
paper.primary_category
#=> #<Arx::Category:0x00007fb657b61830>
paper.categories
#=> [#<Arx::Category:0x00007fb657b60e80>]
# Dates
paper.published_at
#=> #<DateTime: 2018-09-25T11:40:39+00:00 ((2458387j,42039s,0n),+0s,2299161j)>
paper.updated_at
#=> #<DateTime: 2018-09-25T11:40:39+00:00 ((2458387j,42039s,0n),+0s,2299161j)>
# Paper's comment
paper.comment?
#=> false
paper.comment
#=> Arx::Error::MissingField (arXiv paper 1809.09415 is missing the `comment` metadata field)
# Paper's journal reference
paper.journal?
#=> false
paper.journal
#=> Arx::Error::MissingField (arXiv paper 1809.09415 is missing the `journal` metadata field)
# Paper's PDF URL
paper.pdf?
#=> true
paper.pdf_url
#=> "http://arxiv.org/pdf/1809.09415v1"
# Paper's DOI (Digital Object Identifier) URL
paper.doi?
#=> true
paper.doi_url
#=> "http://dx.doi.org/10.1007/978-3-319-98654-8_41"
Arx::Author
paper = Arx('cond-mat/9609089')
#=> #<Arx::Paper:0x00007fb657a7b8d0>
author = paper.authors.first
#=> #<Arx::Author:0x00007fb657a735e0>
author.name
#=> "F. Gebhard"
author.affiliated?
#=> true
author.affiliations
#=> ["ILL Grenoble, France"]
Arx::Category
paper = Arx('cond-mat/9609089')
#=> #<Arx::Paper:0x00007fb657b59bd0>
category = paper.primary_category
#=> #<Arx::Category:0x00007fb6570609b8>
category.name
#=> "cond-mat"
category.full_name
#=> "Condensed Matter"
Acknowledgements
A large portion of this library is based on the brilliant work done by Scholastica in their arxiv
gem for retrieving individual papers from arXiv through the search API.
Arx was created mostly due to the seemingly inactive nature of Scholastica's repository. Additionally, it would have been infeasible to contribute such large changes to an already well-established gem, especially since https://scholasticahq.com/ appears to be dependent upon this gem.
Nevertheless, a special thanks goes out to Scholastica for providing the influence for Arx.
Contributors
All contributions to this repository are greatly appreciated. Contribution guidelines can be found here.
eonu (Edwin Onuonga) ✉️ 🌍 |
xuanxu (Juanjo Bazán) ✉️ 🌍 |
---|
Arx © 2019-2020, Edwin Onuonga - Released under the MIT License.
Authored and maintained by Edwin Onuonga.