Project

biblicit

0.02
No commit activity in last 3 years
No release in over 3 years
Extract citations from PDFs.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 0
>= 0
>= 0

Runtime

 Project Readme

biblicit

Extract citations from PDFs.

Note: The version is 2.x, but really should be 0.2.x.

You might also want to try Grobid, which I have found to perform better than the version of Parscit used here, which throws away non-textual information (font, formatting, etc).

Usage

  # Extract metadata from a file using default tools and settings
  result = Biblicit::Extractor.extract(content: "a string containing the content of a PDF file")

  # Extract metadata from a file using all available tools
  result = Biblicit::Extractor.extract(file: "myfile.pdf", tools: [:parshed, :cb2bib], remote: true, token: false)

  # See reference information for "myfile.pdf"
  result[:citeseer][:title]
  result[:parshed][:title]
  result[:citeseer][:authors]
  # etc

Algorithms

CiteSeer (default)

Wrapper around Perl code extracted from CiteSeerX.

Uses a model trained with the svm-light Support Vector Machine library.

ParsCit (default)

Wrapper around Perl & Ruby code from ParsCit, which is included as a Git submodule.

Uses a model trained with the CRF++ Conditional Random Fields library.

cb2Bib (not recommended)

Wrapper around cb2Bib in command-line mode.

Uses an apparently less-sophisticated parsing algorithm than the others to parse metadata, but then, if :remote=true, scrapes one of a large number of journal or public repository websites for a structured version of the citation data. Warning: sometimes it finds the wrong work!

Requirements

There are a lot, but you may not need all of them, depending on your use case.

Required to support various input file formats

Different tools are used for different input file formats.

PDF - Poppler

This provides pdftotext. You could install xpdf instead.

From source

Requires fontconfig.

wget http://poppler.freedesktop.org/poppler-0.22.1.tar.gz
tar -xzf poppler-0.22.1.tar.gz
cd poppler-0.22.1
./configure
make
sudo make install
On Debian/Ubuntu
sudo apt-get install poppler-utils
On OS X with Homebrew
brew install poppler

Postscript - Ghostscript

This provides ps2ascii.

From source
wget http://downloads.ghostscript.com/public/ghostscript-9.06.tar.gz
tar -xzf ghostscript-9.06.tar.gz
cd ghostscript-9.06
make
sudo make install
On Debian/Ubuntu
sudo apt-get install ghostscript
On OS X with Homebrew
brew install ghostscript

Other (e.g. docx) - AbiWord

This provides abiword.

On Debian/Ubuntu
sudo apt-get install abiword
On OS X

As of writing, you're out of luck, because AbiWord doesn't compile on recent versions of OS X. According to their website, however, this is being actively worked on.

Required to use either the ParsCit or CiteSeer algorithms

Perl modules

More than these might be required; this is what I had to add to my default installation.

From CPAN
sudo cpan install Digest::SHA1
sudo cpan install String::Approx

Required to use the ParsCit algorithm

CRF++

You can specify where you have installed CRF++ by setting the CRFPP_HOME environment variable.

From source
wget http://crfpp.googlecode.com/files/CRF%2B%2B-0.57.tar.gz
tar xvzf CRF++-0.57.tar.gz
cd CRF++-0.57
./configure 
make
sudo make install
On Debian/Ubuntu
sudo apt-add-repository 'deb http://cl.naist.jp/~eric-n/ubuntu-nlp oneiric all'
sudo apt-get update
sudo apt-get install libcrf++-dev crf++
On OS X with Homebrew
brew install crf++

Required to use the CiteSeer algorithm

svm-light

Required for header extraction (reference information for the input work itself).

The included model requires version 5, not the current version. You can specify where you have installed svm-light by setting the SVM_LIGHT_HOME environment variable.

From source
mkdir svm_light5
cd svm_light5
wget http://download.joachims.org/svm_light/v5.00/svm_light.tar.gz
tar -xzf svm_light.tar.gz
make
echo "export SVM_LIGHT_HOME=`pwd`" >> ~/.profile # or .bashrc or whatever
source ~/.profile

Required to use the cb2bib algorithm

cb2Bib

From source (Linux)
wget http://www.molspaces.com/dl/progs/cb2bib-1.4.9.tar.gz
tar -xzvf cb2bib-1.4.9.tar.gz
cd cb2bib-1.4.9
./configure --prefix /usr/local
make
sudo make install
From source (OS X)

Requires Qt & X11, unfortunately, and still requires a hack to work on recent versions of OS X.

wget http://www.molspaces.com/dl/progs/cb2bib-1.4.9.tar.gz
tar -xzvf cb2bib-1.4.9.tar.gz
cd cb2bib-1.4.9
./configure --prefix /Applications/cb2Bib
make # fails first time...
mv src/Makefile src/Makefile.old
sed 's|-lX11 -framework QtWebKit|-lX11 -L/usr/X11/lib -I/usr/X11/include -framework QtWebKit|' src/Makefile.old > src/Makefile
make # should succeed now
sudo make install
On Debian/Ubuntu
sudo apt-get install cb2bib

Other

(I'm not currently sure what this was required for; TODO figure it out!)

On Debian/Ubuntu
sudo apt-get install libicu-dev

Copying

Copyright Academia.edu or the original author(s) - see documentation in the included parscit and svm-header-parse directories.

Apache licensed (see LICENSE.TXT).

Please note svm-light is in general free only for non-commercial use, but can be used in this gem by permission of the author. For conditions on additional uses see the website.