biblicit
Extract citations from PDFs.
Note: The version is 2.x, but really should be 0.2.x.
You might also want to try Grobid, which I have found to perform better than the version of Parscit used here, which throws away non-textual information (font, formatting, etc).
Usage
# Extract metadata from a file using default tools and settings
result = Biblicit::Extractor.extract(content: "a string containing the content of a PDF file")
# Extract metadata from a file using all available tools
result = Biblicit::Extractor.extract(file: "myfile.pdf", tools: [:parshed, :cb2bib], remote: true, token: false)
# See reference information for "myfile.pdf"
result[:citeseer][:title]
result[:parshed][:title]
result[:citeseer][:authors]
# etc
Algorithms
CiteSeer (default)
Wrapper around Perl code extracted from CiteSeerX.
Uses a model trained with the svm-light Support Vector Machine library.
ParsCit (default)
Wrapper around Perl & Ruby code from ParsCit, which is included as a Git submodule.
Uses a model trained with the CRF++ Conditional Random Fields library.
cb2Bib (not recommended)
Wrapper around cb2Bib in command-line mode.
Uses an apparently less-sophisticated parsing algorithm than the others to parse metadata, but then, if :remote=true, scrapes one of a large number of journal or public repository websites for a structured version of the citation data. Warning: sometimes it finds the wrong work!
Requirements
There are a lot, but you may not need all of them, depending on your use case.
Required to support various input file formats
Different tools are used for different input file formats.
PDF - Poppler
This provides pdftotext
. You could install xpdf
instead.
From source
Requires fontconfig.
wget http://poppler.freedesktop.org/poppler-0.22.1.tar.gz
tar -xzf poppler-0.22.1.tar.gz
cd poppler-0.22.1
./configure
make
sudo make install
On Debian/Ubuntu
sudo apt-get install poppler-utils
On OS X with Homebrew
brew install poppler
Postscript - Ghostscript
This provides ps2ascii
.
From source
wget http://downloads.ghostscript.com/public/ghostscript-9.06.tar.gz
tar -xzf ghostscript-9.06.tar.gz
cd ghostscript-9.06
make
sudo make install
On Debian/Ubuntu
sudo apt-get install ghostscript
On OS X with Homebrew
brew install ghostscript
Other (e.g. docx) - AbiWord
This provides abiword
.
On Debian/Ubuntu
sudo apt-get install abiword
On OS X
As of writing, you're out of luck, because AbiWord doesn't compile on recent versions of OS X. According to their website, however, this is being actively worked on.
Required to use either the ParsCit or CiteSeer algorithms
Perl modules
More than these might be required; this is what I had to add to my default installation.
From CPAN
sudo cpan install Digest::SHA1
sudo cpan install String::Approx
Required to use the ParsCit algorithm
CRF++
You can specify where you have installed CRF++ by setting the CRFPP_HOME environment variable.
From source
wget http://crfpp.googlecode.com/files/CRF%2B%2B-0.57.tar.gz
tar xvzf CRF++-0.57.tar.gz
cd CRF++-0.57
./configure
make
sudo make install
On Debian/Ubuntu
sudo apt-add-repository 'deb http://cl.naist.jp/~eric-n/ubuntu-nlp oneiric all'
sudo apt-get update
sudo apt-get install libcrf++-dev crf++
On OS X with Homebrew
brew install crf++
Required to use the CiteSeer algorithm
svm-light
Required for header extraction (reference information for the input work itself).
The included model requires version 5, not the current version. You can specify where you have installed svm-light by setting the SVM_LIGHT_HOME environment variable.
From source
mkdir svm_light5
cd svm_light5
wget http://download.joachims.org/svm_light/v5.00/svm_light.tar.gz
tar -xzf svm_light.tar.gz
make
echo "export SVM_LIGHT_HOME=`pwd`" >> ~/.profile # or .bashrc or whatever
source ~/.profile
Required to use the cb2bib algorithm
cb2Bib
From source (Linux)
wget http://www.molspaces.com/dl/progs/cb2bib-1.4.9.tar.gz
tar -xzvf cb2bib-1.4.9.tar.gz
cd cb2bib-1.4.9
./configure --prefix /usr/local
make
sudo make install
From source (OS X)
Requires Qt & X11, unfortunately, and still requires a hack to work on recent versions of OS X.
wget http://www.molspaces.com/dl/progs/cb2bib-1.4.9.tar.gz
tar -xzvf cb2bib-1.4.9.tar.gz
cd cb2bib-1.4.9
./configure --prefix /Applications/cb2Bib
make # fails first time...
mv src/Makefile src/Makefile.old
sed 's|-lX11 -framework QtWebKit|-lX11 -L/usr/X11/lib -I/usr/X11/include -framework QtWebKit|' src/Makefile.old > src/Makefile
make # should succeed now
sudo make install
On Debian/Ubuntu
sudo apt-get install cb2bib
Other
(I'm not currently sure what this was required for; TODO figure it out!)
On Debian/Ubuntu
sudo apt-get install libicu-dev
Copying
Copyright Academia.edu or the original author(s) - see documentation in the included parscit and svm-header-parse directories.
Apache licensed (see LICENSE.TXT).
Please note svm-light is in general free only for non-commercial use, but can be used in this gem by permission of the author. For conditions on additional uses see the website.