bio-rdf

Library and tools for using an RDF triple-store with biological data. To work with RDF it is often necessary to parse some (tabular) data and output RDF. This repository contains a collection of functions to parse files and generate RDF, allowing one to store data into a triple store.

There are some nuggets of gold to be found in this repository

Generic data parsers and generic generators that churn out (Turtle) RDF
bioruby-table as a tool for RDF generation from tabular data
Command line SPARQL combined with ERB templating

also

Cucumber tests for RDF generation

The name includes RDF, but that really is too a narrow view of the purpose of this tool, unfortunately alternative names (bio-semweb and bio-triplestore) are even worse, so we'll stick to bio-rdf.

At this point this library is a work in progress. Together with the https://github.com/wstrinz/publisci module by Will Strinz, we aim to get to a unified approach for data parsing and RDF output.

Parsers

First there are the parsers. Every (native) data-type has a parser module. This parser module controls the parsing flow. The actual parsing is handled by lower level routines, which may even reside in other libraries, such as BioRuby. The basic flow is

input -> parser -> output

The input can be anything, from directories, files to web based resources.

The output of the parser should be in some form of RDF triple format, though simple tab delimited tables can also be supported (depending on the parser/outputter).

Existing functionality:

PubMed:Entrez to table and RDF
GSEA, gene set enrichment analysis, to table and RDF
Somatic calling and copy number variation: varscan2, somatic sniper and more

more information on that below.

Note that any table file can be turned into RDF using the bio-table rubygem, which is automatically installed by bio-rdf. Example:

bio-table --format rdf table.csv

This project is linked with next generation sequencing, genome browsing, visualisation and QTL mapping. E.g.

Parser functionality

Existing bio-rdf parsers are listed on github. Here we describe the important ones:

Pubmed:Entrez (NCBI Pubmed)

PubMed comprises more than 22 million citations for biomedical literature from MEDLINE, life science journals, and online books.

bio-rdf uses the BioRuby PubMed module to fetch Pubmed records and writes them to a tab delimited output with

  bio-rdf pubmed --tabulate --search "Prins [au] BioGem"

prints

"pubmed","authors","title","journal","year","volume","issue","pages","doi","url" "Sharing programming resources between Bio* projects through remote procedure call and native call stack strategies.","Methods Mol Biol","2012","856","","513-527","10.1007/978-1-61779-585-5_21","http://www.ncbi.nlm.nih.gov/pubmed/22399473"

You can reformat the author list output with

  bio-rdf pubmed --tabulate --search 'Pjotr Prins [au] Bio' --format-author "surname+' '+initials.join('')" --format-authors-join ', '

which produces "Prins P, Goto N, Yates A, Gautier L, Willis S, Fields C, Katayama T" rather than the default.

You can convert the tabular format to RDF by using the included bioruby-table tool.

Gene set enrichment analysis (GSEA)

GSEA is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. The GSEA tool produces two result files for every two biological states. We wrote a parser for the summary files, which outputs either a single table of results (based on a cut-off value). This table can be converted into a triple-store.

To create a tab delimited file from a GSEA result, where FDR < 0.25

  bio-rdf gsea --tabulate --exec "rec.fdr <= 0.25" ./gsea/output/ > results.txt

You can convert the tabular format to RDF by using the included bioruby-table tool.

To convert GMT input files to RDF use ./templates/gsea/gsea_gmt.erb.

Mapping Affymetrix probes to sequence information, through R/Bioconductor

R/Bioconductor contains a lot of modules with annotation data. This document. explores getting annotation data into a triple store. E.g., the first exercise matches Arabidipsis Affy probe to gene ID mapping information, and fetches the matching nucleotide sequences via a shared TAIR ID.

See document.

Somatic variant calling and copy number variation

  bio-rdf variant --help

More examples

Research oriented RDF parsers (aka one-offs) are included in extra. The documentation can be found in doc.

Installation

    gem install bio-rdf

In principle you can use bio-rdf with any RDF triple store.

To run the tests, however, you'll also need to install and run the 4store RDF triple store, which supports Linux and OS X only. E.g. on Debian/Ubuntu

    apt-get install 4store

you may need to add the user to the fourstore group and create the /var/lib/4store/ directory with the appropriate permissions. E.g.

    dbname=biorubyrdftest
    # as root
    mkdir /var/lib/4store/
    mkdir /var/lib/4store/$dbname
    chown fourstore.fourstore -R /var/lib/4store/
    chmod g+rw -R /var/lib/4store/
    # as normal user (this is what the test system does)
    4s-backend-setup $dbname
    4s-backend $dbname
    4s-httpd -p 8000 $dbname

now this should work

    # as user (if added to the fourstore group)
    bundle exec rake

To test for valid Turtle RDF rapper may help:

    rapper -i turtle file.rdf

Also visit http://localhost:8000/status/

Load 4-store with a database through

Load with

  rdf=file.rdf
  rapper -i turtle $rdf
  uri=http://localhost:8000/data/http://biobeat.org/data/$rdf

  curl -X DELETE $uri
  curl -T $rdf -H 'Content-Type: application/x-turtle' $uri

Again visit http://localhost:8000/status/

  ~/opt/local/bin/sparql-query --pipe localhost:8000/sparql/ < test_gsea.rq

Convert XML with XSL to a comma separated file (CSV)

  xalan -xsl ./scripts/sparql-results-csv.xsl -in sparql-result.xml

~ Useful headers

RDF output by bio-exominer https://github.com/pjotrp/bioruby-exominer

@prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix dc: http://purl.org/dc/elements/1.1/ . @prefix hgnc: http://identifiers.org/hgnc.symbol/ . @prefix ncbigene: https://www.google.nl/search?q=ncbi+gene+alias+ . @prefix : http://biobeat.org/rdf/exominer/ns# .

Command line SPARQL queries can be done with 4s-query and sparql-query. The latter has the advantage of generating readable (non-XML) output. To install sparql-query, fetch the git repository and

    apt-get install libglib2.0-dev libcurl4-openssl-dev libxml2-dev
    make

Example:

./sparql-query "http://localhost:8000/sparql/" 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'
  .--------------.
  | ?s | ?p | ?o |
  |----+----+----|
  '--------------'

Empty result, if it is empty :). Note the final slash on the URL. For 4store you also may want to disable the the soft-limit, e.g.

./sparql-query "http://localhost:8000/sparql/?soft-limit=-1" 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'

SPARQL with ERB and parameters

ERB can help reuse SPARQL queries. Here is an example.

Usage

    require 'bio-rdf'

The API doc is online. For more code examples see the test files in the source tree.

Project home page

Information on the source tree, documentation, examples, issues and how to contribute, see

http://github.com/pjotrp/bioruby-rdf

Cite

If you use this software, please cite one of

Biogems.info

This Biogem is published at #bio-rdf

bio-rdf

Development

Runtime

bio-rdf

Parsers

Parser functionality

Pubmed:Entrez (NCBI Pubmed)

Gene set enrichment analysis (GSEA)

Mapping Affymetrix probes to sequence information, through R/Bioconductor

Somatic variant calling and copy number variation

More examples

Installation

RDF output by bio-exominer https://github.com/pjotrp/bioruby-exominer

SPARQL with ERB and parameters

Usage

Project home page

Cite

Biogems.info

Copyright