<img src=“https://secure.travis-ci.org/pjotrp/bioruby-gff3-plugin.png” />
bio-gff3 is listed at biogems.info
bio-gff3¶ ↑
GFF3 parser, aimed at parsing big data GFF3 to return sequences of any type, including assembled mRNA, protein and CDS sequences.
Features:
-
Take GFF3 (genome browser) information of any type, and assemble sequences, e.g. mRNA and CDS
-
Options for low memory use and caching of records
-
Support for external FASTA input files
-
Use of multi-cores (NYI)
Currently the output is a FASTA file.
You can use this plugin in two ways. First as a standalone program, second as a plugin library to BioRuby.
Note: a really fast GFF3 parser, and way more flexible, is in the works. See github.com/mamarjan/gff3-pltools.
Install and run gff3-fetch¶ ↑
After installing ruby 1.9, or later, you can use rubygems
gem install bio-gff3
Then, fetch mRNA and CDS information from GFF3 files and output to FASTA:
gff3-fetch mrna test/data/gff/test.gff3 gff3-fetch cds test/data/gff/test.gff3
Development¶ ↑
To use the library
require 'bio-gff3'
For coding examples see ./bin/gff3-fetch and the ./spec/*rb
You can run RSpecs with something like
rspec -I ../bioruby/lib/ spec/*.rb
(supposing you are referring a bioruby source repository)
This implementation depends on BioRuby’s basic GFF3 parser, with the possible advantage that the plugin can assemble sequences, is faster and does not consume all memory. The Gff3 specs are based on the output of the Wormbase genome browser.
See also¶ ↑
gff3-fetch --help
For a write-up see thebird.nl/bioruby/BioRuby_GFF3.html
Command line usage (CLI)¶ ↑
Fetch and assemble GFF3 types (ORF, mRNA, CDS) + print in FASTA format. gff3-fetch [options] type [filename.fa] filename.gff3 --translate : output as amino acid sequence --validate : validate GFF3 file by translating --fix : check 3-frame translation and fix, if possible --fix-wormbase : fix 3-frame translation on ORFs named 'gene1' --no-assemble : output each record as a sequence --phase : output records using phase (useful w. no-assemble CDS to AA) type is any valid type in the GFF3 definition. For example: mRNA : assemble mRNA CDS : assemble CDS exon : list all exons gene|ORF : list gene ORFs other : use any type from GFF3 definition, e.g. 'Terminate' and the following performance options:
–parser bioruby : use BioRuby GFF3 parser (slow)
--parser line : use GFF3 line parser (faster, default) --block : parse GFF3 by block (optimistic) -- NYI --cache full : load all in RAM (fast, default) --cache none : do not load anything in memory (slow) --cache lru : use least recently used cache (limit RAM use, fast) -- NYI --max-cpus num : use num threads -- NYI --emboss : use EMBOSS translation (fast) -- NYI Where (NYI == Not Yet Implemented): Multiple GFF3 files can be used. With external FASTA files, always the last one before the GFF3 filename is matched. Make sure the FASTA file comes before the GFF3 file on the command line. Note that above switches are only partially implemented at this stage. Examples: Assemble mRNA and CDS information from test.gff3 (which includes sequence information) gff3-fetch mRNA test/data/gff/test.gff3 gff3-fetch CDS test/data/gff/test.gff3 Find CDS records from external FASTA file, adding phase and translate to protein sequence gff3-fetch --no-assemble --phase --translate CDS test/data/gff/MhA1_Contig1133.fa test/data/gff/MhA1_Contig1133.gff3 Find mRNA from external FASTA file, without loading everything in RAM gff3-fetch --cache none mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3 gff3-fetch --cache none mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3 Validate GFF3 file using EMBOSS translation and validation gff3-fetch --cache none --validate --emboss mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3 Find GENEID predicted terminal exons gff3-fetch terminal chromosome1.fa geneid.gff3 Fine tuning output - show errors only gff3-fetch mRNA test/data/gff/test.gff3 --trace ERROR Fine tuning outpt - show messages matching regex gff3-fetch mRNA test/data/gff/test.gff3 --trace '=msg =~ /component/' Fine tuning output - write log messages to file.log gff3-fetch mRNA test/data/gff/test.gff3 --trace ERROR --logger file.log
For more information on output, see the bioruby-logger plugin.
Performance¶ ↑
time gff3-fetch cds m_hapla.WS217.dna.fa m_hapla.WS217.gff3 2> /dev/null > test.fa
Digesting parser: Cache real user sys version RAM ------------------------------------------------------------ full,bioruby 12m41 12m28 0m09 (0.8.0) full,line 12m13 12m06 0m07 (0.8.5) full,line,lazy 11m51 11m43 0m07 (0.8.6) 6,600M none,bioruby 504m 477m 26m50 (0.8.0) none,line 297m 267m 28m36 (0.8.5) none,line,lazy 132m 106m 26m01 (0.8.6) 650M lru,bioruby 533m 510m 22m47 (0.8.5) lru,line 353m 326m 26m44 (0.8.5) 1K lru,line 305m 281m 22m30 (0.8.5) 10K lru,line,lazy 182m 161m 21m10 (0.8.6) 10K lru,line,lazy 75m 75m 0m17 (0.8.6) 50K 730M ------------------------------------------------------------ Block parser: Cache real user sys gff3 version ------------------------------------------------------------ in preparation ------------------------------------------------------------
where
52M m_hapla.WS217.dna.fa 456M m_hapla.WS217.gff3
ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-linux] on an 8 CPU, 2.6 GHz (6MB cache), 16 GB RAM machine.
Note: the new parser by Marjan is magnitudes faster, and way more flexible. See github.com/mamarjan/gff3-pltools
Cite¶ ↑
If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/bts080 or http://dx.doi.org/10.1093/bioinformatics/btq475
Copyright¶ ↑
Copyright © 2010-2012 Pjotr Prins <pjotr.prins@thebird.nl>