bio-maf
This is a plugin for BioRuby adding support for the Multiple Alignment Format (MAF), used in bioinformatics to store whole-genome sets of multiple sequence alignments.
This library provides indexed and sequential access to MAF data, as well as performing various manipulations on it and writing modified MAF files.
For more information, see the project wiki.
Developer documentation generated with YARD is available at rubydoc.info.
This is being developed by Clayton Wheeler as part of the Google Summer of Code 2012, under the auspices of the Open Bioinformatics Foundation. The development blog may be of interest.
Dependencies
Kyoto Cabinet is a database library, required for building MAF indexes. Install the core library in the appropriate way for your platform, as documented here.
If you're using MRI, the kyotocabinet-ruby gem will be used to interact with Kyoto Cabinet. For best performance, however, you should really consider using JRuby. On JRuby, the kyotocabinet-java gem will be used instead; this builds a Java library using JNI to call into Kyoto Cabinet. Please file a bug report if you encounter problems building or using this gem, which is still fairly new.
Installation
bio-maf
is now published as a Ruby gem.
$ gem install bio-maf
Performance
This parser performs best under JRuby, particularly with Java 7. See the Performance wiki page for more information. For best results, use JRuby in 1.9 mode with the ObjectProxyCache disabled:
$ export JRUBY_OPTS='--1.9 -Xji.objectProxyCache=false'
Many parsing modes are multithreaded. Under JRuby, it will default to
using one parser thread per available core, but if desired this can be
configured with the :threads
parser option.
Ruby 1.9.3 is fully supported, but does not perform as well, especially since its concurrency features are not useful for this workload.
Usage
Create an index on a MAF file
Much of the functionality of this library relies on an index. You can create one with maf_index(1), like so:
$ maf_index test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct
To index all sequences for searching, not just the reference sequence:
$ maf_index --all test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct
To build an index programmatically:
require 'bio-maf'
parser = Bio::MAF::Parser.new("test/data/mm8_chr7_tiny.maf")
idx = Bio::MAF::KyotoIndex.build(parser, "/tmp/mm8_chr7_tiny.kct", false)
Compress and index a MAF file
This library fully supports BGZF-compressed MAF files, which
combine gzip compression with blocking for efficient random
access. These can be generated with blocking optimized for MAF access
using the included
maf_bgzip(1)
tool. This writes BGZF-compressed MAF files and optionally indexes
them as well:
$ maf_bgzip --dir /tmp --index --all test/data/mm8.chrM.maf
This is the easiest way to prepare MAF files for use with this library.
Extract blocks from an indexed MAF file, by genomic interval
Refer to mm8_chr7_tiny.maf
.
require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
access.find(q) do |block|
ref_seq = block.sequences[0]
puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
end
# => Matched block at 80082592, 121 bases
# => Matched block at 80082713, 54 bases
Or, equivalently, one can work with a specific MAF file and index directly:
require 'bio-maf'
parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
idx.find(q, parser).each do |block|
ref_seq = block.sequences[0]
puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
end
# => Matched block at 80082592, 121 bases
# => Matched block at 80082713, 54 bases
This can be done with maf_extract(1)
as well:
$ maf_extract -d test/data --interval mm8.chr7:80082592-80082766
Extract alignment blocks truncated to a given interval
Given a genomic interval of interest, one can also extract only the
subsets of blocks that intersect with that interval, using the
#slice
method like so:
require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
int = Bio::GenomicInterval.zero_based('mm8.chr7', 80082350, 80082380)
blocks = access.slice(int).to_a
puts "Got #{blocks.size} blocks, first #{blocks.first.ref_seq.size} base pairs."
# => Got 2 blocks, first 18 base pairs.
Or, with maf_extract(1)
:
$ maf_extract -d test/data --mode slice --interval mm8.chr7:80082592-80082766
Filter species returned in alignment blocks
require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
access.sequence_filter = { :only_species => %w(hg18 mm8 rheMac2) }
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
blocks = access.find(q)
block = blocks.first
puts "Block has #{block.sequences.size} sequences."
# => Block has 3 sequences.
With maf_extract(1)
:
$ maf_extract -d test/data --interval mm8.chr7:80082592-80082766 --only-species hg18,mm8,rheMac2
Extract blocks matching certain conditions
See also the Cucumber feature and step definitions for this.
Match only blocks with all specified species
access = Bio::MAF::Access.maf_dir('test/data')
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082471, 80082730)]
access.block_filter = { :with_all_species => %w(panTro2 loxAfr1) }
n_blocks = access.find(q).count
# => 1
With maf_extract(1)
:
$ maf_extract -d test/data --interval mm8.chr7:80082471-80082730 --with-all-species panTro2,loxAfr1
Match only blocks with a certain number of sequences
access = Bio::MAF::Access.maf_dir('test/data')
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082767, 80083008)]
access.block_filter = { :at_least_n_sequences => 6 }
n_blocks = access.find(q).count
# => 1
With maf_extract(1)
:
$ maf_extract -d test/data --interval mm8.chr7:80082767-80083008 --min-sequences 6
Match only blocks within a text size range
access = Bio::MAF::Access.maf_dir('test/data')
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 0, 80100000)]
access.block_filter = { :min_size => 72, :max_size => 160 }
n_blocks = access.find(q).count
# => 3
With maf_extract(1)
:
$ maf_extract -d test/data --interval mm8.chr7:0-80100000 --min-text-size 72 --max-text-size 160
Process each block in a MAF file
require 'bio-maf'
p = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
puts "MAF version: #{p.header.version}"
# => MAF version: 1
p.each_block do |block|
block.sequences.each do |seq|
do_something(seq)
end
end
Parse empty ('e') lines
Refer to chr22_ieq.maf
.
require 'bio-maf'
p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
:parse_empty => false)
block = p.parse_block
block.sequences.size
# => 3
p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
:parse_empty => true)
block = p.parse_block
block.sequences.size
# => 4
block.sequences.find { |s| s.empty? }
# => #<Bio::MAF::EmptySequence:0x007fe1f39882d0
# @source="turTru1.scaffold_109008", @start=25049,
# @size=1601, @strand=:+, @src_size=50103, @text=nil,
# @status="I">
Such options can also be set on a Bio::MAF::Access object:
require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
access.parse_options[:parse_empty] = true
Remove gaps from parsed blocks
After filtering out species with
Parser#sequence_filter
,
gaps may be left where there was an insertion present only in
sequences that were filtered out. Such gaps can be removed by setting
the :remove_gaps
parser option:
require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
access.parse_options[:remove_gaps] = true
Join blocks after filtering together
Similarly, filtering out species may remove a species which had caused
two adjacent alignment blocks to be split. By enabling the
:join_blocks
parser option, such blocks can be joined together:
require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
access.parse_options[:join_blocks] = true
See the Cucumber feature for more details.
Extract bio-alignment representations of blocks
When the :as_bio_alignment
parser option is given, blocks will be
returned as Bio::BioAlignment::Alignment objects as used in the
bio-alignment Biogem. This offers a great deal of built-in
functionality for column-wise operations, alignment manipulation, and
more.
require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
access.parse_options[:as_bio_alignment] = true
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
access.find(q) do |aln|
col = aln.columns[3]
puts "bases in column 3: #{col}"
end
Tile blocks together over an interval
Extracts alignment blocks overlapping the given genomic interval and
constructs a single alignment block covering the entire interval for
the specified species. Optionally, any gaps in coverage of the MAF
file's reference sequence can be filled in from a FASTA sequence
file. See the Cucumber feature for examples of output, and also
the
maf_tile(1)
man page.
require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
interval = Bio::GenomicInterval.zero_based('mm8.chr7',
80082334,
80082468)
access.tile(interval) do |tiler|
# reference is optional
tiler.reference = 'reference.fa.gz'
tiler.species = %w(mm8 rn4 hg18)
# species_map is optional
tiler.species_map = {
'mm8' => 'mouse',
'rn4' => 'rat',
'hg18' => 'human'
}
tiler.write_fasta($stdout)
end
Compression
MAF files can optionally be compressed in the BGZF format defined
in the SAM specification. This is best done with
maf_bgzip(1)
,
but files compressed with the bgzip(1)
tool from samtools will also
work, though less efficiently.
MAF files compressed with plain gzip will be decompressed on the fly,
but random access to these files will not be possible. However,
gzipped MAF files are suitable as input to
maf_bgzip(1)
.
Command line tools
Man pages for command line tools:
With gem-man installed, these can be read with:
$ gem man bio-maf
Other documentation
Also see the API documentation. For more code examples see the RSpec and Cucumber test files in the source tree.
Also, the scripts in the bin directory provide good worked examples of how to use the existing parsing API.
Project home page
For information on the source tree, documentation, examples, issues and how to contribute, see
http://github.com/csw/bioruby-maf
The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.
Cite
If you use this software, please cite one of
- BioRuby: bioinformatics software for the Ruby programming language
- Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics
Biogems.info
This Biogem is published at biogems.info.
Copyright
Copyright (c) 2012 Clayton Wheeler. See LICENSE.txt for further details.