ParseFasta
So you want to parse a fasta file...
Installation
Add this line to your application's Gemfile:
gem 'parse_fasta'
And then execute:
$ bundle
Or install it yourself as:
$ gem install parse_fasta
JRuby
ParseFasta doesn't work with JRuby for now D:
Overview
Provides nice, programmatic access to fasta and fastq files. It's faster and more lightweight than BioRuby. And more fun!
It takes care of a lot of whacky edge cases like parsing multi-blob gzipped files, and being strict on formatting by default.
Documentation
Checkout parse_fasta docs for the full api documentation.
Usage
Here are some examples of using ParseFasta. Don't forget to require "parse_fasta"
at the top of your program!
Print header and length of each record.
ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
puts [rec.header, rec.seq.length].join "\t"
end
You can parse fastQ files in exatcly the same way.
ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
printf "Header: %s, Sequence: %s, Description: %s, Quality: %s\n",
rec.header,
rec.seq,
rec.desc,
rec.qual
end
The Record#desc
and Record#qual
will be nil
if the file you are parsing is a fastA file.
ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
if rec.qual
# it's a fastQ record
else
# it's a fastA record
end
end
You can also check this with Record#fastq?
ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
if rec.fastq?
# it's a fastQ record
else
# it's a fastA record
end
end
And there is a nice #to_s
method, that does what it should whether the record is fastA or fastQ like. Check out the docs for info on the fancy #to_fasta
and #to_fastq
methods!
ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
puts rec.to_s
end
But of course, since it is a #to_s
override...you don't even have to call it directly!
ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
puts rec
end
Sometimes your fasta file might have record separators (>
) withen the "sequence". For example, CD-HIT's .clstr
files have headers within what would be the sequence part of the record. ParseFasta
is really strict about formatting and will raise an error when trying to read these types of files. If you would like to parse them, use the check_fasta_seq: false
flag like so:
ParseFasta::SeqFile.open(ARGV[0], check_fasta_seq: false).each_record do |rec|
puts rec
end