Skip to content
fstrozzi edited this page Jun 13, 2012 · 24 revisions

Intro

Bio::Faster is a BioRuby gem that implements a fast and simple parser for FastQ files. The new version dropped the support for FastA files to focus on the more resource demanding FastQ parsing. This new version is a rewrite of the old one, the C extension has been completely written from scratch and now the parser checks also for formatting problems in FastQ files. Full RSpecs has been defined based on the test files available in the official FastQ paper. This new gem uses Ruby-FFI to bind against the C extension and it's also compatible with JRuby. For a full list of supported Rubies check Travis-CI

Usage

A Bio::Faster object is created by passing a FastQ file name and the each_record method is then used to parse the whole file. The method returns a simple array for each sequence in the file. The array includes the complete sequence header (ID and comments), the sequence itself and, by default, an array with the quality values as integers. Default quality encoding is expected to be Sanger (Phred33) and conversion is done directly during the parsing.

fastq = Bio::Faster.new("sequences.fastq")
fastq.each_record do |sequence_header, sequence, quality|
     puts sequence_header, sequence, quality
end

Different quality encoding

If the quality values are Phred64 format (e.g. Solexa) you need to specify it directly on the each_record method:

fastq = Bio::Faster.new("sequences.fastq")
fastq.each_record(:quality => :solexa) do |sequence_header, sequence, quality|
     puts sequence_header, sequence, quality
end

Raw qualities (no conversion)

The each_record method can also return just the raw qualities as a string of the ASCII codes, without doing any conversion. To do this, specify the quality as "raw" while calling the method itself.

fastq = Bio::Faster.new("sequences.fastq")
fastq.each_record(:quality => :raw) do |sequence_header, sequence, quality|
     puts sequence_header, sequence, quality
end

Reading compressed files

The each_record method can also read directly from STDIN and this can be useful when dealing with compressed FastQ files.

Just specify stdin as the input:

Bio::Faster.new(:stdin).each_record do |seq|
...

and you can call the Ruby script with pipes in a standard Unix terminal:

zcat sequences.fastq.gz | ruby my_parser.rb

So you can read gzipped files without any drop in the parsing performance.

Performance

BioFaster is almost 3-4X times faster then standard object oriented FastQ parser method (and even faster with JRuby).

This is a comparison of the time needed to parse a 5.4 Gb Illumina 1.8+ FastQ file.

Using BioFaster

Bio::Faster.new("test_file.fastq").each_record {|sequence_header, sequence, quality|}

Ruby 1.9.3-p194

real	4m1.337s
user	3m56.447s
sys	0m4.339s

JRuby 1.6.7 OpenJDK 64-Bit Server VM 1.6.0_18

real	3m12.023s
user	3m9.040s
sys	0m4.277s

Using standard BioRuby parser

Ruby 1.9.3-p194

Bio::FlatFile.open(Bio::Fastq,File.open("test_file.fastq")).each_entry {|seq|}
real	11m35.946s
user	11m26.762s
sys	0m7.764s