NAME

PerlIO::via::SeqIO - PerlIO layer for biological sequence formats

SYNOPSIS

use PerlIO::via::SeqIO;

# open a FASTA file for reading:
open( my $f, "<:via(SeqIO)", 'my.fas');

# open an EMBL file for writing
open( my $e, ">:via(SeqIO::embl)", 'my.embl');

# convert
print $e $_ while (<$f>);

# add comments (this really works)
while (<$f>) {
  # get the real sequence object
  my $seq = O($_);
  if ($seq->desc =~ /Pongo/) {
    print $e "# this one is almost human...";
  }
  print $e $_; 
}

# a one-liner, sort of
$ alias scvt="perl -Ilib -MPerlIO::via::SeqIO -e \"open(STDIN, '<:via(SeqIO)'); open(STDOUT, '>:via(SeqIO::'.shift().')'); while (<STDIN>) { print }\""
$ cat my.fas | scvt gcg > my.gcg

DESCRIPTION

PerlIO::via::SeqIO attempts to provide an easy option for harnessing the magic sequence format I/O of the BioPerl (http://bioperl.org) toolkit. Opening a biological sequence file under via(SeqIO) yields a filehandle that can be used to read and write Bio::Seq objects sequentially with an absolute minimum of setup code.

via(SeqIO) also allows the user to mix plain text and sequence formats on a single filehandle transparently. Different sequence formats can be written to a single file by a simple filehandle tweak.

DETAILS

Basics

Here's the basic idea, in code converting FASTA to EMBL format:

open($in, '<:via(SeqIO)', 'my.fas');
open($out, '>:via(SeqIO::embl)', 'my.embl');
while (<$in>) {
  print $out $_;
}

Scalar and bareword filehandles both are understood by via(SeqIO), as well as STDIN, STDOUT, and DATA. For example:

open(STDIN, '<:via(SeqIO)');
...

allows

cat my.gcg | perl your.pl > out

where your.pl can read STDIN and acquire the sequence objects by using the object getter "UTILITIES/O()". The format of the input in this case will be guessed by the Bio::SeqIO machinery.

On reading, you can rely on Bio::SeqIO's format guesser by invoking an unqualifed

open($in, '<:via(SeqIO)', 'mystery.txt');

or you can specify the format, like so:

open($in, '<:via(SeqIO::embl)', 'mystery.txt');

On writing, a qualified invocation is required;

open($out, '>:via(SeqIO)', 'my.fas');        # throws
open($out, '>:via(SeqIO::fasta)', 'my.fas'); # that's better
Retrieving the sequence object itself

This does what you mean:

open($in, '<:via(SeqIO)', 'my.fas');
open($out, '>:via(SeqIO::embl)', 'my.embl');
while (<$in>) {
  print $out $_;
}

However, $_ here is not the sequence object itself. To get that use the all-purpose object getter "UTILITIES/O()":

while (<$in>) {
  print join("\t", O($_)->id, O($_)->desc), "\n";
}
Writing plain text

Interspersing plain text among your sequences is easy; just print the desired text to the handle. See the "SYNOPSIS".

Switching write formats

You can also easily switch write formats. (Why? Because...who knows?) Use set_write_format off the tied handle object:

open($in, "<:via(SeqIO)", 'my.fas')
open($out, ">:via(SeqIO::embl)", 'multi.txt');

$seq1 = <$in>;
print "This is sequence 1 in embl format:\n";
print $out $seq1;
(tied $out)->set_write_format(gcg);
print $out "while this is sequence 1 in GCG format:\n"
print $out $seq1;
Supported Formats

The supported formats are contained in @PerlIO::via::SeqIO::SUPPORTED_FORMATS. Currently they are

fasta, embl, gcg, genbank

IMPLEMENTATION

This is essentially a hack, but one that attempts to behave fairly well. The handles are highly overloaded, with one foot in PerlIO::via and the other in tie. Things to keep in mind:

PerlIO::via::SeqIO exports open()

Neither PerlIO::via nor tie provided a low enough hook. When the mode does not contain a :via() call, your opens are passed through to CORE::open. If you run into problems, please ping me. "Why didn't you do it like ..." comments are also most welcome.

Peeking at the guts

The filehandle takes notes in a hash under the hood. To look at it, use the object getter:

$o = O($fh);
print join("\n", keys %$o), "\n";

The "public" interface (see "UTILITIES") is available thru the tied object; that is

(tied $fh)

and not

O($fh).

UTILITIES

O()

Title   : O
Usage   : $o = O($sym)
Function: get the object "represented" by the argument
Returns : the right object
Args    : PerlIO::via::SeqIO GLOB, or 
          *PerlIO::via::SeqIO::TFH (tied fh) or
          scalar string (sprintf-rendered Bio::SeqI object)

set_write_format()

Title   : set_write_format
Usage   : (tied $fh)->set_write_format($format)
Function: Set a write handle to write a specified 
          sequence format
Returns : true on success
Args    : scalar string; a supported format 
          (see @PerlIO::via::SeqIO::SUPPORTED_FORMATS)

TODO

Allow writing of de novo (not previously read) sequence objects; i.e.

$seq = $seqio->next_seq;
print $out $seq;

in this scheme.

SEE ALSO

perlio, PerlIO::via, Bio::SeqIO, Bio::Seq, http://bioperl.org

AUTHOR - Mark A. Jensen

Email maj -at- fortinbras -dot- us
http://fortinbras.us
http://bioperl.org/wiki/Mark_Jensen