NAME

PerlIO::via::SeqIO - PerlIO layer for biological sequence formats

SYNOPSIS

use PerlIO::via::SeqIO;

# open a FASTA file for reading:
open( my $f, "<:via(SeqIO)", 'my.fas');

# open an EMBL file for writing
open( my $e, ">:via(SeqIO::embl)", 'my.embl');

# convert
print $e $_ while (<$f>);

# add comments (this really works)
while (<$f>) {
  # get the real sequence object
  my $seq = O($_);
  if ($seq->desc =~ /Pongo/) {
    print $e "# this one is almost human...";
  }
  print $e $_; 
}

# a one-liner, sort of
$ alias scvt="perl -Ilib -MPerlIO::via::SeqIO -e \"open(STDIN, '<:via(SeqIO)'); open(STDOUT, '>:via(SeqIO::'.shift().')'); while (<STDIN>) { print }\""
$ cat my.fas | scvt gcg > my.gcg

DESCRIPTION

PerlIO::via::SeqIO attempts to provide an easy option for harnessing the magic sequence format I/O of the BioPerl (http://bioperl.org) toolkit. Opening a biological sequence file under via(SeqIO) yields a filehandle that can be used to read and write Bio::Seq objects sequentially with an absolute minimum of setup code.

via(SeqIO) also allows the user to mix plain text and sequence formats on a single filehandle transparently. Different sequence formats can be written to a single file by a simple filehandle tweak.

DETAILS

Basics

Here's the basic idea, in code converting FASTA to EMBL format:

open($in, '<:via(SeqIO)', 'my.fas');
open($out, '>:via(SeqIO::embl)', 'my.embl');
while (<$in>) {
  print $out $_;
}

Scalar and bareword filehandles both are understood by via(SeqIO), as well as STDIN, STDOUT, and DATA. For example:

open(STDIN, '<:via(SeqIO)');
...

allows

cat my.gcg | perl your.pl > out

where your.pl can read STDIN and acquire the sequence objects by using the object getter O(). The format of the input in this case will be guessed by the Bio::SeqIO machinery.

On reading, you can rely on Bio::SeqIO's format guesser by invoking an unqualifed

open($in, '<:via(SeqIO)', 'mystery.txt');

or you can specify the format, like so:

open($in, '<:via(SeqIO::embl)', 'mystery.txt');

On writing, a qualified invocation is required;

open($out, '>:via(SeqIO)', 'my.fas');        # throws
open($out, '>:via(SeqIO::fasta)', 'my.fas'); # that's better
Retrieving the sequence object itself

This does what you mean:

open($in, '<:via(SeqIO)', 'my.fas');
open($out, '>:via(SeqIO::embl)', 'my.embl');
while (<$in>) {
  print $out $_;
}

However, $_ here is not the sequence object itself. To get that use the all-purpose object getter O():

while (<$in>) {
  print join("\t", O($_)->id, O($_)->desc), "\n";
}
Writing a de novo sequence object

Use the T() mapper to convert a Bio::Seq object into a thing that can be formatted by via(SeqIO):

open($seqfh, ">:via(SeqIO::embl)", "my.embl");
my $result = Bio::SearchIO->new( -file=>'my.blast' )->next_result;
while(my $hit = $result->next_hit()){
  while(my $hsp = $hit->next_hsp()){
    my $aln = $hsp->get_aln;
      print $seqfh T($_) for ($aln->each_seq);
    }
  }
Writing plain text

Interspersing plain text among your sequences is easy; just print the desired text to the handle. See the "SYNOPSIS".

Even the following works:

open($in, "<:via(SeqIO)", 'my.fas')
open($out, ">:via(SeqIO::embl)", 'annotated.txt');

$seq = <$in>;
print $out "In EMBL format, the sequence would be rendered:", $s;
Switching write formats

You can also easily switch write formats. (Why? Because...who knows?) Use set_write_format off the tied handle object:

open($in, "<:via(SeqIO)", 'my.fas')
open($out, ">:via(SeqIO::embl)", 'multi.txt');

$seq1 = <$in>;
print "This is sequence 1 in embl format:\n";
print $out $seq1;
(tied $out)->set_write_format(gcg);
print $out "while this is sequence 1 in GCG format:\n"
print $out $seq1;
Supported Formats

The supported formats are contained in @PerlIO::via::SeqIO::SUPPORTED_FORMATS. Currently they are

fasta, embl, gcg, genbank

IMPLEMENTATION

This is essentially a hack, but one that attempts to behave fairly well. The handles are highly overloaded, with one foot in PerlIO::via and the other in tie. Things to keep in mind:

PerlIO::via::SeqIO exports open()

Neither PerlIO::via nor tie provided a low enough hook. When the mode does not contain a :via() call, your opens are passed through to CORE::open. If you run into problems, please ping me. "Why didn't you do it like ..." comments are also most welcome.

Peeking at the guts

The filehandle takes notes in a hash under the hood. To look at it, use the object getter:

$o = O($fh);
print join("\n", keys %$o), "\n";

The "public" interface (see "UTILITIES") is available thru the tied object; that is

(tied $fh)

and not

O($fh).

UTILITIES

In the PerlIO::via::SeqIO namespace. To use, do

use PerlIO::via::SeqIO qw(open O T);

(The open hook needs to be available for the package to function. It is a member of @EXPORT. See "IMPLEMENTATION" for details.

O()

Title   : O
Usage   : $o = O($sym) # export it; not an object method
Function: get the object "represented" by the argument
Returns : the right object
Args    : PerlIO::via::SeqIO GLOB, or 
          *PerlIO::via::SeqIO::TFH (tied fh) or
          scalar string (sprintf-rendered Bio::SeqI object)
Example : $seqobj = O($s = <$seqfh>);

T()

Title   : T
Usage   : T($seqobj) # export it; not an object method
Function: Transform a real Bio::Seq object to a
          via(SeqIO)-writeable thing
Returns : A thing writeable as a formatted sequence
          by a via(SeqIO) filehandle
Args    : a[n array of] Bio::Seq or related object[s]
Example : print $seqfh T($seqobj);

set_write_format()

Title   : set_write_format
Usage   : (tied $fh)->set_write_format($format)
Function: Set a write handle to write a specified 
          sequence format
Returns : true on success
Args    : scalar string; a supported format 
          (see @PerlIO::via::SeqIO::SUPPORTED_FORMATS)

SEE ALSO

PerlIO, PerlIO::via, Bio::SeqIO, Bio::Seq, http://bioperl.org

AUTHOR - Mark A. Jensen

Email maj -at- fortinbras -dot- us
http://fortinbras.us
http://bioperl.org/wiki/Mark_Jensen