NAME
PerlIO::via::SeqIO - PerlIO layer for biological sequence formats
SYNOPSIS
use PerlIO::via::SeqIO;
# open a FASTA file for reading:
open( my $f, "<:via(SeqIO)", 'my.fas');
# open an EMBL file for writing
open( my $e, ">:via(SeqIO::embl)", 'my.embl');
# convert
print $e $_ while (<$f>);
# add comments (this really works)
while (<$f>) {
# get the real sequence object
my $seq = O($_);
if ($seq->desc =~ /Pongo/) {
print $e "# this one is almost human...";
}
print $e $_;
}
# a one-liner, sort of
$ alias scvt="perl -Ilib -MPerlIO::via::SeqIO -e \"open(STDIN, '<:via(SeqIO)'); open(STDOUT, '>:via(SeqIO::'.shift().')'); while (<STDIN>) { print }\""
$ cat my.fas | scvt gcg > my.gcg
DESCRIPTION
PerlIO::via::SeqIO
attempts to provide an easy option for harnessing the magic sequence format I/O of the BioPerl (http://bioperl.org) toolkit. Opening a biological sequence file under via(SeqIO)
yields a filehandle that can be used to read and write Bio::Seq objects sequentially with an absolute minimum of setup code.
via(SeqIO)
also allows the user to mix plain text and sequence formats on a single filehandle transparently. Different sequence formats can be written to a single file by a simple filehandle tweak.
DETAILS
- Basics
-
Here's the basic idea, in code converting FASTA to EMBL format:
open($in, '<:via(SeqIO)', 'my.fas'); open($out, '>:via(SeqIO::embl)', 'my.embl'); while (<$in>) { print $out $_; }
Scalar and bareword filehandles both are understood by
via(SeqIO)
, as well as STDIN, STDOUT, and DATA. For example:open(STDIN, '<:via(SeqIO)'); ...
allows
cat my.gcg | perl your.pl > out
where
your.pl
can read STDIN and acquire the sequence objects by using the object getter "UTILITIES/O()". The format of the input in this case will be guessed by theBio::SeqIO
machinery.On reading, you can rely on Bio::SeqIO's format guesser by invoking an unqualifed
open($in, '<:via(SeqIO)', 'mystery.txt');
or you can specify the format, like so:
open($in, '<:via(SeqIO::embl)', 'mystery.txt');
On writing, a qualified invocation is required;
open($out, '>:via(SeqIO)', 'my.fas'); # throws open($out, '>:via(SeqIO::fasta)', 'my.fas'); # that's better
- Retrieving the sequence object itself
-
This does what you mean:
open($in, '<:via(SeqIO)', 'my.fas'); open($out, '>:via(SeqIO::embl)', 'my.embl'); while (<$in>) { print $out $_; }
However,
$_
here is not the sequence object itself. To get that use the all-purpose object getter "UTILITIES/O()":while (<$in>) { print join("\t", O($_)->id, O($_)->desc), "\n"; }
- Writing a de novo sequence object
-
Use the "UTILITIES/T()" mapper to convert a Bio::Seq object into a thing that can be formatted by
via(SeqIO)
:open($seqfh, ">:via(SeqIO::embl)", "my.embl"); my $result = Bio::SearchIO->new( -file=>'my.blast' )->next_result; while(my $hit = $result->next_hit()){ while(my $hsp = $hit->next_hsp()){ my $aln = $hsp->get_aln; print $seqfh T($_) for ($aln->each_seq); } }
- Writing plain text
-
Interspersing plain text among your sequences is easy; just print the desired text to the handle. See the "SYNOPSIS".
Even the following works:
open($in, "<:via(SeqIO)", 'my.fas') open($out, ">:via(SeqIO::embl)", 'annotated.txt'); $seq = <$in>; print $out "In EMBL format, the sequence would be rendered:", $s;
- Switching write formats
-
You can also easily switch write formats. (Why? Because...who knows?) Use
set_write_format
off the tied handle object:open($in, "<:via(SeqIO)", 'my.fas') open($out, ">:via(SeqIO::embl)", 'multi.txt'); $seq1 = <$in>; print "This is sequence 1 in embl format:\n"; print $out $seq1; (tied $out)->set_write_format(gcg); print $out "while this is sequence 1 in GCG format:\n" print $out $seq1;
- Supported Formats
-
The supported formats are contained in
@PerlIO::via::SeqIO::SUPPORTED_FORMATS
. Currently they arefasta, embl, gcg, genbank
IMPLEMENTATION
This is essentially a hack, but one that attempts to behave fairly well. The handles are highly overloaded, with one foot in PerlIO::via and the other in tie
. Things to keep in mind:
PerlIO::via::SeqIO
exportsopen()
-
Neither PerlIO::via nor
tie
provided a low enough hook. When the mode does not contain a:via()
call, your opens are passed through toCORE::open
. If you run into problems, please ping me. "Why didn't you do it like ..." comments are also most welcome. - Peeking at the guts
-
The filehandle takes notes in a hash under the hood. To look at it, use the object getter:
$o = O($fh); print join("\n", keys %$o), "\n";
The "public" interface (see "UTILITIES") is available thru the tied object; that is
(tied $fh)
and not
O($fh).
UTILITIES
In the PerlIO::via::SeqIO
namespace. To use, do
use PerlIO::via::SeqIO qw(open O T);
(The open
hook needs to be available for the package to function. It is a member of @EXPORT
. See "IMPLEMENTATION/PerlIO::via::SeqIO
exports open()
".)
O()
Title : O
Usage : $o = O($sym) # export it; not an object method
Function: get the object "represented" by the argument
Returns : the right object
Args : PerlIO::via::SeqIO GLOB, or
*PerlIO::via::SeqIO::TFH (tied fh) or
scalar string (sprintf-rendered Bio::SeqI object)
Example : $seqobj = O($s = <$seqfh>);
T()
Title : T
Usage : T($seqobj) # export it; not an object method
Function: Transform a real Bio::Seq object to a
via(SeqIO)-writeable thing
Returns : A thing writeable as a formatted sequence
by a via(SeqIO) filehandle
Args : a[n array of] Bio::Seq or related object[s]
Example : print $seqfh T($seqobj);
set_write_format()
Title : set_write_format
Usage : (tied $fh)->set_write_format($format)
Function: Set a write handle to write a specified
sequence format
Returns : true on success
Args : scalar string; a supported format
(see @PerlIO::via::SeqIO::SUPPORTED_FORMATS)
SEE ALSO
perlio, PerlIO::via, Bio::SeqIO, Bio::Seq, http://bioperl.org
AUTHOR - Mark A. Jensen
Email maj -at- fortinbras -dot- us
http://fortinbras.us
http://bioperl.org/wiki/Mark_Jensen