NAME

Bio::Assembly::IO::tigr - Driver to read and write assembly files in the TIGR Assembler v2 default format.

SYNOPSIS

# Building an input stream
use Bio::Assembly::IO;

# Assembly loading methods
my $asmio = Bio::Assembly::IO->new( -file   => 'SGC0-424.tasm',
                                    -format => 'tigr' );
my $scaffold = $asmio->next_assembly;

# Do some things on contigs...

# Assembly writing methods
my $outasm = Bio::Assembly::IO->new( -file   => ">SGC0-modified.tasm",
                                     -format => 'tigr' );
$outasm->write_assembly( -scaffold => $assembly,
                         -singlets => 1 );

DESCRIPTION

This package loads and writes assembly information in/from files in the default TIGR Assembler v2 format. The files are lassie-formatted and often have the .tasm extension. This module was written to be used as a driver module for Bio::Assembly::IO input/output.

Implementation

Assemblies are loaded into Bio::Assembly::Scaffold objects composed of Bio::Assembly::Contig and Bio::Assembly::Singlet objects. Since aligned reads and contig gapped consensus can be obtained in the tasm files, only aligned/gapped sequences are added to the different BioPerl objects.

Additional assembly information is stored as features. Contig objects have SeqFeature information associated with the primary_tag:

_main_contig_feature:$contig_id -> misc contig information
_quality_clipping:$read_id      -> quality clipping position

Read objects have sub_seqFeature information associated with the primary_tag:

_main_read_feature:$read_id     -> misc read information

Singlets are considered by TIGR Assembler as contigs of one sequence and are represented here with features having these primary_tag:

_main_contig_feature:$contig_id
_quality_clipping:$read_primary_id
_main_read_feature:$read_primary_id
_aligned_coord:$read_primary_id

THE TIGR TASM LASSIEFORMAT

Description

In the TIGR tasm lassie format, contigs are separated by a line containing a single pipe character "|", whereas the reads in a contig are separated by a blank line. Singlets can be present in the file and are represented as a contig composed of a single sequence.

Other than the two above-mentioned separators, each line has an attribute name, followed a tab and then an attribute value.

The tasm format is used by more TIGR applications than just TIGR Assembler. Some of the attributes are not used by TIGR Assembler or have constant values. They are indicated by an asterisk *

Contigs have the following attributes:

asmbl_id   -> contig ID
sequence   -> contig ungapped consensus sequence (ambiguities are lowercase)
lsequence  -> gapped consensus sequence (lowercase ambiguities)
quality    -> gapped consensus quality score (in hexadecimal)
seq_id     -> *
com_name   -> *
type       -> *
method     -> always 'asmg' *
ed_status  -> *
redundancy -> fold coverage of the contig consensus
perc_N     -> percent of ambiguities in the contig consensus
seq#       -> number of sequences in the contig
full_cds   -> *
cds_start  -> start of coding sequence *
cds_end    -> end of coding sequence *
ed_pn      -> name of editor (always 'GRA') *
ed_date    -> date and time of edition
comment    -> some comments *
frameshift -> *

Each read has the following attributes:

seq_name  -> read name
asm_lend  -> position of first base on contig ungapped consensus sequence
asm_rend  -> position of last base on contig ungapped consensus sequence
seq_lend  -> start of quality-trimmed sequence (aligned read coordinates)
seq_rend  -> end of quality-trimmed sequence (aligned read coordinates)
best      -> always '0' *
comment   -> some comments *
db        -> database name associated with the sequence (e.g. >my_db|seq1234)
offset    -> offset of the sequence (gapped consensus coordinates)
lsequence -> aligned read sequence (ambiguities are uppercase)

When asm_rend < asm_lend, the sequence was on the complementary DNA strand but its reverse complement is shown in the aligned sequence of the assembly file, not the original read.

Ambiguities are reflected in the contig consensus sequence as lowercase IUPAC characters: a c g t u m r w s y k x n . In the read sequences, however, ambiguities are uppercase: M R W S Y K X N

Example

Example of a contig containing three sequences:

sequence	CGATGCTGTACGGCTGTTGCGACAGATTGCGCTGGGTCGATACCGCGTTGGTGATCGGCTTGTTCAGCGGGCTCTGGTTCGGCGACAGCGCGGCGATCTTGGCGGCTGCGAAGGTTGCCGGCGCAATCATGCGCTGCTGACCGTTGACCTGGTCCTGCCAGTACACCCAGTCGCCCACCATGACCTTCAGCGCGTAGCTGTCACAGCCGGCTGTGGTCAGCGCAGTGGCGACGGTGGTGTAGGAGGCGCCAGCAACACCTTGGGTGATCATGTAGCAGCCTTCTGACAGGCCGTAGGTCAGCATGGTCGGCCACTGGGTACCAGTCAGTCGGGTCAACCGAGATTCGCAsCTGAGCGCCACTGCCGCGCAGAGCGTACATGCCCTTGCGGGTCGCGCCGGTAACACCATCCACGCCGATCAGAACTGCGTCGGTGATGGTGGTGTTACCCGAGGTGCCAGTGGTGAAGGCGACGGTCTGGGTGCTGGCCACAGGCGCCAGAGTGGTCGCGCCAACGGTGGCGATGACCAGTTGCGATGGGCCACGGATACCTGACTGCCCGTTGTTCACGGCGCTGACGATGTTCTGCCACAGCGCCAGGCCAGAGCCGGTGATGTTGTCGAACACTTCGGGCGCAACGCCAGGGAGCGAGACGGTCAGCTTCCAGCTCGAAGCAGCGGAGCCAGTAGCCAGGGCGGCGCTGAGCGAGTTGCCGAGCGTGCCGGTGTAGAACGCGGTCAGCGTGGCGCCGGTGGCGGCGGCAGTGTCCTTCAGCGCACTGGTCGCGGCGGTGTCGGTGCCGTCAGTGACGCGCACGGCGCGGATGTTCGAGGCGCCGCCCTGGATTGATACCGCCAGCGCGGTGCACAGGTCGTACTTGCGCACGGTCyGAGTGCCGAACTTCTGCGATGCGTCACCTGGCGAGCCGATAaGCGTGGCGCTGTTCACCGGCCCCCAGTCAGCAATGCCGACGATGCCGAGAATGTCAGTCGGGACGCCATTGATGTAGCGGGTCTTGGGCGCCACTATTTGTATGTACAAATCTGGCGCAGATAAAGCCGCCGTATTCAAATAACCAGCAGGATAGATAGGCATCACGCCTCCAGAATGAAAAAGGCCACCGATTAGGTGGCCTTTGTTGTGTTCGGCTGGCTGTTAGAGCAGCAGCCCGTTTTCCCGCGCAAACGCGAATGGGTCCTTGTCATGCTTCCTGCAATTGCAGGTAGGACAAAGAATTTGCAGGTTGGATTTGTCGTTCGATCCGCCCTTTGCAAGCGGGAACACGTGGTCAACGTGATACCCATCCCTTATGGATATAGTGCACATGGCGCATTTCCAGCGCTGAGCAGCCAGCAAAAATTTTATGTCGTCGCCGGTGTGTGAGCCGACAGCATTTTTCTTGCGAGCCTTGTATGTCCGCGAGAGTGAACGAACTTGCTCCTTGTTGGCTGTCTTCCAGAGCTTTTGAGTAAGCGCACAGAGATCCTTGTTTCTTGATCTCCACTCTCTGGTTGCGGAAAT
lsequence	CGATGCTGTACGGCTGTTGCGACAGATTGCGCTGGGTCGATACCGCGTTGGTGATCGGCTTGTTCAGCGGGCTCTGGTTCGGCGACAGCGCGGCGATCTTGGCGGCTGCGAAGGTTGCCGGCGCAATCATGCGCTGCTGACCGTTGACCTGGTCCTGCCAGTACACCCAGTCGCCCACCATGACCTTCAGCGCGTAGCTGTCACAGCCGGCTGTGGTCAGCGCAGTGGCGACGGTGGTGTAGGAGGCGCCAGCAACACCTTGGGTGATCATGTAGCAGCCTTCTGACAGGCCGTAGGTCAGCATGGTCGGCCACTGGGTACCAGTCAGTCGGGTCAACCGAGATTCG-CAsCTGAGCGCCACTGCCGCGCAGAGCGTACATGCCCTTGCGGGTCGCGCCGGTAACACCATCCACGCCGATCAGAACTGCGTCGGTGATGGTGGTGTTACCCGAGGTGCCAGTGGTGAAGGCGACGGTCTGGGTGCTGGCCACAGGCGCCAGAGTGGTCGCGCCAACGGTGGCGATGACCAGTTGCGATGGGCCACGGATACCTGACTGCCCGTTGTTCACGGCGCTGACGATGTTCTGCCACAGCGCCAGGCCAGAGCCGGTGATGTTGTCGAACACTTCGGGCGCAACGCCAGGGAGCGAGACGGTCAGCTTCCAGCTCGAAGCAGCGGAGCCAGTAGCCAGGGCGGCGCTGAGCGAGTTGCCGAGCGTGCCGGTGTAGAACGCGGTCAGCGTGGCGCCGGTGGCGGCGGCAGTGTCCTTCAGCGCACTGGTCGCGGCGGTGTCGGTGCCGTCAGTGACGCGCACGGCGCGGATGTTCGAGGCGCCGCCCTGGATTGATACCGCCAGCGCGGTGCACAGGTCGTACTTGCGCACGGTCyGAGTGCCGAACTTCTGCGATGCGTCACCTGGCGAGCCGATAaGCGTGGCGCTGTTCACCGGCCCCCAGTCAGCAATGCCGACGATGCCGAGAATGTCAGTCGGGACGCCATTGATGTAGCGGGTCTTGGGCGCCACTATTTGTATGTACAAATCTGGCGCAGATAAAGCCGCCGTATTCAAATAACCAGCAGGATAGATAGGCATCACGCCTCCAGAATGAAAAAGGCCACCGATTAGGTGGCCTTTGTTGTGTTCGGCTGGCTGTTAGAGCAGCAGCCCGTTTTCCCGCGCAAACGCGAATGGGTCCTTGTCATGCTTCCTGCAATTGCAGGTAGGACAAAGAATTTGCAGGTTGGATTTGTCGTTCGATCCGCCCTTTGCAAGCGGGAACACGTGGTCAACGTGATACCCATCCCTTATGGATATAGTGCACATGGCGCATTTCCAGCGCTGAGCAGCCAGCAAAAATTTTATGTCGTCGCCGGTGTGTGAGCCGACAGCATTTTTCTTGCGAGCCTTGTATGTCCGCGAGAGTGAACGAACTTGCTCCTTGTTGGCTGTCTTCCAGAGCTTTTGAGTAAGCGCACAGAGATCCTTGTTTCTTGATCTCCACTCTCTGGTTGCGGAAAT
quality	0x0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0505050505050505050E0505160505050505050505050505050505050505050505050505050505050303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0404040404040404041604040404040404040404040404040404040404040404040404040404040404040404040404040404040E0404040404040404040B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B
asmbl_id	93
seq_id	
com_name	
type	
method	asmg
ed_status	
redundancy	1.11
perc_N	0.20
seq#	3
full_cds	
cds_start	
cds_end	
ed_pn	GRA
ed_date	08/16/07 17:10:12
comment	
frameshift	

seq_name	SDSU_RFPERU_010_C09.x01.phd.1
asm_lend	1
asm_rend	4423
seq_lend	1
seq_rend	442
best	0
comment	
db	
offset	0
lsequence	CGATGCTGTACGGCTGTTGCGACAGATTGCGCTGGGTCGATACCGCGTTGGTGATCGGCTTGTTCAGCGGGCTCTGGTTCGGCGACAGCGCGGCGATCTTGGCGGCTGCGAAGGTTGCCGGCGCAATCATGCGCTGCTGACCGTTGACCTGGTCCTGCCAGTACACCCAGTCGCCCACCATGACCTTCAGCGCGTAGCTGTCACAGCCGGCTGTGGTCAGCGCAGTGGCGACGGTGGTGTAGGAGGCGCCAGCAACACCTTGGGTGATCATGTAGCAGCCTTCTGACAGGCCGTAGGTCAGCATGGTCGGCCACTGGGTACCAGTCAGTCGGGTCAACCGAGATTCG-CAGCTGAGCGCCACTGCCGCGCAGAGCGTACATGCCCTTGCGGGTCGCGCCGGTAACACCATCCACGCCGATCAGAACTGCGTCGGTGATGGTGG

seq_name	SDSU_RFPERU_002_H12.x01.phd.1
asm_lend	339
asm_rend	940
seq_lend	1
seq_rend	602
best	0
comment	
db	
offset	338
lsequence	CGAGATTCGCCACCTGAGCGCCACTGCCGCGCAGAGCGTACATGCCCTTGCGGGTCGCGCCGGTAACACCATCCACGCCGATCAGAACTGCGTCGGTGATGGTGGTGTTACCCGAGGTGCCAGTGGTGAAGGCGACGGTCTGGGTGCTGGCCACAGGCGCCAGAGTGGTCGCGCCAACGGTGGCGATGACCAGTTGCGATGGGCCACGGATACCTGACTGCCCGTTGTTCACGGCGCTGACGATGTTCTGCCACAGCGCCAGGCCAGAGCCGGTGATGTTGTCGAACACTTCGGGCGCAACGCCAGGGAGCGAGACGGTCAGCTTCCAGCTCGAAGCAGCGGAGCCAGTAGCCAGGGCGGCGCTGAGCGAGTTGCCGAGCGTGCCGGTGTAGAACGCGGTCAGCGTGGCGCCGGTGGCGGCGGCAGTGTCCTTCAGCGCACTGGTCGCGGCGGTGTCGGTGCCGTCAGTGACGCGCACGGCGCGGATGTTCGAGGCGCCGCCCTGGATTGATACCGCCAGCGCGGTGCACAGGTCGTACTTGCGCACGGTCCGAGTGCCGAACTTCTGCGATGCGTCACCTGGCGAGCCGATA-GCGTGGCGC

seq_name	SDSU_RFPERU_009_E07.x01.phd.1
asm_lend	880
asm_rend	1520
seq_lend	641
seq_rend	1
best	0
comment	
db	
offset	8803
lsequence	CGCACGGTCTGAGTGCCGAACTTCTGCGATGCGTCACCTGGCGAGCCGATAAGCGTGGCGCTGTTCACCGGCCCCCAGTCAGCAATGCCGACGATGCCGAGAATGTCAGTCGGGACGCCATTGATGTAGCGGGTCTTGGGCGCCACTATTTGTATGTACAAATCTGGCGCAGATAAAGCCGCCGTATTCAAATAACCAGCAGGATAGATAGGCATCACGCCTCCAGAATGAAAAAGGCCACCGATTAGGTGGCCTTTGTTGTGTTCGGCTGGCTGTTAGAGCAGCAGCCCGTTTTCCCGCGCAAACGCGAATGGGTCCTTGTCATGCTTCCTGCAATTGCAGGTAGGACAAAGAATTTGCAGGTTGGATTTGTCGTTCGATCCGCCCTTTGCAAGCGGGAACACGTGGTCAACGTGATACCCATCCCTTATGGATATAGTGCACATGGCGCATTTCCAGCGCTGAGCAGCCAGCAAAAATTTTATGTCGTCGCCGGTGTGTGAGCCGACAGCATTTTTCTTGCGAGCCTTGTATGTCCGCGAGAGTGAACGAACTTGCTCCTTGTTGGCTGTCTTCCAGAGCTTTTGAGTAAGCGCACAGAGATCCTTGTTTCTTGATCTCCACTCTCTGGTTGCGGAAAT
|

...

FEEDBACK

Mailing Lists

User feedback is an integral part of the evolution of this and other BioPerl modules. Send your comments and suggestions preferably to the BioPerl mailing lists. Your participation is much appreciated.

bioperl-l@bioperl.org                 - General discussion
http://bio.perl.org/MailList.html     - About the mailing lists

Reporting Bugs

Report bugs to the BioPerl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via email or the web:

bioperl-bugs@bio.perl.org
http://bugzilla.bioperl.org/

AUTHOR - Florent E Angly

Email florent dot angly at gmail dot com

APPENDIX

The rest of the documentation details each of the object methods. Internal methods are usually preceded with a "_".

next_assembly

Title   : next_assembly
Usage   : my $scaffold = $asmio->next_assembly()
Function: return the next assembly in the tasm-formatted stream
Returns : Bio::Assembly::Scaffold object
Args    : none

_qual_hex2dec

Title   : _qual_hex2dec
Usage   : my dec_quality = $self->_qual_hex2dec($hex_quality);
Function: convert an hexadecimal quality score into a decimal quality score 
Returns : string
Args    : string

_qual_dec2hex

Title   : _qual_dec2hex
Usage   : my hex_quality = $self->_qual_dec2hex($dec_quality);
Function: convert a decimal quality score into an hexadecimal quality score 
Returns : string
Args    : string

_store_contig

Title   : _store_contig
Usage   : my $contigobj; $contigobj = $self->_store_contig(
          \%contiginfo, $contigobj, $scaffoldobj);
Function: store information of a contig belonging to a scaffold in the
          appropriate object
Returns : Bio::Assembly::Contig object
Args    : hash, Bio::Assembly::Contig, Bio::Assembly::Scaffold

_store_read

Title   : _store_read
Usage   : my $readobj = $self->_store_read(\%readinfo, $contigobj);
Function: store information of a read belonging to a contig in the appropriate object
Returns : Bio::LocatableSeq
Args    : hash, Bio::Assembly::Contig

_store_singlet

Title   : _store_singlet
Usage   : my $singletobj = $self->_store_read(\%readinfo, \%contiginfo,
              $scaffoldobj);
Function: store information of a singlet belonging to a scaffold in the appropriate object
Returns : Bio::Assembly::Singlet
Args    : hash, hash, Bio::Assembly::Scaffold

write_assembly

Title   : write_assembly
Usage   : $ass_io->write_assembly($assembly)
Function: Write the assembly object in TIGR Assembler compatible tasm lassie  
          format
Returns : 1 on success, 0 for error
Args    : A Bio::Assembly::Scaffold object

_perc_N

Title   : _perc_N
Usage   : my $perc_N = $ass_io->_perc_N($sequence_string)
Function: Calculate the percent of ambiguities in a sequence.
          M R W S Y K X N are regarded as ambiguites in an aligned read
          sequence by TIGR Assembler. In the case of a gapped contig
          consensus sequence, all lowercase symbols are ambiguities, i.e.:
          a c g t u m r w s y k x n.
Returns : decimal number
Args    : string

_redundancy

Title   : _redundancy
Usage   : my $ref = $ass_io->_redundancy($contigobj)
Function: Calculate the fold coverage (redundancy) of a contig consensus
          (average number of read base pairs covering the consensus)
Returns : decimal number
Args    : Bio::Assembly::Contig

_ungap

Title   : _ungap
Usage   : my $ungapped = $ass_io->_ungap($gapped)
Function: Remove the gaps from a sequence. Gaps are - in TIGR Assembler
Returns : string
Args    : string

_date_time

Title   : _date_time
Usage   : my $timepoint = $ass_io->date_time
Function: Get date and time (MM//DD/YY HH:MM:SS)
Returns : string
Args    : none

_split_seq_name_and_db

Title   : _split_seq_name_and_db
Usage   : my ($seqname, $db) = $ass_io->_split_seq_name_and_db($id)
Function: Extract seq_name and db from sequence id
Returns : seq_name, db
Args    : id

_merge_seq_name_and_db

Title   : _merge_seq_name_and_db
Usage   : my $id = $ass_io->_merge_seq_name_and_db($seq_name, $db)
Function: Construct id from seq_name and db
Returns : id
Args    : seq_name, db

_coord

Title   : _coord
Usage   : my $id = $ass_io->__coord($readobj, $contigobj)
Function: Get different coordinates for the read
Returns : number, number, number, number, number
Args    : Bio::Assembly::Seq, Bio::Assembly::Contig