NAME
Bio::UnivAln - Bioperl alignment object
SYNOPSIS
This Perl module is intended to simplify the handling of biosequence alignments. When in doubt, always check our Homepage for the newest version, contact emails, help files, etc: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/welcome.html
See the "REVISION HISTORY" for recent bugfixes and enhancements.
Object Creation in a Nutshell
use Bio::UnivAln;
my $aln = Bio::UnivAln->new('t/alnfile.fasta');
$aln = Bio::UnivAln->new(-file=>'t/alnfile.aa',
-desc=>'Sample alignment',
-type=>'amino',
-ffmt=>'raw' # 1 line in file -> 1 sequence
);
$aln = Bio::UnivAln->new(-seqs=>"TCCCGCGTCAACTG\nTGGTGCTTCAACCG\nACTTG--TCAACTG");
$aln = Bio::UnivAln->new(-seqs=>[$sequence_strg,\@character_list,$bioSeqObject]);
$aln = Bio::UnivAln->new(-seqs=> ['ACCCGCGTCAACTG',
['A','G','G','G','G','C','T','T','C','A','A','C','C','G'],
Bio::Seq->new(-seq=>'ACTTG--TCAACTG')
]);
$aln = Bio::UnivAln->new($file,$seqs,$id,$desc,$names,$row_ids,$col_ids,
$row_descs,$col_descs,$numbering,$type,$ffmt,$descffmt,$inplace);
Object Manipulation in a Nutshell
OUTPUT
$aln->ffmt('fasta'); # set default output format
print "\n aln in default format:\n", $aln->layout();
print "\n aln in raw format:\n", $aln->layout("raw");
print "\n aln in fasta format:\n", $aln->layout("fasta"), "\n";
print "\n aln in MSF format (via readseq):\n", $aln->layout("MSF"), "\n";
SLICING
my $alnSlice = $aln->seqs(1,3,1,2); # multiline string of rows 1-3, columns 1-2
print $alnSlice, "\n";
my @alnSlice = $aln->seqs([1..3], [1,4]); # multidimensional array
# of rows 1-3, columns 1+4
for $aref ( @alnSlice ) { print @$aref, "\n"; }
$alnSlice = $aln->seqs([3,2,3,1], [1,3..5]); # rows 3,2,3,1, cols 1,3..5
ADVANCED SLICING
sub has_purine {
my $str = join "", @{ $_[0] };
if ($str =~ /[AaGgRr]+/) {return 1;} else {return 0;}
}
@alnSlice = $aln->seqs([1,2,3], \&has_purine); # rows 1-3, and from these
# only the entries of columns for which has_purine returns 1
$alnSlice = $aln->seqs({ids=>'SeqA SeqB'},{ids=>'ColA ColB ColC'});
$aln->inplace(1); $aln->seqs({ids=>'A B'},[1..6]); $aln->inplace(0);
# manipulates the object itself, assigning the slice internally
MAPPING
@resSlice = $aln->map_r(\&has_purine, [1..3]); # 1,0,1 if row 1+3 has purine
@resSlice = $aln->map_c(\&has_purine, [1,4]); # 1,0 if column 1 has purine
UTILITIES
$resSlice = $aln->consensus(); # 75% majority needed for consensus letter
$resSlice = $aln->consensus(0.6, [1,3]); # 60% majority, columns 1+3 only
$resSlice = $aln->var_sites(); # no columns that are invariable ...
$indices = $aln->var_inds(); # ... and their indices
$resSlice = $aln->invar_sites(); # only columns that are invariable
$indices = $aln->invar_inds(); # (..inds() is available for most utilities)
$resSlice = $aln->var_sites(0.6); # no columns with >= 60% maj. of 1 letter
$resSlice = $aln->invar_sites(0.6); # only columns with >= 60% majority
$resSlice = $aln->gap_free_sites();
$resSlice = $aln->no_allgap_sites(); # exclude sites that only have gaps
$resSlice = $aln->reverse([1,3]); # reverse of rows 1+3 only
$resSlice = $aln->complement([1,3]); # dna/rna complement of rows 1+3 only
$resSlice = $aln->revcom([1,3]); # reverse complement, rows 1+3 only
$resSlice = $aln->remove_gaps(); # original sequences without gaps
INSTALLATION
This module is included with the central Bioperl distribution:
http://bio.perl.org/Core/Latest
ftp://bio.perl.org/pub/DIST
Follow the installation instructions included in the README file.
The following are the installation instructions from the original UnivAlign distribution for installing UnivAlign.pm by itself. These are not necessary if installing from the central Bioperl distribution. Note that the central distribution does not currently run the more extensive univaln.t2 test harness.
The original installation package is available from:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/#univaln
To install, untar it and run the following commands in the module directory created by tar:
% perl Makefile.PL
% make
% make test
% make install
For intensive testing, see t/univaln.t2. Run that script via
% perl t/univaln.t2 > t/my_univaln
and compare the output with the file t/univaln.o . Expected error messages can be found in t/univaln2_expected_errors .
If you do not have superuser installation privileges, or to install in a different directory, you can do one of two things:
1) Either copy the UnivAln.pm module into the ``Bio'' subdirectory of an accessible Perl module directory (e.g. /my/perl/lib/Bio/). (One possible ``accessible Perl module directory'' is your current working directory; there you can create a subdirectory named ``Bio'' and place UnivAln.pm in that subdirectory. Then, you can start scripts using Bio::UnivAln from your current working directory.)
2) Or, run the above commands but specify an alternate location for the module by supplying a PREFIX argument to the first command:
% perl Makefile.PL PREFIX=/my/perl/stuff
This will place the module into '/my/perl/stuff/lib/site_lib/'. To specify a directory for the module file, set the $INSTALLSITELIB variable in Makefile.PL (e.g., $INSTALLSITELIB = '/my/perl/lib');
The make install command may report problems with creating documentation files (pod2man/perllocal.pod); please find the documentation in UnivAln.pm.html instead.
If you have Bio::Seq installed and want to test Bio::Seq usage by Bio::UnivAln, you need enable the line ``use Bio::Seq;'' at the beginning of t/univaln.t2; Bio::Seq can be found via http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/ . Note that the test script will also test error handling; you can expect the error messages included in the file univaln.t2_expected_errors -- these are OK.
If you wish that the module uses Don Gilbert's readseq package for sequence format conversion (Version 1 Feb 1993), you can set the environment variable `READSEQ_DIR'" appropriately. (Currently, only ``fasta'' and ``raw'' format are supported directly by UnivAln.pm.) Then, the program detects and uses `readseq' automatically, if it is in the specified directory (the default directory is ``./''). Modifying the environment variable `READSEQ' changes the expected name of the executable. For example, $ENV{READSEQ_DIR} may be ``/vol/biotools/bin/'' and $ENV{READSEQ} ``readseq2.0''. Readseq will give you support for PIR/CODATA, MSF/GCG and PAUP/NEXUS formats; ASN.1 does not seem to work reliably. (URLs: http://iubio.bio.indiana.edu/IUBio-Software+Data/molbio/readseq/ http://dot.imgen.bcm.tmc.edu:9331/seq-util/Help/readseq.html http://bimas.dcrt.nih.gov/molbio/readseq/formats.html )
Similar support for conversion from Clustal format, using Clustal as a converter, is implemented, but not properly tested and documented. The relevant environment variables are `CLUSTAL_DIR' and `CLUSTAL'. (URL: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/Top.html )
(Thanks to Steve A. Chervitz for his help with bundling the module !)
DESCRIPTION
This module is the Bio::UnivAln alignment object which is part of the Bioperl project. Currently it has some nice methods for accessing an alignment after reading it in from certain formats, incl. utilities like consensus and reverse complement. Bio::Seq (single sequences) is only needed if you explicitly want to use these.
(Most examples below are taken from the test script(s) that can be found in directory ``t'' of the Bio::UnivAln distribution. There you will also find a CGI script producing some graphics, which is currently in alpha status: I suspect it needs some refitting to run on a different server. If you'd like to know more about multiple alignments, in theory and practice, check out the tutorial at http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/mulali.html )
CREATION OF ALIGNMENTS
Alignments can be constructed from files, (multi-line) strings, arrays and Bio::Seq objects. Files need to be in a standard format, as described below, under the header "ALIGNMENT FORMATS".
my $aln = Bio::UnivAln->new('t/alnfile.fasta');
The first parameter is regarded as a file name; if you pass additional parameters, they will overwrite the parameters read in from the file. You can use named parameters; take a look at the documentation on the new() method in the appendix for a list of all parameters, and their names. In the following example, description, sequence type, and file format are provided. The file format will relieve Bio::UnivAln from guessing it; however, there are no guarantees if you bypass Bio::UnivAln's guessing _and_ provide an incorrect file format.
$aln = Bio::UnivAln->new(-file=>'t/alnfile.aa',
-desc=>'Sample alignment',
-type=>'amino',
-ffmt=>'raw' # 1 line in file -> 1 sequence
);
If no description (``-desc'') is given, a default one will be based on the file name. The format type is also the default format for output; if both differ, you need to specify the input format when you construct the $aln object, and then use the accessor ffmt() to set the default output format. Bio::UnivAln can be passed the aligned sequences directly, using the named parameter ``-seqs''. It takes a multi-line string, or any mix of strings, array references, and Bio::Seq objects:
$aln = Bio::UnivAln->new(-seqs=>"TCCCGCGTCAACTG\nTGGTGCTTCAACCG\nACTTG--TCAACTG");
$aln = Bio::UnivAln->new(-seqs=>[$sequence_strg,\@character_list,$bioSeqObject]);
$aln = Bio::UnivAln->new(-seqs=> ['ACCCGCGTCAACTG',
['A','G','G','G','G','C','T','T','C','A','A','C','C','G'],
Bio::Seq->new(-seq=>'ACTTG--TCAACTG')
]);
ACCESS TO THE DATA, AND MANIPULATION
The layout() method returns the sequence in a specified format; supported formats are listed under the header "ALIGNMENT FORMATS".
$aln->ffmt('fasta'); # set default output format
print "\n aln in default format:\n", $aln->layout();
print "\n aln in raw format:\n", $aln->layout("raw");
print "\n aln in fasta format:\n", $aln->layout("fasta"), "\n";
print "\n aln in MSF format (via readseq):\n", $aln->layout("MSF"), "\n";
Access by Specifying Boundaries
You can calculate slices of alignments in a very flexible way; interval slices like the intersection of rows 1-3 and columns 1-2 are calculated using seqs(). Here, intersection means that those elements are returned that are both in rows 1-3 and in columns 1-2.
$alnSlice = $aln->seqs(1,3,1,2); # rows 1-3, columns 1-2
$alnSlice = $aln->seqs(); # returns the whole alignment
Here's a diagram illustrating the general case, intersecting rows $y_lo to $y_hi, and columns $x_lo to $x_hi.
$x_lo $x_hi
: :
.. $y_lo ...:...........:.................
:::::::::::::
:: SELECTED :
::: PART ::::
:::::::::::::
.. $y_hi ...:::::::::::::.................
: :
Fig.1
Maximal intervals will be assumed if no parameters are provided. Per default, the first row (sequence) has index 1 (not 0), and the first column has index 1 (not 0). The latter can be modified using numbering().
Access by Index Lists
If you desire non-consecutive row / column elements, you can specify the indices as lists, one list of desired row indices and one list of desired column indices.
$alnSlice = $aln->seqs([1..3], [1,4]); # rows 1-3, columns 1+4
Here, letters in columns 2+3 will not be returned. Another example:
$alnSlice = $aln->seqs([3,2,3,1], [1,3..5]); # rows 3,2,3,1, cols 1,3..5
If you specify the empty list (``[]''), all rows/columns will be returned. The following diagram shows the case where the list of row indices is [$r1,$r2,$r3,$r4], and the list of column indices is [$c1,$c2,$c3].
: : :
: : :
............*....*..........*............. $r1
: : :
............*....*..........*............. $r2
............*....*..........*............. $r3
: : :
............*....*..........*............. $r4
: : :
: : :
$c1 $c2 $c3 Fig.2
Again, an element is selected for the slice if and only if it lies in the intersection of a row and a column which are both desired according to the index lists.
Return Values
In the examples above, a string (scalar) is returned; the standard sequence accessor seqs() always returns a (multi-line) string in a scalar context. In a list context, it returns an array; each element of such an array is a reference to another array holding the letters of one sequence, i.e. one single row.
@alnSlice = $aln->seqs([1..3], [1,4]); # rows 1-3, columns 1+4
for $aref ( @alnSlice ) { print @$aref, "\n"; }
If you use the result of an accessor or a utility function in the ``-seqs'' slot of new(), you may need to force that result into a scalar context, because the accessor, etc, returns a list in a list context, and the constructor naturally provides such a list context since it expects a list of parameters.
$aln = new Bio::UnivAln(-seqs=>scalar($aln2->seqs()));
In the example above, the list-context return value of $aln2->seqs()
, i.e. the list of rows of $aln2, would be fed one by one as additional parameters into the constructor, if you didn't ``protect'' it by scalar(). You will be warned about the problem because Bio::UnivAln detects any named parameters that it can't use.
Access by Id
(The following access method is currently in alpha status, it may need some revision until the code is fully released.)
Any list of desired row/column indices can be replaced by a hash of desired ids, which are recognized if they are in the object's own list of row or column ids. You need to pass a reference to a hash that has one key, ``ids'', and one value, which is a string containing the ids seperated by `` ''(blank) :
$alnSlice = $aln->seqs({ids=>'SeqA SeqB'},{ids=>'ColA ColB ColC'});
Row (sequence) ids are automatically extracted when reading fasta files and Bio::Seq objects. Otherwise, they are set to the default numerical index list, like (1..20) if the alignment has 20 rows. Since there's currently no way to extract column (site) ids (none of the supported formats has this feature), these always hold the default numerical indices. However, both lists may be set using the accessors row_ids() and col_ids(). Note that arbitrary numbering schemes can be supported this way.
Access by Selector Function
Finally, you can specify a function such that seqs() returns exactly those letters which lie in a row (or column) for which your function returns true. E.g. has_purine() returns true if the row/column contains A, a, T, t, R, or r; the following $alnSlice will contain only columns that have one of these letters in them.
sub has_purine {
my $str = join "", @{ $_[0] };
if ($str =~ /[AaGgRr]+/) {return 1;} else {return 0;}
}
$alnSlice = $aln->seqs([1..3], \&has_purine); # rows 1-3, and from these
# only the columns for which has_purine returns 1
Similarly, the list of row indices, [1..3], could be replaced by a function, which is then used to designate the desired rows. (It has the same role that the expression EXPR has in the Perl code template grep EXPR, LIST, see the perlfunc manpage. Bio::UnivAln also provides the equivalent to map EXPR, LIST; this ``Mapping'' will be discussed soon.)
In other words, any list of indices can be replaced by the reference to a function r or c. The function then selects those elements (*) that lie in a row or column which meets a criterion, returning true if the criterion is met, and false otherwise. If both lists of indices are replaced by functions, the picture is like this:
: : :
: : :
............*....*..........*............. r(row) = true
: : :
............*....*..........*............. r(row) = true
............*....*..........*............. r(row) = true
: : :
............*....*..........*............. r(row) = true
: : :
: : :
c(column) c(column) c(column)
= true = true = true Fig.3
The function out_fasta observes all types of selectors that seqs() does, and returns the result in fasta format :
print "\n aln in fasta format:\n", $aln->out_fasta([1..3],\&has_purine), "\n";
Mapping functions onto Sequences and Columns
You can map a function onto selected rows or columns, and receive the results as a list:
@resSlice = $aln->map_r(\&has_purine, [1..3]); # 1,0,1 if row 1+3 has purine
@resSlice = $aln->map_c(\&has_purine, [1,4]); # 1,0 if column 1 has purine
As you may expect, maximal index lists (i.e, all rows / all columns) will be assumed if no second parameter is provided. In the same way as before, any one if the index lists may be replaced by a hash of ids, or a selector function. In the section on "User-defined Utility Functions", map_r() and map_c() are used to implement user-defined consensus, reverse, complement, etc.
By the way, the following map is the same as @alnSlice = $aln->seqs([1..3]) since mapping the identity function ``sub id { return @_ }'' to desired row/column subsets and collecting the result is just like slicing.
@alnSlice = $aln->map_r(\&id, [ 1..3 ]);
The number of rows/columns of an alignment can be obtained by using height() and width() respectively. Ids and descriptions of rows (sequences) and columns (sites) can be manipulated using row_ids(), col_ids(), row_descs() and col_descs(). Be warned that these accessors just return a reference to the array of ids/descriptions; if you'd like to process the array without changing the object's data, you need to create your own copy. This decision was taken because especially the result of col_ids() can be huge, and for a lot of applications a deep copy is unnecessary.
Inplace Manipulation
Slices can be applied to the object itself, replacing the old alignment by a new one that is sliced from the old ('inplace' manipulation). This is particularly useful is the alignment is large. The row (sequence) and column (site) ids are taken over from the old alignment. They are used as the lookup tables for "Access by Id", and available via row_ids() and col_ids().
The following code sets the ``inplace'' flag, and overwrites the current alignment with the rows named A and B, and columns 1-6:
$aln->inplace(1);
$aln->seqs({ids=>'A B'},[1..6]);
$aln->inplace(0);
'inplace' manipulation is also available for most utility functions below, with the notable exception of consensus(). A full list is given in the description of inplace(), see the Appendix.
UTILITY FUNCTIONS LIKE CONSENSUS, AND REVERSE COMPLEMENT
In the following paragraph, Bio::UnivAln's direct support for utility functions like consensus, (in)variable sites, gap-free sites, reverse, complement, and reverse complement is explained.
$resSlice = $aln->consensus();
calculates the consensus of the columns, i.e. those columns for which there exists a letter which has an absolute majority of 75% or more. The letter '!' designates the case that no consensus is given. To bypass the default threshold, write
$resSlice = $aln->consensus(0.6);
Note that for values smaller or equal to 0.5, two letters may have an equal (relative) majority, and the tie is currently broken arbitrarily. Thresholds larger than or equal to 1 imply that the site that has a consensus letter must be invariant.
$resSlice = $aln->consensus(1, [1..10]);
Just like seqs() and map_c()/map_r(), consensus() can be passed a reference to a hash of desired column ids, a function that selects the columns, or the desired column indices themselves. (Here, it's columns 1 to 10. Internally, map_c() is used to implement consensus().)
The methods var_sites() and invar_sites() are using consensus(), just checking whether there is a consensus letter ('invariable'), or not ('variable'). Naturally, the default threshold is 1, i.e. sites must be truly invariable (100% majority of 1 letter), or truly variable (strictly less than 100% majority). Per default, var_sites() and invar_sites() return a multiline string of rows (sequences), with elements from the desired columns (sites) only. They do not return the alignment in a column-by-column fashion ! Also, don't be confused by the possibilty to specify a list of desired row indices / ids, or a selector function: It just constrains the output further, without influencing which columns are selected. (Internally, seqs() is used to implement these functions; it is passed a row selector, and a function that returns true if there is a consensus letter.)
$resSlice = $aln->var_sites();
$resSlice = $aln->invar_sites();
$resSlice = $aln->var_sites(0.6, [1,3]); # no columns with >= 60% majority
# of one letter; also, print rows 1+3 only
$resSlice = $aln->invar_sites(0.6, [1,3]); # only columns with >= 60% maj.
$resSlice = $aln->no_allgap_sites([1,3]); # exclude sites that only have gaps
$resSlice = $aln->gap_free_sites([1,3]); # exclude sites that have >= 1 gaps
In a similar fashion, the last example selects gap-free columns, and from those only prints the elements that happen to be in rows 1+3. All these utilities support "Inplace Manipulation":
$aln->inplace(1);
$aln->var_sites(0.6, {ids=>'1 2'});
$aln->inplace(0);
The utilities reverse(), complement(), and revcom() also allow for "Inplace Manipulation", and again you can specify the rows which shall form the new alignment by overwriting the old, or be returned as reverse'd, complement'ed, or reverse complement'ed :
$resSlice = $aln->reverse([1,3]);
$resSlice = $aln->complement([1,3]);
$resSlice = $aln->revcom([1,3]);
If an inplace manipulation reverses column order, e.g. in the case of reverse(), this will be reflected in the column ids available via col_ids(). Note that complement is defined according to the IUPAC code, using the same substitutions that Bio::Seq uses, such that results obtained for e.g. Amino Acid sequences are probably nonsensical.
For all functions that have ``sites'' in their name, a corresponding ``inds'' function is available that returns the relevant array of indices instead of the sites themselves:
$indices = $aln->var_inds(); # array of indices of the variable sites
$indices = $aln->no_allgap_inds([1,3]);
Finally, the original sequences (without gaps) are available via remove_gaps().
$some_original_sequences = $aln->remove_gaps([1,3]);
In a list context, all these utility functions return an array of array references, just like seqs(); the exception is consensus(), which then returns a simple array of consensus letters.
ALIGNMENT FORMATS
The directly supported formats are fasta and raw (1 line in file -> 1 sequence, where "\n" (newline) is the delimiter.) If available (see INSTALLATION), readseq is used to parse and write PIR/CODATA, MSF/GCG and PAUP/NEXUS formats; ASN.1 does not seem to work reliably. Clustal is used to parse in clustal format, if available.
ADVANCED STUFF
An exhaustive list of accessors and methods is given in the appendix. E.g., Alignments may be copied using copy(), and compared for equality modulo gaps with equal_nogaps().
Bio::UnivAln doesn't really care whether the sequences passed in are indeed from an alignment (i.e. have the same length); sequence bags (i.e. multisets of sequences where the same element may occur more than once) are therefore handled, too. However, you need to be careful with some methods (e.g. the accessor seqs(), and width()) because their default behavior may depend on inspecting the first sequence only (or the first requested sequence, if this information is available -- new feature in 1.006) : If you use seqs() on a sequence bag, and don't provide the number of columns explicitly, you may be surprised to find out that the length of the first sequence is used as the default length, and not the length of the longest sequence. width() uses the same heuristic.
You can add gap symbols to the elements of a multiset so that each element takes the length of the longest sequence by calling equalize_length(); you can even pad more gap characters by specifying the new width yourself, as an argument to equalize_length().
If you need to have the alignment in an intermediate form, i.e. neither an array of array references, nor a single multiline string, but an array of strings, just request the output as a multiline string, and split it on "\n". For example, to just extract a single column, you can do
@col = split "\n", $aln->seqs([],[$colindex]);
Checking an alignment for characters that should not be there according to the Alignment Type is currently not well-supported; see alphabet_check() for a _preliminary_ way of doing manual checkup, though.
The hash referenced by $names stores {loc,name} pairs of other database locations and corresponding names where the alignment is located. Currently, loc and name must both be set as text, and must consist entirely of a string matching the regexp /^[A-Za-z]\w*$/. That is, they must be a single "word" of at least one character, without a leading underscore. This restriction is not enforced, but code which deviates is subject to break in future releases. Note also that the object may place any other sort of items in the name string, so users of this hash should not rely on accessing entries conforming the requirements above.
Alignment Types
The supported sequence types and corresponding alphabets are the same as reported in the documentation of the Bio::Seq object: ``dna'', ``rna'', and ``amino''. They are carried along without much checking, etc.
User-defined Utility Functions
Here are some user-defined auxiliary functions which can be applied to rows and columns of an alignment; this technique can be used if no appropriate built-in functions are available, or if you don't want to use them.
maj_dominance returns false if all letters in a column/row are the same.
sub maj_dominance {
my $str = join "", @{ $_[0] };
my $first_res = @{ $_[0] }[0];
$str =~ s/$first_res//g;
if ($str) {return 0;} else {return 1;}
}
consensus returns the letter that has the (relative) majority among all residues, if it exceeds a certain threshold. (For simplicity, the threshold is hard-wired.)
sub consensus {
my @chars = @{ $_[0] };
my %temp = ();
my $threshold = 0.34; # more than 1/3
my @list = sort { $temp{$a}<=>$temp{$b} }
grep ++$temp{$_} >= $threshold * ($#chars+1), @chars;
# In case of a tie, it's not specified which residue is in $list[-1]
return (defined($list[-1]) ? $list[-1] : '!');
}
reverse_ and complement implement reversing a sequence (note the underscore; I'm looking for better ways to implement function passing), and complementing it according to IUPAC conventions.
sub reverse_ {
return [ reverse @{ $_[0] } ];
}
sub complement {
my $str = join "", @{ $_[0] };
$str =~ tr/acgtrymkswhbvdnxACGTRYMKSWHBVDNX/tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX/;
return [ split "", $str ];
}
Here, the functions above are applied:
@resSlice = $aln->map_c(\&maj_dominance);
print "\nDominated sites:\n", @resSlice, "\n";
@resSlice = $aln->map_c(\&consensus);
@resSlice = $aln->map_c(\&consensus, \&has_purine);
print "\nConsensus of the columns that have purine\n", @resSlice, "\n";
@resSlice = $aln->map_r(\&reverse_);
@resSlice = $aln->map_r(\&complement);
You can also parametrize your functions by using ``closures'' (See ``Programming Perl'', 2nd Ed., p.253). Basically, you set up a function that has the parameter built into it, and pass around the reference to that function. Here's a function that does the set-up:
sub _setup_consensus_with_threshold {
my $threshold = shift;
return sub {
my @chars = @{ $_[0] };
my %temp = ();
my @list = sort { $temp{$a}<=>$temp{$b} }
grep ++$temp{$_} >= $threshold * ($#chars+1), @chars;
# In case of a tie, it's not specified which residue is in $list[-1]
return (defined($list[-1]) ? $list[-1] : 'N');
}
}
Here, you create an instance of the function, incl. the parameter. $consensus holds a reference to the new instance.
my $consensus = _setup_consensus_with_threshold(0.75);
and finally, you pass the reference to the function around, e.g. to map_c().
@resSlice = $aln->map_c(\&$consensus);
(You can do pretty cute tricks using closures, e.g. you can set a counter to 1 in the setup function, and increment it in the ``real'' function.)
Bio::UnivAln Guts
Currently, the object hash has the following keys. This may be subject to change; in particular the alignment data may at some point be stored more efficiently in a PDL (Perl Data Language) array.
$self->{'seqs'} : An array of array references, each of which holds
one sequence of the alignment
$self->{'id'} : String specifying the ID; shall be in \w+ (i.e. composed
of characters in [a-zA-Z_0-9]; only \S+ is enforced, though)
$self->{'desc'} : String giving a description, (later) to be formatted
according to $descffmt
$self->{'names'} : A reference to a hash which stores {loc,name} pairs of
other database locations and corresponding names where
the alignment is located.
$self->{'row_ids'}: A reference to an array which stores row (sequence) ids
$self->{'col_ids'}: Same as $self->{'row_ids'}, for the columns (sites)
$self->{'row_descs'}: A reference to an array which stores row (sequence)
descriptions
$self->{'col_descs'}: Same as $self->{'row_descs'}, for the columns (sites)
$self->{'numbering'}: The offset of the first column, an integer
$self->{'type'} : The type of the alignment, concatenated from the molecule
type (see L<Alignment Types>) and a flag that is not
currently used, but intended to flag sequence bags
$self->{'ffmt'} : alignment format, see L<ALIGNMENT FORMATS>.
$self->{'descffmt'}: format of $desc; right now this should be ``raw''
or ``fasta'' which just implies that no specific
format is being followed, any text is allowed
excluding ``\n''(newline). More support is planned.
$self->{'inplace'}: Flag which is set to true if accessors (and utility
functions) should make the modification to the object
itself, and just return true on success. (See inplace().)
Some more helpful comments... In Perl, false is 0 or "", true is everything else. Not all internal functions have a POD/HTML documentation. If you read the code of this module, you need to be familiar with map and grep...
TO-DO
Soon:
Better handling of access, copying and slicing of $self->{'row_ids'}
, etc. Add pointer to a UnivAln demo page based on the draft at http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/Docs/phylosnapshot.html. Fix alphabet_check() (current workaround cannot be generalized), Fix UnivAlnAlph{UnivAlnType{"unknown"}} (current setting is not so nice; what's the easiest way to set it to all word characters ?), Support more formats, especially Nexus, Improved sequence alphabet support; not just providing a function alphabet_check() for manual checking. Then, maybe use strings to represent alphabets. Test and document aln() for returning a slice as an alignment.
Later:
Validity marker for correctly initialized/manipulated objects. Assign _unique_ IDs if none are provided. Functions like has_seqs(), etc. Using `undef' during initialization, and functions for regaining such a state. Use Perl Data Language ?!
DISCLAIMER
How is this for a maximum of disclaiming warranty ? In short, I'm developing this module in my spare time and it's for free, don't sue me :-)
IN NO EVENT SHALL THE GLOBEWIDE NETWORK ACADEMY, THE VIRTUAL SCHOOL OF NATURAL
SCIENCES, THE AUTHOR OR THE UNIVERSITY OF BIELEFELD BE LIABLE TO ANY PARTY
FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OF THIS CODE, EVEN IF THE GLOBEWIDE NETWORK ACADEMY,
THE VIRTUAL SCHOOL OF NATURAL SCIENCES, THE AUTHOR OR THE UNIVERSITY OF
BIELEFELD HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
THE GLOBEWIDE NETWORK ACADEMY, THE VIRTUAL SCHOOL OF NATURAL SCIENCES, THE
AUTHOR AND THE UNIVERSITY OF BIELEFELD SPECIFICALLY DISCLAIM ANY WARRANTIES,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE. THE CODE PROVIDED HEREUNDER IS ON AN
"AS IS" BASIS, AND THERE IS NO OBLIGATION WHATSOEVER TO PROVIDE MAINTENANCEE,
SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
COPYRIGHT
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Copyright (c) 1996, 1997, 1998 Georg Fuellen. All Rights Reserved. Some pieces of the code were contributed by Steven E. Brenner, Richard Resnick and Chris Dagdigian. Thanks !!!!
FEEDBACK
Mailing Lists
User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bio.perl.org/MailList.html - About the mailing lists
Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via email or the web:
bioperl-bugs@bio.perl.org
http://bio.perl.org/bioperl-bugs/
AUTHOR
Georg Fuellen
Technische Fakultaet - AG Praktische Informatik, Universitaet Bielefeld, D-33501 Bielefeld, Germany, georg.fuellen@uni-bielefeld.de
http://www.techfak.uni-bielefeld.de/~fuellen/
ACKNOWLEDGEMENTS
Steven E. Brenner, Steve A. Chervitz, Michael Constant, Richard Resnick, Chris Dagdigian, Lew Gramer [more to follow]
SEE ALSO
Bio::Seq.pm - The biosequence object
http://bio.perl.org/Projects/modules.html - Online module documentation
http://bio.perl.org/Projects/SeqAlign/ - Bioperl sequence alignment project
http://bio.perl.org/ - Bioperl Project Homepage
REFERENCES
If you'd like to acknowledge use of Bio::UnivAln in your work, please cite
Fuellen, G. (1997). Bio::UnivAln - bioperl alignment object [WWW-Document]. URL http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/welcome.html
And please drop me a note :-) An article for the Perl Journal is planned. The page is mirrored at
http://merlin.mbcr.bcm.tmc.edu:8001/bcd/Perl/Bio/welcome.html http://www.biotech.ist.unige.it/bcd/Perl/Bio/welcome.html
REVISION HISTORY
Version 1.000 on 12 Feb 1997.
Version 1.001 on 19 Feb 1997. Fixed a bug that triggered _rowbounds() and _colbounds() to use maximal index lists whenever the first index in an index list was 0. New example in POD:
$aln = new Bio::UnivAln(-seqs=>scalar($aln2->var_sites()));
Internal: Now avoiding any global parameter passing by using closures, for the utility functions.
Version 1.002 on 21 Feb 1997. Renamed the module to UnivAln, see the discussion (Feb 1997) in the vsns-bcd-perl mailing list archive. Fixed hopefully all problems in out_graph(), all of them triggered by bugs/features of PGPLOT.
Version 1.003 on 25 Feb 1997. Added POD on using closures. Moved out_graph() into the cgi-script t/univaln.cgi (this is alpha-code !) so that ``use PGPLOT'' is no longer required.
Version 1.004 on 13 Mar 1997. Fixed bug with reading fasta files. Changed file format identifiers ``Fasta'' and ``Raw'' to ``fasta'' and ``raw''. Added support for arbitrary numeration schemes by providing access by name/id for rows (sequences) and columns (sites). Made _arys() and _strs() accept same row/column designations like seqs(). Added 'inplace' manipulation for slicing and most utilities. Improved speed a little. Deriving default description string from file name, if available.
Version 1.005 on 18 Jun 1997. Using simple filename (``basename'') as default decription. Changed type identifiers ``Unknown'', ``Dna'', ``Rna'', ``Amino'' to ``unknown'', ``dna'', ``rna'', ``amino''. The old acronyms should still be supported for the forseeable future.
Version 1.006 on 15 Dec 1997. Added function no_allgap_sites(); out_fasta() now accepts same row/column designations like seqs(). For advanced users: the width() of sequence bags can now be controlled by supplying the index of the row whose width is taken -- by default the width of the first row is used.
Version 1.007 on 15 Mar 1998. * Added functions that return indices instead of actual sequence data: var_inds(), invar_inds(), gap_free_inds(), unknown_free_inds(), special_free_inds() and no_allgap_inds() return the indices of (in)variable, gap-free, unknown-free, gap+unknown-free and non-gap-only columns; they are the counterparts of var_sites(), etc. * Added remove_gaps() to return ungapped (original) sequence data * Added equal_no_gaps() to check for equality ignoring gaps * Added equalize_lengths() for padding a sequence bag with gaps such that all rows have equal length (current procedure is slow) * The default id is now ``_'' (was: ``No_Id_Given'') * Internal functions _rowbounds() and _colbounds() now return indices from the user's perspective, but WITHOUT substracting offset. Instead, all functions subtract offset after applying _rowbounds()/_colbounds()
Version 1.008 on 13 May 1998. * Added readseq conversion support, making it possible to read and write the following formats: MSF, Paup, PIR/Codata, ASN.1. (ASN.1 was not parsed successfully: readseq seems to be unable to read in its own ASN.1 output). Technically, readseq is now used to parse files that have been processed as ``raw'' before; now ``raw'' format is recognized using the expression /^[A-Z_0-9$_GAP_SYMBOL$_UNKN_SYMBOL\s]+$/im, i.e. the file may only have alphanumerical characters, gap and unknown-symbol, and whitespace. If commata, etc, are detected, readseq is used for parsing. Readseq itself seems to be unable to detect ``raw'' format in some cases, causing weird results. * Added Clustal support for parsing; still to be tested and documented.
Version 1.009 on 25 May 1998. * The module can now be ``built'' in the standard way (perl Makefile.PL, etc).
APPENDIX
Please note that functions starting with an underscore (``_'') are intended for internal use only: Use them only if you know what you're doing :-)
new()
Usage : $myAln = Bio::UnivAln->new($file,$seqs,$id,$desc,$names,
$row_ids,$col_ids,$row_descs,$col_descs,$numbering,$type,
$ffmt,$descffmt,$inplace);
- or -
$myAln = Bio::UnivAln->new(-file=$file,
-seqs=>$seqs,
-id=>$id,
-desc=>$desc,
-names=>$names,
-row_ids=>$row_ids,
-col_ids=>$col_ids,
-row_descs=>$row_descs,
-col_descs=>$col_descs,
-numbering=>$numbering,
-type=>$type,
-ffmt=>$ffmt,
-descffmt=>$descffmt,
-inplace=>$inplace);
Function : The constructor for this class, returns a new object.
Returns : Bio::UnivAln object
Argument : $file: file from which the alignment data can be read; all
the other arguments will overwrite the data read in.
(see ``Alignment Formats'')
$seqs: EITHER a reference to a list of either Bio::Seq objects, or
arrays of letters, or strings, or any mix of these,
OR a single (multi-line) string
$id: String specifying the ID.
$desc: String giving a description, (later) to be formatted
according to $descffmt
$names:A reference to a hash which stores {loc,name} pairs of
other database locations and corresponding names where
the alignment is located. (See L<ADVANCED STUFF>).
$row_ids :A reference to an array which stores row (sequence) ids.
$row_descs:A reference to an array which stores row (sequence)
descriptions
$col_ids: Same as $self->{'row_ids'}, for the columns (sites)
$col_descs:Same as $self->{'row_descs'}, for the columns (sites)
$numbering: The offset of the first column
$type: The type of the alignment, see ``Alignment Types''.
$ffmt: alignment format, see ``Alignment Formats''.
$descffmt: format of $desc; right now this should be ``raw''
or ``fasta'' which just implies that no specific
format is being followed, any text is allowed
excluding ``\n''(newline).
$inplace: Flag which is set to true if accessors and utility
functions should make the modification to the object
itself, and just return true on success. See inplace().
_initialize()
Usage : n/a (internal function)
Function : Assigns initial parameters to a blessed object.
Returns : 1 on success
Argument : As Bio::UnivAln->new, allows for named or listed parameters.
See ->new for the legal types of these values.
_rearrange()
Usage : n/a (internal function)
Function : Rearranges named parameters to requested order.
Returns : @params - an array of parameters in the requested order.
Argument : $order : a reference to an array which describes the desired
order of the named parameters.
@param : an array of parameters, either as a list (in
which case the function simply returns the list),
or as an associative array (in which case the
function sorts the values according to @{$order}
and returns that new array.
_rowbounds()
Usage : $corrected_bounds = $aln->_rowbounds($uncorrected_bounds);
Function : create default row index list if necessary,
create row index list if specified by a hash of ids or
by a selector function that acts on rows and returns true/false,
check row index list for bounds errors,
NO LONGER DONE IN VERSION 1.007 AND HIGHER: substract offset of 1.
Returns : reference to corrected row index list
Argument : reference to uncorrected row index list
_colbounds()
Usage : $corrected_bounds = $aln->_colbounds($uncorrected_bounds);
Function : create default column index list if necessary,
create column index list if specified by a hash of ids or
by a selector function that acts on columns and returns true/false,
check column index list for bounds errors,
NOT IN 1.007 and higher: substract offset (according to numbering scheme).
Returns : reference to corrected column index list
Argument : reference to uncorrected column index list
reference to row index list: its first row is used to
provide the width of the alignment considered, which matters
if the alignment is really a sequence bag
_fixbounds()
Usage : ($corrected_rowbounds,$corrected_colbounds) =
$aln->_fixbounds($uncorrected_rowbounds,$uncorrected_colbounds);
OR
($corrected_rowbounds,$corrected_colbounds) =
$aln->_fixbounds($firstpos1,$lastpos1,$firstpos2,$lastpos2)
Function : Convert unfixed index list information into the standard internal
one, allowing as input either max. 2 references or max. 4 boundary
coordinates (2 coord. for row indices and 2 for column indices).
Call functions to create maximal default index lists if needed,
to create index lists if specified by a hash of ids or by a selector
function that acts on rows/columns and returns true/false,
to check index lists for bounds errors, and to substract offsets.
Returns : 2 references to corrected index lists
Argument : EITHER max. 2 references to uncorrected index lists,
OR max. 4 boundary coordinates (integers)
_select()
Usage : @alnSlice = $aln->_select($rrowsel2,$rcolsel2);
Function : select elements from the 2-dimensional array $self->{'seqs'}
using index lists
Returns : array of references to array of characters
Argument : 2 index lists, one for the rows and one for the columns,
Comment : Here's a diagram of dependencies of some methods relying on _select:
seqs()
/ |
|/_ \|/
_arys() <-- _strs()
|
\|/
_select()
_arys()
Usage : @alnSlice = $aln->_arys([$row1,$row2,$rowx],[col1,$col2,$colx])
(other usages see seqs())
Function : Same as seqs(), except that an array is returned always
Returns : array of references to array of characters
Argument : Same as seqs()
_strs()
Usage : $alnSlice = $aln->_strs([$row1,$row2,$rowx],[col1,$col2,$colx])
(other usages see seqs())
Function : Same as seqs(), except that a multiline string is returned always
Returns : multiline string (including newline characters)
Argument : Same as seqs()
seqs()
Usage : (1) $alnSlice = $aln->seqs($firstpos1,$lastpos1,$firstpos2,$lastpos2)
(2) $alnSlice = $aln->seqs([$row1,$row2,$rowx],[col1,$col2,$colx])
(3) $alnSlice = $aln->seqs(\&function_of_row,\&function_of_column)
(4) $alnSlice = $aln->seqs({ids=>'r1 r2 r3'},{ids=>'c1 c2 c3'})
[ (2),(3) and (4) can be intermixed ]
Function : (1) Returns part of an alignment, from row $firstpos1 to row
$lastpos1, and from column $firstpos2 to column $lastpos2
Missing parameters are replaced by default (maximal possible)
values, where the length of the first row of the _returned_ (*)
`alignment' determines the default length in case the alignment is
really a sequence bag. ((*) is a new feature in 1.006.)
(2) Returns part of an alignment, i.e. those elements that lie
in a row designated by an index from the first list
([$row1,$row2,$rowx]), and at the same time lie in a column
designated in the second list ([col1,$col2,$colx]).
The empty list (``[]'') is replaced by default (maximal possible)
values, where the length of the first row of the _returned_
`alignment' determines the default length in case the alignment is
really a sequence bag. (Note that the first element
of rows/columns has index 1 per default.)
(3) Instead of an index list, a function acting on a row / column
may be supplied; whenever the function returns true, the
row / column is designated.
(4) Instead of an index list, a hash of ids may be supplied;
the ids are looked up in the alignment's list of row (sequence)
ids / list of column (site) ids. The former list may be set
during the construction of the alignment (e.g. it may be read
from the fasta file, or the Bio::Seq objects), or it may be
manipulated using row_ids(). col_ids() sets the the latter list.
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : (1) $firstpos1,$lastpos1,$firstpos2,$lastpos2 (all integers; note
that the first element of rows/columns has index 1 per default.)
(2-4) 2 selectors, one for the rows and one for the columns,
each of which may be a reference to a list of indices, or a
reference to a hash that has the format ``{ids=>'id1 id2 idx'}''
where ids is the mandatory key, and the value is a string
containing the desired ids, seperated by `` '' (space), or a
function that acts on a list and returns true/false.
_map_r()
Usage : n/a (internal function)
Function : apply a row function to selected rows,
then return the list of all return values
Returns : list of all return values
Argument : $rowf: Reference to the function that is applied to selected rows,
its results are put into a list and returned.
$rrowsel: Reference to the selector designating the rows
to which rowf is applied, as in cases (2)-(4) in seqs().
_map_c()
Usage : n/a (internal function)
Function : apply a column function to selected columns,
then return the list of all return values
Returns : list of all return values
Argument : analogous to _map_r()
map_r()
Usage : @resSlice = $aln->map_r($rowf,$rrowsel);
Function : apply a function to selected rows,
then return the list of all return values
Returns : list of all return values
Argument : $rowf: Reference to the function that is applied to selected rows,
its results are put into a list and returned.
$rrowsel: Reference to the selector designating the rows
to which rowf is applied, as in cases (2)-(4) in seqs().
map_c()
Usage : @resSlice = $aln->map_c($colf,$rcolsel,$rrowsel);
Function : apply a function to selected columns,
then return the list of all return values
Returns : list of all return values
Argument : $colf: Reference to the function that is applied to selected
columns, its results are put into a list and returned.
$rcolsel: Reference to the selector designating the columns
to which colf is applied, as in cases (2)-(4) in seqs().
$rrowsel: NOT used as a selector, but as a hint for
determining the width of a sequence bag if $rcolsel is undef:
The last index of the first row specified by $rrowsel is
taken as the maximum column index.
consensus()
Usage : $cons_letters = $aln->consensus($threshold,$rcolsel);
Function : return the consensus of a (subset of) the columns; the letter '!'
($_NO_CONSENSUS_SYMBOL) indicates that no consensus letter exists
Returns : in a scalar context: string of consensus letters
in an array context: array of consensus letters
Argument : $threshold: A letter is considered consensus of a column
if the fraction of the letters in the column that form a
(relative) majority is >= $threshold. Ties between 2 letters with
an equal relative majority are broken arbitrarily.
The default value is 0.75.
$rcolsel: Reference to the selector designating the columns
of which the consensus is calculated, as in cases (2)-(4) in seqs().
var_sites()
Usage : $resSlice = $aln->var_sites($threshold,$rrowsel);
Function : return the variable sites of an alignment
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : $threshold: A column is considered variable
if the fraction of the letters in the column that form a
(relative) majority is NOT >= $threshold.
The default $threshold is 1, i.e. only constant, INvariable columns
are excluded.
$rrowsel: Reference to the selector designating the rows
of which in turn the letters in the variable columns are printed,
as in cases (2)-(4) in seqs(). $rrowsel DOES NOT influence the
calculation of the variable sites, it just constrains the output
further !
var_inds()
Usage : $indices = $aln->var_inds($threshold,$rrowsel);
Function : return the _indices_ of the variable sites of an alignment
Returns : reference to array of indices
Argument : $threshold: A column is considered variable
if the fraction of the letters in the column that form a
(relative) majority is NOT >= $threshold.
The default $threshold is 1, i.e. only constant, INvariable columns
are excluded.
$rrowsel: the first row in $rrowsel is used for
determining the width of a sequence bag. In other words,
the last index of the first row specified by $rrowsel is
taken as the maximum column index.
invar_sites()
Usage : $resSlice = $aln->invar_sites($threshold,$rrowsel);
Function : return the INvariable columns of an alignment
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : $threshold: A column is considered INvariable
if the fraction of the letters in the column that form a
(relative) majority is >= $threshold.
The default $threshold is 1, i.e. no variability is allowed.
$rrowsel: see var_sites()
invar_inds()
Comment : This is the indices-returning version of invar_sites(),
cf. var_sites() and var_inds().
gap_free_sites()
Usage : $resSlice = $aln->gap_free_sites($rrowsel);
Function : return the gap-free columns of an alignment.
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : $rrowsel: see var_sites()
gap_free_inds()
Comment : This is the indices-returning version of gap_free_sites(),
cf. var_sites() and var_inds().
unknown_free_sites()
Usage : $resSlice = $aln->unknown_free_sites($rrowsel);
Function : return the unknown-free columns of an alignment.
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : $rrowsel: see var_sites()
unknown_free_inds()
Comment : This is the indices-returning version of unknown_free_sites(),
cf. var_sites() and var_inds().
no_allgap_sites()
Usage : $resSlice = $aln->no_allgap_sites($rrowsel);
Function : return the columns which do not have gaps only, of an alignment.
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : $rrowsel: see var_sites()
no_allgap_inds()
Comment : This is the indices-returning version of no_allgap_sites(),
cf. var_sites() and var_inds().
special_free_sites()
Usage : $resSlice = $aln->special_free_sites($rrowsel);
Function : return the special-free (neither gap nor unknown-symbols) columns
of an alignment.
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : $rrowsel: see var_sites()
special_free_inds()
Comment : This is the indices-returning version of special_free_sites(),
cf. var_sites() and var_inds().
reverse()
Usage : $resSlice = $aln->reverse($rrowsel);
Function : return the rows of an alignment, in reversed form (right->left)
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : $rrowsel: Reference to the selector designating the rows
which are returned in reversed form, as in cases (2)-(4) in seqs().
remove_gaps()
Usage : $resSlice = $aln->remove_gaps($rrowsel);
Function : return the rows of an alignment, gaps removed (i.e. the original
sequences)
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : $rrowsel: Reference to the selector designating the rows
which are returned without gaps, as in cases (2)-(4) in seqs().
complement()
Usage : $resSlice = $aln->complement($rrowsel);
Function : return the rows of an alignment, in complemented form
In the case of dna/rna, the complement is given according to the
IUPAC code; in other cases the result is currently calculated in the
same way and probably meaningless
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : $rrowsel: see reverse()
revcom()
Usage : $aln->revcom($rrowsel);
Function : return the rows of an alignment, in reversed complement form
In the case of dna/rna, the complement is given according to the
IUPAC code; otherwise the result is currently calculated in the
same way and probably nonsensical.
Returns : in a scalar context: multiline string (including newline characters)
in an array context: array of references to arrays of characters
Argument : $rrowsel: see reverse()
equal_nogaps()
Usage : $aln->equal_nogaps($other_aln);
Function : checks whether two alignments have the same original sequences
(i.e. gaps are removed and the rows are compared for equality)
Returns : 1 if alignments are equal modulo gaps, 0 otherwise
Argument : $other_aln: the other alignment
equalize_lengths()
Usage : $aln->equalize_lengths($width);
Function : modifies the alignment / sequence bag such that all rows
have length $width; in the case of sequence bags, this
is a primitive way to obtain a true alignment with
rows of equal length
* The current procedure is straightforward but slow *
Returns : 1 if alignments are equal modulo gaps, 0 otherwise
Argument : $width: the length until which sequences should be padded;
per default the length of the longest row is taken
_seqs()
Usage : @oldSeqs = $aln->_seqs(@sequences,$start)
@oldSeqs = $aln->_seqs($sequences,$start)
Function : to APPEND/OVERWRITE sequences to an alignment
Returns : old list of sequences, i.e. ($self->{'seqs'})
Argument : 1. EITHER a reference to a list of either Bio::Seq objects, or
arrays of letters, or strings, or any mix of these,
OR a single (multi-line) string
2. $rrowsel: Reference to the selector designating the rows
which shall be appended/overwritten (feature in alpha status)
id()
Usage : $aln_id = $aln->id();
$aln->id($id_string);
Function : Accessor, also sets field if an ID is passed in.
Returns : (original) ID value
Argument : sequence string with no whitespace
desc()
Usage : $aln_desc = $aln->desc();
$aln->desc($desc_string);
Function : Accessor, also sets field if a description string is passed in.
Returns : (original) description value
Argument : sequence string
names()
Usage : %names = $aln->names;
$aln->names($hash_ref)
Function : Accessor, also sets field if names hash is passed in.
The names hash is 'human-readable' data; each key is a
location (whether it be URL, database, database query, etc.)
and each value is the id at that location.
Returns : (original) names hash value
Argument : reference to a hash
row_ids()
Usage : $row_ids = $aln->row_ids();
$aln->row_ids($row_ids);
Function : Accessor, also sets field.
Returns : A reference to the (original) array of row (sequence) ids
Argument : A reference to an array of row (sequence) ids
col_ids()
Usage : $col_ids = $aln->col_ids();
$aln->col_ids($col_ids);
Function : Accessor, also sets field.
Returns : A reference to the (original) array of column (site) ids
Argument : A reference to an array of column (site) ids
row_descs()
Usage : $row_descs = $aln->row_descs();
$aln->row_descs($row_descs);
Function : Accessor, also sets field.
Returns : A reference to the (original) array of row (sequence) descriptions
Argument : A reference to an array of row (sequence) descriptions
col_descs()
Usage : $col_descs = $aln->col_descs();
$aln->col_descs($col_descs);
Function : Accessor, also sets field.
Returns : A reference to the (original) array of column (site) descriptions
Argument : A reference to an array of column (site) descriptions
numbering()
Usage : $num_start = $aln->numbering;
$aln->numbering($value);
Function : Accessor, also sets field if a new numbering scheme is passed in.
Returns : (original) numbering value
Argument : number that is used as the offset of the first column
type()
Usage : $aln_type = $aln->type;
$aln->type($value); # May be dangerous !
Function : Accessor, also sets field if a new type is passed in.
The latter is considered dangerous !
Returns : (original) type value
Argument : new type, see the list %UnivAlnTypes in the code
(currently, ``dna'', ``rna'', ``amino'')
ffmt()
Usage : $ffmt = $aln->ffmt;
$aln->ffmt($value);
Function : Accessor, also sets field if a new format acronym is passed in.
This can be done before reading from a file, so that
the presumably correct parsing routine is called, or
the value can be set before writing the object using
layout(), so that layout uses a specified default format.
Returns : (original) format acronym
Argument : string describing the format, see ``Alignment Formats''
inplace()
Usage : $inplace = $aln->inplace();
$aln->inplace($value);
Function : Accessor, also sets field if a value is passed in.
'inplace' is a flag which is set to true if accessors and
functions should make the modification to the object itself,
and just return true on success. Currently, inplace manipulation
is supported for seqs(), _arys(), _strs(), remove_gaps(),
var_sites(), invar_sites(), gap_free_sites(), no_allgap_sites(),
reverse(), complement(), and revcom() (reverse complement).
Returns : (original) inplace value
Argument : currently 0 or 1
width()
Usage : $width = $aln->width();
Function : number of columns (of the first row per default)
Returns : number of columns (of the first row per default)
Argument : row
height()
Usage : $height = $aln->height();
Function : number of rows (sequences)
Returns : number of rows (sequences)
Argument : ./.
alphabet_check()
Usage : @offendig_characters = $aln->alphabet_check($rowsel);
Function : Check rows of the alignment for ``offending characters'', i.e.
characters that are not (currently) expected to be found in the
aligned sequences, because they're not in the default alphabet
that belongs to the specified L<Alignment Type>.
Currently, the Alignment Type can only be set explicitly
via the constructor, or accessor. Since the default alphabet
is not the IUPAC code (e.g. just A,C,G,T,-,? in case of DNA),
this is just a proof of concept.
Returns : an array containing the offending characters, one string per row
Argument : $rrowsel: Reference to the selector designating the rows
to which the check is applied, as in cases (2)-(4) in seqs().
_file_read()
Usage : n/a (internal function)
Function : Read data from file, and call the parsing system to store data
into the object fields
Returns : 1 on success
Argument : filename, and (optionally) the format if known.
_parse()
Usage : $aln->_parse($ent,[$ffmt]);
Function : Parses $ent into the object fields, according to
$ffmt or $self->{'ffmt'}.
Returns : 1 on success
Argument : the prospective alignment to be parsed,
and optionally its format so that it doesn't need to be estimated
Note that ``raw'' is not estimated in a reliable way by readseq
if readseq is installed; I presume it is then estimated only if
the sequence(s) are longer than 111 basepairs
_parse_unknown()
Usage : $aln->_parse_unknown($ent);
Function : tries to figure out the format of $ent and then
calls the appropriate function to parse it into $self->{'seqs'}.
Returns : 1 on success
Argument : $ent : the rough multi-line string to be parsed
_parse_bad()
Usage : $aln->_parse_bad;
Function : Carp on the bad data that the user gave us.
Returns : undef
Argument : (multiline) string that cannot be parsed
_parse_readseq()
Usage : $aln->_parse_readseq;
Function : Try readseq to parse data.
Returns : 1 on success
Argument : $ent : the rough multi-line string to be parsed
_parse_raw()
Usage : $aln->_parse_raw;
Function : parses $ent into the $self->{'seqs'} field, using raw
file format.
Returns : 1 on success
Argument : (multiline) string to be parsed
_parse_fasta()
Usage : $aln->_parse_fasta;
Function : parses $ent into the 'seqs' field, using fasta
file format.
Returns : 1 on success
Argument : (multiline) string to be parsed
copy()
Usage : $copyOfObj = $myUnivAln->copy;
Function : Returns an identical copy of the object.
Returns : Bio::UnivAln
Argument : n/a
layout()
Usage : $aln->layout($format);
Function : Returns the alignment in whichever format the user specifies,
or according to the "ffmt" field if the user does not specify
a format.
Returns : varies; "" if unsuccessful
Argument : $format (one of the formats as defined in $UnivAlnForm).
out_bad()
Usage : $aln->out_bad;
Function : Carp if we don't know the output format.
Returns : undef
Argument : n/a
out_raw()
Usage : $aln->out_raw;
Function : Returns the alignment in raw format.
Returns : multiline string
Argument : n/a
out_fasta()
Usage : $aln->out_fasta;
Function : Returns the alignment as a string in fasta format.
Returns : multiline string
Argument : Same as seqs()
Comment : The old ``out_fasta()'' function with no arguments
is slightly faster and simpler; still available as out_fasta2().