The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::CUA::CUB::Builder -- A module to calculate codon usage bias (CUB) metrics at codon level and other parameters

SYNOPSIS

        use Bio::CUA::CUB::Builder;

        # initialize the builder
        my $builder = Bio::CUA::CUB::Builder->new(
                      codon_table => 1 ); # using stardard genetic code
        
        # calculate RSCU for each codon, and result is stored in "rscu.out" as
        # well as returned as a hash reference
        my $rscuHash = $builder->build_rscu("seqs.fa",undef, 0.5,"rscu.out");

        # calculate CAI for each codon, normalizing RSCU values of codons
        # for each amino acid by the expected RSCUs under even usage,
        # rather than the maximal RSCU used by the traditional CAI method.
        my $caiHash = $builder->build_cai($codonList,2,'mean',"cai.out");

        # calculate tAI for each codon
        my $taiHash = $builder->build_tai("tRNA_copy_number.txt","tai.out", undef, 1);

DESCRIPTION

Codon usage bias (CUB) can be represented at two levels, codon and sequence. The latter is often computed as the geometric means of the sequence's codons. This module caculates CUB metrics at codon level.

Supported CUB metrics include CAI (codon adaptation index), tAI (tRNA adaptation index), RSCU (relative synonymous codon usage), and their variants. See the methods below for details.

The output can be stored in a file which is then used by methods in Bio::CUA::CUB::Calculator to calculate CUB indice for each protein-coding sequence.

METHODS

new

 Title   : new
 Usage   : $analyzer = Bio::CUA::CUB::Builder->new(-codon_table => 1)
 Function: initiate the analyzer
 Returns : an object
 Args    : accepted options are as follows

 B<options needed for building parameters of all CUB indice>
-codon_table

the genetic code table applied for following sequence analyses. It can be specified by an integer (genetic code table id), an object of Bio::CUA::CodonTable, or a map-file. See the method "new" in Bio::CUA::Summarizer for details.

 B<options needed for building tAI index's parameters>
-a_to_i
 a switch option. If true (any nonzero values), all
 'A' nucleotides at the 1st position of anticodon will be regarded as I
 (inosine) which can pair with more nucleotides at codons's wobbling
 position (A,T,C at the 3rd position). The default is true.
-no_atg
 a switch option to indicate whether ATG codons should be
 excluded in tAI calculation. Default is true, following I<dos Reis,
 et al., 2004, NAR>. To include ATG in tAI calculation, provide '0' here.
-wobble
 reference to a hash containing anticodon-codon basepairs at
 wobbling position, such as ('U' is equivalent to 'T')
 %wobblePairs = (
        A => [qw/T/],
        C => [qw/G/],
        T => [qw/A G/],
        G => [qw/C T/],
        I => [qw/A C T/]
        ); # this is the default setting
 Hash keys are the bases in anticodons and hash values are paired
 bases in codons's 3rd positions. This option is optional and default
 value is shown above by the example.

no_atg

 Title   : no_atg
 Usage   : $status = $self->no_atg([$newVal])
 Function: get/set the status whether ATG should be excluded in tAI
 calculation.
 Returns : current status after updating
 Args    : optional. 1 for true, 0 for false

build_rscu

 Title   : build_rscu
 Usage   : $ok = $self->build_rscu($input,[$minTotal,$pseudoCnt,$output]);
 Function: calculate RSCU values for all sense codons
 Returns : reference of a hash using the format 'codon => RSCU value'.
 return undef if failed.
 Args    : accepted arguments are as follows (note: not as hash):
input
 name of a file containing fasta CDS sequences of interested
 genes, or a sequence object with method I<seq> to extract sequence
 string, or a plain sequence string, or reference to a hash containing
 codon counts with structure like I<{ AGC => 50, GTC => 124}>.
output
 optional, name of the file to store the result. If omitted,
 no result will be written.
minTotal
 optional, minimal count of an amino acid in sequences; if observed
 count is smaller than this minimum, all codons of this amino acid would 
 be assigned equal RSCU values. This is to reduce sampling errors in
 rarely observed amino acids. Default value is 5.
pseudoCnt
 optional. Pseudo-counts for unobserved codons. Default is 0.5.

build_cai

 Title   : build_cai
 Usage   : $ok = $self->build_cai($input,[$minTotal,$norm_method,$output]);
 Function: calculate CAI values for all sense codons
 Returns : reference of a hash in which codons are keys and CAI values
 are values. return undef if failed.
 Args    : accepted arguments are as follows:
input
 name of a file containing fasta CDS sequences of interested
 genes, or a sequence object with method I<seq> to derive sequence
 string, or a plain sequence string, or reference to a hash containing
 codon list with structure like I<{ AGC => 50, GTC => 124}>.
minTotal
 optional, minimal codon count for an amino acid; if observed
 count is smaller than this count, all codons of this amino acid would 
 be assigned equal CAI values. This is to reduce sampling errors in
 rarely observed amino acids. Default value is 5.
norm_method
 optional, indicating how to normalize RSCU to get CAI
 values. Valid values are 'max' and 'mean'; the former represents the
 original method used by I<Sharp and Li, 1987, NAR>, i.e., dividing
 all RSCUs by the maximum of an amino acid, while 'mean' indicates
 dividing RSCU by expected average fraction assuming even usage of
 all codons, i.e., 0.5 for amino acids encoded by 2 codons, 0.25 for
 amino acids encoded by 4 codons, etc. The CAI metric determined by
 the latter method is named I<mCAI>. mCAI can assign
 different CAI values for the most preferred codons of different
 amino acids, which otherwise would be the same by CAI (i.e., 1).
output
 optional. If provided, result will be stored in the file
 specified by this argument.
 
 Note: for codons which are not observed will be assigned a count of
 0.5, and codons which are not degenerate (such as AUG and UGG in
 standard genetic code table) are excluded. These are the default of
 the paper I<Sharp and Li, 1986, NAR>. Here you can also reduce
 sampling error by setting parameter $minTotal.

build_b_cai

 Title   : build_b_cai
 Usage   : $caiHash =
 $self->build_b_cai($input,$background,[$minTotal,$output]);
 Function: calculate CAI values for all sense codons. Instead of
 normalizing RSCUs by maximal RSCU or expected fractions, each RSCU value is
 normalized by the corresponding background RSCU, then these
 normalized RSCUs are used to calculate CAI values.
 Returns : reference of a hash in which codons are keys and CAI values
 are values. return undef if failed.
 Args    : accepted arguments are as follows:
input
 name of a file containing fasta CDS sequences of interested
 genes, or a sequence object with metho I<seq> to derive sequence
 string, or a plain sequence string, or reference to a hash containing
 codon list with structure like I<{ AGC => 50, GTC => 124}>.
background
 background data from which background codon usage (RSCUs)
 is computed. Acceptable formats are the same as the above argument
 'input'.
minTotal
 optional, minimal codon count for an amino acid; if observed
 count is smaller than this count, all codons of this amino acid would 
 be assigned equal RSCU values. This is to reduce sampling errors in
 rarely observed amino acids. Default value is 5.
outpu
 optional. If provided, result will be stored in the file
 specified by this argument.
 Note: for codons which are not observed will be assigned a count of
 0.5, and codons which are not degenerate (such as AUG and UGG in
 standard genetic code table) are excluded. 

build_tai

 Title   : build_tai
 Usage   : $taiHash =
 $self->build_tai($input,[$output,$selective_constraints, $kingdom]);
 Function: build tAI values for all sense codons
 Returns : reference of a hash in which codons are keys and tAI indice
 are values. return undef if failed. See Formula 1 and 2 in I<dos
 Reis, 2004, NAR> to see how they are computed.
 Args    : accepted arguments are as follows:
 
input
 name of a file containing tRNA copies/abundance in the format
 'anticodon<tab>count' per line, where 'anticodon' is anticodon in
 the tRNA and count can be the tRNA gene copy number or abundance.
output
 optional. If provided, result will be stored in the file
 specified by this argument.
selective_constraints
 optional, reference to hash containing wobble base-pairing and its
 selective constraint compared to Watson-Crick base-pair, the format
 is like this:
 $selective_constraints = {
                 ...   ...   ...
                 'C-G'   => 0,
                                 'G-T'   => 0.41,
                                 'I-C'   => 0.28,
                                 ...   ...   ...
                                 };
 The key follows the 'anticodon-codon' order, and the values are codon
 selective constraints. The smaller the constraint, the stronger the
 pairing, so all Watson-Crick pairings have value 0.
 If this option is omitted, values will be searched for in the 'input' file,
 following the section of anticodons and started with a line '>SC'. If it is
 not in the input file, then the values in the Table 2 of 
 I<dos Reis, 2004, NAR> are used.
kingdom
 kingdom = 1 for prokaryota and 0 or undef for eukaryota, which
 affects the cacluation for bacteria isoleucine ATA codon. Default is 
 undef for eukaryota

AUTHOR

Zhenguo Zhang, <zhangz.sci at gmail.com>

BUGS

Please report any bugs or feature requests to bug-bio-cua at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-CUA. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Bio::CUA::CUB::Builder

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

Copyright 2015 Zhenguo Zhang.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.