NAME
Bio::ToolBox::GeneTools - SeqFeature agnostic methods for working with gene models
SYNOPSIS
use Bio::ToolBox::GeneTools qw(:all);
my $gene; # a SeqFeatureI compliant gene object obtained elsewhere
# for example, from Bio::DB::SeqFeature::Store database
# or parsed from a GFF3, GTF, or UCSC-style gene table using
# Bio::ToolBox parsers
if (is_coding($gene)) { # boolean test
# collect all exons from all transcripts in gene
my @exons = get_exons($gene);
# find just the alternate exons used only once
my @alternate_exons = get_alt_exons($gene);
# collect UTRs, which may not be defined in the original source
my @utrs;
foreach my $t (get_transcripts($gene)) {
my @u = get_utrs($t);
push @utrs, @u;
}
}
DESCRIPTION
This module provides numerous exportable functions for working with gene SeqFeature models. This assumes that the gene models follow the BioPerl Bio::SeqFeatureI convention with nested SeqFeature objects representing the gene, transcript, and exons. For example,
gene
transcript
exon
CDS
Depending upon how the SeqFeatures were generated or defined, subfeatures may or may not be defined or be obvious. For example, UTRs or introns may not be present. Furthermore, the primary_tag
or type may not follow Sequence Ontology terms. Regular expressions are deployed to handle varying naming schemes and exceptions.
These functions should work with most or all Bio::SeqFeatureI compliant objects. It has been tested with Bio::ToolBox::SeqFeature, Bio::SeqFeature::Lite, and Bio::DB::SeqFeature classes.
New SeqFeature objects that are generated use the same class for simplicity and expediency.
METHODS
Function Import
None of the functions are exported by default. Specify which ones you want when you import the module. Alternatively, use one of the tags below.
- :all
-
Import all of the methods.
- :exon
-
Import all of the exon methods, including "get_exons", "get_alt_exons", "get_common_exons", "get_uncommon_exons", and "get_alt_common_exons".
- :intron
-
Import all of the intron methods, including "get_introns", "get_alt_introns", "get_common_introns", "get_uncommon_introns", and "get_alt_common_introns".
- :transcript
-
Import the transcript related methods, including "get_transcripts", "get_transcript_length", and "collapse_transcripts".
- :cds
-
Import the CDS pertaining methods, including "is_coding", "get_cds", "get_cdsStart", "get_cdsEnd", "get_transcript_cds_length", and "get_utrs".
- :export
-
Import all of the export methods, including "gff_string", "gtf_string", "ucsc_string", and "bed12_string";
- :filter
-
Import all of the transcript filter methods, including "filter_transcript_biotype", "filter_transcript_gencode_basic", and "filter_transcript_support_level".
Exon Methods
Functions to get a list of exons from a gene or transcript
- get_exons
-
my @exons = get_exons($gene); my @exons = get_exons($transcript);
This will return an array or array reference of all the exon subfeatures in the SeqFeature object, either gene or transcript. No discrimination whether they are used once or more than once. Non-defined exons can be assembled from CDS and/or UTR subfeatures. Exons are sorted by start coordinate.
- get_alt_exons
-
my @alternate_exons = get_alt_exons($gene);
This will return an array or array reference of all the exon subfeatures for a multi-transcript gene that are used only once in all of the transcripts.
- get_common_exons
-
my @common_exons = get_common_exons($gene);
This will return an array or array reference of all the exon subfeatures for a multi-transcript gene that are used in all of the transcripts.
- get_uncommon_exons
-
my @uncommon_exons = get_uncommon_exons($gene);
This will return an array or array reference of all the exon subfeatures for a multi-transcript gene that are used in some of the transcripts, i.e. more than one but not all.
- get_alt_common_exons
-
my %exon_hash = get_alt_common_exons($gene);
This will return a hash reference with several keys, including "common", "uncommon", and each of the transcript IDs. Each key value is an array reference with the exons for that category. The "common" will be all common exons, "uncommon" will be uncommon exons (used more than once but less than all), and each transcript ID will include their specific alternate exons (used only once).
For genes with only a single transcript, all exons will be marked as "common" for simplicity, although technically they could all be considered "alternate" since they're only used once.
Intron Methods
Functions to get a list of introns from a gene or transcript. Introns are not usually defined in gene annotation files, but are inferred from the exons and total gene or transcript length. In this case, new SeqFeature elements are generated for each intron.
- get_introns
-
my @introns = get_introns($gene); my @introns = get_introns($transcript);
This will return an array or array reference of all the intron subfeatures in the SeqFeature object, either gene or transcript. No discrimination whether they are used once or more than once. Non-defined introns can be assembled from CDS and/or UTR subfeatures. Introns are sorted by start coordinate.
- get_alt_introns
-
my @alternate_introns = get_alt_introns($gene);
This will return an array or array reference of all the intron subfeatures for a multi-transcript gene that are used only once in all of the transcripts.
- get_common_introns
-
my @common_introns = get_common_introns($gene);
This will return an array or array reference of all the intron subfeatures for a multi-transcript gene that are used in all of the transcripts.
- get_uncommon_introns
-
my @uncommon_introns = get_uncommon_introns($gene);
This will return an array or array reference of all the intron subfeatures for a multi-transcript gene that are used in some of the transcripts, i.e. more than one but not all.
- get_alt_common_introns
-
my %intron_hash = get_alt_common_introns($gene);
This will return a hash reference with several keys, including "common", "uncommon", and each of the transcript IDs. Each key value is an array reference with the introns for that category. The "common" will be all common introns, "uncommon" will be uncommon introns (used more than once but less than all), and each transcript ID will include their specific alternate introns (used only once).
For genes with only a single transcript, all introns will be marked as "common" for simplicity, although technically they could all be considered "alternate" since they're only used once.
Transcript Methods
These methods work on transcripts, typically alternate transcripts from a gene SeqFeature.
- get_transcripts
-
my @transcripts = get_transcripts($gene);
Returns an array or array reference of the transcripts associated with a gene feature.
- collapse_transcripts
-
my $new_transcript = collapse_transcripts($gene); my $new_transcript = collapse_transcripts($transcript1, $transcript2);
This method will collapse all of the transcripts associated with a gene SeqFeature into a single artificial transcript, merging exons as necessary to maximize exon length and minimize introns. This is useful when performing, for example, RNASeq analysis on genes. A single SeqFeature transcript object is returned containing the merged exon subfeatures.
Pass either a gene or a list of transcripts to collapse.
- get_transcript_length
-
my $length = get_transcript_length($transcript);
Calculates and returns the transcribed length of a transcript, i.e the sum of its exon lengths. Warning! If you pass a gene object, you will get the maximum of all transcript exon lengths, which may not be what you anticipate!
CDS methods
These methods calculate values related to the coding sequence of the mRNA transcripts.
- is_coding
-
if( is_coding($transcript) ) { # do something }
This method will return a boolean value (1 or 0) if the passed transcript object appears to be a coding transcript. GFF and GTF files are not always immediately clear about the type of transcript; there are (unfortunately) multiple ways to encode the feature as a protein coding transcript:
primary_tag
,source_tag
, GFF attribute, presence of CDS subfeatures, etc. This method checks all of these possibilities. Note: If you pass a multi-transcript gene, only one transcript need to be coding to pass a true value. - get_cds
-
my @cds = get_cds($transcript);
Returns the CDS subfeatures of the given transcript, if they are defined. Returns either an array or array reference.
- get_cdsStart
-
my $start = get_cdsStart($trancript);
Returns the start coordinate of the CDS for the given transcript. Note that this is the leftmost (smallest) coordinate of the CDS and not necessarily the coordinate of the start codon, similar to what the UCSC gene tables report. Use the transcript strand to determine the 5' end.
- get_cdsEnd
-
my $end = get_cdsEnd($trancript);
Returns the stop coordinate of the CDS for the given transcript. Note that this is the rightmost (largest) coordinate of the CDS and not necessarily the coordinate of the stop codon, similar to what the UCSC gene tables report. Use the transcript strand to determine which is the
3'
and5'
end. - get_start_codon
-
my $start_codon = get_start_codon($trancript);
Returns a SeqFeature object representing the start codon. If one is not explicitly defined in the hierarchy, then a new object is generated.
- get_stop_codon
-
my $stop_codon = get_stop_codon($transcript);
Returns a SeqFeature object representing the stop codon. If one is not defined in the hierarchy, then a new object is created. Note that this assumes that the stop codon is inclusive to the defined CDS, which is the case with GFF3 and UCSC gene table derived features. On the other hand, features derived from GTF is defined with the stop codon exclusive to the CDS. This shouldn't matter with GTF, however, since GTF usually explicitly includes stop codon features, whereas the other two formats do not.
- get_transcript_cds_length
-
my $length = get_transcript_cds_length($transcript);
Calculates and returns the length of the coding sequence for a transcript, i.e. the sum of the CDS lengths.
- get_utrs
-
my @utrs = get_utrs($trancript);
Returns both
5'
and3'
untranslated regions of the transcript. If these are not defined in the SeqFeature subfeature hierarchy, then the coordinates will be determined from from the exon and CDS subfeatures, if available, and new SeqFeature objects generated. Non-coding transcripts will not return anything. - get_5p_utrs
-
my @5p_utrs = get_5p_utrs($trancript);
Returns only the
5'
untranslated regions of the transcript. - get_3p_utrs($transcript)
-
my @3p_utrs = get_3p_utrs($trancript);
Returns only the
3'
untranslated regions of the transcript.
Export methods
These methods are used for exporting a gene and/or transcript model into a text string based on the specified format.
- gff_string
-
my $string .= gff_string($gene, 1); my $string .= gff_string($transcript, 1);
This is just a convenience method. SeqFeature objects based on Bio::SeqFeature::Lite, Bio::DB::SeqFeature, or Bio::ToolBox::SeqFeature have a
gff_string
method, and this will simply call that method. SeqFeature objects that do not have this method will, of course, cause the script to terminate.Pass the seqfeature object and a boolean value to recursively append all subfeatures (e.g. exons) to the string. In most cases, this will generate a GFF3-style string.
Bio::ToolBox::Data::Feature also provides a simplified gff_string method.
- gtf_string
-
my $string .= gtf_string($gene); my $string .= gtf_string($transcript);
This will export a gene or transcript model as a series of GTF formatted text lines, following the defined Gene Transfer Format (also known as GFF version 2.5). It will ensure that each feature is properly tagged with the
gene_id
andtranscript_id
attributes.This method will automatically recurse through all subfeatures.
- ucsc_string
-
my $string = ucsc_string($gene);
This will export a gene or transcript model as a refFlat formatted Gene Prediction line (11 columns). See http://genome.ucsc.edu/FAQ/FAQformat.html#format9 for details. Multiple transcript genes are exported as multiple text lines concatenated together.
- bed12_string
-
my $string = bed12_string($gene);
This will export a gene or transcript model as a UCSC Bed formatted transcript line (12 columns). See http://genome.ucsc.edu/FAQ/FAQformat.html#format1 for details. Multiple transcript genes are exported as multiple text lines concatenated together. Note that gene information is not preserved with Bed12 files; only the transcript name is used. The
RGB
value is set to 0.
Filter methods
These methods are used to filter genes.
- filter_transcript_support_level
-
my $new_gene = filter_transcript_support_level($gene, 'best2'); my @good_transcripts = filter_transcript_support_level(\@transcripts);
This will filter a gene object for transcripts that match or exceed the provided transcript support level. This assumes that the transcripts contain the attribute tag 'transcript_support_level', which are present in Ensembl provided GFF3 and GTF annotation files. The values are a digit (1-5), or 'NA', where 1 is experimentally supported and 5 is entirely predicted with no experimental evidence. See Ensembl TSL glossary entry for details.
Pass a gene SeqFeature object with one or more transcript subfeatures. Alternatively, an array reference of transcripts could be passed as well.
A level may be provided as a second argument. The default is 'best'.
- best
-
Only the transcripts with the highest existing value will be retained.
- best<digit>
-
All transcripts up to the indicated level are retained. For example, 'best3' would indicate that transcripts with support levels 1, 2, and 3 would be retained.
- <digit>
-
Only transcripts at the given level are retained.
- NA
-
Only transcripts with 'NA' as the value are retained. These are typically pseudogenes or single-exon transcripts.
If none of the transcripts have the attribute, then all are returned (nothing is filtered).
If a gene object was provided, a new gene object will be returned with only the retained transcripts as subfeatures. If an array reference of transcripts was provided, then an array reference of the filtered transcripts is returned.
- filter_transcript_gencode_basic
-
my $new_gene = filter_transcript_gencode_basic($gene); my @good_transcripts = filter_transcript_gencode_basic(\@transcripts);
This will filter a gene object for transcripts for the Ensembl GENCODE tag "basic", which indicates that a transcript is tagged as GENCODE Basic transcript.
If a gene object was provided, a new gene object will be returned with only the retained transcripts as subfeatures. If an array reference of transcripts was provided, then an array reference of the filtered transcripts is returned.
- filter_transcript_biotype
-
my $new_gene = filter_transcript_gencode_basic($gene, $biotype); my @good_transcripts = filter_transcript_gencode_basic(\@transcripts, 'miRNA');
This will filter a gene object for transcripts for specific biotype values using the
transcript_biotype
orbiotype
attribute tags, commonly found in Ensembl annotation.If a gene object was provided, a new gene object will be returned with only the retained transcripts as subfeatures. If an array reference of transcripts was provided, then an array reference of the filtered transcripts is returned.
SEE ALSO
Bio::ToolBox::SeqFeature, Bio::ToolBox::Parser, Bio::Tools::GFF, Bio::SeqFeature::Lite, Bio::DB::SeqFeature, Bio::SeqFeatureI
AUTHOR
Timothy J. Parnell, PhD
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.