NAME
Bio::ToolBox::parser::ucsc - Parser for UCSC genePred, refFlat, etc formats
SYNOPSIS
use Bio::ToolBox::parser::ucsc;
### A simple transcript parser
my $ucsc = Bio::ToolBox::parser::ucsc->new('file.genePred');
### A full fledged gene parser
my $ucsc = Bio::ToolBox::parser::ucsc->new(
file => 'ensGene.genePred',
do_gene => 1,
do_cds => 1,
do_utr => 1,
ensname => 'ensemblToGene.txt',
enssrc => 'ensemblSource.txt',
);
### Retrieve one transcript line at a time
my $transcript = $ucsc->next_feature;
### Retrieve one assembled gene at a time
my $gene = $ucsc->next_top_feature;
### Retrieve array of all assembled genes
my @genes = $ucsc->top_features;
# Each gene or transcript is a SeqFeatureI compatible object
printf "gene %s is located at %s:%s-%s\n",
$gene->display_name, $gene->seq_id,
$gene->start, $gene->end;
# Multiple transcripts can be assembled into a gene
foreach my $transcript ($gene->get_SeqFeatures) {
# each transcript has exons
foreach my $exon ($transcript->get_SeqFeatures) {
printf "exon is %sbp long\n", $exon->length;
}
}
# Features can be printed in GFF3 format
$gene->version(3);
print STDOUT $gene->gff_string(1);
# the 1 indicates to recurse through all subfeatures
DESCRIPTION
This is a parser for converting UCSC-style gene prediction flat file formats into BioPerl-style Bio::SeqFeatureI compliant objects, complete with nested objects representing transcripts, exons, CDS, UTRs, start- and stop-codons. Full control is available on what to parse, e.g. exons on, CDS and codons off. Additional gene information can be added by supplying additional tables of information, such as common gene names and descriptions, available from the UCSC repository.
Table formats supported
Supported files are tab-delimited text files obtained from UCSC and described at http://genome.ucsc.edu/FAQ/FAQformat.html#format9. Formats are identified by the number of columns, rather than specific file extensions, column name headers, or other metadata. Therefore, unmodified tables should only be used for correct parsing. Some errors are reported for incorrect lines. Unadulterated files can safely be downloaded from http://hgdownload.soe.ucsc.edu/downloads.html. Files obtained from the UCSC Table Browser can also be used with caution. Files may be gzip compressed.
File formats supported include the following.
Gene Prediction (genePred), 10 columns
Gene Prediction with RefSeq gene Name (refFlat), 11 columns
Extended Gene Prediction (genePredExt), 15 columns
Extended Gene Prediction with bin (genePredExt), 16 columns
knownGene table, 12 columns
Supplemental information
The UCSC gene prediction tables include essential information, but not detailed information, such as common gene names, description, protein accession IDs, etc. This additional information can be associated with the genes or transcripts during parsing if the appropriate tables are supplied. These tables can be obtained from the UCSC download site http://hgdownload.soe.ucsc.edu/downloads.html.
Supported tables include the following.
refSeqStatus, for refGene, knownGene, and xenoRefGene tables
refSeqSummary, for refGene, knownGene, and xenoRefGene tables
ensemblToGeneName, for ensGene tables
ensemblSource, for ensGene tables
kgXref, for knownGene tables
Implementation
For an implementation of this module to generate GFF3 formatted files from UCSC data sources, see the Bio::ToolBox script ucsc_table2gff3.pl.
METHODS
Initalize the parser object
- new
-
Initiate a UCSC table parser object. Pass a single value (a table file name) to open a table and parse its objects. Alternatively, pass an array of key value pairs to control how the table is parsed. Options include the following.
- file
- table
-
Provide a file name for a UCSC gene prediction table. The file may be gzip compressed.
- source
-
Pass a string to be added as the source tag value of the SeqFeature objects. The default value is 'UCSC'. If the file name has a recognizable name, such as 'refGene' or 'ensGene', it will be used instead.
- do_gene
-
Pass a boolean (1 or 0) value to combine multiple transcripts with the same gene name under a single gene object. Default is true.
- do_cds
- do_utr
- do_codon
-
Pass a boolean (1 or 0) value to parse certain subfeatures. Exon subfeatures are always parsed, but CDS, five_prime_UTR, three_prime_UTR, stop_codon, and start_codon features may be optionally parsed. Default is false.
- do_name
-
Pass a boolean (1 or 0) value to assign names to subfeatures, including exons, CDSs, UTRs, and start and stop codons. Default is false.
-
Pass a boolean (1 or 0) value to recycle shared subfeatures (exons and UTRs) between multiple transcripts of the same gene. This results in reduced memory usage, and smaller exported GFF3 files. Default is true.
- refseqsum
- refseqstat
- kgxref
- ensembltogene
- ensemblsource
-
Pass the appropriate file name for additional information.
- class
-
Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature.
Modify the parser object
These methods set or retrieve parameters, and load supplemental files and new tables.
- source
- do_gene
- do_cds
- do_utr
- do_codon
- do_name
-
These methods retrieve or set parameters to the parsing engine, same as the options to the new method.
- fh
-
Set or retrieve the file handle of the current table. This module uses IO::Handle objects. Be careful manipulating file handles of open tables!
- open_file($file)
-
Pass the name of a new table to parse. Existing gene models loaded in memory, if any, are discarded. Counts are reset to 0. Supplemental tables are not discarded.
- load_extra_data($file, $type)
-
Pass two values, the file name of the supplemental file and the type of supplemental data. Values can include the following
refseqstatus or status
refseqsummary or summary
kgxref
ensembltogene or ensname
ensemblsource or enssrc
The number of transcripts with information loaded from the supplemental data file is returned.
Feature retrieval
The following methods parse the table lines into SeqFeature objects. It is best if methods are not mixed; unexpected results may occur.
- next_feature
-
This will read the next line of the table and parse it into a gene or transcript object. However, multiple transcripts from the same gene are not assembled together under the same gene object.
- next_top_feature
-
This method will return all top features (typically genes), with multiple transcripts of the same gene assembled under the same gene object. Transcripts are assembled together if they share the same gene name and the transcripts overlap. If transcripts share the same gene name but do not overlap, they are placed into separate gene objects with the same name but different
primary_id
tags. Calling this method will parse the entire table into memory (so that multiple transcripts may be assembled), but only one object is returned at a time. Call this method repeatedly using a while loop to get all features. - top_features
-
This method is similar to next_top_feature(), but instead returns an array of all the top features.
Other methods
Additional methods for working with the parser object and the parsed SeqFeature objects.
- parse_table
-
Parses the table into memory. If a table wasn't provided using the new() or open_file() methods, then a filename can be passed to this method and it will automatically be opened for you.
- find_gene
-
Pass a gene name, or an array of key = values (name, display_name, ID, primary_ID, and/or coordinate information), that can be used to find a gene already loaded into memory. Only really successful if the entire table is loaded into memory. Genes with a matching name are confirmed by a matching ID or overlapping coordinates, if available. Otherwise the first match is returned.
- counts
-
This method will return a hash of the number of genes and RNA types that have been parsed.
- from_ucsc_string
-
A bare bones method that will convert a tab-delimited text line from a UCSC formatted gene table into a SeqFeature object for you. Don't expect alternate transcripts to be assembled into genes.
- seq_ids
-
Returns an array or array reference of the names of the chromosomes or reference sequences present in the table.
- seq_id_lengths
-
Returns a hash reference to the chromosomes or reference sequences and their corresponding lengths. In this case, the length is inferred by the greatest gene end position.
Bio::ToolBox::parser::ucsc::builder
This is a private module that is responsible for building SeqFeature objects from UCSC table lines. It is not intended for general public use.
AUTHOR
Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.