NAME
Bio::ToolBox::Parser::ucsc - Parser for UCSC genePred, refFlat, etc formats
SYNOPSIS
use Bio::ToolBox::Parser;
my $filename = 'file.refFlat';
my $Parser = Bio::ToolBox::Parser->new(
file => $filename,
do_gene => 1,
do_exon => 1,
) or die "unable to open file!\n";
# the Parser will taste the file and open the appropriate
# subclass parser, ucsc in this case
while (my $feature = $Parser->next_top_feature() ) {
# each $feature is a parent SeqFeature object, usually a gene
printf "%s:%d-%d\n", $f->seq_id, $f->start, $f->end;
# subfeatures such as transcripts, exons, etc are nested within
my @children = $feature->get_SeqFeatures();
}
DESCRIPTION
This is the UCSC specific parser subclass to the Bio::ToolBox::Parser object, and as such inherits generic methods from the parent.
This is a parser for converting UCSC-style gene prediction flat file formats into BioPerl-style Bio::SeqFeatureI compliant objects, complete with nested objects representing transcripts, exons, CDS, UTRs, start- and stop-codons. Full control is available on what to parse, e.g. exons on, CDS and codons off. Additional gene information can be added by supplying additional tables of information, such as common gene names and descriptions, available from the UCSC repository.
Table formats supported
Supported files are tab-delimited text files obtained from UCSC and described at http://genome.ucsc.edu/FAQ/FAQformat.html#format9. Formats are identified by the number of columns, rather than specific file extensions, column name headers, or other metadata. Therefore, unmodified tables should only be used for correct parsing. Some errors are reported for incorrect lines. Unadulterated files can safely be downloaded from http://hgdownload.soe.ucsc.edu/downloads.html. Files obtained from the UCSC Table Browser can also be used with caution. Files may be gzip compressed.
File formats supported include the following.
Gene Prediction (genePred), 10 columns
Gene Prediction with RefSeq gene Name (refFlat), 11 columns
Extended Gene Prediction (genePredExt), 15 columns
Extended Gene Prediction with bin (genePredExt), 16 columns
knownGene table, 12 columns
Supplemental information
The UCSC gene prediction tables include essential information, but not detailed information, such as common gene names, description, protein accession IDs, etc. This additional information can be associated with the genes or transcripts during parsing if the appropriate tables are supplied. These tables can be obtained from the UCSC download site http://hgdownload.soe.ucsc.edu/downloads.html.
Supported tables include the following.
refSeqStatus, for refGene, knownGene, and xenoRefGene tables
refSeqSummary, for refGene, knownGene, and xenoRefGene tables
ensemblToGeneName, for ensGene tables
ensemblSource, for ensGene tables
kgXref, for knownGene tables
Implementation
For an implementation of this module to generate GFF3 formatted files from UCSC data sources, see the Bio::ToolBox script ucsc_table2gff3.pl.
METHODS
Initalize the parser object
In most cases, users should initialize an object using the generic Bio::ToolBox::Parser object.
- new
-
Initiate a UCSC table parser object. Pass a single value (a table file name) to open a table and parse its objects. Alternatively, pass an array of key value pairs to control how the table is parsed. Options include the following.
- file
- table
-
Provide a file name for a UCSC gene prediction table. The file may be gzip compressed.
- source
-
Pass a string to be added as the source tag value of the SeqFeature objects. The default value is 'UCSC'. If the file name has a recognizable name, such as 'refGene' or 'ensGene', it will be used instead.
- do_gene
-
Pass a boolean (1 or 0) value to combine multiple transcripts with the same gene name under a single gene object. Default is true.
-item do_exon
- do_cds
- do_utr
- do_codon
-
Pass a boolean (1 or 0) value to parse certain subfeatures, including exon, CDS, five_prime_UTR, three_prime_UTR, stop_codon, and start_codon features. Default is false.
- do_name
-
Pass a boolean (1 or 0) value to assign names to subfeatures, including exons, CDSs, UTRs, and start and stop codons. Default is false.
-
Pass a boolean (1 or 0) value to recycle shared subfeatures (exons and UTRs) between multiple transcripts of the same gene. This results in reduced memory usage, and smaller exported GFF3 files. Default is true.
- refseqsum
- refseqstat
- kgxref
- ensembltogene
- ensemblsource
-
Pass the appropriate file name for additional information.
- class
-
Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature, which is lighter-weight and consumes less memory. A suitable BioPerl alternative is Bio::SeqFeature::Lite.
Other methods
See Bio::ToolBox::Parser for generic methods for accessing the features. Below are some specific methods to this subclass.
- load_extra_data($file, $type)
-
my $file = 'hg19_refSeqSummary.txt.gz'; my success = $ucsc->load_extra_data($file, 'summary');
Pass two values, the file name of the supplemental file and the type of supplemental data. Values can include the following
refseqstatus or status
refseqsummary or summary
kgxref
ensembltogene or ensname
ensemblsource or enssrc
The number of transcripts with information loaded from the supplemental data file is returned.
- counts
-
This method will return a hash of the number of genes and RNA types that have been parsed.
- typelist
-
This method will return a comma-delimited list of the feature types or
primary_tag
s found in the parsed file. If a file has not yet been parsed, it will return a generic list of expected (typical) feature types. Otherwise, it will return the feature types observed in the parsed file.
SEE ALSO
Bio::ToolBox::Parser, Bio::ToolBox::SeqFeature
AUTHOR
Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.