NAME

get_gene_regions.pl

A script to collect specific, often un-annotated regions from genes.

SYNOPSIS

get_gene_regions.pl [--options...] --db <text> --out <filename>

get_gene_regions.pl [--options...] --in <filename> --out <filename>

Options:
--db <text>
--in <filename>
--out <filename> 
--feature <type | type:source>
--transcript [all|mRNA|miRNA|ncRNA|snRNA|snoRNA|tRNA|rRNA]
--region [tss|tts|firstExon|lastExon|splice|intron|firstIntron|lastIntron]
--start=<integer>
--stop=<integer>
--unique
--slop <integer>
--bed
--gz
--version
--help

OPTIONS

The command line flags and descriptions:

--db <text>

Specify the name of a Bio::DB::SeqFeature::Store annotation database from which gene or feature annotation may be derived. A database is required for generating new data files with features. For more information about using annotation databases, see https://code.google.com/p/biotoolbox/wiki/WorkingWithDatabases. Also see --in as an alternative.

--in <filename>

Alternative to a database, a GFF3 annotation file may be provided. For best results, the database or file should include hierarchical parent-child annotation in the form of gene -> mRNA -> [exon or CDS]. The GFF3 file may be gzipped.

--out <filename>

Specify the output filename.

--feature <type | type:source>

Specify the parental gene feature type (primary_tag) or type:source when using a database. If not specified, a list of available types will be presented interactively to the user for selection. This is not relevant for GFF3 source files (all gene or transcript features are considered). Helpful when gene annotation from multiple sources are listed in the same database, e.g. refSeq and ensembl sources.

Specify the transcript feature or gene subfeature type from which to collect the regions. Multiple types may be specified as a comma-delimited list, or 'all' may be specified. The default value is mRNA.

Specify the type of region to retrieve. If not specified on the command line, the list is presented interactively to the user for selection. Six possibilities are possible.

tss         The first base of transcription
tts         The last base of transcription
firstExon   The first exon of each transcript
lastExon    The last exon of each transcript
splice      The first and last base of each intron
intron      Each intron (usually not defined in the GFF3)
firstIntron The first intron of each transcript
lastIntron  The last intron of each transcript

--start=<integer>

--stop=<integer>

Optionally specify adjustment values to adjust the reported start and end coordinates of the collected regions. A negative value is shifted upstream (5' direction), and a positive value is shifted downstream. Adjustments are made relative to the feature's strand, such that a start adjustment will always modify the feature's 5'end, either the feature startpoint or endpoint, depending on its orientation.

--unique

For gene features only, take only the unique regions. Useful when multiple alternative transcripts are defined for a single gene.

--slop <integer>

When identifying unique regions, specify the number of bp to add and subtract to the start position (the slop or fudge factor) of the regions when considering duplicates. Any other region within this window will be considered a duplicate. Useful, for example, when start sites of transcription are not precisely mapped, but not useful with defined introns and exons. This does not take into consideration transcripts from other genes, only the current gene. The default is 0 (no sloppiness).

--bed

Automatically convert the output file to a BED file.

--gz

Specify whether (or not) the output file should be compressed with gzip.

--version

Print the version number.

--help

Display this POD documentation.

DESCRIPTION

This program will collect specific regions from annotated genes and/or transcripts. Often these regions are not explicitly defined in the source GFF3 annotation, necessitating a script to pull them out. These regions include the start and stop sites of transcription, introns, the splice sites (both 5' and 3'), and the first and last exons. Importantly, unique regions may only be reported, especially important when a single gene may have multiple alternative transcripts. A slop factor is included for imprecise annotation.

The program will report the chromosome, start and stop coordinates, strand, name, and parent and transcript names for each region identified. The reported start and stop sites may be adjusted with modifiers. A standard biotoolbox data formatted text file is generated. This may be converted into a standard BED or GFF file using the appropriate biotoolbox scripts. The file may also be used directly in data collection.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.

To install Bio::ToolBox, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Bio::ToolBox

CPAN shell

perl -MCPAN -e shell
install Bio::ToolBox

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

OPTIONS

DESCRIPTION

AUTHOR

Module Install Instructions