NAME

average_gene.pl

A program to generate class average summaries for a list of genes or features.

SYNOPSIS

average_gene.pl --db <database> --feature <text> --out <file> [--options]

average_gene.pl --in <file> --out <file> [--options]
 
 Options:
 --in <filename> 
 --out <filename>
 --db <name | filename>
 --ddb <name | filename>
 --feature <type | type:source | alias>,...
 --data <dataset_name | filename>
 --method [mean|median|stddev|min|max|range|sum|rpm|rpkm]
 --value [score|count|length]
 --bins <integer>
 --ext <integer>
 --extsize <integer>
 --min <integer>
 --strand [all|sense|antisense]
 --long
 --sum
 --smooth
 --force_strand
 --(no)log
 --gz
 --cpu <integer>
 --version
 --help

OPTIONS

The command line flags and descriptions:

--in <filename>

Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. The first row should be column headers. Bed files are acceptable, as are text files generated with this program.

--out <filename>

Specify the output file name.

--db <name | filename>

Specify the name of a BioPerl database from which to obtain the annotation, chromosomal information, and/or data. Typically a Bio::DB::SeqFeature::Store database schema is used, either from a relational database, SQLite file, or a single GFF3 file to be loaded into memory. Alternatively, a BigWigSet directory, or a single BigWig, BigBed, or Bam file may be specified.

A database is required for generating new files. When generating a new genome interval file, a bigFile or Bam file listed as a data source will be adopted as the database.

For input files, the database name may be obtained from the file metadata. A different database may be specified from that listed in the metadata when a different source is desired.

--ddb <name | filename>

Optionally specify the name of an alternate data database from which the data should be collected, separate from the primary annotation database. The same options apply as to the --db option.

--feature <type | type:source | alias>,...

Specify the type of feature from which to collect values. This is required only for new feature tables. Three types of values may be passed: the feature type, feature type and source expressed as 'type:source', or an alias to one or more feature types. Aliases are specified in the biotoolbox.cfg file and provide a shortcut to a list of one or more database features. More than one feature may be included as a comma-delimited list (no spaces).

--data <dataset_name | filename>

Provide the name of the dataset to collect the values. If no dataset is specified on the command line, then the program will interactively present a list of datasets from the database to select.

The dataset may be a feature type in a BioPerl Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database. Provide either the feature type or type:source. The feature may point to another data file whose path is stored in the feature's attribute tag (for example a binary Bio::Graphics::Wiggle .wib file, a bigWig file, or Bam file), or the features' scores may be used in data collection.

Alternatively, the dataset may be a database file, including bigWig (.bw), bigBed (.bb), or Bam alignment (.bam) files. The files may be local or remote (specified with a http: or ftp: prefix).

--method [mean|median|stddev|min|max|range|sum|rpm|rpkm]

Specify the method for combining all of the dataset values within the genomic region of the feature. Accepted values include:

- mean        (default)
- median
- sum
- stddev      Standard deviation of the population (within the region)
- min
- max
- rpm         Reads Per Million mapped, for Bam and BigBed only
- rpkm        Same as rpm but normalized for gene length in kb
--value [score|count|length]

Optionally specify the type of data value to collect from the dataset or data file. Three values are accepted: score, count, or length. The default value type is score. Note that some data sources only support certain types of data values. Wig and BigWig files only support score and count; BigBed and database features support count and length and optionally score; Bam files support basepair coverage (score), count (number of alignments), and length.

--bins <integer>

Specify the number of bins that will be generated over the length of the feature. The size of the feature is a percentage of the feature length. The default number is 10, which results in bins of size equal to 10% of the feature length.

--ext <integer>

Specify the number of extended bins on either side of the feature. The bins are of the same size as determined by the feature length and the --bins value. The default is 0.

--extsize <integer>

Specify the exact bin size in bp of the extended bins rather than using a percentage of feature of length.

--min <integer>

Specify the minimum feature size to be averaged. Features with a length below this value will not be skipped (all bins will have null values). This is to avoid having bin sizes below the average microarray tiling distance. The default is undefined (no limit).

--strand [all|sense|antisense]

Specify whether stranded data should be collected. Three values are allowed: all datasets should be collected (default), only sense datasets, or only antisense datasets should be collected.

--long

Indicate that the dataset from which scores are collected are long features (counting genomic annotation for example) and not point data (microarray data or sequence coverage). Normally long features are only recorded at their midpoint, leading to inaccurate representation at some windows. This option forces the program to collect data separately at each window, rather than once for each file feature or region and subsequently assigning scores to windows. Execution times may be longer than otherwise. Default is false.

--sum

Indicate that the data should be averaged across all features at each position, suitable for graphing. A separate text file will be written with the suffix '_summed' with the averaged data. The default is false.

--smooth

Indicate that windows without values should (not) be interpolated from neighboring values. The default is false.

--force_strand

For features that are not inherently stranded (strand value of 0) or that you want to impose a different strand, set this option when collecting stranded data. This will reassign the specified strand for each feature regardless of its original orientation. This requires the presence of a "strand" column in the input data file. This option only works with input file lists of database features, not defined genomic regions (e.g. BED files). Default is false.

--(no)log

Dataset values are (not) in log2 space and should be treated accordingly. Output values will be in the same space.

--gz

Specify whether (or not) the output file should be compressed with gzip.

--cpu <integer>

Specify the number of CPU cores to execute in parallel. This requires the installation of Parallel::ForkManager. With support enabled, the default is 2. Disable multi-threaded execution by setting to 1.

--version

Print the version number.

--help

This help text.

DESCRIPTION

This program will collect data across a gene or feature body into numerous percentile bins. It is used to determine if there is a spatial distribution preference for the dataset over gene bodies. The number of bins may be specified as a command argument (default 10). Additionally, extra bins may be extended on either side of the gene (default 0 on either side). The bin size is determined as a percentage of gene length.

The program writes out a tim data formatted text file. It will also optionally generate a "summed" or average profile for all of the features.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.