NAME

pull_features.pl

A script to pull out a specific list of data rows from a data file.

SYNOPSIS

pull_features.pl --data <filename> --list <filename> --out <filename>

Options:
--data <filename>
--list <filename>
--out <filename>
--dindex <integer>
--lindex <integer>
--order [list | data]
--sum
--starti <integer>
--stopi <integer>
--log
--version
--help

OPTIONS

The command line flags and descriptions:

--data

Specify a tab-delimited text file as the data source file. One of the columns in the input file should contain the identifiers to be used in the lookup. The file may be gzipped.

--list

Specify the name of a text file containing the list of feature names or identifiers to pull. The file may be a single column or tab-delimited multi-column file with column headers. A .kgg file from a Cluster k-means analysis may be used.

--out

Specify the output file name.

--dindex <integer>
--lindex <integer>

Specify the index numbers of the columns in the data and list files, respectively, containing the identifiers to match features. If not specified, then the program will attempt to identify appropriate matching columns with the same header name. If none are specified, the user must select interactively from a list of available column names.

--order [list | data]

Optionally specify the order of features in the output file. Two options are available. Specify 'list' to match the order of features in the list file. Or specify 'data' to match the order of features in the data file. The default is list.

--sum

Indicate that the pulled data should be averaged across all features at each position, suitable for graphing. A separate text file with '_summed' appended to the filename will be written.

--starti <integer>

When re-summarizing the pulled data, indicate the start column index that begins the range of datasets to summarize. Defaults to the leftmost column without a standard feature description name.

--stopi <integer>

When re-summarizing the pulled data, indicate the stop column index the ends the range of datasets to summarize. Defaults to the last or rightmost column.

--log

The data is in log2 space. Only necessary when re-summarizing the pulled data.

--version

Print the version number.

--help

Display this POD documentation.

DESCRIPTION

Given a list of requested unique feature identifiers, this program will pull out those features (rows) from a datafile and write a new file. This program compares in function to a popular spreadsheet VLOOKUP command. The list is provided as a separate text file, either as a single column file or a multi-column tab-delimited from which one column is selected. All rows from the source data file that match an identifier in the list will be written to the new file. The order of the features in the output file may match either the list file or the data file.

The program will also accept a Cluster gene file (with .kgg extension) as a list file. In this case, all of the genes for each cluster are written into separate files, with the output file name appended with the cluster number.

The program will optionally regenerate a summed data file, in which values in the specified data columns are averaged and written out as rows in a separate data file. Compare this function to the summary option in the biotoolbox scripts get_relative_data.pl or average_gene.pl.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.