NAME

data2wig.pl

A script to convert a generic data file into a wig file.

SYNOPSIS

data2wig.pl [--options...] <filename>

Options:
--in <filename>
--out <filename> 
--step [fixed | variable | bed]
--bed | --bdg
--size <integer>
--span <integer>
--index | --score <column_index>
--chr <column_index>
--start | --pos <column_index>
--stop | --end <column_index>
--name <text>
--(no)track
--mid
--inter | --zero
--format [0 | 1 | 2 | 3]
--method [mean | median | sum | max]
--log
--bigwig | --bw
--chromof <filename>
--db <database>
--bwapp </path/to/wigToBigWig>
--keep
--gz
--version
--help

OPTIONS

The command line flags and descriptions:

--in <filename>

Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. Genome coordinates are required. The first row should be column headers. Text files generated by other BioToolBox scripts are acceptable. Files may be gzipped compressed.

--out <filename>

Optionally specify the name of of the output file. The track name is used as default. The '.wig' extension is automatically added if required.

--step [fixed | variable | bed]

The type of step progression for the wig file. Three wig formats are available: - fixedStep: where data points are positioned at equal distances along the chromosome - variableStep: where data points are variably positioned along the chromosome. - bed (bedGraph): where scores are associated with intervals defined by start and stop coordinates. The fixedStep wig file has one column of data (score), the variableStep wig file has two columns (position and score), and the bedGraph has four columns of data (chromosome, start, stop, score). If the option is not defined, then the format is automatically determined from the metadata of the file.

--bed
--bdg

Convenience option to specify a bedGraph file should be written. Same as specifying --step=bed.

--size <integer>

Optionally define the step size in bp for 'fixedStep' wig file. This value is automatically determined from the table's metadata, if available. If the --step option is explicitly defined as 'fixed', then the step size may also be explicitly defined. If this value is not explicitly defined or automatically determined, the variableStep format is used by default.

--span <integer>

Optionally indicate the size of the region in bp to which the data value should be assigned. The same size is assigned to all data values in the wig file. This is useful, for example, with microarray data where all of the oligo probes are the same length and you wish to assign the value across the oligo rather than the midpoint. The default is inherently 1 bp.

--index <column_index>
--score <column_index>

Indicate the column index (0-based) of the dataset in the data table to be used for the score. If a GFF file is used as input, the score column is automatically selected. If not defined as an option, then the program will interactively ask the user for the column index from a list of available columns.

--chr <column_index>

Optionally specify the column index (0-based) of the chromosome or sequence identifier. This is required to generate the wig file. It may be identified automatically from the column header names.

--start <column_index>
--pos <column_index>

Optionally specify the column index (0-based) of the start or chromosome position. This is required to generate the wig file. It may be identified automatically from the column header names.

--start <column_index>
--end <column_index>

Optionally specify the column index (0-based) of the stop or end position. It may be identified automatically from the column header names.

--name <text>

The name of the track defined in the wig file. The default is to use the name of the chosen score column, or, if the input file is a GFF file, the base name of the input file.

--(no)track

Do (not) include the track line at the beginning of the wig file. Wig files normally require a track line, but if you will be converting to the binary bigwig format, the converter requires no track line. Why it can't simply ignore the line is beyond me. This option is automatically set to false when the --bigwig option is enabled.

--mid

A boolean value to indicate whether the midpoint between the actual 'start' and 'stop' values should be used. The default is to use only the 'start' position.

--zero
--inter

Source data is in interbase coordinate (0-base) system. Shift the start position to base coordinate (1-base) system. Wig files are by definition 1-based. This is automatically enabled when converting from Bed or BedGraph files. Default is false.

--format [0 | 1 | 2 | 3]

Indicate the number of decimal places the score value should be formatted. Acceptable values include 0, 1, 2, or 3 places. The default is to not format the score value.

--method [mean | median | sum | max]

Define the method used to combine multiple data values at a single position. Wig files do not tolerate multiple identical positions.

--log

If multiple data values need to be combined at a single identical position, indicate whether the data is in log2 space or not. This affects the mathematics behind the combination method.

--bigwig
--bw

Indicate that a binary BigWig file should be generated instead of a text wiggle file. A .wig file is first generated, then converted to a .bw file, and then the .wig file is removed.

--chromof <filename>

When converting to a BigWig file, provide a two-column tab-delimited text file containing the chromosome names and their lengths in bp. Alternatively, provide a name of a database, below.

--db <database>

Specify the name of a Bio::DB::SeqFeature::Store annotation database or other indexed data file, e.g. Bam or bigWig file, from which chromosome length information may be obtained. For more information about using databases, see https://code.google.com/p/biotoolbox/wiki/WorkingWithDatabases. It may be supplied from the input file metadata.

--bwapp </path/to/wigToBigWig>

Specify the path to the UCSC wigToBigWig or bedGraphToBigWig conversion utility. The default is to first check the BioToolBox configuration file biotoolbox.cfg for the application path. Failing that, it will search the default environment path for the utility. If found, it will automatically execute the utility to convert the wig file.

--keep

Keep the wig or bedGraph file after converting to a bigWig file. The default is to delete the file if the bigWig conversion is successful.

--gz

A boolean value to indicate whether the output wiggle file should be compressed with gzip.

--version

Print the version number.

--help

Display the POD documentation

DESCRIPTION

This program will convert any tab-delimited data text file into a wiggle formatted text file. This requires that the file contains not only the scores bu also chromosomal coordinates, i.e. chromosome, start, and (optionally) stop. The program should automatically detect these columns (if appropriately labeled) or they can be specified. An option exists to use the midpoint of a region, e.g. microarray probe.

The wig file format is specified by documentation supporting the UCSC Genome Browser and detailed here: http://genome.ucsc.edu/goldenPath/help/wiggle.html. Three formats are supported, 'fixedStep', 'variableStep', and 'bedGraph'. The format may be requested or determined empirically from the input file metadata. Genomic bin files generated with BioToolBox scripts record the window and step values in the metadata, which are used to determine the span and step wig values, respectively. The variableStep format is otherwise generated by default. The span is, by default, 1 bp.

Wiggle files cannot tolerate multiple datapoints at the same identical position, e.g. multiple microarray probes matching a repetitive sequence. An option exists to mathematically combine these positions into one value.

Strand is not inherently supported in wig files. If you have stranded data, they should be split into separate files. The BioToolBox script split_data_file.pl can be used for this purpose.

A binary BigWig file may also be further generated from the text wiggle file. The binary format is preferential to the text version for a variety of reasons, including fast, random access and no loss in data value precision. More information can be found at this location: http://genome.ucsc.edu/goldenPath/help/bigWig.html. Conversion requires BigWig file support, supplied by the external wigToBigWig or bedGraphToBigWig utility available from UCSC.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.