NAME
data2bed.pl
A script to convert a data file to a bed file.
SYNOPSIS
data2bed.pl [--options...] <filename>
Options:
--in <filename>
--ask
--chr <column_index>
--start <column_index>
--stop | --end <column_index>
--name <column_index | base_text>
--score <column_index>
--strand <column_index>
--zero
--out <filename>
--bigbed | --bb
--chromof <filename>
--db <database>
--bbapp </path/to/bedToBigBed>
--gz
--version
--help
OPTIONS
The command line flags and descriptions:
- --in <filename>
-
Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. Genome coordinates are required. The first row should be column headers. Text files generated by other BioToolBox scripts are acceptable. Files may be gzipped compressed.
- --ask
-
Indicate that the program should interactively ask for the indices for feature data. It will present a list of the column names to choose from. Enter nothing for non-relevant columns or to accept default values.
- --chr <column_index>
-
The index of the dataset in the data table to be used as the chromosome or sequence ID column in the BED data.
- --start <column_index>
-
The index of the dataset in the data table to be used as the start position column in the BED data.
- --start <column_index>
- --end <column_index>
-
The index of the dataset in the data table to be used as the stop or end position column in the BED data.
- --score <column_index>
-
The index of the dataset in the data table to be used as the score column in the BED data.
- --name <column_index | base_text>
-
Supply either the index of the column in the data table to be used as the name column in the BED data, or the base text to be used when auto-generating unique feature names. The auto-generated names are in the format 'text_00000001'. If the source file is GFF3, it will automatically extract the Name attribute.
- --strand <column_index>
-
The index of the dataset in the data table to be used for strand information. Accepted values might include any of the following "f(orward), r(everse), w(atson), c(rick), +, -, 1, -1, 0, .".
- --zero
-
Indicate that the source data is already in interbase (0-based) coordinates and do not need to be converted. By convention, all BioPerl (and, by extension, all biotoolbox) scripts are base (1-based) coordinates. Default behavior is to convert.
- --out <filename>
-
Specify the output filename. By default it uses the basename of the input file.
- --bigbed
- --bb
-
Indicate that a binary BigBed file should be generated instead of a text BED file. A .bed file is first generated, then converted to a .bb file, and then the .bed file is removed.
- --chromf <filename>
-
When converting to a BigBed file, provide a two-column tab-delimited text file containing the chromosome names and their lengths in bp. Alternatively, provide a name of a database, below.
- --db <database>
-
Specify the name of a
Bio::DB::SeqFeature::Store
annotation database or other indexed data file, e.g. Bam or bigWig file, from which chromosome length information may be obtained. For more information about using databases, see https://code.google.com/p/biotoolbox/wiki/WorkingWithDatabases. It may be supplied from the input file metadata. - --bbapp </path/to/bedToBigBed>
-
Specify the path to the Jim Kent's bedToBigBed conversion utility. The default is to first check the BioToolBox configuration file
biotoolbox.cfg
for the application path. Failing that, it will search the default environment path for the utility. If found, it will automatically execute the utility to convert the bed file. - --gz
-
Specify whether (or not) the output file should be compressed with gzip.
- --version
-
Print the version number.
- --help
-
Display this POD documentation.
DESCRIPTION
This program will convert a tab-delimited data file into a BED file, according to the specifications here http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED. A minimum of three and a maximum of six columns may be generated. Thin and thick block data (columns greater than 6) are not written.
Column identification may be specified on the command line, chosen interactively, or automatically determined from the column headers. GFF source files should have columns automatically identified.
All lower-numbered columns must be defined before writing higher-numbered columns, as per the specification. Dummy data may be filled in for Name and/or Score if a higher column is requested.
Browser and Track lines are not written.
Following specification, all coordinates are written in interbase (0-based) coordinates. Base (1-based) coordinates (the BioPerl standard) will be converted.
Score values should be integers within the range 1..1000. Score values are not converted in this script. However, the biotoolbox script manipulate_datasets.pl has tools to do this if required.
An option exists to further convert the BED file to an indexed, binary BigBed format. Jim Kent's bedToBigBed conversion utility must be available, and either a chromosome definition file or access to a Bio::DB database is required.
AUTHOR
Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.