NAME
data2bed.pl
A program to convert a data file to a bed file.
SYNOPSIS
data2bed.pl [--options...] <filename>
File Options:
-i --in <filename> input file: txt, gff, vcf, etc
-o --out <filename> output file name
-H --noheader input file has no header row
-0 --zero file is in 0-based coordinate system
Column indices:
--bed [3|4|5|6] type of bed to write
-a --ask interactive selection of columns
-c --chr <index> chromosome column
-b --begin --start <index> start coordinate column
-e --end --stop <index> stop coordinate column
-n --name <text | index> name column or base name text
-s --score <index> score column
-t --strand <index> strand column
BigBed options:
-B --bb --bigbed generate a bigBed file
-d --db <database> database to collect chromosome lengths
--chromof <filename> specify a chromosome file
--bwapp </path/to/bedToBigBed> specify path to bedToBigBed
General Options:
--sort sort output by genomic coordinates
-z --gz compress output file
-Z --bgz bgzip compress output file
-v --version print version and exit
-h --help show extended documentation
OPTIONS
The command line flags and descriptions:
File Options
- --in <filename>
-
Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. Genome coordinates are required. The first row should be column headers. Text files generated by other BioToolBox scripts are acceptable. Files may be gzipped compressed.
- --out <filename>
-
Specify the output filename. By default it uses the basename of the input file.
- --noheader
-
The input file does not have column headers, often found with UCSC derived annotation data tables.
- --zero
-
Indicate that the source data is already in interbase (0-based) coordinates and do not need to be converted. By convention, all BioPerl (and, by extension, all biotoolbox) scripts are base (1-based) coordinates. Default behavior is to convert.
Column indices
- --bed [3|4|5|6]
-
Explicitly set the number of bed columns in the output file. Otherwise, it will attempt to write as many columns as available, filling in mock data as needed.
- --ask
-
Indicate that the program should interactively ask for the indices for feature data. It will present a list of the column names to choose from. Enter nothing for non-relevant columns or to accept default values.
- --chr <column_index>
-
The index of the dataset in the data table to be used as the chromosome or sequence ID column in the BED data.
- --start <column_index>
- --begin <column_index>
-
The index of the dataset in the data table to be used as the start position column in the BED data.
- --start <column_index>
- --end <column_index>
-
The index of the dataset in the data table to be used as the stop or end position column in the BED data.
- --name <column_index | base_text>
-
Supply either the index of the column in the data table to be used as the name column in the BED data, or the base text to be used when auto-generating unique feature names. The auto-generated names are in the format 'text_00000001'. If the source file is GFF3, it will automatically extract the Name attribute.
- --score <column_index>
-
The index of the dataset in the data table to be used as the score column in the BED data.
- --strand <column_index>
-
The index of the dataset in the data table to be used for strand information. Accepted values might include any of the following: +, -, 1, -1, 0, .
BigBed options
- --bigbed
- --bb
-
Indicate that a binary BigBed file should be generated instead of a text BED file. A .bed file is first generated, then converted to a .bb file, and then the .bed file is removed.
- --db <database>
-
Specify the name of a
Bio::DB::SeqFeature::Store
annotation database or other indexed data file, e.g. Bam or bigWig file, from which chromosome length information may be obtained. For more information about using databases, see https://code.google.com/p/biotoolbox/wiki/WorkingWithDatabases. It may be supplied from the input file metadata. - --chromf <filename>
-
When converting to a BigBed file, provide a two-column tab-delimited text file containing the chromosome names and their lengths in bp. Alternatively, provide a name of a database, below.
- --bbapp </path/to/bedToBigBed>
-
Specify the path to the UCSC bedToBigBed conversion utility. The default is to first check the BioToolBox configuration file
biotoolbox.cfg
for the application path. Failing that, it will search the default environment path for the utility. If found, it will automatically execute the utility to convert the bed file.
General options
- --sort
-
Sort the output file by genomic coordinates. Automatically enabled when compressing with bgzip or saving to bigBed.
- --gz
-
Specify whether the output file should be compressed with gzip.
- --bgz
-
Specify whether the output file should be compressed with block gzip (bgzip) for tabix compatibility.
- --version
-
Print the version number.
- --help
-
Display this POD documentation.
DESCRIPTION
This program will convert a tab-delimited data file into a BED file, according to the specifications here http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED. A minimum of three and a maximum of six columns may be generated. Thin and thick block data (columns greater than 6) are not written.
Column identification may be specified on the command line, chosen interactively, or automatically determined from the column headers. GFF source files should have columns automatically identified.
All lower-numbered columns must be defined before writing higher-numbered columns, as per the specification. Dummy data may be filled in for Name and/or Score if a higher column is requested.
Browser and Track lines are not written.
Following specification, all coordinates are written in interbase (0-based) coordinates. Base (1-based) coordinates (the BioPerl standard) will be converted.
Score values should be integers within the range 1..1000. Score values are not converted in this script. However, the biotoolbox script manipulate_datasets.pl has tools to do this if required.
An option exists to further convert the BED file to an indexed, binary BigBed format. Jim Kent's bedToBigBed conversion utility must be available, and either a chromosome definition file or access to a Bio::DB database is required.
AUTHOR
Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.