NAME
Bio::ToolBox::file_helper
SYNOPSIS
use Bio::ToolBox::file_helper qw(
load_tim_data_file
write_tim_data_file
open_to_read_fh
open_to_write_fh
);
my $input_data = load_tim_data_file($file) or die "can't open file!";
my ($fh, $metadata) = open_tim_data_file($file) or die;
while (my $line = $fh->getline) {
...
}
my $input_fh = open_to_read_fh($file);
my $output_fh = open_to_write_fh($file);
my $output_fh = open_to_write_fh($file, $gz, $append);
my $success = write_tim_data_file(
'data' => $data,
'filename' => $file,
'gz' => $gz,
);
DESCRIPTION
These are general file helper subroutines to work with data text files, primarily opening, loading, and writing. Specifically, it is designed to work with tim data text files, which is a generic tab delimited text file of rows and columns along with special metadata column headers. While it best uses this tim data format, it will really read any tab delimited text file. Special file formats used in bioinformatics, including GFF and BED files, are automatically recognized by their file extension and appropriate metadata added.
Files opened using these subroutines are stored in a specific complex data structure described below. This format allows for data access as well as records metadata about each column (dataset) and the file in general. This metadata helps preserve a "history" of the dataset: where it came from, how it was collected, and how it was processed.
Additional subroutines are also present for general processing and output of this data structure.
The tim data file format is described below, and following that a description of the data structure.
FORMAT OF TIM DATA TEXT FILE
The tim data file format is not indicated by a special file extension. Rather, a generic '.txt' extension is used to preserve functionality with other text processing programs. The file is essentially a simple tab delimited text file representing rows (lines) and columns (demarcated by the tabs).
What makes it unique are the metadata header lines, each prefixed by a '# '. These metadata lines describe the data within the table with regards to its type, source, methodology, history, and processing. The metadata is designed to be read by both human and computer. Opening files without this metadata will result in basic default metadata assigned to each column. Special files recognized by their extension (e.g. GFF or BED) will have appropriate metadata assigned.
The specific metadata lines that are specifically recognized are listed below.
- Feature
-
The Feature describes the types of features represented on each row in the data table. These can include gene, transcript, genome, etc.
- Database
-
The name of the database used in generation of the feature table. This is often also the database used in collecting the data, unless the dataset metadata specifies otherwise.
- Program
-
The name of the program generating the data table and file. It usually includes the whole path of the executable.
- Column
-
The next header lines include column specific metadata. Each column will have a separate header line, specified initially by the word 'Column', followed by an underscore and the column number (0-based). Following this is a series of 'key=value' pairs separated by ';'. Spaces are generally not allowed. Obviously '=' or ';' are not allowed or they will interfere with the parsing. The metadata describes how and where the data was collected. Additionally, any modifications performed on the data are also recorded here. The only key that is required is 'name'. If the file being read does not contain metadata, then it will be auto generated with basic metadata.
A list of standard column header keys is below, but is not exhaustive.
- name
-
The name of the column. This should be identical to the table header.
- database
-
Included if different from the main database indicated above.
- window
-
The size of the window for genome datasets
- step
-
The step size of the window for genome datasets
- dataset
-
The name of the dataset(s) from which data is collected. Comma delimited.
- start
-
The starting point for the feature in collecting values
- stop
-
The stopping point of the feature in collecting values
- extend
-
The extension of the region in collecting values
- strand
-
The strandedness of the data collecte. Values include 'sense', 'antisense', or 'none'
- method
-
The method of collecting values
- log2
-
boolean indicating the values are in log2 space or not
Finally, the data table follows the metadata. The table consists of tab-delimited data. The same number of fields should be present in each row. Each row represents a genomic feature or landmark, and each column contains either identifying information or a collected dataset. The first row will always contain the column names, except in special cases such as the GFF format where the columns are strictly defined. The column name should be the same as defined in the column's metadata. When loading GFF files, the header names and metadata are automatically generated for conveniance.
USAGE
Call the module at the beginning of your perl script and pass a list of the desired modules to import. None are imported by default.
use Bio::ToolBox::db_helper qw(load_tim_data_file write_tim_data_file);
The specific usage for each subroutine is detailed below.
- load_tim_data_file()
-
This is a newer, updated file loader and parser for tim's data files. It will completely parse and load the file contents into the described data structure in memory. Files with metadata lines (described in tim data format) will have the metadata lines loaded. Files without metadata lines will have basic metadata (column name and index) automatically generated. The first non-header line should contain the column (dataset) name. Recognized file formats without headers, including GFF, BED, and SGR, will have the columns automatically named.
This subroutine uses the open_tim_data_file() subroutine and completes the loading of the file into memory.
BED and BedGraph style files, recognized by .bed or .bdg file extensions, have their start coordinate adjusted by +1 to convert from 0-based interbase numbering system to 1-based numbering format, the convention used by BioPerl. A metadata attribute is applied informing the user of the change. When writing a valid Bed or BedGraph file, converted start positions are changed back to interbase format.
Strand information is parsed from recognizable symbols, including "+, -, 1, -1, f, r, w, c, 0, .", to the BioPerl convention of 1, 0, and -1. Valid BED and GFF files are changed back when writing these files.
Pass the module the filename. The file may be compressed with gzip, recognized by the .gz extension.
The subroutine will return a scalar reference to the hash, described above. Failure to read or parse the file will return an empty value.
Example:
my $filename = 'my_data.txt.gz'; my $data_ref = load_tim_data_file($filename);
- open_tim_data_file()
-
This is a file opener and metadata parser for data files, including tim's data formatted files and other recognized data formats (gff, bed, sgr). It will open the file, parse the metadata, and return an open file handle ready for reading. It will NOT load the entire file contents into memory. This is to allow for processing those gigantic data files that will break Perl with malloc errors.
The subroutine will open the file, parse the header lines (marked with a # prefix) into a metadata hash as described above, parse the data column names (the first row in the table), set the file pointer to the first row of data in the table, and return the open file handle along with a scalar reference to the metadata hash. The calling program may then process the file through the filehandle line by line as appropriate.
The data column names may be found in an array in the data hash under the key 'column_names';
Pass the module the filename. The file may be compressed with gzip, recognized by the .gz extension.
The subroutine will return two items: a scalar reference to the file handle, and a scalar reference to the data hash, described as above. The file handle is an IO::Handle object and may be manipulated as such. Failure to read or parse the file will return an empty value.
Example:
my $filename = 'my_data.txt.gz'; my ($fh, $metadata_ref) = open_tim_data_file($filename); while (my $line = $fh->getline) { ... } $fh->close;
- write_tim_data_file()
-
This subroutine will write out a data file formatted for tim's data files. Please refer to "FORMAT OF TIM DATA TEXT FILE" for more information regarding the file format. If the 'gff' key is true in the data hash, then a gff file will be written.
The subroutine is passed a reference to an anonymous hash containing the arguments. The keys include
Required: data => A scalar reference to the tim data structure ad described in C<Bio::ToolBox::data_helper>. Optional: filename => A scalar value containing the name of the file to write. This value is required for new data files and optional for overwriting existing files (the filename stored in the metadata is used). Appropriate extensions are added (e.g, .txt, .gz, etc) as neccessary. format => A string to indicate the file format to be written. Acceptable values include 'text', and 'simple'. Text files are text in nature, include all metadata, and usually have '.txt' extensions. Simple files are tab-delimited text files without metadata, useful for exporting data. If the format is not specified, the extension of the passed filename will be used as a guide. The default behavior is to write standard text files. gz => A boolean value (1 or 0) indicating whether the file should be written through a gzip filter to compress. If this value is undefined, then the file name is checked for the presence of the '.gz' extension and the value set appropriately. Default is false. simple => A boolean value (1 or 0) indicating whether a simple tab-delimited text data file should be written. This is an old alias for setting 'format' to 'simple'.
The subroutine will return true if the write was successful, otherwise it will return undef. The true value is the name of the file written, including any changes to the extension if necessary.
Note that by explicitly providing the filename extension, some of these options may be set without providing the arguments to the subroutine. The arguments always take precendence over the filename extensions, however.
Example
my $filename = 'my_data.txt.gz'; my $data_ref = load_tim_data_file($filename); ... my $success_write = write_tim_data_file( 'data' => $data_ref, 'filename' => $filename, 'format' => 'simple', ); if ($success_write) { print "wrote $success_write!"; }
- open_to_read_fh()
-
This subroutine will open a file for reading. If the passed filename has a '.gz' extension, it will appropriately open the file through a gunzip filter.
Pass the subroutine the filename. It will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.
Example
my $filename = 'my_data.txt.gz'; my $fh = open_to_read_fh($filename); while (my $line = $fh->getline) { # do something } $fh->close;
- open_to_write_fh()
-
This subroutine will open a file for writing. If the passed filename has a '.gz' extension, it will appropriately open the file through a gzip filter.
Pass the subroutine three values: the filename, a boolean value indicating whether the file should be compressed with gzip, and a boolean value indicating that the file should be appended. The gzip and append values are optional. The compression status may be determined automatically by the presence or absence of the passed filename extension; the default is no compression. The default is also to write a new file and not to append.
If gzip compression is requested, but the filename does not have a '.gz' extension, it will be automatically added. However, the change in file name is not passed back to the originating program; beware!
The subroutine will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.
Example
my $filename = 'my_data.txt.gz'; my $gz = 1; # compress output file with gzip my $fh = open_to_write_fh($filename, $gz); # write to new compressed file $fh->print("something interesting\n"); $fh->close;
- convert_genome_data_2_gff_data()
-
This subroutine will convert an existing data hash structure as described above and convert it to a defined gff data structure, i.e. one that has the nine defined columns. Once converted, a gff data file may then be written using the write_tim_data_file() subroutine. To convert and write the gff file in one step, see the following subroutine, convert_and_write_gff_file();
NOTE: This method is DESTRUCTIVE!!!! Since the data table will be completely reorganized, any extraneous data in the data table will be discarded. Since referenced data is being used, any data loss may be significant and unexpected. A normal data file should be written first to preserve extraneous data, and the conversion to gff data be the last operation done.
Since the gff data structure requires genomic coordinates, this data must be present as identifiable datasets in the data table and metadata. It looks specifically for datasets labeled 'Chromosome', 'Start', and 'Stop' or 'End'. Failure to identify these datasets will simply return nothing. A dataset generated with get_new_genome_list() in Bio::ToolBox::db_helper will generate these datasets.
The subroutine must be passed a reference to an anonymous hash with the arguments. The keys include
Required: data => A scalar reference to the data hash. The data hash should be as described in this module. Optional: chromo => The index of the column in the data table that contains the chromosome or reference name. By default it searches for the first column with a name that begins with 'chr' or 'refseq' or 'seq'. start => The index of the column with the start position. By default it searches for the first column with a name that contains 'start'. stop => The index of the column with the stop position. By default it searches for the first column with a name that contains 'stop' or 'end'. score => The index of the dataset in the data table to be used as the score column in the gff data. name => The name to be used for the GFF features. Pass either the index of the dataset in the data table that contains the unique name for each gff feature, or a text string to be used as the name for all of the features. This information will be used in the 'group' column. strand => The index of the dataset in the data table to be used for strand information. Accepted values might include any of the following 'f(orward), r(everse), w(atson), c(rick), +, -, 1, -1, 0, .). source => A scalar value representing either the index of the column containing values, or a text string to be used as the GFF source value. Default is 'data'. type => A scalar value representing either the index of the column containing values, or a text string to be used as the GFF type or method value. If not defined, it will use the column name of the dataset used for either the 'score' or 'name' column, if defined. As a last resort, it will use the most creative method of 'Experiment'. method => Alias for "type". midpoint => A boolean (1 or 0) value to indicate whether the midpoint between the actual 'start' and 'stop' values should be used instead of the actual values. Default is false. zero => The coordinates are 0-based (interbase). Convert to 1-based format (bioperl conventions). tags => Provide an anonymous array of indices to be added as tags in the Group field of the GFF feature. The tag's key will be the column's name. As many tags may be added as desired. id => Provide the index of the column containing unique values which will be used in generating the GFF ID in v.3 GFF files. If not provided, the ID is automatically generated from the name. version => The GFF version (2 or 3) to be written. The default is version 3.
The subroutine will return true if the conversion was successful, otherwise it will return nothing.
Example
my $data_ref = load_tim_data_file($filename); ... my $success = convert_genome_data_2_gff_data( 'data' => $data_ref, 'score' => 3, 'midpoint' => 1, ); if ($success) { # write a gff file my $success_write = write_tim_data_file( 'data' => $data_ref, 'filename' => $filename, ); if ($success_write) { print "wrote $success_write!"; } }
- convert_and_write_to_gff_file()
-
This subroutine will convert a tim data structure as described above into GFF format and write the file. It will preserve the current data structure and convert the data on the fly as the file is written, unlike the destructive subroutine convert_genome_data_2_gff_data().
Either a v.2 or v.3 GFF file may be written. The only metadata written is the original data's filename (if present) and any dataset (column) metadata that contains more than the basics (name and index).
Since the gff data structure requires genomic coordinates, this data must be present as identifiable datasets in the data table and metadata. It looks specifically for datasets labeled 'Chromosome', 'Start', and optionally 'Stop' or 'End'. Failure to identify these datasets will simply return nothing. A dataset generated with get_new_genome_list() in Bio::ToolBox::db_helper will generate these coordinate datasets.
If successful, the subroutine will return the name of the output gff file written.
The subroutine must be passed a reference to an anonymous hash with the arguments. The keys include
Required: data => A scalar reference to the data hash. The data hash should be as described in this module. Optional: filename => The name of the output GFF file. If not specified, the default value is, in order, the method, name of the indicated 'name' dataset, name of the indicated 'score' dataset, or the originating file basename. version => The version of GFF file to write. Acceptable values include '2' or '3'. For v.3 GFF files, unique ID values will be auto generated, unless provided with a 'name' dataset index. Default is to write v.3 files. chromo => The index of the column in the data table that contains the chromosome or reference name. By default it searches for the first column with a name that begins with 'chr' or 'refseq' or 'seq'. start => The index of the column with the start position. By default it searches for the first column with a name that contains 'start'. stop => The index of the column with the stop position. By default it searches for the first column with a name that contains 'stop' or 'end'. score => The index of the dataset in the data table to be used as the score column in the gff data. name => The name to be used for the GFF features. Pass either the index of the dataset in the data table that contains the unique name for each gff feature, or a text string to be used as the name for all of the features. This information will be used in the 'group' column. strand => The index of the dataset in the data table to be used for strand information. Accepted values might include any of the following 'f(orward), r(everse), w(atson), c(rick), +, -, 1, -1, 0, .). source => A scalar value representing either the index of the column containing values, or a text string to be used as the GFF source value. Default is 'data'. type => A scalar value representing either the index of the column containing values, or a text string to be used as the GFF type or method value. If not defined, it will use the column name of the dataset used for either the 'score' or 'name' column, if defined. As a last resort, it will use the most creative method of 'Experiment'. method => Alias for "type". midpoint => A boolean (1 or 0) value to indicate whether the midpoint between the actual 'start' and 'stop' values should be used instead of the actual values. Default is false. tags => Provide an anonymous array of indices to be added as tags in the Group field of the GFF feature. The tag's key will be the column's name. As many tags may be added as desired. id => Provide the index of the column containing unique values which will be used in generating the GFF ID in v.3 GFF files. If not provided, the ID is automatically generated from the name.
Example
my $data_ref = load_tim_data_file($filename); ... my $success = convert_and_write_to_gff_file( 'data' => $data_ref, 'score' => 3, 'midpoint' => 1, 'filename' => "$filename.gff", 'version' => 2, ); if ($success) { print "wrote file '$success'!"; }
- write_summary_data()
-
This subroutine will summarize the data in a data file, generating mean values for all the values in each dataset (column), and writing an output file with the summarized data. This is useful for data collected in windows across a feature, for example, microarray data values across the body of genes, and then generating a composite or average gene occupancy.
The output file is a tim data tab-delimited file as described above with three columns: The Name of the window, the Midpoint of the window (calculated as the mean of the start and stop points for the window), and the mean value. The table is essentially rotated 90º from the original table; the averages of each column dataset becomes rows of data.
Pass the subroutine an anonymous hash of arguments. These include:
Required: data => A scalar reference to the data hash. The data hash should be as described in this module. filename => The base filename for the file. This will be appended with '_summed' to differentiate from the original data file. This may be automatically obtained from the metadata of an opened file if not specified, otherwise it will not work. Optional: startcolumn => The index of the beginning dataset containing the data to summarized. This may be automatically calculated by taking the leftmost column without a known feature-description name (using examples from Bio::ToolBox::db_helper). stopcolumn => The index of the last dataset containing the data to summarized. This may be automatically calculated by taking the rightmost column. dataset => The name of the original dataset used in collecting the data. It may be obtained from the metadata for the startcolumn. log => The data is in log2 space. It may be obtained from the metadata for the startcolumn.
Example
my $main_data_ref = load_tim_data_file($filename); ... my $summary_success = write_summary_data( 'data' => $main_data_ref, 'filename' => $outfile, 'startcolumn' => 4, );
INTERNAL SUBROUTINES
These are internally used subroutines and are not exported for general usage.
- _check_file
-
This subroutine confirms the existance of a passed filename. If not immediately found, it will attempt to append common file extensions and verifiy its existence. This allows the user to pass only the base file name and not worry about missing the extension. This may be useful in shell scripts.
AUTHOR
Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 3324:
Non-ASCII character seen before =encoding in '90º'. Assuming UTF-8