The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::ToolBox::file_helper

SYNOPSIS

  use Bio::ToolBox::file_helper qw(
    load_tim_data_file
    write_tim_data_file
    open_to_read_fh
    open_to_write_fh
  );
  
  my $input_data = load_tim_data_file($file) or die "can't open file!";
  
  my ($fh, $metadata) = open_tim_data_file($file) or die;
  
  while (my $line = $fh->getline) {
    ...
  }
  
  my $input_fh = open_to_read_fh($file);
  
  my $output_fh = open_to_write_fh($file);
  
  my $output_fh = open_to_write_fh($file, $gz, $append);
  
  my $success = write_tim_data_file(
    'data'       => $data,
    'filename'   => $file,
    'gz'         => $gz,
  );

DESCRIPTION

These are general file helper subroutines to work with data text files, primarily opening, loading, and writing. Specifically, it is designed to work with tim data text files, which is a generic tab delimited text file of rows and columns along with special metadata column headers. While it best uses this tim data format, it will really read any tab delimited text file. Special file formats used in bioinformatics, including GFF and BED files, are automatically recognized by their file extension and appropriate metadata added.

Files opened using these subroutines are stored in a specific complex data structure described below. This format allows for data access as well as records metadata about each column (dataset) and the file in general. This metadata helps preserve a "history" of the dataset: where it came from, how it was collected, and how it was processed.

Additional subroutines are also present for general processing and output of this data structure.

The tim data file format is described below, and following that a description of the data structure.

FORMAT OF TIM DATA TEXT FILE

The tim data file format is not indicated by a special file extension. Rather, a generic '.txt' extension is used to preserve functionality with other text processing programs. The file is essentially a simple tab delimited text file representing rows (lines) and columns (demarcated by the tabs).

What makes it unique are the metadata header lines, each prefixed by a '# '. These metadata lines describe the data within the table with regards to its type, source, methodology, history, and processing. The metadata is designed to be read by both human and computer. Opening files without this metadata will result in basic default metadata assigned to each column. Special files recognized by their extension (e.g. GFF or BED) will have appropriate metadata assigned.

The specific metadata lines that are specifically recognized are listed below.

Feature

The Feature describes the types of features represented on each row in the data table. These can include gene, transcript, genome, etc.

Database

The name of the database used in generation of the feature table. This is often also the database used in collecting the data, unless the dataset metadata specifies otherwise.

Program

The name of the program generating the data table and file. It usually includes the whole path of the executable.

Column

The next header lines include column specific metadata. Each column will have a separate header line, specified initially by the word 'Column', followed by an underscore and the column number (0-based). Following this is a series of 'key=value' pairs separated by ';'. Spaces are generally not allowed. Obviously '=' or ';' are not allowed or they will interfere with the parsing. The metadata describes how and where the data was collected. Additionally, any modifications performed on the data are also recorded here. The only key that is required is 'name'. If the file being read does not contain metadata, then it will be auto generated with basic metadata.

A list of standard column header keys is below, but is not exhaustive.

name

The name of the column. This should be identical to the table header.

database

Included if different from the main database indicated above.

window

The size of the window for genome datasets

step

The step size of the window for genome datasets

dataset

The name of the dataset(s) from which data is collected. Comma delimited.

start

The starting point for the feature in collecting values

stop

The stopping point of the feature in collecting values

extend

The extension of the region in collecting values

strand

The strandedness of the data collecte. Values include 'sense', 'antisense', or 'none'

method

The method of collecting values

log2

boolean indicating the values are in log2 space or not

Finally, the data table follows the metadata. The table consists of tab-delimited data. The same number of fields should be present in each row. Each row represents a genomic feature or landmark, and each column contains either identifying information or a collected dataset. The first row will always contain the column names, except in special cases such as the GFF format where the columns are strictly defined. The column name should be the same as defined in the column's metadata. When loading GFF files, the header names and metadata are automatically generated for conveniance.

USAGE

Call the module at the beginning of your perl script and pass a list of the desired modules to import. None are imported by default.

  use Bio::ToolBox::db_helper qw(load_tim_data_file write_tim_data_file);
  

The specific usage for each subroutine is detailed below.

load_tim_data_file()

This is a newer, updated file loader and parser for tim's data files. It will completely parse and load the file contents into the described data structure in memory. Files with metadata lines (described in tim data format) will have the metadata lines loaded. Files without metadata lines will have basic metadata (column name and index) automatically generated. The first non-header line should contain the column (dataset) name. Recognized file formats without headers, including GFF, BED, and SGR, will have the columns automatically named.

This subroutine uses the open_tim_data_file() subroutine and completes the loading of the file into memory.

BED and BedGraph style files, recognized by .bed or .bdg file extensions, have their start coordinate adjusted by +1 to convert from 0-based interbase numbering system to 1-based numbering format, the convention used by BioPerl. A metadata attribute is applied informing the user of the change. When writing a valid Bed or BedGraph file, converted start positions are changed back to interbase format.

Strand information is parsed from recognizable symbols, including "+, -, 1, -1, f, r, w, c, 0, .", to the BioPerl convention of 1, 0, and -1. Valid BED and GFF files are changed back when writing these files.

Pass the module the filename. The file may be compressed with gzip, recognized by the .gz extension.

The subroutine will return a scalar reference to the hash, described above. Failure to read or parse the file will return an empty value.

Example:

        my $filename = 'my_data.txt.gz';
        my $data_ref = load_tim_data_file($filename);
        
open_tim_data_file()

This is a file opener and metadata parser for data files, including tim's data formatted files and other recognized data formats (gff, bed, sgr). It will open the file, parse the metadata, and return an open file handle ready for reading. It will NOT load the entire file contents into memory. This is to allow for processing those gigantic data files that will break Perl with malloc errors.

The subroutine will open the file, parse the header lines (marked with a # prefix) into a metadata hash as described above, parse the data column names (the first row in the table), set the file pointer to the first row of data in the table, and return the open file handle along with a scalar reference to the metadata hash. The calling program may then process the file through the filehandle line by line as appropriate.

The data column names may be found in an array in the data hash under the key 'column_names';

Pass the module the filename. The file may be compressed with gzip, recognized by the .gz extension.

The subroutine will return two items: a scalar reference to the file handle, and a scalar reference to the data hash, described as above. The file handle is an IO::Handle object and may be manipulated as such. Failure to read or parse the file will return an empty value.

Example:

        my $filename = 'my_data.txt.gz';
        my ($fh, $metadata_ref) = open_tim_data_file($filename);
        while (my $line = $fh->getline) {
                ...
        }
        $fh->close;
write_tim_data_file()

This subroutine will write out a data file formatted for tim's data files. Please refer to "FORMAT OF TIM DATA TEXT FILE" for more information regarding the file format. If the 'gff' key is true in the data hash, then a gff file will be written.

The subroutine is passed a reference to an anonymous hash containing the arguments. The keys include

  Required:
  data     => A scalar reference to the tim data structure ad described
              in C<Bio::ToolBox::data_helper>. 
  Optional: 
  filename => A scalar value containing the name of the file to 
              write. This value is required for new data files and 
              optional for overwriting existing files (the filename 
              stored in the metadata is used). Appropriate extensions 
              are added (e.g, .txt, .gz, etc) as neccessary. 
  format   => A string to indicate the file format to be written.
              Acceptable values include 'text', and 'simple'.
              Text files are text in nature, include all metadata, and
              usually have '.txt' extensions. Simple files are
              tab-delimited text files without metadata, useful for
              exporting data. If the format is not specified, the
              extension of the passed filename will be used as a
              guide. The default behavior is to write standard text
              files.
  gz       => A boolean value (1 or 0) indicating whether the file 
              should be written through a gzip filter to compress. If 
              this value is undefined, then the file name is checked 
              for the presence of the '.gz' extension and the value 
              set appropriately. Default is false.
  simple   => A boolean value (1 or 0) indicating whether a simple 
              tab-delimited text data file should be written. This is 
              an old alias for setting 'format' to 'simple'.

The subroutine will return true if the write was successful, otherwise it will return undef. The true value is the name of the file written, including any changes to the extension if necessary.

Note that by explicitly providing the filename extension, some of these options may be set without providing the arguments to the subroutine. The arguments always take precendence over the filename extensions, however.

Example

        my $filename = 'my_data.txt.gz';
        my $data_ref = load_tim_data_file($filename);
        ...
        my $success_write = write_tim_data_file(
                'data'     => $data_ref,
                'filename' => $filename,
                'format'   => 'simple',
        );
        if ($success_write) {
                print "wrote $success_write!";
        }
open_to_read_fh()

This subroutine will open a file for reading. If the passed filename has a '.gz' extension, it will appropriately open the file through a gunzip filter.

Pass the subroutine the filename. It will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.

Example

        my $filename = 'my_data.txt.gz';
        my $fh = open_to_read_fh($filename);
        while (my $line = $fh->getline) {
                # do something
        }
        $fh->close;
        
open_to_write_fh()

This subroutine will open a file for writing. If the passed filename has a '.gz' extension, it will appropriately open the file through a gzip filter.

Pass the subroutine three values: the filename, a boolean value indicating whether the file should be compressed with gzip, and a boolean value indicating that the file should be appended. The gzip and append values are optional. The compression status may be determined automatically by the presence or absence of the passed filename extension; the default is no compression. The default is also to write a new file and not to append.

If gzip compression is requested, but the filename does not have a '.gz' extension, it will be automatically added. However, the change in file name is not passed back to the originating program; beware!

The subroutine will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.

Example

        my $filename = 'my_data.txt.gz';
        my $gz = 1; # compress output file with gzip
        my $fh = open_to_write_fh($filename, $gz);
        # write to new compressed file
        $fh->print("something interesting\n");
        $fh->close;
        
convert_genome_data_2_gff_data()

This subroutine will convert an existing data hash structure as described above and convert it to a defined gff data structure, i.e. one that has the nine defined columns. Once converted, a gff data file may then be written using the write_tim_data_file() subroutine. To convert and write the gff file in one step, see the following subroutine, convert_and_write_gff_file();

NOTE: This method is DESTRUCTIVE!!!! Since the data table will be completely reorganized, any extraneous data in the data table will be discarded. Since referenced data is being used, any data loss may be significant and unexpected. A normal data file should be written first to preserve extraneous data, and the conversion to gff data be the last operation done.

Since the gff data structure requires genomic coordinates, this data must be present as identifiable datasets in the data table and metadata. It looks specifically for datasets labeled 'Chromosome', 'Start', and 'Stop' or 'End'. Failure to identify these datasets will simply return nothing. A dataset generated with get_new_genome_list() in Bio::ToolBox::db_helper will generate these datasets.

The subroutine must be passed a reference to an anonymous hash with the arguments. The keys include

  Required:
  data     => A scalar reference to the data hash. The data hash 
              should be as described in this module.
  Optional: 
  chromo   => The index of the column in the data table that contains
              the chromosome or reference name. By default it 
              searches for the first column with a name that begins 
              with 'chr' or 'refseq' or 'seq'.
  start    => The index of the column with the start position. By 
              default it searches for the first column with a name 
              that contains 'start'.
  stop     => The index of the column with the stop position. By 
              default it searches for the first column with a name 
              that contains 'stop' or 'end'.
  score    => The index of the dataset in the data table to be used 
              as the score column in the gff data.
  name     => The name to be used for the GFF features. Pass either 
              the index of the dataset in the data table that 
              contains the unique name for each gff feature, or a 
              text string to be used as the name for all of the 
              features. This information will be used in the 
              'group' column.
  strand   => The index of the dataset in the data table to be used
              for strand information. Accepted values might include
              any of the following 'f(orward), r(everse), w(atson),
              c(rick), +, -, 1, -1, 0, .).
  source   => A scalar value representing either the index of the 
              column containing values, or a text string to 
              be used as the GFF source value. Default is 'data'.
  type     => A scalar value representing either the index of the 
              column containing values, or a text string to 
              be used as the GFF type or method value. If not 
              defined, it will use the column name of the dataset 
              used for either the 'score' or 'name' column, if 
              defined. As a last resort, it will use the most 
              creative method of 'Experiment'.
  method   => Alias for "type".
  midpoint => A boolean (1 or 0) value to indicate whether the 
              midpoint between the actual 'start' and 'stop' values
              should be used instead of the actual values. Default 
              is false.
  zero     => The coordinates are 0-based (interbase). Convert to 
              1-based format (bioperl conventions).
  tags     => Provide an anonymous array of indices to be added as 
              tags in the Group field of the GFF feature. The tag's 
              key will be the column's name. As many tags may be 
              added as desired.
  id       => Provide the index of the column containing unique 
              values which will be used in generating the GFF ID 
              in v.3 GFF files. If not provided, the ID is 
              automatically generated from the name.
  version  => The GFF version (2 or 3) to be written. The default is 
              version 3.

The subroutine will return true if the conversion was successful, otherwise it will return nothing.

Example

        my $data_ref = load_tim_data_file($filename);
        ...
        my $success = convert_genome_data_2_gff_data(
                'data'     => $data_ref,
                'score'    => 3,
                'midpoint' => 1,
        );
        if ($success) {
                # write a gff file
                my $success_write = write_tim_data_file(
                        'data'     => $data_ref,
                        'filename' => $filename,
                );
                if ($success_write) {
                        print "wrote $success_write!";
                }
        }
convert_and_write_to_gff_file()

This subroutine will convert a tim data structure as described above into GFF format and write the file. It will preserve the current data structure and convert the data on the fly as the file is written, unlike the destructive subroutine convert_genome_data_2_gff_data().

Either a v.2 or v.3 GFF file may be written. The only metadata written is the original data's filename (if present) and any dataset (column) metadata that contains more than the basics (name and index).

Since the gff data structure requires genomic coordinates, this data must be present as identifiable datasets in the data table and metadata. It looks specifically for datasets labeled 'Chromosome', 'Start', and optionally 'Stop' or 'End'. Failure to identify these datasets will simply return nothing. A dataset generated with get_new_genome_list() in Bio::ToolBox::db_helper will generate these coordinate datasets.

If successful, the subroutine will return the name of the output gff file written.

The subroutine must be passed a reference to an anonymous hash with the arguments. The keys include

  Required:
  data     => A scalar reference to the data hash. The data hash 
              should be as described in this module.
  Optional: 
  filename => The name of the output GFF file. If not specified, 
              the default value is, in order, the method, name of 
              the indicated 'name' dataset, name of the indicated 
              'score' dataset, or the originating file basename.
  version  => The version of GFF file to write. Acceptable values 
              include '2' or '3'. For v.3 GFF files, unique ID 
              values will be auto generated, unless provided with a 
              'name' dataset index. Default is to write v.3 files.
  chromo   => The index of the column in the data table that contains
              the chromosome or reference name. By default it 
              searches for the first column with a name that begins 
              with 'chr' or 'refseq' or 'seq'.
  start    => The index of the column with the start position. By 
              default it searches for the first column with a name 
              that contains 'start'.
  stop     => The index of the column with the stop position. By 
              default it searches for the first column with a name 
              that contains 'stop' or 'end'.
  score    => The index of the dataset in the data table to be used 
              as the score column in the gff data.
  name     => The name to be used for the GFF features. Pass either 
              the index of the dataset in the data table that 
              contains the unique name for each gff feature, or a 
              text string to be used as the name for all of the 
              features. This information will be used in the 
              'group' column.
  strand   => The index of the dataset in the data table to be used
              for strand information. Accepted values might include
              any of the following 'f(orward), r(everse), w(atson),
              c(rick), +, -, 1, -1, 0, .).
  source   => A scalar value representing either the index of the 
              column containing values, or a text string to 
              be used as the GFF source value. Default is 'data'.
  type     => A scalar value representing either the index of the 
              column containing values, or a text string to 
              be used as the GFF type or method value. If not 
              defined, it will use the column name of the dataset 
              used for either the 'score' or 'name' column, if 
              defined. As a last resort, it will use the most 
              creative method of 'Experiment'.
  method   => Alias for "type".
  midpoint => A boolean (1 or 0) value to indicate whether the 
              midpoint between the actual 'start' and 'stop' values
              should be used instead of the actual values. Default 
              is false.
  tags     => Provide an anonymous array of indices to be added as 
              tags in the Group field of the GFF feature. The tag's 
              key will be the column's name. As many tags may be 
              added as desired.
  id       => Provide the index of the column containing unique 
              values which will be used in generating the GFF ID 
              in v.3 GFF files. If not provided, the ID is 
              automatically generated from the name.

Example

        my $data_ref = load_tim_data_file($filename);
        ...
        my $success = convert_and_write_to_gff_file(
                'data'     => $data_ref,
                'score'    => 3,
                'midpoint' => 1,
                'filename' => "$filename.gff",
                'version'  => 2,
        );
        if ($success) {
                print "wrote file '$success'!";
        }
        
write_summary_data()

This subroutine will summarize the data in a data file, generating mean values for all the values in each dataset (column), and writing an output file with the summarized data. This is useful for data collected in windows across a feature, for example, microarray data values across the body of genes, and then generating a composite or average gene occupancy.

The output file is a tim data tab-delimited file as described above with three columns: The Name of the window, the Midpoint of the window (calculated as the mean of the start and stop points for the window), and the mean value. The table is essentially rotated 90º from the original table; the averages of each column dataset becomes rows of data.

Pass the subroutine an anonymous hash of arguments. These include:

  Required:
  data        => A scalar reference to the data hash. The data hash 
                 should be as described in this module.
  filename    => The base filename for the file. This will be 
                 appended with '_summed' to differentiate from the 
                 original data file. This may be automatically  
                 obtained from the metadata of an opened file if 
                 not specified, otherwise it will not work.
  Optional: 
  startcolumn => The index of the beginning dataset containing the 
                 data to summarized. This may be automatically 
                 calculated by taking the leftmost column without
                 a known feature-description name (using examples 
                 from Bio::ToolBox::db_helper).
  stopcolumn  => The index of the last dataset containing the 
                 data to summarized. This may be automatically 
                 calculated by taking the rightmost column. 
  dataset     => The name of the original dataset used in 
                 collecting the data. It may be obtained from the 
                 metadata for the startcolumn.
  log         => The data is in log2 space. It may be obtained 
                 from the metadata for the startcolumn.

Example

        my $main_data_ref = load_tim_data_file($filename);
        ...
        my $summary_success = write_summary_data(
                'data'         => $main_data_ref,
                'filename'     => $outfile,
                'startcolumn'  => 4,
        );

INTERNAL SUBROUTINES

These are internally used subroutines and are not exported for general usage.

_check_file

This subroutine confirms the existance of a passed filename. If not immediately found, it will attempt to append common file extensions and verifiy its existence. This allows the user to pass only the base file name and not worry about missing the extension. This may be useful in shell scripts.

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 3319:

Non-ASCII character seen before =encoding in '90º'. Assuming UTF-8