The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::ToolBox::data_helper

DESCRIPTION

These are general subroutines for working with data, and specifically what was known colloquially as the "tim data structure", before it became Bio::ToolBox. These subroutines provides a catchall location for common subroutines that don't fit in either Bio::ToolBox::file_helper or Bio::ToolBox::db_helper.

TIM DATA STRUCTURE

The tim data structure is a complex data structure that is commonly used throughout the biotoolbox scripts, thus simplifying data input/output and manipulation. The primary structure is a hash with numerous keys. The actual data table is represented as an array of arrays. Metadata for the columns (datasets) are stored as hashes.

This whole data structure is intended to eventually become the structure for a blessed object and the basis for a class of object-oriented methods that work on the structure. One of these days I'll find time to implement that and rewrite all of the biotoolbox scripts.... Maybe for that mythical 2.0 release.

The description of the primary keys in the tim data structure are described here.

program

This includes the scalar value from the Program header line and represents the name of the program that generated the data file.

db

This includes the scalar value from the Database header line and is the name of the database from which the file data was generated.

feature

This includes the scalar value from the Feature header line and describes the type of features in the data file.

gff

This includes a scalar value of the source GFF file version, obtained from either the GFF file pragma or the file extension. The default value is 0 (not a GFF file). As such, it may be treated as a boolean value.

bed

If the source file is a BED file, then this tag value is set to the number of columns in the original BED file, an integer of 3 to 12. The default value is 0 (not a BED file). As such, it may be treated as a boolean value.

number_columns

This includes an integer representing the total number of columns in the data table. It is automatically calculated from the data table and updated each time a column is added.

last_row

This represents the integer for the index number (0-based) of the last row in the data table. It is calculated automatically from the data table.

other

This key points to an anonymous array of additional, unrecognized header lines in the parsed file. For example, metadata from older file formats or general comments not suitable for other locations. The entire line is added to the array, and is rewritten before the column metadata is written. The line ending character is automatically stripped when it is added to this array upon file loading, and automatically added when writing out to a text file.

filename

The original path and filename of the file opened and parsed. (Just in case you forgot ;) Joking aside, missing extensions may have been added to the filename by the different functions upon opening (a convenience for users) in the case that they weren't initially provided. The actual complete name will be found here.

basename

The base name of the original file name, minus the extension(s). Useful when needing to assign a new file name based on the current file name.

extension

The known extension(s) of the original file name. Known extensions currently include '.txt, .gff, .gff3, .bed, .sgr' as well as their gzip equivalents.

path

The parent directories of the original file. The full filename can be regenerated by concatenating the path, basename, and extension.

headers

A boolean flag (1 or 0) to indicate whether headers are present or not. Some file formats, e.g. BED, GFF, etc., do not explicitly have column headers; the headers flag should be set to false in this case. Standard tim data formatted text files should be set to true.

<column_index_number>

Each column will have a metadata index. Usually this is read from the column's metadata line. The key will be the index number (0-based) of the column. The value will be an anonymous hash consisting of the column metadata. For metadata header lines from a parsed file, these will be the key=value pairs listed in the line. There should always be two mandatory keys, 'name' and 'index'.

data_table

This key will point to an anonymous array of arrays, representing the tab-delimited data table in the file. The primary array will be row, representing each feature. The secondary array will be the column, representing the descriptive and data elements for each feature. Any value can be looked up by $data_structure_ref->{'data_table'}->[$row][$column]. The first row should always contain the column (dataset) names, regardless whether the original data file had dataset names (e.g. GFF or BED files).

index

This is an optional index hash to provide faster lookup to specific data table rows for genomic bin features. The index is generated using the index_data_table() function. The index is comprised of another hash data structure, where the first key represents the chromosome name, and the second key represents an index value. The index value is the integer (or whole number rounding down) of the start value divided by the index_increment value. For example, with a genomic bin feature at chr1:10691..10700 and an index_increment value of 100, the index value would be {chr1}{106}. The value of that key would be the index number of that row, or more specifically, the row index for the first occurence of that index_value (which would've been genomic bin feature chr1:10601..10610). Hence, the index will get you relatively close to your desired genomic position within the data_table, but you will still need to step through the features (rows) starting at the indexed position until you find the row you want. That should save you a little bit of time, at least. The index is not stored upon writing to a standard tim data text file.

index_increment

This is a single number representing the increment value to calculate the index value for the index. It is generated along with the index by the index_data_table() function. The index_increment value is not stored upon writing to a standard tim data text file.

USAGE

Call the module at the beginning of your perl script. Include the name(s) of the subroutines to import.

  use Bio::ToolBox::data_helper qw(generate_tim_data_structure);
  

The specific usage for each subroutine is detailed below.

generate_tim_data_structure()

As the name implies, this generates a new empty data structure as described above. Populating the data table and metadata is the responsibility of the end user.

Pass the module an array. The first element should be the name of the features in the data table. This is an arbitrary, but required, value. The remainder of the array should be the name(s) of the columns (datasets). A rudimentary metadata hash for each dataset is generated (consisting only of name and index). The name is also entered into the first row of the data table (row 0, the header row).

It will return the reference to the tim_data_structure.

Example

        my $main_data_ref = generate_tim_data_structure(qw(
                genomic_intevals
                Chromo
                Start
                Stop
        ));
verify_data_structure()

This subroutine verifies the data structure. It checks items such as the presence of the data table array, the number of columns in the data table and metadata, the metadata index of the last row, the presence of basic metadata, and verification of dataset names for each column. For data structures with the GFF or BED tags set to true, it will verify the format, including column number and column names; if a check fails, it will reset the GFF or BED key to false. It will automatically correct some simple errors, and complain about others.

Pass the data structure reference. It will return 1 if successfully verified, or false if not.

sort_data_structure()

This subroutine will sort the data table by the values in given column. It will automatically determine whether the contents of the column are numbers or alphanumeric, and will sort accordingly, either numerically or asciibetically. The first non-null value in the column is used to determine. The sort may fail if the values are not consistent. The sort may be done either increasing or decreasing.

Pass the function three values:

    1. the data structure reference, as described here
    2. the index of the column or dataset by which to sort
    3. a scalar value indicating the direction of the sort, 
       either 'increasing', 'i', 'decreasing', or 'd'.
gsort_data_structure()

This subroutine will sort the data table by increasing chromosomal coordinates. It will attempt to automatically identify the chromosome and start or position columns by their column name. Failure to find these columns mean a failure to sort the table. Chromosome names are sorted first by their digits (e.g. chr2 before chr10), and then alphanumerically. Base coordinates are sorted by increasing value. Identical positions are kept in their original order.

Pass the function one parameter, the data structure.

splice_data_structure()

This function will splice an ordinal section out of a data structure in preparation for forking and parallel execution. Pass the function three parameters:

    1. the data structure reference, as described here
    2. the 1-based ordinal index to keep
    3. the total number of parts to split the data structure

Each spliced data structure will maintain the same metadata and column headings (data table row 0), but the data table will have only a fraction of the original data.

For example, to split a data table into four segments for parallel execution in four children processes, call this function once in each child, increasing the index (second parameter) each time.

        my $data = load_tim_data_file($file);
        my $pm = Parallel::ForkManager->new(4);
        for my $i (1..4) {
                $pm->start and next;
                ### in child
                splice_data_structure($data, $i, 4);
                # do something with this fraction
                write_tim_data_file('data' => $data, 'filename' => "file#$i");
                $pm->finish;
        }
        $pm->wait_all_children;
        

The child data structure will be lost upon exiting the child process unless it is saved somehow. The easiest thing is to write it to disk. The biotoolbox script join_data_file.pl may then be used to join the file segments back into a single file. The Parallel::ForkManager also has a method of merging the data structure into the parent process using a disk file intermediate.

index_data_table()

This function creates an index hash for genomic bin features in the data table. Rather than stepping through an entire data table of genomic coordinates looking for a specific chromosome and start feature (or data row), an index may be generated to speed up the search, such that only a tiny portion of the data_table needs to be stepped through to identify the correct feature.

This function generates two additional keys in the tim data structure described above, "index" and "index_increment". Please refer to those items in "TIM DATA STRUCTURE" for their description and usage.

Pass this subroutine one or two arguments. The first is the reference to the data structure. The optional second argument is an integer value to be used as the index_increment value. This value determines the size and efficiency of the index; small values generate a larger but more efficient index, while large values do the opposite. A balance should be struck between memory consumption and speed. The default value is 20 x the feature window size (determined from the metadata). Therefore, finding the specific genomic coordinate feature should take no more than 20 steps from the indexed position. If successful, the subroutine returns a true value.

Example

        my $main_data_ref = load_tim_data_file($filename);
        index_data_table($main_data_ref) or 
                die " unable to index data table!\n";
        ...
        my $chr = 'chr9';
        my $start = 123456;
        my $index_value = 
                int( $start / $main_data_ref->{index_increment} ); 
        my $starting_row = $main_data_ref->{index}{$chr}{$index_value};
        for (
                my $row = $starting_row;
                $row <= $main_data_ref->{last_row};
                $row++
        ) {
                if (
                        $main_data_ref->{data_table}->[$row][0] eq $chr and
                        $main_data_ref->{data_table}->[$row][1] <= $start and
                        $main_data_ref->{data_table}->[$row][2] >= $start
                ) {
                        # do something
                        # you could stop here, but what if you had overlapping
                        # genomic bins for some odd reason?
                } elsif (
                        $main_data_ref->{data_table}->[$row][0] ne $chr
                ) {
                        # no longer on same chromosome, stop the loop
                        last;
                } elsif (
                        $main_data_ref->{data_table}->[$row][1] > $start
                ) {
                        # moved beyond the window, stop the loop
                        last;
                }
        }
                
find_column_index()

This subroutine helps to find the index number of a dataset or column given only the name. This is useful if the file contents are not in a standard order, for example a typical tim data text file instead of a GFF or BED file.

Pass the subroutine two arguments: 1) The reference to the data structure, and 2) a scalar text string that represents the name. The string will be used in regular expression pattern, so Perl REGEX notation may be used. The search is performed with the case insensitive flag. The index position of the first match is returned.

Example

        my $main_data_ref = load_tim_data_file($filename);
        my $chromo_index = find_column_index($main_data_ref, "^chr|seq");
        

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.