The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::ToolBox::Data - Reading, writing, and manipulating data structure

SYNOPSIS

  use Bio::ToolBox::Data;
  
  ### Create new gene list from database
  my $Data = Bio::ToolBox::Data->new(
        db      => 'hg19',
        feature => 'gene:ensGene',
  );
  
  my $Data = Bio::ToolBox::Data->new(
        db      => 'hg19',
        feature => 'genome',
        win     => 1000,
        step    => 1000,
  );
  
  
  ### Open a pre-existing file
  my $Data = Bio::ToolBox::Data->new(
        file    => 'coordinates.bed',
  );
  
  
  ### Get a specific value
  my $value = $Data->value($row, $column);
  
  
  ### Replace or add a value
  $Data->value($row, $column, $new_value);
  
  
  ### Iterate through a Data structure one row at a time
  my $stream = $Data->row_stream;
  while (my $row = $stream->next_row) {
          # get the positional information from the file data
          # assuming that the input file had these identifiable columns
          my $seq_id = $row->seq_id;
          my $start  = $row->start;
          my $stop   = $row->end;
          
          # generate a Bio::Seq object from the database using 
          # these coordinates 
          my $region = $db->segment($seq_id, $start, $stop);
          
          # modify a row value
          my $value = $row->value($column);
          my $new_value = $value + 1;
          $row->value($column, $new_value);
  }
  
  
  ### write the data to file
  my $success = $Data->write_file(
       filename     => 'new_data.txt',
       gz           => 1,
  );
  print "wrote new file $success\n"; # file is new_data.txt.gz
  

DESCRIPTION

This module works with the primary Bio::ToolBox Data structure. Simply, it is a complex data structure representing a tabbed-delimited table (array of arrays), with plenty of options for metadata. Many common bioinformatic file formats are simply tabbed-delimited text files (think BED, GFF, VCF). Each row is a feature or genomic interval, and each column is a piece of information about that feature, such as name, type, and/or coordinates. We can append to that file additional columns of information, perhaps scores from genomic data sets. We can record metadata regarding how and where we obtained that data. Finally, we can write the updated table to a new file.

METHODS

Initializing the structure

new()

Initialize a new Data structure. This generally requires options, provided as an array of key => values. A new list of features may be obtained from an annotation database or an existing file may be loaded. If you do not pass any options, a new empty structure will be generated for you to populate.

These are the options available.

file => $filename
in => $filename

Provide the path and name to an existing tabbed-delimited text file from which to load the contents. This is a shortcut to the load_file() method. See that method for more details.

stream => 1

Boolean option indicating that the file should be opened as a file stream. A Bio::ToolBox::Data::Stream object will be returned. This is a convenience method.

noheader => 1

Boolean option indicating that the file does not have file headers, in which case dummy headers are provided. This is not necessary for defined file types that don't normally have file headers, such as BED, GFF, or UCSC files.

parse => 1

Boolean option indicating that a gene annotation table or file should be parsed into SeqFeature objects and a general table of names and IDs representing those objects be generated. The annotation file may be specified in one of two ways: Through the file option above, or in the database metadata of an existing table file representing previously parsed objects.

feature => $type
feature => "$type:$source"
feature => 'genome'

For de novo lists from an annotation database, provide the GFF type or type:source (columns 3 and 2) for collection. A comma delimited string may be accepted (not an array).

For a list of genomic intervals across the genome, specify a feature of 'genome'.

db => $name
db => $path
db => $database_object

Provide the name of the database from which to collect the features. It may be a short name, whereupon it is checked in the Bio::ToolBox configuration file .biotoolbox.cfg for connection information. Alternatively, a path to a database file or directory may be given.

If you already have an opened Bio::DB::SeqFeature::Store database object, you can simply pass that. See Bio::ToolBox::db_helper for more information. However, this in general should be discouraged, since the name of the database will not be properly recorded when saving to file. It is better to simply pass the name of database again; multiple connections to the same database are smartly handled in the background.

win => $integer
step => $integer

If generating a list of genomic intervals, optionally provide the window and step values. Default values are defined in the Bio::ToolBox configuration file .biotoolbox.cfg.

columns => [qw(Column1 Column2 ...)]
datasets => [qw(Column1 Column2 ...)]

When no file is given or database given to search, then a new, empty Data object is returned. In this case, you may optionally provide the column names in advance as an anonymous array. You may also optionally provide a general feature name, if desired.

If successful, the method will return a Bio::ToolBox::Data object.

duplicate()

This will create a new Data object containing the same column headers and metadata, but lacking the table content, i.e. no rows of data. File name metadata, if present in the original, is not preserved. The purpose here, for example, is to allow one to selectively copy rows from one Data object to another.

parse_table($filename)

This will parse a gene annotation table into SeqFeature objects. If this is called from an empty Data object, then the table will be filled with the SeqFeature object names and IDs. If this is called from a non-empty Data object, then the table's contents will be associated with the SeqFeature objects using their name and ID. The stored SeqFeature objects can be retrieved using the get_seqfeature() method.

General Metadata

There is a variety of general metadata regarding the Data structure.

The following methods may be used to access or set these metadata properties.

feature()
feature($text)

Returns or sets the name of the features used to collect the list of features. The actual feature types are listed in the table, so this metadata is merely descriptive.

feature_type

Returns one of three specific values describing the contents of the data table inferred by the presence of specific column names. This provides a clue as to whether the table features represent genomic regions (defined by coordinate positions) or named database features. The return values include:

coordinate: Table includes at least chromosome and start
named: Table includes name, type, and/or Primary_ID
unknown: unrecognized
program($name)

Returns or sets the name of the program generating the list.

database($name)

Returns or sets the name or path of the database from which the features were derived.

gff

Returns or sets the version of loaded GFF files. Supported versions included 1, 2, 2.5 (GTF), and 3.

bed

Returns or sets the BED file version. Here, the BED version is simply the number of columns.

ucsc

Returns or sets the UCSC file format version. Here, the version is simply the number of columns. Supported versions include 10 (gene prediction), 11 (refFlat, or gene prediction with gene name), 12 (knownGene table), 15 (extended gene prediction), or 16 (extended gene prediction with bin).

vcf

Returns or sets the VCF file version number. VCF support is limited.

File information

filename
path
basename
extension

Returns the filename, full path, basename, and extension of the filename. Concatenating the last three values will reconstitute the first original filename.

add_file_metadata($filename)

Add filename metadata. This will automatically parse the path, basename, and recognized extension from the passed filename.

Comments

Comments are any other commented lines from a text file (lines beginning with a #) that were not parsed as metadata.

comments

Returns a copy of the array containing commented lines.

add_comment($text)

Appends the text string to the comment array.

delete_comment
delete_comment($index)

Deletes a comment. Provide the array index of the comment to delete. If an index is not provided, ALL comments will be deleted!

vcf_headers

For VCF files, this will partially parse the VCF headers into a hash structure that can be queried or manipulated. Each header line is parsed for the primary key, being the first word after the ## prefix, e.g. INFO, FORMAT, FILTER, contig, etc. For the simple values, they are stored as the value. For complex entries, such as with INFO and FORMAT, a second level hash is created with the ID extracted and used as the second level key. The value is always the always the remainder of the string.

For example, the following would be a simple parsed vcf header in code representation.

  $vcf_header = {
     FORMAT => {
        GT = q(ID=GT,Number=1,Type=String,Description="Genotype"),
        AD = q(ID=AD,Number=.,Type=Integer,Description="ref,alt Allelic depths"),
     },
     fileDate => 20150715,
  }
rewrite_vcf_headers

If you have altered the vcf headers exported by the vcf_headers() method, then this method will rewrite the hash structure as new comment lines. Do this prior to writing or saving the Data sturcture or else you will lose your changed VCF header metadata.

The Data table

The Data table is the array of arrays containing all of the actual information. Rows and columns are indexed using 0-based indexing as with all Perl arrays. Row 0 is always the column header row containing the column names, regardless whether an actual header name existed in the original file format (e.g. BED or GFF formats). Any individual table "cell" can be specified as [$row][$column].

list_columns

Returns an array or array reference of the column names in ascending (left to right) order.

number_columns

Returns the number of columns in the Data table.

last_column

Returns the array index number for the last (right most) column. This number is always 1 less than the value returned by number_columns() due to 0-based indexing.

last_row

Returns the array index number of the last row. Since the header row is index 0, this is also the number of actual content rows.

column_values($index)

Returns an array or array reference representing the values in the specified column. This includes the column header as the first element. Pass the method the column index.

add_column($name)
add_column(\@column_data)

Appends a new column to the Data table at the rightmost position (highest index). It adds the column header name and creates a new column metadata hash. Pass the method one of two possibilities. Pass a text string representing the new column name, in which case no data will be added to the column. Alternatively, pass an array reference, and the contents of the array will become the column data. If the Data table already has rows, then the passed array reference must have the same number of elements.

It returns the new column index if successful.

copy_column($index)

This will copy a column, appending the duplicate column at the rightmost position (highest index). It will duplicate column metadata as well. It will return the new index position.

delete_column($index1, $index2, ...)

Deletes one or more specified columns. Any remaining columns rightwards will have their indices shifted down appropriately. If you had identified one of the shifted columns, you may need to re-find or calculate its new index.

reorder_column($index1, $index, ...)

Reorders columns into the specified order. Provide the new desired order of indices. Columns could be duplicated or deleted using this method. The columns will adopt their new index numbers.

add_row
add_row(\@values)
add_row($Row)

Add a new row of data values to the end of the Data table. Optionally provide either a reference to an array of values to put in the row, or pass a <Bio::ToolBox::Data::Feature> Row object, such as one obtained from another Data object. If the number of columns do not match, the array is filled up with null values for missing columns, or excess values are dropped.

delete_row($row1, $row2, ...)

Deletes one or more specified rows. Rows are spliced out highest to lowest index to avoid issues. Be very careful deleting rows while simultaneously iterating through the table!

row_values($row)

Returns a copy of an array for the specified row index. Modifying this returned array does not migrate back to the Data table; Use the value method below instead.

value($row, $column)
value($row, $column, $new_value)

Returns or sets the value at a specific row or column index. Index positions are 0-based (header row is index 0).

Column Metadata

Each column has metadata. Each metadata is a series of key => value pairs. The minimum keys are 'index' (the 0-based index of the column) and 'name' (the column header name). Additional keys and values may be queried or set as appropriate. When the file is written, these are stored as commented metadata lines at the beginning of the file.

name($index)
name($index, $new_name)

Convenient method to return the name of the column given the index number. A column may also be renamed by passing a new name.

metadata($index, $key)
metadata($index, $key, $new_value)

Returns or sets the metadata value for a specific $key for a specific column $index.

This may also be used to add a new metadata key. Simply provide the name of a new $key that is not present

If no key is provided, then a hash or hash reference is returned representing the entire metadata for that column.

copy_metadata($source, $target)

This method will copy the metadata (everything except name and index) between the source column and target column. Returns 1 if successful.

delete_metadata($index, $key);

Deletes a column-specific metadata $key and value for a specific column $index. If a $key is not provided, then all metadata keys for that index will be deleted.

find_column($name)

Searches the column names for the specified column name. This employs a case-insensitive grep search, so simple substitutions may be made.

chromo_column
start_column
stop_column
strand_column
name_column
type_column
id_column

These methods will return the identified column best matching the description. Returns undef if that column is not present. These use the find_column() method with a predefined list of aliases.

Efficient Data Access

Most of the time we need to iterate over the Data table, one row at a time, collecting data or processing information. These methods simplify the process.

iterate(sub {})

This method will process a code reference on every row in the data table. Pass a subroutine or code reference. The subroutine will receive the row as a Bio::ToolBox::Data::Feature object. With this object, you can retrieve values, set values, and add new values. For example

    $Data->iterate( sub {
       my $row = shift;
       my $number = $row->value($index);
       my $log_number = log($number);
       $row->value($index, $log_number);
    } );
row_stream()

This returns an Bio::ToolBox::Data::Iterator object, which has one method, next_row(). Call this method repeatedly until it returns undef to work through each row of data.

Users of the Bio::DB family of database adaptors may recognize the analogy to the seq_stream() method.

next_row()

Called from a Bio::ToolBox::Data::Iterator object, it returns a Bio::ToolBox::Data::Feature object. This object represents the values in the current Data table row.

An example using the iterator is shown below.

  my $stream = $Data->row_stream;
  while (my $row = $stream->next_row) {
     # each $row is a Bio::ToolBox::Data::Feature object
     # representing the row in the data table
     my $value = $row->value($index);
     # do something with $value
  }

SeqFeature Objects

SeqFeature objects corresponding to data rows can be stored in the Data object. This can be useful if the SeqFeature object is not readily available from a database or is processor intensive in generating or parsing. Note that storing large numbers of objects will increase memory usage.

store_seqfeature($row_index, $seqfeature)

Stores the SeqFeature object for the given row index. Only one SeqFeature object can be stored per row.

get_seqfeature($row_index)

Retrieves the SeqFeature object for the given row index.

Data Table Functions

These methods alter the Data table en masse.

verify()

This method will verify the Data structure, including the metadata and the Data table. It ensures that the table has the correct number of rows and columns as described in the metadata, and that each column has the basic metadata.

If the Data structure is marked as a GFF or BED structure, then the table is checked that the structure matches the proper format. If not, for example when additional columns have been added, then the GFF or BED value is set to null.

This method is automatically called prior to writing the Data table to file.

sort_data($index, $direction);

This method will sort the Data table by the values in the indicated column. It will automatically determine whether the contents of the column are numbers or alphanumeric, and will sort accordingly, either numerically or asciibetically. The first non-null value in the column is used to determine. The sort may fail if the values are not consistent. Pass a second optional value to indicate the direction of the sort. The value should be either 'i' for 'increasing' or 'd' for 'decreasing'. The default order is increasing.

gsort_data

This method will sort the Data table by increasing genomic coordinates. It requires the presence of chromosome and start (or position) columns, identified by their column names. These are automatically identified. Failure to find these columns mean a failure to sort the table. Chromosome names are sorted first by their digits (e.g. chr2 before chr10), and then alphanumerically. Base coordinates are sorted by increasing value. Identical positions are kept in their original order.

splice_data($current_part, $total_parts)

This method will splice the Data table into $total_parts number of pieces, retaining the $current_part piece. The other parts are discarded. This method is intended to be used when a program is forked into separate processes, allowing each child process to work on a subset of the original Data table.

Two values are passed to the method. The first is the current part number, 1-based. The second value is the total number of parts that the table should be divided, corresponding to the number of concurrent processes. One easy approach to forking is to use Parallel::ForkManager. The example below shows how to fork into four concurrent processes.

        my $Data = Bio::ToolBox::Data->new(file => $file);
        my $pm = Parallel::ForkManager->new(4);
        for my $i (1..4) {
                $pm->start and next;
                ### in child ###
                $Data->splice_data($i, 4);
                # do something with this portion
                # then save to a temporary unique file
                $Data->save("$file_$i");
                $pm->finish;
        }
        $pm->wait_all_children;
        # reload children files
        $Data->reload_children(glob "$file_*");

Since each forked child process is separate from their parent process, their contents must be reloaded into the current Data object. The Parallel::ForkManager documentation recommends going through a disk file intermediate. Therefore, write each child Data object to file using a unique name. Once all children have been reaped, they can be reloaded into the current Data object using the reload_children() method.

Remember that if you fork your script into child processes, any database connections must be re-opened; they are typically not clone safe. If you have an existing database connection by using the open_database() method, it should be automatically re-opened for you when you use the splice_data() method, but you will need to call open_database() again in the child process to obtain the new database object.

reload_children(@children_files)

Discards the current data table in memory and reloads two or more files written from forked children processes. Provide the name of the child files in the order you want them loaded. The files will be automatically deleted if successfully loaded. Returns the number of lines reloaded on success.

File Functions

The Data table may be read in from a file or written out as a file. In all cases, it is a tab-delimited text file, whether as an ad hoc table or a specific bioinformatic format, e.g. BED, GFF, etc. Multiple common file formats are supported. Column headers are assumed, except in those cases where it is not, e.g. BED, GFF, etc. Metadata may be included as commented lines at the beginning of the file, prefixed with a # symbol. Reading and writing gzip compressed files is fully supported.

load_file($filename)

This will load a file into a new, empty Data table. This function is called automatically when a filename is provided to the new() function. The existence of the file is first checked (appending common missing extensions as necessary), metadata and column headers processed and/or generated from default settings, the content loaded into the table, and the structure verified. Error messages may be printed if the structure or format is inconsistent or doesn't match the expected format, e.g a file with a .bed extension doesn't match the UCSC specification. Pass the name of the filename.

taste_file($filename)

Tastes, or checks, a file for a certain flavor, or known gene file formats. This is based on file extension, metadata headers, and/or file content in the first 10 lines or so. Returns a string based on the file format. Values include gff, bed, ucsc, or undefined. Useful for determining if the file represents a known gene table format that lacks a defined file extension, e.g. UCSC formats.

save()
write_file()
write_file($filename)
write_file(%options)

These methods will write the Data structure out to file. It will be first verified as to proper structure. Opened BED and GFF files are checked to see if their structure is maintained. If so, they are written in the same format; if not, they are written as regular tab-delimited text files. You may pass additional options.

filename => $filename

Optionally pass a new filename. Required for new objects; previous opened files may be overwritten if a new name is not provided. If necessary, the file extension may be changed; for example, BED files that no longer match the defined format lose the .bed and gain a .txt extension. Compression may or add or strip .gz as appropriate. If a path is not provided, the current working directory is used.

gz => boolean

Change the compression status of the output file. The default is to maintain the status of the original opened file.

If the file save is successful, it will return the full path and name of the saved file, complete with any changes to the file extension.

summary_file(%options)

Write a separate file summarizing columns of data (mean values). The mean value of each column becomes a row value, and each column header becomes a row identifier (i.e. the table is transposed). The best use of this is to summarize the mean profile of windowed data collected across a feature. See the Bio::ToolBox scripts get_relative_data.pl and get_binned_data.pl as examples. You may pass these options. They are optional.

filename => $filename

Pass an optional new filename. The default is to take the basename and append "_summed" to it.

startcolumn => $index
stopcolumn => $index

Provide the starting and ending columns to summarize. The default start is the leftmost column without a recognized standard name. The default ending column is the last rightmost column. Indexes are 0-based.

dataset => $name

Pass a string that is the name of the dataset. This could be collected from the metadata, if present. This will become the name of the score column if defined.

The name of the summarized column is either the provided dataset name, the defined basename in the metadata of the Data structure, or a generic name. If successful, it will return the name of the file saved.

Verifying Datasets

When working with row Features and collecting scores, the dataset from which you are collecting must be verified prior to collection. This ensures that the proper database adaptor is available and loaded, and that the dataset is correctly specified (otherwise nothing would be collected). This verification is normally performed transparently when you call get_score() or get_position_scores(). However, datasets may be explicitly verified prior to calling the score methods.

verify_dataset($dataset)
verify_dataset($dataset, $database)

Pass the name of the dataset (GFF type or type:source) for a GFF3-based database, e.g. <Bio::DB::SeqFeature::Store>, or path and file name for a data file, e.g. Bam, BigWig, BigBed, or USeq file. If a separate database is being used, pass the name or opened database object as a second parameter. For more advance options, see "verify_or_request_feature_types" in Bio::ToolBox::db_helper.

The name of the verified dataset, with a prefix if necessary, is returned.

AUTHOR

 Timothy J. Parnell, PhD
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.