NAME
Bio::ToolBox::Data::Stream - Read, Write, and Manipulate Data File Line by Line
SYNOPSIS
use Bio::ToolBox::Data;
### Open a pre-existing file
my $Stream = Bio::ToolBox::Data->new(
in => 'regions.bed',
stream => 1,
);
# or directly
my $Stream = Bio::ToolBox::Data::Stream->new(
in => 'regions.bed',
);
### Open a new file for writing
my $Stream = Bio::ToolBox::Data::Stream->new(
out => 'output.txt',
columns => [qw(chromosome start stop name)],
);
### Working line by line
while (my $line = $Stream->next_line) {
# get the positional information from the file data
# assuming that the input file had these identifiable columns
# each line is Bio::ToolBox::Data::Feature item
my $seq_id = $line->seq_id;
my $start = $line->start;
my $stop = $line->end;
# change values
$line->value(1, 100); # index, new value
}
### Working with two file streams
my $inStream = Bio::ToolBox::Data::Stream->new(
file => 'regions.bed',
);
my $outStream = $inStream->duplicate('regions_ext100.bed');
my $sc = $inStream->start_column;
my $ec = $inStream->end_column;
while (my $line = $inStream->next_line) {
# adjust positions by 100 bp
my $s = $line->start;
my $e = $line->end;
$line->value($sc, $s - 100);
$line->value($ec, $e + 100);
$outStream->write_row($line);
}
### Finishing
# close your file handles when you are done
$Stream->close_fh;
DESCRIPTION
This module works similarly to the Bio::ToolBox::Data object, except that rows are read from a file handle rather than a memory structure. This allows very large files to be read, manipulated, and even written without slurping the entire contents into a memory.
For an introduction to the Bio::ToolBox::Data object and methods, refer to its documentation and the Bio::ToolBox::Data::Feature documentation.
Typically, manipulations are only performed on one row at a time, not on an entire table. Therefore, large scale table manipulations, such as sorting, is not possible.
A typical workflow consists of opening two Stream objects, one for reading and one for writing. Rows are read, one at a time, from the read Stream, manipulated as necessary, and then written to the write Stream. Each row is passed as a Bio::ToolBox::Data::Feature object. It can be manipulated as such, or the corresponding values may be dumped as an array. Working with the row data as an array is required when adding or deleting columns, since these manipulations are not allowed with a Feature object. The write Stream can then be passed either the Feature object or the array of values to be written.
METHODS
Initializing the structure
- new()
-
Create a new Bio::ToolBox::Data::Stream object. For simplicity, a new file may also be opened using the Bio::ToolBox::Data new function.
my $Stream = Bio::ToolBox::Data->new( stream => 1, in => $filename, );
Options to the new function are listed below. Streams are inherently either read or write mode, determined by the mode given through the options.
- in => $filename
-
Provide the path and name of the file to open for reading. File types are recognized by the extension, and compressed files (.gz) are supported. File types supported include all those listed in Bio::ToolBox::file_helper.
- out => $filename
-
Provide the path and name of the file to open for writing. No check is made for pre-existing files; if it exists it will be overwritten! A new data object is prepared, therefore column names must be provided.
- noheader => 1
-
Boolean option indicating that the input file does not have file headers, in which case dummy headers are provided. This is not necessary for defined file types that don't normally have file headers, such as BED, GFF, or UCSC files. Ignored for output files.
- columns => [qw(Column1 Column2 ...)]
-
When a new file is written, provide the names of the columns as an anonymous array. If no columns are provided, then a completely empty data structure is made. Columns must be added with the add_column() method below.
- gff => $gff_version
-
When writing a GFF file, provide a GFF version. When this is given, the nine standard column names and metadata are automatically provided based on the file format specification. Note that the column names are not actually written in the file, but are maintained for internal use. Acceptable versions include 1, 2, 2.5 (GTF), and 3 (GFF3).
- bed => $number_of_bed_columns
-
When writing a BED file, provide the number of bed columns that the file will have. When this is given, the standard column names and metadata will be automatically provided based on the standard file format specification. Note that column names are not actually written to the file, but are maintained for internal use. Acceptable values are integers from 3 to 12.
- ucsc => $number_of_columns
-
When writing a UCSC-style file format, provide the number of bed columns that the file will have. When this is given, the standard column names and metadata will be automatically provided based on the file format specification. Note that column names are not actually written to the file, but are maintained for internal use. Acceptable values include 10 (refFlat without gene names), 11 (refFlat with gene names), 12 (knownGene gene prediction table), and 15 (an extended gene prediction or genePredExt table).
- gz => $gz
-
Optional boolean value that indicates whether the output file should be written with compression. This can also be inferred from the file name.
- duplicate($filename)
-
For an opened-to-read Stream object, you may duplicate the object as a new opened-to_write Stream object that maintains the same columns and metadata. A new different filename must be provided.
General Metadata
There is a variety of general metadata regarding the Data structure that is available.
The following methods may be used to access or set these metadata properties. Note that metadata is only written at the beginning of the file, and so must be set prior to iterating through the file.
- feature()
- feature($text)
-
Returns or sets the name of the features used to collect the list of features. The actual feature types are listed in the table, so this metadata is merely descriptive.
- feature_type
-
Returns one of three specific values describing the contents of the data table inferred by the presence of specific column names. This provides a clue as to whether the table features represent genomic regions (defined by coordinate positions) or named database features. The return values include:
- program($name)
-
Returns or sets the name of the program generating the list.
- database($name)
-
Returns or sets the name or path of the database from which the features were derived.
- gff
-
Returns or sets the version of loaded GFF files. Supported versions included 1, 2, 2.5 (GTF), and 3.
- bed
-
Returns or sets the BED file version. Here, the BED version is simply the number of columns.
- ucsc
-
Returns or sets the UCSC file format version. Here, the version is simply the number of columns. Supported versions include 10 (gene prediction), 11 (refFlat, or gene prediction with gene name), 12 (knownGene table), 15 (extended gene prediction), or 16 (extended gene prediction with bin).
- vcf
-
Returns or sets the VCF file version number. VCF support is limited.
File information
- filename
- path
- basename
- extension
-
Returns the filename, full path, basename, and extension of the filename. Concatenating the last three values will reconstitute the first original filename.
- add_file_metadata($filename)
-
Add filename metadata. This will automatically parse the path, basename, and recognized extension from the passed filename.
Comments
Comments are the other commented lines from a text file (lines beginning with a #) that were not parsed as metadata.
- comments
-
Returns a copy of the array containing commented lines.
- add_comment($text)
-
Appends the text string to the comment array.
- delete_comment
- delete_comment($index)
-
Deletes a comment. Provide the array index of the comment to delete. If an index is not provided, ALL comments will be deleted!
- vcf_headers
-
For VCF files, this will partially parse the VCF headers into a hash structure that can be queried or manipulated. Each header line is parsed for the primary key, being the first word after the ## prefix, e.g. INFO, FORMAT, FILTER, contig, etc. For the simple values, they are stored as the value. For complex entries, such as with INFO and FORMAT, a second level hash is created with the ID extracted and used as the second level key. The value is always the always the remainder of the string.
For example, the following would be a simple parsed vcf header in code representation.
$vcf_header = { FORMAT => { GT = q(ID=GT,Number=1,Type=String,Description="Genotype"), AD = q(ID=AD,Number=.,Type=Integer,Description="ref,alt Allelic depths"), }, fileDate => 20150715, }
- rewrite_vcf_headers
-
If you have altered the vcf headers exported by the vcf_headers() method, then this method will rewrite the hash structure as new comment lines. Do this prior to writing the new file stream or else you will lose your changed VCF header metadata.
Column Metadata
Information about the columns may be accessed. This includes the names of the column and shortcuts to specific identifiable columns, such as name and coordinates. In addition, each column may have additional metadata. Each metadata is a series of key => value pairs. The minimum keys are 'index' (the 0-based index of the column) and 'name' (the column header name). Additional keys and values may be queried or set as appropriate. When the file is written, these are stored as commented metadata lines at the beginning of the file. Setting metadata is futile after reading or writing has begun.
- list_columns
-
Returns an array or array reference of the column names in ascending (left to right) order.
- number_columns
-
Returns the number of columns in the Data table.
- last_column
-
Returns the array index of the last (rightmost) column in the Data table.
- name($index)
-
Convenient method to return the name of the column given the index number.
- metadata($index, $key)
- metadata($index, $key, $new_value)
-
Returns or sets the metadata value for a specific $key for a specific column $index.
This may also be used to add a new metadata key. Simply provide the name of a new $key that is not present
If no key is provided, then a hash or hash reference is returned representing the entire metadata for that column.
- find_column($name)
-
Searches the column names for the specified column name. This employs a case-insensitive grep search, so simple substitutions may be made.
- chromo_column
- start_column
- stop_column
- strand_column
- name_column
- type_column
- id_column
-
These methods will return the identified column best matching the description. Returns
undef
if that column is not present. These use the find_column() method with a predefined list of aliases.
Modifying Columns
These methods allow modification to the number and order of the columns in a Stream object. These methods can only be employed prior to opening a file handle for writing, i.e. before the first write_row() method is called. This enables one, for example, to duplicate a read-only Stream object to create a write-only Stream, add or delete columns, and then begin the row iteration.
- add_column($name)
-
Appends a new column at the rightmost position (highest index). It adds the column header name and creates a new column metadata hash. Pass a text string representing the new column name. It returns the new column index if successful.
- copy_column($index)
-
This will copy a column, appending the duplicate column at the rightmost position (highest index). It will duplicate column metadata as well. It will return the new index position.
- delete_column($index1, $index2, ...)
-
Deletes one or more specified columns. Any remaining columns rightwards will have their indices shifted down appropriately. If you had identified one of the shifted columns, you may need to re-find or calculate its new index.
- reorder_column($index1, $index2, ...)
-
Reorders columns into the specified order. Provide the new desired order of indices. Columns could be duplicated or deleted using this method. The columns will adopt their new index numbers.
Row Data Access
Once a file Stream object has been opened, and metadata and/or columns adjusted as necessary, then the file contents can be iterated through, one row at a time. This is typically a one-way direction. If you need to go back or start over, the easiest thing to do is re-open the file as a new Stream object.
There are two main methods, next_row() for reading and write_row() for writing. They cannot and should not be used on the same Stream object.
- next_row()
- next_line()
- read_line()
-
This method reads the next line in the file handle and returns a Bio::ToolBox::Data::Feature object. This object represents the values in the current file row.
Note that strand values and 0-based start coordinates are automatically converted to BioPerl conventions if required by the file type.
- add_row( $data )
- add_line( $data )
- write_row( $data )
- write_line( $data )
-
This method writes a new row or line to a file handle. The first time this method is called the file handle is automatically opened for writing. Up to this point, columns may be manipulated. After this point, columns cannot be adjusted (otherwise the file structure becomes inconsistent).
This method may be implemented in one of three ways, based on the type data that is passed.
- A <Bio::ToolBox::Data::Feature> object
-
A Feature object representing a row from another <Bio::ToolBox::Data> data table or Stream. The values from this object will be automatically obtained. Modified strand and 0-based coordinates may be adjusted back as necessary.
- An array reference of values
-
Pass an array reference of values. The number of elements should match the number of expected columns. The values will be automatically joined using tabs. This implementation should be used if you using values from another Stream and the number of columns have been modified.
Manipulation of strand and 0-based starts may be performed if the metadata indicates this should be done.
- A string
-
Pass a text string. This assumes the column values are already tab concatenated. A new line character is appended if one is not included. No data manipulation (strand or 0-based starts) or sanity checking of the required number of columns is performed. Use with caution!
- iterate( \&sub )
-
A convenience method that will process a code reference for every line in the file. Pass a subroutine or code reference. The subroutine will receive the line as a Bio::ToolBox::Data::Feature object, just as with the read_line() method. See also the Bio::ToolBox::Data iterate() method.
File Handle methods
The below methods work with the file handle. When you are finished with a Stream, you should be kind and close the file handle properly.
- mode
-
Returns the write mode of the Stream object. Read-only objects return false (0) and write-only Stream objects return true (1).
- close_fh
-
Closes the file handle.
- fh
-
Returns the IO::File compatible file handle object representing the file handle. Use with caution.
AUTHOR
Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.