NAME

Bio::ToolBox::Data::file - File functions to Bio:ToolBox::Data family

DESCRIPTION

File methods for reading and writing data files for both Bio::ToolBox::Data and Bio::ToolBox::Data::Stream objects. This module should not be used directly. See the respective modules for more information.

DESCRIPTION

These are methods for providing file IO for the Bio::ToolBox::Data data structure. These file IO methods work with any generic tab-delimited text file of rows and columns. It also properly handles comment, metadata, and column-specific metadata custom to Bio::ToolBox programs. Special file formats used in bioinformatics, including for example GFF and BED files, are automatically recognized by their file extension and appropriate metadata added.

Files opened using these subroutines are stored in a specific complex data structure described below. This format allows for data access as well as records metadata about each column (dataset) and the file in general. This metadata helps preserve a "history" of the dataset: where it came from, how it was collected, and how it was processed.

Additional subroutines are also present for general processing and output of this data structure.

The data file format is described below, and following that a description of the data structure.

RECOGNIZED FILE FORMATS

Bio::ToolBox will recognize a number of standard bioinformatic file formats, almost all of which are recognized by their extension. Recognition is NOT guaranteed if an alternate file extension is used!!!!

These formats include

BED .bed .bedgraph .bdg

Bed files must have 3-12 columns. BedGraph files must have 4 columns.

GFF .gff .gff3 .gtf

These may also be recognized by the gff-version pragma. These must have 9 columns.

UCSC tables .refFlat .genePred

These are typically recognized by the number of columns, and can include simple refFlat, gene prediction, extended gene prediction, and known Gene tables.

Peak files .narrowPeak .broadPeak

These are special "BED6+4" file formats.

CDT .cdt

Cluster data files used with Cluster 3.0 and Treeview.

SGR

Rare file format of chromosome, position, score.

TEXT .txt

Almost any tab-delimited text file can be loaded.

Compression .gz .bz2

Compressed files are usually read through an external decompression program. All of the above formats can be loaded as compressed files.

DEFAULT BIO::TOOLBOX DATA TEXT FILE FORMAT

When not writing to a defined format, e.g. BED or GFF, a Bio::ToolBox Data structure is written as a simple tab-delimited text file, with the first line being the column header names. Such files are easily parsed by other programs.

If additional metadata is included in the Data object, then these are written as comment lines, prefixed by a "# ", before the table. Metadata can describe the data within the table with regards to its type, source, methodology, history, and processing. The metadata is designed to be read by both human and computer. Opening files without this metadata will result in basic default metadata assigned to each column.

Some common metadata lines that are specifically recognized are listed below.

Feature

The Feature describes the types of features represented on each row in the data table. These can include gene, transcript, genome, etc.

Database

The name of the database used in generation of the feature table. This is often also the database used in collecting the data, unless the dataset metadata specifies otherwise.

Program

The name of the program generating the data table and file. It usually includes the whole path of the executable.

Column

The next header lines include column specific metadata. Each column will have a separate header line, specified initially by the word 'Column', followed by an underscore and the column number (0-based). Following this is a series of 'key=value' pairs separated by ';'. Spaces are generally not allowed. Obviously '=' or ';' are not allowed or they will interfere with the parsing. The metadata describes how and where the data was collected. Additionally, any modifications performed on the data are also recorded here.

A list of common column metadata keys is shown.

name

The name of the column. This should be identical to the table header.

database

Included if different from the main database indicated above.

window

The size of the window for genome datasets

step

The step size of the window for genome datasets

dataset

The name of the dataset(s) from which data is collected. Comma delimited.

start

The starting point for the feature in collecting values

stop

The stopping point of the feature in collecting values

extend

The extension of the region in collecting values

strand

The strandedness of the data collected. Values include 'sense', 'antisense', or 'none'

method

The method of collecting values

log2

boolean indicating the values are in log2 space or not

USER METHODS REFERENCE

These methods are generally available to Bio::ToolBox::Data objects and can be used by the user.

load_file($filename)

This will load a file into a new, empty Data table. This function is called automatically when a filename is provided to the new() function. The existence of the file is first checked (appending common missing extensions as necessary), metadata and column headers processed and/or generated from default settings, the content loaded into the table, and the structure verified. Error messages may be printed if the structure or format is inconsistent or doesn't match the expected format, e.g a file with a .bed extension doesn't match the UCSC specification. Pass the name of the filename.

taste_file($filename)

Tastes, or checks, a file for a certain flavor, or known gene file formats. This is based on file extension, metadata headers, and/or file content in the first 10 lines or so. Returns a string based on the file format. Values include gff, bed, ucsc, or undefined. Useful for determining if the file represents a known gene table format that lacks a defined file extension, e.g. UCSC formats.

add_file_metadata($filename)

Add or update the file metadata to a Data object. This will automatically parse the path, basename, and recognized file extension.

write_file()
save()

This method will write out a Bio::ToolBox Data structure to file. Zero or more values may be passed to the method.

Pass no values, and the filename stored in the metadata will be used in writing the file, effectively overwriting itself. No filename will generate an error.

Pass a single value representing the filename to write. The current working directory is assumed if no path is provided in the filename.

Pass an array of key => values for fine control of the write process. Keys include the following:

filename => A scalar value containing the name of the file to 
            write. This value is required for new data files and 
            optional for overwriting existing files (the filename 
            stored in the metadata is used). Appropriate extensions 
            are added (e.g, .txt, .gz, etc) as neccessary. 
format   => A string to indicate the file format to be written.
            Acceptable values include 'text', and 'simple'.
            Text files are text in nature, include all metadata, and
            usually have '.txt' extensions. Simple files are
            tab-delimited text files without metadata, useful for
            exporting data. If the format is not specified, the
            extension of the passed filename will be used as a
            guide. The default behavior is to write standard text
            files.
gz       => A boolean value (1 or 0) indicating whether the file 
            should be written through a gzip filter to compress. If 
            this value is undefined, then the file name is checked 
            for the presence of the '.gz' extension and the value 
            set appropriately. Default is false.
simple   => A boolean value (1 or 0) indicating whether a simple 
            tab-delimited text data file should be written. This is 
            an old alias for setting 'format' to 'simple'.

The method will return the real name of the file written if the write was successful. The filename may be modified slightly as necessary, for example append or change the file extension to match the specified file format.

open_to_read_fh()

This subroutine will open a file for reading. If the passed filename has a '.gz' extension, it will appropriately open the file through a gunzip filter.

Pass the subroutine the filename. It will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.

Example

my $filename = 'my_data.txt.gz';
my $fh = Bio::ToolBox::Data::file->open_to_read_fh($filename);
while (my $line = $fh->getline) {
	# do something
}
$fh->close;
open_to_write_fh()

This subroutine will open a file for writing. If the passed filename has a '.gz' extension, it will appropriately open the file through a gzip filter.

Pass the subroutine three values: the filename, a boolean value indicating whether the file should be compressed with gzip, and a boolean value indicating that the file should be appended. The gzip and append values are optional. The compression status may be determined automatically by the presence or absence of the passed filename extension; the default is no compression. The default is also to write a new file and not to append.

If gzip compression is requested, but the filename does not have a '.gz' extension, it will be automatically added. However, the change in file name is not passed back to the originating program; beware!

The subroutine will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.

Example

my $filename = 'my_data.txt.gz';
my $gz = 1; # compress output file with gzip
my $fh = Bio::ToolBox::Data::file->open_to_write_fh($filename, $gz);
# write to new compressed file
$fh->print("something interesting\n");
$fh->close;

OTHER METHODS

These methods are used internally by Bio::ToolBox::Core and other objects are not recommended for use by general users.

parse_headers($noheader)

This will determine the file format, parse any metadata lines that may be present, add metadata and inferred column names for known file formats, and determine the table column header names. This is automatically called by load_file(), and generally need not be called.

Pass a true boolean option if there were no headers in the file.

add_data_line($line)

Parses a text line from the file into a Data table row.

check_file($filename)

This subroutine confirms the existance of a passed filename. If not immediately found, it will attempt to append common file extensions and verify its existence. This allows the user to pass only the base file name and not worry about missing the extension. This may be useful in shell scripts.

add_column_metadata()

Parse a column metadata line from a file into a Data structure.

add_gff_metadata($version, $force)

Add default column metadata for a GFF file. Specify which GFF version.

add_bed_metadata($column_count, $force)

Add default column metadata for a BED file. Specify the number of BED columns.

add_peak_metadata($column_count, $force)

Add default column metadata for a narrowPeak or broadPeak file. Specify the number of columns.

add_ucsc_metadata($column_count, $force)

Add default column metadata for a UCSC refFlat or genePred file. Specify the number of columns to define the format.

add_sgr_metadata($force)

Add default column metadata for a SGR file.

add_standard_metadata($line)

Add default column metadata for a generic file. Pass the text line containing the tab-delimited column headers.

standard_column_names($format)

Returns an anonymous array of standard file format column header names. Pass a value representing the file format. Values include gff, bed12, bed6, bdg, narrowpeak, broadpeak, sgr, ucsc16, ucsc15, genepredext, ucsc12, knowngene, ucsc11, genepred, ucsc10, refflat.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.