NAME

Bio::ToolBox::Data::file - File functions to Bio:ToolBox::Data family

DESCRIPTION

File methods for reading and writing data files for both Bio::ToolBox::Data and Bio::ToolBox::Data::Stream objects. This module should not be used directly. See the respective modules for more information.

DESCRIPTION

These are methods for providing file IO for the Bio::ToolBox::Data data structure. These file IO methods work with any generic tab-delimited text file of rows and columns. It also properly handles comment, metadata, and column-specific metadata custom to Bio::ToolBox programs. Special file formats used in bioinformatics, including for example GFF and BED files, are automatically recognized by their file extension and appropriate metadata added.

Files opened using these subroutines are stored in a specific complex data structure described below. This format allows for data access as well as records metadata about each column (dataset) and the file in general. This metadata helps preserve a "history" of the dataset: where it came from, how it was collected, and how it was processed.

Additional subroutines are also present for general processing and output of this data structure.

The data file format is described below, and following that a description of the data structure.

RECOGNIZED FILE FORMATS

Bio::ToolBox will recognize a number of standard bioinformatic file formats, almost all of which are recognized by their extension. Recognition is NOT guaranteed if an alternate file extension is used!!!!

These formats include

BED .bed .bedgraph .bdg: Bed files must have 3-12 columns. BedGraph files must have 4 columns.
GFF .gff .gff3 .gtf: These may also be recognized by the gff-version pragma. These must have 9 columns.
UCSC tables .refFlat .genePred: These are typically recognized by the number of columns, and can include simple refFlat, gene prediction, extended gene prediction, and known Gene tables.
Peak files .narrowPeak .broadPeak: These are special "BED6+4" file formats.
CDT .cdt: Cluster data files used with Cluster 3.0 and Treeview.
SGR: Rare file format of chromosome, position, score.
TEXT .txt: Almost any tab-delimited text file can be loaded.
Compression .gz .bz2: Compressed files are usually read through an external decompression program. All of the above formats can be loaded as compressed files.

DEFAULT BIO::TOOLBOX DATA TEXT FILE FORMAT

When not writing to a defined format, e.g. BED or GFF, a Bio::ToolBox Data structure is written as a simple tab-delimited text file, with the first line being the column header names. Such files are easily parsed by other programs.

If additional metadata is included in the Data object, then these are written as comment lines, prefixed by a "# ", before the table. Metadata can describe the data within the table with regards to its type, source, methodology, history, and processing. The metadata is designed to be read by both human and computer. Opening files without this metadata will result in basic default metadata assigned to each column.

Some common metadata lines that are specifically recognized are listed below.

Feature

The Feature describes the types of features represented on each row in the data table. These can include gene, transcript, genome, etc.

Database

The name of the database used in generation of the feature table. This is often also the database used in collecting the data, unless the dataset metadata specifies otherwise.

Program

The name of the program generating the data table and file. It usually includes the whole path of the executable.

Column

The next header lines include column specific metadata. Each column will have a separate header line, specified initially by the word 'Column', followed by an underscore and the column number (0-based). Following this is a series of 'key=value' pairs separated by ';'. Spaces are generally not allowed. Obviously '=' or ';' are not allowed or they will interfere with the parsing. The metadata describes how and where the data was collected. Additionally, any modifications performed on the data are also recorded here.

A list of common column metadata keys is shown.

name: The name of the column. This should be identical to the table header.
database: Included if different from the main database indicated above.
window: The size of the window for genome datasets
step: The step size of the window for genome datasets
dataset: The name of the dataset(s) from which data is collected. Comma delimited.
start: The starting point for the feature in collecting values
stop: The stopping point of the feature in collecting values
extend: The extension of the region in collecting values
strand: The strandedness of the data collected. Values include 'sense', 'antisense', or 'none'
method: The method of collecting values
log2: boolean indicating the values are in log2 space or not

USER METHODS REFERENCE

These methods are generally available to Bio::ToolBox::Data objects and can be used by the user.

load_file($filename)

Loads a file into memory. Any metadata lines will be automatically parsed and the table loaded into the Data object. Some basic consistency checks are performed. Structured file formats, such as BED, GFF, etc are

add_file_metadata($filename)

Add or update the file metadata to a Data object. This will automatically parse the path, basename, and recognized file extension.

write_file()

save()

This method will write out a Bio::ToolBox Data structure to file. Zero or more values may be passed to the method.

Pass no values, and the filename stored in the metadata will be used in writing the file, effectively overwriting itself. No filename will generate an error.

Pass a single value representing the filename to write. The current working directory is assumed if no path is provided in the filename.

Pass an array of key => values for fine control of the write process. Keys include the following:

filename => A scalar value containing the name of the file to 
            write. This value is required for new data files and 
            optional for overwriting existing files (the filename 
            stored in the metadata is used). Appropriate extensions 
            are added (e.g, .txt, .gz, etc) as neccessary. 
format   => A string to indicate the file format to be written.
            Acceptable values include 'text', and 'simple'.
            Text files are text in nature, include all metadata, and
            usually have '.txt' extensions. Simple files are
            tab-delimited text files without metadata, useful for
            exporting data. If the format is not specified, the
            extension of the passed filename will be used as a
            guide. The default behavior is to write standard text
            files.
gz       => A boolean value (1 or 0) indicating whether the file 
            should be written through a gzip filter to compress. If 
            this value is undefined, then the file name is checked 
            for the presence of the '.gz' extension and the value 
            set appropriately. Default is false.
simple   => A boolean value (1 or 0) indicating whether a simple 
            tab-delimited text data file should be written. This is 
            an old alias for setting 'format' to 'simple'.

The method will return the real name of the file written if the write was successful. The filename may be modified slightly as necessary, for example append or change the file extension to match the specified file format.

open_to_read_fh()

This subroutine will open a file for reading. If the passed filename has a '.gz' extension, it will appropriately open the file through a gunzip filter.

Pass the subroutine the filename. It will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.

Example

my $filename = 'my_data.txt.gz';
my $fh = Bio::ToolBox::Data::file->open_to_read_fh($filename);
while (my $line = $fh->getline) {
	# do something
}
$fh->close;

open_to_write_fh()

This subroutine will open a file for writing. If the passed filename has a '.gz' extension, it will appropriately open the file through a gzip filter.

Pass the subroutine three values: the filename, a boolean value indicating whether the file should be compressed with gzip, and a boolean value indicating that the file should be appended. The gzip and append values are optional. The compression status may be determined automatically by the presence or absence of the passed filename extension; the default is no compression. The default is also to write a new file and not to append.

If gzip compression is requested, but the filename does not have a '.gz' extension, it will be automatically added. However, the change in file name is not passed back to the originating program; beware!

The subroutine will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.

Example

my $filename = 'my_data.txt.gz';
my $gz = 1; # compress output file with gzip
my $fh = Bio::ToolBox::Data::file->open_to_write_fh($filename, $gz);
# write to new compressed file
$fh->print("something interesting\n");
$fh->close;

OTHER METHODS

These methods are used internally by Bio::ToolBox::Core and other objects are not recommended for use by general users.

parse_headers: This will determine the file format, parse any metadata lines that may be present, add metadata and inferred column names for known file formats, and determine the table column header names. This is automatically called by load_file(), and generally need not be called.
add_data_line($line): Parses a text line from the file into a Data table row.

check_file($filename): This subroutine confirms the existance of a passed filename. If not immediately found, it will attempt to append common file extensions and verifiy its existence. This allows the user to pass only the base file name and not worry about missing the extension. This may be useful in shell scripts.
add_column_metadata(): Parse a column metadata line from a file into a Data structure.
add_gff_metadata(): Add default column metadata for a GFF file.
add_bed_metadata(): Add default column metadata for a BED file.
add_peak_metadata(): Add default column metadata for a narrowPeak or broadPeak file.
add_ucsc_metadata(): Add default column metadata for a UCSC refFlat or genePred file.
add_sgr_metadata(): Add default column metadata for a SGR file.
add_standard_metadata(): Add default column metadata for a generic file.
standard_column_names(): Returns an anonymous array of standard file format column header names. Pass a value representing the file format. Values include gff, bed12, bed6, bdg, narrowpeak, broadpeak, sgr, ucsc16, ucsc15, genepredext, ucsc12, knowngene, ucsc11, genepred, ucsc10, refflat.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 1643:: '=item' outside of any '=over'
Around line 1654:: Expected text after =item, not a bullet

To install Bio::ToolBox, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Bio::ToolBox

CPAN shell

perl -MCPAN -e shell
install Bio::ToolBox

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

DESCRIPTION

DESCRIPTION

RECOGNIZED FILE FORMATS

DEFAULT BIO::TOOLBOX DATA TEXT FILE FORMAT

USER METHODS REFERENCE

OTHER METHODS

AUTHOR

Module Install Instructions