NAME
Bio::ToolBox::Data::file - File functions to Bio:ToolBox::Data family
DESCRIPTION
File methods for reading and writing data files for both Bio::ToolBox::Data and Bio::ToolBox::Data::Stream objects. This module should not be used directly. See the respective modules for more information.
DESCRIPTION
These are methods for providing file IO for the Bio::ToolBox::Data data structure. These file IO methods work with any generic tab-delimited text file of rows and columns. It also properly handles comment, metadata, and column-specific metadata custom to Bio::ToolBox programs. Special file formats used in bioinformatics, including for example GFF and BED files, are automatically recognized by their file extension and appropriate metadata added.
Files opened using these subroutines are stored in a specific complex data structure described below. This format allows for data access as well as records metadata about each column (dataset) and the file in general. This metadata helps preserve a "history" of the dataset: where it came from, how it was collected, and how it was processed.
Additional subroutines are also present for general processing and output of this data structure.
The data file format is described below, and following that a description of the data structure.
RECOGNIZED FILE FORMATS
Bio::ToolBox will recognize a number of standard bioinformatic file formats, almost all of which are recognized by their extension. Recognition is NOT guaranteed if an alternate file extension is used!!!!
These formats include
- BED .bed .bedgraph .bdg
-
Bed files must have 3-12 columns. BedGraph files must have 4 columns.
- GFF .gff .gff3 .gtf
-
These may also be recognized by the gff-version pragma. These must have 9 columns.
- UCSC tables .refFlat .genePred
-
These are typically recognized by the number of columns, and can include simple refFlat, gene prediction, extended gene prediction, and known Gene tables.
- Peak files .narrowPeak .broadPeak
-
These are special "BED6+4" file formats.
- CDT .cdt
-
Cluster data files used with Cluster 3.0 and Treeview.
- SGR
-
Rare file format of chromosome, position, score.
- TEXT .txt
-
Almost any tab-delimited text file can be loaded.
- Compression .gz .bz2
-
Compressed files are usually read through an external decompression program. All of the above formats can be loaded as compressed files.
DEFAULT BIO::TOOLBOX DATA TEXT FILE FORMAT
When not writing to a defined format, e.g. BED or GFF, a Bio::ToolBox Data structure is written as a simple tab-delimited text file, with the first line being the column header names. Such files are easily parsed by other programs.
If additional metadata is included in the Data object, then these are written as comment lines, prefixed by a "# ", before the table. Metadata can describe the data within the table with regards to its type, source, methodology, history, and processing. The metadata is designed to be read by both human and computer. Opening files without this metadata will result in basic default metadata assigned to each column.
Some common metadata lines that are specifically recognized are listed below.
- Feature
-
The Feature describes the types of features represented on each row in the data table. These can include gene, transcript, genome, etc.
- Database
-
The name of the database used in generation of the feature table. This is often also the database used in collecting the data, unless the dataset metadata specifies otherwise.
- Program
-
The name of the program generating the data table and file. It usually includes the whole path of the executable.
- Column
-
The next header lines include column specific metadata. Each column will have a separate header line, specified initially by the word 'Column', followed by an underscore and the column number (0-based). Following this is a series of 'key=value' pairs separated by ';'. Spaces are generally not allowed. Obviously '=' or ';' are not allowed or they will interfere with the parsing. The metadata describes how and where the data was collected. Additionally, any modifications performed on the data are also recorded here.
A list of common column metadata keys is shown.
- name
-
The name of the column. This should be identical to the table header.
- database
-
Included if different from the main database indicated above.
- window
-
The size of the window for genome datasets
- step
-
The step size of the window for genome datasets
- dataset
-
The name of the dataset(s) from which data is collected. Comma delimited.
- start
-
The starting point for the feature in collecting values
- stop
-
The stopping point of the feature in collecting values
- extend
-
The extension of the region in collecting values
- strand
-
The strandedness of the data collected. Values include 'sense', 'antisense', or 'none'
- method
-
The method of collecting values
- log2
-
boolean indicating the values are in log2 space or not
USER METHODS REFERENCE
These methods are generally available to Bio::ToolBox::Data objects and can be used by the user.
- load_file($filename)
-
Loads a file into memory. Any metadata lines will be automatically parsed and the table loaded into the Data object. Some basic consistency checks are performed. Structured file formats, such as BED, GFF, etc are
- add_file_metadata($filename)
-
Add or update the file metadata to a Data object. This will automatically parse the path, basename, and recognized file extension.
- write_file()
- save()
-
This method will write out a Bio::ToolBox Data structure to file. Zero or more values may be passed to the method.
Pass no values, and the filename stored in the metadata will be used in writing the file, effectively overwriting itself. No filename will generate an error.
Pass a single value representing the filename to write. The current working directory is assumed if no path is provided in the filename.
Pass an array of key => values for fine control of the write process. Keys include the following:
filename => A scalar value containing the name of the file to write. This value is required for new data files and optional for overwriting existing files (the filename stored in the metadata is used). Appropriate extensions are added (e.g, .txt, .gz, etc) as neccessary. format => A string to indicate the file format to be written. Acceptable values include 'text', and 'simple'. Text files are text in nature, include all metadata, and usually have '.txt' extensions. Simple files are tab-delimited text files without metadata, useful for exporting data. If the format is not specified, the extension of the passed filename will be used as a guide. The default behavior is to write standard text files. gz => A boolean value (1 or 0) indicating whether the file should be written through a gzip filter to compress. If this value is undefined, then the file name is checked for the presence of the '.gz' extension and the value set appropriately. Default is false. simple => A boolean value (1 or 0) indicating whether a simple tab-delimited text data file should be written. This is an old alias for setting 'format' to 'simple'.
The method will return the real name of the file written if the write was successful. The filename may be modified slightly as necessary, for example append or change the file extension to match the specified file format.
- open_to_read_fh()
-
This subroutine will open a file for reading. If the passed filename has a '.gz' extension, it will appropriately open the file through a gunzip filter.
Pass the subroutine the filename. It will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.
Example
my $filename = 'my_data.txt.gz'; my $fh = Bio::ToolBox::Data::file->open_to_read_fh($filename); while (my $line = $fh->getline) { # do something } $fh->close;
- open_to_write_fh()
-
This subroutine will open a file for writing. If the passed filename has a '.gz' extension, it will appropriately open the file through a gzip filter.
Pass the subroutine three values: the filename, a boolean value indicating whether the file should be compressed with gzip, and a boolean value indicating that the file should be appended. The gzip and append values are optional. The compression status may be determined automatically by the presence or absence of the passed filename extension; the default is no compression. The default is also to write a new file and not to append.
If gzip compression is requested, but the filename does not have a '.gz' extension, it will be automatically added. However, the change in file name is not passed back to the originating program; beware!
The subroutine will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.
Example
my $filename = 'my_data.txt.gz'; my $gz = 1; # compress output file with gzip my $fh = Bio::ToolBox::Data::file->open_to_write_fh($filename, $gz); # write to new compressed file $fh->print("something interesting\n"); $fh->close;
OTHER METHODS
These methods are used internally by Bio::ToolBox::Core and other objects are not recommended for use by general users.
- parse_headers
-
This will determine the file format, parse any metadata lines that may be present, add metadata and inferred column names for known file formats, and determine the table column header names. This is automatically called by load_file(), and generally need not be called.
- add_data_line($line)
-
Parses a text line from the file into a Data table row.
- check_file($filename)
-
This subroutine confirms the existance of a passed filename. If not immediately found, it will attempt to append common file extensions and verifiy its existence. This allows the user to pass only the base file name and not worry about missing the extension. This may be useful in shell scripts.
- add_column_metadata()
-
Parse a column metadata line from a file into a Data structure.
- add_gff_metadata()
-
Add default column metadata for a GFF file.
- add_bed_metadata()
-
Add default column metadata for a BED file.
- add_peak_metadata()
-
Add default column metadata for a narrowPeak or broadPeak file.
- add_ucsc_metadata()
-
Add default column metadata for a UCSC refFlat or genePred file.
- add_sgr_metadata()
-
Add default column metadata for a SGR file.
- add_standard_metadata()
-
Add default column metadata for a generic file.
- standard_column_names()
-
Returns an anonymous array of standard file format column header names. Pass a value representing the file format. Values include gff, bed12, bed6, bdg, narrowpeak, broadpeak, sgr, ucsc16, ucsc15, genepredext, ucsc12, knowngene, ucsc11, genepred, ucsc10, refflat.
AUTHOR
Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.
2 POD Errors
The following errors were encountered while parsing the POD:
- Around line 1639:
'=item' outside of any '=over'
- Around line 1650:
Expected text after =item, not a bullet