######## Bio::ToolBox revision history #############
v1.52
- Added binning option to wig files in script bam2wig. Default
is to write wig files in 10 bp bins with significant decreases
in runtime and memory usage while not appreciably diminishing
resolution.
- Add support to calculate shift values without doing wig
conversion in script bam2wig
- Add support for mRNA transcript subfeatures, including CDS,
5 prime UTR, and 3 prime UTRs, in data collection scripts
get_datasets and get_binned_data.
- Add new UTR methods to GeneTools library
- Changed behavior of reporting common and alternate exons and
introns in GeneTools. Genes with single transcripts now report
all exons and introns as common for simplicity.
- Add option to search at the 5 prime, middle, or 3 prime end
of features in script get_intersecting_features
- Fix bug in specifying which database feature to collect
regions from in script get_gene_regions
- Fix bug where tables with coordinates could not be used in
database lookups in script get_feature_info
v1.51
- Changed how bam alignments are recorded for indexed position
data hashes. Alignments are now recorded at their 5' postion
instead of midpoint, which wrecked havoc with large gaps and pairs.
- Reporting indexed bam alignment names (ncount method) now returns
the actual names rather than count. The db_helper calculate_score
method can properly count these. This avoids double-counting across
exons, etc.
- Fix major bug in script bam2wig that prevented paired-end
alignments from working. Thanks to Mengyao for pointing this out.
- Add additional checks when loading malformed files that have a
missing column header or extraneous hidden columns (extra tabs)
- Add format checks for numeric columns in some file formats
- Miscellaneous code improvements here and there
v1.50
- Major upgrade of the data collection libraries to simplify data
collection and improve efficiency. The value type is no longer
specified, being rolled into the specified collection method. Low
level optimizations have been added to improve speed. Increases
from 30% to over 300% have been measured, depending on the
collection method and adapter.
- Rewrite of data collection scripts to work with the improved libraries
- Added support for the modern Bio::DB::HTS module for Bam files,
while keeping support for the older Bio::DB::Sam module.
- Added more agnostic support for multiple different fasta indexing
adapters
- Script bam2wig is completely rewritten to handle multiple bam
files for merging, independent bam scaling, improved alignment
filtering, customizable output, improved cross-strand correlation
for peak shifting, improved speed and memory management, and lots
more features.
- Updated script data2fasta
- Numerous other features and changes too small to mention
- Relaxed requirements for external modules, namely BioPerl, so
that scripts and functions that don't absolutely require them can
still be used. All database functions will require it though.
v1.45
- Fix endless loop bug with opening files with metadata but no data,
e.g. empty VCF files
- Revert support for opening bedGraphToBigWig file handles
v1.44
- Added new function to GeneTools for exporting to GTF format.
- Added new function to filter transcript subfeatures in a gene
SeqFeature object by available Ensembl Transcript Support Level tags.
- Fixed critical bug with collapsing multiple transcripts in
GeneTools function that resulted in too many overlapping exons.
- Fixed bug in exporting non-coding gene models to UCSC refFlat format.
- Other minor bug fixes.
v1.43
- Fix bug with unique option in script get_gene_regions where
too many regions were being discarded. Thanks to Mengyao.
- Fix bug with generating bigWig files in script bam2wig, and
restore option to prefer bedGraphToBigWig if so desired
- Add option to ignore extraneous attribute tags when parsing
GFF and GTF files to reduce memory (simplify). Enable this
option by default when parsing annotation files when loading a
table in Bio::ToolBox::Data.
v1.42
- Changed bigWig convertor method to use primarily the wigToBigWig
utility for simplicity
- Introduced new method to open a wigToBigWig utility filehandle to
"print" wig files directly to a bigWig
- Updated bam2wig and data2wig scripts to write directly to the
bigWig utility and skip writing temporary intermediate wig file
- Added functionality to bam2wig to record stranded shifted counts
- Fixed a critical bug in script get_gene_regions where transcripts
weren't being filtered
- Improved file format taste testing to avoid GFF false positives
- Improved UCSC gene table parser behavior
v1.41
- Added no header option when loading text files missing a
column header row. Updated script manipulate_datasets to take
advantage of the feature.
- Added option to combine multiple score columns into a single score
when converting a file to a wig file in script data2wig
- Added option to split gff or vcf data files by an attribute tag
in script split_data_file
- Improve handling of writing vcf files
- Fix critical errors with calculating cdsStart and cdsEnd in the
GeneTools library
- Fix bugs in gff parser to continue when encountering errors in
parsing and interpret transcript biotype gtf attributes
- Fix bug in properly handling start coordinates in script data2wig
v1.40
- Major update introduces new SeqFeature object Bio::ToolBox::SeqFeature
that is a little faster and more compact than equivalent BioPerl objects.
This is the default object used in gene table parsers.
- New Module Bio::ToolBox::GeneTools for working with SeqFeature objects
representing traditional nested feature gene, transcript, exon models.
The script get_gene_regions now uses this module, as do other scripts.
- Expunged many scripts that are no longer considered part of the primary
mission of the BioToolBox distribution. These are now available in a
separate repository located at https://github.com/tjparnell/HCI-Scripts.
- Bio::ToolBox::Data objects can now parse all gene tables into memory
and store the features in the object. This allows gene tables to be
used without requiring a database to be setup.
- Added a file tasting method to determine whether a file looks like a
specific file format, e.g. gff, UCSC gene table, etc.
- Added numerous little methods and method aliases here and there to
improve functionality
- Added attribute rewrite functions for both GFF and VCF files
- Improved file format testing
- Numerous little optimizations in loading files
v1.36 (git 44b9dea)
- added new option to script get_relative_data to allow user to specify
what feature types to avoid
- fix bugs in scripts manipulate_datasets when exporting log2 treeview
files and defining x axes in graph_profile
- fix annoying bug where manipulate_datasets will not re-show column list
- improve data file summarization
- some library method optimizations
v1.35 (git e489d52)
- Add new options for setting dimensions and linear regression lines in
script graph_data.
- Restored unique option in script data2gff.
- New convenience methods for Feature objects.
- Fixed bug with smoothing interpolation in get_relative_data
- Numerous other bug fixes regarding bed files, column names,
file support, warnings.
v1.34 (git 5d4803c)
- Changed the behavior of automatically converting interbase coordinates
to base coordinates upon loading a file, and converting back as necessary
when writing. This had the side effect of effectively changing coordinates
when writing out nonstandard text files. Conversion is now done on the fly
when using the start method of row Features. Start interbase coordinates
are now recognized by appending a 0 to the column name. Output files should
now look like the input files.
- Strand values are not automatically converted upon loading; They are
converted as necessary on the fly using the row Feature strand method.
- Null values are not automatically converted to internal '.' null values.
They are converted as necessary using the row Feature value method to
maintain backward compatibility.
- Scripts data2bed and data2wig go back to using a Stream input to avoid
high memory usage.
- Script data2wig now has a fast option to skip lots of checks on values
and intervals. This speeds up conversion considerably at the risk of
making improper wig files if the source file has issues.
- Script join_data_file is considerably faster by simply concatenating
data lines without processing or checking.
- Script bam2wig has new recording option, mid extend, to record the
middle portion of alignments or proper paired-end alignments. Credit to
Ohad for recommending.
- Add explicit interbase support to scripts data2gff and data2fasta.
- Fix critical bug were extensions were not scored properly for coordinate
features in script get_binned_data. Thanks to Mengyao.
- Fix bam2wig alignment alignment illustrations in POD. Thanks to Ohad.
- Bug fixes regarding bed file integrity checking that were introduced in
the previous release.
v.1.33 (git ba1a70e)
- Removed legacy_helper module. All scripts now properly updated to
use Bio::ToolBox::Data and related objects. This was the last step of
a long process to modernize all of the scripts to use the new libraries.
- All data collection modules are now chromosome naming-scheme agnostic,
meaning that "chr1" and "1" for chromosome can be used equally, regardless
of what the annotation or big data file uses.
- Minimal VCF file support is added, including the ability to parse INFO
and SAMPLE attributes, and verify some file format integrity.
- Significantly improve GTF file parsing.
- Improve file format verification, including printing error messages.
This should alleviate cryptic reasons for automatic file extension changes.
- Tons of bug fixes. See GitHub for a full change log.
v.1.32 (git 67749a7)
- Fix bug with adding a new column to Data object, particularly
when selected from a database.
- Fix bugs related to adding, deleting, or modifying columns for
a specific file format, such as BED or GFF
- Introduce additional Data structure verification tests, including
proper strand information, to verify correct file formatting, such
as BED and GFF
- Fix bugs when writing data files that incorrectly maintained
file extensions for a given format even when the structure was no
longer valid.
- Add support for .bigwig and .bigbed file extensions.
- Fix bug with opening fai fasta index and forked databases in script
CpG_calculator.
v.1.31 (git 9a4e122)
- Major addition of parsers for GFF and UCSC gene table formats.
This replaces the old gff3_parser and now supports GFF, GTF, and GFF3.
Also moved UCSC gene table parsing out of ucsc_table2gff3 and into
own parser module, available for all. This supports refFlat, genePred,
and knownGene tables. Tests for these parsers are included.
- Updated script get_gene_regions to use parsers.
- Greatly optimized bedGraph writing from script bam2wig to reduce
memory usage. Also ensure that bedGraph is written over entire chromosome.
- Fix bugs when sorting and performing math with null, NA, and inf
values, especially with script manipulate_datasets.
- Fix bug where coverage shifts by 1 bp after each write to fixedStep wig
in script bam2wig. Thanks to Magda for reporting.
v.1.30 (git 9ab9ff4)
- Major upgrade of the Bio::ToolBox::Data library internals.
Old data_helper and file_helper modules are gone, and a
legacy_helper module added for those programs that still haven't
been upgraded yet. Numerous improvements and bug fixes to Data and
Stream objects, structure verification, standard file format metadata,
file writing, and more. Several new methods have been added too.
- Added support for ncount, or name count, of bam files. By
counting unique alignment names, we can avoid double-counting
of reads in adjacent search areas. Also works for counting
paired-end reads. Supported by get_datasets script.
- Updated pull_features script to use new Data objects.
v.1.26 (git 21c800b)
- Removed Extras folder and outdated library functions. These
are available as a separate GitHub project, biotoolbox-extra.
- Improved GFF3 parser to handle orphans more gracefully, and
simplify parsing by adding a next_top_feature function. It is
moved out of the db_helper hierarchy, where it never really belonged.
- Changed license to exclusively Artistic License 2.0.
- Fixed bug when using input files with coordinate information in
script get_datasets. Thanks to Mengyao for reporting.
- Fixed bug when opening a new Data::Stream not based on a file or
data list.
v.1.25 (svn 955)
- Added a new option to manually specify the extension length
and allow new ways to record read coverage in the script bam2wig.pl.
A text graphic is included in the documentation to illustrate
different methods.
- Broke out database and fasta functionality from
Bio::ToolBox::db_helper into a separate sub module, which should
limit the number of modules loaded at compile time.
- Allow main Data feature_type to be specified by command line
option, useful when your input file has names of database features
but not a type column, for scripts get_feature_info.pl,
get_datasets.pl, get_binned_data.pl, and get_relative_data.pl.
- Added BED and GFF string export to Bio::ToolBox::Data::Feature
objects.
- Changed library version reporting for default new Data files.
- Fix bugs with setting and removing AUTO metadata properly
when opening and writing Data files.
- Fix bugs regarding deleting metadata, which had a side effect
of adding unwanted metadata to files written by manipulate_datasets.
- Added more name possibilities when looking for possible name
columns.
- Fix bug where a database may sometimes not be opened properly
after forking into children in data collection scripts.
- Fix bug that prevented statistics from being recovered from
child processes in script graph_data.pl.
v.1.24001 (svn 940)
- Updated tests to catch possible sources of error, including
recent UCSC BigFile libraries that power Bio::DB::BigWig
adaptors, DB_File required for GFF3 loading into memory database,
and path verification in Data metadata.
v.1.24 (svn 936)
- Added new module Bio::ToolBox::Data::Stream for working with
data files line by line instead of loading them into memory.
Moved lots of shared methods into Bio::ToolBox::Data::common.
- Added explicit file support for UCSC-style refSeq and genePred
file formats, as well as Encode narrowPeak and broadPeak files.
- Added new value type, pcount, in data collection scripts and
library score methods. Features, such as Bam alignments, must
be entirely contained within the search region, and not just
overlapping as with the count value.
- Added improved method for reloading forked children files
back into Data objects without having to call external
join_data_file script.
- Improved forking in data collection scripts, including a
delay in the parent after forking to prevent race conditions
on fast servers with high fork numbers.
- Removed all vanity names to data_helper and file_helper
subroutines. All scripts updated to reflect changes.
- Improved identification of overlapping features when avoiding
neighboring features when collecting relative data.
- Optimized Bam score data collection methods.
- Disabling bins when writing coverage in bam2wig.
- Fix bugs with writing CDT files in manipulate_datasets.
- Improved ToolBox::Data::Feature methods to handle internal nulls.
- Improved retrieval of sequence list, particularly for
SeqFeature::Store databases.
- Updated and improved library testing for Data and Stream objects
and database interaction.
- Fixed bug where negative coordinates would not be accepted
when collecting relative coordinates.
- Fixed bug where Bam and BigBed databases may not be opened
properly in some instances, such as precounting features for RPM
scores.
- Fix bug where in some cases all database features could be
returned with the method get_feature().
- Fix bug were type options is now properly implemented in script
get_feature_info.
- Fix bug limiting to chromosome length in script
get_intesecting_features.
v.1.23 (svn 915)
- Improved script get_gene_regions to recognize non_coding exons;
prompt for region, feature, and RNA type; specify for more than
one feature type at a time; and avoid mixing RNA sub types from
the same gene. Thanks to Mengyao for troubleshooting.
- Fixed bugs pertaining to collecting relative windows that may
extend beyond the beginning of the chromosome. Thanks to Nate
for reporting.
- Fixed bugs sorting by genomic coordinate, especially when
only Position is provided and not Start.
- Made Bio:ToolBox::Features return smart coordinates only, no
funny values.
v.1.22 (svn 906)
- Added new export options of alternate, common, or all exons
to script get_gene_regions.
- Changed behavior of Bio::ToolBox::Data::Feature such that
database features must now be explicitly retrieved rather
than automatically retrieved, which could lead to runaway
execution if it could not be found.
- Improved how name columns are recognized and used when
retrieving database features.
- Improved writing of strand information in proper format
for Bed and GFF files.
- Fixed numerous bugs that prevented proper execution in
several scripts, including manipulate_datasets, get_feature_info,
graphing scripts. Thanks to Mengyao and Yixuan for reporting.
- Standardize data file loading message among several scripts.
v.1.21 (svn 896)
- Fixed critical bug that prevented upstream windows from
collecting data in script get_relative_data.
- Fixed critical bug that prevented some bigBed files from
being opened.
- Fixed critical bugs that prevented scripts data2fasta and
get_intersecting_features from working properly.
- Fixed bugs where strand may be inappropriately assigned or
sometimes ignored when collecting a regional positioned scores.
- Fix minor bugs in output of scripts ucsc_table2gff3 and
get_ensembl_data
- Include checks in data collection scripts to exit gracefully if
datasets can't be verified.
- Interactive list of values to keep or toss is now sorted
alphanumerically in script manipulate_datasets.
v.1.20 (svn 884)
- Refactored db_helper so that all database adaptors are loaded
dynamically only as needed during runtime, rather than loading
everything all at once regardless of need. This results in
faster load times and reduced memory footprint.
- Added new methods to Bio::ToolBox::Data objects, including
sorting, genomic sorting, and feature_type.
- Split out metadata-related methods and Feature objects as
separate modules in Bio::ToolBox::Data. Feature objects will
now automatically retrieve represented database features as
necessary to collect attributes.
- Rewrote many, many scripts to use Bio::ToolBox::Data objects.
Simplify, unify, and improve all Data functions.
- Moved many specialized, outdated, or esoteric scripts to an
optional extras folder that will no longer be distributed via
CPAN but will be available through SVN.
- Added new functions to script manipulate_datasets.pl, including
processing rows with specific values, split and concatenate columns,
view table contents, and add additional manipulations prior to
writing CDT files. Also, several old functions were removed.
- Added support for converting refFlat and simple genePred
file formats to GFF3 in script ucsc_table2gff3.pl.
- Add better warnings for reading files with DOS or MAC line endings.
- Removed file extension manipulation in join_data_file script.
- Replaced fatal errors with warnings in merge_datasets script.
- Fix critical error where midpoints were not calculated correctly
for features in script get_relative_data.pl, preventing data
collection around a feature midpoint.
- Fix bug to properly collect extended bins at 3'end and avoid
undefined start errors in average_gene.pl; plus write a summary
file when executing with forks.
- Fix bugs with collecting features from a database.
- Fix bug with renaming M to UCSC-style chrMT in
get_ensembl_annotation.
- Numerous other small fixes scattered about.
v.1.19 (svn 843)
- Implemented subfeature sharing and multiple parentage when
exporting UCSC tables as GFF3. For example, exons can now be
shared between multiple transcripts of the same gene. This
leads to considerable reduction in file size at the expense
of increased complexity. Naming of subfeatures is now optional.
- Renamed script print_feature_types.pl to simply db_types.pl.
Known databases in the configuration file can now be
interactively chosen from a list.
- Added support for multiple parentage in the gff3 parser
library and script gff3_to_ucsc_table.pl.
- Added a verbose option and improved path detection in script
db_setup.pl.
- Script filter_bam.pl now works on unsorted and non-indexed
bam files, making it more useful than before.
- Bam files opened using db_helper::bam may now be sorted as
necessary before indexing.
- Increase default buffer value in script bam2wig.pl.
- Fixed bug where firstExon features were misnamed as lastExon
in script get_gene_regions.pl.
v.1.18 (svn 826)
- Fixed critical bug when calculating RPM and RPKM values in
data collection scripts. This is a long-standing bug that
produced erroneous values. The bug does not affect bam2wig.pl
rpm reporting.
- Improved methods for collecting from subfeatures such as
exons of genes or transcripts in script get_datasets.pl.
- Added option to specify which UCSC table(s) to use when
setting up a new database in script db_setup.pl.
- Added new options to extend and concatenate sequences in
script data2fasta.pl.
- Added ability to use the samtools fasta index when available
in scripts data2fasta.pl and CpG_calculator.pl. This index is
about 10-20% faster than the BioPerl fasta index.
- Fixed bug to avoid illegal characters in filenames when
splitting data files, and added an option to use a custom
file prefix in script split_data_file.pl.
- Fixed bug where ensembl gene names may not be properly
recorded in the output GFF3 file in script ucsc_table2gff3.pl.
v.1.17 (svn 808)
- Added six new method functions to Bio::ToolBox::Data for
working with columns and metadata.
- Updated script correlate_position_data.pl with parallel
execution plus an ANOVA statistical analysis between data.
- Fixed bug where the --bwapp option was not being used in
script bam2wig.pl. Thanks to Michael D. for reporting.
- Removed extraneous BioPerl warnings when opening a fasta file
or directory fails, and replaced with some suggestions.
- Fixed bug with RPM option that lead to warnings in db_helper.
- Simplified warning for duplicate lookup values in script
merge_datasets.pl.
- Reorganized the POD summary and provided examples of usage
for main data collection scripts, plus provide default values
in POD summaries for a number of scripts. Thanks to Christian
for the recommendation.
v.1.16 (svn 794)
- Fixed critical bug that prevented the forward strand from
being written when generating stranded coverage in script
bam2wig.pl. Thanks to Michael D. for reporting.
- Fixed critical bug that prevented the script get_bam_seq_stats.pl
from compiling properly.
- Fixed bug that prevented filtering more than one length at
a time in script filter_bam.pl. Thanks to Yixuan for reporting.
- Fixed again the bug where passing a negative or zero start
to data collection methods issues a warning and resets the value
to 1 in db_helper.
v.1.15 (svn 786)
- Added Bio::ToolBox::Data method to delete column metadata
and improved adding new metadata.
- Added back cached database objects for data collection,
which brings back speed lost in the previous version.
- Original strand format is now maintained when rewriting data
files. For example, + and - from Bed and GFF files as opposed
to 1 and -1.
- Passing a negative or zero start value to data collection
methods in db_helper now issues a friendly warning and resets
the value to 1.
- Opening a BigWigSet directory of bigWig files can now infer
strand based on filename and set the metadata appropriately.
For example, files whose basename ends in f, forward, or plus
will be interpreted as strand 1.
- Script gff3_to_ucsc_table.pl was significantly updated to
address critical flaws and change the output format to refFlat.
- Script manipulate_datasets.pl no longer writes metadata for
simple file formats when using certain functions that do not
change data content.
- Script bam2wig.pl now includes a --flip strand option.
- Scripts graph_data.pl and graph_profile.pl have fixed errors
and made improvements regarding fonts and sizes.
- Various other small bug fixes and checks for optional Perl
module installs.
- Updated shebang lines to use universal /usr/bin/perl
- Updated script POD documentation to make common options more
uniform.
v.1.14.1 (svn 763)
- Changed the method of caching database objects introduced
in version 1.14, which wreaked havoc with forked child
processes. All database connections are cached by default
and returned if subsequently re-opened, unless explicitly
told to not use the cached connection. Multiple scripts
were updated to reflect the new connection caching.
- Bio::ToolBox::Data now automatically re-clones existing
database connections if you splice the data table.
- Bam file index files are now explicitly generated prior
to opening the bam file database connection. Additionally,
existing .bai files are copied as .bam.bai in preference
to creating a new .bam.bai file. Thanks to Yixuan for
reporting.
- Fixed POD errors in script bar2wig.pl and updated method
for finding the java executable file. Thanks to Guillaume
for reporting.
- Removed debugging warn statements in script
get_relative_data.pl.
- Added POD documentation to Bio::ToolBox::db_helper::useq.
v.1.14 (svn 737)
- Massive reorganization of the entire package into a proper
Perl module distribution that is installed using standard
Module::Build methods. This will install the libraries into
site-specific Perl library directories as Bio::ToolBox::*.
Scripts will install into a standard bin directory. All
scripts have been updated to reflect these changes.
- Added new module Bio::ToolBox::Data, which provides an easy
object-oriented interface to working with data files and the
rest of the Bio::ToolBox functions.
- Added new script db_setup.pl to ease generating an annotation
database with UCSC data
- Added Build tests for all major library functions, including
score collections from all binary database adaptors.
- Added capability to properly collect value types, including
score, count, and length, from useq and wiggle database adaptors
- Loosened restriction for counting Bam alignments where the
midpoint had to be within the query region; now any overlapping
alignment that intersects the region will be counted.
- Reworked the interpolation algorithm to interpolate as many
datapoints as possible in script get_relative_data.pl.
- Removed cryptic error messages when opening databases, and
added database handle caching to avoid repeated openings
- Newly generated feature lists no longer append all aliases to
the feature name
- Added additional attributes to the list of available ones to
retrieve from the database in script get_feature_info.pl. Also
added a --type command line option to set a feature type to
named features.
- Improved data table checking to include a count of columns
for every row.
- Added max_count option to script bam2wig.pl to control for
high Bam coverage
- Fixed bug where the summary file was not created for
script get_relative_data.pl
v.1.13 (svn 691)
- Updated to include native support for USeq archive files
with data collection scripts. USeq files may be used in
the same manner as BigWig, BigBed, or Bam files for data
collection. USeq files may be generated using tools from
the USeq package (useq.sourceforge.net). The
Bio::DB::USeq adaptor is available via CPAN.
- Added new script filter_bam.pl, which can filter alignments
based on various criteria and write a new Bam file. Filters
are one or more boolean tests, including attributes, scores,
lengths, sequence, etc.
- Added new script get_bam_seq_stats.pl, which collects
information about the read sequences themselves and summarizes
the sequence composition and nucleotide frequencies, suitable
for generating sequence logos.
- Updated script manipulate_datasets.pl to allow any integer
to be used when formatting decimal values.
- Restored ability to write a new data file without collecting
data from script get_datasets.pl.
- Changed the log conversion step to avoid having to increase
read count by 1 to avoid log of 0 errors in script bam2wig.pl.
- Use the command line --log argument in preference over
metadata in script manipulate_datasets.pl.
- Method sum now writes 0 instead of null in script
bin_genomic_data.pl.
- Fixed issue where joining data files may not maintain gzip
status. This had issues with combining forked children files.
- Fixed bug where a provided, indexed data source file
(e.g. BigWig) could not be used as a database in script
get_datasets.pl
v.1.12.6 (svn 680)
- Updated the script novo_wrapper.pl to use Parallel::ForkManager
instead of GNU Parallel. This should make it more stable,
particularly under nohup.
- Consolidated the standard out results when functions were
applied to multiple columns in script manipulate_datasets.pl.
This will make the script much less chatty.
- Fixed bug with naming temporary forked children file names.
- Fixed bugs with the generation of summary files.
- Fixed bug with the automatic identification of the X axis in
script graph_profile.pl.
- Fixed bug where features not found in a database could crash
the script get_feature_info.pl.
v.1.12.5 (svn 667)
- Improved the shift value determination to make it more robust
against outliers in script bam2wig.pl. Additionally, the model
data that is written is now centered over the shift peak to
make evaluations more interpretable.
- Fixed a bug where 0 or negative coordinates may be written
to varStep wig files in script bam2wig.pl.
v.1.12.4 (svn 662)
- Improved the efficiency of scanning for high coverage regions
and calculating 3 prime shift values in script bam2wig.pl; Each
reference sequence is now scanned in parallel. Also added a new
option to write the shift profile model and correlation data.
The efficiency of writing bedGraph files was improved, giving
up to 2X increase in performance. The default maximum duplicate
value is now unlimited. Warnings about coverage beyond the ends
of chromosomes are now silenced unless verbose is turned on.
- The script graph_data.pl can now execute in parallel to improve
efficiency when a list of datasets are provided in advance. A
list may now be provided in conjunction with the --all option.
- Improved recognition of the X-axis column in script
graph_profile.pl.
- Fixed critical error when writing extended position bedGraph
files from script bam2wig.pl where reverse reads were not
extended appropriately in the 3 prime direction.
v.1.12.3 (svn 651)
- Added user options to control the size of the memory buffer
when writing bedGraph files and the disk write frequency in
script bam2wig.pl.
- Added option to control the output order of the features from
script pull_features.pl. The order may match either the input
list or input data file. Also improved automatic column identification
and avoid empty output files.
- Script data2wig.pl will now write bedGraph files.
- Fixed bug leading to excessive memory usage when writing a
fixedStep wig file from script bam2wig.pl. Thanks to Jeff for
reporting.
- Fixed bug where writing strand values for gff or bed files may
not be written correctly.
- Fixed bug leading to errors loading input files with comment or
empty lines in the middle of data lines.
- Fixed bug to avoid log of 0 errors in script bam2wig.pl.
v.1.12.2 (svn 642)
- Scripts find_enriched_regions.pl and CpG_calculator.pl are now
multi-threaded. The find_enriched_regions.pl also has additional
optimizations to reduce memory usage.
- The script merge_datasets.pl now has the option to use a coordinate
string as a unique identifier when looking up features. This is
particularly helpful with BED, GFF, and other files with genomic
coordinates that do not have unique name identifiers.
- A coordinate string in the format chromo:start-stop may now be
generated from coordinate values in data files using a new function
in the script manipulate_datasets.pl.
- Fixed a bug regarding changing file extensions in script
join_data_file.pl, which gave odd output file names with scripts that
executed in parallel.
v.1.12.1 (svn 635)
- Fixed bugs were gzip status and file extensions may be inappropriately
inherited. This may cause problems when joining children files from
parallel process forks.
- Fixed bug where the interactive menu would exit upon an empty value
in script manipulate_datasets.pl. A "q" must now be provided to exit.
- Minor optimization when calculating shift values in script bam2wig.pl.
v.1.12 (svn 619)
- Major improvements to performance of some data collection scripts by
adding multi-threaded options. These include get_datasets.pl,
get_relative_data.pl, average_gene.pl, and bam2wig.pl. The number of
CPU forks may be specified with the --cpu option (default 2). This option
requires the installation of Parallel::ForkManager, available through
CPAN. Run the check_dependencies.pl script to install it.
- All gzip compression read and writes are now forked through an
external gzip utility for a considerable boost in performance (2-5X).
The gzip executable must be in your path for this to work (it usually
is on most Unix-like environments).
- Added --long option when collecting data from long features in script
average_gene.pl.
- Improved efficiency when collecting data from very large windows in
both get_relative_data.pl and average_gene.pl.
- Summing the total number of read alignments in Bam files is also
multi-threaded. Summing the total number of intervals in a BigBed file
is also improved.
- Fixed a critical error where not all windows had data collected when
using the script get_relative_data.pl
v.1.11 (svn 603)
- Major revision of how features are now retrieved from the database
using primary_IDs rather than relying on unique names in the database.
Generating lists of features will now return Primary_ID, Name, and Type.
The Primary_ID is unique to a database and is usually non-portable.
Current feature lists with only Name and Type will still work, and are
subject to limitations of non-unique Names in the database. This affects
all scripts that work with database features, including get_features.pl,
get_feature_info.pl, get_datasets.pl, get_relative_data.pl,
average_gene.pl, get_intersecting_features.pl, and correlate_position_data.pl.
- GFF3 annotation scripts get_ensembl_annotation.pl and ucsc_table2gff3.pl
now produce GFF3 files that better match the GFF3 specification. Names
are no longer made unique (which broke ties with the originating data),
proper Dbxref tags are attributed when external sources could be
identified, and chromosomes are now sorted by name. Other minor
improvements were also made.
- Fixed critical bug that prevented spliced alignments from being
counted in script bam2wig.pl. Thanks to Pinal K. for reporting.
v.1.10.3 (svn 597)
- Unified column names and improved their recognition in scripts
get_feature_info.pl and the graphing scripts graph_data.pl,
graph_histogram.pl, and graph_profile.pl.
- Graphing scripts now write the output graph directory in the input
file parent directory instead of the current directory.
v.1.10.2 (svn 591)
- Added a new option of position when adjusting coordinates of retrieved
features using the script get_features.pl. Coordinates may be adjusted
at the 5 prime, 3 prime, or both ends of stranded features. This also
fixes bugs where collected features on the reverse strand with adjusted
coordinates were not reported properly.
- Improved automatic recognition of the name, score, and other columns
in the convertor scripts data2bed.pl, data2gff.pl, and data2wig.pl.
- Improved the Cluster and Treeview export function in script
manipulate_datasets.pl. The CDT files generated now include separate ID
and NAME columns per the specification, and new manipulations are
included prior to exporting, including percentile rank and log2.
- The convert null function now also converts zero values if requested
in script manipulate_datasets.pl.
- Added new option of a minimum size when trimming windows in the script
find_enriched_regions.pl.
- Increased the radius from 35 bp to 50 bp when verifying a putative
mapped nucleosome in script map_nucleosomes.pl, leading to fewer
overlapping or offset nucleosomes.
- Added new option to re-center offset nucleosomes in script
verify_nucleosome_mapping.pl. Also improved report formatting.
- Added checks and warnings when writing file names longer than 256
characters. Some scripts automatically generate file names that may
exceed this limit, preventing writing. File names are now truncated.
Thanks to Adam F. for reporting.
- Added new methods and code improvements to the gff3 parsing library.
- Fixed a bug in script merge_datasets.pl where the column index for a
second file may not be properly validated leading to premature
termination.
- Fixed a bug where multiple datasets combined with an ampersand for
merging were not properly verified.
- Fixed a bug where a user may not be prompted to select a dataset from
a database if none was supplied from the command line.
- Fixed a bug where files containing trailing nulls do not load
properly.
- Fixed a bug related to finding specific data columns by name.
- Fixed a bug with writing summary files.
v.1.10.1 (svn 568)
- Added support for Bio::DB::Fasta in the main BioToolBox library, and
added the support to scripts data2fasta.pl and CpG_calculator.pl. Any
BioToolBox program that requires chromosome information or sequence can
now use a genomic multi-fasta or directory of fasta files in the --db
option.
- Fixed critical error in data2gff.pl that prevented files from being
converted to GFF format.
- Fixed critical error merge_datasets.pl that prevented column headers
from being written to the output file.
- Made the warning about unavailable files on the UCSC FTP server less
scary in the script ucsc_table2gff3.pl.
- Updated and clarified some script documentation.
v.1.10 (svn 559)
- Significantly improved performance when collecting data from Bam files
by using a low level API. Improvements of at least 2X may be realized.
- Significantly improved the performance of the bam2wig.pl script by at
least 2X. Added a new option of recording extended regions across the
predicted fragment based on empirically determined shift values.
Sampling to determine shift values has been increased. BedGraph files
are now written more efficiently. Maximum number of identical reads are
now enforced.
- Significantly improved the performance of the split_bam_by_isize.pl
script to increase speed by at least 2X. Added an option to skip
checking of mates. Improved reporting of results.
- Added a filter option to remove overlapping nucleosomes in script
verify_nucleosome_mapping.pl; also fixed bugs in reporting offset
distances and improved output reporting.
- Removed confusing separate scan and tag datasets required for script
map_nucleosomes.pl. Cleaned up and organized code. Fixed bugs that
prevented datasets from being validated.
- Fixed critical bug where data was not collected for the final row in
script get_datasets.pl.
- Fixed bugs with parsing unusual input files, for example commented
header lines in bed files or inconsistent column numbers.
- Fixed bug in script get_intersecting_features.pl where a strand column
was expected even if it was not present.
- Changed all tim library calls to use arrays instead of anonymous
hashes for a cleaner API.
- Changed shebang lines to use /usr/bin/env to improve portability on
systems with different Perl versions installed.
- Cleaned up and made POD documentation more consistent.
- Add warnings about database users and passwords in configuration file.
v.1.9.7 (svn 539)
- Fixed critical bug where an exon containing all three 5'UTR, CDS, and
3'UTR was not properly parsed in the script get_ensembl_annotation.pl.
New command line options for to include or not CDS, UTR, and start/stop
codons were added. Significant changes to improve and organize the code
was also made.
- Changed the method of assigning the GFF type for chromosomes and
scaffolds based on their name in the script ucsc_table2gff3.pl. Also
made the inclusion of start and stop codons enabled by default.
- Removed annoying automatic column assignment for input GFF files in
script data2bed.pl. GFF files are still handled properly if no columns
are specified on the command line.
v.1.9.6 (svn 533)
- Fixed critical bug in script ucsc_table2gff3.pl where single exons
containing all three 5'UTR, CDS, and 3'UTR subfeatures were not properly
parsed into GFF3. This had resulted in an extended CDS longer than
expected. Thanks to H. Stovall for reporting.
- Added warnings when a sequence could not be generated to avoid
division by 0 errors, and a slight correction to fraction calculations,
in script CpG_calculator.pl.
v.1.9.5 (svn 525)
- Changed the non-intuitive --except option to a more intuitive --zero
option in script manipulate_datasets.pl; this is now a boolean option to
include or exclude zero values when calculating statistics. The printed
statistics output has also been cleaned up and no longer includes
decimal formatting. The export function will automatically generate a
name when executed automatically.
- Added capability to use a column of source values rather than a static
text string for the GFF source tag in script data2gff.pl. Also made
improvements to the interactive ask session.
- Added the capability to use a big file dataset as the database for
chromosome information in script find_enriched_regions.pl.
- Added an option to automatically convert the output file to a BED file
in script get_gene_regions.pl, and included a description of the --in
option in the POD documentation.
v.1.9.4 (svn 519)
- Fixed first critical bug in script get_datasets.pl where strand
information in input files with genomic coordinates (e.g. BED files) was
not considered when adjusting coordinates (start, stop, or fractional).
- Fixed second critical bug in script get_datasets.pl where collecting
fractional data for named database features resulted in data collection
over the entire feature.
- Improved interpretation of input file features as genomic regions or
named features in script get_datasets.pl.
- Changed the --set_strand option to --force_strand in multiple data
collection scripts. This should make the function a little more obvious
as to its purpose. Documentation changed as appropriate.
v.1.9.3 (svn 516)
- Fixed bug where wig definition lines may not be written when no
alignments exist in the first 2 Mb of a chromosome when converting a bam
file to a wig file in script bam2wig.pl. Definition lines are now always
written. Thanks to Matt J. for reporting.
- Fixed bug where the format_with_commas sub was not properly imported
into the tim_db_helper library
- Fixed bug where the bed output from script get_features.pl did not
properly report strand information.
v.1.9.2 (svn 510)
- Fixed critical bug where codon changes were not reported correctly for
minus strand genes in script locate_SNPs.pl. Thanks to Craig K. for
reporting.
v.1.9.1 (svn 507)
- Added critical code to interpret strand information from input files
such as Bed and GFF into BioPerl standards. Essential for collecting
stranded data. Also properly writes back strand information for valid
Bed and GFF files
- Updated and unified internal library methods for validating and
requesting database feature types. By default, all database features are
presented to the user as a list when selecting database features to
collect data. The source_exclude parameter in the biotoolbox.cfg
configuration file is now deprecated.
- Upgraded script get_intersecting_features.pl to automatically
recognize input file columns and search for more than 1 feature type
- Fixed bug in script get_datasets.pl where the program will not
continue when only a data database was provided
- Fixed bug of requesting index when using a .kgg file as a gene list in
script pull_features.pl
- Fixed bug in generating file name for Treeview export function in
script manipulate_datasets.pl
- Fixed behavior when reading files to prevent adding the current
program name to the metadata when the input file does not have this
metadata
- Minor updates to script novo_wrapper.pl
v.1.9.0 (svn 493)
- Added new script get_features.pl which generates a list of features
for one or more feature types from a database. Information about the
features may be returned, including name, type, and coordinates. Sub
features may be included. The data may be written as a BioToolBox
formatted text file, GFF or BED.
- Added new script correlate_position_data.pl that calculates a Pearson
correlation between the score values at identical positions along a
feature between two datasets. This helps in identifying changes in
spatial distribution of values. An option for calculating shifts is also
available.
- Improved Big File generation such that Bio::DB::BigWig or
Bio::DB::BigBed is no longer required just to generate the big file, as
conversion uses external utilities anyway.
- Fixed generation of bin values when calculating distribution
frequencies in scripts data2frequency.pl and graph_histogram.pl
v.1.8.7 (svn 487)
- Added new command line options to script merge_datasets.pl to control
the program's behavior. The "--lookupname" option allows you to specify
the name of the lookup column, while "--manual" turns off all automatic
guessing of columns. Also improved handling of original_file metadata.
- Added a new option to collect data from long features (such as genomic
annotations) instead of point data (microarray or sequence data) in
script get_relative_data.pl.
- Added option to convert to and from Roman numerals in chromosome names
and support for wig files in script change_chr_prefix.pl
- Added option to change the IP port number when connecting to a remote
MySQL database host in script get_ensembl_annotation.pl
- Fixed bug to properly close opened files in script split_data_file.pl
and avoid unnecessary error messages.
- Modified statements and warnings regarding step and span values in
script data2wig.pl
v.1.8.6 (svn 477)
- Added numerous enhancements and bug fixes to script data2wig.pl,
including automatically assigning the span parameter in the wig file,
identifying coordinate columns, adding command line options for
coordinate columns, and updating the POD documentation
- Improved the treeview export function in script manipulate_datasets.pl
to include different manipulations, including median center of genes or
datasets, converting to Z-scores, and converting null values. Also
changed the default output name to <basename>.cdt.
- Added advanced option to script merge_datasets.pl to specify the
column order on the command line instead of interactively. Also
increased the number of columns that can be specified as letters.
- Added the "value" command line option to specify the type of data to
collect to the script find_enriched_regions.pl. Also added the sum
method plus some improvements for identifying depleted regions.
- Updated the script run_cluster.pl to accept any file name as input,
and added basic file format validation checks prior to running the
cluster algorithm, among a few other minor improvements
- Improved handling of error messages when attempting to open databases
that do not exist or can not otherwise be opened.
- Added more support for reading bedgraph files, dealing with track
lines and possibly empty lines
- Collecting data from bigWig files that use spanned features (span > 1
bp) are now collected at every base rather than just the start position
- Fixed bug where more than two files were not properly merged using
lookup in script merge_datasets.pl
- Fixed bug to allow data to be collected for Bed files from indexed
data files without specifying a database in script get_datasets.pl
v.1.8.5 (svn 461)
- Fixed critical bug where all knownGene feature strands are reversed in
script ucsc_table2gff3.pl
- Fixed critical bug where the sign is flipped when generating Z-scores
with script manipulate_datasets.pl
- Added new functions "convert null values" and "absolute value" to
script manipulate_datasets.pl
- Added additional file format checks when writing formatted files
including GFF, BED, and SGR. File extensions may automatically change to
default txt if the format does not match.
- Better handling of input Bed files and generating appropriate default
file names in script data2gff.pl
- Improved merging of datasets by lookup, and loosened restrictions on
metadata checking, issuing warnings instead, in script merge_datasets.pl
- Loosened restrictions on metadata differences and failures in script
join_data_file.pl
- Included fix for finding column indices when name is prefixed with #
- Added another check to avoid returning undefined values from BigWig
data collection
v.1.8.4 (svn 448)
- Changed shift value determination to use trimmed mean to avoid
outliers, and added new option to control the minimum acceptable R^2
value in script bam2wig.pl
- Improved script merge_datasets.pl to identify appropriate lookup
columns automatically and successfully merge more than two files using
lookup
- Changed my implementation of Z-score generation so that signed values
are properly reported instead of absolute values in script
manipulate_datasets.pl
- Fixed critical bug where output files were prematurely closed when
splitting a data file in script split_data_file.pl
- Reduced some unnecessary error reporting when opening databases that
do not exist
- Updated list of column names to avoid in script graph_data.pl
- Updated interactive prompts in script manipulate_datasets.pl
- Fixed bug where the --pos option in script_datasets.pl did not accept
the 'm' argument
- Fixed bug where strand was reported as '.' instead of '0' in script
get_feature_info.pl
- Fixed bug regarding writing headers, especially with new BED files
- Fixed bug when providing an index of 0 on the command line with script
manipulate_datasets.pl
v.1.8.3 (svn 431)
- Improved mapping efficiency, made tag dataset optional, added direct
support of BigWig and BigWigSet datasources, and updated documentation
to script map_nucleosomes.pl.
- Updated script verify_nucleosome_mapping.pl to accomodate changes in
map_nucleosomes.pl output, added support for generic input files, added
option for other datasources, and added direct support for BigWig and
BigWigSet datasources.
- Added multiply and add methods to script manipulate_datasets.pl.
- Added firstIntron and lastIntron to list of regions to collect in
script get_gene_regions.pl
- Fixed critical bug when collecting data about GFF features from a
database that caused a crash when no features were found.
- Fixed bug in get_gene_regions.pl when collecting introns where the
last intron was skipped and reverse strand coordinates were flipped
- Fixed bugs in manipulate_datasets.pl where a list of invalid index
numbers could still evaluate to index 0, and the start column may not be
recognized when performing a genomic sort.
- Fixed bug where text files with DOS/Windows line endings (CRLF) were
not loaded properly
- Fixed bug in data2wig.pl to skip positions less than or equal to 0
- Improved null value reporting when collecting data
v.1.8.2 (svn r411)
- Added new script CpG_calculator.pl to count observed and expected CpG
dinucleotides across a genome sequence or defined regions.
- Added R61 SacCer2 to R64 SacCer3 conversion to script
convert_yeast_genome_version.pl. Also improved chromosome name
recognition and identification of columns in custom file structures.
- Fixed and improved bin generation and output in scripts
data2frequency.pl and graph_histogram.pl. Values outside of the
requested range are now ignored. Script data2frequency.pl also has
considerable code cleanup and reorganization.
- Added a sum method and made minor enhancements to wig data collection
to script bin_genomic_data.pl, along with considerable code cleanup.
- Added automatic capability to script merge_datasets.pl. All unique
columns are automatically merged without manual interaction. This is now
useful for automated shell scripts.
- Enforced no compression when generating bigWig files, and improved
column recognition in script data2wig.pl
- Changed 'primary_tag' to 'type' in the generated metadata and subtrack
selection for BigWigSet database output in script big_file2gff3.pl. Also
improved conf stanza renaming scheme for BigWigSets.
- Fixed bug in script bar2wig.pl that prevented the USeq App Bar2Gr from
being used.
v.1.8.1 (svn r392)
- Updated script find_enriched_regions.pl to handle separate feature and
data databases if desired, and add capability to restrict searches to
specific strands.
- Updated script map_transcripts to handle chromosomes names without
integers in their names
- Brought script convert_yeast_genome.pl back out of retirement and
updated with R63 to R64 convertor
- Added chromosome and sequence sorting to GFF3 output from script
get_ensembl_annotation.pl. Also include Ensembl API version reporting.
- Updated script check_dependencies.pl to report the installed Ensembl
API version number
- Improved GFF3 parsing and minor improvements to script
gff3_to_ucsc_table.pl
- Fixed bugs when working with BigWigSet databases, where a trailing
slash in the directory name may lead to different behaviors, and
unexpected results when collecting data from BigWigSet databases using
two different methods in the same program
- Fixed bug where null values in tab-delimited text files are now
internally converted to null character .
- Fixed sorting issues in script split_bam_by_isize.pl
- Fixed bugs in script novo_wrapper.pl that prevented an uncompressed
Fastq input file from being split properly, split input files from being
removed after aligning, and a single unsorted Bam file is not further
processed
v.1.8.0 (svn r378)
- Moved script novo_wrapper.pl out of retirement (due to popular demand)
and significantly updated it to handle parallel execution
- Retired old script merge_SNPs and replaced it with new
intersect_SNPs.pl script, which is an improved version that uses the VCF
format.
- Updated script locate_SNPs.pl to work with multiple alternate
sequences, multiple features, and importantly with the VCF format
- Added .vcf and .bdg extensions as properly recognized file format
extensions. Changed default bedgraph extension to use .bdg in script
bam2wig.pl
- Stripped all code and mention of binary tim_data_formatted files based
on Storable. Not really a prominent feature and never lived up to its
hype anyway, so removing it
v.1.7.4 (svn r363) (not released)
- Fixed critical bug that prevents local Bam files from opening for data
collection
- Added warnings if a chromosome segment failed to be found in a
database
v.1.7.3 (svn r355)
- Fixed bugs in script bam2wig.pl that prevents it from finding its
libraries and compiling properly; and another bug that prevented
stranded start positions from being recorded properly
v.1.7.2 (svn r351)
- Fixed bug in script ucsc_table2gff3.pl where the output file name may
not be properly generated, leading to an overwrite of the input file.
- Fixed bug in script bam2wig.pl where the recorded position is off by 1
bp
- Added recommended settings in the POD for bam2wig.pl
v.1.7.1 (svn r346)
- Fixed critical bug in data collection library that allowed too many
datapoints to be collected by ignoring the stop position. This could
affect scripts get_datasets.pl, get_relative_data.pl, average_gene.pl,
find_enriched_regions.pl, and others.
- Major overhaul of script pull_features.pl to include better automatic
identification of identifier columns, the capability to match multiple
features, and to simultaneously write all groups from a .kgg list
- Updated script get_datasets.pl so that it would rewrite the output
file after each round of data collection.
- Minor bug fixes in script find_enriched_regions.pl
- Retired outdated script convert_yeast_genome_version.pl. Users should
use the liftOver program from UCSC and chain files from SGD.
v.1.7.0 (svn r340)
- Added new program get_gene_regions.pl which helps in retrieving
regions not explicitly annotated in a database, including start and stop
sites of transcription and introns.
- Added new program data2fasta.pl which generates a multi-Fasta file
from a tab-delimited text file of coordinates or a list of sequences,
such as microarray probes.
- Added new program compare_subfeature_scores.pl which compares a list
of feature and subfeatures and find the subfeature with the minimum and
maximum score.
- Major update to the data collection scripts to improve memory
consumption and efficiency, and a significant boost in speed when
working with BigWig data sources (I have seen up to 10 fold increase,
depending on collection methods).
- Improvements when working with BigWigSet directories, including
working with impromptu directories of BigWig files that do not have a
defined metadata file.
- Added the option of using separate annotation and data databases when
using the data collection scripts. This greatly simplifies things when
you have, for example, an annotation SeqFeature::Store database and a
BigWigSet database of data.
- Added the rpkm method to work with any segment, not just genes with
exons, in data collection scripts get_datasets.pl and average_gene.pl
- Fixed bugs in script ucsc_table2gff3.pl, data2wig.pl,
find_enriched_regions.pl, and bar2wig.pl
v.1.6.4 (svn r314)
- Major update to script bam2wig.pl to reduce memory consumption by
writing incremental portions. The strand option is now a boolean option,
and when enabled, automatically writes both strands simultaneously. The
binning of read counts into windows of user-selected size is now
possible. The optimal shift value for ChIP-Seq data can now be empically
determined from the reads using a statistical method.
- Added additional support for UCSC ensGene tables by including
ensemblToGeneName and ensemblSource supplemental tables in script
ucsc_table2gff2.pl. The common gene name is now included in the output
GFF3 file.
- Added rna_count function to script get_feature_info.pl
- Added minimum and maximum value functions to script
manipulate_datasets.pl
- Included a range option when generating a summary file in script
manipulate_datasets.pl
- Improved the regular expression matching of the chromosome name when
sorting by genomic coordinates in the script manipulate_datasets.pl
- Increased the number of available letters when requesting indices from
the second file in script merge_datasets.pl
- Updated script check_dependencies.pl to handle missing dependencies
more gracefully
- Updated error handling of missing Perl module dependencies, including
IO::Zlib
- Fixed bug where the default chromosome exclusion list in
biotoolbox.cfg wasn't being used when generating a new genome interval
list
- Fixed bug where where a script might ignore the --nogz option when the
original file was gzipped
- Fixed bug in script split_data_file.pl where a filename may get out of
sync with what was requested and what is written
v.1.6.3 (svn r293)
- Added knownGene as a source in script ucsc_table2gff3.pl
- Improved handling of the chromosome exclusion list in library
tim_db_helper
- Fixed bug where an exception could occur if multiple genomic regions
on different chromosomes are returned from a database query. Included
logic to help identify the appropriate intended chromosome.
- Fixed bug where an exception and crash could occur if the query
chromosome is not present in a bigWig, bigBed, or Bam file when
collecting data. Chromosome names are now checked prior to query.
- Fixed bug in script get_datasets.pl where a null value is returned
instead of 0 when using the method of sum.
- Removed several minor bugs that could generate non-fatal Perl warnings
v.1.6.2 (svn r282)
- Fixed bugs in script data2bed.pl that prevented a bigBed file from
being generated. Also improved autodetection of data columns and allowed
for dummy data to be inserted in lower column data when writing higher
column data. Also added ability to use either the GFF Name or ID
attribute as the Bed feature name.
- Added span option to script data2wig.pl when making wig files.
- Renamed script process_agilent.pl to process_microarray.pl. Completely
restructured internal data to accomodate multi-slide arrays and other
file formats, including NimbleGen and GenePix.
- Removed annoying verbose output from script split_data_file.pl and
improved efficiency.
- Stopped writing index keys in the metadata of tim data file formats.
Index is now automatically calculated and retained internally. Also
avoids writing metadata automatically if it wasn't present in the first
place.
- Added summary export function to script manipulate_datasets.pl. This
replicates the summary option from script get_relative_data.pl.
- Added multi-column support to the subtract and division functions in
script manipulate_datasets.pl.
- Minor bug fixes and improvements to script map_oligo_data2gff.pl.
- Improved script gff3_to_ucsc_table.pl to handle gzip files and make
the UCSC bin column optional.
- Added character escaping when generating GFF3 files.
- Improved handling of BigWigSet directories in script big_file2gff3.pl
where the set name is used as the final subdirectory in the target path.
Also improved name handling.
- Fixed bug in writing Sam files in script change_chr_prefix.pl. Also
added increased support for pragmas and fasta sequences in GFF3 files,
and support for non-standard text files.
- Changed the score column name to the more meaningful outfile basename
when writing summary files.
- Fixed data collection from Bed files in script bin_genomic_data.pl.
- Renamed script map_relative_data.pl to get_relative_data.pl; updated
the POD to be more helpful.
v.1.6.1 (svn r258)
- updated the inline documentation for all perl scripts to include the
version option
v1.6.0 (svn r253)
- added version numbers and reporting to all perl scripts and modules
- retired a number of outdated scripts
- renamed script map_data.pl to map_relative_data.pl
v1.5.9 (svn r247)
- updated script big_file2gff3.pl to generate BigWigSet conf stanzas
with subtracks, also more thorough conf stanzas
- added additional axis formatting options to script graph_profile.pl
- fixed critical error in library tim_db_helper where relative
coordinates were not correctly reported in function
get_region_dataset_hash()
- improved handling of opening a bigwigset database in library
tim_db_helper::bigwig
- major overhaul of script average_gene.pl to work with bed files, add
new methods including rpm support, and general much-needed
reorganization
- improved error messaging in biotoolbox libraries by using confess
instead of croak
- reorganize the order of checking for the biotoolbox configuration in
tim_db_helper::config
v1.5.8 (svn r240) (not released)
- fix some bugs with script graph_histogram.pl concerning the bins and
their labels
- updated script gff3_to_ucsc_table.pl to work with gene models without
transcripts and fix bugs handling comments and pragmas
- fixed bug with trimming windows in script find_enriched_regions.pl by
including absolute option to get_region_dataset_hash() function in
library tim_db_helper
- added option to randomly assign strand for paired-end features to
script bam2gff_bed.pl
- fix chromosome regex issue with non-standard chromosome names in
script bar2wig.pl
- updated methods to get chromosome sizes in libraries
tim_db_helper::bigwig and tim_db_helper::bigbed
- added new parameter chromosome_exclude in configuration file
biotoolbox.cfg, which allows specific chromosomes to be excluded when
generating new feature or genomic interval lists
- removed all references to key reference_sequence_type from config file
biotoolbox.cfg and associated scripts
- updated chromosome reference, and added logic to automatically
identify column indices in script data2bed.pl
- updated several scripts to use seq_ids to retrieve chromosome lists
- fixed bug in script get_feature_info.pl where short feature lists
would cause a failure when generating a list of possible attributes from
sample features
v1.5.7 (svn r227) (not released)
- major overhaul of script get_datasets.pl
- removed subs get_feature_dataset() and get_genome_dataset() from
library tim_db_helper, functionality moved to script get_datasets.pl
- added data color options to script graph_profile.pl
- completely updated script map_data.pl to work with chromosome segments
rather than named features, and added rpm support
- added new sub to check datasets for rpm support in library
tim_db_helper
- fixed bug when specifying no datasets in script get_datasets.pl
- improved support for BigWigSet databases in library tim_db_helper and
script print_feature_types.pl
v1.5.6 (svn r223) (not released)
- added rpm method to score functions in library tim_db_helper
- minor bug fixes and adjustments to help rpm method in tim_db_helper
bigwig, bigbed, and bam libraries
- minor bug fix in script find_enriched_regions.pl
- fixed export bug in library tim_db_helper::bigbed
- fixed bug in library tim_db_helper sub process_and_verify_dataset()
where new datasets would never be prompted
- corrected the method for counting bed features in library
tim_db_helper::bigbed
- fixed alignment collection to only take alignments with midpoint
positions within the requested region in library tim_db_helper::bam
v1.5.5 (svn r219) (not released)
- added new avoid option to method get_region_dataset_hash() in library
tim_db_helper
- updated script map_data.pl to use get_region_dataset_hash()
- fixed bug in method validate_dataset_list() in library tim_db_helper
- fixed bug in script merge_datasets.pl where table headers may not be
written properly
- fixed bug in tim_db_helper::get_genome_dataset() if more than one
segment was found
- made numerous improvements in opening db connections in library
tim_db_helper
- made changes to assigning feature type when opening certain files in
library tim_file_helper
- fixed bug in library tim_db_helper where bed file coordinates were not
written out in interbase
- moved the sum_total_alignments() subroutine from the script bam2wig.pl
to the library tim_db_helper::bam
- added support for stranded paired-end RNA-Seq bam files aligned with
TopHat which use the XS attribute to record strand information in
scripts bam2wig.pl and bam2gff_bed.pl
- disabled splices on paired-end bam files in script bam2wig.pl
v1.5.4 (svn r209) (not released)
- added more explicit support for bed files in the tim_file_helper and
tim_data_helper libraries, including data structure verification,
interbase to base conversion, and metadata handling
- generalized bam and bigfile database handling to tim_db_helper
libraries
- simplified generating genomic windows in tim_db_helper
-improved handling of collecting data from bigfile databases in
tim_db_helper libraries
- added chromosome feature output to script big_file2gff3.pl
- updated numerous scripts to reflect tim_db_helper changes; general
code cleanup
- further simplification and code cleanup of library tim_db_helper,
including database and dataset list verification, and removing redundant
code in collecting dataset values
- added new subroutine process_and_verify_dataset() to library
tim_db_helper
- updated scripts average_gene.pl, find_enriched_regions.pl, and
map_data.pl to use the new sub process_and_verify_dataset()
v1.5.3 (svn r205)
- Fixed bug in script bam2wig.pl that prevented spliced alignments from
being properly checked and recorded.
- Fixed numerous bugs in script ucsc_table2gff3.pl, including a bug
where the gene start coordinate may not be updated from interbase to
base, and not accurately converting the CDS phase
- Added new features to the script ucsc_table2gff3.pl, including
automatic table retrieval through FTP from UCSC to greatly simplify
conversion, adding support for knownGene and xenoRefGene tables,
customizing the type of features to output, properly handling features
with duplicate names by creating unique IDs, and optionally including
chromosome information in the output GFF3 file
- Deleted the now redundant script ucsc_chrom2gff3.pl
v1.5.2 (svn r200)
- Updated several scripts and libraries to fix bugs in handling GFF
version numbers and pragmas.
- Added unique IDs to the gff3 output from bam2gff_bed.pl
- Added option to deal with multiple values at identical positions in
the script data2wig.pl
- Added support for log2 values when combining multiple values at
identical postions in scripts data2wig.pl, bar2wig.pl, and
useq2bigfile.pl.
- Retired the outdated script just_blast_oligos.pl.
v1.5.1 (svn r193)
- Fixed critical bug in script bar2wig.pl where values from multiple
postions were not combined properly. Also fixed bug with processing a
single bar file.
- Removed required dependencies of bioperl for scripts bar2wig.pl and
useq2bigfile.pl
- Fixed small bug in tim_db_helper::bigbed library to ensure positions
were withing the region of interest
- Added mapping quality filter and other improvements to script
bam2wig.pl
- Changed score reporting to record mapping quality in script
bam2gff_bed.pl
v1.5 (svn r184)
- Added script useq2bigfile.pl for converting USeq archives
- Added script check_dependencies.pl for assisting in checking for Perl
module dependencies. It will help install the latest versions through
CPAN
- Changed the biotoolbox configuration file from lib/tim_db_helper.cfg
to biotoolbox.cfg in the root directory.
- Moved the biotoolbox configuration loader into a separate module as
lib/tim_db_helper/config.pm. This avoids requiring installing BioPerl
and loading all of tim_db_helper.pm when it may not be necessary.
- Updated numerous scripts to reflect changes with the biotoolbox
configuration loader.
- added axes labeling options to scripts graph_data.pl and
graph_histogram.pl
- fixed bug in handling bed files in library tim_file_helper
- minor fixes in script data2wig.pl
- improved working with bigfile conversions
- fixed minor bug in script big_file2gff3.pl when leaving files in the
current directory
v1.4.4 (svn r162)
- Added reads per million option to script bam2wig.pl
- Added parent, exon, and transcript_length attributes to script
get_feature_info.pl
- Updated scripts find_enriched_regions.pl and map_transcripts.pl to
work with with standalone data files (BigWig, BigBed, Bam)
- Added configuration, description, and capabilities to working with
SQLite database files in tim_db_helper
- Added midpoint as acceptable coordinate in script data2wig.pl
- Bug fixes to script locate_SNPs.pl and bam2wig.pl; library
tim_db_helper::bam
v1.4.3 (svn r144)
- Changed script bar2wig.pl to require method for combining values and
removed interbase option
- Updated peak indentification in script map_nucleosomes.pl to use the
tag dataset and not the scan dataset
- Updated script big_file2gff3.pl to produce more useful conf files with
BigWigSets
- Added overlap data column to ouput of script
get_intersecting_features.pl and added --set_strand option to enforce
directionality
- Added three new functions to script manipulate_datasets.pl, including
new column, strandsign, and mergestrand
- Fixed script wig2data.pl so it works now
- Updated script get_feature_info.pl to parse an attribute list from the
command line
- Improved handling of metadata when opening tim data files
v1.4.2 (svn r129)
- Added fast low level coverage function to the script bam2wig.pl
- Fixed script pull_features.pl to keep the order of features in the
list file.
- Fixed script bar2wig.pl to correctly identify the chromosome name.
- Various bug fixes to the database library helper tim_db_helper.pm.
v1.4.1 (svn r119)
- Fixed bug with get_ensembl_annotation.pl where a protein_coding gene
encoding a transcript lacking a CDS will write inappropriate
coordinates. These transcripts will not write start_codon, stop_codon,
or CDS subfeatures.
- Fixed bug with script get_intersecting_features.pl where selecting
regions with a start, stop modifier was not being selected properly.
- Fixed bug with tim_db_helper modules that prevented working with
source data files specified in a database feature
- Added log transformation of count in script bam2wig.pl
v1.4 (svn r111)
- Added script bam2wig.pl for enumerating alignments and writing a wig
file of the counts.
- Added script change_chr_prefix.pl for adding or stripping chromosome
prefixes from data and annotation files.
- Bug fixes to ucsc_table2gff3.pl.
v1.3 (svn r104)
- Added ability to restrict data collection to exon subfeatures to
script get_datasets.pl. Useful for RNA-seq analysis.
- Added exon count as attribute to script get_feature_info.pl.
- Bug fixes to get_datasets.pl.
v1.2 (svn r98)
- Added support for bam files as a data source.
- Updated data collection scripts to allow direct referencing of data
source files, including bigWig, bigBed, and Bam files, on the command
line, without having to reference the files from within the database.
v1.1 (svn r92)
- Updated script ucsc_table2gff3.pl to use Bio::SeqFeature::Lite. Now
outputs exon and codon features.
- Updated script get_ensembl_annotation.pl to collect RNA features from
Ensembl as well as generate exon and codon features.
- Added script gff3_to_ucsc_table.pl to generate UCSC style refSeq
tables from GFF3 formatted data.
v1.0.2 (svn r91)
- Bug fixes to libs tim_file_helper and tim_db_helper
- Bug fixes to scripts print_feature_types.pl,
get_intersecting_features.pl, big_file2gff3.pl, graph_data.pl,
graph_histogram.pl, graph_profile.pl
v1.0 (svn r68)
- Initial public release of an archive. Previous versions were only
available through SVN.