Sub-Method: run
### run ########################################################################################### # Description: # Main method called by shatterproof.pl # Calls primary sub methods # # Input variables: # $argv_ref: reference to @ARGV
Sub-Method: validate_input
### validate_input ################################################################################ # Description: # Validates command line arguments. Prints error messages if some input if invalid. # # Input variables: # $argv_ref: reference to @ARGV # $cnv_directory_ref: reference to variable storing the CNV input directory # $trans_directory_ref: reference to variable storing the translocation input directory # $insertion_directory_ref: reference to variable storing the insertion input directory # $loh_directory_ref: reference to variable storing the LOH input directory # $tp53_mutated_ref: reference to variable storing the tp53 mutated flag # $output_directory_ref: reference to variable storing the output directory # $config_file_path_ref: reference to variable storing the path to the config file
Sub-Method: analyze_cnv_data
### analyze_cnv_data ############################################################################## # Description: # Reads data from files located in the CNV input directory and populates: # $genome_cnv_data_hash_ref # $chromosome_copy_number_count_hash_ref # $chromosome_cnv_breakpoints_hash_ref # # Input variables: # $output_directory: stores the path to the output directory # $cnv_files_array_ref: reference to array containing all the CNV input files # $bin_size: stores the size of the bins which the chromosome will be divided into # $tp53_mutated_ref: reference to the tp53 mutated flag
Sub-Method: check_for_overlaps
### check_for_overlaps ############################################################################ # Description: # Checks if there were overlapping CNV regions with different copy-numbers in the input files. # Also checks if there are any overlapping translocation destinations or overlapping LOH # regions. # # Input variables: # $type: flag variable indicating which type of overlap to check for "cnv", "trans", # or "loh" # $file_data_ref: reference to an array storing all the data lines read in from the specific # type of input file
Sub-Method: analyze_trans_data
### analyze_trans_data ############################################################################ # Description: # Reads data from files located in the trans input directory and popultates: # $genome_trans_data_hash_ref # $chromosome_translocation_count_hash_ref # $genome_trans_breakpoints_hash_ref # # Input variables: # $output_directory: stores the path to the output directory # $trans_files_array_ref: reference to array containing all the translocation # input files # $bin_size: stores the size of the bins which the chromosome will be divided into # $tp53_mutated_ref: reference to the tp53 mutated flag
Sub-Method: analyze_insertion_data
### analyze_insertion_data ############################################################################## # Description: # Reads data from files located in the insertion input directory and populates: # $genome_insertion_data_hash_ref # $chromosome_insertion_count_hash_ref # $genome_trans_insertion_breakpoints_hash_ref # # Input variables: # $output_directory: stores the path to the output directory # $insertion_files_array_ref: reference to array containing all the insertion input files # $bin_size: stores the size of the bins which the chromosome will be divided into # $genome_trans_breakpoints_hash_ref: store reference to hash that contains the translocation breakpoints on # each chromosome # $tp53_mutated_ref: reference to the tp53 mutated flag #
Sub-Method: analyze_loh_data
### analyze_loh_data ############################################################################## # Description: # Reads data from files located in the LOH input directory and populates: # $chromosome_loh_breakpoints_hash_ref # # Input variables: # $output_directory: stores the path to the output directory # $loh_files_array_ref: reference to array containing all the LOH input files #
Sub-Method: calculate_genome_localization
### calculate_genome_localization ################################################################# # Description: # Caculates the mutation density for each chromosome # # Input variables: # $output_directory: stores the path to the output directory # $chromosome_copy_number_count_hash_ref: stores a reference to the hash storing the number # of CNV events on each chromosome # #chromosome_translocation_count_hash_ref: stores a reference to the hash storing the number # of translocation events on each chromosome #
Sub-Method: calculate_chromosome_localization
### calculate_chromosome_localization ############################################################# # Description: # Performs a sliding window analysis on the CNV and translocation data. Identifies regions # that have a density of mutation much greater than the average rate of mutation of the # genome. # # Input variables: # $output_directory: stores the directory where output files are created # $genome_cnv_data_hash_ref: reference to hash that stores position of all CNV breakpoints in # the genome # $genome_trans_data_hash_ref: reference to hash that stores position of all the # translocation breakpoints in the genome # $bin_size: size of the bins that divide up the genome # $window_size: number of bins to evaluate in each window #
Sub-Method: check_copy_number_count
### check_copy_number_count ####################################################################### # Description: # Produces an output file that records the number of regions of copy-number variation that # are present in each chromosome. # # Input variables: # $output_directory: stores the path to the output directory # $chromosome_copy_number_count_hash_ref: reference to hash that stores the count of regions # of copy-number variation on each chromosome #
Sub-Method: check_copy_number_switches
### check_copy_number_switches #################################################################### # Description: # Creates an output file that records the number of breakpoints between CNV regions on each # chromosome # # Input variables: # $output_directory: stores path to output directory # $chromosome_copy_number_count_hash_ref: reference to hash that stores the count of regions # of copy-number variation on each chromosome #
Sub-Method: calculate_interchromosomal_translocation_rate
### calculate_interchromosomal_translocation_rate ################################################# # Description: # Create an output file that records the number of translocations between each and every # chromosome # # Input variables: # $output_directory: stores path to output directory # $chromosome_translocation_count_hash_ref: reference to hash that stores the count of # translocations between each chromosome #
Sub-Method: analyze_suspect_regions
### analyze_suspect_regions ####################################################################### # Description: # Produces the final report output file, that includes the chromothriptic scores for each of # the highly mutated regions # # Input variables: # $output_directory: stores path to the output directory # $suspect_regions_array_ref: reference to array storing the chromosome, # start, and end position of highly mutated # regions # $genome_mutation_density_hash_ref: stores the average mutation density of each # chromosome # $genome_cnv_data_hash_ref: stores the position of CNV mutations on # each chromosome # $genome_trans_data_hash_ref: stores the position of translocation events # on each chromosome # $genome_trans_insertion_breakpoints_hash_ref: stores the position of insertions on each # chromosome # $bin_size: stores the size of a single bin # $localization_window_size: stores the number of bins to include in a # window # $tp53_mutated: stores whether the TP53 gene is mutatated # or not # $chromosome_cnv_breakpoints_hash_ref: stores the breakpoints of CNV mutations on # each chromosome # $chromosome_loh_breakpoints_hash_ref: stores the breakpoints of LOH regions on # each chromosome #
Sub-Method: analyze_likely_regions
### analyze_likely_regions ######################################################################## # Description: # Generates an output file that lists the regions that have a mutation density that is less # than the outlier cut off but greater than 1 - the outlier cut off # # Input variables: # $output_directory: stores path to the output directory # $likely_regions_array_ref: reference to array storing the chromosome, start, # and end position of highly mutated regions # $genome_mutation_density_hash_ref: stores the average mutation density of each # chromosome # $genome_cnv_data_hash_ref: stores the position of CNV mutations on each # chromosome # $genome_trans_data_hash_ref: stores the position of translocation events on each # chromosome # $bin_size: stores the size of a single bin #
Sub-Method: calculate_score
### calculate_score ############################################################################### # Description: # Calculates the chromothripic score for the given region. Calls sub methods to generate the # score for each hallmark # # Input variables: # $chr: stores the chromosome on which the region # is found # $start: stores the start base pair of the region # $end: stores the end base pair of the region # $genome_cnv_data_hash_ref: stores the position of CNV mutations on # each chromosome # $genome_trans_data_hash_ref: stores the position of translocation events # on each chromosome # $genome_mutation_density_hash_ref: stores the average mutation density of each # chromosome # $genome_trans_insertion_breakpoints_hash_ref: stores the position of insertions on each # chromosome # $tp53_mutated: stores whether the TP53 gene is mutatated # or not # $chromosome_cnv_breakpoints_hash_ref: stores the breakpoints of CNV mutations on # each chromosome # $chromosome_loh_breakpoints_hash_ref: stores the breakpoints of LOH regions on # each chromosome # $bin_size: stores the size of a single bin #
Sub-Method: calculate_copy_number_score
### calculate_copy_number_score ################################################################## # Description: # Calculates the score for the copy-number variation hallmark # # Input variables: # $chr: stores the chromsome where the region is located # $start: stores the starting location of the region # $end: stores the end location of the region # $genome_cnv_data_hash_ref: stores the position of CNV mutations on each chromosome # $bin_size: stores the size of single bin #
Sub-Method: calculate_genome_localization_score
### calculate_genome_localization_score ########################################################## # Description: # Calculates the genome localization hallmark score # # Input variables: # $chr: store the chromosome where the region is located # $genome_mutation_density_hash_ref: stores the average mutation density of each # chromosome #
Sub-Method: calculate_region_mutation_density_score
### calculate_region_mutation_density_score ###################################################### # Description: # Calculates the chromosome localization hallmark score # # Input variables: # $chr: chromosome where the region is located # $start: starting location of the region # $end: end location of the region # $genome_cnv_data_hash_ref: stores the position of CNV mutations on each # chromosome # $genome_trans_data_hash_ref: stores the position of translocation events # on each chromosome # $genome_mutation_density_hash_ref: stores the average mutation density of each # chromosome # $bin_size: stores the size of single bin #
Sub-Method: calculate_translocation_score
### calculate_translocation_score ################################################################# # Description: # Calculates the translocation hallmark score # # Input variables: # $chr: chromosome where the region is located # $start: starting location of the region # $end: end location of the region # $genome_trans_data_hash_ref: stores the position of translocation events on each # chromosome # $bin_size: stores the size of single bin #
Sub-Method: calculate_insertion_breakpoint_score
### calculate_insertion_breakpoint_score ########################################################## # Description: # Calculates the insertions at translocation breakpoints hallmark score # # Input variables: # $chr: chromosome where the region is located # $start: starting location of the region # $end: end location of the region # $genome_trans_data_hash_ref: stores the position of translocation events # on each chromosome # $genome_trans_insertion_breakpoints_hash_ref: stores the position of breakpoints with # insertions nearby # $bin_size: stores the size of single bin #
Sub-Method: calculate_loh_score
### calculate_loh_score ########################################################################### # Description: # Calculates the loss of heterozgozity hallmark score # # Input variables: # $chr: chromosome where the region is located # $start: starting location of the region # $end: end location of the region # $chromosome_cnv_breakpoints_hash_ref: stores the breakpoints of CNV mutations on each # chromosome # $chromosome_loh_breakpoints_hash_ref: stores the breakpoints of LOH regions on each # chromosome #
Sub-Method: standard_deviation_and_mean
### standard_deviation_and_mean ################################################################### # Description: # Calculates the standard deviation and mean for a given set of values # # Input variables: # $data_ref: reference to either a hash or an array # $type: 0 indicates a hash, 1 indicates an array #
NAME
ShatterProof - a script for analyzing next-generation sequencing data
SYNOPSIS
use Shatterproof
DESCRIPTION
ShatterProof is a tool that can be used to analyze next generation sequencing data for signs of chromothripsis. Link to publication will be posted soon.
README
Input File Types
ShatterProof can takes as input 4 different types of input files. See the scripts/conversion_scripts directory for some Perl scripts which will convert some common tools' output the the required input formats.
Translocation Input Files (.spt)
Tab delimited columns First line is header line: #chr1 start end chr2 start end quality
Example data entry line: 1 1000 2000 4 4000 5000 78
If no value is available for quality, use a "." eg.:
1 1000 2000 4 4000 5000 .
Copy-Number Input Files (.spc)
Tab delimited columns First line is header line: #chr start end number quality
Example data entry line: 12 2000 3000 2 63
If no value is available for quality, use a "." eg.:
12 2000 3000 2 .
Loss of Heterozygozity Input Files (.spl)
Tab delimited columns First line is header line: #chr start end quality
Example data entry line: 12 2000 3000 63
If no value is available for quality, use a "." eg.:
12 2000 3000 .
Insertion Input Files (.vcf)
Additionally, ShatterProof accepts insertion calls in VCF files as input. See http://www.1000genomes.org/node/101 for details on the VCF file format. ShatterProof analyzes the CHROM and POS fields of these files.
Installing ShatterProof To install this module type the following:
perl Makefile.PL
make
make test
make install
Configuring ShatterProof
See the config.pl file in the scripts directory for a sample ShatterProof configuration file. $bin_size: number (integer) of base pairs to include in each bin of the sliding window analysis $localization_window_size: number (integer) of bins to include in each window of the sliding window analysis $expected_mutation_density: a reference value (double) used in determining if the concentration of translocation events on a particular chromosome is higher than expected. $collapse_regions: flag variable
value 1: merge overlapping CNV regions that have the same copy number
value 0: do not merge overlapping CNV regions that have the same copy number. If such regions are found an error is thrown
$outlier_deviations: the number of standard deviations away from the mean a value has to be in order to be considered non-significant. Used to identify highly mutated regions. $genome_localization_weight: weight given to the localization of mutations to one chromosome hallmark $chromosome_localization_weight: weight given to the localization of mutations to one area of a particular chromosome hallmark $cnv_weight: weight given to the concentrated CNV hallmark $translocation_weight: weight give to the concentrated translocations hallmark $insertion_breakpoint_weight: weight given the the short breakpoint insertions hallmark $loh_weight: weight given to the loss/retention of heterozygosity hallmark $tp53_mutated_weight: weight given to the TP53 mutation hallmark
Running ShatterProof
From the scripts directory run execute the shatterproof.pl file using Perl. Main Usage: perl -w shatterproof.pl --cnv <dir> --trans <dir> [--insrt <dir>] [--loh <dir>] [--tp53] --config <path> --output <dir> Arguments:
--cnv Define the path to the directory containing the CNV input files --trans Define the path to the directory containing the Translocation input files --insrt Define the path to the directory containing the insertion VCF input files --loh Define the path to the directory containing the LOH input files --tp53 Indicate that TP53 should be considered mutated, regardless of data --config Define the path to the ShatterProof config file --output Define the path to the directory where output should be placed dir Path to a directory path Path to a file
PREREQUISITES
strict; warnings; Carp; Switch; File::Basename; List::Util qw[min max]; Statistics::Distributions; POSIX
any
CPAN