NAME
intersect_SNPs.pl
A script to identify unique and common SNPs from multiple sequence runs.
SYNOPSIS
intersect_SNPs.pl <file1.vcf> <file2.vcf> ...
Options:
--in <filename>
--gz
--version
--help
OPTIONS
The command line flags and descriptions:
- --in <filename>
-
Specify the input SNP files. The files should be in the .vcf format, and may be gzipped. Each SNP file is assumed to contain one sample or strain only.
- --gz
-
Specify whether (or not) the output files should be compressed with gzip.
- --version
-
Print the version number.
- --help
-
Display this POD documentation.
DESCRIPTION
This simple program will identify unique and common Single Nucleotide Polymorphisms (SNPs) between two or more sequenced strains. This is useful, for example, in isolating unique SNPs that may be responsible for a mutant phenotype from background polymorphisms common to the strains.
Each strain should have a separate SNP file in the Variant Call Format (VCF) 4.0 or 4.1 format, a tab-delimited text file with metadata. See http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 for more information about the format. The files may be gzipped.
Numerous SNP callers are capable of generating the VCF format from sequence (usually Bam) files. The Samtools program is one such program, using the "mpileup" function in conjunction with it's "bcftools" tool. See the Samtools site at http://samtools.sourceforge.net for more information.
Note that this program currently loads all SNPs into memory, thus for large genomes extensive memory requirements may be required.
SNPs are identified as unique vs common based on the reported coordinate and the alternate sequence. Overlapping SNPs will likely be treated separately. The unique SNPs are written to a new file with the file's base name appended with "_unique". The VCF format and headers are maintained.
The common SNPs are written to a separate VCF file, with the file name comprised of input file base names appended with "_common". Genotype data, if present, is stripped from the common SNP, and only one representative is recorded.
Note that certain sequence variants may be reported as unique when in fact identical alternate sequences are also present in other strains. Usually in these cases, One of the strains has an additional variant sequence not present in the other.
AUTHOR
Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.