NAME
bam2wig.pl
A script to convert Bam alignments into a wig representation file.
SYNOPSIS
bam2wig.pl [--options...] <file.bam>
bam2wig.pl --extend --rpm --separate --out file --bw file1.bam file2.bam
Required options:
--in <filename.bam> repeat if multiple bams, or comma-delimited list
Reporting options (pick one):
--start record at 5' position
--mid record at midpoint of alignment or pair
--span record across entire alignment or pair
--extend extend alignment (record predicted fragment)
--cspan record a span centered on midpoint
--coverage raw alignment coverage
Alignment reporting options:
--splice split alignment at N splices
--strand record separate strands
--flip flip the strands for convenience
Paired-end alignments:
--pe treat as paired-end alignments
--minsize <integer> minimum allowed insertion size (30)
--maxsize <integer> maximum allowed insertion size (600)
Alignment filtering options:
--qual <integer> minimum mapping quality (0)
--nosecondary skip secondary alignments (false)
--noduplicate skip marked duplicate alignments (false)
--nosupplementary skip supplementary alignments (false)
--chrskip <regex> regular expression to skip chromosomes
--blacklist <file> interval file of regions to skip (bed, gff, txt)
Shift options:
--shift shift reads in the 3' direction
--shiftval <integer> explicit shift value in bp (default is to calculate)
--extval <integer> explicit extension size in bp (default is to calculate)
--chrom <integer> number of chromosomes to sample (4)
--minr <float> minimum pearson correlation to calculate shift (0.5)
--zmin <float> minimum z-score from average to test peak for shift (3)
--zmax <float> maximum z-score from average to test peak for shift (10)
--model write peak shift model file for graphing
Score options:
--rpm scale depth to Reads Per Million mapped
--scale <float> explicit scaling factor, repeat for each bam file
--mean or --separate scale multiple bams independently before averaging
--fraction assign fractional counts to all multi-mapped alignments
--format <integer> number of decimal positions (4)
Output options:
--out <filename> output file name, default is bam file basename
--bin <integer> bin size for span or extend mode (10)
--bw convert to bigWig format
--bwapp /path/to/wigToBigWig path to external converter
--gz gzip compress output
Wig format:
--bdg bedGraph, default for span and extend at bin 1
--fix fixedStep, default for bin > 1
--var varStep, default for start, mid
General options:
--cpu <integer> number of parallel processes (2)
--verbose report additional information
--version print version information
--help show full documentation
OPTIONS
The command line flags and descriptions:
Input
- --in <filename>
-
Specify the input Bam alignment file. More than one file may be specified, either with repeated options, a comma-delimited list, or simply appended to the command. Bam files will be automatically indexed if necessary.
Reporting Options
- --start
-
Specify that the 5' position should be recorded in the wig file.
- --mid
-
Specify that the midpoint of the alignment (single-end) or fragment (paired-end) will be recorded in the wig file.
- --span
-
Specify that the entire span of the alignment (single-end) or fragment (paired-end) will be recorded in the wig file.
- --extend
-
Specify that the alignment should be extended in the 3' direction and that the entire length of the extension be recorded in the wig file. The extension may be defined by the user or empirically determined.
- --cspan
-
Specify that a defined span centered at the alignment (single-end) or fragment (paired-end) midpoint will be recorded in the wig file. The span is defined by the extension value.
- --coverage
-
Specify that the raw alignment coverage be calculated and reported in the wig file. This utilizes a special low-level operation and precludes any alignment filtering or post-normalization methods.
- --position [start|mid|span|extend|cspan|coverage]
-
Legacy option for supporting previous versions of bam2wig.
Alignment reporting options
- --splice
-
Indicate that the bam file contains alignments with splices, such as from RNASeq experiments. Alignments will be split on cigar N operations and each sub fragment will be recorded. This only works with single-end alignments, and is disabled for paired-end reads. Only start and span recording options are supported.
- --strand
-
Indicate that separate wig files should be written for each strand. The output file basename is appended with either '_f' or '_r' for both files. Strand for paired-end alignments are determined by the strand of the first read.
- --flip
-
Flip the strand of the output files when generating stranded wig files. Do this when RNA-Seq alignments map to the opposite strand of the coding sequence, depending on the library preparation method.
Paired-end alignments
- --pe
-
The Bam file consists of paired-end alignments, and only properly mapped pairs of alignments will be counted. Properly mapped pairs include FR reads on the same chromosome, and not FF, RR, RF, or pairs aligning to separate chromosomes. The default is to treat all alignments as single-end.
- --minsize <integer>
-
Specify the minimum paired-end fragment size in bp to accept for recording. Default is 30 bp.
- --maxsize <integer>
-
Specify the maximum paired-end fragment size in bp to accept for recording. Default is 600 bp.
Alignment filtering options:
- --qual <integer>
-
Set a minimum mapping quality score of alignments to count. The mapping quality is a range from 0-255, with higher numbers indicating lower probability of a mapping error. Multi-mapping alignments often have a map quality of 0. The default is 0 (accept everything).
- --secondary | --nosecondary
-
Boolean flag to accept or skip secondary alignments, indicated by the alignment bit flag 0x100. Secondary alignments typically represent alternative mapping locations, or multi-mapping events. By default, all alignments are accepted.
- --duplicate | --noduplicate
-
Boolean flag to accept or skip duplicate alignments, indicated by the alignment bit flag 0x400. Duplicates alignments may represent a PCR or optical duplication. By default, all alignments are accepted.
- --supplementary | --nosupplementary
-
Boolean flag to accept or skip supplementary alignments, indicated by the alignment bit flag 0x800. Supplementary alignments are typically associated with chimeric fragments. By default, all alignments are accepted.
- --chrskip <regex>
-
Provide a regular expression to skip certain chromosomes. Perl-based regular expressions are employed. Expressions should be quoted or properly escaped on the command line. Examples might be
'chrM' 'scaffold.+' 'chr.+alt|chrUn.+|chr.+_random'
- --blacklist <file>
-
Provide a file of genomic intervals from which to exclude alignments. Examples might include repeats, ribosomal RNA, or heterochromatic regions. The file should be any text file interpretable by Bio::ToolBox::Data with chromosome, start, and stop coordinates, including BED and GFF formats. Note that this only excludes overlapping alignments, and does not include extended alignments.
Shift options
- --shift
-
Specify that the positions of the alignment should be shifted towards the 3' end. Useful for ChIP-Seq applications, where only the ends of the fragments are counted and often seen as separated discrete peaks on opposite strands flanking the true target site. This option is disabled with paired-end and spliced reads (where it is not needed).
- --shiftval <integer>
-
Provide the value in bp that the recorded position should be shifted. The value should be 1/2 the average length of the library insert size. The default is to automatically and empirically determine the appropriate shift value using cross-strand correlation (recommended).
- --extval <integer>
-
Manually set the length for reads to be extended. By default, the shift value is determined empirically and extension is set to 2X the shift value. This is also used for the cspan mode.
- --chrom <integer>
-
Indicate the number of sequences or chromosomes to sample when empirically determining the shift value. The reference sequences listed in the Bam file header are taken in order of decreasing length, and one or more are taken as a representative sample of the genome. The default value is 4.
- --minr <float>
-
Provide the minimum Pearson correlation value to accept a shift value when empirically determining the shift value. Enter a decimal value between 0 and 1. Higher values are more stringent. The default is 0.5.
- --zmin <float>
-
Specify the minimum z-score (or number of standard deviations) from the chromosomal mean depth to test for a peak shift. Increase this number to test for strong robust peaks, which give a better estimations of the shift value. Default is 3.
- --zmax <float>
-
Specify the maximum z-score (or number of standard deviations) from the chromosomal mean depth to test for a peak shift. This excludes erroneous peaks due to repetitive sequence alignments with high coverage. Increase this number to include more robust peaks that can give a better estimation of the shift value. Default is 10.
- --model
-
Indicate that the shift model profile data should be written to file for examination. The average profile, including for each sampled chromosome, are reported for the forward and reverse strands, as well as the shifted profile. A standard text file is generated using the output base name. The default is to not write the model shift data.
Score Options
- --rpm
-
Convert the data to Reads (or Fragments) Per Million mapped. This is useful for comparing read coverage between different datasets. The default is no RPM conversion.
- --scale <float>
-
Optionally provide your own scaling factor. This will be multiplied with every position when generating the wig file. This may be combined with the rpm factor as well. When combining multiple bam files, either a single scale factor may be supplied for all files, or individual scale factors may be supplied for each bam file. If supplying multiple, use the option multiple times or give a comma-delimited list. The values should be in the same order as the bam files.
- --mean
- --separate
-
When processing multiple bam files, this option will take the mean or average across all bam files. Without this option, the bam files are simply added. When combined with the rpm option, each bam file will be scaled separately before taking the average.
- --fraction
-
Indicate that multi-mapping alignments should be given fractional counts instead of full counts. The number of alignments is determined using the NH alignment tag. If a read has 10 alignments, then each alignment is given a count of 0.1.
- --format <integer>
-
Indicate the number of decimal postions reported in the wig file. This is only applicable when rpm, scale, or fraction options are provided. The default value is 4 decimal positions.
Output Options
- --out <filename>
-
Specify the output base filename. An appropriate extension will be added automatically. By default it uses the base name of the input file.
- --bin <integer>
-
Specify the bin size in bp for the output wig file. In general, specifying a larger bin size will decrease the run time and memory requirements in exchange for loss of resolution. The default for span, center span, or extend mode is 10 bp; all other modes is 1 bp.
- --bw
-
Specify whether or not the wig file should be further converted into an indexed, compressed, binary BigWig file. The default is false.
- --bwapp /path/to/wigToBigWig
-
Optionally specify the full path to the UCSC wigToBigWig conversion utility. The application path may be set in the .biotoolbox.cfg file or found in the default executable path, which makes this option unnecessary.
- --gz
-
Specify whether (or not) the output file should be compressed with gzip. Disable with --nogz.
Wig format
- --bdg
-
Specify that the output wig format is a bedGraph-style wig format. This is the default format for extend, span, and cspan modes of operation.
- --fix
-
Specify that the output wig format is in fixedStep wig format. This is the default format for coverage mode of operation.
- --var
-
Specify that the output wig format is in variableStep wig format. This is the default format for start and midpoint modes of operation.
General options
- --cpu <integer>
-
Specify the number of parallel instances to run simultaneously. This requires the installation of Parallel::ForkManager. With support enabled, the default is 2. Disable multi-threaded execution by setting to 1.
- --verbose
-
Print extra informational statements during processing. The default is false.
- --version
-
Print the version number.
- --help
-
Display this POD documentation.
DESCRIPTION
This program will enumerate aligned sequence tags and generate a wig, or optionally BigWig, file. Alignments may be counted and recorded in several different ways. Strict enumeration may be performed and recorded at either the alignment's start or midpoint position. Alternatively, either the alignment or fragment may be recorded across its span. Finally, a basic unstranded, unshifted, and non-transformed alignment coverage may be generated.
Both paired-end and single-end alignments may be counted. Alignments with splices (e.g. RNA-Seq) may be counted singly or separately. Alignment counts may be separated by strand, facilitating analysis of RNA-Seq experiments.
For ChIP-Seq experiments, the alignment position may be shifted in the 3 prime direction. This effectively merges the separate peaks (representing the ends of the enriched fragments) on each strand into a single peak centered over the target locus. Alternatively, the entire predicted fragment may be recorded across its span. This extended method of recording infers the mean size of the library fragments, thereby emulating the coverage of paired-end sequencing using single-end sequence data. The shift value is empirically determined from the sequencing data or provided by the user. If requested, the shift model profile may be written to file.
The output wig file may be either a variableStep, fixedStep, or bedGraph format. The wig file may be further converted into a compressed, indexed, binary bigWig format, dependent on the availability of the appropriate conversion utilities.
RECOMMENDED SETTINGS
The type of wig file to generate for your Bam sequencing file can vary depending on your particular experimental application. Here are a few common sequencing applications and my recommended settings for generating the wig or bigWig file.
- Straight coverage
-
To generate a straight-forward coverage map, similar to what most genome browsers display when using a Bam file as source, use the following settings:
bam2wig.pl --coverage --in <bamfile>
- Single-end ChIP-Seq
-
When sequencing Chromatin Immuno-Precipitation products, one generally performs a 3 prime shift adjustment to center the fragment's end reads over the predicted center and putative target. To adjust the positions of tag count peaks, let the program empirically determine the shift value from the sequence data (recommended). Otherwise, if you know the mean size of your ChIP eluate fragments, you can use the --shiftval option.
To evaluate the empirically determined shift value, be sure to include the --model option to examine the profiles of stranded and shifted read counts and the distribution of cross-strand correlations.
Depending on your downstream applications and/or preferences, you can record strict enumeration (start positions) or coverage (extend position).
Finally, to compare ChIP-Seq alignments from multiple experiments, convert your reads to Reads Per Million Mapped, which will help to normalize read counts.
bam2wig.pl --start --shift --model --rpm --in <bamfile> bam2wig.pl --extend --model --rpm --in <bamfile>
- Paired-end ChIP-Seq
-
If both ends of the ChIP eluate fragments are sequenced, then we do not need to calculate a shift value. Instead, we will simply count at the midpoint of each properly-mapped sequence pair, or record the defined fragment span.
bam2wig.pl --mid --pe --rpm --in <bamfile> bam2wig.pl --span --pe --rpm --in <bamfile>
- Unstranded RNA-Seq
-
With RNA-Sequencing, we may be interested in either coverage (generating a transcriptome map) or simple tag counts (differential gene expression), so we can count in one of two ways.
To compare RNA-Seq data from different experiments, convert the read counts to Reads Per Million Mapped, which will help to normalize read counts.
bam2wig --span --splice --rpm --in <bamfile> bam2wig --mid --rpm --in <bamfile>
- Stranded, single-end RNA-Seq
-
If the library was generated in such a way as to preserve strand, then we can separate the counts based on the strand of the alignment. Note that the reported strand may be accurate or flipped, depending upon whether first-strand or second-strand synthesized cDNA was sequenced, and whether your aligner took this into account. Check the Bam alignments in a genome browser to confirm the orientation relative to coding sequences. If alignments are opposite to the direction of transcription, you can include the --flip option to switch the output.
bam2wig --span ---splice -strand --rpm --in <bamfile> bam2wig --pos mid --strand --rpm --in <bamfile>
- Paired-end RNA-Seq
-
Since paired-end mode of bam2wig interprets pairs as a single fragment, and splices are disable with paired-end alignments, paired-end RNAseq may be best treated as single-end RNAseq.
TEXT REPRESENTATION OF RECORDING ALIGNMENTS
To help users visualize how this program records alignments in a wig file, drawn below are 10 alignments, five forward and five reverse. They may be interpreted as either single-end or paired-end. Drawn below are the numbers that would be recorded in a wig file for various parameter settings. Note that alignments are not drawn to scale and are drawn for visualization purposes only. Values of X represent 10.
- Alignments
-
....>>>>>>.....................................<<<<<<............. .....>>>>>>..................................<<<<<<............... ........>>>>>>.......................................<<<<<<....... ........>>>>>>.........................................<<<<<<..... ..........>>>>>>............................................<<<<<<
- Starts
-
....11..2.1.......................................1.1.....1.1....1
- Midpoints
-
......11..2.1..................................1.1.....1.1....1...
- Stranded Starts
-
F...11..2.1....................................................... R.................................................1.1.....1.1....1
- Span (Coverage)
-
....122244433311.............................112222111122221211111
- Mid Span (extend value 2)
-
......121.2211.................................1111....1111...11..
- Stranded Span
-
F...122244433311.................................................. R............................................112222111122221211111
- Shifted Starts (shift value 26)
-
........................1.1...11121.1..1..........................
- Shifted Span (shift value 26)
-
...................11222211112344365544411........................
- Extend (extend value 52)
-
12223445789999XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX999887665321111
- Paired-End Midpoints
-
............................1...111..1............................
- Paired-End Mid span (extend value 6)
-
..........................111123333432111.........................
- Paired-End Span
-
....12224455555555555555555555555555555555555555555443333332211111
AUTHOR
Timothy J. Parnell, PhD
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 2815:
'=item' outside of any '=over'