NAME
Tutorial_pipeline00.pl - Inferring quantitative and qualitative parameters from an input BAM file.
SYNOPSIS
Tutorial_pipeline00.pl
DESCRIPTION
This tutorial illustrates how the libraries Bio::ViennaNGS::BamStat and Bio::ViennaNGS::BamStatSummary can be used to count mapped reads, read alignments, single-end and paired-end reads and to check and compare quality features stored in the BAM file. The latter is exemplified by using the module to deduce the distribution of edit distances for all read alignments.
However, I would like to point out that this tutorial does not cover all features of Bio::ViennaNGS::BamStat and Bio::ViennaNGS::BamStatSummary. It is merely meant to illustrate the basic usage principles. For more details please refer to the documentation of Bio::ViennaNGS::BamStat and Bio::ViennaNGS::BamStatSummary.
INTRODUCTION
In this tutorial we examine the count of different types of reads in an exact way and visualize the distribution of the edit distances between the aligned reads and the reference genome.
The input data BAM file can be downloaded here). From here on, the input file C1R1.bam is assumed to be accessible in your working directory.
PROCEDURE
Include libraries
use Bio::ViennaNGS::BamStat;
use Bio::ViennaNGS::BamStatSummary;
For this tutorial following <Bio::ViennaNGS
> libraries are included. <Bio::ViennaNGS::BamStat
> provides methods to read key aspects concerning quality and quantity information from a defined BAM file and stores essential information in a data object. <Bio::ViennaNGS::BamStatSummary
> provides methods to compare and visualize data stored in the BamStat object.
Define control variables
@bams = qw# C1R1.bam #;
$odir = '.';
$edit_control = 1;
$segemehl_control = 1;
In the first step of this tutorial, control and parameter variables are set. The array @bams
holds a list of all BAM files to be analyzed. This tutorial is restricted to a single file C1R1.bam, which should be accessible in the current working directory.
Since Tutorial_pipeline00.pl produces several output file, with fixed file names, an output directory has to be specified. This is done in the $odir
variable, per default set to the current working directory. Please not that files with same name in this directory will be overwritten.
Some methods in Bio::ViennaNGS::BamStatSummary use the Statistics::R
library. Therefore, a valid path to a working version of R has to be specified. Please note that per default the path is set to /usr/bin/R.
The $edit_control
flag has to be set if information on the edit distance of the read alignments should be stored by Bio::ViennaNGS::BamStat . If set, this information can be visualized in a subsequent step by Bio::ViennaNGS::BamStatSummary.
Bio::ViennaNGS::BamStat and Bio::ViennaNGS::BamStatSummary are in principle compatible with any BAM file from any read aligner. Nevertheless, one has to be aware that some mapping tools differ in BAM dialect or with respect to the information stored in the BAM file. A special flag to use the auxiliary information stored in a BAM file produced by segemehl is available as $segemehl_control
. Set to '1' if your input file was mapped with segemehl, as it the case for the provided C1R1.bam. Otherwise set to '0'.
Creating a new BamStatSummary object.
$bamsummary = Bio::ViennaNGS::BamStatSummary->new(files => \@bams,
outpath => $odir,
is_segemehl => $segemehl_control,
control_edit => $edit_control,
);
Initializes a new BamStatSummary object representing data from
I<segemehl> mapped BAM files (C<<< is_segemehl => 1 >>>)
specified in @bams. The output directory is set to $odir ('./'),
where beside standard read quantification also the edit distance
(C<<< control_edit => 1 >>>) of each read will be stored.
Read-in BAM files @bams
In the next step the initialized data object is populated with essential data extracted from each read BAM file.
$bamsummary->populate_data();
For each BAM file in @bams the method new
from BIO::ViennaNGS::BamStat is called like this,
$bo = Bio::ViennaNGS::BamStat->new(bam => $bamfile);
Quantify data from $bamsummary
A basic measure in the quantification of BAM files is how many reads are uniquely or multi mapped and how many alignments exist in total. Depending on whether single-end or paired-end reads are analyzed, the number of mapped pairs is essential for the latter. To this end we compile quantitative information for all reads stored in $bamsummary by,
$bamsummary->populate_countStat();
Produce output for read quantification.
Next, statistical output will be written to a CSV file in $odir, which can easily be screened with any text editor or spreadsheet program.
$bamsummary->dump_countStat("csv");
Plot read quantification.
To get a visual overview of the consistency of examined samples, bar-plots can be produced via,
$bamsummary->make_BarPlot();
which creates a bar-plot in pdf format of read quantification in $odir.
Plot edit distance distribution.
To gain a quick overview of the quality of different mapped RNA-seq samples, we plot the distribution of edit distances for all reads aligned to the reference genome for all samples in @bams.
$bamsummary->make_BoxPlot("data_edit") if( $bamsummary->has_control_edit );
Summary
We used Bio::ViennaNGS::BamStat and Bio:ViennaNGS::BamStatSummary to extract, store, summarize, and visualize quantitative and qualitative data stored in a BAM file. Only exemplary features of the library were illustrated. Further useful functions are implemented in the corresponding script bam_quality_stat.pl, or can be implemented according to ones needs.
AUTHOR
Fabian Amman <fabian@tbi.univie.ac.at>