NAME
Bio::Cigar - Parse CIGAR strings and translate coordinates to/from reference/query
SYNOPSIS
use 5.014;
use Bio::Cigar;
my $cigar = Bio::Cigar->new("2M1D1M1I4M");
say "Query length is ", $cigar->query_length;
say "Reference length is ", $cigar->reference_length;
my ($qpos, $op) = $cigar->rpos_to_qpos(3);
say "Alignment operation at reference position 3 is $op";
my $query = "GCAAATGC";
my $ref = "AAAAGCAAATGC";
my $aln = $cigar->align($query, $ref, 5); # align query to pos 5 of ref
say foreach @$aln;
DESCRIPTION
Bio::Cigar is a small library to parse CIGAR strings ("Compact Idiosyncratic Gapped Alignment Report"), such as those used in the SAM file format. CIGAR strings are a run-length encoding which minimally describes the alignment of a query sequence to an (often longer) reference sequence.
Parsing follows the SAM v1 spec for the CIGAR
column.
Parsed strings are represented by an object that provides a few utility methods.
ATTRIBUTES
All attributes are read-only.
string
The CIGAR string for this object.
reference_length
The length of the reference sequence segment aligned with the query sequence described by the CIGAR string.
query_length
The length of the query sequence described by the CIGAR string.
ops
An arrayref of [length, operation]
tuples describing the CIGAR string. Lengths are integers, possible operations are below.
CIGAR operations
The CIGAR operations are given in the following table, taken from the SAM v1 spec:
Op Description
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
M alignment match (can be a sequence match or mismatch)
I insertion to the reference
D deletion from the reference
N skipped region from the reference
S soft clipping (clipped sequences present in SEQ)
H hard clipping (clipped sequences NOT present in SEQ)
P padding (silent deletion from padded reference)
= sequence match
X sequence mismatch
• H can only be present as the first and/or last operation.
• S may only have H operations between them and the ends of the string.
• For mRNA-to-genome alignment, an N operation represents an intron.
For other types of alignments, the interpretation of N is not defined.
• Sum of the lengths of the M/I/S/=/X operations shall equal the length of SEQ.
CONSTRUCTOR
new
Takes a CIGAR string as the sole argument and returns a new Bio::Cigar object.
METHODS
rpos_to_qpos
Takes a reference position (origin 1, base-numbered) and returns the corresponding position (origin 1, base-numbered) on the query sequence. Indels affect how the numbering maps from reference to query.
In list context returns a tuple of [query position, operation at position]
. Operation is a single-character string. See the table of CIGAR operations.
If the reference position does not map to the query sequence (as with a deletion, for example), returns undef
or [undef, operation]
.
qpos_to_rpos
Takes a query position (origin 1, base-numbered) and returns the corresponding position (origin 1, base-numbered) on the reference sequence. Indels affect how the numbering maps from query to reference.
In list context returns a tuple of [references position, operation at position]
. Operation is a single-character string. See the table of CIGAR operations.
If the query position does not map to the reference sequence (as with an insertion, for example), returns undef
or [undef, operation]
.
op_at_rpos
Takes a reference position and returns the operation at that position. Simply a shortcut for calling "rpos_to_qpos" in list context and discarding the first return value.
op_at_qpos
Takes a query position and returns the operation at that position. Simply a shortcut for calling "qpos_to_rpos" in list context and discarding the first return value.
reversed
Returns a new Bio::Cigar object with a CIGAR string that's the reverse of this one, i.e. the last operation becomes the first, the second-to-last the second, etc. until the first operation becomes the last.
align($query, $reference, $start_pos=1, $reversed=0)
Takes a query sequence and a reference sequence and aligns them according to the CIGAR string, using gap characters (-
) for indels and spaces for soft clipping. This is pure string manipulation and as such the match and mismatch operators (=
and X
) are assumed to be correct for the given input sequences and not verified. Returns an array ref of [query seq, ref seq]
.
Optionally, the leftmost reference position (origin 1) can be passed, i.e. the query is aligned starting at that position.
When $reversed
is given a true value, the reverse complement of the passed query sequence is used to generate the alignment. Only the IUPAC nucleotide codes ATCGU
are currently supported for reverse complementation.
AUTHOR
Thomas Sibley <trsibley@uw.edu>
Felix Kühnl <felix@bioinf.uni-leipzig.de>
COPYRIGHT
Copyright 2014- Mullins Lab, Department of Microbiology, University of Washington.
LICENSE
This library is free software; you can redistribute it and/or modify it under the GNU General Public License, version 2.