The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

LICENSE

Copyright [1999-2015] Wellcome Trust Sanger Institute and the EMBL-European Bioinformatics Institute Copyright [2016-2024] EMBL-European Bioinformatics Institute

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

CONTACT

  Please email comments or questions to the public Ensembl
  developers list at <http://lists.ensembl.org/mailman/listinfo/dev>.

  Questions may also be sent to the Ensembl help desk at
  <http://www.ensembl.org/Help/Contact>.

NAME

Bio::EnsEMBL::BaseAlignFeature - Baseclass providing a common abstract implementation for alignment features

SYNOPSIS

  my $feat = new Bio::EnsEMBL::DnaPepAlignFeature(
    -slice        => $slice,
    -start        => 100,
    -end          => 120,
    -strand       => 1,
    -hseqname     => 'SP:RF1231',
    -hstart       => 200,
    -hend         => 220,
    -analysis     => $analysis,
    -cigar_string => '10M3D5M2I',
    -align_type   => 'ensembl'
  );

  where $analysis is a Bio::EnsEMBL::Analysis object.

  Alternatively if you have an array of ungapped features:

    my $feat =
      new Bio::EnsEMBL::DnaPepAlignFeature( -features => \@features );

  where @features is an array of Bio::EnsEMBL::FeaturePair objects.

  There is a method to (re)create ungapped features from the cigar_string:

    my @ungapped_features = $feat->ungapped_features();

  where @ungapped_features is an array of Bio::EnsEMBL::FeaturePair's.

  Bio::EnsEMBL::BaseAlignFeature inherits from:
    Bio::EnsEMBL::FeaturePair, which in turn inherits from:
      Bio::EnsEMBL::Feature,
  thus methods from both parent classes are available.


  The cigar_string is a condensed representation of the matches and gaps
  which make up the gapped alignment (where CIGAR stands for
  Concise Idiosyncratic Gapped Alignment Report).

  CIGAR format is: n <matches> [ x <deletes or inserts> m <matches> ]*
  where M = match, D = delete, I = insert; n, m are match lengths;
  x is delete or insert length.

  Spaces are omitted, thus: "23M4I12M2D1M"
  as are counts for any lengths of 1, thus 1M becomes M: "23M4I12M2DM"


  To make things clearer this is how a blast HSP would be parsed:

  >AK014066
         Length = 146

    Minus Strand HSPs:

    Score = 76 (26.8 bits), Expect = 1.4, P = 0.74
    Identities = 20/71 (28%), Positives = 29/71 (40%), Frame = -1

  Query:   479 GLQAPPPTPQGCRLIPPPPLGLQAPLPTLRAVGSSHHHP*GRQGSSLSSFRSSLASKASA 300
               G  APPP PQG R   P P G + P   L             + + ++  R  +A   +
  Sbjct:     7 GALAPPPAPQG-RWAFPRPTG-KRPATPLHGTARQDRQVRRSEAAKVTGCRGRVAPHVAP 64

  Query:   299 SSPHNPSPLPS 267
                  H P+P P+
  Sbjct:    65 PLTHTPTPTPT 75

  The alignment goes from 479 down to 267 in the query sequence on the reverse
  strand, and from 7 to 75 in the subject sequence.

  The alignment is made up of the following ungapped pieces:

  query_seq start 447 , sbjct_seq hstart  7 , match length  33 , strand -1
  query_seq start 417 , sbjct_seq hstart 18 , match length  27 , strand -1
  query_seq start 267 , sbjct_seq hstart 27 , match length 147 , strand -1

  When assembled into a DnaPepAlignFeature where:
    (seqname, start, end, strand) refer to the query sequence,
    (hseqname, hstart, hend, hstrand) refer to the subject sequence,
  these ungapped pieces are represented by the cigar string:
    33M3I27M3I147M
  with start 267, end 479, strand -1, and hstart 7, hend 75, hstrand 1.

CAVEATS

  AlignFeature cigar strings have the opposite 'sense'
  ('D' and 'I' swapped) compared with Exonerate cigar strings.

  Exonerate modules in Bio::EnsEMBL::Analysis use this convention:

   The longer genomic sequence specified by:
      exonerate:    target
      AlignFeature: (sequence, start, end, strand)

   A shorter sequence (such as EST or protein) specified by:
      exonerate:    query
      AlignFeature: (hsequence, hstart, hend, hstrand)

  The resulting AlignFeature cigar strings have 'D' and 'I'
  swapped compared with the Exonerate output, i.e.:

    exonerate:    M 123 D 1 M 11 I 1 M 39
    AlignFeature: 123MI11MD39M

METHODS

new

  Arg [..]   : List of named arguments. (-cigar_string , -features, -align_type) defined
               in this constructor, others defined in FeaturePair and 
               SeqFeature superclasses.  Either cigar_string or a list
               of ungapped features should be provided - not both.
  Example    : $baf = new BaseAlignFeatureSubclass(-cigar_string => '3M3I12M', -align_type => 'ensembl');
  Description: Creates a new BaseAlignFeature using either a cigar string or
               a list of ungapped features.  BaseAlignFeature is an abstract
               baseclass and should not actually be instantiated - rather its
               subclasses should be.
  Returntype : Bio::EnsEMBL::BaseAlignFeature
  Exceptions : thrown if both feature and cigar string args are provided
               thrown if neither feature nor cigar string args are provided
               warn if cigar string is provided without cigar type
  Caller     : general
  Status     : Stable

cigar_string

  Arg [1]    : string $cigar_string
  Example    : $feature->cigar_string( "12MI3M" );
  Description: get/set for attribute cigar_string.
               cigar_string describes the alignment:
                 "xM" stands for x matches (or mismatches),
                 "xI" for x inserts into the query sequence,
                 "xD" for x deletions from the query sequence
                 where the query sequence is specified by (seqname, start, ...)
                 and the subject sequence by (hseqname, hstart, ...).
               An "x" that is 1 can be omitted.
               See the SYNOPSIS for an example.
  Returntype : string
  Exceptions : none
  Caller     : general
  Status     : Stable

align_type

  Arg [1]    : type $align_type
  Example    : $feature->align_type( "ensembl" );
  Description: get/set for attribute align_type.
               align_type specifies which cigar string 
               is used to describe the alignment:
               The default is 'ensembl'
  Returntype : string
  Exceptions : none
  Caller     : general
  Status     : Stable

alignment_length

  Arg [1]    : None
  Description: return the alignment length (including indels) based on the alignment_type ('ensembl', 'mdtag')
  Returntype : int
  Exceptions : 
  Caller     : 
  Status     : Stable

_ensembl_cigar_alignment_length

  Arg [1]    : None
  Description: return the alignment length (including indels) based on the cigar_string
  Returntype : int
  Exceptions :
  Caller     :
  Status     : Stable

ungapped_features

  Args       : none
  Example    : @ungapped_features = $align_feature->get_feature
  Description: converts the internal cigar_string into an array of
               ungapped feature pairs
  Returntype : list of Bio::EnsEMBL::FeaturePair
  Exceptions : cigar_string not set internally
  Caller     : general
  Status     : Stable

strands_reversed

  Arg [1]    : int $strands_reversed
  Description: get/set for attribute strands_reversed
               0 means that strand and hstrand are the original strands obtained
                 from the alignment program used
               1 means that strand and hstrand have been flipped as compared to
                 the original result provided by the alignment program used.
                 You may want to use the reverse_complement method to restore the
                 original strandness.
  Returntype : int
  Exceptions : none
  Caller     : general
  Status     : Stable
 

reverse_complement

  Args       : none
  Description: reverse complement the FeaturePair based on the cigar type
               modifing strand, hstrand and cigar_string in consequence
  Returntype : none
  Exceptions : none
  Caller     : general
  Status     : Stable

_ensembl_reverse_complement

  Args       : none
  Description: reverse complement the FeaturePair for ensembl cigar string,
               modifing strand, hstrand and cigar_string in consequence
  Returntype : none
  Exceptions : none
  Caller     : general
  Status     : Stable

transform

  Arg  1     : String $coordinate_system_name
  Arg [2]    : String $coordinate_system_version
  Example    : $feature = $feature->transform('contig');
               $feature = $feature->transform('chromosome', 'NCBI33');
  Description: Moves this AlignFeature to the given coordinate system.
               If the feature cannot be transformed to the destination 
               coordinate system undef is returned instead.
  Returntype : Bio::EnsEMBL::BaseAlignFeature;
  Exceptions : wrong parameters
  Caller     : general
  Status     : Medium Risk

_parse_ensembl_cigar

  Args       : none
  Description: PRIVATE (internal) method - creates ungapped features from 
               internally stored cigar line in ensembl format
  Returntype : list of Bio::EnsEMBL::FeaturePair
  Exceptions : none
  Caller     : ungapped_features
  Status     : Stable

_parse_features

  Arg  [1]   : listref Bio::EnsEMBL::FeaturePair $ungapped_features
  Description: creates internal cigar_string and start,end hstart,hend
               entries.
  Returntype : none, fills in values of self
  Exceptions : argument list undergoes many sanity checks - throws under many
               invalid conditions
  Caller     : new
  Status     : Stable

_hit_unit

  Args       : none
  Description: abstract method, overwrite with something that returns
               one or three
  Returntype : int 1,3
  Exceptions : none
  Caller     : internal
  Status     : Stable

_query_unit

  Args       : none
  Description: abstract method, overwrite with something that returns
               one or three
  Returntype : int 1,3
  Exceptions : none
  Caller     : internal
  Status     : Stable

_mdtag_alignment_length

  Arg [1]    : None
  Description: return the alignment length (including indels) based on the mdtag (mdz) string
  Returntype : int
  Exceptions : none
  Caller     : internal
  Status     : Stable

_get_mdz_chunks

  Arg [1]    : mdtag string - MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* (Refer:  SAM/BAM specification)
  Description: parses the mdtag string and group it according the type
               eg: MD:Z:35^VIVALE31^GRPLIQPRRKKAYQLEHTFQGLLGKRSLFTE10 returns ['35', '^', 'VIVALE', '31', '^', 'GRPLIQPRRKKAYQLEHTFQGLLGKRSLFTE', '10']
  Returntype : array of strings
  Exceptions : none
  Caller     : internal
  Status     : Stable

_get_mdz_alignment_length

  Arg [1]    : array of strings
  Description: calculate the alignment length from the given chunks
  Returntype : array of strings
  Exceptions : none
  Caller     : internal
  Status     : Stable

_get_mdz_chunk_type

  Arg [1]    : char
  Description: get the chunk type
  Returntype : string
  Exceptions : none
  Caller     : internal
  Status     : Stable

_mdz_alignment_string

  Arg [1]    : input sequence
  Arg [2]    : MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* (Refer:  SAM/BAM specification)
               eg: MD:Z:96^RHKTDSFVGLMGKRALNS0V14
  Example    : $pf->alignment_strings
  Description: Allows to rebuild the alignment string of both the seq and hseq sequence
  Returntype : array reference containing 2 strings
               the first corresponds to seq
               the second corresponds to hseq
  Exceptions : none
  Caller     : general
  Status     : Stable