NAME

Text::Levenshtein::Edlib - XS edit distance and optimal alignment path calculation

SYNOPSIS

use feature 'say';

use Text::Levenshtein::Edlib qw/:all/;
say distance 'kitten', 'sitting'; # prints '3'
say 'Distance > 2!'
    if !defined distance 'kitten', 'sitting', 2; # prints 'Distance > 2!'

my $align = align('kitten', 'sitting');
say "Edit distance is: $align->{editDistance}";
say "Alphabet length is: $align->{alphabetLength}";
say "Start locations are: @{$align->{startLocations}}";
say "End locations are: @{$align->{endLocations}}";
say "Alignment path is: @{$align->{alignment}}";

DESCRIPTION

Text::Levenshtein::Edlib is a wrapper around the edlib library that computes Levenshtein edit distance and optimal alignment path for a pair of strings.

It does not handle UTF-8 strings, for those Text::Levenshtein::XS can compute edit distance but not alignment path.

This module has two functions:

distance($query, $target, [$max_distance])

This is the basic interface to the library. It is compatible with the function of the same name in Text::Levenshtein::XS.

It returns the edit distance between the two given strings. If the third argument is specified, and the edit distance is greater than the value of the third argument, then the function finishes the computation early and returns undef.

align($query, $target, [$max_distance, [$mode, [$task]]])

This is the full-featured interface to the library.

It returns a hashref with the following keys:

editDistance

The edit distance of the two strings.

alphabetLength

The number of different characters in the query and target together.

endLocations

Array of zero-based positions in target where optimal alignment paths end. If gap after query is penalized, gap counts as part of query (NW), otherwise not.

startLocations

Array of zero-based positions in target where optimal alignment paths start, they correspond to endLocations. If gap before query is penalized, gap counts as part of query (NW), otherwise not.

alignment

Alignment is found for first pair of start and end locations. Alignment is sequence of numbers: 0, 1, 2, 3. 0 stands for match. 1 stands for insertion to target. 2 stands for insertion to query. 3 stands for mismatch. You can use the EDLIB_EDOP_* constants instead of 0, 1, 2, and 3. Alignment aligns query to target from begining of query till end of query. If gaps are not penalized, they are not in alignment.

The third argument, $max_distance, works similarly to the third argument of distance: if the distance is more than its value, this function returns an empty hashref. Default value is -1, which disables this optimization.

The fourth argument, $mode, chooses how Edlib should treat gaps before and after query. The options are:

EDLIB_MODE_NW (default)

Global method - gaps are not ignored. This is the standard Levenshtein distance, and is the default if $mode is not specified.

EDLIB_MODE_SHW

Prefix method - gaps at query end are ignored. So the edit distance between AACT and AACTGGC is 0, because we can ignore the GGC at the end.

EDLIB_MODE_HW

Infix method - gaps at both query start and end are ignored. So the edit distance between ACT and CGACTGAC is 0, because CG at the beginning and GAC at the end of the target are ignored.

The fifth argument, $task, chooses what we want to compute. If set to EDLIB_TASK_PATH (default), all the keys described above will be computed. If set to EDLIB_TASK_LOC, all keys except for alignment will be computed. If set to EDLIB_TASK_DISTANCE, all keys except for alignment and startLocations will be computed. The less we compute, the faster the function will run.

EXPORT

All constants by default. You can export the functions align and distance and any of the constants below. You can use the tags :constants to export every constant, and :all to export every constant, align and distance.

Exportable constants

EDLIB_CIGAR_EXTENDED
EDLIB_CIGAR_STANDARD
EDLIB_EDOP_DELETE
EDLIB_EDOP_INSERT
EDLIB_EDOP_MATCH
EDLIB_EDOP_MISMATCH
EDLIB_MODE_HW
EDLIB_MODE_NW
EDLIB_MODE_SHW
EDLIB_STATUS_ERROR
EDLIB_STATUS_OK
EDLIB_TASK_DISTANCE
EDLIB_TASK_LOC
EDLIB_TASK_PATH

SEE ALSO

https://github.com/Martinsos/edlib/

AUTHOR

Marius Gavrilescu, <marius@ieval.ro>

COPYRIGHT AND LICENSE

Copyright (C) 2017 by Marius Gavrilescu

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.22.3 or, at your option, any later version of Perl 5 you may have available.