NAME

Code::DRY - Cut-and-Paste-Detector for Perl code

SYNOPSIS

use Code::DRY;

# high level usage: scan some directories for Perl code
# and let the module report duplicates sorted
# by length of duplicates. Minimum length are 4 lines, 
# and all filters are set to undef.
#
Code::DRY::scan_directories(4, undef, undef, undef, @dirs);

or

# mid level usage: let the function report duplicates
# from a list of files. The ignore filter is set to undef.
# This time the minimum length is set to 40 bytes.
Code::DRY::find_duplicates_in(-40, undef, @files);

or

# low level usage: analyse the raw data yourself
# built the suffix and lcp array
Code::DRY::build_suffixarray_and_lcp($longstringwithcode);
# avoid matches crossing file boundaries
Code::DRY::clip_lcp_to_fileboundaries(\@Code::DRY::fileoffsets);
# avoid matches overlapping into each other
Code::DRY::reduce_lcp_to_nonoverlapping_lengths();
# avoid matches that are included in longer matches
Code::DRY::set_lcp_to_zero_for_shadowed_substrings();
# then iterate through the lcp array via get_len_at(index)
# and through the suffix/offset array via get_offset_at(index)

DESCRIPTION

The module's main purpose is to report repeated text fragments (typically Perl code) that could be considered for isolation and/or abstraction in order to reduce multiple copies of the same code (aka cut and paste code).

Code duplicates may occur in the same line, file or directory.

The ad hoc approach to compare every item against every other item leads to computing times growing exponentially with the amount of code, which is not useful for anything but the smallest code bases.

So a efficient data structure is needed.

This module can create the suffix array and the longest common prefix array for a string of 8-bit characters. These data structures can be used to search for repetitions of substrings in O(n) time.

The current strategy is to concatenate code from all files into one string and then use the suffix array and its companion, the longest-common-prefix (lcp) array on this string.

Example:

Instead of real Perl code I use the string 'mississippi' for simplicity. A suffix is a partial string of an input string, which ends at the end of the input string. A prefix is a partial string of an input string, which starts at the start of the input string. The suffix array of a string is a list of offsets (each one for a suffix), which is sorted lexicographically by suffix:

#  offset suffix
================
0  10:    i
1   7:    ippi
2   4:    issippi
3   1:    ississippi
4   0:    mississippi
5   9:    pi
6   8:    ppi
7   6:    sippi
8   3:    sissippi
9   5:    ssippi
10  2:    ssissippi

The other structure needed is the longest common prefix array (lcp). It contains the maximal length of the prefixes for this entry shared with the previous entry from the suffix array. For this example it looks like this:

#  offset lcp  (common prefixes shown in ())
=====================
0  10:    0    ()
1   7:    1    (i)
2   4:    1    (i)
3   1:    4    (issi) overlap!
3         3    (iss)  corrected non overlapping prefixes
4   0:    0    ()
5   9:    0    ()
6   8:    1    (p)
7   6:    0    ()
8   3:    2    (si)
9   5:    1    (s)
10  2:    3    (ssi)

The standard lcp array may contain overlapping prefixes, but for our purposes we need only non overlapping prefixes lengths. The same overlap may occur for prefixes that extend from the end of one source file to the start of the next file when we use concatenated content of source files. The limiting with respect to internal overlaps and file crossing prefix lengths is done by two respective functions afterwards.

If we sort the so obtained lcp values in descending order we get

#  offset lcp  (prefix shown in ())
===================================
3   1:    3    (iss) now corrected to non overlapping prefixes
10  2:    3    (ssi)
8   3:    2    (si)
1   7:    1    (i)
2   4:    1    (i)
6   8:    1    (p)
9   5:    1    (s)
0  10:    0    ()
4   0:    0    ()
5   9:    0    ()
7   6:    0    ()

The first entry shows the longest repetition in the given string. Not all entries are of interest since smaller copies are contained in the longest match. After removing all 'shadowed' repetitions, the next entry can be reported. Finally the lcp values are too small to be of any interest.

Currently this is experimental code.

The most appropriate mailing list on which to discuss this module would be perl-qa. See http://lists.perl.org/list/perl-qa.html.

REQUIREMENTS & OPTIONAL MODULES

REQUIREMENTS

  • The ability to compile XS extensions.

    This means a working C compiler, a linker, a make program etc. If you built perl from source you will have these already and they will be used automatically. If your perl was built in some other way, for example you may have installed it using your Operating System's packaging mechanism, you will need to ensure that the appropriate tools are installed.

  • Module File::Find

    This is a core module now for a while.

OPTIONAL MODULES

  • Test::More

    Required if you want to run Code::DRY's own tests.

SUBROUTINES

scan_directories($minlength, $ignoreContent, $regexAccept, $regexIgnore, @array_of_pathnames_of_directories)

Scans the given directories in @array_of_pathnames_of_directories recursively for file names matching the regexp $regexAccept, if it is defined. If those file names also do not match against the regexp $regexIgnore (unless $regexIgnore is undefined) they are included in the analysis.

If $regexAccept and $regexIgnore both are undef, all file names will be considered for analysis.

If either of $regexAccept and $regexIgnore is not a ref of type 'Regexp', it is expected to be a pattern string that will be converted into a regexp with qr{}xms.

The parameter $ignoreContent can be used to avoid duplication reports for content matching this regex. If $ignoreContent is not a ref of type 'Regexp', it is expected to be a pattern string that will be converted into a regexp with qr{}xms.

The parameter $minlength is interpreted in units of lines when being positive. Otherwise its absolute value is interpreted in units of bytes.

All repetitions with a minimum length of $minlength will be reported by the report callback function.

find_duplicates_in($minlength, $ignoreContent, @array_of_pathnames_of_files)

Reads files for the given file names composing a long string, which is then analysed for repetitions.

The parameter $minlength is interpreted in units of lines when being positive. Otherwise its absolute value is interpreted in units of bytes.

The parameter $ignoreContent can be used to avoid duplication reports for content matching this regex. If $ignoreContent is not a ref of type 'Regexp', it is expected to be a pattern string that will be converted into a regexp with qr{}xms.

All repetitions with a minimum length of $minlength will be reported by the report callback function.

set_reporter(sub{ CODE BLOCK })

Set custom code to report duplicates of a code fragment. The callback is invoked with position information for the copies found during analysis.

The supplied code has to accept two scalars and an array reference.

The first parameter is the required minimum length of duplicates to be reported.

The second parameter contains a string describing the units for minimum length ('lines' or 'bytes').

The referenced array (third parameter) contains one entry with an anonymous array reference for each copy found. Copies are reported in the order of the processing of the files and then in the order of positions. Each copy is represented by this position information as an array entry:

1. filename
2. line number at start of copy (starting with 1). This is the line number of the first line completely contained in the copy.
3. line number at end of copy. This is the line number of the last line completely contained in the copy.
4. offset from start of file at start of copy (starting with 0) clipped to the next completely contained line.
5. offset from start of file at end of copy clipped to the last completely contained line.
6. offset from start of file at start of copy (starting with 0) (used in 'bytes' mode)
7. offset from start of file at end of copy (used in 'bytes' mode)

The default reporter is like this:

XXX insert code when stable

set_default_reporter

Resets the reporter callback function to the default shown above.

report_dupes($minlength, $copies, $length, $index_in_suffix_and_lcp_array)

This function builds a data structure with position information for the duplication copies described by the input parameters. The entries in the suffix array from $index_in_suffix_and_lcp_array to $index_in_suffix_and_lcp_array + $copies -1 will give the offsets in the string where the copies start. Each has a length of $length characters. With these values file names and line numbers are retrieved and stored in the structure. Then the reporter callback function is called with the minimum length of this scan $minlength and this structure. See also function set_reporter().

enter_files($ref_to_array_of_pathnames_of_files)

Reads the files given by the pathnames. Any files with length zero are skipped (and removed from the filename array). Offset arrays for file and line end positions are built. The content of all files is concatenated. Currently the content must not be valid Perl (but this might change when parsing gets involved in a future release).

build_suffixarray_and_lcp($textstring_to_analyse)

The XS routine calls the sais-lite function to construct the suffix and longest common prefix arrays for the complete string to analyse.

clip_lcp_to_fileboundaries(@array_with_endoffset_positions_of_files)

The XS routine limits lcp values that cross files according to given file end positions. The accumulated end positions have to be sorted in ascending order. Internally a binary search is done in order to find the right file end position. @Code::DRY::fileoffsets contain the needed offset when enter_files has been used before.

reduce_lcp_to_nonoverlapping_lengths

The XS routine limits lcp values that overlap with the preceeding entry. This is necessary to avoid overlap of the reported duplicates.

offset2fileindex($offset)

This function uses binary search to get the index of the respective file entry for this offset.

offset2filename($offset)

This function uses binary search to get the index of the respective file entry for this offset. It returns the filename for this entry.

offset2line($offset)

This function uses binary search to get the index of the respective file entry for this offset. Then it again uses binary search to get the max offset of the respective line and returns the line number for this entry.

offsetAndFileindex2line($offset, $fileindex, \$nextcompleteLine, \$lastcompleteLine)

This function uses binary search to get the max offset of the respective line belonging to file $fileindex and returns the line number for this entry.

If the third parameter is defined, it is expected to be a reference to a scalar. It will be set to the line number of the next line unless $offset points to the start of a line. Then it will be set to the current line number.

If the fourth parameter is defined, it is expected to be a reference to a scalar. It will be set to the line number of the previous line unless $offset points to the end of a line or there is no previous line. Then it will be set to the current line number.

clearData

This function clears all resources that were used for a scan.

get_len_at($index)

This XS function returns the lcp value at index $index or ~0, if the index is out of range.

get_offset_at($index)

This XS function returns the offset value from the suffix array at index $index or ~0, if the index is out of range.

get_isa_at($offset)

This XS function returns the index number from the suffix array where the $offset is found or ~0, if the offset is out of range.

set_lcp_to_zero_for_shadowed_substrings($index)

This XS function sets all prefix lengths to zero for those entries where the suffix is contained in another longer (or more leftward) suffix.

get_concatenated_text($offset, $length)

This function returns the given text portion of the internal concatenated string at offset $offset with a length of $length.

get_line_offsets_of_fileindex($fileindex)

This function returns the array reference of the line end offsets for the file at index $fileindex.

get_next_ranked_index not yet implemented

This XS function returns the next index number of the sorted lcp values or ~0, if there are no more entries left.

reset_rank_iterator not yet implemented

This XS function resets the iterator of the sorted lcp values.

get_size

This XS function returns the size of string (in 8-bit characters) used by the build_suffixarray_and_lcp function.

get_lcp

This XS function returns a reference of a copy of the lcp array from the build_suffixarray_and_lcp function.

get_sa

This XS function returns a reference of a copy of the suffix array from the build_suffixarray_and_lcp function.

__get_text

Internal function

__free_all

Internal function

DIAGNOSTICS

Output messages

Duplicates are reported (as per default reporter) in the following format:

1 duplicate(s) found with a length of 8 (>= 2 lines) and 78 bytes reduced to complete lines:
1.  File: t/00_lowlevel.t in lines 57..64 (offsets 1467..1544)
2.  File: t/00_lowlevel.t in lines 74..81 (offsets 1865..1942)
=================
...<duplicated content>
=================

Error messages

This module can die with the following error messages:

  • "cannot open file $file: $!\n";

    The opening of a file for read access failed.

  • "Error building suffix array:$!\n"

    The XS code could not allocate enough memory for the combined file content.

BUGS AND LIMITATIONS

Probably some, it is new code :-).

Currently the underlying XS code operates with 8-bit characters only. With Perl source code that seems to work on most texts.

The full extent of masking out submatches has not yet beem implemented.

To report bugs, go to <http://rt.cpan.org/NoAuth/Bugs.html?Dist=Code-DRY> or send mail to <bug-Code-DRY#rt.cpan.org>

EXPORTED SYMBOLS

None by default.

ACKNOWLEDGEMENTS

Thanks to Yuta Mori for providing the C code for the construction of the suffix array (sais-lite) and to Johannes Fischer for extending it with the efficient generation of lcp values. I am grateful that both authors provided their work as open source.

Some code and ideas cribbed from:

Ovid's blog http://blogs.perl.org/users/ovid/2012/12/finding-duplicate-code-in-perl.html

SEE ALSO

AUTHOR

Heiko Eißfeldt, <hexcoder@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2014 by hexcoder

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

For files salcpis.[ch] from the sais-lite-lcp-master package: