NAME

Test::CPAN::Health::Check::DuplicateCode - Detect copy-paste code blocks across source files

SYNOPSIS

use Test::CPAN::Health::Check::DuplicateCode;

my $check  = Test::CPAN::Health::Check::DuplicateCode->new;
my $result = $check->run($dist);

printf "%s: %s\n", $result->status, $result->summary;

DESCRIPTION

Implements a lightweight, dependency-free clone detector using a sliding-window hash approach:

  1. Each source file is reduced to a sequence of code lines: lines that are not blank, not pure comments, and not POD. Each line is whitespace-normalised (leading/trailing whitespace stripped; runs of whitespace collapsed to a single space).

  2. A sliding window of 6 consecutive normalised lines forms a "chunk". The hash (MD5-free: joined as a string) of each chunk is recorded along with the originating file.

  3. Chunks whose hash appears in more than one distinct file are cross-file duplicates.

Score = round((1 - dup_chunks / total_chunks) * 100) when total_chunks > 0, else 100. Status: pass ≥ 90, warn ≥ 50, fail below 50.

LIMITATIONS

  • Whitespace normalisation is basic: it does not account for string literals or heredocs that span multiple lines.

  • The detector only finds cross-file duplicates; intra-file repetition is not flagged.

  • Short common boilerplate (e.g. use strict; use warnings; 1;) is naturally filtered out because consecutive boilerplate lines rarely reach the minimum chunk size of 6 in the same relative order.

run

PURPOSE

Extract code chunks from all source files, identify cross-file duplicates, and return a scored Result.

API SPECIFICATION

INPUT

dist     Test::CPAN::Health::Distribution  required
context  Hashref                           optional

OUTPUT

Test::CPAN::Health::Result with check_id 'duplicate_code'.

MESSAGES

Code  | Severity | Message                                     | Resolution
------+----------+---------------------------------------------+-----------
DC001 | SKIP     | No source files found                       | Add source files
DC002 | PASS     | No cross-file duplicate code blocks found   |
DC003 | WARN     | N duplicate code block(s) found             | Refactor to shared sub
DC004 | FAIL     | Many duplicate code blocks found            | Refactor to shared sub

FORMAL SPECIFICATION

-- Z schema (placeholder) --
DuplicateCodeOp
total_chunks : N
dup_chunks   : N
score        : 0..100
-------------------------------------------------------
total_chunks = 0    => status = pass /\ score = 100
score >= 90         => status = pass
score >= 50         => status = warn
score < 50          => status = fail

SIDE EFFECTS

Reads source files only; no network or subprocess I/O.

USAGE EXAMPLE

my $result = Test::CPAN::Health::Check::DuplicateCode->new->run($dist);
print $result->summary;

AUTHOR

Nigel Horne, <njh at nigelhorne.com>

LICENSE AND COPYRIGHT

Copyright (C) 2025-2026 Nigel Horne.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.