NAME
Test::CPAN::Health::Check::DuplicateCode - Detect copy-paste code blocks across source files
SYNOPSIS
use Test::CPAN::Health::Check::DuplicateCode;
my $check = Test::CPAN::Health::Check::DuplicateCode->new;
my $result = $check->run($dist);
printf "%s: %s\n", $result->status, $result->summary;
DESCRIPTION
Implements a lightweight, dependency-free clone detector using a sliding-window hash approach:
Each source file is reduced to a sequence of code lines: lines that are not blank, not pure comments, and not POD. Each line is whitespace-normalised (leading/trailing whitespace stripped; runs of whitespace collapsed to a single space).
A sliding window of
6consecutive normalised lines forms a "chunk". The hash (MD5-free: joined as a string) of each chunk is recorded along with the originating file.Chunks whose hash appears in more than one distinct file are cross-file duplicates.
Score = round((1 - dup_chunks / total_chunks) * 100) when total_chunks > 0, else 100. Status: pass ≥ 90, warn ≥ 50, fail below 50.
LIMITATIONS
Whitespace normalisation is basic: it does not account for string literals or heredocs that span multiple lines.
The detector only finds cross-file duplicates; intra-file repetition is not flagged.
Short common boilerplate (e.g.
use strict; use warnings; 1;) is naturally filtered out because consecutive boilerplate lines rarely reach the minimum chunk size of 6 in the same relative order.
run
PURPOSE
Extract code chunks from all source files, identify cross-file duplicates, and return a scored Result.
API SPECIFICATION
INPUT
dist Test::CPAN::Health::Distribution required
context Hashref optional
OUTPUT
Test::CPAN::Health::Result with check_id 'duplicate_code'.
MESSAGES
Code | Severity | Message | Resolution
------+----------+---------------------------------------------+-----------
DC001 | SKIP | No source files found | Add source files
DC002 | PASS | No cross-file duplicate code blocks found |
DC003 | WARN | N duplicate code block(s) found | Refactor to shared sub
DC004 | FAIL | Many duplicate code blocks found | Refactor to shared sub
FORMAL SPECIFICATION
-- Z schema (placeholder) --
DuplicateCodeOp
total_chunks : N
dup_chunks : N
score : 0..100
-------------------------------------------------------
total_chunks = 0 => status = pass /\ score = 100
score >= 90 => status = pass
score >= 50 => status = warn
score < 50 => status = fail
SIDE EFFECTS
Reads source files only; no network or subprocess I/O.
USAGE EXAMPLE
my $result = Test::CPAN::Health::Check::DuplicateCode->new->run($dist);
print $result->summary;
AUTHOR
Nigel Horne, <njh at nigelhorne.com>
LICENSE AND COPYRIGHT
Copyright (C) 2025-2026 Nigel Horne.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.