Name
Data::Edit::Xml::Reuse - Reuse Xml via the Dita conref facility.
Synopsis
Reusing Identical Content
Data::Edit::Xml::Reuse scans an entire document corpus looking for opportunities to reuse identical Xml via the Dita conref facility. Duplicated identical content is moved to a separate Xml file called the dictionary. Duplicated content in the corpus is replaced with references to the singular content in the dictionary. Larger blocks of identical content are favored over smaller blocks of content where possible.
Data::Edit::Xml::Reuse provides parameters that qualify the minimum size of a block of content and the minimum number of references to a block of content to be moved to the dictionary.
The following example checks the a corpus of Dita Xml documents held in folder inputFolder. A copy of the corpus with a conref replacing each block of identical content under the table and p tags is placed in the outputFolder as long as such content is at least 32 characters long and has a minimum of 4 references to it:
use Data::Edit::Xml::Reuse;
my $x = Data::Edit::Xml::Reuse::reuse
(inputFolder => q(in),
outputFolder => q(out),
reportsFolder => q(reports),
minimumLength => 32,
minimumReferences => 4,
tags => {map {$_=>1} qw(table p)},
);
The actual number of times each block of content was reused can be found in report:
lists/reused_content_by_tag.txt
in the reportsFolder.
Matching Similar Content
Optionally, Data::Edit::Xml::Reuse will also report similar content using the:
matchSimilarTagContent => 0.9,
keyword. Content under the specified tags that matches to the specified level of confidence between 0 and 1 is assigned a guid id attribute and written to report:
similar/tag_blocks_by_vocabulary.txt
in the reportsFolder.
The tags containing similar content will have this guid listed on their xtrf attribute making it easy to locate related content using grep.
The report, combined with the id and xtrf attributes, helps identify similar text, in situ, perhaps to be standardized further and eventually reused.
Description
Reuse Xml via the Dita conref facility.
Version 20191221.
The following sections describe the methods in each functional area of this module. For an alphabetic listing of all methods by name see Index.
Reuse Xml
Reuse Xml via Dita conrefs.
reuse(%)
Check Xml for reuse opportunities.
Parameter Description
1 %attributes Reuse attributes
Example:
if (1) {
owf($in[0], <<END); # Base file
<concept id="c">
<title>Ordering information</title>
<conbody>
<p>For further information, please visit our web site or contact your local sales company.</p>
<p>Made in Sweden</p>
<table>
<tbody>
<row>
<entry><p>Ice Associates</p></entry>
<entry><p>North Pole 1</p></entry>
</row>
</tbody>
</table>
<p>aaa bbb ccc ddd eee</p>
</conbody>
</concept>
END
owf($in[1], <<END); # Similar file
<concept id="c">
<title>Ordering information</title>
<conbody>
<p>For further information, please visit our web site or contact your local sales company.</p>
<p>Copyright © 2018 - 2019. All rights reserved.</p>
<p>Made in Norway</p>
<table>
<tbody>
<row>
<entry><p>Ice Associates</p></entry>
<entry><p>North Pole 1</p></entry>
</row>
</tbody>
</table>
<p>aaa bbb ccc ddd fff</p>
</conbody>
</concept>
END
my $dictionary = fpf($outputFolder, qw(dictionary xml)); # Dictionary file name
my $r = Data::Edit::Xml::Reuse::𝗿𝗲𝘂𝘀𝗲
(dictionary => $dictionary, # Reuse request
inputFolder => $inputFolder,
matchSimilarTagContent => 0.5,
outputFolder => $outputFolder,
reportsFolder => $reportsFolder,
tags => {map {$_=>1} qw(p table)},
);
ok readFile($dictionary) eq <<END; # Resulting dictionary
<concept id="dictionary">
<title>Dictionary</title>
<conbody>
<p id="GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51">For further information, please visit our web site or contact your local sales company.</p>
<table id="GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4">
<tbody>
<row>
<entry>
<p>Ice Associates</p>
</entry>
<entry>
<p>North Pole 1</p>
</entry>
</row>
</tbody>
</table>
</conbody>
</concept>
END
if (my $h = # Similar XML report
readFile(fpe($reportsFolder, qw(similar tag_blocks_by_vocabulary txt))))
{ok index($h, <<END) > 0;
Similar Tag_Content Md5Sum
1 2 aaa bbb ccc ddd eee GUID-3c8810e0-d8aa-0484-84b8-a57230b756de
2 aaa bbb ccc ddd fff GUID-7472a890-4587-8393-9c34-0aa3859d2e21
END
}
ok readFile(fpe($testFolder, qw(out 1 xml))) eq <<END; # Deduplicated XML file - Sweden
<concept id="c">
<title>Ordering information</title>
<conbody>
<p conref="dictionary/xml#dictionary/GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51"/>
<!-- For further information, please visit our web site or contact your local sales company. -->
<p>Made in Sweden</p>
<table conref="dictionary/xml#dictionary/GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4"/>
<!-- <tbody><row><entry><p>Ice Associates</p></entry><entry><p>North Pole 1</p></entry></row></tbody> -->
<p id="GUID-3c8810e0-d8aa-0484-84b8-a57230b756de" xtrf="GUID-7472a890-4587-8393-9c34-0aa3859d2e21">aaa bbb ccc ddd eee</p>
</conbody>
</concept>
END
ok readFile(fpe($testFolder, qw(out 2 xml))) eq <<END; # Deduplicated XML file - Norway
<concept id="c">
<title>Ordering information</title>
<conbody>
<p conref="dictionary/xml#dictionary/GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51"/>
<!-- For further information, please visit our web site or contact your local sales company. -->
<p>Copyright © 2018 - 2019. All rights reserved.</p>
<p>Made in Norway</p>
<table conref="dictionary/xml#dictionary/GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4"/>
<!-- <tbody><row><entry><p>Ice Associates</p></entry><entry><p>North Pole 1</p></entry></row></tbody> -->
<p id="GUID-7472a890-4587-8393-9c34-0aa3859d2e21" xtrf="GUID-3c8810e0-d8aa-0484-84b8-a57230b756de">aaa bbb ccc ddd fff</p>
</conbody>
</concept>
END
}
Data::Edit::Xml::Reuse Definition
Attributes used by the reuser.
Input fields
dictionary - The dictionary file into which to store the duplicate Xml.
fileExtensions - The extensions of the Xml files to examine in the inputFolder.
getFileUrl - An optional url to retrieve a specified file from the server running xref used in generating html reports. The complete url is obtained by appending the fully qualified file name to this value.
htmlFolder - Folder into which to write reports as html.
inputFolder - A folder containing the Xml files with extensions named in fileExtensions to be analyzed for reuse.
matchSimilarTagContent - Confidence level between 0 and 1: match content under tags with this level of confidence.
maximumColumnWidth - Truncate columns in text reports to this length or allow any width if undef.
minimumLength - The minimum length content must have to be considered for matching.
minimumReferences - The minimum number of references content must have before it can be reused.
outputFolder - A folder into which to write the deduplicated Xml.
reportsFolder - A folder into which reports will be written.
tags - {tag=>1} only consider tags that appear as keys in this hash with truthful values.
Output fields
inputFiles - The files selected from inputFolder for analysis because their extensions matched fileExtensions.
matchBlocks - [[md5, content]*] blocks of content that match with the confidence level expressed by matchSimilarContent
matchInBlock - {md5 => matchBlocks} : index into matchBlocks by md5 sum.
reusableContent - {tag}{md5sum}{content}++ potentially reusable content.
timeEnded - Time the run ended.
timeStart - Time the run started.
Private Methods
newReuse(%)
Create a new cross reuser.
Parameter Description
1 %attributes Attributes
formatTables($$%)
Format reports.
Parameter Description
1 $reuse Reuser
2 $data Table to be formatted
3 %options Options
loadInputFiles($)
Load the names of the files to be processed.
Parameter Description
1 $reuse Cross referencer
ffc($$)
First few characters of a string with white space normalized.
Parameter Description
1 $reuse Reuser
2 $string String
reuseParams($)
Tabulate reuse parameters.
Parameter Description
1 $reuse Reuser
analyzeOneFile($$)
Analyze one input file.
Parameter Description
1 $reuse Reuser
2 $file File to analyze
analyzeInputFiles($)
Analyze the input files.
Parameter Description
1 $reuse Reuser
conRefOneFile($$)
Conref one file.
Parameter Description
1 $reuse Reuser
2 $file File to analyze
conRef($)
Replace common text with conrefs.
Parameter Description
1 $reuse Cross referencer
reportSimilarContent($)
Report content likely to be similar on the basis of their vocabulary.
Parameter Description
1 $reuse Reuser
Index
1 analyzeInputFiles - Analyze the input files.
2 analyzeOneFile - Analyze one input file.
3 conRef - Replace common text with conrefs.
4 conRefOneFile - Conref one file.
5 ffc - First few characters of a string with white space normalized.
6 formatTables - Format reports.
7 loadInputFiles - Load the names of the files to be processed.
8 newReuse - Create a new cross reuser.
9 reportSimilarContent - Report content likely to be similar on the basis of their vocabulary.
10 reuse - Check Xml for reuse opportunities.
11 reuseParams - Tabulate reuse parameters.
Installation
This module is written in 100% Pure Perl and, thus, it is easy to read, comprehend, use, modify and install via cpan:
sudo cpan install Data::Edit::Xml::Reuse
Author
Copyright
Copyright (c) 2016-2019 Philip R Brenan.
This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.