Name
Data::Edit::Xml::Reuse - Reuse Xml via Dita conrefs.
Synopsis
Data::Edit::Xml::Reuse scans an entire document corpus looking for opportunities to reuse identical Xml via the Dita conref facility. Duplicated identical content is moved to a separate Xml file called the dictionary. Duplicated content in the corpus is replaced with references to the singular content in the dictionary. Larger blocks of identical content are favored over smaller blocks in situations where possible.
Data::Edit::Xml::Reuse provides parameters that control the minimum sizes and references to duplicated content moved to the dictionary.
The following example checks the a corpus of Dita XML documents held in folder inputFolder. A copy of the corpus with conrefs replacing identical content under table and p tags is placed in the outputFolder as long as such content is at least 32 characters long and has a minimum of 4 references to it:
use Data::Edit::Xml::Reuse;
my $x = xref(inputFolder => q(in),
outputFolder => q(out),
reportsFolder => q(reports),
minimumLength => 32,
minimumReferences => 4,
tags => {map {$_=>1} qw(table p)},
);
Optionally, Data::Edit::Xml::Reuse will also report similar content using the:
matchSimilarTagContent => 0.9,
keyword. Content under the specified tags that matches to the specified level of confidence between 0 and 1 is written to report
similar/tag_blocks_by_vocabulary.txt
in the reportsFolder. This report helps identify further text to be standardized and thus reused.
Description
Reuse Xml via Dita conrefs.
Version 20191212.
The following sections describe the methods in each functional area of this module. For an alphabetic listing of all methods by name see Index.
Reuse Xml
Reuse Xml via Dita conrefs.
reuse(%)
Check Xml for reuse opportunities.
Parameter Description
1 %attributes Reuse attributes
Example:
if (1) {
owf($in[0], <<END); # Base file
<concept id="c">
<title>Ordering information</title>
<conbody>
<p>For further information, please visit our web site or contact your local sales company.</p>
<p>Made in Sweden</p>
<table>
<tbody>
<row>
<entry><p>Ice Associates</p></entry>
<entry><p>North Pole 1</p></entry>
</row>
</tbody>
</table>
<p>aaa bbb ccc ddd eee</p>
</conbody>
</concept>
END
owf($in[1], <<END); # Similar file
<concept id="c">
<title>Ordering information</title>
<conbody>
<p>For further information, please visit our web site or contact your local sales company.</p>
<p>Copyright © 2018 - 2019. All rights reserved.</p>
<p>Made in Norway</p>
<table>
<tbody>
<row>
<entry><p>Ice Associates</p></entry>
<entry><p>North Pole 1</p></entry>
</row>
</tbody>
</table>
<p>aaa bbb ccc ddd fff</p>
</conbody>
</concept>
END
my $dictionary = fpf($outputFolder, qw(dictionary xml)); # Dictionary file name
my $r = Data::Edit::Xml::Reuse::𝗿𝗲𝘂𝘀𝗲 # Reuse request
(dictionary => $dictionary,
inputFolder => $inputFolder,
matchSimilarTagContent => 0.5,
outputFolder => $outputFolder,
reportsFolder => $reportsFolder,
tags => {map {$_=>1} qw(p table)},
);
ok readFile($dictionary) eq <<END; # Resulting dictionary
<concept id="dictionary">
<title>Dictionary</title>
<conbody>
<p id="GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51">For further information, please visit our web site or contact your local sales company.</p>
<table id="GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4">
<tbody>
<row>
<entry>
<p>Ice Associates</p>
</entry>
<entry>
<p>North Pole 1</p>
</entry>
</row>
</tbody>
</table>
</conbody>
</concept>
END
if (my $h = # Similar XML report
readFile(fpe($reportsFolder, qw(similar tag_blocks_by_vocabulary txt))))
{ok index($h, <<END) > 0;
Similar Tag_Content Md5Sum
1 2 aaa bbb ccc ddd eee GUID-3c8810e0-d8aa-0484-84b8-a57230b756de
2 aaa bbb ccc ddd fff GUID-7472a890-4587-8393-9c34-0aa3859d2e21
END
}
if (my $h = removeFilePathsFromStructure readFiles($outputFolder))
{is_deeply $h,
{"1.xml" => "<concept id=\"c\"title>
<conbody>
<p conref=\"GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51\"p>
<table conref=\"GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4\"tbody> -->
<p id=\"GUID-3c8810e0-d8aa-0484-84b8-a57230b756de\" xtrf=\"GUID-7472a890-4587-8393-9c34-0aa3859d2e21\"concept>
",
"2.xml" => "<concept id=\"c\"title>
<conbody>
<p conref=\"GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51\"p>
<table conref=\"GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4\"tbody> -->
<p id=\"GUID-7472a890-4587-8393-9c34-0aa3859d2e21\" xtrf=\"GUID-3c8810e0-d8aa-0484-84b8-a57230b756de\"concept>
",
"xml" => "<concept id=\"dictionary\"title>
<conbody>
<p id=\"GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51\"p>
<table id=\"GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4\"concept>
",
};
}
}
Data::Edit::Xml::Reuse Definition
Attributes used by the reuser.
Input fields
dictionary - The file into which to store the common Xml
getFileUrl - A url to retrieve a specified file from the server running xref used in generating html reports. The complete url is obtained by appending the fully qualified file name to this value.
htmlFolder - Folder into which to write reports as html
inputFolder - A folder containing the dita and ditamap files to analyzed for reuse.
matchSimilarTagContent - Confidence level between 0 and 1: match content under tags with this level of confidence
maximumColumnWidth - Truncate columns in text reports to this length or allow any width if undef
minimumLength - The minimum length content has to be to be considered for matching
minimumReferences - The minimum number of references required for reuse
outputFolder - A folder to write the Xml with reuse applied.
reportsFolder - Reports folder
Output fields
fileExtensions - Default file extensions to load
inputFiles - Input files from inputFolder.
matchBlocks - [[md5, content]*] blocks of content that match with the confidence level expressed by matchSimilarContent
matchInBlock - {md5 => matchBlocks} : index into matchBlocks by md5 sum
reusableContent - {tag}{md5sum}{content}++ potentially reusable content
significantCount - Minimum number of significant duplications
tags - {tag=>reusable} only reuse tags that appear as keys in this hash with truthful values
timeEnded - Time the run ended
timeStart - Time the run started
Private Methods
newReuse(%)
Create a new cross reuser
Parameter Description
1 %attributes Attributes
formatTables($$%)
Using cross reference $xref options and an array of arrays $data format a report as a table using %options as described in Data::Table::Text::formatTable and Data::Table::Text::formatHtmlTable.
Parameter Description
1 $reuse Reuser
2 $data Table to be formatted
3 %options Options
loadInputFiles($)
Load the names of the files to be processed
Parameter Description
1 $reuse Cross referencer
ffc($$)
First few characters of a string with white space normalized
Parameter Description
1 $reuse Reuser
2 $string String
reuseParams($)
Tabulate reuse parameters
Parameter Description
1 $reuse Reuser
analyzeOneFile($$)
Analyze one input file
Parameter Description
1 $reuse Reuser
2 $file File to analyze
analyzeInputFiles($)
Analyze the input files
Parameter Description
1 $reuse Reuser
conRefOneFile($$)
Conref one file
Parameter Description
1 $reuse Reuser
2 $file File to analyze
conRef($)
Replace common text with conrefs
Parameter Description
1 $reuse Cross referencer
reportSimilarContent($)
Report content likely to be similar on the basis of their vocabulary
Parameter Description
1 $reuse Reuser
Index
1 analyzeInputFiles - Analyze the input files
2 analyzeOneFile - Analyze one input file
3 conRef - Replace common text with conrefs
4 conRefOneFile - Conref one file
5 ffc - First few characters of a string with white space normalized
6 formatTables - Using cross reference $xref options and an array of arrays $data format a report as a table using %options as described in Data::Table::Text::formatTable and Data::Table::Text::formatHtmlTable.
7 loadInputFiles - Load the names of the files to be processed
8 newReuse - Create a new cross reuser
9 reportSimilarContent - Report content likely to be similar on the basis of their vocabulary
10 reuse - Check Xml for reuse opportunities.
11 reuseParams - Tabulate reuse parameters
Installation
This module is written in 100% Pure Perl and, thus, it is easy to read, comprehend, use, modify and install via cpan:
sudo cpan install Data::Edit::Xml::Reuse
Author
Copyright
Copyright (c) 2016-2019 Philip R Brenan.
This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.