Name

Data::Edit::Xml::Reuse - Reuse Xml via Dita conrefs.

Synopsis

Data::Edit::Xml::Reuse scans an entire document corpus looking for opportunities to reuse identical Xml via the Dita conref facility. Duplicated identical content is moved to a separate Xml file called the dictionary. Duplicated content in the corpus is replaced with references to the singular content in the dictionary. Larger blocks of identical content are favored over smaller blocks in situations where possible.

Data::Edit::Xml::Reuse provides parameters that control the minimum sizes and references to duplicated content moved to the dictionary.

The following example checks the a corpus of Dita XML documents held in folder inputFolder. A copy of the corpus with conrefs replacing identical content under table and p tags is placed in the outputFolder as long as such content is at least 32 characters long and has a minimum of 4 references to it:

use Data::Edit::Xml::Reuse;

my $x = xref(inputFolder       => q(in),
             outputFolder      => q(out),
             reportsFolder     => q(reports),
             minimumLength     => 32,
             minimumReferences => 4,
             tags              => {map {$_=>1} qw(table p)},
            );

Optionally, Data::Edit::Xml::Reuse will also report similar content using the:

matchSimilarTagContent => 0.9,

keyword. Content under the specified tags that matches to the specified level of confidence between 0 and 1 is written to report

similar/tag_blocks_by_vocabulary.txt

in the reportsFolder. This report helps identify further text to be standardized and thus reused.

Description

Reuse Xml via Dita conrefs.

Version 20191212.

The following sections describe the methods in each functional area of this module. For an alphabetic listing of all methods by name see Index.

Reuse Xml

Reuse Xml via Dita conrefs.

reuse(%)

Check Xml for reuse opportunities.

   Parameter    Description
1  %attributes  Reuse attributes

Example:

if (1) {                                                                        
  owf($in[0], <<END);                                                           # Base file
  <concept id="c">
    <title>Ordering information</title>
    <conbody>
      <p>For further information, please visit our web site or contact your local sales company.</p>
      <p>Made in Sweden</p>
      <table>
        <tbody>
          <row>
            <entry><p>Ice Associates</p></entry>
            <entry><p>North Pole 1</p></entry>
          </row>
        </tbody>
      </table>
      <p>aaa bbb ccc ddd eee</p>
    </conbody>
  </concept>
END

  owf($in[1], <<END);                                                           # Similar file
  <concept id="c">
    <title>Ordering information</title>
    <conbody>
      <p>For further information, please visit our web site or contact your local sales company.</p>
      <p>Copyright © 2018 - 2019. All rights reserved.</p>
      <p>Made in Norway</p>
      <table>
        <tbody>
          <row>
            <entry><p>Ice Associates</p></entry>
            <entry><p>North Pole 1</p></entry>
          </row>
        </tbody>
      </table>
      <p>aaa bbb ccc ddd fff</p>
    </conbody>
  </concept>
END

  my $dictionary  = fpf($outputFolder, qw(dictionary xml));                     # Dictionary file name

  my $r = Data::Edit::Xml::Reuse::𝗿𝗲𝘂𝘀𝗲                                         # Reuse request
   (dictionary             => $dictionary,
    inputFolder            => $inputFolder,
    matchSimilarTagContent => 0.5,
    outputFolder           => $outputFolder,
    reportsFolder          => $reportsFolder,
    tags                   => {map {$_=>1} qw(p table)},
   );

  ok readFile($dictionary) eq <<END;                                            # Resulting dictionary
<concept id="dictionary">
  <title>Dictionary</title>
  <conbody>
    <p id="GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51">For further information, please visit our web site or contact your local sales company.</p>
    <table id="GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4">
      <tbody>
        <row>
          <entry>
            <p>Ice Associates</p>
          </entry>
          <entry>
            <p>North Pole 1</p>
          </entry>
        </row>
      </tbody>
    </table>
  </conbody>
</concept>
END

  if (my $h =                                                                   # Similar XML report
    readFile(fpe($reportsFolder, qw(similar tag_blocks_by_vocabulary txt))))
   {ok index($h, <<END) > 0;
   Similar  Tag_Content          Md5Sum
1        2  aaa bbb ccc ddd eee  GUID-3c8810e0-d8aa-0484-84b8-a57230b756de
2           aaa bbb ccc ddd fff  GUID-7472a890-4587-8393-9c34-0aa3859d2e21
END
   }

  if (my $h = removeFilePathsFromStructure readFiles($outputFolder))
   {is_deeply $h,
     {"1.xml" => "<concept id=\"c\"title>
<conbody>
  <p conref=\"GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51\"p>
  <table conref=\"GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4\"tbody> -->
  <p id=\"GUID-3c8810e0-d8aa-0484-84b8-a57230b756de\" xtrf=\"GUID-7472a890-4587-8393-9c34-0aa3859d2e21\"concept>
",
      "2.xml" => "<concept id=\"c\"title>
<conbody>
  <p conref=\"GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51\"p>
  <table conref=\"GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4\"tbody> -->
  <p id=\"GUID-7472a890-4587-8393-9c34-0aa3859d2e21\" xtrf=\"GUID-3c8810e0-d8aa-0484-84b8-a57230b756de\"concept>
",
      "xml"   => "<concept id=\"dictionary\"title>
<conbody>
  <p id=\"GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51\"p>
  <table id=\"GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4\"concept>
",
     };
   }
 }

Data::Edit::Xml::Reuse Definition

Attributes used by the reuser.

Input fields

dictionary - The file into which to store the common Xml

getFileUrl - A url to retrieve a specified file from the server running xref used in generating html reports. The complete url is obtained by appending the fully qualified file name to this value.

htmlFolder - Folder into which to write reports as html

inputFolder - A folder containing the dita and ditamap files to analyzed for reuse.

matchSimilarTagContent - Confidence level between 0 and 1: match content under tags with this level of confidence

maximumColumnWidth - Truncate columns in text reports to this length or allow any width if undef

minimumLength - The minimum length content has to be to be considered for matching

minimumReferences - The minimum number of references required for reuse

outputFolder - A folder to write the Xml with reuse applied.

reportsFolder - Reports folder

Output fields

fileExtensions - Default file extensions to load

inputFiles - Input files from inputFolder.

matchBlocks - [[md5, content]*] blocks of content that match with the confidence level expressed by matchSimilarContent

matchInBlock - {md5 => matchBlocks} : index into matchBlocks by md5 sum

reusableContent - {tag}{md5sum}{content}++ potentially reusable content

significantCount - Minimum number of significant duplications

tags - {tag=>reusable} only reuse tags that appear as keys in this hash with truthful values

timeEnded - Time the run ended

timeStart - Time the run started

Private Methods

newReuse(%)

Create a new cross reuser

   Parameter    Description
1  %attributes  Attributes

formatTables($$%)

Using cross reference $xref options and an array of arrays $data format a report as a table using %options as described in Data::Table::Text::formatTable and Data::Table::Text::formatHtmlTable.

   Parameter  Description
1  $reuse     Reuser
2  $data      Table to be formatted
3  %options   Options

loadInputFiles($)

Load the names of the files to be processed

   Parameter  Description
1  $reuse     Cross referencer

ffc($$)

First few characters of a string with white space normalized

   Parameter  Description
1  $reuse     Reuser
2  $string    String

reuseParams($)

Tabulate reuse parameters

   Parameter  Description
1  $reuse     Reuser

analyzeOneFile($$)

Analyze one input file

   Parameter  Description
1  $reuse     Reuser
2  $file      File to analyze

analyzeInputFiles($)

Analyze the input files

   Parameter  Description
1  $reuse     Reuser

conRefOneFile($$)

Conref one file

   Parameter  Description
1  $reuse     Reuser
2  $file      File to analyze

conRef($)

Replace common text with conrefs

   Parameter  Description
1  $reuse     Cross referencer

reportSimilarContent($)

Report content likely to be similar on the basis of their vocabulary

   Parameter  Description
1  $reuse     Reuser

Index

1 analyzeInputFiles - Analyze the input files

2 analyzeOneFile - Analyze one input file

3 conRef - Replace common text with conrefs

4 conRefOneFile - Conref one file

5 ffc - First few characters of a string with white space normalized

6 formatTables - Using cross reference $xref options and an array of arrays $data format a report as a table using %options as described in Data::Table::Text::formatTable and Data::Table::Text::formatHtmlTable.

7 loadInputFiles - Load the names of the files to be processed

8 newReuse - Create a new cross reuser

9 reportSimilarContent - Report content likely to be similar on the basis of their vocabulary

10 reuse - Check Xml for reuse opportunities.

11 reuseParams - Tabulate reuse parameters

Installation

This module is written in 100% Pure Perl and, thus, it is easy to read, comprehend, use, modify and install via cpan:

sudo cpan install Data::Edit::Xml::Reuse

Author

philiprbrenan@gmail.com

http://www.appaapps.com

Copyright

Copyright (c) 2016-2019 Philip R Brenan.

This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.