Name

Data::Edit::Xml::Lint - lint xml files in parallel using xmllint, report the failure rate and reprocess linted files to fix cross references.

Synopsis

Linting and reporting

Create some sample xml files, some with errors, lint them in parallel and retrieve the number of errors and failing files:

 for my $n(1..$N)                                                              # Some projects
  {my $x = Data::Edit::Xml::Lint::new();                                       # New xml file linter

   my $catalog = $x->catalog = catalogName;                                    # Use catalog if possible
   my $project = $x->project = projectName($n);                                # Project name
   my $file    = $x->file    =    fileName($n);                                # Target file

   $x->source = <<END;                                                         # Sample source
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//HPE//DTD HPE DITA Concept//EN" "concept.dtd" []>
<concept id="$project">
<title>Project $project</title>
<conbody>
  <p>Body of $project</p>
</conbody>
</concept>
END

   $x->source =~ s/id="\w+?"//gs if addError($n);                              # Introduce an error into some projects

   $x->lint(foo=>1);                                                           # Write the source to the target file, lint using xmllint, include some attributes to be included as comments at the end of the target file
  }

 Data::Edit::Xml::Lint::wait;                                                  # Wait for lints to complete

 say STDERR Data::Edit::Xml::Lint::report($outDir, "xml")->print;              # Report total pass fail rate
}

Produces:

50 % success converting 3 projects containing 10 xml files on 2017-07-13 at 17:43:24

ProjectStatistics
   #  Percent   Pass  Fail  Total  Project
   1  33.3333      1     2      3  aaa
   2  50.0000      2     2      4  bbb
   3  66.6667      2     1      3  ccc

FailingFiles
   #  Errors  Project       File
   1       1  ccc           out/ccc5.xml
   2       1  aaa           out/aaa9.xml
   3       1  bbb           out/bbb1.xml
   4       1  bbb           out/bbb7.xml
   5       1  aaa           out/aaa3.xml

Rereading

Once a file has been linted, it can be reread with read to obtain details about the xml including any id attributes defined (see: idDefs below) and any labels that refer to these id attributes (see: labelDefs below). Such labels provide additional identities for a node beyond that provided by the id attribute.

{catalog    => "/home/phil/hp/dtd/Dtd_2016_07_12/catalog-hpe.xml",
 definition => "bbb",
 docType    => "<!DOCTYPE concept PUBLIC \"-//HPE//DTD HPE DITA Concept//EN\" \"concept.dtd\" []>",
 errors     => 1,
 file       => "out/bbb1.xml",
 foo        => 1,
 header     => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>",
 idDefs     => { bbb => 1, c1 => 1 },
 labelDefs  => {
                 bbb => "bbb",
                 c1 => "c1",
                 conbody1 => "c1",
                 conbody2 => "c1",
                 concept1 => "bbb",
                 concept2 => "bbb",
               },
 labels     => "bbb concept1 concept2",
 project    => "bbb",
 sha256     => "b00cdebf2e1837fa15140d25315e5558ed59eb735b5fad4bade23969babf9531",
 source     => "..."
}

ReLinting

In order to fix references between files, a list of files can be relinted which performs the following actions:

  1. reads the specified files via read

  2. constructs an id map to locate an ids from labels defined in the specified files

  3. Reparses each of the specified files to build a parse tree representing the xml in each file.

  4. Calls a user supplied sub passing it the parse tree for each specified file and the id map. The sub should traverse the parse tree fixing attributes which make references between the files using the supplied id map.

  5. Writes any modified parse trees back to the originating file thus fixing the changes

Description

Version 20181215.

The following sections describe the methods in each functional area of this module. For an alphabetic listing of all methods by name see Index.

Constructor

Construct a new linter

new()

Create a new xml linter - call this method statically as in Data::Edit::Xml::Lint and then fill in the relevant Attributes.

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::new

Attributes

Attributes describing a lint.

author :lvalue

Optional author of the xml - only needed if you want to generate an SDL file map.

catalog :lvalue

Optional catalog file containing the locations of the DTDs used to validate the xml or use dtds to supply a DTD instead.

compressedErrors :lvalue

Number of compressed errors discovered.

compressedErrorText :lvalue

Text of compressed errors.

ditaType :lvalue

Optional Dita topic type(concept|task|troubleshooting|reference) of the xml - only needed if you want to generate an SDL file map.

docType :lvalue

The second line: the document type extracted from the source.

dtds :lvalue

Optional directory containing the DTDs used to validate the xml.

errors :lvalue

Total number of uncompressed lint errors detected by xmllint over all files.

errorText :lvalue

Text of uncompressed lint errors detected by xmllint over all files.

file :lvalue

File that the xml should be written to or read from by lint, read or relint.

fileNumber :lvalue

File number - assigned by the caller to help debugging transformations.

guid :lvalue

Guid or id of the outermost tag - if not supplied the first definition encountered in each file will be used on the basis that all Dita topics require an id.

header :lvalue

The first line: the xml header extracted from source.

idDefs :lvalue

{id} = count - the number of times this id is defined in the xml contained in this file.

labelDefs :lvalue

{label or id} = id - the id of the node containing a label defined on the xml.

labels :lvalue

Optional parse tree to supply labels for the current source as the labels are present in the parse tree not in the string representing the parse tree.

linted :lvalue

Date the lint was performed by lint. We avoid adding a time as well because this then induces much longer sync times with AWS S3.

preferredSource :lvalue

Preferred representation of the xml source, used by relint to supply a preferred representation for the source.

processes :lvalue

Maximum number of xmllint processes to run in parallel - 8 by default if linting in parallel is being used. Linting in parallel is pointless if each file is already being converted in parallel. Conversely, linting in parallel is helpful if the xml files are being converted serially.

project :lvalue

Optional project name to allow error counts to be aggregated by project and to allow id and labels to be scoped to the files contained in each project.

reusedInProject :lvalue

List of projects in which this file is reused, which can be set via reuseFileInProject every time you discover another project in which a file is reused.

sha256 :lvalue

Sha256 hash of the string containing the xml processed by lint or read.

source :lvalue

The source Xml to be written to file and linted.

title :lvalue

Optional title of the xml - only needed if you want to generate an SDL file map.

Lint

Lint xml files in parallel

lint($@)

Store some xml in a files, apply xmllint in parallel and update the source file with the results

   Parameter    Description
1  $lint        Linter
2  %attributes  Attributes to be recorded as xml comments

lintNOP($@)

Store some xml in a files, apply xmllint in series and update the source file with the results

   Parameter    Description
1  $lint        Linter
2  %attributes  Attributes to be recorded as xml comments

nolint($@)

Store just the attributes in a file so that they can be retrieved later to process non xml objects referenced in the xml - like images

   Parameter    Description
1  $lint        Linter
2  %attributes  Attributes to be recorded as xml comments

read($)

Reread a linted xml file and extract the attributes associated with the lint

   Parameter  Description
1  $file      File containing xml

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::read

waitAllProcesses()

Wait for all lints to finish - this is a static method, call as Data::Edit::Xml::Lint::wait

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::waitAllProcesses

clear(@)

Clear the results of a prior run

   Parameter              Description
1  @foldersAndExtensions  Directories to clear and extensions of files to remove

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::clear

relint($$$@)

Locate all the labels or id in the specified files, analyze the map of labels and ids with analysisSub parse each file, process each parse with processSub, then "lint" in lint the reprocessed xml back to the original file - this allows you to reprocess the contents of each file with knowledge of where labels or id are located in the other files associated with a project. The analysisSub(linkmap = {project}{labels or id>}=[file, id]) should return true if the processing of each file is to be performed subsequently. The processSub(parse tree representation of a file, id and label mapping, reloaded linter) should return true if a lint is required to save the results after each file has been processed else false. Optionally, the analysisSub may set the preferredSource attribute to indicate the preferred representation of the xml.

   Parameter              Description
1  $processes             Maximum number of processes to use
2  $analysisSub           Analysis 𝘀𝘂𝗯
3  $processSub            Process 𝘀𝘂𝗯
4  @foldersAndExtensions  Folders and extensions of files to process (recursively)

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::relint

resolveUniqueLink($$)

Return the unique (file, leading id) of the specified link in the link map or () if no such definition exists

   Parameter  Description
1  $linkMap   Link map
2  $link      Label

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::resolveUniqueLink

urlEncode($)

Return a url encoded string

   Parameter  Description
1  $s         String

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::urlEncode

resolveDitaLink($$$$)

Return the unique (file, leading id, topic ) of the specified link in the link map or () if no such definition exists

   Parameter    Description
1  $linkMap     Link map
2  $fileToGuid  File map
3  $link        Label
4  $sourceFile  File we are resolving from

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::resolveDitaLink

reuseFileInProject($$)

Record the reuse of the specified file in the specified project

   Parameter  Description
1  $file      Name of file that is being reused
2  $project   Name of project in which it is reused

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::reuseFileInProject

resolveFileToGuid($$)

Return the unique definition of the specified link in the link map or undef if no such definition exists

   Parameter     Description
1  $fileToGuids  File to guids map
2  $file         File

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::resolveFileToGuid

multipleLabelDefs($)

Return ([project; source label or id; targets count]*) of all labels or id that have multiple definitions

   Parameter   Description
1  $labelDefs  Label definitions

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::multipleLabelDefs

multipleLabelDefsReport($)

Return a report showing labels and id with multiple definitions in each project ordered by most defined

   Parameter   Description
1  $labelDefs  Label and Id definitions

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::multipleLabelDefsReport

singleLabelDefs($)

Return ([project; label or id]*) of all labels or ids that have a single definition

   Parameter   Description
1  $labelDefs  Label and Id definitions

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::singleLabelDefs

singleLabelDefsReport($)

Return a report showing label or id with just one definitions ordered by project, label name

   Parameter   Description
1  $labelDefs  Label and Id definitions

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::singleLabelDefsReport

Report

Methods for reporting the results of linting several files

report($$)

Analyse the results of prior lints and return a hash reporting various statistics and a printable report

   Parameter         Description
1  $outputDirectory  Directory to search
2  $filter           Optional regular expression to filter files

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::report

Attributes

compressedErrors :lvalue

Compressed errors over all files

docTypes :lvalue

Array of [number of errors, project, files] ordered from least to most errors

failingFiles :lvalue

{docType}++ - Hash of document types encountered

failingProjects :lvalue

[Projects with xmllint errors]

filter :lvalue

File selection filter

numberOfFiles :lvalue

Number of files encountered

numberOfProjects :lvalue

Number of projects defined - each project can contain zero or more files

passingProjects :lvalue

[Projects with no xmllint errors]

passRatePercent :lvalue

Total number of passes as a percentage of all input files

A printable report of the above

timestamp :lvalue

Timestamp of report

totalCompressedErrorsFileByFile :lvalue

Total number of errors summed file by file

totalCompressedErrors :lvalue

Number of compressed errors

totalErrors :lvalue

Total number of errors

Private Methods

lintOP($$@)

Store some xml in a files, apply xmllint in parallel or series and update the source file with the lint results in text format so as to be be easy to search with grep.

   Parameter    Description
1  $inParallel  In parallel or not
2  $lint        Linter
3  %attributes  Attributes to be recorded as xml comments

compressErrors(@)

Compress the errors so we cound the ones that do not look similar. Errors typically occupy three lines with the last line containing ^ at the end to mark the location of the error.

   Parameter  Description
1  @errors    Errors

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::compressErrors

formatAttributes(%)

Format the attributes section of the output file

   Parameter    Description
1  $attributes  Hash of attributes

waitProcessing($)

Wait for a processor to become available

   Parameter   Description
1  $processes  Maximum number of processes

reuseInProject($)

Record the reuse of an item in the named project

   Parameter  Description
1  $project   Name of the project in which it is reused

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::reuseInProject

countLinkTargets($$)

Count the number of targets this link resolves to.

   Parameter  Description
1  $linkMap   Link map
2  $link      Label

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::countLinkTargets

p4($$)

Format a fraction as a percentage to 4 decimal places

   Parameter  Description
1  $p         Pass
2  $f         Fail

This is a static method and so should be invoked as:

Data::Edit::Xml::Lint::p4

Index

1 author - Optional author of the xml - only needed if you want to generate an SDL file map.

2 catalog - Optional catalog file containing the locations of the DTDs used to validate the xml or use dtds to supply a DTD instead.

3 clear - Clear the results of a prior run

4 compressedErrors - Compressed errors over all files

5 compressedErrorText - Text of compressed errors.

6 compressErrors - Compress the errors so we cound the ones that do not look similar.

7 countLinkTargets - Count the number of targets this link resolves to.

8 ditaType - Optional Dita topic type(concept|task|troubleshooting|reference) of the xml - only needed if you want to generate an SDL file map.

9 docType - The second line: the document type extracted from the source.

10 docTypes - Array of [number of errors, project, files] ordered from least to most errors

11 dtds - Optional directory containing the DTDs used to validate the xml.

12 errors - Total number of uncompressed lint errors detected by xmllint over all files.

13 errorText - Text of uncompressed lint errors detected by xmllint over all files.

14 failingFiles - {docType}++ - Hash of document types encountered

15 failingProjects - [Projects with xmllint errors]

16 file - File that the xml should be written to or read from by lint, read or relint.

17 fileNumber - File number - assigned by the caller to help debugging transformations.

18 filter - File selection filter

19 formatAttributes - Format the attributes section of the output file

20 guid - Guid or id of the outermost tag - if not supplied the first definition encountered in each file will be used on the basis that all Dita topics require an id.

21 header - The first line: the xml header extracted from source.

22 idDefs - {id} = count - the number of times this id is defined in the xml contained in this file.

23 labelDefs - {label or id} = id - the id of the node containing a label defined on the xml.

24 labels - Optional parse tree to supply labels for the current source as the labels are present in the parse tree not in the string representing the parse tree.

25 lint - Store some xml in a files, apply xmllint in parallel and update the source file with the results

26 linted - Date the lint was performed by lint.

27 lintNOP - Store some xml in a files, apply xmllint in series and update the source file with the results

28 lintOP - Store some xml in a files, apply xmllint in parallel or series and update the source file with the lint results in text format so as to be be easy to search with grep.

29 multipleLabelDefs - Return ([project; source label or id; targets count]*) of all labels or id that have multiple definitions

30 multipleLabelDefsReport - Return a report showing labels and id with multiple definitions in each project ordered by most defined

31 new - Create a new xml linter - call this method statically as in Data::Edit::Xml::Lint and then fill in the relevant Attributes.

32 nolint - Store just the attributes in a file so that they can be retrieved later to process non xml objects referenced in the xml - like images

33 numberOfFiles - Number of files encountered

34 numberOfProjects - Number of projects defined - each project can contain zero or more files

35 p4 - Format a fraction as a percentage to 4 decimal places

36 passingProjects - [Projects with no xmllint errors]

37 passRatePercent - Total number of passes as a percentage of all input files

38 preferredSource - Preferred representation of the xml source, used by relint to supply a preferred representation for the source.

39 print - A printable report of the above

40 processes - Maximum number of xmllint processes to run in parallel - 8 by default if linting in parallel is being used.

41 project - Optional project name to allow error counts to be aggregated by project and to allow id and labels to be scoped to the files contained in each project.

42 read - Reread a linted xml file and extract the attributes associated with the lint

43 relint - Locate all the labels or id in the specified files, analyze the map of labels and ids with analysisSub parse each file, process each parse with processSub, then "lint" in lint the reprocessed xml back to the original file - this allows you to reprocess the contents of each file with knowledge of where labels or id are located in the other files associated with a project.

44 report - Analyse the results of prior lints and return a hash reporting various statistics and a printable report

45 resolveDitaLink - Return the unique (file, leading id, topic ) of the specified link in the link map or () if no such definition exists

46 resolveFileToGuid - Return the unique definition of the specified link in the link map or undef if no such definition exists

47 resolveUniqueLink - Return the unique (file, leading id) of the specified link in the link map or () if no such definition exists

48 reusedInProject - List of projects in which this file is reused, which can be set via reuseFileInProject every time you discover another project in which a file is reused.

49 reuseFileInProject - Record the reuse of the specified file in the specified project

50 reuseInProject - Record the reuse of an item in the named project

51 sha256 - Sha256 hash of the string containing the xml processed by lint or read.

52 singleLabelDefs - Return ([project; label or id]*) of all labels or ids that have a single definition

53 singleLabelDefsReport - Return a report showing label or id with just one definitions ordered by project, label name

54 source - The source Xml to be written to file and linted.

55 timestamp - Timestamp of report

56 title - Optional title of the xml - only needed if you want to generate an SDL file map.

57 totalCompressedErrors - Number of compressed errors

58 totalCompressedErrorsFileByFile - Total number of errors summed file by file

59 totalErrors - Total number of errors

60 urlEncode - Return a url encoded string

61 waitAllProcesses - Wait for all lints to finish - this is a static method, call as Data::Edit::Xml::Lint::wait

62 waitProcessing - Wait for a processor to become available

Installation

This module is written in 100% Pure Perl and, thus, it is easy to read, comprehend, use, modify and install via cpan:

sudo cpan install Data::Edit::Xml::Lint

Author

philiprbrenan@gmail.com

http://www.appaapps.com

Copyright

Copyright (c) 2016-2018 Philip R Brenan.

This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.