Name

Data::Edit::Xml::Lint - lint xml files in parallel using xmllint and report the failure rate

Synopsis

Linting and reporting

Create some sample xml files, some with errors, lint them in parallel and retrieve the number of errors and failing files:

 for my $n(1..$N)                                                              # Some projects
  {my $x = Data::Edit::Xml::Lint::new();                                       # New xml file linter

   my $catalog = $x->catalog = catalogName;                                    # Use catalog if possible
   my $project = $x->project = projectName($n);                                # Project name
   my $file    = $x->file    =    fileName($n);                                # Target file

   $x->source = <<END;                                                         # Sample source
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//HPE//DTD HPE DITA Concept//EN" "concept.dtd" []>
<concept id="$project">
<title>Project $project</title>
<conbody>
  <p>Body of $project</p>
</conbody>
</concept>
END

   $x->source =~ s/id="\w+?"//gs if addError($n);                              # Introduce an error into some projects

   $x->lint(foo=>1);                                                           # Write the source to the target file, lint using xmllint, include some attributes to be included as comments at the end of the target file
  }

 Data::Edit::Xml::Lint::wait;                                                  # Wait for lints to complete

 say STDERR Data::Edit::Xml::Lint::report($outDir, "xml")->print;              # Report total pass fail rate
}

Produces:

50 % success converting 3 projects containing 10 xml files on 2017-07-13 at 17:43:24

ProjectStatistics
   #  Percent   Pass  Fail  Total  Project
   1  33.3333      1     2      3  aaa
   2  50.0000      2     2      4  bbb
   3  66.6667      2     1      3  ccc

FailingFiles
   #  Errors  Project       File
   1       1  ccc           out/ccc5.xml
   2       1  aaa           out/aaa9.xml
   3       1  bbb           out/bbb1.xml
   4       1  bbb           out/bbb7.xml
   5       1  aaa           out/aaa3.xml

Rereading

Once a file has been linted, it can reread with read to obtain details about the xml including id=?s defined (see: idDefs below) and any labels that refer to these id=?s (see: labelDefs below). Such labels provide additional names for a node which cannot be stored in the xml itself.

{catalog    => "/home/phil/hp/dtd/Dtd_2016_07_12/catalog-hpe.xml",
 definition => "bbb",
 docType    => "<!DOCTYPE concept PUBLIC \"-//HPE//DTD HPE DITA Concept//EN\" \"concept.dtd\" []>",
 errors     => 1,
 file       => "out/bbb1.xml",
 foo        => 1,
 header     => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>",
 idDefs     => { bbb => 1, c1 => 1 },
 labelDefs  => {
                 bbb => "bbb",
                 c1 => "c1",
                 conbody1 => "c1",
                 conbody2 => "c1",
                 concept1 => "bbb",
                 concept2 => "bbb",
               },
 labels     => "bbb concept1 concept2",
 project    => "bbb",
 sha256     => "b00cdebf2e1837fa15140d25315e5558ed59eb735b5fad4bade23969babf9531",
 source     => "..."
}

ReLinting

In order to fix references between files, a list of files can be relinted:

  1. the specified files are read

  2. a map is constructed to locate all the ids and labels defined in the specified files

  3. each file is reparsed

  4. the resulting parse tree and id map are handed to a caller provided 𝘀𝘂𝗯 that can the traverse the parse tree fixing attributes which make references between the files.

  5. the modified parse trees are written back to the originating file thus fixing the changes

Description

Constructor

Construct a new linter

new

Create a new xml linter - call this method statically as in Data::Edit::Xml::Lint

Attributes

Attributes describing a lint

file :lvalue

File that the xml will be written to and read from by lint or read

catalog :lvalue

Optional catalog file containing the locations of the DTDs used to validate the xml

docType :lvalue

The second line: the document type extracted from the source

dtds :lvalue

Optional directory containing the DTDs used to validate the xml

errors :lvalue

Number of lint errors detected by xmllint

header :lvalue

The first line: the xml header extracted from source

labels :lvalue

Optional parse tree to supply labels for the current source as the labels are present in the parse tree not in the string representing the parse tree

linted :lvalue

Date the lint was performed by lint

idDefs :lvalue

{id} = count - the number of times this id is defined in the xml contained in this file

labelDefs :lvalue

{label or id} = id - the id of the node containing a label defined on the xml

project :lvalue

Optional project name to allow error counts to be aggregated by project and to allow id and labels to be scoped to the files contained in each project

processes :lvalue

Maximum number of xmllint processes to run in parallel - 8 by default

sha256 :lvalue

Sha256 hash of the string containing the xml processed by lint or read

source :lvalue

The source Xml to be linted

Lint

Lint xml files in parallel

lint

Store some xml in a files and apply xmllint in parallel

   Parameter    Description
1  $lint        Linter
2  %attributes  Attributes to be recorded as xml comments

read

Reread a linted xml file and extract the attributes associated with the lint

   Parameter  Description
1  $file      File containing xml

wait()

Wait for all lints to finish - this is a static method, call as Data::Edit::Xml::Lint::wait

searchDirectoryTreeForMatchingFiles

Search a directory tree for files that match the specified extensions

   Parameter    Description
1  $folder      Directory to start search in
2  @extensions  Extensions of files to find

clear

Clear the results of a prior run

   Parameter         Description
1  $outputDirectory  Directory to clear
2  @fileExtensions   Extensions of files to remove

relint

Locate all the labels or id in the specified files, analyze the map of labels and ids with analysisSub parse each file, process each parse with processSub, then "lint" in lint the reprocessed xml back to the original file - this allows you to reprocess the contents of each file with knowledge of where labels or id are located in the other files associated with a project. The analysisSub(linkmap = {project}{labels or id>}=[file, id]) should return true if the processing of each file is to be performed subsequently. The processSub(parse tree representation of a file, id and label mapping, reloaded linter) should return true if a lint is required to save the results after each file has been processed else false, files to reprocess

   Parameter     Description
1  $analysisSub  Analysis 𝘀𝘂𝗯
2  $processSub   Process 𝘀𝘂𝗯
3  $folder       Folder containing files to process (recursively)
4  @extensions   Extensions of files to process

Return the unique definition of the specified link in the link map or undef if no such definition exists

   Parameter  Description
1  $linkMap   Link map
2  $link      Label

multipleLabelDefs

Return ([project; source label or id; targets count]*) of all labels or id that have multiple definitions

   Parameter   Description
1  $labelDefs  Label and Id definitions

multipleLabelDefsReport

Return a report showing labels and id with multiple definitions in each project ordered by most defined

   Parameter   Description
1  $labelDefs  Label and Id definitions

singleLabelDefs

Return ([project; label or id]*) of all labels or ids that have a single definition

   Parameter   Description
1  $labelDefs  Label and Id definitions

singleLabelDefsReport

Return a report showing label or id with just one definitions ordered by project, label name

   Parameter   Description
1  $labelDefs  Label and Id definitions

Report

Methods for reporting the results of linting several files

report

Analyse the results of prior lints and return a hash reporting various statistics and a printable report

   Parameter         Description
1  $outputDirectory  Directory to clear
2  @fileExtensions   Types of files to analyze

Attributes

passRatePercent :lvalue

Total number of passes as a percentage of all input files

timestamp :lvalue

Timestamp of report

numberOfProjects :lvalue

Number of projects defined - each project can contain zero or more files

numberOfFiles :lvalue

Number of files encountered

failingFiles :lvalue

Array of [number of errors, project, files] ordered from least to most errors

projects :lvalue

Hash of "project name"=>[project name, pass, fail, total, percent pass]

A printable report of the above

Index

catalog

clear

docType

dtds

errors

failingFiles

file

header

idDefs

labelDefs

labels

lint

linted

multipleLabelDefs

multipleLabelDefsReport

new

numberOfFiles

numberOfProjects

passRatePercent

print

processes

project

projects

read

relint

report

resolveUniqueLink

searchDirectoryTreeForMatchingFiles

sha256

singleLabelDefs

singleLabelDefsReport

source

timestamp

wait()

Installation

This module is written in 100% Pure Perl and is thus easy to read, use, modify and install.

Standard Module::Build process for building and installing modules:

perl Build.PL
./Build
./Build test
./Build install

Author

philiprbrenan@gmail.com

http://www.appaapps.com

Copyright

Copyright (c) 2016-2017 Philip R Brenan.

This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.