Name
Data::Edit::Xml::Lint - lint xml files in parallel using xmllint, report the failure rate and reprocess linted files to fix cross references.
Synopsis
Linting and reporting
Create some sample xml files, some with errors, lint them in parallel and retrieve the number of errors and failing files:
for my $n(1..$N) # Some projects
{my $x = Data::Edit::Xml::Lint::new(); # New xml file linter
my $catalog = $x->catalog = catalogName; # Use catalog if possible
my $project = $x->project = projectName($n); # Project name
my $file = $x->file = fileName($n); # Target file
$x->source = <<END; # Sample source
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//HPE//DTD HPE DITA Concept//EN" "concept.dtd" []>
<concept id="$project">
<title>Project $project</title>
<conbody>
<p>Body of $project</p>
</conbody>
</concept>
END
$x->source =~ s/id="\w+?"//gs if addError($n); # Introduce an error into some projects
$x->lint(foo=>1); # Write the source to the target file, lint using xmllint, include some attributes to be included as comments at the end of the target file
}
Data::Edit::Xml::Lint::wait; # Wait for lints to complete
say STDERR Data::Edit::Xml::Lint::report($outDir, "xml")->print; # Report total pass fail rate
}
Produces:
50 % success converting 3 projects containing 10 xml files on 2017-07-13 at 17:43:24
ProjectStatistics
# Percent Pass Fail Total Project
1 33.3333 1 2 3 aaa
2 50.0000 2 2 4 bbb
3 66.6667 2 1 3 ccc
FailingFiles
# Errors Project File
1 1 ccc out/ccc5.xml
2 1 aaa out/aaa9.xml
3 1 bbb out/bbb1.xml
4 1 bbb out/bbb7.xml
5 1 aaa out/aaa3.xml
Rereading
Once a file has been linted, it can be reread with read to obtain details about the xml including any id attributes defined (see: idDefs below) and any labels that refer to these id attributes (see: labelDefs below). Such labels provide additional identities for a node beyond that provided by the id attribute.
{catalog => "/home/phil/hp/dtd/Dtd_2016_07_12/catalog-hpe.xml",
definition => "bbb",
docType => "<!DOCTYPE concept PUBLIC \"-//HPE//DTD HPE DITA Concept//EN\" \"concept.dtd\" []>",
errors => 1,
file => "out/bbb1.xml",
foo => 1,
header => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>",
idDefs => { bbb => 1, c1 => 1 },
labelDefs => {
bbb => "bbb",
c1 => "c1",
conbody1 => "c1",
conbody2 => "c1",
concept1 => "bbb",
concept2 => "bbb",
},
labels => "bbb concept1 concept2",
project => "bbb",
sha256 => "b00cdebf2e1837fa15140d25315e5558ed59eb735b5fad4bade23969babf9531",
source => "..."
}
ReLinting
In order to fix references between files, a list of files can be relinted which performs the following actions:
constructs an id map to locate an ids from labels defined in the specified files
Reparses each of the specified files to build a parse tree representing the xml in each file.
Calls a user supplied sub passing it the parse tree for each specified file and the id map. The sub should traverse the parse tree fixing attributes which make references between the files using the supplied id map.
Writes any modified parse trees back to the originating file thus fixing the changes
Description
Version 20181101.
The following sections describe the methods in each functional area of this module. For an alphabetic listing of all methods by name see Index.
Constructor
Construct a new linter
new()
Create a new xml linter - call this method statically as in Data::Edit::Xml::Lint
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::new
Attributes
Attributes describing a lint
author :lvalue
Optional author of the xml - only needed if you want to generate an SDL file map
catalog :lvalue
Optional catalog file containing the locations of the DTDs used to validate the xml
compressedErrors :lvalue
Number of compressed errors
compressedErrorText :lvalue
Text of compressed errors
ditaType :lvalue
Optional Dita topic type(concept|task|troubleshooting|reference) of the xml - only needed if you want to generate an SDL file map
docType :lvalue
The second line: the document type extracted from the source
dtds :lvalue
Optional directory containing the DTDs used to validate the xml
errors :lvalue
Number of uncompressed lint errors detected by xmllint
errorText :lvalue
Text of uncompressed lint errors detected by xmllint
file :lvalue
File that the xml will be written to and read from by lint, read or relint
fileNumber :lvalue
File number - assigned early on by the caller to help debugging transformations
guid :lvalue
Guid or id of the outermost tag - if not supplied the first definition encountered will be used on the basis that all Dita topics require an id
header :lvalue
The first line: the xml header extracted from source
idDefs :lvalue
{id} = count - the number of times this id is defined in the xml contained in this file
labelDefs :lvalue
{label or id} = id - the id of the node containing a label defined on the xml
labels :lvalue
Optional parse tree to supply labels for the current source as the labels are present in the parse tree not in the string representing the parse tree
linted :lvalue
Date the lint was performed by lint
preferredSource :lvalue
Preferred representation of the xml source, used by relint to supply a preferred representation for the source
processes :lvalue
Maximum number of xmllint processes to run in parallel - 8 by default
project :lvalue
Optional project name to allow error counts to be aggregated by project and to allow id and labels to be scoped to the files contained in each project
reusedInProject :lvalue
List of projects in which this file is reused, which can be set via reuseFileInProject every time you discover another project in which a file is reused
sha256 :lvalue
Sha256 hash of the string containing the xml processed by lint or read
source :lvalue
The source Xml to be linted
title :lvalue
Optional title of the xml - only needed if you want to generate an SDL file map
Lint
Lint xml files in parallel
lint($@)
Store some xml in a files, apply xmllint in parallel and update the source file with the results
Parameter Description
1 $lint Linter
2 %attributes Attributes to be recorded as xml comments
lintNOP($@)
Store some xml in a files, apply xmllint in single and update the source file with the results
Parameter Description
1 $lint Linter
2 %attributes Attributes to be recorded as xml comments
nolint($@)
Store just the attributes in a file so that they can be retrieved later to process non xml objects referenced in the xml - like images
Parameter Description
1 $lint Linter
2 %attributes Attributes to be recorded as xml comments
read($)
Reread a linted xml file and extract the attributes associated with the lint
Parameter Description
1 $file File containing xml
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::read
waitAllProcesses()
Wait for all lints to finish - this is a static method, call as Data::Edit::Xml::Lint::wait
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::waitAllProcesses
clear(@)
Clear the results of a prior run
Parameter Description
1 @foldersAndExtensions Directories to clear and extensions of files to remove
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::clear
relint($$$@)
Locate all the labels or id in the specified files, analyze the map of labels and ids with analysisSub parse each file, process each parse with processSub, then "lint" in lint the reprocessed xml back to the original file - this allows you to reprocess the contents of each file with knowledge of where labels or id are located in the other files associated with a project. The analysisSub(linkmap = {project}{labels or id>}=[file, id]) should return true if the processing of each file is to be performed subsequently. The processSub(parse tree representation of a file, id and label mapping, reloaded linter) should return true if a lint is required to save the results after each file has been processed else false. Optionally, the analysisSub may set the preferredSource attribute to indicate the preferred representation of the xml.
Parameter Description
1 $processes Maximum number of processes to use
2 $analysisSub Analysis 𝘀𝘂𝗯
3 $processSub Process 𝘀𝘂𝗯
4 @foldersAndExtensions Folders and extensions of files to process (recursively)
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::relint
resolveUniqueLink($$)
Return the unique (file, leading id) of the specified link in the link map or () if no such definition exists
Parameter Description
1 $linkMap Link map
2 $link Label
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::resolveUniqueLink
urlEncode($)
Return a url encoded string
Parameter Description
1 $s String
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::urlEncode
resolveDitaLink($$$$)
Return the unique (file, leading id, topic ) of the specified link in the link map or () if no such definition exists
Parameter Description
1 $linkMap Link map
2 $fileToGuid File map
3 $link Label
4 $sourceFile File we are resolving from
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::resolveDitaLink
reuseFileInProject($$)
Record the reuse of the specified file in the specified project
Parameter Description
1 $file Name of file that is being reused
2 $project Name of project in which it is reused
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::reuseFileInProject
resolveFileToGuid($$)
Return the unique definition of the specified link in the link map or undef if no such definition exists
Parameter Description
1 $fileToGuids File to guids map
2 $file File
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::resolveFileToGuid
multipleLabelDefs($)
Return ([project; source label or id; targets count]*) of all labels or id that have multiple definitions
Parameter Description
1 $labelDefs Label definitions
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::multipleLabelDefs
multipleLabelDefsReport($)
Return a report showing labels and id with multiple definitions in each project ordered by most defined
Parameter Description
1 $labelDefs Label and Id definitions
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::multipleLabelDefsReport
singleLabelDefs($)
Return ([project; label or id]*) of all labels or ids that have a single definition
Parameter Description
1 $labelDefs Label and Id definitions
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::singleLabelDefs
singleLabelDefsReport($)
Return a report showing label or id with just one definitions ordered by project, label name
Parameter Description
1 $labelDefs Label and Id definitions
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::singleLabelDefsReport
Report
Methods for reporting the results of linting several files
report($$)
Analyse the results of prior lints and return a hash reporting various statistics and a printable report
Parameter Description
1 $outputDirectory Directory to search
2 $filter Optional regular expression to filter files
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::report
Attributes
compressedErrors :lvalue
Compressed errors over all files
docTypes :lvalue
Array of [number of errors, project, files] ordered from least to most errors
failingFiles :lvalue
{docType}++ - Hash of document types encountered
failingProjects :lvalue
[Projects with xmllint errors]
filter :lvalue
File selection filter
numberOfFiles :lvalue
Number of files encountered
numberOfProjects :lvalue
Number of projects defined - each project can contain zero or more files
passingProjects :lvalue
[Projects with no xmllint errors]
passRatePercent :lvalue
Total number of passes as a percentage of all input files
print :lvalue
A printable report of the above
timestamp :lvalue
Timestamp of report
totalCompressedErrorsFileByFile :lvalue
Total number of errors summed file by file
totalCompressedErrors :lvalue
Number of compressed errors
totalErrors :lvalue
Total number of errors
Private Methods
lintOP($$@)
Store some xml in a files, apply xmllint in parallel or single and update the source file with the results
Parameter Description
1 $inParallel In parallel or not
2 $lint Linter
3 %attributes Attributes to be recorded as xml comments
compressErrors(@)
Compress the errors so we cound the ones that do not look similar. Errors typically occupy three lines with the last line containing ^ at the end to mark the location of the error.
Parameter Description
1 @errors Errors
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::compressErrors
formatAttributes(%)
Format the attributes section of the output file
Parameter Description
1 $attributes Hash of attributes
waitProcessing($)
Wait for a processor to become available
Parameter Description
1 $processes Maximum number of processes
reuseInProject($)
Record the reuse of an item in the named project
Parameter Description
1 $project Name of the project in which it is reused
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::reuseInProject
countLinkTargets($$)
Count the number of targets this link resolves to.
Parameter Description
1 $linkMap Link map
2 $link Label
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::countLinkTargets
p4($$)
Format a fraction as a percentage to 4 decimal places
Parameter Description
1 $p Pass
2 $f Fail
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::p4
Index
1 author - Optional author of the xml - only needed if you want to generate an SDL file map
2 catalog - Optional catalog file containing the locations of the DTDs used to validate the xml
3 clear - Clear the results of a prior run
4 compressedErrors - Compressed errors over all files
5 compressedErrorText - Text of compressed errors
6 compressErrors - Compress the errors so we cound the ones that do not look similar.
7 countLinkTargets - Count the number of targets this link resolves to.
8 ditaType - Optional Dita topic type(concept|task|troubleshooting|reference) of the xml - only needed if you want to generate an SDL file map
9 docType - The second line: the document type extracted from the source
10 docTypes - Array of [number of errors, project, files] ordered from least to most errors
11 dtds - Optional directory containing the DTDs used to validate the xml
12 errors - Number of uncompressed lint errors detected by xmllint
13 errorText - Text of uncompressed lint errors detected by xmllint
14 failingFiles - {docType}++ - Hash of document types encountered
15 failingProjects - [Projects with xmllint errors]
16 file - File that the xml will be written to and read from by lint, read or relint
17 fileNumber - File number - assigned early on by the caller to help debugging transformations
18 filter - File selection filter
19 formatAttributes - Format the attributes section of the output file
20 guid - Guid or id of the outermost tag - if not supplied the first definition encountered will be used on the basis that all Dita topics require an id
21 header - The first line: the xml header extracted from source
22 idDefs - {id} = count - the number of times this id is defined in the xml contained in this file
23 labelDefs - {label or id} = id - the id of the node containing a label defined on the xml
24 labels - Optional parse tree to supply labels for the current source as the labels are present in the parse tree not in the string representing the parse tree
25 lint - Store some xml in a files, apply xmllint in parallel and update the source file with the results
26 linted - Date the lint was performed by lint
27 lintNOP - Store some xml in a files, apply xmllint in single and update the source file with the results
28 lintOP - Store some xml in a files, apply xmllint in parallel or single and update the source file with the results
29 multipleLabelDefs - Return ([project; source label or id; targets count]*) of all labels or id that have multiple definitions
30 multipleLabelDefsReport - Return a report showing labels and id with multiple definitions in each project ordered by most defined
31 new - Create a new xml linter - call this method statically as in Data::Edit::Xml::Lint
32 nolint - Store just the attributes in a file so that they can be retrieved later to process non xml objects referenced in the xml - like images
33 numberOfFiles - Number of files encountered
34 numberOfProjects - Number of projects defined - each project can contain zero or more files
35 p4 - Format a fraction as a percentage to 4 decimal places
36 passingProjects - [Projects with no xmllint errors]
37 passRatePercent - Total number of passes as a percentage of all input files
38 preferredSource - Preferred representation of the xml source, used by relint to supply a preferred representation for the source
39 print - A printable report of the above
40 processes - Maximum number of xmllint processes to run in parallel - 8 by default
41 project - Optional project name to allow error counts to be aggregated by project and to allow id and labels to be scoped to the files contained in each project
42 read - Reread a linted xml file and extract the attributes associated with the lint
43 relint - Locate all the labels or id in the specified files, analyze the map of labels and ids with analysisSub parse each file, process each parse with processSub, then "lint" in lint the reprocessed xml back to the original file - this allows you to reprocess the contents of each file with knowledge of where labels or id are located in the other files associated with a project.
44 report - Analyse the results of prior lints and return a hash reporting various statistics and a printable report
45 resolveDitaLink - Return the unique (file, leading id, topic ) of the specified link in the link map or () if no such definition exists
46 resolveFileToGuid - Return the unique definition of the specified link in the link map or undef if no such definition exists
47 resolveUniqueLink - Return the unique (file, leading id) of the specified link in the link map or () if no such definition exists
48 reusedInProject - List of projects in which this file is reused, which can be set via reuseFileInProject every time you discover another project in which a file is reused
49 reuseFileInProject - Record the reuse of the specified file in the specified project
50 reuseInProject - Record the reuse of an item in the named project
51 sha256 - Sha256 hash of the string containing the xml processed by lint or read
52 singleLabelDefs - Return ([project; label or id]*) of all labels or ids that have a single definition
53 singleLabelDefsReport - Return a report showing label or id with just one definitions ordered by project, label name
54 source - The source Xml to be linted
55 timestamp - Timestamp of report
56 title - Optional title of the xml - only needed if you want to generate an SDL file map
57 totalCompressedErrors - Number of compressed errors
58 totalCompressedErrorsFileByFile - Total number of errors summed file by file
59 totalErrors - Total number of errors
60 urlEncode - Return a url encoded string
61 waitAllProcesses - Wait for all lints to finish - this is a static method, call as Data::Edit::Xml::Lint::wait
62 waitProcessing - Wait for a processor to become available
Installation
This module is written in 100% Pure Perl and, thus, it is easy to read, comprehend, use, modify and install via cpan:
sudo cpan install Data::Edit::Xml::Lint
Author
Copyright
Copyright (c) 2016-2018 Philip R Brenan.
This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.