Name
Data::Edit::Xml::Xref - Cross reference Dita XML, match topics and ameliorate missing references.
Synopsis
Check the references in a large corpus of Dita XML documents held in folder inputFolder running processes in parallel where ever possible to take advantage of multi-cpu computers:
use Data::Edit::Xml::Xref;
my $x = xref(inputFolder => q(in),
maximumNumberOfProcesses => 512,
relativePath => q(out),
fixBadRefs => 1,
flattenFolder => q(out2),
matchTopics => 0.9,
);
The cross reference analysis can be requested as a status line:
ok nws($x->statusLine) eq nws(<<END);
Xref: 108 references fixed, 50 bad xrefs, 16 missing image files, 16 missing image references, 13 bad first lines, 13 bad second lines, 9 bad conrefs, 9 duplicate topic ids, 9 files with bad conrefs, 9 files with bad xrefs, 8 duplicate ids, 6 bad topicrefs, 6 files not referenced, 4 invalid guid hrefs, 2 bad book maps, 2 bad tables, 1 External xrefs with no format=html, 1 External xrefs with no scope=external, 1 file failed to parse, 1 href missing
END
Or as a tabular report:
ok nws($x->statusTable) eq nws(<<END);
Xref:
Count Condition
1 108 references fixed
2 50 bad xrefs
3 16 missing image files
4 16 missing image references
5 13 bad first lines
6 13 bad second lines
7 9 files with bad conrefs
8 9 bad conrefs
9 9 files with bad xrefs
10 9 duplicate topic ids
11 8 duplicate ids
12 6 bad topicrefs
13 6 files not referenced
14 4 invalid guid hrefs
15 2 bad book maps
16 2 bad tables
17 1 href missing
18 1 file failed to parse
19 1 External xrefs with no format=html
20 1 External xrefs with no scope=external
END
More detailed reports are produced in the reports folder:
$x->reports
and indexed by the reports report:
reports/reports.txt
which contains a list of all the reports generated:
Rows Title File
1 5 Attributes reports/count/attributes.txt
2 13 Bad Xml line 1 reports/bad/xmlLine1.txt
3 13 Bad Xml line 2 reports/bad/xmlLine2.txt
4 9 Bad conRefs reports/bad/ConRefs.txt
5 2 Bad external xrefs reports/bad/externalXrefs.txt
6 16 Bad image references reports/bad/imageRefs.txt
7 9 Bad topicrefs reports/bad/topicRefs.txt
8 50 Bad xRefs reports/bad/XRefs.txt
9 2 Bookmaps with errors reports/bad/bookMap.txt
10 2 Document types reports/count/docTypes.txt
11 8 Duplicate id definitions within files reports/bad/idDefinitionsDuplicated.txt
12 3 Duplicate topic id definitions reports/bad/topicIdDefinitionsDuplicated.txt
13 3 File extensions reports/count/fileExtensions.txt
14 1 Files failed to parse reports/bad/parseFailed.txt
15 0 Files types reports/count/fileTypes.txt
16 16 Files whose short names are bi-jective with their md5 sums reports/good/shortNameToMd5Sum.txt
17 0 Files whose short names are not bi-jective with their md5 sums reports/bad/shortNameToMd5Sum.txt
18 108 Fixes Applied To Failing References reports/lists/referencesFixed.txt
19 0 Good bookmaps reports/good/bookMap.txt
20 9 Good conRefs reports/good/ConRefs.txt
21 5 Good topicrefs reports/good/topicRefs.txt
22 8 Good xRefs reports/good/XRefs.txt
23 1 Guid topic definitions reports/lists/guidsToFiles.txt
24 2 Image files reports/good/imagesFound.txt
25 1 Missing hrefs reports/bad/missingHrefAttributes.txt
26 16 Missing image references reports/bad/imagesMissing.txt
27 4 Possible improvements reports/improvements.txt
28 2 Resolved GUID hrefs reports/good/guidHrefs.txt
29 2 Tables with errors reports/bad/tables.txt
30 23 Tags reports/count/tags.txt
31 11 Topic Reuses reports/lists/topicReuse.txt
32 0 Topic Reuses reports/lists/similar/byTitle.txt
33 16 Topics reports/lists/topics.txt
34 15 Topics with similar vocabulary reports/lists/similar/byVocabulary.txt
35 0 Topics with validation errors reports/bad/validationErrors.txt
36 0 Topics without ids reports/bad/topicIdDefinitionsMissing.txt
37 6 Unreferenced files reports/bad/notReferenced.txt
38 11 Unresolved GUID hrefs reports/bad/guidHrefs.txt
File names in reports can be made relative to a specified directory named on the:
relativePath => q(out)
attribute.
Add navigation titles to topic references
Xref will create or update the navigation titles navtitles of topic refs appendix|chapter|topicref in maps if requested by both file name and GUID reference:
addNavTitle => 1
Reports of successful updates will be written to:
reports/good/navTitles.txt
Reports of unsuccessful updates will be written to:
reports/bad/navTitles.txt
Fix bad references
It is often desirable to ameliorate unresolved Dita href attributes so that incomplete content can be loaded into a content management system. The:
fixBadRefs => 1
attribute requests that the:
href
attribute be renamed to:
xtrf
if the href attribute specification cannot be resolved in the current corpus.
This feature designed by mailto:mom@cpan.org.
File flattening
It is often desirable to flatten the topic files so that they can coexist in a single folder of a content management system without colliding with each other.
The presence of the input attribute:
flatFolder
causes topic files to be flattened into the named folder.
Xref uses the well known Gearhart-Brenan Dita Topic File Naming System where the file name for a topic consists of the following items separated by underscores:
- The first letter of the root tag of the topic.
- The title of the topic with all runs of characters not in the ranges:
-
a-z, A-Z, 0-9
reduced to a single underscore.
- The MD5 sum in hexadecimal of the content of the topic.
-
This has the effect of sorting files by their root tags and titles while guaranteeing a unique name for the topic that depends only on its content.
If the content of two such files is identical then they will have an identical file name because the generation of the file name depends only on the content of the topic. If two topic files have the same name under this naming system then they have identical content and only one file is needed to hold the topic in a content management system.
Topic Matching
Topics can be matched on title and vocabulary to assist authors in finding similar topics by specifying the:
matchTopics => 0.9
attribute where the value of this attribute is the confidence level between 0 and 1.
Topic matching might take some time for large input folders.
Title matching
Title sorts topics by their titles so that topic with similar titles can be easily located:
Similar Prefix Source
1 14 c_Notices__ c_Notices_5614e96c7a3eaf3dfefc4a455398361b
2 c_Notices__ c_Notices_14a9f467215dea879d417de884c21e6d
3 c_Notices__ c_Notices_19011759a2f768d76581dc3bba170a44
4 c_Notices__ c_Notices_aa741e6223e6cf8bc1a5ebdcf0ba867c
5 c_Notices__ c_Notices_f0009b28c3c273094efded5fac32b83f
6 c_Notices__ c_Notices_b1480ac1af812da3945239271c579bb1
7 c_Notices__ c_Notices_5f3aa15d024f0b6068bd8072d4942f6d
8 c_Notices__ c_Notices_17c1f39e8d70c765e1fbb6c495bedb03
9 c_Notices__ c_Notices_7ea35477554f979b3045feb369b69359
10 c_Notices__ c_Notices_4f200259663703065d247b35d5500e0e
11 c_Notices__ c_Notices_e3f2eb03c23491c5e96b08424322e423
12 c_Notices__ c_Notices_06b7e9b0329740fc2b50fedfecbc5a94
13 c_Notices__ c_Notices_550a0d84dfc94982343f58f84d1c11c2
14 c_Notices__ c_Notices_fa7e563d8153668db9ed098d0fe6357b
15 3 c_Overview__ c_Overview_f9e554ee9be499368841260344815f58
16 c_Overview__ c_Overview_f234dc10ea3f4229d0e1ab4ad5e8f5fe
17 c_Overview__ c_Overview_96121d7bcd41cf8be318b96da0049e73
Vocabulary matching
Vocabulary matching compares the vocabulary of pairs of topics: topics with similar vocabularies within the confidence level specified are reported together:
Similar Topic
1 8 in/1.dita
2 in/2.dita
3 in/3.dita
4 in/4.dita
5 in/5.dita
6 in/6.dita
7 in/7.dita
8 in/8.dita
9
10 2 in/map/bookmap.ditamap
11 in/map/bookmap2.ditamap
12
13 2 in/act4. dita
14 in/act5.dita
Description
Cross reference Dita XML, match topics and ameliorate missing references.
Version 20190204.
The following sections describe the methods in each functional area of this module. For an alphabetic listing of all methods by name see Index.
Cross reference
Check the cross references in a set of Dita files and report the results.
xref(%)
Check the cross references in a set of Dita files held in inputFolder and report the results in the reports folder. The possible attributes are defined in Data::Edit::Xml::Xref
Parameter Description
1 %attributes Attributes
Example:
my $N = 8;
Data::Edit::Xml::Xref Definition
Attributes used by the Xref cross referencer.
Input fields
debugTimes - Write timing information if true
fixBadRefs - Try to fix bad references in these files where possible by either changing a guid to a file name assuming the right file is present or failing that by moving the failing reference to the "xtrf" attribute.
flattenFolder - Files are optionally renamed to the Gearhart standard and placed in this folder.
inputFolder - A folder containing the dita and ditamap files to be cross referenced.
matchTopics - Match topics by title and by vocabulary to the specified confidence level between 0 and 1. This operation might take some time to complete on a large corpus.
maximumNumberOfProcesses - Maximum number of processes to run in parallel at any one time.
relativePath - Report files relative to this path or absolutely if undefined.
reports - Reports folder: the cross referencer will write reports to files in this folder.
summary - Print the summary line.
Output fields
addNavTitles - If true, add navtitle to topicrefs to show the title of the target
attributeCount - {file}{attribute} == count of the different xml attributes found in the xml files.
author - {file} = author of this file.
badBookMaps - Bad book maps.
badConRefs - {sourceFile} = [file, href] indicating the file has at least one bad conref.
badConRefsList - Bad conrefs - by file.
badGuidHrefs - Bad conrefs - all.
badImageRefs - Consolidated images missing.
badNavTitles - Details of nav titles that were not resolved
badTables - Array of tables that need fixing.
badTopicRefs - [file, href] Invalid href attributes found on topicref tags.
badXRefs - Bad Xrefs - by file
badXRefsList - Bad Xrefs - all
badXml1 - [Files] with a bad xml encoding header on the first line.
badXml2 - [Files] with a bad xml doc type on the second line.
baseTag - Base Tag for each file
conRefs - {file}{href} Count of conref definitions in each file.
docType - {file} == docType: the docType for each xml file.
duplicateIds - [file, id] Duplicate id definitions within each file.
duplicateTopicIds - [topicId, [files]] Files with duplicate topic ids - the id on the outermost tag.
fileExtensions - Default file extensions to load
fixRefs - {file}{ref} where the href or conref target is not present.
fixedRefs - [] hrefs and conrefs from fixRefs
flattenFiles - {old full file name} = file renamed to Gearhart standard
goodBookMaps - Good book maps.
goodConRefs - Good con refs - by file.
goodConRefsList - Good con refs - all.
goodGuidHrefs - {file}{href}{location}++ where a href that starts with GUID- has been correctly resolved.
goodImageRefs - Consolidated images found.
goodNavTitles - Details of nav titles that were resolved
goodTopicRefs - Good topic refs.
goodXRefs - Good xrefs - by file.
goodXRefsList - Good xrefs - all.
guidHrefs - {file}{href} = location where href starts with GUID- and is thus probably a guid.
guidToFile - {topic id which is a guid} = file defining topic id.
hrefUrlEncoding - Hrefs that need url encoding because they contain white space
ids - {file}{id} Id definitions across all files.
images - {file}{href} Count of image references in each file.
improvements - Suggested improvements - a list of improvements that might be made.
inputFiles - Input files from inputFolder.
inputFolderImages - {filename} = full file name which works well for images because the md5 sum in their name is probably unique.
md5Sum - MD5 sum for each input file.
missingImageFiles - [file, href] == Missing images in each file.
missingTopicIds - Missing topic ids.
noHref - Tags that should have an href but do not have one.
notReferenced - Files in input area that are not referenced by a conref, image, topicref or xref tag and are not a bookmap.
olBody - The number of ol under body by file
parseFailed - {file} files that failed to parse.
results - Summary of results table.
sourceFile - The source file from which this structure was generated.
statusLine - Status line summarizing the cross reference.
statusTable - Status table summarizing the cross reference.
tagCount - {file}{tags} == count of the different tag names found in the xml files.
title - {file} = title of file.
topicIds - {file} = topic id - the id on the outermost tag.
topicRefs - {file}{href}++ References from bookmaps to topics via appendix, chapter, topicref.
validationErrors - True means that Lint detected errors in the xml contained in the file.
vocabulary - The text of each topic shorn of attributes for vocabulary comparison.
xRefs - {file}{href}++ Xrefs references.
xrefBadFormat - External xrefs with no format=html.
xrefBadScope - External xrefs with no scope=external.
Attributes
The following is a list of all the attributes in this package. A method coded with the same name in your package will over ride the method of the same name in this package and thus provide your value for the attribute in place of the default value supplied for this attribute by this package.
Replaceable Attribute List
improvementLength
improvementLength
Improvement length
Private Methods
countLevels($$)
Count has elements to the specified number of levels
Parameter Description
1 $l Levels
2 $h Hash
relativeFilePath($$)
Format file name for easy use on windows
Parameter Description
1 $xref Xref
2 $file File
unixFile($$)
Format file name for easy use on unix
Parameter Description
1 $xref Xref
2 $file File
formatFileNames($$$)
Format file names for easy use on unix and windows
Parameter Description
1 $xref Xref
2 $array Array of arrays containing file names in unix format
3 $column Column containing file names
loadInputFiles($)
Load the names of the files to be processed
Parameter Description
1 $xref Cross referencer
analyzeOneFile($)
Analyze one input file
Parameter Description
1 $iFile File to analyze
reportGuidsToFiles($)
Map and report guids to files
Parameter Description
1 $xref Xref results
fixOneFile($$)
Fix one file by moving unresolved references to the xtrf attribute
Parameter Description
1 $xref Xref results
2 $file File to fix
fixFiles($)
Fix files by moving unresolved references to the xtrf attribute
Parameter Description
1 $xref Xref results
fixOneFileGB($$)
Fix one file to the Gearhart-Brenan standard
Parameter Description
1 $xref Xref results
2 $file File to fix
fixFilesGB($)
Rename files to the Gearhart-Brenan standard
Parameter Description
1 $xref Xref results
analyze($)
Analyze the input files
Parameter Description
1 $xref Cross referencer
reportDuplicateIds($)
Report duplicate ids
Parameter Description
1 $xref Cross referencer
reportDuplicateTopicIds($)
Report duplicate topic ids
Parameter Description
1 $xref Cross referencer
reportNoHrefs($)
Report locations where an href was expected but not found
Parameter Description
1 $xref Cross referencer
reportRefs($$)
Report bad references found in xrefs or conrefs as they have the same structure
Parameter Description
1 $xref Cross referencer
2 $type Type of reference to be processed
reportGuidHrefs($)
Report on guid hrefs
Parameter Description
1 $xref Cross referencer
reportXrefs($)
Report bad xrefs
Parameter Description
1 $xref Cross referencer
reportTopicRefs($)
Report bad topic refs
Parameter Description
1 $xref Cross referencer
reportConrefs($)
Report bad conrefs refs
Parameter Description
1 $xref Cross referencer
reportImages($)
Reports on images and references to images
Parameter Description
1 $xref Cross referencer
reportParseFailed($)
Report failed parses
Parameter Description
1 $xref Cross referencer
reportXml1($)
Report bad xml on line 1
Parameter Description
1 $xref Cross referencer
reportXml2($)
Report bad xml on line 2
Parameter Description
1 $xref Cross referencer
reportDocTypeCount($)
Report doc type count
Parameter Description
1 $xref Cross referencer
reportTagCount($)
Report tag counts
Parameter Description
1 $xref Cross referencer
reportAttributeCount($)
Report attribute counts
Parameter Description
1 $xref Cross referencer
reportValidationErrors($)
Report the files known to have validation errors
Parameter Description
1 $xref Cross referencer
checkBookMap($$)
Check whether a bookmap is valid or not
Parameter Description
1 $xref Cross referencer
2 $bookMap Bookmap
reportBookMaps($)
Report on whether each bookmap is good or bad
Parameter Description
1 $xref Cross referencer
reportTables($)
Report on tables that have problems
Parameter Description
1 $xref Cross referencer
reportFileExtensionCount($)
Report file extension counts
Parameter Description
1 $xref Cross referencer
reportFileTypes($)
Report file type counts - takes too long in series
Parameter Description
1 $xref Cross referencer
reportNotReferenced($)
Report files not referenced by any of conref, image, topicref, xref and are not bookmaps.
Parameter Description
1 $xref Cross referencer
reportExternalXrefs($)
Report external xrefs missing other attributes
Parameter Description
1 $xref Cross referencer
reportPossibleImprovements($)
Report improvements possible
Parameter Description
1 $xref Cross referencer
reportTopicDetails($)
Things that occur once in each file
Parameter Description
1 $xref Cross referencer
reportTopicReuse($)
Count how frequently each topic is reused
Parameter Description
1 $xref Cross referencer
reportSimilarTopicsByTitle($)
Report topics likely to be similar on the basis of their titles as expressed in the non Guid part of their file names
Parameter Description
1 $xref Cross referencer
reportSimilarTopicsByVocabulary($)
Report topics likely to be similar on the basis of their vocabulary
Parameter Description
1 $xref Cross referencer
reportMd5Sum($)
Good files have short names which uniquely represent their content and thus can be used instead of their md5sum to generate unique names
Parameter Description
1 $xref Cross referencer
reportOlBody($)
ol under body - indicative of a task
Parameter Description
1 $xref Cross referencer
reportHrefUrlEncoding($)
href needs url encoding
Parameter Description
1 $xref Cross referencer
addNavTitlesToOneMap($$)
Fix navtitles in one map
Parameter Description
1 $xref Xref results
2 $file File to fix
addNavTitlesToMaps($)
Add nav titles to files containing maps.
Parameter Description
1 $xref Xref results
createSampleInputFiles($)
Create sample input files for testing. The attribute inputFolder supplies the name of the folder in which to create the sample files.
Parameter Description
1 $N Number of sample files
Index
1 addNavTitlesToMaps - Add nav titles to files containing maps.
2 addNavTitlesToOneMap - Fix navtitles in one map
3 analyze - Analyze the input files
4 analyzeOneFile - Analyze one input file
5 checkBookMap - Check whether a bookmap is valid or not
6 countLevels - Count has elements to the specified number of levels
7 createSampleInputFiles - Create sample input files for testing.
8 fixFiles - Fix files by moving unresolved references to the xtrf attribute
9 fixFilesGB - Rename files to the Gearhart-Brenan standard
10 fixOneFile - Fix one file by moving unresolved references to the xtrf attribute
11 fixOneFileGB - Fix one file to the Gearhart-Brenan standard
12 formatFileNames - Format file names for easy use on unix and windows
13 loadInputFiles - Load the names of the files to be processed
14 relativeFilePath - Format file name for easy use on windows
15 reportAttributeCount - Report attribute counts
16 reportBookMaps - Report on whether each bookmap is good or bad
17 reportConrefs - Report bad conrefs refs
18 reportDocTypeCount - Report doc type count
19 reportDuplicateIds - Report duplicate ids
20 reportDuplicateTopicIds - Report duplicate topic ids
21 reportExternalXrefs - Report external xrefs missing other attributes
22 reportFileExtensionCount - Report file extension counts
23 reportFileTypes - Report file type counts - takes too long in series
24 reportGuidHrefs - Report on guid hrefs
25 reportGuidsToFiles - Map and report guids to files
26 reportHrefUrlEncoding - href needs url encoding
27 reportImages - Reports on images and references to images
28 reportMd5Sum - Good files have short names which uniquely represent their content and thus can be used instead of their md5sum to generate unique names
29 reportNoHrefs - Report locations where an href was expected but not found
30 reportNotReferenced - Report files not referenced by any of conref, image, topicref, xref and are not bookmaps.
31 reportOlBody - ol under body - indicative of a task
32 reportParseFailed - Report failed parses
33 reportPossibleImprovements - Report improvements possible
34 reportRefs - Report bad references found in xrefs or conrefs as they have the same structure
35 reportSimilarTopicsByTitle - Report topics likely to be similar on the basis of their titles as expressed in the non Guid part of their file names
36 reportSimilarTopicsByVocabulary - Report topics likely to be similar on the basis of their vocabulary
37 reportTables - Report on tables that have problems
38 reportTagCount - Report tag counts
39 reportTopicDetails - Things that occur once in each file
40 reportTopicRefs - Report bad topic refs
41 reportTopicReuse - Count how frequently each topic is reused
42 reportValidationErrors - Report the files known to have validation errors
43 reportXml1 - Report bad xml on line 1
44 reportXml2 - Report bad xml on line 2
45 reportXrefs - Report bad xrefs
46 unixFile - Format file name for easy use on unix
47 xref - Check the cross references in a set of Dita files held in inputFolder and report the results in the reports folder.
Installation
This module is written in 100% Pure Perl and, thus, it is easy to read, comprehend, use, modify and install via cpan:
sudo cpan install Data::Edit::Xml::Xref
Author
Copyright
Copyright (c) 2016-2018 Philip R Brenan.
This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.
2 POD Errors
The following errors were encountered while parsing the POD:
- Around line 2366:
You forgot a '=back' before '=head2'
You forgot a '=back' before '=head2'
- Around line 2534:
Unterminated L<...> sequence