NAME
dtatw-sanitize-header.perl - make DDC/DTA-friendly TEI-headers
SYNOPSIS
dtatw-sanitize-header.perl [OPTIONS] XML_HEADER_FILE
General Options:
-help # this help message
-verbose LEVEL # set verbosity level (0<=LEVEL<=1)
-quiet # alias for -verbose=0
-dta , -foreign # do/don't warn about strict DTA header compliance (default=do)
-max-bibl-length LEN # trim bibl fields to maximum length LEN (default=256)
Auxiliary DB Options: # optional BASENAME-keyed JSON-metadata Berkeley DB
-aux-db DBFILE # read auxiliary DB from DBFILE (default=none)
-aux-xpath XPATH # append <idno type="KEY"> elements to XPATH (default='fileDesc[@n="ddc-aux"]')
XPath Options:
-xpath ATTR=XPATH # prepend XPATH for attribute ATTR
-default ATTR=VAL # default values (for textClass* attributes)
I/O Options:
-blanks , -noblanks # do/don't keep 'ignorable' whitespace in XML_HEADER_FILE file (default=don't)
-base BASENAME # use BASENAME to auto-compute field names (default=basename(XML_HEADER_FILE))
-output FILE # specify output file (default='-' (STDOUT))
OPTIONS AND ARGUMENTS
General Options
- -h, -help
-
Display a brief usage summary and exit.
- -v, -verbose LEVEL
-
Set verbosity level; values for LEVEL are:
0: silent 1: warnings only 2: warnings and progress messages - -q, -quiet
-
Alis for -verbose=0
- -b, -basename BASENAME
-
Set basename for generated header fields; default is the basename (non-directory portion) of XML_HEADER_FILE up to but not including the first dot (".") character, if any. In default
-dtamode, everything after the first dot character in BASENAME will be truncated even if you specify this option; in-foreignmode, dots in basenames passed in via this option are allowed. - -dta, -nodta
-
Do/don't run with DTA-specific heuristics and attempt to enforce DTA-header compliance (default: do).
- -foreign
-
Alias for
-nodta. - -l, -max-bibl-len LEN
-
Trim sanitized XPaths to maximum length LEN characters (default=256).
Auxiliary DB Options
You can optionally use a BASENAME-keyed JSON-metadata Berkeley DB file to automatically insert additional metadata fields into an existing header.
- -aux-db DBFILE
-
Apply auxiliary metadata from Berkeley DB file DBFILE (default=none). Keys of DBFILE should be BASENAMEs as parsed from XML_HEADER_FILE or passed in via the
-basenameoption, and the associated values should be flat JSON objects whose keys are the names of metadata attributes for BASENAME and whose values are the values of those metadata attributes. - -aux-xpath XPATH
-
Append
<idno type="KEY">VAL</idno>elements to XPATH (default='fileDesc[@n="ddc-aux"]') for auxiliary metadata attributes.
XPath Options
You can optionally specify source XPaths to override the defaults with the -xpath option.
- -xpath ATTR=XPATH
-
Prepend XPATH to the builtin list of source XPaths for the attribute ATTR. Known attributes: author title date bibl shelfmark library dirname dtaid timestamp availability avail textClassDTA textClassDWDS textClassCorpus.
- -default ATTR=VALUE
-
Default value for attribute ATTR. Only used for textClass* attributes.
I/O Options
- -[no]keep-blanks
-
Do/don't retain all whitespace in input file (default=don't).
- -o, -output OUTFILE
-
Write output to OUTFILE; default="-" (standard output).
- -format LEVEL
-
Format output at libxml level LEVEL (default=1).
DESCRIPTION
dtatw-sanitize-header.perl applies some parsing and encoding heuristics to a TEI-XML header file XML_HEADER_FILE in an attempt to ensure compliance with DTA/D* header conventions for subsequent DDC indexing. For each supported metadata attribute, a corresponding header record is first sought by means of a first-match-wins XPath list. If no existing header record is found, a default (possibly empty) value is heuristically assigned, and the resulting value is inserted into the header at a conventional XPath location.
The metadata attributes currently supported are listed below; Source XPaths in the list are specified relative to the root <teiHeader> element, and unless otherwise noted, the first source XPath listed is also the target XPath, guaranteed to be exist in the output header on successful script completion.
See https://kaskade.dwds.de/dstar/doc/README.html#bibliographic_metadata_attributes for details on D* metadata attribute conventions.
author
XPath(s):
fileDesc/titleStmt/author[@n="ddc"] ##-- ddc: canonical target (formatted)
fileDesc/titleStmt/author ##-- new (direct, un-formatted)
fileDesc/sourceDesc/biblFull/titleStmt/author ##-- new (sourceDesc, un-formatted)
fileDesc/titleStmt/editor[string(@corresp)!="#DTACorpusPublisher"] ##-- new (direct, un-formatted)
fileDesc/sourceDesc/biblFull/titleStmt/editor[string(@corresp)!="#DTACorpusPublisher"] ##-- new (sourceDesc, un-formatted)
fileDesc/sourceDesc/listPerson[@type="searchNames"]/person/persName ##-- old
Heuristically parses and formats persName, surname, forename, and genName elements to a human-readable string. In DTA mode, defaults to the first component of the "_"-separated BASENAME.
title
XPath(s):
fileDesc/titleStmt/title[@type="main" or @type="sub" or @type="vol"] ##-- DTA-mode only
fileDesc/titleStmt/title[@type="ddc"] ##-- ddc: canonical target (formatted)
fileDesc/titleStmt/title[not(@type)]
sourceDesc[@id="orig"]/biblFull/titleStmt/title
sourceDesc[@id="scan"]/biblFull/titleStmt/title
sourceDesc[not(@id)]/biblFull/titleStmt/title
In DTA mode, heuristically parses and formats @type="main", @type="sub", @type="vol" elements to a human-readable string, and defaults to the second component of the "_"-separated BASENAME.
date
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/biblFull/publicationStmt/date[@type="pub"] ##-- ddc: canonical target
fileDesc/sourceDesc[@n="scan"]/biblFull/publicationStmt/date ##-- old:publDate
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]/supplied
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="publication"]/supplied ##-- new:date (published, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="publication"] ##-- new:date (published)
fileDesc/sourceDesc/biblFull/publicationStmt/date/supplied ##-- new:date (generic, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date ##-- new:date (generic, supplied)
Heuristically trims everything but digits and hyphens from the extracted date-string. In DTA mode, defaults to the final component of the "_"-separated BASENAME.
firstDate
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/biblFull/publicationStmt/date[@type="first"] ##-- ddc: canonical target
fileDesc/sourceDesc[@n="orig"]/biblFull/publicationStmt/date ##-- old: publDate
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]/supplied
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="firstPublication"]/supplied ##-- new:date (first, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="firstPublication"] ##-- new:date (first)
fileDesc/sourceDesc/biblFull/publicationStmt/date/supplied ##-- new:date (generic, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date ##-- new:date (generic, supplied)
Heuristically trims everything but digits and hyphens from the extracted date-string. Defaults to the publication date (see above).
bibl
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/bibl ##-- ddc:canonical target
fileDesc/sourceDesc[@n="orig"]/bibl ##-- old:firstBibl, target
fileDesc/sourceDesc[@n="scan"]/bibl ##-- old:publBibl
fileDesc/sourceDesc/bibl ##-- new|old:generic
Heuristically generated from author, title, and date if not set. Ensures that the first 2 XPaths are set in the output file.
shelfmark
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/idno/idno[@type="shelfmark"] ##-- ddc: canonical target
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/idno[@type="shelfmark"] ##-- -2013-08-04
fileDesc/sourceDesc/msDesc/msIdentifier/idno/idno[@type="shelfmark"]
fileDesc/sourceDesc/msDesc/msIdentifier/idno[@type="shelfmark"] ##-- new (>=2012-07)
fileDesc/sourceDesc/biblFull/notesStmt/note[@type="location"]/ident[@type="shelfmark"] ##-- old (<2012-07)
library
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/repository ##-- ddc: canonical target
fileDesc/sourceDesc/msDesc/msIdentifier/repository ##-- new
fileDesc/sourceDesc/biblFull/notesStmt/note[@type="location"]/name[@type="repository"] ##-- old
basename (dtadir)
XPath(s):
fileDesc/publicationStmt[@n="ddc"]/idno[@type="basename"] ##-- new: canonical target
fileDesc/publicationStmt/idno/idno[@type="DTADirName"] ##-- (>=2013-09-04)
fileDesc/publicationStmt/idno[@type="DTADirName"] ##-- (>=2013-09-04)
fileDesc/publicationStmt/idno[@type="DTADIRNAME"] ##-- new (>=2012-07)
fileDesc/publicationStmt/idno[@type="DTADIR"] ##-- old (<2012-07)
Heuristically set to BASENAME if not found.
dtaid
XPath(s):
fileDesc/publicationStmt[@n="ddc"]/idno[@type="dtaid"] ##-- ddc: canonical target
fileDesc/publicationStmt/idno/idno[@type="DTAID"]
fileDesc/publicationStmt/idno[@type="DTAID"]
Defaults to "0" (zero) if unset.
timestamp
XPath(s):
fileDesc/publicationStmt/date[@type="ddc-timestamp"] ##-- ddc: canonical target
fileDesc/publicationStmt/date ##-- DTA mode only
Defaults to last modification time of XML_HEADER_FILE or the current time if not set.
availability (human-readable)
XPath(s):
fileDesc/publicationStmt/availability[@type="ddc"]
fileDesc/publicationStmt/availability
Defaults to "-" if unset.
avail (DWDS code)
XPath(s):
fileDesc/publicationStmt/availability[@type="ddc_dwds"]
fileDesc/publicationStmt/availability/@n
Defaults to "-" if unset.
textClass
Source XPath(s):
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds1main"]
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds1sub"]
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds2main"]
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds2sub"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds1main"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds1sub"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds2main"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds2sub"]
profileDesc/textClass/keywords/term ##-- dwds keywords
Target XPath:
profileDesc/textClass/classCode[@scheme="ddcTextClassDWDS"]
textClassDTA
Source XPath(s):
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dtamain"]
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dtasub"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dtamain"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dtasub"]
Target XPath:
profileDesc/textClass/classCode[@scheme="ddcTextClassDTA"]
DTA corpus
Source XPath(s):
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#DTACorpus"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#DTACorpus"]
Target XPath:
profileDesc/textClass/classCode[@scheme="ddcTextClassCorpus"]
SEE ALSO
AUTHOR
Bryan Jurish <moocow@cpan.org>