NAME
dtatw-sanitize-header.perl - make DDC/DTA-friendly TEI-headers
SYNOPSIS
dtatw-sanitize-header.perl [OPTIONS] XML_HEADER_FILE
General Options:
-help # this help message
-verbose LEVEL # set verbosity level (0<=LEVEL<=1)
-quiet # alias for -verbose=0
-dta , -foreign # do/don't warn about strict DTA header compliance (default=do)
-max-bibl-length LEN # trim bibl fields to maximum length LEN (default=256)
Auxiliary DB Options: # optional BASENAME-keyed JSON-metadata Berkeley DB
-aux-db DBFILE # read auxiliary DB from DBFILE (default=none)
-aux-xpath XPATH # append <idno type="KEY"> elements to XPATH (default='fileDesc[@n="ddc-aux"]')
XPath Options:
-xpath ATTR=XPATH # prepend XPATH for attribute ATTR
-default ATTR=VAL # default values (for textClass* attributes)
I/O Options:
-blanks , -noblanks # do/don't keep 'ignorable' whitespace in XML_HEADER_FILE file (default=don't)
-base BASENAME # use BASENAME to auto-compute field names (default=basename(XML_HEADER_FILE))
-output FILE # specify output file (default='-' (STDOUT))
OPTIONS AND ARGUMENTS
General Options
- -h, -help
-
Display a brief usage summary and exit.
- -v, -verbose LEVEL
-
Set verbosity level; values for LEVEL are:
0: silent 1: warnings only 2: warnings and progress messages
- -q, -quiet
-
Alis for -verbose=0
- -b, -basename BASENAME
-
Set basename for generated header fields; default is the basename (non-directory portion) of XML_HEADER_FILE up to but not including the first dot (".") character, if any. In default
-dta
mode, everything after the first dot character in BASENAME will be truncated even if you specify this option; in-foreign
mode, dots in basenames passed in via this option are allowed. - -dta, -nodta
-
Do/don't run with DTA-specific heuristics and attempt to enforce DTA-header compliance (default: do).
- -foreign
-
Alias for
-nodta
. - -l, -max-bibl-len LEN
-
Trim sanitized XPaths to maximum length LEN characters (default=256).
Auxiliary DB Options
You can optionally use a BASENAME-keyed JSON-metadata Berkeley DB file to automatically insert additional metadata fields into an existing header.
- -aux-db DBFILE
-
Apply auxiliary metadata from Berkeley DB file DBFILE (default=none). Keys of DBFILE should be BASENAMEs as parsed from XML_HEADER_FILE or passed in via the
-basename
option, and the associated values should be flat JSON objects whose keys are the names of metadata attributes for BASENAME and whose values are the values of those metadata attributes. - -aux-xpath XPATH
-
Append
<idno type="KEY">VAL</idno>
elements to XPATH (default='fileDesc[@n="ddc-aux"]'
) for auxiliary metadata attributes.
XPath Options
You can optionally specify source XPaths to override the defaults with the -xpath
option.
- -xpath ATTR=XPATH
-
Prepend XPATH to the builtin list of source XPaths for the attribute ATTR. Known attributes: author title date bibl shelfmark library dirname dtaid timestamp availability avail textClassDTA textClassDWDS textClassCorpus.
- -default ATTR=VALUE
-
Default value for attribute ATTR. Only used for textClass* attributes.
I/O Options
- -[no]keep-blanks
-
Do/don't retain all whitespace in input file (default=don't).
- -o, -output OUTFILE
-
Write output to OUTFILE; default="-" (standard output).
- -format LEVEL
-
Format output at libxml level LEVEL (default=1).
DESCRIPTION
dtatw-sanitize-header.perl applies some parsing and encoding heuristics to a TEI-XML header file XML_HEADER_FILE in an attempt to ensure compliance with DTA/D* header conventions for subsequent DDC indexing. For each supported metadata attribute, a corresponding header record is first sought by means of a first-match-wins XPath list. If no existing header record is found, a default (possibly empty) value is heuristically assigned, and the resulting value is inserted into the header at a conventional XPath location.
The metadata attributes currently supported are listed below; Source XPaths in the list are specified relative to the root <teiHeader>
element, and unless otherwise noted, the first source XPath listed is also the target XPath, guaranteed to be exist in the output header on successful script completion.
See https://kaskade.dwds.de/dstar/doc/README.html#bibliographic_metadata_attributes for details on D* metadata attribute conventions.
author
XPath(s):
fileDesc/titleStmt/author[@n="ddc"] ##-- ddc: canonical target (formatted)
fileDesc/titleStmt/author ##-- new (direct, un-formatted)
fileDesc/sourceDesc/biblFull/titleStmt/author ##-- new (sourceDesc, un-formatted)
fileDesc/titleStmt/editor[string(@corresp)!="#DTACorpusPublisher"] ##-- new (direct, un-formatted)
fileDesc/sourceDesc/biblFull/titleStmt/editor[string(@corresp)!="#DTACorpusPublisher"] ##-- new (sourceDesc, un-formatted)
fileDesc/sourceDesc/listPerson[@type="searchNames"]/person/persName ##-- old
Heuristically parses and formats persName
, surname
, forename
, and genName
elements to a human-readable string. In DTA mode, defaults to the first component of the "_"-separated BASENAME.
title
XPath(s):
fileDesc/titleStmt/title[@type="main" or @type="sub" or @type="vol"] ##-- DTA-mode only
fileDesc/titleStmt/title[@type="ddc"] ##-- ddc: canonical target (formatted)
fileDesc/titleStmt/title[not(@type)]
sourceDesc[@id="orig"]/biblFull/titleStmt/title
sourceDesc[@id="scan"]/biblFull/titleStmt/title
sourceDesc[not(@id)]/biblFull/titleStmt/title
In DTA mode, heuristically parses and formats @type="main"
, @type="sub"
, @type="vol"
elements to a human-readable string, and defaults to the second component of the "_"-separated BASENAME.
date
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/biblFull/publicationStmt/date[@type="pub"] ##-- ddc: canonical target
fileDesc/sourceDesc[@n="scan"]/biblFull/publicationStmt/date ##-- old:publDate
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]/supplied
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="publication"]/supplied ##-- new:date (published, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="publication"] ##-- new:date (published)
fileDesc/sourceDesc/biblFull/publicationStmt/date/supplied ##-- new:date (generic, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date ##-- new:date (generic, supplied)
Heuristically trims everything but digits and hyphens from the extracted date-string. In DTA mode, defaults to the final component of the "_"-separated BASENAME.
firstDate
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/biblFull/publicationStmt/date[@type="first"] ##-- ddc: canonical target
fileDesc/sourceDesc[@n="orig"]/biblFull/publicationStmt/date ##-- old: publDate
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]/supplied
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="firstPublication"]/supplied ##-- new:date (first, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="firstPublication"] ##-- new:date (first)
fileDesc/sourceDesc/biblFull/publicationStmt/date/supplied ##-- new:date (generic, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date ##-- new:date (generic, supplied)
Heuristically trims everything but digits and hyphens from the extracted date-string. Defaults to the publication date (see above).
bibl
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/bibl ##-- ddc:canonical target
fileDesc/sourceDesc[@n="orig"]/bibl ##-- old:firstBibl, target
fileDesc/sourceDesc[@n="scan"]/bibl ##-- old:publBibl
fileDesc/sourceDesc/bibl ##-- new|old:generic
Heuristically generated from author, title, and date if not set. Ensures that the first 2 XPaths are set in the output file.
shelfmark
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/idno/idno[@type="shelfmark"] ##-- ddc: canonical target
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/idno[@type="shelfmark"] ##-- -2013-08-04
fileDesc/sourceDesc/msDesc/msIdentifier/idno/idno[@type="shelfmark"]
fileDesc/sourceDesc/msDesc/msIdentifier/idno[@type="shelfmark"] ##-- new (>=2012-07)
fileDesc/sourceDesc/biblFull/notesStmt/note[@type="location"]/ident[@type="shelfmark"] ##-- old (<2012-07)
library
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/repository ##-- ddc: canonical target
fileDesc/sourceDesc/msDesc/msIdentifier/repository ##-- new
fileDesc/sourceDesc/biblFull/notesStmt/note[@type="location"]/name[@type="repository"] ##-- old
basename (dtadir)
XPath(s):
fileDesc/publicationStmt[@n="ddc"]/idno[@type="basename"] ##-- new: canonical target
fileDesc/publicationStmt/idno/idno[@type="DTADirName"] ##-- (>=2013-09-04)
fileDesc/publicationStmt/idno[@type="DTADirName"] ##-- (>=2013-09-04)
fileDesc/publicationStmt/idno[@type="DTADIRNAME"] ##-- new (>=2012-07)
fileDesc/publicationStmt/idno[@type="DTADIR"] ##-- old (<2012-07)
Heuristically set to BASENAME if not found.
dtaid
XPath(s):
fileDesc/publicationStmt[@n="ddc"]/idno[@type="dtaid"] ##-- ddc: canonical target
fileDesc/publicationStmt/idno/idno[@type="DTAID"]
fileDesc/publicationStmt/idno[@type="DTAID"]
Defaults to "0" (zero) if unset.
timestamp
XPath(s):
fileDesc/publicationStmt/date[@type="ddc-timestamp"] ##-- ddc: canonical target
fileDesc/publicationStmt/date ##-- DTA mode only
Defaults to last modification time of XML_HEADER_FILE or the current time if not set.
availability (human-readable)
XPath(s):
fileDesc/publicationStmt/availability[@type="ddc"]
fileDesc/publicationStmt/availability
Defaults to "-" if unset.
avail (DWDS code)
XPath(s):
fileDesc/publicationStmt/availability[@type="ddc_dwds"]
fileDesc/publicationStmt/availability/@n
Defaults to "-" if unset.
textClass
Source XPath(s):
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds1main"]
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds1sub"]
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds2main"]
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds2sub"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds1main"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds1sub"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds2main"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds2sub"]
profileDesc/textClass/keywords/term ##-- dwds keywords
Target XPath:
profileDesc/textClass/classCode[@scheme="ddcTextClassDWDS"]
textClassDTA
Source XPath(s):
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dtamain"]
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dtasub"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dtamain"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dtasub"]
Target XPath:
profileDesc/textClass/classCode[@scheme="ddcTextClassDTA"]
DTA corpus
Source XPath(s):
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#DTACorpus"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#DTACorpus"]
Target XPath:
profileDesc/textClass/classCode[@scheme="ddcTextClassCorpus"]
SEE ALSO
AUTHOR
Bryan Jurish <moocow@cpan.org>