NAME
dcdb-create.perl - create a DiaColloDB diachronic collocation database
SYNOPSIS
dcdb-create.perl [OPTIONS] [INPUT(s)...]
General Options:
-help ##-- this help message
-version ##-- report version information and exit
-jobs NJOBS ##-- number of threads for corpus compilation (default=-1: all cores)
-xs , -pp ##-- do/don't use fast XS implementations where available (default=if available)
Corpus Options:
-list , -nolist ##-- INPUT(s) are/aren't file-lists (default=no)
-glob , -noglob ##-- do/don't glob INPUT(s) argument(s) (default=do)
-union, -nounion ##-- do/don't trate INPUT(s) as DB directories to be merged (default=don't)
-lazy , -nolazy ##-- do/don't create "lazy" list-client (union mode only; default=don't)
-dclass CLASS ##-- set corpus document class (default=DDCTabs)
-dopt OPT=VAL ##-- set corpus document option, e.g.
## eosre=EOSRE # eos regex (default='^$')
## foreign=BOOL # disable D*-specific heuristics
-bysent ##-- track collocations by sentence (default)
-byparagraph ##-- track collocations by paragraph
-bypage ##-- track collocations by page
-bydoc ##-- track collocations by document
Indexing Options:
-attrs ATTRS ##-- select index attributes (default=l,p)
## known attributes: l, p, w, doc.title, ...
-use-all-the-data ##-- disable default frequency- and regex-filters
-64bit ##-- use 64-bit quads where available
-32bit ##-- use 32-bit integers where available
-dmax DIST ##-- maximum distance for indexed co-occurrences (default=5)
-tfmin TFMIN ##-- minimum global term frequency (default=2)
-lfmin LFMIN ##-- minimum global lemma frequency (default=undef:tfmin)
-cfmin CFMIN ##-- minimum relation co-occurrence frequency (default=2)
-[no]tdf ##-- do/don't create (term x document) index relation (default=if available)
-tdf-dbreak BREAK ##-- set tdf matrix "document" granularity (e.g. s,p,page,file; default=file)
-tdf-fmin VFMIN ##-- set minimum tdf term frequency (default=undef: TFMIN)
-tdf-dfmin VDFMIN ##-- set minimum tdf term "document"-frequency (default=4)
-tdf-nmin VNMIN ##-- set minimum number of content tokens per tdf "document" (default=8)
-tdf-nmax VNMAX ##-- set maximum number of content tokens per tdf "document" (default=inf)
-tdf-option OPT=VAL ##-- set arbitrary tdf matrix option, e.g.
## minFreq=INT # minimum term frequency (default=undef: use TFMIN)
## minDocFreq=INT # minimum term document-"frequency" (default=4)
## minDocSize=INT # minimum document size (#/terms) (default=4)
## maxDocSize=INT # maximum document size (#/terms) (default=inf)
## mgood=REGEX # positive regex for document-level metatdata
## mbad=REGEX # negative regex for document-level metatdata
-option OPT=VAL ##-- set arbitrary DiaColloDB option, e.g.
## pack_id=PACKFMT # pack-format for IDs
## pack_f=PACKFMT # pack-format for frequencies
## pack_date=PACKFMT # pack-format for dates
## (p|w|l)good=REGEX # positive regex for (postags|words|lemmata)
## (p|w|l)bad=REGEX # negative regex for (postags|words|lemmata)
## (p|w|l)goodfile=FILE # positive list-filefor (postags|words|lemmata)
## (p|w|l)badfile=FILE # negative list-file for (postags|words|lemmata)
## ddcServer=HOST:PORT # server for ddc relations
## ddcTimeout=SECONDS # timeout for ddc relations
I/O and Logging Options:
-log-level LEVEL ##-- set log-level (default=TRACE)
-log-option OPT=VAL ##-- set log option (e.g. logdate, logtime, file, syslog, stderr, ...)
-[no]keep ##-- do/don't keep temporary files (default=don't)
-[no]mmap ##-- do/don't use mmap for file access (default=do)
-[no]debug ##-- do/don't enable painful debugging checks (default=don't)
-[no]times ##-- do/don't report operating timing (default=do)
-output OUT ##-- output directory or client configuration file (required)
Environment Variables:
DIACOLLO_SORT ##-- system sort command prefix
SORT ##-- fallback for DIACOLLO_SORT
DESCRIPTION
dcdb-create.perl compiles a DiaColloDB diachronic collocation database from a tokenized and annotated input corpus, or merges multiple existing DiaColloDB databases into a single database directory. The resulting database can be queried with the dcdb-query.perl(1) script, or wrapped into a web-service with the help of the DiaColloDB::WWW utilities, which see for details.
OPTIONS AND ARGUMENTS
Arguments
- INPUT(s)
-
File(s), glob(s), file-list(s) to be indexed or existing indices to be merged. Interpretation depends on the -glob, -list, -union, and -lazy options.
General Options
- -help
-
Display a brief help message and exit.
- -version
-
Display version information and exit.
- -jobs NJOBS
-
Run
NJOBSparallel compilation threads. If specified as 0, will run only a single thread. The default value (-1) will run as many jobs as there are cores on the (unix/linux) system; see "nJobs" in DiaColloDB::Utils for details. Also sets the environment variableOMP_NUM_THREADSafter interpreting theNJOBSrequest.
Corpus Options
Input corpora can be either "raw" corpora using the default DiaColloDB::Corpus class or a single "pre-compiled" corpus directory using the DiaColloDB::Corpus::Compiled conventions as created by the dcdb-corpus-compile.perl(1) script.
If a pre-compiled input corpus directory is specified, only the corpus content filters pre-compiled into the corpus itself are used, and the corpus content filter options to this script (-Opgood=REGEX etc.) will have no effect. For "raw" input corpora, a temporary DiaColloDB::Corpus::Compiled object will be created and the DiaColloDB::Corpus::Filters options to this script should be honored.
- -list
- -nolist
-
Do/don't treat INPUT(s) as file-lists rather than corpus data files or pre-compiled corpus directories. Default=don't.
- -glob
- -noglob
-
Do/don't expand wildcards in INPUT(s). Has no effect for pre-compiled corpus directories. Default=do.
- -union
- -nounion
-
Do/don't trate INPUT(s) as DB directories to be merged. Creates a new physical DB by merging data from the argument INPUT(s). Default=don't.
- -lazy
- -nolazy
-
Enable/disable "lazy union" mode. If enabled, INPUT(s) are treated as DB URLs to be merged "lazily", and only a simple DiaColloDB::Client::list configuration file OUT is created, suitable for passing to dcdb-query.perl as rcfile://OUT. User options specified with
-option OPT=VALwill clobber the DiaColloDB::Client::list defaults (e.g.fudge,fork, etc.). Unlike -union mode, no physical DB is created in -lazy mode; queries to the lazy client are deferred to the underlying DB URLs specified in the configuration file. The lazy configuration should behave like a physical DB created with -union, can be created in near constant time, requires only a few bytes of disk space, and may even process queries faster than a physical DB if you have the threads module installed.Default=off.
Aliases: -lazy-union, -list-union, -lu
- -dclass CLASS
-
Set corpus document class (default=DDCTabs) for raw (i.e. not pre-compiled) corpora. See "SUBCLASSES" in DiaColloDB::Document for a list of supported input formats. If you are using the default DDCTabs document class on your own (non-D*) corpus, you may also want to specify
-dopt foreign=1.Has no effect for pre-compiled corpus directory INPUT(s).
- -dopt OPT=VAL
-
Set corpus document option for raw (i.e. not pre-compiled) corpora, e.g.
-dopt eosre=EOSREsets the end-of-sentence regex for the default DDCTabs document class, and-dopt foreign=1disables D*-specific hacks.Potentially dangerous for pre-compiled corpus directory INPUT(s).
Aliases: -document-option, -docoption, -dO
- -bysent
-
Track collocations by sentence (default). Has no effect for pre-compiled corpus directory INPUT(s).
- -byparagraph
-
Track collocations by paragraph. Has no effect for pre-compiled corpus directory INPUT(s).
- -bypage
-
Track collocations by page. Has no effect for pre-compiled corpus directory INPUT(s).
- -bydoc
-
Track collocations by document. Has no effect for pre-compiled corpus directory INPUT(s).
Indexing Options
- -attrs ATTRS
-
Select attributes to be indexed (default=l,p). Known attributes include
l, p, w, doc.title, doc.author, etc. - -use-all-the-data
-
Disables default frequency- and regex-based pruning filter options, inspired by Mark Lauersdorf; equivalent to:
-tfmin=0 \ -lfmin=0 \ -cfmin=0 \ -tdf-tfmin=0 \ -tdf-dfmin=0 \ -tdf-nmin=0 \ -tdf-nmax=inf \ -O=pgood='' -O=poodfile='' \ -O=wgood='' -O=wgoodfile='' \ -O=lgood='' -O=lgoodfile='' \ -O=pbad='' -O=pbadfile='' \ -O=wbad='' -O=wbadfile='' \ -O=lbad='' -O=lbadfile='' \ -tO=mgood='' \ -tO=mbad=''Corpus content filters (
pgood,pgoodfile, ...,lbad,lbadfile) have no effect for pre-compiled corpus directory INPUT(s)Aliases: -all, -noprune, -nofilters, -F
- -64bit
-
Use 64-bit quads to index integer IDs where available.
- -32bit
-
Use 32-bit integers where available (default).
- -dmax DIST
-
Specify maximum distance for indexed co-occurrences (default=5).
- -tfmin TFMIN
-
Specify minimum global term frequency (default=2). A "term" in this sense is an n-tuple of indexed attributes not including the "date" component.
- -lfmin LFMIN
-
Specify minimum global lemma frequency (default=undef:TFMIN).
- -cfmin CFMIN
-
Specify minimum relation co-occurrence frequency (default=2).
- -[no]tdf
-
Do/don't create (term x document) index relation (default=if available).
- -tdf-dbreak BREAK
-
Set tdf matrix "document" granularity (e.g. s,p,page,file; default=file).
- -tdf-fmin VFMIN
-
Set minimum tdf term frequency (default=undef: use TFMIN).
- -tdf-dfmin VDFMIN
-
Set minimum term document-"frequency" (default=4).
- -tdf-nmin VNMIN
-
Set minimum number of content tokens per tdf "document" (default=8).
- -tdf-nmax VNMAX
-
Set maximum number of content tokens per tdf "document" (default=inf).
- -tdf-option OPT=VAL
-
Set arbitrary tdf matrixDiaColloDB option, e.g.
minFreq=INT # -tdf-fmin: minimum term frequency minDocFreq=INT # -tdf-dfmin: minimum term document-"frequency" minDocSize=INT # -tdf-nmin: minimum document size (#/terms) maxDocSize=INT # -tdf-nmax: maximum document size (#/terms) mgood=REGEX # positive regex for document-level metatdata mbad=REGEX # negative regex for document-level metatdataAlias: -tO
- -option OPT=VAL
-
Set arbitrary DiaColloDB index option, e.g.
pack_id=PACKFMT # pack-format for IDs pack_f=PACKFMT # pack-format for frequencies pack_date=PACKFMT # pack-format for dates (p|w|l)good=REGEX # (raw input only) positive regex for (postags|words|lemmata) (p|w|l)bad=REGEX # (raw input only) negative regex for (postags|words|lemmata) (p|w|l)goodfile=REGEX # (raw input only) positive list-file for (postags|words|lemmata) (p|w|l)badfile=REGEX # (raw input only) negative list-file for (postags|words|lemmata) ddcServer=HOST:PORT # server for ddc relations ddcTimeout=SECONDS # timeout for ddc relationsAlias: -O
I/O and Logging Options
- -log-level LEVEL
-
Set DiaColloDB::Logger log-level (default=TRACE).
- -log-option OPT=VAL
-
Set arbitrary DiaColloDB::Logger option (e.g. logdate, logtime, file, syslog, stderr, ...).
- -[no]keep
-
Do/don't keep temporary files (default=don't)
- -[no]mmap
-
Do/don't use mmap() for low-level index file access (default=do)
- -[no]debug
-
Do/don't enable painful debugging checks (default=don't)
- -[no]times
-
Do/don't report operating timing (default=do)
- -output OUT
-
Output directory or filename (required).
BUGS AND LIMITATIONS
Probably many.
ACKNOWLEDGEMENTS
Perl by Larry Wall.
AUTHOR
Bryan Jurish <moocow@cpan.org>
SEE ALSO
DiaColloDB(3pm), dcdb-corpus-compile.perl(1), dcdb-info.perl(1), dcdb-query.perl(1), dcdb-export.perl(1), perl(1).