NAME

dcdb-create.perl - create a DiaColloDB diachronic collocation database

SYNOPSIS

dcdb-create.perl [OPTIONS] [INPUT(s)...]

General Options:
  -help                ##-- this help message
  -version             ##-- report version information and exit
  -jobs NJOBS          ##-- number of threads for corpus compilation (default=-1: all cores)
  -xs , -pp            ##-- do/don't use fast XS implementations where available (default=if available)

Corpus Options:
  -list , -nolist      ##-- INPUT(s) are/aren't file-lists (default=no)
  -glob , -noglob      ##-- do/don't glob INPUT(s) argument(s) (default=do)
  -union, -nounion     ##-- do/don't trate INPUT(s) as DB directories to be merged (default=don't)
  -lazy , -nolazy      ##-- do/don't create "lazy" list-client (union mode only; default=don't)
  -dclass CLASS        ##-- set corpus document class (default=DDCTabs)
  -dopt OPT=VAL        ##-- set corpus document option, e.g.
                       ##   eosre=EOSRE  # eos regex (default='^$')
                       ##   foreign=BOOL # disable D*-specific heuristics
  -bysent              ##-- track collocations by sentence (default)
  -byparagraph         ##-- track collocations by paragraph
  -bypage              ##-- track collocations by page
  -bydoc               ##-- track collocations by document

Indexing Options:
  -attrs ATTRS         ##-- select index attributes (default=l,p)
                       ##   known attributes: l, p, w, doc.title, ...
  -use-all-the-data    ##-- disable default frequency- and regex-filters
  -64bit               ##-- use 64-bit quads where available
  -32bit               ##-- use 32-bit integers where available
  -dmax DIST           ##-- maximum distance for indexed co-occurrences (default=5)
  -tfmin TFMIN         ##-- minimum global term frequency (default=2)
  -lfmin LFMIN         ##-- minimum global lemma frequency (default=undef:tfmin)
  -cfmin CFMIN         ##-- minimum relation co-occurrence frequency (default=2)
  -[no]tdf             ##-- do/don't create (term x document) index relation (default=if available)
  -tdf-dbreak BREAK    ##-- set tdf matrix "document" granularity (e.g. s,p,page,file; default=file)
  -tdf-fmin VFMIN      ##-- set minimum tdf term frequency (default=undef: TFMIN)
  -tdf-dfmin VDFMIN    ##-- set minimum tdf term "document"-frequency (default=4)
  -tdf-nmin VNMIN      ##-- set minimum number of content tokens per tdf "document" (default=8)
  -tdf-nmax VNMAX      ##-- set maximum number of content tokens per tdf "document" (default=inf)
  -tdf-option OPT=VAL  ##-- set arbitrary tdf matrix option, e.g.
                       ##   minFreq=INT            # minimum term frequency (default=undef: use TFMIN)
                       ##   minDocFreq=INT         # minimum term document-"frequency" (default=4)
                       ##   minDocSize=INT         # minimum document size (#/terms) (default=4)
                       ##   maxDocSize=INT         # maximum document size (#/terms) (default=inf)
                       ##   mgood=REGEX            # positive regex for document-level metatdata
                       ##   mbad=REGEX             # negative regex for document-level metatdata
  -option OPT=VAL      ##-- set arbitrary DiaColloDB option, e.g.
                       ##   pack_id=PACKFMT        # pack-format for IDs
                       ##   pack_f=PACKFMT         # pack-format for frequencies
                       ##   pack_date=PACKFMT      # pack-format for dates
                       ##   (p|w|l)good=REGEX      # positive regex for (postags|words|lemmata)
                       ##   (p|w|l)bad=REGEX       # negative regex for (postags|words|lemmata)
                       ##   (p|w|l)goodfile=FILE   # positive list-filefor (postags|words|lemmata)
                       ##   (p|w|l)badfile=FILE    # negative list-file for (postags|words|lemmata)
                       ##   ddcServer=HOST:PORT    # server for ddc relations
                       ##   ddcTimeout=SECONDS     # timeout for ddc relations

I/O and Logging Options:
  -log-level LEVEL     ##-- set log-level (default=TRACE)
  -log-option OPT=VAL  ##-- set log option (e.g. logdate, logtime, file, syslog, stderr, ...)
  -[no]keep            ##-- do/don't keep temporary files (default=don't)
  -[no]mmap            ##-- do/don't use mmap for file access (default=do)
  -[no]debug           ##-- do/don't enable painful debugging checks (default=don't)
  -[no]times           ##-- do/don't report operating timing (default=do)
  -output OUT          ##-- output directory or client configuration file (required)

Environment Variables:
  DIACOLLO_SORT        ##-- system sort command prefix
  SORT                 ##-- fallback for DIACOLLO_SORT

DESCRIPTION

dcdb-create.perl compiles a DiaColloDB diachronic collocation database from a tokenized and annotated input corpus, or merges multiple existing DiaColloDB databases into a single database directory. The resulting database can be queried with the dcdb-query.perl(1) script, or wrapped into a web-service with the help of the DiaColloDB::WWW utilities, which see for details.

OPTIONS AND ARGUMENTS

Arguments

INPUT(s)

File(s), glob(s), file-list(s) to be indexed or existing indices to be merged. Interpretation depends on the -glob, -list, -union, and -lazy options.

General Options

-help

Display a brief help message and exit.

-version

Display version information and exit.

-jobs NJOBS

Run NJOBS parallel compilation threads. If specified as 0, will run only a single thread. The default value (-1) will run as many jobs as there are cores on the (unix/linux) system; see "nJobs" in DiaColloDB::Utils for details. Also sets the environment variable OMP_NUM_THREADS after interpreting the NJOBS request.

Corpus Options

Input corpora can be either "raw" corpora using the default DiaColloDB::Corpus class or a single "pre-compiled" corpus directory using the DiaColloDB::Corpus::Compiled conventions as created by the dcdb-corpus-compile.perl(1) script.

If a pre-compiled input corpus directory is specified, only the corpus content filters pre-compiled into the corpus itself are used, and the corpus content filter options to this script (-Opgood=REGEX etc.) will have no effect. For "raw" input corpora, a temporary DiaColloDB::Corpus::Compiled object will be created and the DiaColloDB::Corpus::Filters options to this script should be honored.

-list
-nolist

Do/don't treat INPUT(s) as file-lists rather than corpus data files or pre-compiled corpus directories. Default=don't.

-glob
-noglob

Do/don't expand wildcards in INPUT(s). Has no effect for pre-compiled corpus directories. Default=do.

-union
-nounion

Do/don't trate INPUT(s) as DB directories to be merged. Creates a new physical DB by merging data from the argument INPUT(s). Default=don't.

-lazy
-nolazy

Enable/disable "lazy union" mode. If enabled, INPUT(s) are treated as DB URLs to be merged "lazily", and only a simple DiaColloDB::Client::list configuration file OUT is created, suitable for passing to dcdb-query.perl as rcfile://OUT. User options specified with -option OPT=VAL will clobber the DiaColloDB::Client::list defaults (e.g. fudge, fork, etc.). Unlike -union mode, no physical DB is created in -lazy mode; queries to the lazy client are deferred to the underlying DB URLs specified in the configuration file. The lazy configuration should behave like a physical DB created with -union, can be created in near constant time, requires only a few bytes of disk space, and may even process queries faster than a physical DB if you have the threads module installed.

Default=off.

Aliases: -lazy-union, -list-union, -lu

-dclass CLASS

Set corpus document class (default=DDCTabs) for raw (i.e. not pre-compiled) corpora. See "SUBCLASSES" in DiaColloDB::Document for a list of supported input formats. If you are using the default DDCTabs document class on your own (non-D*) corpus, you may also want to specify -dopt foreign=1.

Has no effect for pre-compiled corpus directory INPUT(s).

-dopt OPT=VAL

Set corpus document option for raw (i.e. not pre-compiled) corpora, e.g. -dopt eosre=EOSRE sets the end-of-sentence regex for the default DDCTabs document class, and -dopt foreign=1 disables D*-specific hacks.

Potentially dangerous for pre-compiled corpus directory INPUT(s).

Aliases: -document-option, -docoption, -dO

-bysent

Track collocations by sentence (default). Has no effect for pre-compiled corpus directory INPUT(s).

-byparagraph

Track collocations by paragraph. Has no effect for pre-compiled corpus directory INPUT(s).

-bypage

Track collocations by page. Has no effect for pre-compiled corpus directory INPUT(s).

-bydoc

Track collocations by document. Has no effect for pre-compiled corpus directory INPUT(s).

Indexing Options

-attrs ATTRS

Select attributes to be indexed (default=l,p). Known attributes include l, p, w, doc.title, doc.author, etc.

-use-all-the-data

Disables default frequency- and regex-based pruning filter options, inspired by Mark Lauersdorf; equivalent to:

-tfmin=0 \
-lfmin=0 \
-cfmin=0 \
-tdf-tfmin=0 \
-tdf-dfmin=0 \
-tdf-nmin=0 \
-tdf-nmax=inf \
-O=pgood='' -O=poodfile='' \
-O=wgood='' -O=wgoodfile='' \
-O=lgood='' -O=lgoodfile='' \
-O=pbad='' -O=pbadfile='' \
-O=wbad='' -O=wbadfile='' \
-O=lbad='' -O=lbadfile='' \
-tO=mgood='' \
-tO=mbad=''

Corpus content filters (pgood, pgoodfile, ..., lbad, lbadfile) have no effect for pre-compiled corpus directory INPUT(s)

Aliases: -all, -noprune, -nofilters, -F

-64bit

Use 64-bit quads to index integer IDs where available.

-32bit

Use 32-bit integers where available (default).

-dmax DIST

Specify maximum distance for indexed co-occurrences (default=5).

-tfmin TFMIN

Specify minimum global term frequency (default=2). A "term" in this sense is an n-tuple of indexed attributes not including the "date" component.

-lfmin LFMIN

Specify minimum global lemma frequency (default=undef:TFMIN).

-cfmin CFMIN

Specify minimum relation co-occurrence frequency (default=2).

-[no]tdf

Do/don't create (term x document) index relation (default=if available).

-tdf-dbreak BREAK

Set tdf matrix "document" granularity (e.g. s,p,page,file; default=file).

-tdf-fmin VFMIN

Set minimum tdf term frequency (default=undef: use TFMIN).

-tdf-dfmin VDFMIN

Set minimum term document-"frequency" (default=4).

-tdf-nmin VNMIN

Set minimum number of content tokens per tdf "document" (default=8).

-tdf-nmax VNMAX

Set maximum number of content tokens per tdf "document" (default=inf).

-tdf-option OPT=VAL

Set arbitrary tdf matrixDiaColloDB option, e.g.

minFreq=INT            # -tdf-fmin: minimum term frequency
minDocFreq=INT         # -tdf-dfmin: minimum term document-"frequency"
minDocSize=INT         # -tdf-nmin: minimum document size (#/terms)
maxDocSize=INT         # -tdf-nmax: maximum document size (#/terms)
mgood=REGEX            # positive regex for document-level metatdata
mbad=REGEX             # negative regex for document-level metatdata

Alias: -tO

-option OPT=VAL

Set arbitrary DiaColloDB index option, e.g.

pack_id=PACKFMT        # pack-format for IDs
pack_f=PACKFMT         # pack-format for frequencies
pack_date=PACKFMT      # pack-format for dates
(p|w|l)good=REGEX      # (raw input only) positive regex for (postags|words|lemmata)
(p|w|l)bad=REGEX       # (raw input only) negative regex for (postags|words|lemmata)
(p|w|l)goodfile=REGEX  # (raw input only) positive list-file for (postags|words|lemmata)
(p|w|l)badfile=REGEX   # (raw input only) negative list-file for (postags|words|lemmata)
ddcServer=HOST:PORT    # server for ddc relations
ddcTimeout=SECONDS     # timeout for ddc relations

Alias: -O

I/O and Logging Options

-log-level LEVEL

Set DiaColloDB::Logger log-level (default=TRACE).

-log-option OPT=VAL

Set arbitrary DiaColloDB::Logger option (e.g. logdate, logtime, file, syslog, stderr, ...).

-[no]keep

Do/don't keep temporary files (default=don't)

-[no]mmap

Do/don't use mmap() for low-level index file access (default=do)

-[no]debug

Do/don't enable painful debugging checks (default=don't)

-[no]times

Do/don't report operating timing (default=do)

-output OUT

Output directory or filename (required).

BUGS AND LIMITATIONS

Probably many.

ACKNOWLEDGEMENTS

Perl by Larry Wall.

AUTHOR

Bryan Jurish <moocow@cpan.org>

SEE ALSO

DiaColloDB(3pm), dcdb-corpus-compile.perl(1), dcdb-info.perl(1), dcdb-query.perl(1), dcdb-export.perl(1), perl(1).