NAME
dcdb-corpus-compile.perl - pre-compile a DiaColloDB corpus
SYNOPSIS
dcdb-corpus-compile.perl [OPTIONS] [INPUT(s)...]
General Options:
-h, -help # this help message
-V, -version # report version information and exit
-j, -jobs NJOBS # set number of worker threads
Input Corpus Options:
-l, -[no]list # INPUT(s) are/aren't file-lists (default=no)
-g, -[no]glob # do/don't glob INPUT(s) argument(s) (default=don't)
-u, -[no]union # do/don't treat INPUT(S) as pre-compiled corpus to be merged (default=don't)
-C, -dclass CLASS # set corpus document class (default=DDCTabs)
-D, -dopt OPT=VAL # set corpus document option, e.g.
# eosre=EOSRE # eos regex (default='^$')
# foreign=BOOL # disable D*-specific heuristics
-bysent # default split by sentences (default)
-byparagraph # default split by paragraphs
-bypage # default split by page
-bydoc # default split by document
Content Filter Options:
-f, -filter KEY=VAL # set filter option for KEY = (p|w|l)(bad|good)(_file)?
# (p|w|l)good=REGEX # positive regex for (postags|words|lemmata)
# (p|w|l)bad=REGEX # negative regex for (postags|words|lemmata)
# (p|w|l)goodfile=FILE # positive list-file for (postags|words|lemmata)
# (p|w|l)badfile=FILE # negative list-file for (postags|words|lemmata)
-F, -nofilters # clear all filter options
I/O and Logging Options:
-ll, -log-level LVL # set log-level (default=TRACE)
-lo, -log-option K=V # set log option (e.g. logdate, logtime, file, syslog, stderr, ...)
-t, -[no]times # do/don't report operating timing (default=do)
-a, -[no]append # do/don't append to existing output corpus (default=don't)
-o, -output OUTDIR # set output corpus directory (required)
DESCRIPTION
dcdb-corpus-compile.perl pre-compiles a DiaColloDB::Corpus::Compiled from a tokenized and annotated input corpus represented as a DiaColloDB::Corpus object, optionally applying content filters such as stopword lists. The resulting compiled corpus can be used with dcdb-create.perl(1) to compile a DiaColloDB collocation database.
Note that it is not necessary to pre-compile a corpus with this script in order to create a fully functional DiaColloDB database from a source corpus, since the DiaColloDB::create() method as invoked by the dcdb-create.perl(1) script should implicitly create a (temporary) DiaColloDB::Corpus::Compiled object as and when required.
OPTIONS AND ARGUMENTS
Arguments
- INPUT(s)
-
File(s), glob(s), file-list(s), or basename(s) to be compiled. Interpretation depends on the -glob, -list, and -union options.
General Options
- -help
-
Display a brief help message and exit.
- -version
-
Display version information and exit.
- -jobs NJOBS
-
Run
NJOBSparallel compilation threads. If specified as 0, will run only a single thread. The default value (-1) will run as many jobs as there are cores on the (unix/linux) system; see "nJobs" in DiaColloDB::Utils for details.
Input Corpus Options
- -list
- -nolist
-
Do/don't treat INPUT(s) as file-lists rather than corpus data files. Default=don't.
- -glob
- -noglob
-
Do/don't expand wildcards in INPUT(s). Default=do.
- -union
- -nounion
-
Do/don't treat INPUT(s) as pre-compiled corpora to be merged. Note that in
-unionmode, no corpus content filters are applied (they are assumed to have been applied to the INPUT(s) prior to the union call). Default=don't - -dclass CLASS
-
Set corpus document class (default=DDCTabs). See "SUBCLASSES" in DiaColloDB::Document for a list of supported input formats. If you are using the default DDCTabs document class on your own (non-D*) corpus, you may also want to specify
-dopt foreign=1.Aliases: -C, -document-class, -dclass, -dc
- -dopt OPT=VAL
-
Set corpus document option, e.g.
-dopt eosre=EOSREsets the end-of-sentence regex for the default DDCTabs document class, and-dopt foreign=1disables D*-specific hacks.Aliases: -D, -document-option, -docoption, -dopt, -do, -dO
- -bysent
-
Split corpus (-> track collocations in compiled database) by sentence (default).
- -byparagraph
-
Split corpus (-> track collocations in compiled database) by paragraph.
- -bypage
-
Split corpus (-> track collocations in compiled database) by page.
- -bydoc
-
Split corpus (-> track collocations in compiled database) by document.
Filter Options
- -use-all-the-data
-
Disables all content-filter options, inspired by Mark Lauersdorf; equivalent to:
-f=pgood='' \ -f=wgood='' \ -f=lgood='' \ -f=pbad='' \ -f=wbad='' \ -f=lbad=''Aliases: -F, -nofilters, -A, -all, -noprune
I/O and Logging Options
- -log-level LEVEL
-
Set DiaColloDB::Logger log-level (default=TRACE).
Aliases: -ll, -log-level, -level
- -log-option OPT=VAL
-
Set arbitrary DiaColloDB::Logger option (e.g. logdate, logtime, file, syslog, stderr, ...).
Aliases: -lo, -log-option, -logopt
- -[no]times
-
Do/don't report operating timing (default=do)
Aliases: -t, -timing, -times, -time
- -output OUTDIR
-
Output directory for compiled corpus (required).
Aliases: -o, -output-directory, -outdir, -output, -out, -od
BUGS AND LIMITATIONS
Probably many.
ACKNOWLEDGEMENTS
Perl by Larry Wall.
AUTHOR
Bryan Jurish <moocow@cpan.org>
SEE ALSO
DiaColloDB(3pm), DiaColloDB::Corpus(3pm), DiaColloDB::Corpus::Compiled(3pm), DiaColloDB::Corpus::Filters(3pm), dcdb-create.perl(1), perl(1).