NAME

dcdb-corpus-compile.perl - pre-compile a DiaColloDB corpus

SYNOPSIS

dcdb-corpus-compile.perl [OPTIONS] [INPUT(s)...]

General Options:
  -h, -help            # this help message
  -V, -version         # report version information and exit
  -j, -jobs NJOBS      # set number of worker threads

Input Corpus Options:
  -l, -[no]list        # INPUT(s) are/aren't file-lists (default=no)
  -g, -[no]glob        # do/don't glob INPUT(s) argument(s) (default=don't)
  -u, -[no]union       # do/don't treat INPUT(S) as pre-compiled corpus to be merged (default=don't)
  -C, -dclass CLASS    # set corpus document class (default=DDCTabs)
  -D, -dopt OPT=VAL    # set corpus document option, e.g.
                       #   eosre=EOSRE  # eos regex (default='^$')
                       #   foreign=BOOL # disable D*-specific heuristics
      -bysent          # default split by sentences (default)
      -byparagraph     # default split by paragraphs
      -bypage          # default split by page
      -bydoc           # default split by document

Content Filter Options:
  -f, -filter KEY=VAL  # set filter option for KEY = (p|w|l)(bad|good)(_file)?
                       #   (p|w|l)good=REGEX      # positive regex for (postags|words|lemmata)
                       #   (p|w|l)bad=REGEX       # negative regex for (postags|words|lemmata)
                       #   (p|w|l)goodfile=FILE   # positive list-file for (postags|words|lemmata)
                       #   (p|w|l)badfile=FILE    # negative list-file for (postags|words|lemmata)
  -F, -nofilters       # clear all filter options

I/O and Logging Options:
  -ll, -log-level LVL  # set log-level (default=TRACE)
  -lo, -log-option K=V # set log option (e.g. logdate, logtime, file, syslog, stderr, ...)
  -t,  -[no]times      # do/don't report operating timing (default=do)
  -a,  -[no]append     # do/don't append to existing output corpus (default=don't)
  -o,  -output OUTDIR  # set output corpus directory (required)

DESCRIPTION

dcdb-corpus-compile.perl pre-compiles a DiaColloDB::Corpus::Compiled from a tokenized and annotated input corpus represented as a DiaColloDB::Corpus object, optionally applying content filters such as stopword lists. The resulting compiled corpus can be used with dcdb-create.perl(1) to compile a DiaColloDB collocation database.

Note that it is not necessary to pre-compile a corpus with this script in order to create a fully functional DiaColloDB database from a source corpus, since the DiaColloDB::create() method as invoked by the dcdb-create.perl(1) script should implicitly create a (temporary) DiaColloDB::Corpus::Compiled object as and when required.

OPTIONS AND ARGUMENTS

Arguments

INPUT(s)

File(s), glob(s), file-list(s), or basename(s) to be compiled. Interpretation depends on the -glob, -list, and -union options.

General Options

-help

Display a brief help message and exit.

-version

Display version information and exit.

-jobs NJOBS

Run NJOBS parallel compilation threads. If specified as 0, will run only a single thread. The default value (-1) will run as many jobs as there are cores on the (unix/linux) system; see "nJobs" in DiaColloDB::Utils for details.

Input Corpus Options

-list
-nolist

Do/don't treat INPUT(s) as file-lists rather than corpus data files. Default=don't.

-glob
-noglob

Do/don't expand wildcards in INPUT(s). Default=do.

-union
-nounion

Do/don't treat INPUT(s) as pre-compiled corpora to be merged. Note that in -union mode, no corpus content filters are applied (they are assumed to have been applied to the INPUT(s) prior to the union call). Default=don't

-dclass CLASS

Set corpus document class (default=DDCTabs). See "SUBCLASSES" in DiaColloDB::Document for a list of supported input formats. If you are using the default DDCTabs document class on your own (non-D*) corpus, you may also want to specify -dopt foreign=1.

Aliases: -C, -document-class, -dclass, -dc

-dopt OPT=VAL

Set corpus document option, e.g. -dopt eosre=EOSRE sets the end-of-sentence regex for the default DDCTabs document class, and -dopt foreign=1 disables D*-specific hacks.

Aliases: -D, -document-option, -docoption, -dopt, -do, -dO

-bysent

Split corpus (-> track collocations in compiled database) by sentence (default).

-byparagraph

Split corpus (-> track collocations in compiled database) by paragraph.

-bypage

Split corpus (-> track collocations in compiled database) by page.

-bydoc

Split corpus (-> track collocations in compiled database) by document.

Filter Options

-use-all-the-data

Disables all content-filter options, inspired by Mark Lauersdorf; equivalent to:

-f=pgood='' \
-f=wgood='' \
-f=lgood='' \
-f=pbad='' \
-f=wbad='' \
-f=lbad=''

Aliases: -F, -nofilters, -A, -all, -noprune

I/O and Logging Options

-log-level LEVEL

Set DiaColloDB::Logger log-level (default=TRACE).

Aliases: -ll, -log-level, -level

-log-option OPT=VAL

Set arbitrary DiaColloDB::Logger option (e.g. logdate, logtime, file, syslog, stderr, ...).

Aliases: -lo, -log-option, -logopt

-[no]times

Do/don't report operating timing (default=do)

Aliases: -t, -timing, -times, -time

-output OUTDIR

Output directory for compiled corpus (required).

Aliases: -o, -output-directory, -outdir, -output, -out, -od

BUGS AND LIMITATIONS

Probably many.

ACKNOWLEDGEMENTS

Perl by Larry Wall.

AUTHOR

Bryan Jurish <moocow@cpan.org>

SEE ALSO

DiaColloDB(3pm), DiaColloDB::Corpus(3pm), DiaColloDB::Corpus::Compiled(3pm), DiaColloDB::Corpus::Filters(3pm), dcdb-create.perl(1), perl(1).