NAME

CWB - Perl toolbox for the IMS Open Corpus Workbench

SYNOPSIS

use CWB;

# full pathnames of CQP and the CWB tools
$CWB::CQP;             # cqp
$CWB::Config;          # cwb-config
$CWB::Encode;          # cwb-encode
$CWB::Makeall;         # cwb-makeall
$CWB::Decode;          # cwb-decode
$CWB::Lexdecode;       # cwb-lexdecode
$CWB::DescribeCorpus;  # cwb-describe-corpus
$CWB::Huffcode;        # cwb-huffcode
$CWB::CompressRdx;     # cwb-compress-rdx
$CWB::Itoa;            # cwb-itoa
$CWB::Atoi;            # cwb-atoi
$CWB::SEncode;         # cwb-s-encode
$CWB::SDecode;         # cwb-s-decode
$CWB::ScanCorpus;      # cwb-scan-corpus
$CWB::Align;           # cwb-align
$CWB::AlignEncode;     # cwb-align-encode
$CWB::CQPserver;       # cqpserver

# default registry directory and effective registry setting
$CWB::DefaultRegistry;
@dirs = CWB::RegistryDirectory(); # may return multiple directories

# open filehandle for reading or writing
# automagically compresses/decompresses files and dies on error
$fh = CWB::OpenFile("> my_file.gz");
$fh = CWB::OpenFile(">", "my_file.gz"); # as in 3-argument open() call

# temporary file objects (disk files are automatically removed)
$t1 = new CWB::TempFile;             # picks a unique filename
$t2 = new CWB::TempFile "mytemp";    # extends prefix to unique name
$t3 = new CWB::TempFile "mytemp.gz"; # compressed temporary file
$filename = $t1->name;        # full pathname of temporary file
$t1->write(...);              # works like $fh->print()
$t1->finish;                  # stop writing file
print $t1->status, "\n";      # WRITING/FINISHED/READING/DELETED
# main program can read or overwrite file <$filename> now
$line = $t1->read;            # read one line, like $fh->getline()
$t1->rewind;                  # re-read from beginning of file
$line = $t1->read;            # (reads first line again)
$t1->close;                   # stop reading and delete temporary file
# other files will be deleted when objects $t2 and $t3 are destroyed

# execute shell command with automatic error detection
$cmd = "ls -l";
$errlevel = CWB::Shell::Cmd($cmd);   # dies with error message if not ok
# $errlevel: 0 (ok), 1 (minor problems), ..., 6 (fatal error)
@lines = ();
CWB::Shell::Cmd($cmd, \@lines);      # capture standard output in array
CWB::Shell::Cmd($cmd, "files.txt");  # ... or in file (for large amounts of data)
$CWB::Shell::Paranoid = 1;    # more paranoid checks (-1 for less paranoid)

$quoted = CWB::Shell::Quote($string); # quote arbitrary string as shell argument
CWB::Shell::Cmd([$prog, $arg, ...], \@lines); # auto-quotes individual arguments

# read / modify / write registry files (must be in canonical format)
$reg = new CWB::RegistryFile; # create new registry file
$reg = new CWB::RegistryFile "/corpora/c1/registry/dickens";  # load file
die "failed" unless defined $reg;    # will fail if not in canonical format

$reg = new CWB::RegistryFile "dickens";       # search in standard registry
$filename = $reg->filename;                   # retrieve full pathname

# edit standard fields
$name = $reg->name;           # read NAME field
$reg->name("Charles Dickens");# modify NAME field
$corpus_id = $reg->id;        # same for ID, HOME, INFO
$home_dir = $reg->home;
$info_file = $reg->info;
$reg->delete_info;            # INFO line is optional and may be deleted

# edit corpus properties
@properties = $reg->list_properties;
$value = $reg->property("language");  # get property value
$reg->property("language", "en");     # set / add property
$reg->delete_property("language");

# edit attributes ('p'=positional, 's'=structural, 'a'=alignment)
@attr = $reg->list_attributes;        # list all attributes
@s_attr = $reg->list_attributes('a'); # list alignment attributes
$type = $reg->attribute("word");      # 'p'/'s'/'a' or undef
$reg->delete_attribute("np");
$reg->add_attribute("np", 's');       # specify type when adding attribute
$dir = $reg->attribute_path("lemma"); # may be stored in different directory
$reg->attribute_path("lemma", $dir);  # set attribute path
$reg->delete_attribute_path;          # default location is HOME directory

# comment lines (preceding field/declaration) and inline comments use keys:
#   ":NAME", ":ID", ... "::$property", ... "$attribute", ...
@lines = $reg->comments(":HOME");     # comment lines before HOME field
$reg->set_comments(":INFO", @lines);  # overwrite existing comments
$reg->add_comments("::language", "", "comment for language property", "");
$reg->set_comments("::language");     # delete comments before property
$comment = $reg->line_comment("np");  # inline comment of np attribute
$reg->line_comment("word", "the required word attribute");  # set comment
$reg->delete_line_comment("word");    # delete inline comment

# (over)write registry file (requires full pathname)
$reg->write("/corpora/c1/registry/dickens");

DESCRIPTION

This module offers basic support for using the IMS Open Corpus Workbench (http://cwb.sourceforge.net/) from Perl scripts. Several additional functions are included to perform tasks that are often needed by corpus-related scripts.

CWB PATHNAMES

Package variables give the full pathnames of CQP and the CWB tools, so they can be used in shell commands even when they are not installed in the user's search path. The following variables are available:

$CWB::CQP;             # cqp
$CWB::Config;          # cwb-config
$CWB::Encode;          # cwb-encode
$CWB::Makeall;         # cwb-makeall
$CWB::Decode;          # cwb-decode
$CWB::Lexdecode;       # cwb-lexdecode
$CWB::DescribeCorpus;  # cwb-describe-corpus
$CWB::Huffcode;        # cwb-huffcode
$CWB::CompressRdx;     # cwb-compress-rdx
$CWB::Itoa;            # cwb-itoa
$CWB::Atoi;            # cwb-atoi
$CWB::SEncode;         # cwb-s-encode
$CWB::SDecode;         # cwb-s-decode
$CWB::ScanCorpus;      # cwb-scan-corpus
$CWB::Align;           # cwb-align
$CWB::AlignEncode;     # cwb-align-encode
$CWB::CQPserver;       # cqpserver

Other configuration information includes the general installation prefix, the directory containing CWB binaries (which might be used to install additional software related to the CWB), and the default registry directory. NB: individual install paths may have overridden the general prefix, so the package variable $CWB::Prefix does not have much practical importance. Use the cwb-config program to find out the precise installation paths.

$CWB::Prefix;          # general installation prefix
$CWB::BinDir;          # directory for CWB binaries (executable programs)
$CWB::DefaultRegistry; # compiled-in default registry directory
$CWB::CWBVersion;      # release version of the CWB binaries (Perl-style)

Note that $CWB::CWBVersion refers to the release verison of the CWB binaries rather than the Perl module ($CWB::VERSION). All version numbers are encoded in Perl numeric style (e.g. 3.005_001 for CWB v3.5.1), so specific version requirements can easily be checked by numeric comparison.

MISCELLANEOUS FUNCTIONS

@dirs = CWB::RegistryDirectory();

The function CWB::RegistryDirectory can be used to determine the effective registry directory (either the compiled-in default registry or a setting made in the CORPUS_REGISTRY environment variable). It is possible to specify multiple registry directories, so CWB::RegistryDirectory returns a list of strings.

$fh = CWB::OpenFile($name);
$fh = CWB::OpenFile($mode, $name);

Open file $name for reading, writing, or appending. Returns FileHandle object if successful, otherwise it dies with an error message. It is thus never necessary to check whether $fh is defined.

If CWB::OpenFile is called with two arguments, $mode indicates the file access mode: < for reading, > for writing, >> for appending, |- for a write pipe and -| for a read pipe (see "open" in perlfunc for details). In this form, I/O layers can be appended to the access mode. For example, to read a .gz file in ISO-8859-1 encoding, you can use the command

$fh = CWB::OpenFile("<:encoding(latin1)", $filename);

In the one-argument form, CWB::OpenFile examines the file name for an embedded access mode specifier. If $name starts with > the file is opened for writing (an existing file will be overwritten), if it starts with >> the file is opened for appending. The default is to open the file for reading, which can optionally be made explicit by a leading <. A | at the start or end of $name opens a write or read pipe, respectively.

Files with extension .Z, .gz, .bz2 or .xz are automatically compressed and decompressed, provided that the necessary programs are installed. It is also possible to append to .gz and .bz2 files.

TEMPORARY FILES

Temporary files (implemented by CWB::TempFile objects) are created with a unique name and are automatically deleted when the script exits. The life cycle of a temporary file consists of four stages: create, write, read (possibly re-read), delete. This cycle corresponds to the following method calls:

 $tf = new CWB::TempFile;  # create new temporary file in /tmp dir
 $tf->write(...);     # write cycle (buffered output, like print function)
 $tf->finish;         # complete write cycle (flushes buffer)
 $line = $tf->read;   # read cycle (like getline method for FileHandle)
[$tf->rewind;         # optional: start re-reading temporary file ]
[$line = $tf->read;                                               ]
 $tf->close;          # delete temporary file

Once the temporary file has been read from, it cannot be re-written; a new CWB::TempFile object has to be created for the next cycle. When the write stage is completed (but before reading has started, i.e. after calling the finish method), the temporary file can be accessed and/or overwritten by external programs. Use the name method to obtain its full pathname. If no direct access to the temporary file is required, the finish method is optional. The write cycle will automatically be completed before the first read method call.

$tf = new CWB::TempFile [ $prefix ];

Creates temporary file in /tmp directory. If the optional $prefix is specified, the filename will begin with $prefix and be extended to a unique name. If $prefix contains a / character, it is interpreted as an absolute or relative path, and the temporary file will not be created in the /tmp directory. To create a temporary file in the current working directory, use ./MyPrefix.

You can add the extension .Z, .gz, or .bz2 to $prefix in order to create a compressed temporary file. The actual filename (returned by the name method) will have the same extension in this case.

The temporary file is immediately created and opened for writing.

$tf->close;

Closes all open file handles and deletes the temporary file. This will be done automatically when the CWB::TempFile object is destroyed. Use close to free disk space immediately.

$filename = $tf->name;

Returns the real filename of a temporary file. NB: direct access to this file (e.g. by external programs) is only allowed after calling finish, and before the first read.

$status = $tf->status;

Returns the current status of the temporary file, i.e. the stage in its life cycle. The return value is one of the strings WRITING (initial state), FINISHED (immediately after finish, before first read), READING (while reading or after rewind) or DELETED (after close).

$tf->write(...);

Write data to the temporary file. All arguments are passed to Perl's built-in print function. Like print, this method does not automatically add newlines.

$tf->finish;

Stop writing to the temporary file, flush the output buffer, and close the associated file handle. Afer finish has been called, the temporary file can be accessed directly by the script or external programs, and may be overwritten by them. In order to automatically delete a file created by an external program, finish the temporary file immediately after its creation and then allow the external tool to overwrite it:

$tf = new CWB::TempFile;
$tf->finish;  # temporary file has size of 0 bytes now
$filename = $tf->name;
system "$my_shell_command > $filename";
$line = $tf->read;

Read one line from temporary file (same as getline method on FileHandle). Automatically invokes finish if called immediately after write cycle.

$tf->rewind;

Allows the script to re-read a temporary file. The next read call will return the first line of the temporary file. Internally this is achieved by closing and re-opening the associated file handle.

SHELL COMMANDS

The CWB::Shell::Cmd() function provides a convenient replacement for the built-in system command. Standard output and error messages produced by the invoked shell command are captured to avoid screen clutter, and the former is available to the Perl script (similar to the backtick operator `$shell_cmd`). CWB::Shell::Cmd() also checks for a variety of error conditions and returns an error level value ranging from 0 (successful) to 6 (fatal error):

Error Level  Description
  6          command execution failed (system error)
  5          non-zero exit value or error message on STDERR
  4          -- reserved for future use --
  3          warning message on STDERR
  2          any output on STDERR
  1          error message on STDOUT

Depending on the value of $CWB::Shell::Paranoid, a warning message will be issued or the function will die with an error message.

$CWB::Shell::Paranoid = 0;

With the default setting of 0, CWB::Shell::Cmd() will die if the error level is 5 or greater. In the extra paranoid setting (+1), it will almost always die (error level 2 or greater). In the less paranoid setting (-1) only an error level of 6 (i.e. failure to execute the shell command) will cause the script to abort.

$errlvl = CWB::Shell::Cmd($cmd);
$errlvl = CWB::Shell::Cmd($cmd, $filename);
$errlvl = CWB::Shell::Cmd($cmd, \@lines);

The first form executes $cmd as a shell command (through the built-in system function) and returns an error level value. With the default setting of $CWB::Shell::Paranoid, serious errors are usually detected and cause the script to die, so it is not necessary to check $errlvl.

The second form stores the standard output of the shell command in file $filename. It can then be processed with external programs or read in by the Perl script. NB: Compressed files are not supported! It is recommended to use an uncompressed temporary file (CWB::TempFile object).

The third form requires an array reference as its second argument. It splits the standard output of the shell command into chomped lines and stores them in @lines. If there is a large amount of standard ouput, it is more efficient to use the second form.

$errlvl = CWB::Shell::Cmd([$prog, $arg, ...], ...);

In each form of CWB::Shell::Cmd, the string $cmd can be replaced by an array reference containing the program to be called and its individual arguments. The arguments will automatically be quoted in a way that is safe at least in bash and tcsh shells. Note that simple option flags with values must be passed as two separate arguments in this case, e.g. [$CWB::DescribeCorpus, "-r", $registry, "DICKENS"].

If you want to execute a multi-command pipeline or use other shell metacharacters in your command, you have to use the CWB::Shell::Quote function to quote literal arguments yourself.

$safe = CWB::Shell::Quote($argument);

Safely quote $argument as a command-line argument in bash and tcsh shells. Simple strings that consist only of ASCII letters and digits, _, -, . and / are passed through without quotes. The CWB::Shell::Quote function is vectorised, so multiple argument strings can be passed in a single call.

REGISTRY FILE EDITING

Registry files in canonical format can be loaded into CWB::RegistryFile objects, edited using the various access methods detailed below, and written back to disk. It is also possible to create a registry entry from scratch and save it to a disk file.

Canonical registry files consist of a header and a body. The header begins with a NAME, ID, PATH, and optional INFO field

NAME "long descriptive name"
ID   my-corpus
PATH /path/to/data/directory
INFO /path/to/info/file.txt

followed by optional corpus property definitions

##:: property1 = "value1"
##:: property2 = "value2"

The body declares positional, structural, and alignment attributes in arbitrary order, using the following keywords

ATTRIBUTE  word     # positional attribute
STRUCTURE  np       # structural attribute
ALIGNED    corpus2  # alignment attribute (CORPUS2 is target corpus)

Each attribute declaration may be followed by an alternative directory path on the same line, if the attribute data is not stored in the HOME directory of the corpus:

ATTRIBUTE  lemma  /path/to/other/data/directory

The header fields, corpus properties, and attribute declarations are jointly referred to as content lines. Each content line may be preceded by an arbitrary number of comment lines (starting with a # character) and blank lines. Trailing comments and blank lines (i.e. after the last content line in a registry file) are allowed but will be ignored by CWB::RegistryFile. Besides, each content line may include an in-line comment which extends from the first # character to the end of the line (see examples above). Note that lines starting with the special symbol ##:: are interpreted as corpus property definitions rather than comments.

$reg = new CWB::RegistryFile;
$reg = new CWB::RegistryFile $filename;

The first form of the CWB::RegistryFile constructor creates a new, empty registry entry. The mandatory fields have to be filled in by the Perl script before the $reg object can be saved to disk. It is also highly advisable to declare at least the word attribute. :-)

The second form attempts to read and parse the registry file $filename. If successful, a CWB::RegistryFile object storing all relevant information is returned. If $filename does not contain the character / and cannot be found in the current directory, the constructor will automatically search the standard registry directories for it. The full pathname of the registry file can later be determined with the filename method.

If the load operation failed (i.e. the file does not exist or is not in the canonical registry file format), an error message is printed and an undefined value returned (so this module can be used e.g. to write a robust graphical registry editor). Always check the return value of the constructor before proceeding.

$filename = $reg->filename;

Get the full pathname of the registry file represented by $reg. This value is undefined if $reg was created as a new (empty) registry entry.

$name = $reg->name;
$id = $reg->id;
$home = $reg->home;
$info = $reg->info;

Get the values of the NAME, ID, HOME, and INFO fields from the registry file header. Since the INFO field is optional, the info() method may return an undefined value.

$reg->name($value);
$reg->id($value);
$reg->home($value);
$reg->info($value);
$reg->delete_info;

Modify the NAME, ID, HOME, and INFO fields. The INFO field is optional and may be deleted.

@properties = $reg->list_properties;
$value = $reg->property($property);

Corpus properties are key / value pairs. The list_properties() method returns a list of the keys, i.e. the names of defined properties. Use the property() method to obtain the value of a single property $property.

$reg->property($property, $value);
$reg->delete_property($property);

You can also use the property() method to set the value of a property by passing a second argument. This will add a new corpus property if $property isn't already defined. Use delete_property() to remove a corpus property.

@attr = $reg->list_attributes;
@attr_of_type = $reg->list_attributes($type);
$type = $reg->attribute($att_name);

list_attributes() returns the names of all declared attributes. The attribute() method returns the type of the specified attribute, or an undefined value if the attribute is not declared. $type is one of 'p' (positional), 's' (structural), or 'a' (alignment). Passing one of these type codes to list_attributes() will return attributes of the selected type only.

$reg->add_attribute($att_name, $type);
$reg->delete_attribute($att_name);

add_attribute() adds an attribute of type $type (p, s, or a, see above). The duplicate declaration of an attribute with the same type is silently ignored. Re-declaration with a different type is a fatal error. Use delete_attribute() to remove an attribute of the specified name, regardless of its type.

$directory = $reg->attribute_path($att_name);
$reg->attribute_path($att_name, $directory);
$reg->delete_attribute_path;

Use the attribute_path() method to get and set the alternative data path of attribute $att_name. If no alternative path is specified in the registry entry, an undefined value is returned. When an alternative path is deleted with delete_attribute_path(), the attribute will look for its data files in the HOME directory of the corpus.

@lines = $reg->comments($key);
$reg->add_comments($key, @lines);
$reg->set_comments($key, @lines);
$reg->set_comments($key);

Comment lines in a registry file are associated with the first content line following the comments. They are available through the comments() method as a list of chomped lines with the initial # character removed. Since comment lines may precede any kind of content line, a special key $key is used to identify the desired content line.

$key = ":NAME";       header field (same for ":ID", ":HOME", ":INFO")
$key = "::$property"; definition of corpus property $property
$key = $att_name;     declaration of attribute $att_name

Use add_comments() to add @lines to the existing comments for $key. The new comments are always inserted immediately before the content line. The set_comments() method overwrites existing comments with @lines. The second form deletes all comments for $key (replacing them with zero new comment lines). Note that "" represents a blank line and "#..." a comment line beginning with two sharps ##.

$comment = $reg->line_comment($key);
$reg->line_comment($key, $comment);
$reg->delete_line_comment($key);

Inline comments use the same $key identifiers as comment lines. Just as with the INFO field, the line_comment() method allows you to get and set inline comments, and delete_line_comment() removes an inline comment.

$reg->write($filename);

Write registry file to disk in canonical format. $filename has to be a full absolute or relative path. For safety reasons, the write() method does not automatically save a file in the default registry directory. Make sure that the filename is all lowercase and identical to the corpus ID, or the CWB tools and CQP will not be able to read the registry file.

If $reg was initialised from a registry file, $filename can be omitted. In this case, the original file will automatically be overwritten.

COPYRIGHT

Copyright (C) 1999-2022 Stephanie Evert [https://purl.org/stephanie.evert]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.