NAME

Toolkit.pod

Toolkit Organization

This file briefly describes the functionality of all the programs in the Toolkit directory of SenseClusters.

Toolkit is organized into following directories -

preprocess (preprocessing programs)

Organized as

1. plain (processes input in plain text format)

Contains

  • text2sval.pl - Converts simple raw text into Senseval2 format

2. sval2 (processes input in Senseval-2 format)

Contains

  • balance.pl - Balances sense distribution.

  • filter.pl - Removes low frequency sense tags.

  • frequency.pl - Displays frequency distribution of senses.

  • keyconvert.pl - Converts KEY file from Senseval-2 format to SenseCluster's format.

  • maketarget.pl - Creates a Perl regex for the target word by spotting all <head> tags in the given file

  • prepare_sval2.pl - Prepares Senseval-2 data for experiments.

  • sval2plain.pl - Converts a given file in Senseval-2 format to plain text format.

  • windower.pl - Retains only W words around the target word in the given Senseval-2 instance file.

  • preprocess.pl - Tokenizes and can split the data. (Borrowed from SenseTools package)

count

Contains only 1 program thus far,

>

reduce-count.pl - Helps to reduce the size of the given bigram file created using a huge training data.

vector (Vector constructors)

Contains

>

order1vec.pl - Creates first order context vectors

>

order2vec.pl - Creates second order context vectors

>

wordvec.pl - Creates word vectors from given NSP bigram output

>

nsp2regex.pl - Creates regular expressions from given words. (Borrowed from SenseTools package)

svd (SVD interface)

Contains

>

mat2harbo.pl - Converts matrices from SenseClusters format to Harwell-Boeing format

>

svdpackout.pl - Reconstructs a matrix from its singular vectors created by SVDPack

matrix (Similarity matrix constructors)

Contains

>

bitsimat.pl - Creates a similarity matrix for given bit vectors

>

simat.pl - Creates a similarity matrix for given non-binary (integer or real) vectors

evaluate (Evaluation programs)

Contains

>

cluto2label.pl - Converts clustering output of Cluto to a cluster by sense distribution table for evaluation

>

format_clusters.pl - display clusters of text with assigned sense id, or display senseval-2 format with assigned sense id.

>

label.pl - Assigns sense tags to the discovered clusters for evaluation.

>

report.pl - Reports performance accuracy of discrimination in terms of the precision, recall and confusion matrix

clusterlabel (Cluster Labeling programs)

Contains

>

clusterlabeling.pl - Selects significant word-pairs from the contents / instances of the clusters and assigns them as the labels to the clusters. Also creates separate file for each cluster.

clusterstopping (Cluster Stopping program)

Contains

>

clusterstopping.pl - Predicts the number of clusters that a given data should be divided into. Provides three such cluster stopping measures.

Acknowledgement

This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784).