NAME
Toolkit.pod
Toolkit Organization
This file briefly describes the functionality of all the programs in the Toolkit directory of SenseClusters.
Toolkit is organized into following directories -
preprocess (preprocessing programs)
Organized as
- 1. plain (processes input in plain text format)
-
Contains
text2sval.pl - Converts simple raw text into Senseval2 format
- 2. sval2 (processes input in Senseval-2 format)
-
Contains
balance.pl - Balances sense distribution.
filter.pl - Removes low frequency sense tags.
frequency.pl - Displays frequency distribution of senses.
keyconvert.pl - Converts KEY file from Senseval-2 format to SenseCluster's format.
maketarget.pl - Creates a Perl regex for the target word by spotting all <head> tags in the given file
prepare_sval2.pl - Prepares Senseval-2 data for experiments.
sval2plain.pl - Converts a given file in Senseval-2 format to plain text format.
windower.pl - Retains only W words around the target word in the given Senseval-2 instance file.
preprocess.pl - Tokenizes and can split the data. (Borrowed from SenseTools package)
count
Contains only 1 program thus far,
- >
-
reduce-count.pl - Helps to reduce the size of the given bigram file created using a huge training data.
vector (Vector constructors)
Contains
- >
-
order1vec.pl - Creates first order context vectors
- >
-
order2vec.pl - Creates second order context vectors
- >
-
wordvec.pl - Creates word vectors from given NSP bigram output
- >
-
nsp2regex.pl - Creates regular expressions from given words. (Borrowed from SenseTools package)
svd (SVD interface)
Contains
- >
-
mat2harbo.pl - Converts matrices from SenseClusters format to Harwell-Boeing format
- >
-
svdpackout.pl - Reconstructs a matrix from its singular vectors created by SVDPack
matrix (Similarity matrix constructors)
Contains
- >
-
bitsimat.pl - Creates a similarity matrix for given bit vectors
- >
-
simat.pl - Creates a similarity matrix for given non-binary (integer or real) vectors
evaluate (Evaluation programs)
Contains
- >
-
cluto2label.pl - Converts clustering output of Cluto to a cluster by sense distribution table for evaluation
- >
-
format_clusters.pl - display clusters of text with assigned sense id, or display senseval-2 format with assigned sense id.
- >
-
label.pl - Assigns sense tags to the discovered clusters for evaluation.
- >
-
report.pl - Reports performance accuracy of discrimination in terms of the precision, recall and confusion matrix
clusterlabel (Cluster Labeling programs)
Contains
- >
-
clusterlabeling.pl - Selects significant word-pairs from the contents / instances of the clusters and assigns them as the labels to the clusters. Also creates separate file for each cluster.
clusterstopping (Cluster Stopping program)
Contains
- >
-
clusterstopping.pl - Predicts the number of clusters that a given data should be divided into. Provides three such cluster stopping measures.
Acknowledgement
This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784).