NAME

measure3d.pm

SYNOPSIS

This module can be used as a foundation for building 3-dimensional measures of association that can then be used by statistic.pl. In particular this module provides functions that give convenient access to 3-d (i.e., trigram) frequency counts as created by count.pl, as well as some degree of error handling that verifies the data.

To be used in a measure module that is to be used by statistic.pl, the functions provided in this module must be embedded within other functions that adhere to the standards and naming convention described in Docs/NewStats.txt and is briefly summarized below.

DESCRIPTION

The functions in this module retrieve observed bigram frequency counts and marginal totals, and also compute expected values. They also provide support for error checking of the output produced by count.pl. These functions are used in all the trigram (3d) measure modules (e.g., ll3.pm, tmi3.pm, etc.) provided in NSP. If you are writing your own 3d measure, you can use these functions as well.

With bigram or 3d measures we use a 3x3 contingency table to store the frequency counts associated with each word in the bigram, as well as the number of times the bigram occurs. The notation we employ is as follows:

Marginal Frequencies:

n1pp = the number of bigrams where the first word is word1.
np1p = the number of bigrams where the second word is word2.
npp1 = the number of bigrams where the third word is word3
n2pp = the number of bigrams where the first word is not word1. 
np2p = the number of bigrams where the second word is not word2.
npp2 = the number of bigrams where the third word is not word3.

Observed Frequencies:

n111 = number of times word1, word2 and word3 occur together in
       their respective positions, joint frequency.
n112 = number of times word1 and word2 occur in their respective
       positions but word3 does not.
n211 = number of times word2 and word3 occur in their respective
       positions but word1 does not.
n212 = number of times word2 occurs in in its respective position
       but word1 and word2 do not.
n121 = number of times word1 and word3 occur in their respective 
       positions but word2 does not.
n122 = number of times word1 occurs in its respective position
       but word2 and word3 do not.
n221 = number of times word3 occurs in its respective position
       but word1 and word2 do not.
n222 = number of time neither word1, word2 or word3 occur in their
       respective positions.

Expected Frequencies:

m111 = expected number of times word1, word2 and word3 occur together in
       their respective positions.
m112 = expected number of times word1 and word2 occur in their respective
       positions but word3 does not.
m211 = expected number of times word2 and word3 occur in their respective
       positions but word1 does not.
m212 = expected number of times word2 occurs in in its respective position
       but word1 and word2 do not.
m121 = expected number of times word1 and word3 occur in their respective 
       positions but word2 does not.
m122 = expected number of times word1 occurs in its respective position
       but word2 and word3 do not.
m221 = expected number of times word3 occurs in its respective position
       but word1 and word2 do not.
m222 = expected number of time neither word1, word2 or word3 occur in their
       respective positions.

Functions Included in measure2d:

  1. The intializeStatistic function performs initialization before the actual computation of measures of association begin. It also verifies that the input consists of trigrams, and that it has all the frequency combinations needed to do the computations. The frequency combinations that are required for this module are $n111, $n1pp, $np1p, $npp1, $n11p, $np11, and $n1p1.

    initializeStatistic() is passed the following input parameters:

    1) The ngram size. For 3d (trigram) measures this will be 3. 
    2) The total number of trigrams in the corpus (nppp).
    3) The number of frequency combinations. 
    4) A 3-d array containing the frequency combinations.

    Each row of the array in 4) represents a single frequency combination. On a given row, the first element denotes the number of indices on this row, say 'n'. This is followed by the 'n' values that correspond to the indices included in the frequency combination. (For more details on frequency combinations, see README.pod). To use this module, the joint frequency, n11, as well as the marginal frequencies, n1p and np1, are required in order to calculate the expected values.

    This function does not return any values. If an error occurs, it can be detected by statistic.pl using the errorCode and errorString functions described below.

  2. The getObservedValues function takes as input an array containing the frequency values for a trigram as found by count.pl. This will include seven values: n111, n1pp, np1p, npp1, n11p, np11, and n1p1. The size of this array is guaranteed to be exactly the same as the third parameter passed to the function initializeStatistic() function above.

    The getObservedValues function verifies that the marginal frequencies n1pp, np1p, npp1, n11p, np11, and n1p1 are consistent with the value of the joint frequency of the trigram, n111. If they are not consistent, it sets an error code and error message, and the function returns.

    If the marginals are valid, it computes the observed values for the remaining cells in the 3d table ( n112, n211, n212, n121, n122, n221, n222 ) based on these marginal totals and the joint frequency and returns an array containing these total in the order above.

  3. The function calculateExpectedValues() calculates the expected values of the cells in the contingency table based on the marginal frequencies and the total sample size. The expected values are estimated based on the assumption that the two words in the trigram are independent.

    The function returns these values in an array ordered as follows: m111, m112, m121, m122, m211, m212, m221, m222.

  4. The function getMarginalValues will return the marginal frequencies in the order of n1pp, np1p, npp1, n11p, np11, n1p1, n2pp, np2p, npp2.

  5. The function getTotalTrigrams() returns the total number of trigrams in the corpus (nppp).

  6. The function errorCode() returns 0 if the last operation was successful. It will return an integer starting with 1 if the last operation failed. This indicates that statistic.pl should abort. It will return an integer starting with 2 to indicate a warning should be issued. This does not cause cause statistic.pl to abort. However, a warning after calculateStatistic() will cause the trigram which generated that warning to be ignored by statistic.pl.

  7. The function errorString() returns the text of an error message.

Writing your own statistic module

Any measure module being used by statistic.pl must follow this convention. In order to make it easier to build 3d measures, we provide a 3d specific functions that can be embedded within the measure module in order to carry out common operations in calculating the values of such measures.

  1. The filename should have an extension of .pm. Usually the name of the file should be Statistic.pm, where "Statistic" is the name of the particular statistic you are writing.

  2. Let us say you have named your file Statistic.pm. The first line of the file should declare that its a package of the same name as the filename. Thus the first line of the file Statistic.pm should be...

    package Statistic;
  3. To use the measure3d.pm module you need to include it in your Statistic.pm module.

    A small code snippet to ensure that it is included is as follows:

     #  Check to see if the statistic.pm module can see the
     #  measure3d.pm module ... if not see if it can be found.
    
     my $module = "measure3d.pm"; my $modulename = "measure3d.pm";
     if( !( -f $modulename ) ) {
        my $found = 0;
        #  Check each of the PATHS to see if the module is there
        foreach (split(/:/, $ENV{PATH})) {
     	$module = $_ . "/" . $modulename;
    	if ( -f $module ) { $found = 1; last; }
        }
        # if still not found anywhere, quit!
        if ( ! $found ) { print "Could not find $modulename.\n"; exit; }
     }
    
     # IMPORTANT : now include the module into the current package    
     require $module;
  4. You need to implement at least two functions in your package

    i)   initializeStatistic()
    ii)  calculateStatistic()

    Function initializeStatistic() is passed the a set of parameters that include the trigram size which is 3, the total number of trigrams in the corpus, the number of frequency combinations and an array containing the frequency combinations. More detail of these parameters are described above in the description of the measure3d::initializeStatistic function.

    These paramters can be passed directly into the measure3d.pm module's function measure3d::intializeStatistic. For example:

    sub initializeStatistic {
    
        measure3d::initializeStatistic(@_);
    
    }

    where @_ contains the input parameters.

    This function is called before any calls to the function calculateStatistic() and can be used by the statistic library to set up any values that may be required for the calculations later. This function is not expected to return anything. If an error occurs, it can be reported through the mechanisms described below.

    The other mandatory function is calculateStatistic(). This is passed an array containing the frequency values for an ngram as found in the input n-gram file.

    Function calculateStatistic() is expected to return a (possibly floating) value as the value of the statistical measure calculated using the frequency values passed to it.

    There exists two main functions in the module measure3d.pm in order to help calculate the trigram statistic.

    1.  measure3d::getObservedValues(@frequencies)
    2.  measure3d::getExpectedValues();

    The function measure3d::getObservedValues will return the list of observed values from the given trigram. If it does not then there existed an error in the calculation of these values and zero should be returned. An example of how this can be used is as follows:

    if( !( ($n111, $n112, $n211, $n212, $n121, $n122, $n221, $n222) = measure3d::getObservedValues(@_) ) ) { return(0); }

    where @_ is the parameters sent to calculateStatistic from statistic.pl. A more detailed description of this function can be seen above.

    The function measure3d::getExpectedValues will return the list of expected values from the given trigram. If it does not then there existed an error in the calculation of these values and zero should be returned. An example of how this can be used is as follows:

    if( !( ($m111, $m112, $m121, $m122, $m211, $m212, $m221, $m222) = measure3d::getExpectedValues() ) ) { return(0); }

    When a library is loaded, statistic.pl checks for initializeStatistc and calcualteStatistic functions: if they are not implemented, then an error is reported and the program quits.

  5. Program statistic.pl also supports three other functions that are not mandatory, but may be implemented by the user. These are:

      i) errorCode()
     ii) errorString()
    iii) getStatisticName()

    Function errorCode, if implemented, is called immediately after the call to function initializeStatistic() and immediately after every call to function calculateStatistic().

    The measure3d.pm module implements both measure3d::errorCode() and measure3d::errorString().

    The errorCode() and errorString() methods that are implmented in your Statistic.pm modoule can return the value returned by the measure3d::errorCode() and measure3d::errorString() functions.

    An example of this is below:

    sub errorCode { return measure3d::errorCode(); }

    sub errorString { return measure3d::errorString(); }

    The third function that may be implemented is getStatisticName(). If this function is implemented, it is expected to return a string containing the name of the statistic being implmented. This string is used in the formatted output of statistic.pl. If this function is not implemented, then the statistic file name entered on the commandline is used in the formatted output.

    Note that all three functions described in this section are first checked for existence before being called. So, if the user elects to not implement these functions, no harm will be done. However, we strongly recommend the implementation of at least the function errorCode() since this is the only way for the statistic library to report errors to the user.

  6. Having implemented the two mandatory functions (in point 3 above) and zero or more of the three non-mandatory functions (in point 4 above), one must make these functions available outside the package. To do so, one has to export them, thusly.

    For this, first include the Exporter package by including the following line in the program

    require Exporter;

    Now include the following line to inherit Exporter's functions:

    @ISA = qw ( Exporter );

    Now export the various functions implemented so that they are accessible outside this package, by adding the following line (assume that you have implemented only the two mandatory functions):

    @EXPORT = qw( initializeStatistic calculateStatistic );

    If you implement say the errorCode() and errorString() functions too, you may export them like so:

    @EXPORT = qw( initializeStatistic calculateStatistic errorCode errorString );

    Note that the user may implement other functions too, and may export them if he so wishes, but since statistic.pl is not expecting anything besides the five functions above, doing so would have no effect on statistic.pl.

  7. Finally, at the end of everything, add the line

    1;
    
    This will ensure that the LAST line of the file returns a 
    true value, and is necessary so that when this package is 
    loaded, it returns a TRUE value.

Errors to look out for:

  1. The filename does not end with a .pm.

  2. The rest of the filename (besides the extension) does not match the package name (declared in the first line of the file). Remember its case sensitive!

  3. The five functions (2 mandatory, 3 non-mandatory) must have their names match EXACTLY with those shown above. Again, names are all case sensitive.

  4. The last line of the file is not "1;". This is necessary, and easily overlooked!

AUTHORS

Ted Pedersen (tpederse@umn.edu)
Satanjeev Banerjee <banerjee@cs.cmu.edu>
Bridget McInnes (bthomson@d.umn.edu)

BUGS

SEE ALSO

home page:    http://www.d.umn.edu/~tpederse/nsp.html

mailing list: http://groups.yahoo.com/group/ngram/

COPYRIGHT

Copyright (C) 2004 Satanjeev Banerjee, Ted Pedersen and Bridget McInnes

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.