NAME

Treex::Tool::Parser::MSTperl::TrainerUnlabelled

VERSION

version 0.09407

DESCRIPTION

Trains on correctly parsed sentences and so creates and tunes the model. Uses single-best MIRA (McDonald et al., 2005, Proc. HLT/EMNLP)

FIELDS

parser

Reference to an instance of Treex::Tool::Parser::MSTperl::Parser which is used for the training.

model

Reference to an instance of Treex::Tool::Parser::MSTperl::ModelUnlabelled which is being trained.

METHODS

The sumUpdateWeight is a number by which the change of the feature weights is multiplied in the sum of the weights, so that at the end of the algorithm the sum corresponds to its formal definition, which is a sum of all weights after each of the updates. sumUpdateWeight is a member of a sequence going from N*T to 1, where N is the number of iterations ("number_of_iterations" in Treex::Tool::Parser::MSTperl::FeaturesControl, 10 by default) and T being the number of sentences in training data, N*T thus being the number of inner iterations, i.e. how many times mira_update() is called.

$trainer->train($training_data);

Trains the model, using the settings from config and the training data in the form of a reference to an array of parsed sentences (Treex::Tool::Parser::MSTperl::Sentence), which can be obtained by the Treex::Tool::Parser::MSTperl::Reader.

$self->mira_update($sentence_correct_parse, $sentence_best_parse, $sumUpdateWeight)

Performs one update of the MIRA (Margin-Infused Relaxed Algorithm) on one sentence from the training data. Its input is the correct parse of the sentence (from the training data) and the best scoring parse created by the parser.

my ( $features_diff_1, $features_diff_2, $features_diff_count ) = features_diff( $features_1, $features_2 );

Compares features of two parses of a sentence, where the features ($features_1, $features_2) are represented as a reference to an array of strings representing the features (the same feature might be present repeatedly, all occurencies of the same feature are summed together).

Features that appear exactly the same times in both parses are disregarded.

The first two returned values ($features_diff_1, $features_diff_2) are array references, $features_diff_1 containing features that appear in the first parse ($features_1) more often than in the second parse ($features_2), and vice versa for $features_diff_2. Each feature is contained as many times as is the difference in number of occurencies, eg. if the feature TAG|tag:NN|NN appears 5 times in the first parse and 8 times in the second parse, then $features_diff_2 will contain 'TAG|tag:NN|NN', 'TAG|tag:NN|NN', 'TAG|tag:NN|NN'.

The third returned value ($features_diff_count) is a count of features in which the parses differ, ie. $features_diff_count = scalar(@$features_diff_1) + scalar(@$features_diff_2).

update_feature_weight( $model, $feature, $update, $sumUpdateWeight )

Updates weight of $feature by $update (which might be positive or negative) and also updates the sum of updates of the feature (which is later used for overtraining avoidance), multiplied by $sumUpdateWeight, which is simply a count of inner iterations yet to be performed (thus eliminating the need to update the sum on each inner iteration).

AUTHORS

Rudolf Rosa <rosa@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.