NAME

Word2vec::Word2phrase - word2vec's word2phrase wrapper module.

SYNOPSIS

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetMinCount( 12 );
$w2p->SetMaxCount( 20 );
$w2p->SetTrainFilePath( "textCorpus.txt" );
$w2p->SetOutputFilePath( "phraseTextCorpus.txt" );
$w2p->ExecuteTraining();
undef( $w2p );

# or

my $w2p = Word2vec::Word2phrase->new();
$w2p->ExecuteTraining( $trainFilePath, $outputFilePath, $minCount, $threshold, $debug, $overwrite );
undef( $w2p );

DESCRIPTION

Word2vec::Word2phrase is a word2vec package tool that "compoundifies" bi-grams in a text corpus based on a minimum and maximum frequency.

Main Functions

new

Description:

Returns a new 'Word2vec::Word2phrase' module object.

Note: Specifying no parameters implies default options.

Default Parameters:
   debugLog                    = 0
   writeLog                    = 0
   trainFilePath               = ""
   outputFilePath              = ""
   minCount                    = 5
   threshold                   = 100
   setW2PDebug                 = 2
   workingDir                  = Current Directory
   word2PhraseExeDir           = Word2Phrase Executable Directory
   overwriteOldFile            = 0

Input:

$debugLog                    -> Instructs module to print debug statements to the console. (1 = True / 0 = False)
$writeLog                    -> Instructs module to print debug statements to a log file. (1 = True / 0 = False)
$trainFilePath               -> Specifies the training text corpus for word2phrase training. (String)
$outputFilePath              -> Specifies the output path for post word2phrase training. (String)
$minCount                    -> Specifies the minimum range value for bi-gram 'compoundification'. (Positive Integer)
$threshold                   -> Specifies the maximum range value for bi-gram 'compoundification'. (Positive Integer)
$setW2PDebug                 -> Specifies the word2phrase debug information parameter value to show during training. (Integer)
$workingDir                  -> Specifies the current working directory. (String)
$word2PhraseExeDir           -> Specifies word2phrase executable directory. (String)
$overwriteOldFile            -> Instructs the module to either overwrite any existing data with the same output file name and path. ( '1' or '0' )

Note: It is not recommended to specify all new() parameters, as it has not been thoroughly tested.

Output:

Word2vec::Word2phrase object.

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();

undef( $w2p );

DESTROY

Description:

Removes member variables and file handle from memory.

Input:

None

Output:

None

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();

$w2p->DESTROY();
undef( $w2p );

ExecuteTraining

Description:

Executes word2phrase training based on parameters. Parameter variables have higher precedence than member variables.
Any parameter specified will override its respective member variable.

Note: If no parameters are specified, this module executes word2phrase training based on preset member
variables. Returns string regarding training status.

Input:

$trainFilePath  -> Training text corpus file path
$outputFilePath -> Vector binary file path
$minCount       -> Minimum bi-gram frequency (Positive Integer)
$threshold      -> Maximum bi-gram frequency (Positive Integer)
$debug          -> Displays word2phrase debug information during training. (0 = None, 1 = Show Debug Information, 2 = Show Even More Debug Information)
$overwrite      -> Overwrites old training file when executing training. (0 = False / 1 = True)

Output:

$value          -> '0' = Successful / '-1' = Un-successful

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetMinCount( 12 );
$w2p->SetMaxCount( 20 );
$w2p->SetTrainFilePath( "textCorpus.txt" );
$w2p->SetOutputFilePath( "phraseTextCorpus.txt" );
$w2p->ExecuteTraining();
undef( $w2p );

# Or

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
$w2p->ExecuteTraining( "textCorpus.txt", "phraseTextCorpus.txt", 12, 20, 2, 1 );
undef( $w2p );

ExecuteStringTraining

Description:

Executes word2phrase training based on parameters. Parameter variables have higher precedence than member variables.
Any parameter specified will override its respective member variable.

Note: If no parameters are specified, this module executes word2phrase training based on preset member
variables. Returns string regarding training status.

Input:

$trainingString -> String to train
$outputFilePath -> Vector binary file path
$minCount       -> Minimum bi-gram frequency (Positive Integer)
$threshold      -> Maximum bi-gram frequency (Positive Integer)
$debug          -> Displays word2phrase debug information during training. (0 = None, 1 = Show Debug Information, 2 = Show Even More Debug Information)
$overwrite      -> Overwrites old training file when executing training. (0 = False / 1 = True)

Output:

$value          -> '0' = Successful / '-1' = Un-successful

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetMinCount( 12 );
$w2p->SetMaxCount( 20 );
$w2p->SetTrainFilePath( "large string to train here" );
$w2p->SetOutputFilePath( "phraseTextCorpus.txt" );
$w2p->ExecuteTraining();
undef( $w2p );

# Or

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
$w2p->ExecuteTraining( "large string to train here", "phraseTextCorpus.txt", 12, 20, 2, 1 );
undef( $w2p );

GetOSType

Description:

Returns the operating system type string.

Input:

None

Output:

$string -> Operating system string.

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $operatingSystem = $w2p->GetOSType();
print( "Operating System: $operatingSystem\n" ) if defined( $operatingSystem );
undef( $w2p );

Accessor Functions

GetDebugLog

Description:

Returns the _debugLog member variable set during Word2vec::Word2phrase object initialization of new function.

Input:

None

Output:

$value -> 0 = False, 1 = True

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $debugLog = $w2p->GetDebugLog();

print( "Debug Logging Enabled\n" ) if $debugLog == 1;
print( "Debug Logging Disabled\n" ) if $debugLog == 0;

undef( $w2p );

GetWriteLog

Description:

Returns the _writeLog member variable set during Word2vec::Word2phrase object initialization of new function.

Input:

None

Output:

$value -> 0 = False, 1 = True

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $writeLog = $w2p->GetWriteLog();

print( "Write Logging Enabled\n" ) if $writeLog == 1;
print( "Write Logging Disabled\n" ) if $writeLog == 0;

undef( $w2p );

GetFileHandle

Description:

Returns file handle used by WriteLog() method.

Input:

None

Output:

$fileHandle -> Returns file handle blob used by 'WriteLog()' function or undefined.

Example:

<This should not be called.>

GetTrainFilePath

Description:

Returns (string) training file path.

Input:

None

Output:

$string -> word2phrase training file path

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $filePath = $w2p->GetTrainFilePath();

print( "Output File Path: $filePath\n" ) if defined( $filePath );
undef( $w2p );

GetOutputFilePath

Description:

Returns (string) output file path.

Input:

None

Output:

$string -> word2phrase output file path

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $filePath = $w2p->GetOutputFilePath();

print( "Output File Path: $filePath\n" ) if defined( $filePath );
undef( $w2p );

GetMinCount

Description:

Returns (integer) minimum bi-gram range.

Input:

None

Output:

$value ->  Minimum bi-gram frequency (Positive Integer)

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $mincount = $w2p->GetMinCount();

print( "MinCount: $mincount\n" ) if defined( $mincount );
undef( $w2p );

GetThreshold

Description:

Returns (integer) maximum bi-gram range.

Input:

None

Output:

$value ->  Maximum bi-gram frequency (Positive Integer)

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $mincount = $w2p->GetThreshold();

print( "MinCount: $mincount\n" ) if defined( $mincount );
undef( $w2p );

GetW2PDebug

Description:

Returns word2phrase debug parameter value.

Input:

None

Output:

$value -> 0 = No debugging, 1 = Show debugging, 2 = Show even more debugging

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $w2pdebug = $w2p->GetW2PDebug();

print( "Word2Phrase Debug Level: $w2pdebug\n" ) if defined( $w2pdebug );

undef( $w2p );

GetWorkingDir

Description:

Returns (string) working directory path.

Input:

None

Output:

$string -> Current working directory path

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $workingDir = $w2p->GetWorkingDir();

print( "Working Directory: $workingDir\n" ) if defined( $workingDir );

undef( $w2p );

GetWord2PhraseExeDir

Description:

Returns (string) word2phrase executable directory path.

Input:

None

Output:

$string -> Word2Phrase executable directory path

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $workingDir = $w2p->GetWord2PhraseExeDir();

print( "Word2Phrase Executable Directory: $workingDir\n" ) if defined( $workingDir );

undef( $w2p );

GetOverwriteOldFile

Description:

Returns the current value of the overwrite training file variable.

Input:

None

Output:

$value -> 1 = True/Overwrite or 0 = False/Append to current file

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
my $overwrite = $w2p->GetOverwriteOldFile();

if defined( $overwrite )
{
   print( "Overwrite Old File: " );
   print( "Yes\n" ) if $overwrite == 1;
   print( "No\n" ) if $overwrite == 0;
}

undef( $w2p );

Mutator Functions

SetTrainFilePath

Description:

Sets training file path.

Input:

$string -> Training file path

Output:

None

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetTrainFilePath( "filePath" );

undef( $w2p );

SetOutputFilePath

Description:

Sets word2phrase output file path.

Input:

$string -> word2phrase output file path

Output:

None

Example:

use Word2vec::Word2phrase;

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetOutputFilePath( "filePath" );

undef( $w2p );

SetMinCount

Description:

Sets minimum range value.

Input:

$value -> Minimum frequency value (Positive integer)

Output:

None

Example:

use Word2vec::Word2phrase:

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetMinCount( 1 );

undef( $w2p );

SetThreshold

Description:

Sets maximum range value.

Input:

$value -> Maximum frequency value (Positive integer)

Output:

None

Example:

use Word2vec::Word2phrase:

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetThreshold( 100 );

undef( $w2p );

SetW2PDebug

Description:

Sets word2phrase debug parameter.

Input:

$value -> word2phrase debug parameter (0 = No debug info, 1 = Show debug info, 2 = Show more debug info.)

Output:

None

Example:

use Word2vec::Word2phrase:

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetW2PDebug( 2 );

undef( $w2p );

SetWorkingDir

Description:

Sets working directory path.

Input:

$string -> Current working directory path.

Output:

None

Example:

use Word2vec::Word2phrase:

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetWorkingDir( "filePath" );

undef( $w2p );

SetWord2PhraseExeDir

Description:

Sets word2phrase executable file directory path.

Input:

$string -> Word2Phrase executable directory path.

Output:

None

Example:

use Word2vec::Word2phrase:

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetWord2PhraseExeDir( "filePath" );

undef( $w2p );

SetOverwriteOldFile

Description:

Enables overwriting word2phrase output file if one already exists with the same output file name.

Input:

$value -> Integer: 1 = Overwrite old file, 0 = No not overwrite old file.

Output:

None

Example:

use Word2vec::Word2phrase:

my $w2p = Word2vec::Word2phrase->new();
$w2p->SetOverwriteOldFile( 1 );

undef( $w2p );

Debug Functions

GetTime

Description:

Returns current time string in "Hour:Minute:Second" format.

Input:

None

Output:

$string -> XX:XX:XX ("Hour:Minute:Second")

Example:

use Word2vec::Word2phrase:

my $w2p = Word2vec::Word2phrase->new();
my $time = $w2p->GetTime();

print( "Current Time: $time\n" ) if defined( $time );

undef( $w2p );

GetDate

Description:

Returns current month, day and year string in "Month/Day/Year" format.

Input:

None

Output:

$string -> XX/XX/XXXX ("Month/Day/Year")

Example:

use Word2vec::Word2phrase:

my $w2p = Word2vec::Word2phrase->new();
my $date = $w2p->GetDate();

print( "Current Date: $date\n" ) if defined( $date );

undef( $w2p );

WriteLog

Description:

Prints passed string parameter to the console, log file or both depending on user options.

Note: printNewLine parameter prints a new line character following the string if the parameter
is undefined and does not if parameter is 0.

Input:

$string -> String to print to the console/log file.
$value  -> 0 = Do not print newline character after string, all else prints new line character including 'undef'.

Output:

None

Example:

use Word2vec::Word2phrase:

my $w2p = Word2vec::Word2phrase->new();
$w2p->WriteLog( "Hello World" );

undef( $w2p );

Author

Clint Cuffy, Virginia Commonwealth University

COPYRIGHT

Copyright (c) 2016

Bridget T McInnes, Virginia Commonwealth University
btmcinnes at vcu dot edu

Clint Cuffy, Virginia Commonwealth University
cuffyca at vcu dot edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA  02111-1307, USA.