NAME
Word2vec::Interface - Interface module for word2vec.pm, word2phrase.pm, interface.pm modules and associated utilities.
SYNOPSIS
use Word2vec::Interface;
my $result = 0;
# Compile a text corpus, execute word2vec training and compute cosine similarity of two words
my $w2vinterface = Word2vec::Interface->new();
my $xmlconv = $w2vinterface->GetXMLToW2VHandler();
$xmlconv->SetWorkingDir( "Medline/XML/Directory/Here" );
$xmlconv->SetSavePath( "textcorpus.txt" );
$xmlconv->SetStoreTitle( 1 );
$xmlconv->SetStoreAbstract( 1 );
$xmlconv->SetBeginDate( "01/01/2004" );
$xmlconv->SetEndDate( "08/13/2016" );
$xmlconv->SetOverwriteExistingFile( 1 );
# If Compound Word File Exists, Store It In Memory
# And Create Compound Word Binary Search Tree Using The Compound Word Data
$xmlconv->ReadCompoundWordDataFromFile( "compoundword.txt" );
$xmlconv->CreateCompoundWordBST();
# Parse XML Files or Directory Of Files
$result = $xmlconv->ConvertMedlineXMLToW2V( "/xmlDirectory/" );
# Check(s)
print( "Error Parsing Medline XML Files\n" ) if ( $result == -1 );
exit if ( $result == -1 );
# Setup And Execute word2vec Training
my $word2vec = $w2vinterface->GetWord2VecHandler();
$word2vec->SetTrainFilePath( "textcorpus.txt" );
$word2vec->SetOutputFilePath( "vectors.bin" );
$word2vec->SetWordVecSize( 200 );
$word2vec->SetWindowSize( 8 );
$word2vec->SetSample( 0.0001 );
$word2vec->SetNegative( 25 );
$word2vec->SetHSoftMax( 0 );
$word2vec->SetBinaryOutput( 0 );
$word2vec->SetNumOfThreads( 20 );
$word2vec->SetNumOfIterations( 12 );
$word2vec->SetUseCBOW( 1 );
$word2vec->SetOverwriteOldFile( 0 );
# Execute word2vec Training
$result = $word2vec->ExecuteTraining();
# Check(s)
print( "Error Training Word2vec On File: \"textcorpus.txt\"" ) if ( $result == -1 );
exit if ( $result == -1 );
# Read word2vec Training Data Into Memory And Store As A Binary Search Tree
$result = $word2vec->ReadTrainedVectorDataFromFile( "vectors.bin" );
# Check(s)
print( "Error Unable To Read Word2vec Trained Vector Data From File\n" ) if ( $result == -1 );
exit if ( $result == -1 );
# Compute Cosine Similarity Between "respiratory" and "arrest"
$result = $word2vec->ComputeCosineSimilarity( "respiratory", "arrest" );
print( "Cosine Similarity Between \"respiratory\" and \"arrest\": $result\n" ) if defined( $result );
print( "Error Computing Cosine Similarity\n" ) if !defined( $result );
# Compute Cosine Similarity Between "respiratory arrest" and "heart attack"
$result = $word2vec->ComputeMultiWordCosineSimilarity( "respiratory arrest", "heart attack" );
print( "Cosine Similarity Between \"respiratory arrest\" and \"heart attack\": $result\n" ) if defined( $result );
print( "Error Computing Cosine Similarity\n" ) if !defined( $result );
undef( $w2vinterface );
# or
use Word2vec::Interface;
my $result = 0;
my $w2vinterface = Word2vec::Interface->new();
$w2vinterface->XTWSetWorkingDir( "Medline/XML/Directory/Here" );
$w2vinterface->XTWSetSavePath( "textcorpus.txt" );
$w2vinterface->XTWSetStoreTitle( 1 );
$w2vinterface->XTWSetStoreAbstract( 1 );
$w2vinterface->XTWSetBeginDate( "01/01/2004" );
$w2vinterface->XTWSetEndDate( "08/13/2016" );
$w2vinterface->XTWSetOverwriteExistingFile( 1 );
# If Compound Word File Exists, Store It In Memory
# And Create Compound Word Binary Search Tree Using The Compound Word Data
$w2vinterface->XTWReadCompoundWordDataFromFile( "compoundword.txt" );
$w2vinterface->XTWCreateCompoundWordBST();
# Parse XML Files or Directory Of Files
$result = $w2vinterface->XTWConvertMedlineXMLToW2V( "/xmlDirectory/" );
$result = $w2vinterface->W2VExecuteTraining( "textcorpus.txt", "vectors.bin", 200, 8, undef, 0.001, 25,
undef, 0, 0, 20, 15, 1, 0, undef, undef, undef, 1 );
# Read word2vec Training Data Into Memory And Store As A Binary Search Tree
$result = $w2vinterface->W2VReadTrainedVectorDataFromFile( "vectors.bin" );
# Check(s)
print( "Error Unable To Read Word2vec Trained Vector Data From File\n" ) if ( $result == -1 );
exit if ( $result == -1 );
# Compute Cosine Similarity Between "respiratory" and "arrest"
$result = $w2vinterface->W2VComputeCosineSimilarity( "respiratory", "arrest" );
print( "Cosine Similarity Between \"respiratory\" and \"arrest\": $result\n" ) if defined( $result );
print( "Error Computing Cosine Similarity\n" ) if !defined( $result );
# Compute Cosine Similarity Between "respiratory arrest" and "heart attack"
$result = $w2vinterface->W2VComputeMultiWordCosineSimilarity( "respiratory arrest", "heart attack" );
print( "Cosine Similarity Between \"respiratory arrest\" and \"heart attack\": $result\n" ) if defined( $result );
print( "Error Computing Cosine Similarity\n" ) if !defined( $result );
undef( $w2vinterface );
DESCRIPTION
Word2vec::Interface is an interface module for utilization of word2vec, word2phrase, xmltow2v and their associated functions.
This program houses a set of functions, modules and utilities for use with UMLS Similarity.
XmlToW2v Features:
- Compilation of a text corpus from plain or gun-zipped Medline XML files.
- Multi-threaded text corpus compilation support.
- Include text corpus articles via date range.
- Include text corpus articles via title, abstract or both.
- Compoundifying on-the-fly while building text corpus given a compound word file.
Word2vec Features:
- Word2vec training with user specified settings.
- Manipulation of Word2vec word vectors. (Addition/Subtraction/Average)
- Word2vec binary format to plain text file conversion.
- Word2vec plain text to binary format file conversion.
- Multi-word cosine similarity computation. (Sudo-compound word cosine similarity).
Word2phrase Features:
- Word2phrase training with user specified settings.
Interface Features:
- Word Sense Disambiguation via trained word2vec data.
Interface Main Functions
new
Description:
Returns a new "Word2vec::Interface" module object.
Note: Specifying no parameters implies default options.
Default Parameters:
word2vecDir = "../../External/word2vec"
debugLog = 0
writeLog = 0
ignoreCompileErrors = 0
ignoreFileChecks = 0
exitFlag = 0
workingDir = ""
word2vec = Word2vec::Word2vec->new()
word2phrase = Word2vec::Word2phrase->new()
xmltow2v = Word2vec::Xmltow2v->new()
util = Word2vec::Interface()
instanceAry = ()
senseAry = ()
instanceCount = 0
senseCount = 0
Input:
$word2vecDir -> Specifies word2vec package source/executable directory.
$debugLog -> Instructs module to print debug statements to the console. ('1' = True / '0' = False)
$writeLog -> Instructs module to print debug statements to a log file. ('1' = True / '0' = False)
$ignoreCompileErrors -> Instructs module to ignore source code compilation errors. ('1' = True / '0' = False)
$ignoreFileChecks -> Instructs module to ignore file checks. ('1' = True / '0' = False)
$exitFlag -> In the event of a run-time check error, exitFlag is set to '1' which gracefully terminates the script.
$workingDir -> Specifies the current working directory.
$word2vec -> Word2vec::Word2vec object.
$word2phrase -> Word2vec::Word2phrase object.
$xmltow2v -> Word2vec::Xmltow2v object.
$interface -> Word2vec::Interface object.
$instanceAry -> Word Sense Disambiguation: Array of instances.
$senseAry -> Word Sense Disambiguation: Array of senses.
$instanceCount -> Number of Word Sense Disambiguation instances loaded in memory.
$senseCount -> Number of Word Sense Disambiguation senses loaded in memory.
Note: It is not recommended to specify all new() parameters, as it has not been thoroughly tested. Maximum recommended parameters to be specified include:
"word2vecDir, debugLog, writeLog, ignoreCompileErrors, ignoreFileChecks"
Output:
Word2vec::Interface object.
Example:
use Word2vec::Interface;
# Parameters: Word2Vec Directory = undef, DebugLog = True, WriteLog = False, IgnoreCompileErrors = False, IgnoreFileChecks = False
my $interface = Word2vec::Interface->new( undef, 1, 0 );
undef( $interface );
# Or
# Parameters: Word2Vec Directory = undef, DebugLog = False, WriteLog = False, IgnoreCompileErrors = False, IgnoreFileChecks = False
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
undef( $interface );
DESTROY
Description:
Removes member variables and file handle from memory.
Input:
None
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->DESTROY();
undef( $interface );
RunFileChecks
Description:
Runs word2vec file checks. Looks for word2vec executable files, if not found
it will then look for the source code and compile automatically placing the
executable files in the same directory. Errors out gracefully when word2vec
executable files are not present and source files cannot be located.
Notes : Word2vec Executable File List: word2vec, word2phrase, word-analogy, distance, compute-accuracy.
: This method is called automatically in interface::new() function. It can be disabled by setting
_ignoreFileChecks new() parameter to 1.
Input:
$string -> Word2vec source/executable directory.
Output:
$value -> Returns '1' if checks passed and '0' if file checks failed.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new( undef, 1, 0, 1, 1 );
my $result = $interface->RunFileChecks();
print( "Passed Word2Vec File Checks!\n" ) if $result == 0;
print( "Failed Word2Vec File Checks!\n" ) if $result == 1;
undef( $interface );
_CheckIfExecutableFileExists
Description:
Checks specified executable file exists in a given directory.
Input:
$filePath -> Executable file path
$fileName -> Executable file name
Output:
$value -> Returns '1' if file is found and '0' if otherwise.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->_CheckIfExecutableFileExists( "../../External/word2vec", "word2vec" );
print( "Executable File Exists!\n" ) if $result == 1;
print( "Executable File Does Not Exist!\n" ) if $result == 0;
undef( $interface );
_CheckIfSourceFileExists
Description:
Checks specified directory (string) for the filename (string).
This ensures the specified files are of file type "text/cpp".
Input:
$filePath -> Executable file path
$fileName -> Executable file name
Output:
$value -> Returns '1' if file is found and of type "text/cpp" and '0' if otherwise.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->_CheckIfSourceFileExists( "../../External/word2vec", "word2vec" );
print( "Source File Exists!\n" ) if $result == 1;
print( "Source File Does Not Exist!\n" ) if $result == 0;
undef( $interface );
_CompileSourceFile
Description:
Compiles C++ source filename in a specified directory.
Input:
$filePath -> Source file path (string)
$fileName -> Source file name (string)
Output:
$value -> Returns '1' if successful and '0' if un-successful.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface;
my $result = $interface->_CompileSourceFile( "../../External/word2vec", "word2vec" );
print( "Compiled Source Successfully!\n" ) if $result == 1;
print( "Source Compilation Attempt Unsuccessful!\n" ) if $result == 0;
undef( $interface );
GetFileType
Description:
Checks file in given file path and if it exists, returns the file type.
Input:
$filePath -> File path
Output:
$string -> Returns file type (string).
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $fileType = $interface->GetFileType( "samples/textcorpus.txt" );
print( "File Type: $fileType\n" );
undef( $interface );
GetOSType
Description:
Returns current operating system (string).
Input:
None
Output:
$string -> Operating System Type. (String)
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $os = $interface->GetOSType();
print( "Operating System: $os\n" );
undef( $interface );
_ModifyWord2VecSourceForWindows
Description:
Modifies "word2vec.c" file for compilation under windows operating system.
Input:
None
Output:
$value -> '1' = Successful / '0' = Un-successful
Example:
This is a private function and should not be utilized.
_RemoveWord2VecSourceModification
Description:
Removes modification of "word2vec.c". Returns source file to its original state.
Input:
None
Output:
$value -> '1' = Successful / '0' = Un-successful.
Example:
This is a private function and should not be utilized.
Interface Command-Line Functions
CLComputeCosineSimilarity
Description:
Command-line Method: Computes cosine similarity between 'wordA' and 'wordB' using the specified 'filePath' for
loading trained word2vec word vector data.
Input:
$filePath -> Word2Vec trained word vectors binary file path. (String)
$wordA -> First word for cosine similarity comparison.
$wordB -> Second word for cosine similarity comparison.
Output:
$value -> Cosine similarity value (float) or undefined.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->CLComputeCosineSimilarity( "../../samples/samplevectors.bin", "of", "the" );
print( "Cosine Similarity Between \"of\" and \"the\": $value\n" ) if defined( $value );
print( "Error: Cosine Similarity Could Not Be Computed\n" ) if !defined( $value );
undef( $interface );
CLComputeMultiWordCosineSimilarity
Description:
Command-line Method: Computes cosine similarity between 'phraseA' and 'phraseB' using the specified 'filePath'
for loading trained word2vec word vector data.
Note: Supports multiple words concatenated by ':' for each string.
Input:
$filePath -> Word2Vec trained word vectors binary file path. (String)
$phraseA -> First phrase for cosine similarity comparison. (String)
$phraseB -> Second phrase for cosine similarity comparison. (String)
Output:
$value -> Cosine similarity value (float) or undefined.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->CLComputeMultiWordCosineSimilarity( "../../samples/samplevectors.bin", "heart:attack", "myocardial:infarction" );
print( "Cosine Similarity Between \"heart attack\" and \"myocardial infarction\": $value\n" ) if defined( $value );
print( "Error: Cosine Similarity Could Not Be Computed\n" ) if !defined( $value );
undef( $instance );
CLComputeAvgOfWordsCosineSimilarity
Description:
Command-line Method: Computes cosine similarity average of all words in 'phraseA' and 'phraseB',
then takes cosine similarity between 'phraseA' and 'phraseB' average values using the
specified 'filePath' for loading trained word2vec word vector data.
Note: Supports multiple words concatenated by ':' for each string.
Input:
$filePath -> Word2Vec trained word vectors binary file path. (String)
$phraseA -> First phrase for cosine similarity comparison.
$phraseB -> Second phrase for cosine similarity comparison.
Output:
$value -> Cosine similarity value (float) or undefined.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->CLComputeAvgOfWordsCosineSimilarity( "../../samples/samplevectors.bin", "heart:attack", "myocardial:infarction" );
print( "Cosine Similarity Between \"heart attack\" and \"myocardial infarction\": $value\n" ) if defined( $value );
print( "Error: Cosine Similarity Could Not Be Computed\n" ) if !defined( $value );
undef( $instance );
CLMultiWordCosSimWithUserInput
Description:
Command-line Method: Computes cosine similarity depending on user input given a vectorBinaryFile (string).
Note: Words can be compounded by the ':' character.
Input:
$filePath -> Word2Vec trained word vectors binary file path. (String)
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->CLMultiWordCosSimWithUserInput( "../../samples/samplevectors.bin" );
undef( $instance );
CLAddTwoWordVectors
Description:
Command-line Method: Loads the specified word2vec trained binary data file, adds word vectors and returns the summed result.
Input:
$filePath -> Word2Vec trained word vectors binary file path. (String)
$wordDataA -> Word2Vec word data (String)
$wordDataB -> Word2Vec word data (String)
Output:
$vectorData -> Summed '$wordDataA' and '$wordDataB' vectors
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $wordVtr = $interface->CLAddTwoWordVectors( "../../samples/samplevectors.bin", "of", "the" );
print( "Word Vector for \"of\" + \"the\": $wordVtr\n" ) if defined( $wordVtr );
print( "Word Vector Cannot Be Computed\n" ) if !defined( $wordVtr );
undef( $instance );
CLSubtractTwoWordVectors
Description:
Command-line Method: Loads the specified word2vec trained binary data file, subtracts word vectors and returns the difference result.
Input:
$filePath -> Word2Vec trained word vectors binary file path. (String)
$wordDataA -> Word2Vec word data (String)
$wordDataB -> Word2Vec word data (String)
Output:
$vectorData -> Difference of '$wordDataA' and '$wordDataB' vectors
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $wordVtr = $interface->CLSubtractTwoWordVectors( "../../samples/samplevectors.bin", "of", "the" );
print( "Word Vector for \"of\" - \"the\": $wordVtr\n" ) if defined( $wordVtr );
print( "Word Vector Cannot Be Computed\n" ) if !defined( $wordVtr );
undef( $instance );
CLStartWord2VecTraining
Description:
Command-line Method: Executes word2vec training given the specified options hash.
Input:
$hashRef -> Hash reference of word2vec options
Output:
$value -> Returns '0' = Successful / '-1' = Un-successful.
Example:
use Word2vec::Interface;
my %options;
$options{'-trainfile'} = "../../samples/textcorpus.txt";
$options{'-outputfile'} = "../../samples/tempvectors.bin";
my $interface = Word2vec::Interface->new();
my $result = $interface->CLStartWord2VecTraining( \%options );
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLStartWord2PhraseTraining
Description:
Command-line Method: Executes word2phrase training given the specified options hash.
Input:
$hashRef -> Hash reference of word2vec options.
Output:
$value -> Returns '0' = Successful / '-1' = Un-successful.
Example:
use Word2vec::Interface;
my %options;
$options{'-trainfile'} = "../../samples/textcorpus.txt";
$options{'-outputfile'} = "../../samples/tempvectors.bin";
my $interface = Word2vec::Interface->new();
my $result = $interface->CLStartWord2PhraseTraining( \%options );
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLCleanText
Description:
Command-line Method: Reads an input text file, normalizes based on the settings below and prints to a new file.
- All Text Conveted To Lowercase
- Duplicate White Spaces Removed
- "'s" (Apostrophe 's') Characters Removed
- Hyphen "-" Replaced With Whitespace
- All Characters Outside Of "a-z" and NewLine Characters Are Removed
- Lastly, Whitespace Before And After Text Is Removed
Input:
$hashRef -> Hash reference of inputfile/outputfile options.
Output:
$value -> Returns '0' = Successful / '-1' = Un-successful.
Example:
use Word2vec::Interface;
my %options;
$options{'-inputfile'} = "../../samples/test.txt";
$options{'-outputfile'} = "../../samples/clean_text.txt";
my $interface = Word2vec::Interface->new();
my $result = $interface->CLCleanText( \%options );
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLCompileTextCorpus
Description:
Command-line Method: Compiles a text corpus given the specified options hash.
Input:
$hashRef -> Hash reference of xmltow2v options.
Output:
$value -> Returns '0' = Successful / '-1' = Un-successful.
Example:
use Word2vec::Interface;
my %options;
$options{'-workdir'} = "../../samples";
$options{'-savedir'} = "../../samples/textcorpus.txt";
my $interface = Word2vec::Interface->new();
my $result = $interface->CLCompileTextCorpus( \%options );
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLConvertWord2VecVectorFileToText
Description:
Command-line Method: Converts conversion of word2vec binary format to plain text word vector data.
Input:
$filePath -> Word2Vec binary file path
$savePath -> Path to save converted file
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->CLConvertWord2VecVectorFileToText( "../../samples/samplevectors.bin", "../../samples/convertedvectors.bin" );
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLConvertWord2VecVectorFileToBinary
Description:
Command-line Method: Converts conversion of plain text word vector data to word2vec binary format.
Input:
$filePath -> Word2Vec binary file path
$savePath -> Path to save converted file
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->CLConvertWord2VecVectorFileToBinary( "../../samples/samplevectors.bin", "../../samples/convertedvectors.bin" );
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLConvertWord2VecVectorFileToSparse
Description:
Command-line Method: Converts conversion of plain text word vector data to sparse vector data format.
Input:
$filePath -> Vectors file path
$savePath -> Path to save converted file
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->CLConvertWord2VecVectorFileToSparse( "../../samples/samplevectors.bin", "../../samples/convertedvectors.bin" );
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLCompoundifyTextInFile
Description:
Command-line Method: Reads a specified plain text file at 'filePath' and 'compoundWordFile', then compoundifies and saves the file to 'savePath'.
Input:
$filePath -> Text file to compoundify
$savePath -> Path to save compoundified file
$compoundWordFile -> Compound word file path
Output:
$value -> Result '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->CLCompoundifyTextInFile( "../../samples/textcorpus.txt", "../../samples/compoundcorpus.txt", "../../samples/compoundword.txt" );
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLSortVectorFile
Description:
Reads a specifed vector file in memory, sorts alphanumerically and saves to a file.
Input:
$hashRef -> Hash reference of parameters. (File path and overwrite parameters)
Output:
$value -> Result '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my %options;
%options{ "-filepath" } = "vectors.bin";
%options{ "-overwrite" } = 1;
my $result = $interface->CLSortVectorFile();
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLFindSimilarTerms
Description:
Fetches an array containing the nearest n terms using cosine similarity as the metric of determining similar terms.
Input:
$term -> Comparison term used to find similar terms.
$numberOfSimilarTerms -> Integer value used to limit the number of elements in array returned.
Output:
$value -> 'Array reference' = Successful / 'undef' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->W2VReadTrainedVectorDataFromFile( "vectors.bin" );
$result = $interface->CLFindSimilarTerms( "cookie", 10 ) if $result == 0;
print "Success\n" if defined( $result );
print "Error: No Elements Returned\n" if !defined( $result );
return if !defined( $result );
for my $element ( @{ $result } )
{
print "$element\n";
}
undef( $interface );
CleanWord2VecDirectory
Description:
Cleans up C object and executable files in word2vec directory.
Input:
None
Output:
$value -> Result '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->CleanWord2VecDirectory();
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLSimilarityAvg
Description:
Computes cosine similarity of average values for a list of specified word comparisons given a file.
Note: Trained vector data must be loaded in memory previously before calling this method.
Input:
$filePath -> Text file with list of word comparisons by line.
Output:
$value -> Result '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->W2VReadTrainedVectorDataFromFile( "vectors.bin" );
$result = $interface->CLSimilarityAvg( "MiniMayoSRS.terms" ) if $result == 0;
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLSimilarityComp
Description:
Computes cosine similarity values for a list of specified compound word comparisons given a file.
Note: Trained vector data must be loaded in memory previously before calling this method.
Input:
$filePath -> Text file with list of word comparisons by line.
Output:
$value -> Result '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->W2VReadTrainedVectorDataFromFile( "vectors.bin" );
$result = $interface->CLSimilarityComp( "MiniMayoSRS.terms" ) if $result == 0;
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLSimilaritySum
Description:
Computes cosine similarity of summed values for a list of specified word comparisons given a file.
Note: Trained vector data must be loaded in memory previously before calling this method.
Input:
$filePath -> Text file with list of word comparisons by line.
Output:
$value -> Result '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->W2VReadTrainedVectorDataFromFile( "vectors.bin" );
$result = $interface->CLSimilaritySum( "MiniMayoSRS.terms" ) if $result == 0;
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
CLWordSenseDisambiguation
Description:
Command-line Method: Assigns a particular sense to each instance using word2vec trained word vector data.
Stop words are removed if a stoplist is specified before computing cosine similarity average of each instance
and sense context.
Input:
$instanceFilePath -> WSD instance file path
$senseFilePath -> WSD sense file path
$stopListfilePath -> Stop list file path
Output:
$value -> Returns '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->CLWordSenseDisambiguation( "ACE.instances.sval", "ACE.senses.sval", "vectors.bin", "stoplist" );
print( "Success!\n" ) if $result == 0;
print( "Failed!\n" ) if $result == -1;
undef( $interface );
_WSDAnalyzeSenseData
Description:
Analyzes sense sval files for identification number mismatch and adjusts accordingly in memory.
Input:
None
Output:
None
Example:
This is a private function and should not be utilized.
_WSDReadList
Description:
Reads a WSD list when the '-list' parameter is specified.
Input:
$listPath -> WSD list file path
Output:
\%listOfFile -> List of files hash reference
Example:
This is a private function and should not be utilized.
_WSDParseList
Description:
Parses the specified list of files for Word Sense Disambiguation computation.
Input:
$listOfFilesHashRef -> Hash reference to a hash of file paths
$vectorBinaryFile -> Word2vec trained word vector data file
$stopListFilePath -> Stop list file path
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
This is a private function and should not be utilized.
WSDParseFile
Description:
Parses a specified file in SVL format and stores all context in memory. Utilized for
Word Sense Disambiguation cosine similarity computation.
Input:
$filePath -> WSD instance or sense file path
$stopListRegex -> Stop list regex ( Automatically generated with stop list file )
Output:
$arrayReference -> Array reference of WSD instances or WSD senses in memory.
Example:
This is a private function and should not be utilized.
WSDCalculateCosineAvgSimiliarity
Description:
For each instance stored in memory, this method computes an average cosine similarity for the context
of each instance and sense with stop words removed via stop list regex. After average cosine similarity
values are calculated for each instance and sense, the cosine similarity of each instance and sense is
computed. The highest cosine similarity value of a given instance to a particular sense is assigned and
stored.
Input:
None
Output:
$value -> Returns '0' = Successful / '-1' = Un-successful
Example:
This is a private function and should not be utilized.
_WSDCalculateAccuracy
Description:
Computes accuracy of assigned sense identification for each instance in memory.
Input:
None
Output:
$value -> Returns accuracy percentage (float) or '-1' if un-successful.
Example:
This is a private function and should not be utilized.
WSDPrintResults
Description:
For each instance, this method prints standard information to the console window consisting of:
Note: Only prints to console if '--debuglog' or 'writelog' option is passed.
Input:
None
Output:
None
Example:
This is a private function and should not be utilized.
WSDSaveResults
Description:
Saves WSD results post sense identification assignment in the 'instanceFilePath' (string) location. Saved data consists of:
Input:
$instanceFilePath -> WSD instance file path
Output:
None
Example:
This is a private function and should not be utilized.
_WSDGenerateAccuracyReport
Description:
Fetches saved results for all instance files and stores accuracies for each in a text file.
Input:
$workingDirectory -> Directory of "*.results.txt" files
Output:
None
Example:
This is a private function and should not be utilized.
_WSDStop
Description:
Generates and returns a stop list regex given a 'stopListFilePath' (string). Returns undefined in the event of an error.
Input:
$stopListFilePath -> WSD Stop list file path
Output:
$stopListRegex -> Returns stop list regex of the WSD stop list file.
Example:
This is a private function and should not be utilized.
ConvertStringLineEndingsToTargetOS
Description:
Converts passed string parameter to current OS line ending format.
ie. DOS/Windows to Unix/Linux or Unix/Linux to DOS/Windows.
Warning: This is incompatible with the legacy MacOS format, errors may occur as it is not supported.
Input:
$string -> String to convert
Output:
$string -> Output data with target OS line endings.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $tempStr = "samples text\r\n;
$tempStr = $interface->ConvertStringLineEndingsToTargetOS( $tempStr );
undef( $interface );
Interface Accessor Functions
GetWord2VecDir
Description:
Returns word2vec executable/source directory.
Input:
None
Output:
$string -> Word2vec file path
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $filePath = $interface->GetWord2VecDir();
print( "FilePath: $filePath\n" );
undef( $interface );
GetDebugLog
Description:
Returns the _debugLog member variable set during Word2vec::Word2phrase object initialization of new function.
Input:
None
Output:
$value -> 0 = False, 1 = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $debugLog = $interface->GetDebugLog();
print( "Debug Logging Enabled\n" ) if $debugLog == 1;
print( "Debug Logging Disabled\n" ) if $debugLog == 0;
undef( $interface );
GetWriteLog
Description:
Returns the _writeLog member variable set during Word2vec::Word2phrase object initialization of new function.
Input:
None
Output:
$value -> 0 = False, 1 = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $writeLog = $interface->GetWriteLog();
print( "Write Logging Enabled\n" ) if $writeLog == 1;
print( "Write Logging Disabled\n" ) if $writeLog == 0;
undef( $interface );
GetIgnoreCompileErrors
Description:
Returns the _ignoreCompileErrors member variable set during Word2vec::Word2phrase object initialization of new function.
Input:
None
Output:
$value -> 0 = False, 1 = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $ignoreCompileErrors = $interface->GetIgnoreCompileErrors();
print( "Ignore Compile Errors Enabled\n" ) if $ignoreCompileErrors == 1;
print( "Ignore Compile Errors Disabled\n" ) if $ignoreCompileErrors == 0;
undef( $interface );
GetIgnoreFileChecks
Description:
Returns the _ignoreFileChecks member variable set during Word2vec::Word2phrase object initialization of new function.
Input:
None
Output:
$value -> 0 = False, 1 = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $ignoreFileChecks = $interface->GetIgnoreFileChecks();
print( "Ignore File Checks Enabled\n" ) if $ignoreFileChecks == 1;
print( "Ignore File Checks Disabled\n" ) if $ignoreFileChecks == 0;
undef( $interface );
GetExitFlag
Description:
Returns the _exitFlag member variable set during Word2vec::Word2phrase object initialization of new function.
Input:
None
Output:
$value -> 0 = False, 1 = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $exitFlag = $interface->GetExitFlag();
print( "Exit Flag Set\n" ) if $exitFlag == 1;
print( "Exit Flag Not Set\n" ) if $exitFlag == 0;
undef( $interface );
GetFileHandle
Description:
Returns file handle used by WriteLog() method.
Input:
None
Output:
$fileHandle -> Returns file handle blob used by 'WriteLog()' function or undefined.
Example:
This is a private function and should not be utilized.
GetWorkingDirectory
Description:
Returns the _workingDir member variable set during Word2vec::Word2phrase object initialization of new function.
Input:
None
Output:
$string -> Returns working directory
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $dir = $interface->GetWorkingDirectory();
print( "Working Directory: $dir\n" );
undef( $interface );
GetLeskHandler
Description:
Returns the _lesk member variable set during Word2vec::Lesk object initialization of new function.
Note: This returns a new object if not defined with lesk::_debugLog and lesk::_writeLog parameters mirroring interface::_debugLog and interface::_writeLog.
Input:
None
Output:
Word2vec::Lesk -> Returns 'Word2vec::Lesk' object.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $lesk = $interface->GetLeskHandler();
undef( $lesk );
undef( $interface );
GetSpearmansHandler
Description:
Returns the _spearmans member variable set during Word2vec::Spearmans object initialization of new function.
Note: This returns a new object if not defined with spearmans::_debugLog and spearmans::_writeLog parameters mirroring interface::_debugLog and interface::_writeLog.
Input:
None
Output:
Word2vec::Spearmans -> Returns 'Word2vec::Spearmans' object.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $spearmans = $interface->GetSpearmansHandler();
undef( $spearmans );
undef( $interface );
GetWord2VecHandler
Description:
Returns the _word2vec member variable set during Word2vec::Word2vec object initialization of new function.
Note: This returns a new object if not defined with word2vec::_debugLog and word2vec::_writeLog parameters mirroring interface::_debugLog and interface::_writeLog.
Input:
None
Output:
Word2vec::Word2vec -> Returns 'Word2vec::Word2vec' object.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $word2vec = $interface->GetWord2VecHandler();
undef( $word2vec );
undef( $interface );
GetWord2PhraseHandler
Description:
Returns the _word2phrase member variable set during Word2vec::Word2phrase object initialization of new function.
Note: This returns a new object if not defined with word2phrase::_debugLog and word2phrase::_writeLog parameters mirroring interface::_debugLog and interface::_writeLog.
Input:
None
Output:
Word2vec::Word2phrase -> Returns 'Word2vec::Word2phrase' object
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $word2phrase = $interface->GetWord2PhraseHandler();
undef( $word2phrase );
undef( $interface );
GetXMLToW2VHandler
Description:
Returns the _xmltow2v member variable set during Word2vec::Xmltow2v object initialization of new function.
Note: This returns a new object if not defined with word2vec::_debugLog and word2vec::_writeLog parameters mirroring interface::_debugLog and interface::_writeLog.
Input:
None
Output:
Word2vec::Xmltow2v -> Returns 'Word2vec::Xmltow2v' object
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $xmltow2v = $interface->GetXMLToW2VHandler();
undef( $xmltow2v );
undef( $interface );
#=head3 GetInstanceAry
Description:
Returns the _instanceAry member variable set during Word2vec::Word2phrase object initialization of new function.
Input:
None
Output:
$instance -> Returns array reference of WSD instances.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $aryRef = $interface->GetInstanceAry();
my @instanceAry = @{ $aryRef };
undef( $interface );
GetSensesAry
Description:
Returns the _senseAry member variable set during Word2vec::Word2phrase object initialization of new function.
Input:
None
Output:
$senses -> Returns array reference of WSD senses.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $aryRef = $interface->GetSensesAry();
my @sensesAry = @{ $aryRef };
undef( $interface );
GetInstanceCount
Description:
Returns the _instanceCount member variable set during Word2vec::Word2phrase object initialization of new function.
Input:
None
Output:
$value -> Returns number of stored WSD instances.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $count = $interface->GetInstanceCount();
print( "Stored WSD instances in memory: $count\n" );
undef( $interface );
GetSenseCount
Description:
Returns the _sensesCount member variable set during Word2vec::Word2phrase object initialization of new function.
Input:
None
Output:
$value -> Returns number of stored WSD senses.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $count = $interface->GetSensesCount();
print( "Stored WSD senses in memory: $count\n" );
undef( $interface );
Interface Mutator Functions
SetWord2VecDir
Description:
Sets word2vec executable/source file directory.
Input:
$string -> Word2Vec Directory
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->SetWord2VecDir( "/word2vec" );
undef( $interface );
SetDebugLog
Description:
Instructs module to print debug statements to the console.
Input:
$value -> '1' = Print Debug Statements / '0' = Do Not Print Statements
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->SetDebugLog( 1 );
undef( $interface );
SetWriteLog
Description:
Instructs module to print a log file.
Input:
$value -> '1' = Print Debug Statements / '0' = Do Not Print Statements
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->SetWriteLog( 1 );
undef( $interface );
SetIgnoreCompileErrors
Description:
Instructs module to ignore compile errors when compiling source files.
Input:
$value -> '1' = Ignore warnings/errors, '0' = Display and process warnings/errors.
Output:
None
Example:
use Word2vec::Interface;
my $instance = word2vec::instance->new();
$instance->SetIgnoreCompileErrors( 1 );
undef( $instance );
SetIgnoreFileCheckErrors
Description:
Instructs module to ignore file checking errors.
Input:
$value -> '1' = Ignore warnings/errors, '0' = Display and process warnings/errors.
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->SetIgnoreFileCheckErrors( 1 );
undef( $interface );
SetWorkingDirectory
Description:
Sets current working directory.
Input:
$path -> Working directory path (String)
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->SetWorkingDirectory( "my/new/working/directory" );
undef( $interface );
SetInstanceAry
Description:
Sets member instance array variable to de-referenced passed array reference parameter.
Input:
$arrayReference -> Array reference for Word Sense Disambiguation - Array of instances (Word2vec::Wsddata objects).
Output:
None
Example:
use word2vec::instance;
# This array would theoretically contain 'Word2vec::Wsddata' objects.
my @instanceAry = ();
my $instance = word2vec::instance->new();
$instance->SetInstanceAry( \@instanceAry );
undef( $instance );
ClearInstanceAry
Description:
Clears member instance array.
Input:
None
Output:
None
Example:
use Word2vec::Interface;
my $instance = word2vec::instance->new();
$instance->ClearInstanceAry();
undef( $instance );
SetSenseAry
Description:
Sets member sense array variable to de-referenced passed array reference parameter.
Input:
$arrayReference -> Array reference for Word Sense Disambiguation - Array of senses (Word2vec::Wsddata objects).
Output:
None
Example:
use Word2vec::Interface;
# This array would theoretically contain 'Word2vec::Wsddata' objects.
my @senseAry = ();
my $interface = word2vec::instance->new();
$interface->SetSenseAry( \@senseAry );
undef( $instance );
ClearSenseAry
Description:
Clears member sense array.
Input:
None
Output:
None
Example:
use word2vec::instance;
my $instance = word2vec::instance->new();
$instance->ClearSenseAry();
undef( $instance );
SetInstanceCount
Description:
Sets member instance count variable to passed value (integer).
Input:
$value -> Integer (Positive)
Output:
None
Example:
use word2vec::instance;
my $instance = word2vec::instance->new();
$instance->SetInstanceCount( 12 );
undef( $instance );
SetSenseCount
Description:
Sets member sense count variable to passed value (integer).
Input:
$value -> Integer (Positive)
Output:
None
Example:
use Word2vec::Interface;
my $interface = word2vec::instance->new();
$instance->SetSenseCount( 12 );
undef( $instance );
Debug Functions
GetTime
Description:
Returns current time string in "Hour:Minute:Second" format.
Input:
None
Output:
$string -> XX:XX:XX ("Hour:Minute:Second")
Example:
use Word2vec::Interface:
my $interface = Word2vec::Interface->new();
my $time = $interface->GetTime();
print( "Current Time: $time\n" ) if defined( $time );
undef( $interface );
GetDate
Description:
Returns current month, day and year string in "Month/Day/Year" format.
Input:
None
Output:
$string -> XX/XX/XXXX ("Month/Day/Year")
Example:
use Word2vec::Interface:
my $interface = Word2vec::Interface->new();
my $date = $interface->GetDate();
print( "Current Date: $date\n" ) if defined( $date );
undef( $interface );
WriteLog
Description:
Prints passed string parameter to the console, log file or both depending on user options.
Note: printNewLine parameter prints a new line character following the string if the parameter
is undefined and does not if parameter is 0.
Input:
$string -> String to print to the console/log file.
$value -> 0 = Do not print newline character after string, all else prints new line character including 'undef'.
Output:
None
Example:
use Word2vec::Interface:
my $interface = Word2vec::Interface->new();
$interface->WriteLog( "Hello World" );
undef( $interface );
Lesk Main Functions
GetPhraseOverlapBetweenStrings
Description:
Given two strings, this returns a hash of all overlapping (matching) phrases between both strings and their frequency counts. This prioritizes longer phrases as higher priority when matching.
Input:
$string_a -> First comparison string
$string_b -> Second comparison string
Output:
$hash_ref -> Returns a hash table reference with keys being the unique matching phrase between two input string parameters and the value as the frequency count of each unique phrase.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my %phrase_overlaps = %{ $interface->GetPhraseOverlapBetweenStrings( "I like to eat cookies", "Sometimes I like to eat cookies" ) };
for my $phrase ( sort keys %phrase_overlaps )
{
print "$phrase : $phrase_overlaps{ $phrase }\n";
}
undef( %phrase_overlaps );
undef( $interface );
CalculateLeskScore
Description:
Given two strings, this returns a lesk score based on overlapping (matching) features between both strings.
Input:
$string_a -> First comparison string
$string_b -> Second comparison string
Output:
$score -> Lesk Score (Float)
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $lesk_score = $interface->CalculateLeskScore( "I like to eat cookies", "Sometimes I like to eat cookies" );
print "Lesk Score: $lesk_score\n";
undef( $interface );
CalculateLeskCosineScore
Description:
Given two strings, this returns a cosine score based on overlapping (matching) features between both strings.
Input:
$string_a -> First comparison string
$string_b -> Second comparison string
Output:
$score -> Cosine Score (Float)
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $cosine_score = $interface->CalculateLeskCosineScore( "I like to eat cookies", "Sometimes I like to eat cookies" );
print "Cosine Score: $cosine_score\n";
undef( $interface );
CalculateLeskFScore
Description:
Given two strings, this returns a F score based on overlapping (matching) features between both strings.
Input:
$string_a -> First comparison string
$string_b -> Second comparison string
Output:
$score -> F Score (Float)
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $f_score = $interface->CalculateLeskFScore( "I like to eat cookies", "Sometimes I like to eat cookies" );
print "F Score: $f_score\n";
undef( $interface );
CalculateAllLeskScores
Description:
Given two strings, this returns a list of scores (F, Cosine, Lesk, Raw Lesk, Precision, Recall), frequency counts (features, phrases, string lengths).
Input:
$string_a -> First comparison string
$string_b -> Second comparison string
Output:
$result_hash -> Hash reference containing: Lesk, Raw Lesk, F, Precision, Recall, Cosine, Matching Feature Frequency, Matching Phrase Frequency, String A Length and String B Length.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my %scores = %{ $interface->CalculateAllLeskScores( "I like to eat cookies", "Sometimes I like to eat cookies" ) };
for my $score_name ( sort keys %scores )
{
print "$score_name : $scores{ $score_name }\n";
}
undef( $interface );
Util Main Functions
CleanText
Description:
Normalizes text based on the following.
- Text converted to lowercase
- More than one white space is replaced with a single white space
- Apostrophe "s" ('s) characters are removed
- Hyphen character is replaced with a single white space
- All special characters removed outside of lowercase 'a-z' and compoundified terms retained, joined by '_' (underscore).
- Line-feed/carriage return (LF-CR) endings are cleaned and converted to OS specific LF-CR endings
Input:
$string -> String of text to normalize
Output:
$string -> Cleaned/Normalized text.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $text = "123485clean-text!!@&^#*@";
print( "Original Text: \"$text\"\n" );
$text = $interface->CleanText( $text );
print( "Cleaned Text: \"$text\"\n" );
undef( $interface );
RemoveNewLineEndingsFromString
Description:
Removes new line endings from string. Supports MSWin32, linux and MacOS line endings.
Input:
$string -> String with line-feed/carriage return ending to remove.
Output:
$string -> String without line-feed/carriage return ending.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $text = "this is sample text\n";
print( "Original Text: \"$text\"\n" );
$text = $interface->RemoveNewLineEndingsFromString( $text );
print( "Cleaned Text: \"$text\"\n" );
undef( $interface );
IsFileOrDirectory
Description:
Given a path, returns a string specifying whether this path represents a file or directory.
Input:
$path -> String representing path to check
Output:
$string -> Returns "file", "dir" or "unknown".
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->IsFileOrDirectory( "../samples/stoplist" );
print( "Path Type Is A File\n" ) if $result eq "file";
print( "Path Type Is A Directory\n" ) if $result eq "dir";
print( "Path Type Is Unknown\n" ) if $result eq "unknown";
undef( $interface );
IsWordOrCUITerm
Description:
Determines if string parameter is a 'word' or 'cui'.
Input:
$string -> String with single term/cui to examine.
Output:
$string -> Returns "word" or "cui".
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->IsWordOrCUITerm( "c12345678" );
print( "String Is Word\n" ) if $result eq "word";
print( "String Is A CUI\n" ) if $result eq "cui";
undef( $interface );
GetFilesInDirectory
Description:
Given a path and file tag string, returns a string of files consisting of the file tag string in the specified path.
Input:
$path -> String representing path
$fileTag -> String consisting of file tag to fetch.
Output:
$string -> Returns string of file names consisting of $fileTag.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
# Looks in specified path for files including ".sval" in their file name.
my $result = $interface->GetFilesInDirectory( "../samples/", ".sval" );
print( "Found File Name(s): $result\n" ) if defined( $result );
undef( $interface );
Spearmans Main Functions
SpCalculateSpearmans
Calculates Spearman's Rank Correlation Score between two data-sets.
Input:
$fileA -> Data set to compare
$fileB -> Data set to compare
$includeCountsInResults -> Specifies whether to return file counts in score. (undef = False / defined = True)
Output:
$value -> "undef" or Spearman's Rank Correlation Score
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $score = $interface->SpCalculateSpearmans( "samples/MiniMayoSRS.term.comp_results", "Similarity/MiniMayoSRS.terms.coders", undef );
print "Spearman's Rank Correlation Score: $score\n" if defined( $score );
print "Spearman's Rank Correlation Score: undef\n" if !defined( $score );
undef( $interface );
SpIsFileWordOrCUIFile
Description:
Determines if a file is composed of CUI or word terms by checking the first line.
Input:
$string -> File Path
Output:
$string -> "undef" = Unable to determine, "cui" = CUI Term File, "word" = Word Term File
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $isWordOrCuiFile = $interface->SpIsFileWordOrCUIFile( "samples/MiniMayoSRS.terms" );
print( "MiniMayoSRS.terms File Is A \"$isWordOrCuiFile\" File\n" ) if defined( $isWordOrCuiFile );
print( "Unable To Determine Type Of File\n" ) if !defined( $isWordOrCuiFile );
undef( $interface );
SpGetPrecision
Returns the number of decimal places after the decimal point of the Spearman's Rank Correlation Score to represent.
Input:
None
Output:
$value -> Integer
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
print "Spearman's Precision: " . $interface->SpGetPrecision() . "\n";
undef( $interface );
SpGetIsFileOfWords
Returns the variable indicating whether the files to be parsed are files consisting of words or CUI terms.
Input:
None
Output:
$value -> "undef" = Auto-Detect, 0 = CUI Terms, 1 = Word Terms
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $isFileOfWords = $interface->SpGetIsFileOfWords();
print "Is File Of Words?: $isFileOfWords\n" if defined( $isFileOfWords );
print "Is File Of Words?: undef\n" if !defined( $isFileOfWords );
undef( $interface );
SpGetPrintN
Returns the variable indicating whether the to print NValue.
Input:
None
Output:
$value -> "undef" = Do not print NValue, "defined" = Print NValue
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $printN = $interface->SpGetPrintN();
print "Print N\n" if defined( $printN );
print "Do Not Print N\n" if !defined( $printN );
undef( $interface );
SpGetACount
Returns the non-negative count for file A.
Input:
None
Output:
$value -> Integer
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
print "A Count: " . $interface->SpGetACount() . "\n";
undef( $interface );
SpGetBCount
Returns the non-negative count for file B.
Input:
None
Output:
$value -> Integer
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
print "B Count: " . $interface->SpGetBCount() . "\n";
undef( $interface );
SpGetNValue
Returns the N value.
Input:
None
Output:
$value -> Integer
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
print "N Value: " . $interface->SpGetNValue() . "\n";
undef( $interface );
SpSetPrecision
Sets number of decimal places after the decimal point of the Spearman's Rank Correlation Score to represent.
Input:
$value -> Integer
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->SpSetPrecision( 8 );
my $score = $interface->SpCalculateSpearmans( "samples/MiniMayoSRS.term.comp_results", "Similarity/MiniMayoSRS.terms.coders", undef );
print "Spearman's Rank Correlation Score: $score\n" if defined( $score );
print "Spearman's Rank Correlation Score: undef\n" if !defined( $score );
undef( $interface );
SpSetIsFileOfWords
Specifies the main method to auto-detect if file consists of CUI or Word terms, or manual override with user setting.
Input:
$value -> "undef" = Auto-Detect, 0 = CUI Terms, 1 = Word Terms
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->SpSetIsFileOfWords( undef );
my $score = $interface->SpCalculateSpearmans( "samples/MiniMayoSRS.term.comp_results", "Similarity/MiniMayoSRS.terms.coders", undef );
print "Spearman's Rank Correlation Score: $score\n" if defined( $score );
print "Spearman's Rank Correlation Score: undef\n" if !defined( $score );
undef( $interface );
SpSetPrintN
Specifies the main method print _NValue post Spearmans::CalculateSpearmans() function completion.
Input:
$value -> "undef" = Do Not Print _NValue, "defined" = Print _NValue
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->SpSetPrintN( 1 );
my $score = $interface->SpCalculateSpearmans( "samples/MiniMayoSRS.term.comp_results", "Similarity/MiniMayoSRS.terms.coders", undef );
print "Spearman's Rank Correlation Score: $score\n" if defined( $score );
print "Spearman's Rank Correlation Score: undef\n" if !defined( $score );
undef( $interface );
Word2Vec Main Functions
W2VExecuteTraining
Executes word2vec training based on parameters. Parameter variables have higher precedence
than member variables. Any parameter specified will override its respective member variable.
Note: If no parameters are specified, this module executes word2vec training based on preset
member variables. Returns string regarding training status.
Input:
$trainFilePath -> Specifies word2vec text corpus training file in a given path. (String)
$outputFilePath -> Specifies word2vec trained output data file name and save path. (String)
$vectorSize -> Size of word2vec word vectors. (Integer)
$windowSize -> Maximum skip length between words. (Integer)
$minCount -> Disregard words that appear less than $minCount times. (Integer)
$sample -> Threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled. (Float)
$negative -> Number of negative examples. (Integer)
$alpha -> Set that start learning rate. (Float)
$hs -> Hierarchical Soft-max (Integer)
$binary -> Save trained data as binary mode. (Integer)
$numOfThreads -> Number of word2vec training threads. (Integer)
$iterations -> Number of training iterations to run prior to completion of training. (Integer)
$useCBOW -> Enable Continuous Bag Of Words model or Skip-Gram model. (Integer)
$classes -> Output word classes rather than word vectors. (Integer)
$readVocab -> Read vocabulary from file path without constructing from training data. (String)
$saveVocab -> Save vocabulary to file path. (String)
$debug -> Set word2vec debug mode. (Integer)
$overwrite -> Instructs the module to either overwrite any existing text corpus files or append to the existing file. ( '1' = True / '0' = False )
Note: It is not recommended to specify all new() parameters, as it has not been thoroughly tested.
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetTrainFilePath( "textcorpus.txt" );
$interface->W2VSetOutputFilePath( "vectors.bin" );
$interface->W2VSetWordVecSize( 200 );
$interface->W2VSetWindowSize( 8 );
$interface->W2VSetSample( 0.0001 );
$interface->W2VSetNegative( 25 );
$interface->W2VSetHSoftMax( 0 );
$interface->W2VSetBinaryOutput( 0 );
$interface->W2VSetNumOfThreads( 20 );
$interface->W2VSetNumOfIterations( 15 );
$interface->W2VSetUseCBOW( 1 );
$interface->W2VSetOverwriteOldFile( 0 );
$interface->W2VExecuteTraining();
undef( $interface );
# or
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VExecuteTraining( "textcorpus.txt", "vectors.bin", 200, 8, 5, 0.001, 25, 0.05, 0, 0, 20, 15, 1, 0, "", "", 2, 0 );
undef( $interface );
W2VExecuteStringTraining
Executes word2vec training based on parameters. Parameter variables have higher precedence
than member variables. Any parameter specified will override its respective member variable.
Note: If no parameters are specified, this module executes word2vec training based on preset
member variables. Returns string regarding training status.
Input:
$trainingStr -> String to train with word2vec.
$outputFilePath -> Specifies word2vec trained output data file name and save path. (String)
$vectorSize -> Size of word2vec word vectors. (Integer)
$windowSize -> Maximum skip length between words. (Integer)
$minCount -> Disregard words that appear less than $minCount times. (Integer)
$sample -> Threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled. (Float)
$negative -> Number of negative examples. (Integer)
$alpha -> Set that start learning rate. (Float)
$hs -> Hierarchical Soft-max (Integer)
$binary -> Save trained data as binary mode. (Integer)
$numOfThreads -> Number of word2vec training threads. (Integer)
$iterations -> Number of training iterations to run prior to completion of training. (Integer)
$useCBOW -> Enable Continuous Bag Of Words model or Skip-Gram model. (Integer)
$classes -> Output word classes rather than word vectors. (Integer)
$readVocab -> Read vocabulary from file path without constructing from training data. (String)
$saveVocab -> Save vocabulary to file path. (String)
$debug -> Set word2vec debug mode. (Integer)
$overwrite -> Instructs the module to either overwrite any existing text corpus files or append to the existing file. ( '1' = True / '0' = False )
Note: It is not recommended to specify all new() parameters, as it has not been thoroughly tested.
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetOutputFilePath( "vectors.bin" );
$interface->W2VSetWordVecSize( 200 );
$interface->W2VSetWindowSize( 8 );
$interface->W2VSetSample( 0.0001 );
$interface->W2VSetNegative( 25 );
$interface->W2VSetHSoftMax( 0 );
$interface->W2VSetBinaryOutput( 0 );
$interface->W2VSetNumOfThreads( 20 );
$interface->W2VSetNumOfIterations( 15 );
$interface->W2VSetUseCBOW( 1 );
$interface->W2VSetOverwriteOldFile( 0 );
$interface->W2VExecuteStringTraining( "string to train here" );
undef( $interface );
# or
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VExecuteStringTraining( "string to train here", "vectors.bin", 200, 8, 5, 0.001, 25, 0.05, 0, 0, 20, 15, 1, 0, "", "", 2, 0 );
undef( $interface );
W2VComputeCosineSimilarity
Description:
Computes cosine similarity between two words using trained word2vec vector data. Returns
float value or undefined if one or more words are not in the dictionary.
Note: Supports single words only and requires vector data to be in memory with W2VReadTrainedVectorDataFromFile() prior to function execution.
Input:
$string -> Single string word
$string -> Single string word
Output:
$value -> Float or Undefined
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
print "Cosine similarity between words: \"of\" and \"the\": " . $interface->W2VComputeCosineSimilarity( "of", "the" ) . "\n";
undef( $interface );
W2VComputeAvgOfWordsCosineSimilarity
Description:
Computes cosine similarity between two words or compound words using trained word2vec vector data.
Returns float value or undefined.
Note: Supports multiple words concatenated by ' ' and requires vector data to be in memory prior
to method execution. This method will not error out when a word is not located within the dictionary.
It will take the average of all found words for each parameter then cosine similarity of both word vectors.
Input:
$string -> string of single or multiple words separated by ' ' (space).
$string -> string of single or multiple words separated by ' ' (space).
Output:
$value -> Float or Undefined
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
print "Cosine similarity between words: \"heart attack\" and \"acute myocardial infarction\": " .
$interface->W2VComputeAvgOfWordsCosineSimilarity( "heart attack", "acute myocardial infarction" ) . "\n";
undef( $interface );
W2VComputeMultiWordCosineSimilarity
Description:
Computes cosine similarity between two words or compound words using trained word2vec vector data.
Note: Supports multiple words concatenated by ' ' (space) and requires vector data to be in memory prior to method execution.
If $allWordsMustExist is set to true, this function will error out when a specified word is not found and return undefined.
Input:
$string -> string of single or multiple words separated by ' ' (space).
$string -> string of single or multiple words separated by ' ' (space).
$allWordsMustExist -> 1 = True, 0 or undef = False
Output:
$value -> Float or Undefined
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
print "Cosine similarity between words: \"heart attack\" and \"acute myocardial infarction\": " .
$interface->W2VComputeMultiWordCosineSimilarity( "heart attack", "acute myocardial infarction" ) . "\n";
undef( $interface );
W2VComputeCosineSimilarityOfWordVectors
Description:
Computes cosine similarity between two word vectors.
Returns float value or undefined if one or more words are not in the dictionary.
Note: Function parameters require actual word vector data with words removed.
Input:
$string -> string of word vector representation data separated by ' ' (space).
$string -> string of word vector representation data separated by ' ' (space).
Output:
$value -> Float or Undefined
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $vectorAData = $interface->W2VGetWordVector( "heart" );
my $vectorBData = $interface->W2VGetWordVector( "attack" );
# Remove Words From Data
$vectorAData = W2VRemoveWordFromWordVectorString( $vectorAData );
$vectorBData = W2VRemoveWordFromWordVectorString( $vectorBData );
undef( @tempAry );
print "Cosine similarity between words: \"heart\" and \"attack\": " .
$interface->W2VComputeCosineSimilarityOfWordVectors( $vectorAData, $vectorBData ) . "\n";
undef( $interface );
W2VCosSimWithUserInput
Description:
Computes cosine similarity between two words using trained word2vec vector data based on user input.
Note: No compound word support.
Warning: Requires vector data to be in memory prior to method execution.
Input:
None
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
$interface->W2VCosSimWIthUserInputTest();
undef( $interface );
W2VMultiWordCosSimWithUserInput
Description:
Computes cosine similarity between two words or compound words using trained word2vec vector data based on user input.
Note: Supports multiple words concatenated by ':'.
Warning: Requires vector data to be in memory prior to method execution.
Input:
None
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
$interface->W2VMultiWordCosSimWithUserInput();
undef( $interface );
W2VComputeAverageOfWords
Description:
Computes cosine similarity average of all found words given an array reference parameter of
plain text words. Returns average values (string) or undefined.
Warning: Requires vector data to be in memory prior to method execution.
Input:
$arrayReference -> Array reference of words
Output:
$string -> String of word2vec word average values
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my @wordAry = qw( of the and );
my $data = $interface->W2VComputeAverageOfWords( \@wordAry );
print( "Computed Average Of Words: $data" ) if defined( $data );
undef( $interface );
W2VAddTwoWords
Description:
Adds two word vectors and returns the result.
Warning: This method also requires vector data to be in memory prior to method execution.
Input:
$string -> Word to add
$string -> Word to add
Output:
$string -> String of word2vec summed word values
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $data = $interface->W2VAddTwoWords( "heart", "attack" );
print( "Computed Sum Of Words: $data" ) if defined( $data );
undef( $interface );
W2VSubtractTwoWords
Description:
Subtracts two word vectors and returns the result.
Warning: This method also requires vector data to be in memory prior to method execution.
Input:
$string -> Word to subtract
$string -> Word to subtract
Output:
$string -> String of word2vec difference between word values
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $data = $interface->W2VSubtractTwoWords( "king", "man" );
print( "Computed Difference Of Words: $data" ) if defined( $data );
undef( $interface );
W2VAddTwoWordVectors
Description:
Adds two vector data strings and returns the result.
Warning: Text word must be removed from vector data prior to calling this method. This method
also requires vector data to be in memory prior to method execution.
Input:
$string -> Word2vec word vector data (with string word removed)
$string -> Word2vec word vector data (with string word removed)
Output:
$string -> String of word2vec summed word values
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $wordAData = $interface->W2VGetWordVector( "of" );
my $wordBData = $interface->W2VGetWordVector( "the" );
# Removing Words From Vector Data
$wordAData = W2VRemoveWordFromWordVectorString( $wordAData );
$wordBData = W2VRemoveWordFromWordVectorString( $wordBData );
my $data = $interface->W2VAddTwoWordVectors( $wordAData, $wordBData );
print( "Computed Sum Of Words: $data" ) if defined( $data );
undef( $interface );
W2VSubtractTwoWordVectors
Description:
Subtracts two vector data strings and returns the result.
Warning: Text word must be removed from vector data prior to calling this method. This method
also requires vector data to be in memory prior to method execution.
Input:
$string -> Word2vec word vector data (with string word removed)
$string -> Word2vec word vector data (with string word removed)
Output:
$string -> String of word2vec difference between word values
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $wordAData = $interface->W2VGetWordVector( "of" );
my $wordBData = $interface->W2VGetWordVector( "the" );
# Removing Words From Vector Data
$wordAData = W2VRemoveWordFromWordVectorString( $wordAData );
$wordBData = W2VRemoveWordFromWordVectorString( $wordBData );
my $data = $interface->W2VSubtractTwoWordVectors( $wordAData, $wordBData );
print( "Computed Difference Of Words: $data" ) if defined( $data );
undef( $interface );
W2VAverageOfTwoWordVectors
Description:
Computes the average of two vector data strings and returns the result.
Warning: Text word must be removed from vector data prior to calling this method. This method
also requires vector data to be in memory prior to method execution.
Input:
$string -> Word2vec word vector data (with string word removed)
$string -> Word2vec word vector data (with string word removed)
Output:
$string -> String of word2vec average between word values
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $wordAData = $interface->W2VGetWordVector( "of" );
my $wordBData = $interface->W2VGetWordVector( "the" );
# Removing Words From Vector Data
$wordAData = W2VRemoveWordFromWordVectorString( $wordAData );
$wordBData = W2VRemoveWordFromWordVectorString( $wordBData );
my $data = $interface->W2VAverageOfTwoWordVectors( $wordAData, $wordBData );
print( "Computed Average Of Words: $data" ) if defined( $data );
undef( $interface );
W2VGetWordVector
Description:
Searches dictionary in memory for the specified string argument and returns the vector data.
Returns undefined if not found.
Warning: Requires vector data to be in memory prior to method execution.
Input:
$string -> Word to locate in word2vec vocabulary/dictionary
Output:
$string -> Found word2vec word + word vector data or undefined.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $wordData = $interface->W2VGetWordVector( "of" );
print( "Word2vec Word Data: $wordData\n" ) if defined( $wordData );
undef( $interface );
W2VIsVectorDataInMemory
Description:
Checks to see if vector data has been loaded in memory.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->W2VIsVectorDataInMemory();
print( "No vector data in memory\n" ) if $result == 0;
print( "Yes vector data in memory\n" ) if $result == 1;
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
print( "No vector data in memory\n" ) if $result == 0;
print( "Yes vector data in memory\n" ) if $result == 1;
undef( $interface );
W2VIsWordOrCUIVectorData
Description:
Checks to see if vector data consists of word or CUI terms.
Input:
None
Output:
$string -> 'cui', 'word' or undef
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $isWordOrCUIData = $interface->W2VIsWordOrCUIVectorData();
print( "Vector Data Consists Of \"$isWordOrCUIData\" Terms\n" ) if defined( $isWordOrCUIData );
print( "Cannot Determine Type Of Terms\n" ) if !defined( $isWordOrCUIData );
undef( $interface );
W2VIsVectorDataSorted
Description:
Checks to see if vector data header is signed as sorted in memory.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $result = $interface->IsVectorDataSorted();
print( "No vector data is not sorted\n" ) if $result == 0;
print( "Yes vector data is sorted\n" ) if $result == 1;
undef( $interface );
W2VCheckWord2VecDataFileType
Description:
Checks specified file to see if vector data is in binary or plain text format. Returns 'text'
for plain text and 'binary' for binary data.
Input:
$string -> File path
Output:
$string -> File Type ( "text" = Plain text file / "binary" = Binary data file )
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $fileType = $interface->W2VCheckWord2VecDataFileType( "samples/samplevectors.bin" );
print( "FileType: $fileType\n" ) if defined( $fileType );
undef( $fileType );
W2VReadTrainedVectorDataFromFile
Description:
Reads trained vector data from file path in memory or searches for vector data from file. This function supports and
automatically detects word2vec binary, plain text and sparse vector data formats.
Note: If search word is undefined, the entire vector file is loaded in memory. If a search word is defined only the vector data is returned or undef.
Input:
$string -> Word2vec trained vector data file path
$searchWord -> Searches trained vector data file for specific word vector
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
# Loading data in memory
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
print( "Success Loading Data\n" ) if $result == 0;
print( "Un-successful, Data Not Loaded\n" ) if $result == -1;
undef( $interface );
# or
# Searching vector data file for a specific word vector
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin", "medical" );
print( "Found Vector Data In File\n" ) if $result != -1;
print( "Vector Data Not Found\n" ) if $result == -1;
undef( $interface );
W2VSaveTrainedVectorDataToFile
Description:
Saves trained vector data at the location in specified format.
Note: Leaving 'saveFormat' undefined will automatically save as plain text format.
Input:
$string -> Save Path
$saveFormat -> Integer ( '0' = Save as plain text / '1' = Save data in word2vec binary format / '2' = Sparse vector data Ffrmat )
Note: Leaving $saveFormat as undefined will save the file in plain text format.
Warning: If the vector data is stored as a binary search tree, this method will error out gracefully.
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
$interface->W2VSaveTrainedVectorDataToFile( "samples/newvectors.bin" );
undef( $interface );
W2VStringsAreEqual
Description:
Compares two strings to check for equality, ignoring case-sensitivity.
Note: This method is not case-sensitive. ie. "string" equals "StRiNg"
Input:
$string -> String to compare
$string -> String to compare
Output:
$value -> '1' = Strings are equal / '0' = Strings are not equal
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->W2VStringsAreEqual( "hello world", "HeLlO wOrLd" );
print( "Strings are equal!\n" )if $result == 1;
print( "Strings are not equal!\n" ) if $result == 0;
undef( $interface );
W2VRemoveWordFromWordVectorString
Description:
Given a vector data string as input, it removed the vector word from its data returning only data.
Input:
$string -> Vector word & data string.
Output:
$string -> Vector data string.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $str = "cookie 1 0.234 9 0.0002 13 0.234 17 -0.0023 19 1.0000";
my $vectorData = $interface->W2VRemoveWordFromWordVectorString( $str );
print( "Success!\n" ) if length( vectorData ) < length( $str );
undef( $interface );
W2VConvertRawSparseTextToVectorDataAry
Description:
Converts sparse vector string to a dense vector format data array.
Input:
$string -> Vector data string.
Output:
$arrayReference -> Reference to array of vector data.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $str = "cookie 1 0.234 9 0.0002 13 0.234 17 -0.0023 19 1.0000";
my @vectorData = @{ $interface->W2VConvertRawSparseTextToVectorDataAry( $str ) };
print( "Data conversion successful!\n" ) if @vectorData > 0;
print( "Data conversion un-successful!\n" ) if @vectorData == 0;
undef( $interface );
W2VConvertRawSparseTextToVectorDataHash
Description:
Converts sparse vector string to a dense vector format data hash.
Input:
$string -> Vector data string.
Output:
$hashReference -> Reference to hash of vector data.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $str = "cookie 1 0.234 9 0.0002 13 0.234 17 -0.0023 19 1.0000";
my %vectorData = %{ $interface->W2VConvertRawSparseTextToVectorDataHash( $str ) };
print( "Data conversion successful!\n" ) if ( keys %vectorData ) > 0;
print( "Data conversion un-successful!\n" ) if ( keys %vectorData ) == 0;
undef( $interface );
Word2Vec Accessor Functions
W2VGetDebugLog
Description:
Returns the _debugLog member variable set during Word2vec::Word2vec object initialization of new function.
Input:
None
Output:
$value -> '0' = False, '1' = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new()
my $debugLog = $interface->W2VGetDebugLog();
print( "Debug Logging Enabled\n" ) if $debugLog == 1;
print( "Debug Logging Disabled\n" ) if $debugLog == 0;
undef( $interface );
W2VGetWriteLog
Description:
Returns the _writeLog member variable set during Word2vec::Word2vec object initialization of new function.
Input:
None
Output:
$value -> '0' = False, '1' = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $writeLog = $interface->W2VGetWriteLog();
print( "Write Logging Enabled\n" ) if $writeLog == 1;
print( "Write Logging Disabled\n" ) if $writeLog == 0;
undef( $interface );
W2VGetFileHandle
Description:
Returns the _fileHandle member variable set during Word2vec::Word2vec object instantiation of new function.
Warning: This is a private function. File handle is used by WriteLog() method. Do not manipulate this file handle as errors can result.
Input:
None
Output:
$fileHandle -> Returns file handle for WriteLog() method or undefined.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $fileHandle = $interface->W2VGetFileHandle();
undef( $interface );
W2VGetTrainFilePath
Description:
Returns the _trainFilePath member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$string -> Returns word2vec training text corpus file path.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $filePath = $interface->W2VGetTrainFilePath();
print( "Training File Path: $filePath\n" );
undef( $interface );
W2VGetOutputFilePath
Description:
Returns the _outputFilePath member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$string -> Returns post word2vec training output file path.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $filePath = $interface->W2VGetOutputFilePath();
print( "File Path: $filePath\n" );
undef( $interface );
W2VGetWordVecSize
Description:
Returns the _wordVecSize member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (integer) size of word2vec word vectors. Default value = 100
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetWordVecSize();
print( "Word Vector Size: $value\n" );
undef( $interface );
W2VGetWindowSize
Description:
Returns the _windowSize member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (integer) word2vec window size. Default value = 5
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetWindowSize();
print( "Window Size: $value\n" );
undef( $interface );
W2VGetSample
Description:
Returns the _sample member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (integer) word2vec sample size. Default value = 0.001
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetSample();
print( "Sample: $value\n" );
undef( $interface );
W2VGetHSoftMax
Description:
Returns the _hSoftMax member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (integer) word2vec HSoftMax value. Default = 0
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetHSoftMax();
print( "HSoftMax: $value\n" );
undef( $interface );
W2VGetNegative
Description:
Returns the _negative member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (integer) word2vec negative value. Default = 5
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetNegative();
print( "Negative: $value\n" );
undef( $interface );
W2VGetNumOfThreads
Description:
Returns the _numOfThreads member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (integer) word2vec number of threads to use during training. Default = 12
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetNumOfThreads();
print( "Number of threads: $value\n" );
undef( $interface );
W2VGetNumOfIterations
Description:
Returns the _iterations member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (integer) word2vec number of word2vec iterations. Default = 5
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetNumOfIterations();
print( "Number of iterations: $value\n" );
undef( $interface );
W2VGetMinCount
Description:
Returns the _minCount member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (integer) word2vec min-count value. Default = 5
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetMinCount();
print( "Min Count: $value\n" );
undef( $interface );
W2VGetAlpha
Description:
Returns the _alpha member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (integer) word2vec alpha value. Default = 0.05 for CBOW and 0.025 for Skip-Gram.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetAlpha();
print( "Alpha: $value\n" );
undef( $interface );
W2VGetClasses
Description:
Returns the _classes member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (integer) word2vec classes value. Default = 0
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetClasses();
print( "Classes: $value\n" );
undef( $interface );
W2VGetDebugTraining
Description:
Returns the _debug member variable set during Word2vec::Word2vec object instantiation of new function.
Note: 0 = No debug output, 1 = Enable debug output, 2 = Even more debug output
Input:
None
Output:
$value -> Returns (integer) word2vec debug value. Default = 2
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetDebugTraining();
print( "Debug: $value\n" );
undef( $interface );
W2VGetBinaryOutput
Description:
Returns the _binaryOutput member variable set during Word2vec::Word2vec object instantiation of new function.
Note: 1 = Save trained vector data in binary format, 2 = Save trained vector data in plain text format.
Input:
None
Output:
$value -> Returns (integer) word2vec binary flag. Default = 0
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetBinaryOutput();
print( "Binary Output: $value\n" );
undef( $interface );
W2VGetReadVocabFilePath
Description:
Returns the _readVocab member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$string -> Returns (string) word2vec read vocabulary file name or empty string if not set.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $str = $interface->W2VGetReadVocabFilePath();
print( "Read Vocab File Path: $str\n" );
undef( $interface );
W2VGetSaveVocabFilePath
Description:
Returns the _saveVocab member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$string -> Returns (string) word2vec save vocabulary file name or empty string if not set.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $str = $interface->W2VGetSaveVocabFilePath();
print( "Save Vocab File Path: $str\n" );
undef( $interface );
W2VGetUseCBOW
Description:
Returns the _useCBOW member variable set during Word2vec::Word2vec object instantiation of new function.
Note: 0 = Skip-Gram Model, 1 = Continuous Bag Of Words Model.
Input:
None
Output:
$value -> Returns (integer) word2vec Continuous-Bag-Of-Words flag. Default = 1
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetUseCBOW();
print( "Use CBOW?: $value\n" );
undef( $interface );
W2VGetWorkingDir
Description:
Returns the _workingDir member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (string) working directory path or current directory if not specified.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $str = $interface->W2VGetWorkingDir();
print( "Working Directory: $str\n" );
undef( $interface );
W2VGetWord2VecExeDir
Description:
Returns the _word2VecExeDir member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns (string) word2vec executable directory path or empty string if not specified.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $str = $interface->W2VGetWord2VecExeDir();
print( "Word2Vec Executable File Directory: $str\n" );
undef( $interface );
W2VGetVocabularyHash
Description:
Returns the _hashRefOfWordVectors member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns hash reference of vocabulary/dictionary words. (Word2vec trained data in memory)
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my @vocabulary = $interface->W2VGetVocabularyHash();
undef( $interface );
W2VGetOverwriteOldFile
Description:
Returns the _overwriteOldFile member variable set during Word2vec::Word2vec object instantiation of new function.
Input:
None
Output:
$value -> Returns 1 = True or 0 = False.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $value = $interface->W2VGetOverwriteOldFile();
print( "Overwrite Exiting File?: $value\n" );
undef( $interface );
Word2Vec Mutator Functions
W2VSetTrainFilePath
Description:
Sets member variable to string parameter. Sets training file path.
Input:
$string -> Text corpus training file path
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetTrainFilePath( "samples/textcorpus.txt" );
undef( $interface );
W2VSetOutputFilePath
Description:
Sets member variable to string parameter. Sets output file path.
Input:
$string -> Post word2vec training save file path
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetOutputFilePath( "samples/tempvectors.bin" );
undef( $interface );
W2VSetWordVecSize
Description:
Sets member variable to integer parameter. Sets word2vec word vector size.
Input:
$value -> Word2vec word vector size
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetWordVecSize( 100 );
undef( $interface );
W2VSetWindowSize
Description:
Sets member variable to integer parameter. Sets word2vec window size.
Input:
$value -> Word2vec window size
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetWindowSize( 8 );
undef( $interface );
W2VSetSample
Description:
Sets member variable to integer parameter. Sets word2vec sample size.
Input:
$value -> Word2vec sample size
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetSample( 3 );
undef( $interface );
W2VSetHSoftMax
Description:
Sets member variable to integer parameter. Sets word2vec HSoftMax value.
Input:
$value -> Word2vec HSoftMax size
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetHSoftMax( 12 );
undef( $interface );
W2VSetNegative
Description:
Sets member variable to integer parameter. Sets word2vec negative value.
Input:
$value -> Word2vec negative value
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetNegative( 12 );
undef( $interface );
W2VSetNumOfThreads
Description:
Sets member variable to integer parameter. Sets word2vec number of training threads to specified value.
Input:
$value -> Word2vec number of threads value
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetNumOfThreads( 12 );
undef( $interface );
W2VSetNumOfIterations
Description:
Sets member variable to integer parameter. Sets word2vec iterations value.
Input:
$value -> Word2vec number of iterations value
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetNumOfIterations( 12 );
undef( $interface );
W2VSetMinCount
Description:
Sets member variable to integer parameter. Sets word2vec min-count value.
Input:
$value -> Word2vec min-count value
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetMinCount( 7 );
undef( $interface );
W2VSetAlpha
Description:
Sets member variable to float parameter. Sets word2vec alpha value.
Input:
$value -> Word2vec alpha value. (Float)
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->SetAlpha( 0.0012 );
undef( $interface );
W2VSetClasses
Description:
Sets member variable to integer parameter. Sets word2vec classes value.
Input:
$value -> Word2vec classes value.
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetClasses( 0 );
undef( $interface );
W2VSetDebugTraining
Description:
Sets member variable to integer parameter. Sets word2vec debug parameter value.
Input:
$value -> Word2vec debug training value.
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetDebugTraining( 0 );
undef( $interface );
W2VSetBinaryOutput
Description:
Sets member variable to integer parameter. Sets word2vec binary parameter value.
Input:
$value -> Word2vec binary output mode value. ( '1' = Binary Output / '0' = Plain Text )
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetBinaryOutput( 1 );
undef( $interface );
W2VSetSaveVocabFilePath
Description:
Sets member variable to string parameter. Sets word2vec save vocabulary file name.
Input:
$string -> Word2vec save vocabulary file name and path.
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetSaveVocabFilePath( "samples/vocab.txt" );
undef( $interface );
W2VSetReadVocabFilePath
Description:
Sets member variable to string parameter. Sets word2vec read vocabulary file name.
Input:
$string -> Word2vec read vocabulary file name and path.
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetReadVocabFilePath( "samples/vocab.txt" );
undef( $interface );
W2VSetUseCBOW
Description:
Sets member variable to integer parameter. Sets word2vec CBOW parameter value.
Input:
$value -> Word2vec CBOW mode value.
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetUseCBOW( 1 );
undef( $interface );
W2VSetWorkingDir
Description:
Sets member variable to string parameter. Sets working directory.
Input:
$string -> Working directory
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetWorkingDir( "/samples" );
undef( $interface );
W2VSetWord2VecExeDir
Description:
Sets member variable to string parameter. Sets word2vec executable file directory.
Input:
$string -> Word2vec directory
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetWord2VecExeDir( "/word2vec" );
undef( $interface );
W2VSetVocabularyHash
Description:
Sets vocabulary/dictionary hash reference to hash reference parameter.
Warning: This will overwrite any existing vocabulary/dictionary data in memory.
Input:
$hashReference -> Vocabulary/Dictionary hash reference of word2vec word vectors.
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $vocabularyHasReference = $interface->W2VGetVocabularyHash();
$interface->W2VSetVocabularyHash( $vocabularyHasReference );
undef( $interface );
W2VClearVocabularyHash
Description:
Clears vocabulary/dictionary hash.
Input:
None
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VClearVocabularyHash();
undef( $interface );
W2VAddWordVectorToVocabHash
Description:
Adds word vector string to vocabulary/dictionary.
Input:
$string -> Word2vec word vector string
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
# Note: This is representational data of word2vec's word vector format and not actual data.
$interface->W2VAddWordVectorToVocabHash( "of 0.4346 -0.1235 0.5789 0.2347 -0.0056 -0.0001" );
undef( $interface );
W2VSetOverwriteOldFile
Description:
Sets member variable to integer parameter. Enables overwriting output file if one already exists.
Input:
$value -> '1' = Overwrite exiting file / '0' = Graceful termination when file with same name exists
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2VSetOverwriteOldFile( 1 );
undef( $interface );
Word2Phrase Main Functions
W2PExecuteTraining
Description:
Executes word2phrase training based on parameters. Parameter variables have higher precedence than member variables.
Any parameter specified will override its respective member variable.
Note: If no parameters are specified, this module executes word2phrase training based on preset member
variables. Returns string regarding training status.
Input:
$trainFilePath -> Training text corpus file path
$outputFilePath -> Vector binary file path
$minCount -> Minimum bi-gram frequency (Positive Integer)
$threshold -> Maximum bi-gram frequency (Positive Integer)
$debug -> Displays word2phrase debug information during training. (0 = None, 1 = Show Debug Information, 2 = Show Even More Debug Information)
$overwrite -> Overwrites old training file when executing training. (0 = False / 1 = True)
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2PSetMinCount( 12 );
$interface->W2PSetMaxCount( 20 );
$interface->W2PSetTrainFilePath( "textCorpus.txt" );
$interface->W2PSetOutputFilePath( "phraseTextCorpus.txt" );
$interface->W2PExecuteTraining();
undef( $interface );
# Or
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2PExecuteTraining( "textCorpus.txt", "phraseTextCorpus.txt", 12, 20, 2, 1 );
undef( $interface );
W2PExecuteStringTraining
Description:
Executes word2phrase training based on parameters. Parameter variables have higher precedence than member variables.
Any parameter specified will override its respective member variable.
Note: If no parameters are specified, this module executes word2phrase training based on preset member
variables. Returns string regarding training status.
Input:
$trainingString -> String to train
$outputFilePath -> Vector binary file path
$minCount -> Minimum bi-gram frequency (Positive Integer)
$threshold -> Maximum bi-gram frequency (Positive Integer)
$debug -> Displays word2phrase debug information during training. (0 = None, 1 = Show Debug Information, 2 = Show Even More Debug Information)
$overwrite -> Overwrites old training file when executing training. (0 = False / 1 = True)
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2PSetMinCount( 12 );
$interface->W2PSetMaxCount( 20 );
$interface->W2PSetTrainFilePath( "large string to train here" );
$interface->W2PSetOutputFilePath( "phraseTextCorpus.txt" );
$interface->W2PExecuteTraining();
undef( $interface );
# Or
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2PExecuteTraining( "large string to train here", "phraseTextCorpus.txt", 12, 20, 2, 1 );
undef( $interface );
Word2Phrase Accessor Functions
W2PGetDebugLog
Description:
Returns the _debugLog member variable set during Word2vec::Interface object initialization of new function.
Input:
None
Output:
$value -> 0 = False, 1 = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $debugLog = $interface->W2PGetDebugLog();
print( "Debug Logging Enabled\n" ) if $debugLog == 1;
print( "Debug Logging Disabled\n" ) if $debugLog == 0;
undef( $interface );
W2PGetWriteLog
Description:
Returns the _writeLog member variable set during Word2vec::Interface object initialization of new function.
Input:
None
Output:
$value -> 0 = False, 1 = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $writeLog = $interface->W2PGetWriteLog();
print( "Write Logging Enabled\n" ) if $writeLog == 1;
print( "Write Logging Disabled\n" ) if $writeLog == 0;
undef( $interface );
W2PGetFileHandle
Description:
Returns file handle used by word2phrase::WriteLog() method.
Input:
None
Output:
$fileHandle -> Returns file handle blob used by 'WriteLog()' function or undefined.
Example:
<This should not be called.>
W2PGetTrainFilePath
Description:
Returns (string) training file path.
Input:
None
Output:
$string -> word2phrase training file path
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $filePath = $interface->W2PGetTrainFilePath();
print( "Output File Path: $filePath\n" ) if defined( $filePath );
undef( $interface );
W2PGetOutputFilePath
Description:
Returns (string) output file path.
Input:
None
Output:
$string -> word2phrase output file path
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $filePath = $interface->W2PGetOutputFilePath();
print( "Output File Path: $filePath\n" ) if defined( $filePath );
undef( $interface );
W2PGetMinCount
Description:
Returns (integer) minimum bi-gram range.
Input:
None
Output:
$value -> Minimum bi-gram frequency (Positive Integer)
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $mincount = $interface->W2PGetMinCount();
print( "MinCount: $mincount\n" ) if defined( $mincount );
undef( $interface );
W2PGetThreshold
Description:
Returns (integer) maximum bi-gram range.
Input:
None
Output:
$value -> Maximum bi-gram frequency (Positive Integer)
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $mincount = $interface->W2PGetThreshold();
print( "MinCount: $mincount\n" ) if defined( $mincount );
undef( $interface );
W2PGetW2PDebug
Description:
Returns word2phrase debug parameter value.
Input:
None
Output:
$value -> 0 = No debugging, 1 = Show debugging, 2 = Show even more debugging
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $interfacedebug = $interface->W2PGetW2PDebug();
print( "Word2Phrase Debug Level: $interfacedebug\n" ) if defined( $interfacedebug );
undef( $interface );
W2PGetWorkingDir
Description:
Returns (string) working directory path.
Input:
None
Output:
$string -> Current working directory path
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $workingDir = $interface->W2PGetWorkingDir();
print( "Working Directory: $workingDir\n" ) if defined( $workingDir );
undef( $interface );
W2PGetWord2PhraseExeDir
Description:
Returns (string) word2phrase executable directory path.
Input:
None
Output:
$string -> Word2Phrase executable directory path
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $workingDir = $interface->W2PGetWord2PhraseExeDir();
print( "Word2Phrase Executable Directory: $workingDir\n" ) if defined( $workingDir );
undef( $interface );
W2PGetOverwriteOldFile
Description:
Returns the current value of the overwrite training file variable.
Input:
None
Output:
$value -> 1 = True/Overwrite or 0 = False/Append to current file
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $overwrite = $interface->W2PGetOverwriteOldFile();
if defined( $overwrite )
{
print( "Overwrite Old File: " );
print( "Yes\n" ) if $overwrite == 1;
print( "No\n" ) if $overwrite == 0;
}
undef( $interface );
Word2Phrase Mutator Functions
W2PSetTrainFilePath
Description:
Sets training file path.
Input:
$string -> Training file path
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2PSetTrainFilePath( "filePath" );
undef( $interface );
W2PSetOutputFilePath
Description:
Sets word2phrase output file path.
Input:
$string -> word2phrase output file path
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->W2PSetOutputFilePath( "filePath" );
undef( $interface );
W2PSetMinCount
Description:
Sets minimum range value.
Input:
$value -> Minimum frequency value (Positive integer)
Output:
None
Example:
use Word2vec::Interface:
my $interface = Word2vec::Interface->new();
$interface->W2PSetMinCount( 1 );
undef( $interface );
W2PSetThreshold
Description:
Sets maximum range value.
Input:
$value -> Maximum frequency value (Positive integer)
Output:
None
Example:
use Word2vec::Interface:
my $interface = Word2vec::Interface->new();
$interface->W2PSetThreshold( 100 );
undef( $interface );
W2PSetW2PDebug
Description:
Sets word2phrase debug parameter.
Input:
$value -> word2phrase debug parameter (0 = No debug info, 1 = Show debug info, 2 = Show more debug info.)
Output:
None
Example:
use Word2vec::Interface:
my $interface = Word2vec::Interface->new();
$interface->W2PSetW2PDebug( 2 );
undef( $interface );
W2PSetWorkingDir
Description:
Sets working directory path.
Input:
$string -> Current working directory path.
Output:
None
Example:
use Word2vec::Interface:
my $interface = Word2vec::Interface->new();
$interface->W2PSetWorkingDir( "filePath" );
undef( $interface );
W2PSetWord2PhraseExeDir
Description:
Sets word2phrase executable file directory path.
Input:
$string -> Word2Phrase executable directory path.
Output:
None
Example:
use Word2vec::Interface:
my $interface = Word2vec::Interface->new();
$interface->W2PSetWord2PhraseExeDir( "filePath" );
undef( $interface );
W2PSetOverwriteOldFile
Description:
Enables overwriting word2phrase output file if one already exists with the same output file name.
Input:
$value -> Integer: 1 = Overwrite old file, 0 = No not overwrite old file.
Output:
None
Example:
use Word2vec::Interface:
my $interface = Word2vec::Interface->new();
$interface->W2PSetOverwriteOldFile( 1 );
undef( $interface );
XMLToW2V Main Functions
XTWConvertMedlineXMLToW2V
Description:
Parses specified parameter Medline XML file or directory of files, creating a text corpus. Returns 0 if successful or -1 during an error.
Note: Supports plain Medline XML or gun-zipped XML files.
Input:
$filePath -> XML file path to parse. (This can be a single file or directory of XML/XML.gz files).
Output:
$value -> '0' = Successful / '-1' = Un-Successful
Example:
use Word2vec::Interface;
$interface = Word2vec::Interface->new(); # Note: Specifying no parameters implies default settings
$interface->XTWSetSavePath( "testCorpus.txt" );
$interface->XTWSetStoreTitle( 1 );
$interface->XTWSetStoreAbstract( 1 );
$interface->XTWSetBeginDate( "01/01/2004" );
$interface->XTWSetEndDate( "08/13/2016" );
$interface->XTWSetOverwriteExistingFile( 1 );
$interface->XTWConvertMedlineXMLToW2V( "/xmlDirectory/" );
undef( $interface );
XTWCreateCompoundWordBST
Description:
Creates a binary search tree using compound word data in memory and stores root node. This also clears the compound word array afterwards.
Warning: Compound word file must be loaded into memory using XTWReadCompoundWordDataFromFile() prior to calling this method. This function
will also delete the compound word array upon completion as it will no longer be necessary.
Input:
None
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWReadCompoundWordDataFromFile( "samples/compoundword.txt" );
$interface->CreateCompoundWordBST();
XTWCompoundifyString
Description:
Compoundifies string parameter based on compound word data in memory using the compound word binary search tree.
Warning: Compound word file must be loaded into memory using XTWReadCompoundWordDataFromFile() prior to calling this method.
Input:
$string -> String to compoundify
Output:
$string -> Compounded string or "(null)" if string parameter is not defined.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWReadCompoundWordDataFromFile( "samples/compoundword.txt" );
$interface->CreateCompoundWordBST();
my $compoundedString = $interface->CompoundifyString( "String to compoundify" );
print( "Compounded String: $compoundedString\n" );
undef( $interface );
XTWReadCompoundWordDataFromFile
Description:
Reads compound word file and stores in memory. $autoSetMaxCompWordLength parameter is not required to be set. This
parameter instructs the method to auto set the maximum compound word length dependent on the longest compound word found.
Note: $autoSetMaxCompWordLength options: defined = True and Undefined = False.
Input:
$filePath -> Compound word file path
$autoSetMaxCompWordLength -> Maximum length of a given compoundified phrase the module's compoundify algorithm will permit.
Note: Calling this method with $autoSetMaxCompWordLength defined will automatically set the maxCompoundWordLength variable to the longest compound phrase.
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWReadCompoundWordDataFromFile( "samples/compoundword.txt", 1 );
undef( $interface );
XTWSaveCompoundWordListToFile
Description:
Saves compound word data in memory to a specified file location.
Input:
$savePath -> Path to save compound word list to file.
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWReadCompoundWordDataFromFile( "samples/compoundword.txt" );
$interface->XTWSaveCompoundWordDataFromFile( "samples/newcompoundword.txt" );
undef( $interface );
XTWReadTextFromFile
Description:
Reads a plain text file with utf8 encoding in memory. Returns string data if successful and "(null)" if unsuccessful.
Input:
$filePath -> Text file to read into memory
Output:
$string -> String data if successful or "(null)" if un-successful.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $textData = $interface->XTWReadTextFromFile( "samples/textcorpus.txt" );
print( "Text Data: $textData\n" );
undef( $interface );
XTWSaveTextToFile
Description:
Saves a plain text file with utf8 encoding in a specified location.
Input:
$savePath -> Path to save string data.
$string -> String to save
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $result = $interface->XTWSaveTextToFile( "text.txt", "Hello world!" );
print( "File saved\n" ) if $result == 0;
print( "File unable to save\n" ) if $result == -1;
undef( $interface );
XTWReadXMLDataFromFile
Description:
Reads an XML file from a specified location. Returns string in memory if successful and "(null)" if unsuccessful.
Input:
$filePath -> File to read given path
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
Warning: This is a private function and is called by XML::Twig parsing functions. It should not be called outside of xmltow2v module.
XTWSaveTextCorpusToFile
Description:
Saves text corpus data to specified file path. This method will append to any existing file if $appendToFile parameter
is defined or "overwrite" option is disabled. Enabling "overwrite" option will overwrite any existing files.
Input:
$savePath -> Path to save the text corpus
$appendToFile -> Specifies whether the module will overwrite any existing data or append to existing text corpus data.
Note: Leaving this variable undefined will fetch the "Overwrite" member variable and set the value to this parameter.
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
Warning: This is a private function and is called by XML::Twig parsing functions. It should not be called outside of xmltow2v module.
XTWIsDateInSpecifiedRange
Description:
Checks to see if $date is within $beginDate and $endDate range. Returns 1 if true and 0 if false.
Note: Date Format: XX/XX/XXXX (Month/Day/Year)
Input:
$date -> Date to check against minimum and maximum data range. (String)
$beginDate -> Minimum date range (String)
$endDate -> Maximum date range (String)
Output:
$value -> '1' = True/Date is within specified range Or '0' = False/Date is not within specified range.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
print( "Is \"01/01/2004\" within the date range: \"02/21/1985\" to \"08/13/2016\"?\n" );
print( "Yes\n" ) if $interface->XTWIsDateInSpecifiedRange( "01/01/2004", "02/21/1985", "08/13/2016" ) == 1;
print( "No\n" ) if $interface->XTWIsDateInSpecifiedRange( "01/01/2004", "02/21/1985", "08/13/2016" ) == 0;
undef( $interface );
XTWIsFileOrDirectory
Description:
Checks to see if specified path is a file or directory.
Input:
$path -> File or directory path. (String)
Output:
$string -> Returns: "file" = file, "dir" = directory and "unknown" if the path is not a file or directory (undefined).
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $path = "path/to/a/directory";
print( "Is \"$path\" a file or directory? " . $interface->XTWIsFileOrDirectory( $path ) . "\n" );
$path = "path/to/a/file.file";
print( "Is \"$path\" a file or directory? " . $interface->XTWIsFileOrDirectory( $path ) . "\n" );
undef( $interface );
XTWRemoveSpecialCharactersFromString
Description:
Removes special characters from string parameter, removes extra spaces and converts text to lowercase.
Note: This method is called when parsing and compiling Medline title/abstract data.
Input:
$string -> String passed to remove special characters from and convert to lowercase.
Output:
$string -> String with all special characters removed and converted to lowercase.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $str = "Heart Attack is$ an!@ also KNOWN as an Acute MYOCARDIAL inFARCTion!";
print( "Original String: $str\n" );
$str = $interface->XTWRemoveSpecialCharactersFromString( $str );
print( "Modified String: $str\n" );
undef( $interface );
XTWGetFileType
Description:
Returns file data type (string).
Input:
$filePath -> File to check located at file path
Output:
$string -> File type
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new()
my $fileType = $interface->XTWGetFileType( "samples/textcorpus.txt" );
undef( $interface );
XTWDateCheck
Description:
Checks specified begin and end date strings for formatting and logic errors.
Input:
None
Output:
$value -> "0" = Passed Checks / "-1" = Failed Checks
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new()
print "Passed Date Checks\n" if ( $interface->_DateCheck() == 0 );
print "Failed Date Checks\n" if ( $interface->_DateCheck() == -1 );
undef( $interface );
XMLToW2V Accessor Functions
XTWGetDebugLog
Description:
Returns the _debugLog member variable set during Word2vec::Interface object initialization of new function.
Input:
None
Output:
$value -> '0' = False, '1' = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new()
my $debugLog = $interface->XTWGetDebugLog();
print( "Debug Logging Enabled\n" ) if $debugLog == 1;
print( "Debug Logging Disabled\n" ) if $debugLog == 0;
undef( $interface );
XTWGetWriteLog
Description:
Returns the _writeLog member variable set during Word2vec::Interface object initialization of new function.
Input:
None
Output:
$value -> '0' = False, '1' = True
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $writeLog = $interface->XTWGetWriteLog();
print( "Write Logging Enabled\n" ) if $writeLog == 1;
print( "Write Logging Disabled\n" ) if $writeLog == 0;
undef( $interface );
XTWGetStoreTitle
Description:
Returns the _storeTitle member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $storeTitle = $interface->XTWGetStoreTitle();
print( "Store Title Option: Enabled\n" ) if $storeTitle == 1;
print( "Store Title Option: Disabled\n" ) if $storeTitle == 0;
undef( $interface );
XTWGetStoreAbstract
Description:
Returns the _storeAbstract member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $storeAbstract = $interface->XTWGetStoreAbstract();
print( "Store Abstract Option: Enabled\n" ) if $storeAbsract == 1;
print( "Store Abstract Option: Disabled\n" ) if $storeAbstract == 0;
undef( $interface );
XTWGetQuickParse
Description:
Returns the _quickParse member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $quickParse = $interface->XTWGetQuickParse();
print( "Quick Parse Option: Enabled\n" ) if $quickParse == 1;
print( "Quick Parse Option: Disabled\n" ) if $quickParse == 0;
undef( $interface );
XTWGetCompoundifyText
Description:
Returns the _compoundifyText member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $compoundify = $interface->XTWGetCompoundifyText();
print( "Compoundify Text Option: Enabled\n" ) if $compoundify == 1;
print( "Compoundify Text Option: Disabled\n" ) if $compoundify == 0;
undef( $interface );
XTWGetStoreAsSentencePerLine
Description:
Returns the _storeAsSentencePerLine member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $storeAsSentencePerLine = $interface->GetStoreAsSentencePerLine();
print( "Store As Sentence Per Line: Enabled\n" ) if $storeAsSentencePerLine == 1;
print( "Store As Sentence Per Line: Disabled\n" ) if $storeAsSentencePerLine == 0;
undef( $interface );
XTWGetNumOfThreads
Description:
Returns the _numOfThreads member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$value -> Number of threads
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $numOfThreads = $interface->XTWGetNumOfThreads();
print( "Number of threads: $numOfThreads\n" );
undef( $interface );
XTWGetWorkingDir
Description:
Returns the _workingDir member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$string -> Working directory string
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $workingDirectory = $interface->XTWGetWorkingDir();
print( "Working Directory: $workingDirectory\n" );
undef( $interface );
XTWGetSavePath
Description:
Returns the _saveDir member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$string -> Save directory string
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $savePath = $interface->XTWGetSavePath();
print( "Save Directory: $savePath\n" );
undef( $interface );
XTWGetBeginDate
Description:
Returns the _beginDate member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$date -> Beginning date range - Format: XX/XX/XXXX (Mon/Day/Year)
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $date = $interface->XTWGetBeginDate();
print( "Date: $date\n" );
undef( $interface );
XTWGetEndDate
Description:
Returns the _endDate member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$date -> End date range - Format: XX/XX/XXXX (Mon/Day/Year).
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $date = $interface->XTWGetEndDate();
print( "Date: $date\n" );
undef( $interface );
XTWGetXMLStringToParse
Returns the XML data (string) to be parsed.
Description:
Returns the _xmlStringToParse member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$string -> Medline XML data string
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $xmlStr = $interface->XTWGetXMLStringToParse();
print( "XML String: $xmlStr\n" );
undef( $interface );
XTWGetTextCorpusStr
Description:
Returns the _textCorpusStr member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$string -> Text corpus string
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $str = $interface->XTWGetTextCorpusStr();
print( "Text Corpus: $str\n" );
undef( $interface );
XTWGetFileHandle
Description:
Returns the _fileHandle member variable set during Word2vec::Interface object instantiation of new function.
Warning: This is a private function. File handle is used by 'xmltow2v::WriteLog()' method. Do not manipulate this file handle as errors can result.
Input:
None
Output:
$fileHandle -> Returns file handle for WriteLog() method.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $fileHandle = $interface->XTWGetFileHandle();
undef( $interface );
XTWGetTwigHandler
Returns XML::Twig handler.
Description:
Returns the _twigHandler member variable set during Word2vec::Interface object instantiation of new function.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Output:
$twigHandler -> XML::Twig handler.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $xmlHandler = $interface->XTWGetTwigHandler();
undef( $interface );
XTWGetParsedCount
Description:
Returns the _parsedCount member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$value -> Number of parsed Medline articles.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $numOfParsed = $interface->XTWGetParsedCount();
print( "Number of parsed Medline articles: $numOfParsed\n" );
undef( $interface );
XTWGetTempStr
Description:
Returns the _tempStr member variable set during Word2vec::Interface object instantiation of new function.
Warning: This is a private function and should not be called or manipulated. Used by module as a temporary storage
location for parsed Medline 'Title' and 'Abstract' flag string data.
Input:
None
Output:
$string -> Temporary string storage location.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $tempStr = $interface->XTWGetTempStr();
print( "Temp String: $tempStr\n" );
undef( $interface );
XTWGetTempDate
Description:
Returns the _tempDate member variable set during Word2vec::Interface object instantiation of new function.
Used by module as a temporary storage location for parsed Medline 'DateCreated' flag string data.
Input:
None
Output:
$date -> Date string - Format: XX/XX/XXXX (Mon/Day/Year).
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $date = $interface->XTWGetTempDate();
print( "Temp Date: $date\n" );
undef( $interface );
XTWGetCompoundWordAry
Description:
Returns the _compoundWordAry member array reference set during Word2vec::Interface object instantiation of new function.
Warning: Compound word data must be loaded in memory first via XTWReadCompoundWordDataFromFile().
Input:
None
Output:
$arrayReference -> Compound word array reference.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $arrayReference = $interface->XTWGetCompoundWordAry();
my @compoundWord = @{ $arrayReference };
print( "Compound Word Array: @compoundWord\n" );
undef( $interface );
XTWGetCompoundWordBST
Description:
Returns the _compoundWordBST member variable set during Word2vec::Interface object instantiation of new function.
Input:
None
Output:
$bst -> Compound word binary search tree.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $bst = $interface->XTWGetCompoundWordBST();
undef( $interface );
XTWGetMaxCompoundWordLength
Description:
Returns the _maxCompoundWordLength member variable set during Word2vec::Interface object instantiation of new function.
Note: If not defined, it is automatically set to and returns 20.
Input:
None
Output:
$value -> Maximum number of compound words in a given phrase.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $compoundWordLength = $interface->XTWGetMaxCompoundWordLength();
print( "Maximum Compound Word Length: $compoundWordLength\n" );
undef( $interface );
XTWGetOverwriteExistingFile
Description:
Returns the _overwriteExisitingFile member variable set during Word2vec::Interface object instantiation of new function.
Enables overwriting of existing text corpus if set to '1' or appends to the existing text corpus if set to '0'.
Input:
None
Output:
$value -> '1' = Overwrite existing file / '0' = Append to exiting file.
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
my $overwriteExitingFile = $interface->XTWGetOverwriteExistingFile();
print( "Overwrite Existing File? YES\n" ) if ( $overwriteExistingFile == 1 );
print( "Overwrite Existing File? NO\n" ) if ( $overwriteExistingFile == 0 );
undef( $interface );
XMLToW2V Mutator Functions
XTWSetStoreTitle
Description:
Sets member variable to passed integer parameter. Instructs module to store article title if true or omit if false.
Input:
$value -> '1' = Store Titles / '0' = Omit Titles
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetStoreTitle( 1 );
undef( $interface );
XTWSetStoreAbstract
Description:
Sets member variable to passed integer parameter. Instructs module to store article abstracts if true or omit if false.
Input:
$value -> '1' = Store Abstracts / '0' = Omit Abstracts
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetStoreAbstract( 1 );
undef( $interface );
XTWSetWorkingDir
Description:
Sets member variable to passed string parameter. Represents the working directory.
Input:
$string -> Working directory string
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetWorkingDir( "/samples/" );
undef( $interface );
XTWSetSavePath
Description:
Sets member variable to passed integer parameter. Represents the text corpus save path.
Input:
$string -> Text corpus save path
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetSavePath( "samples/textcorpus.txt" );
undef( $interface );
XTWSetQuickParse
Description:
Sets member variable to passed integer parameter. Instructs module to utilize quick parse
routines to speed up text corpus compilation. This method is somewhat less accurate due to its non-exhaustive nature.
Input:
$value -> '1' = Enable Quick Parse / '0' = Disable Quick Parse
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetQuickParse( 1 );
undef( $interface );
XTWSetCompoundifyText
Description:
Sets member variable to passed integer parameter. Instructs module to utilize 'compoundify' option if true.
Warning: This requires compound word data to be loaded into memory with XTWReadCompoundWordDataFromFile() method prior
to executing text corpus compilation.
Input:
$value -> '1' = Compoundify text / '0' = Do not compoundify text
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetCompoundifyText( 1 );
undef( $interface );
XTWSetStoreAsSentencePerLine
Description:
Sets member variable to passed integer parameter. Instructs module to utilize 'storeAsSentencePerLine' option if true.
Input:
$value -> '1' = Store as sentence per line / '0' = Do not store as sentence per line
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetStoreAsSentencePerLine( 1 );
undef( $interface );
XTWSetNumOfThreads
Description:
Sets member variable to passed integer parameter. Sets the requested number of threads to parse Medline XML files
and compile the text corpus.
Input:
$value -> Integer (Positive value)
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetNumOfThreads( 4 );
undef( $interface );
XTWSetBeginDate
Description:
Sets member variable to passed string parameter. Sets beginning date range for earliest articles to store, by
'DateCreated' Medline tag, within the text corpus during compilation.
Note: Expected format - "XX/XX/XXXX" (Mon/Day/Year)
Input:
$string -> Date string - Format: "XX/XX/XXXX"
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetBeginDate( "01/01/2004" );
undef( $interface );
XTWSetEndDate
Description:
Sets member variable to passed string parameter. Sets ending date range for latest article to store, by
'DateCreated' Medline tag, within the text corpus during compilation.
Note: Expected format - "XX/XX/XXXX" (Mon/Day/Year)
Input:
$string -> Date string - Format: "XX/XX/XXXX"
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetEndDate( "08/13/2016" );
undef( $interface );
XTWSetXMLStringToParse
Description:
Sets member variable to passed string parameter. This string normally consists of Medline XML data to be
parsed for text corpus compilation.
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> String
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetXMLStringToParse( "Hello World!" );
undef( $interface );
XTWSetTextCorpusStr
Description:
Sets member variable to passed string parameter. Overwrites any stored text corpus data in memory to the string parameter.
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> String
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetTextCorpusStr( "Hello World!" );
undef( $interface );
XTWAppendStrToTextCorpus
Description:
Sets member variable to passed string parameter. Appends string parameter to text corpus string in memory.
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> String
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWAppendStrToTextCorpus( "Hello World!" );
undef( $interface );
XTWClearTextCorpus
Description:
Clears text corpus data in memory.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWClearTextCorpus();
undef( $interface );
XTWSetTempStr
Description:
Sets member variable to passed string parameter. Sets temporary member string to passed string parameter.
(Temporary placeholder for Medline Title and Abstract data).
Note: This removes special characters and converts all characters to lowercase.
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> String
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetTempStr( "Hello World!" );
undef( $interface );
XTWAppendToTempStr
Description:
Appends string parameter to temporary member string in memory.
Note: This removes special characters and converts all characters to lowercase.
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> String
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWAppendToTempStr( "Hello World!" );
undef( $interface );
XTWClearTempStr
Clears the temporary string storage in memory.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWClearTempStr();
undef( $interface );
XTWSetTempDate
Description:
Sets member variable to passed string parameter. Sets temporary date string to passed string.
Note: Date Format - "XX/XX/XXXX" (Mon/Day/Year)
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> Date string - Format: "XX/XX/XXXX"
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetTempDate( "08/13/2016" );
undef( $interface );
XTWClearTempDate
Description:
Clears the temporary date storage location in memory.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWClearTempDate();
undef( $interface );
XTWSetCompoundWordAry
Description:
Sets member variable to de-referenced passed array reference parameter. Stores compound word array by
de-referencing array reference parameter.
Note: Clears previous data if existing.
Warning: This is a private function and should not be called or manipulated.
Input:
$arrayReference -> Array reference of compound words
Ouput:
None
Example:
use Word2vec::Interface;
my @compoundWordAry = ( "big dog", "respiratory failure", "seven large masses" );
my $interface = Word2vec::Interface->new();
$interface->XTWSetCompoundWordAry( \@compoundWordAry );
undef( $interface );
XTWClearCompoundWordAry
Description:
Clears compound word array in memory.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWClearCompoundWordAry();
undef( $interface );
XTWSetCompoundWordBST
Description:
Sets member variable to passed Word2vec::Bst parameter. Sets compound word binary search tree to passed binary tree parameter.
Note: Un-defines previous binary tree if existing.
Warning: This is a private function and should not be called or manipulated.
Input:
Word2vec::Bst -> Binary Search Tree
Ouput:
None
Example:
use Word2vec::Interface;
my @compoundWordAry = ( "big dog", "respiratory failure", "seven large masses" );
@compoundWordAry = sort( @compoundWordAry );
my $arySize = @compoundWordAry;
my $bst = Word2vec::Bst;
$bst->CreateTree( \@compoundWordAry, 0, $arySize, undef );
my $interface = Word2vec::Interface->new();
$interface->XTWSetCompoundWordBST( $bst );
undef( $interface );
XTWClearCompoundWordBST
Description:
Clears/Un-defines existing compound word binary search tree from memory.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWClearCompoundWordBST();
undef( $interface );
XTWSetMaxCompoundWordLength
Description:
Sets member variable to passed integer parameter. Sets maximum number of compound words in a phrase for comparison.
ie. "medical campus of Virginia Commonwealth University" can be interpreted as a compound word of 6 words.
Setting this variable to 3 will only attempt compoundifying a maximum amount of three words.
The result would be "medical_campus_of Virginia commonwealth university" even-though an exact representation
of this compounded string can exist. Setting this variable to 6 will result in compounding all six words if
they exists in the compound word array/bst.
Warning: This is a private function and should not be called or manipulated.
Input:
$value -> Integer
Ouput:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetMaxCompoundWordLength( 8 );
undef( $interface );
XTWSetOverwriteExistingFile
Description:
Sets member variable to passed integer parameter. Sets option to overwrite existing text corpus during compilation
if 1 or append to existing text corpus if 0.
Input:
$value -> '1' = Overwrite existing text corpus / '0' = Append to existing text corpus during compilation.
Output:
None
Example:
use Word2vec::Interface;
my $interface = Word2vec::Interface->new();
$interface->XTWSetOverWriteExistingFile( 1 );
undef( $xmltow2v );
Author
Clint Cuffy, Virginia Commonwealth University
COPYRIGHT
Copyright (c) 2016
Bridget T McInnes, Virginia Commonwealth University
btmcinnes at vcu dot edu
Clint Cuffy, Virginia Commonwealth University
cuffyca at vcu dot edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to:
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.