NAME

Word2vec::Xmltow2v - Medline XML-To-W2V Module.

SYNOPSIS

use Word2vec::Xmltow2v;

# Parameters: Debug Output = True, Write Log = False, StoreTitle = True, StoreAbstract = True, Quick Parse = True, CompoundifyText = True, Use Multi-Threading (Default = 1 Thread Per CPU Core)
my $xmlconv = Word2vec::Xmltow2v->new( 1, 0, 1, 1, 1, 1, 2 );      # Note: Specifying no parameters implies default settings.
$xmlconv->SetWorkingDir( "Medline/XML/Directory/Here" );
$xmlconv->SetSavePath( "textcorpus.txt" );
$xmlconv->SetStoreTitle( 1 );
$xmlconv->SetStoreAbstract( 1 );
$xmlconv->SetBeginDate( "01/01/2004" );
$xmlconv->SetEndDate( "08/13/2016" );
$xmlconv->SetOverwriteExistingFile( 1 );

# If Compound Word File Exists, Store It In Memory And Create Compound Word Binary Search Tree
$xmlconv->ReadCompoundWordDataFromFile( "compoundword.txt", 1 );
$xmlconv->CreateCompoundWordBST();

# Parse XML Files or Directory Of Files
$xmlconv->ConvertMedlineXMLToW2V( "/xmlDirectory/" );
undef( $xmlconv );

DESCRIPTION

Word2vec::Xmltow2v is a XML-to-text module which converts Medline XML article title and abstract data, given a date range, into a plain text corpus for use with Word2vec::Interface. It also "compoundifies" during text corpus compilation given a compound word file.

Main Functions

new

Description:

Returns a new 'Word2vec::Xmltow2v' module object.

Note: Specifying no parameters implies default options.

Default Parameters:
   debugLog                    = 0
   writeLog                    = 0
   storeTitle                  = 1
   storeAbstract               = 1
   quickParse                  = 0
   compoundifyText             = 0
   storeAsSentencePerLine      = 0
   numOfThreads                = Number of CPUs/CPU cores (1 thread per core/CPU)
   workingDir                  = Current Directory
   savePath                    = Current Directory
   beginDate                   = "00/00/0000"
   endDate                     = "99/99/9999"
   xmlStringToParse            = "(null)"
   textCorpusString            = ""
   twigHandler                 = 0
   parsedCount                 = 0
   tempDate                    = ""
   tempStr                     = ""
   outputFileName              = "textcorpus.txt"
   compoundWordAry             = ()
   compoundWordBST             = Word2vec::Bst->new()
   maxCompoundWordLength       = 0
   overwriteExistingFile       = 0

Input:

$debugLog                    -> Instructs module to print debug statements to the console. (1 = True / 0 = False)
$writeLog                    -> Instructs module to print debug statements to a log file. (1 = True / 0 = False)
$storeTitle                  -> Instructs module to store Medline article titles during text corpus compilation. (1 = True / 0 = False)
$storeAbstract               -> Instructs module to store Medline article abstracts during text corpus compilation. (1 = True / 0 = False)
$quickParse                  -> Instructs module to utilize quick XML parsing Functions for known Medline article title and abstract tags. (1 = True / 0 = False)
$compoundifyText             -> Instructs module to compoundify text on the fly given a compound word file. This is automatically set
                                when reading the compound word file to memory regardless of user setting. (1 = True / 0 = False)
$storeAsSentencePerLine      -> Instructs module to store parsed medline data as a length single sentence or separate sentences on new lines based on period character. (1 = True / 0 = False)
$numOfThreads                -> Specifies the number of worker threads which parse Medline XML files simultaneously to create the text corpus.
                                This speeds up text corpus generation by the number of physical cores present an a given machine. (Positive integer value)
                                ie. Using four threads of a Intel i7 core machine speeds up text corpus generation roughly four times faster than being single threaded.
$workingDir                  -> Specifies the current working directory. (String)
$savePath                    -> Specifies the save path for text corpus generation. (String)
$beginDate                   -> Specifies the beginning date range for Medline article text corpus composition. (Format: XX/XX/XXXX)
$endDate                     -> Specifies the ending date range for Medline article text corpus composition. (Format: XX/XX/XXXX)
$xmlStringToParse            -> Storage location for the current Medline XML file in memory. (String)
$textCorpusString            -> Temporary storage location for text corpus generation in memory. (String)
$twigHandler                 -> XML::Twig object location.
$parsedCount                 -> Number of parsed Medline articles during text corpus generation.
$tempDate                    -> Temporary storage location for current Medline article date during text corpus compilation.
$tempStr                     -> Temporary storage location for current Medline article title/abstract during text corpus compilation.
$outputFileName              -> Output file path/name.
$compoundWordAry             -> Storage location for compound words, used to compoundify text. (Array) <- Depreciated
$compoundWordBST             -> Storage location for compound words, used to compoundify text. (Binary Search Tree) <- Supersedes '$compoundWordAry'
$maxCompoundWordLength       -> Maximum number of words able to be compoundified in one phrase. ie "six_sea_snakes_were_sailing" = 5 compoundified words.
                                The compounding algorithm will attempt to compoundify no more than this set value, even-though the compound word list could
                                possibly contain larger compounded phrases.
$overwriteExistingFile       -> Instructs the module to either overwrite any existing text corpus files or append to the existing file.

Note: It is not recommended to specify all new() parameters, as it has not been thoroughly tested. Maximum recommended parameters to be specified include:
      "debugLog, writeLog, storeTitle, storeAbstract, quickParse, compoundifyText, numOfThreads, workingDir, savePath, beginDate, endDate"

Output:

Word2vec::Xmltow2v object.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();  # Note: Specifying no parameters implies default settings as listed above.

undef( $xmlconv );

# Or

use Word2vec::Xmltow2v;

# Parameters: Debug Output = True, Write Log = False, StoreTitle = True, StoreAbstract = True, Quick Parse = True, CompoundifyText = True, Use Multi-Threading (2 Threads)
my $xmlconv = new xmltow2v( 1, 0, 1, 1, 1, 1, 2 );

undef( $xmlconv );

DESTROY

Description:

Removes module objects and variables from memory.

Input:

None

Output:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();

$xmlconv->DESTROY();
undef( $xmlconv );

ConvertMedlineXMLToW2V

Description:

Parses specified parameter Medline XML file or directory of files, creating a text corpus. Returns 0 if successful or -1 during an error.

Note: Supports plain Medline XML or gun-zipped XML files.

Input:

$filePath -> XML file path to parse. (This can be a single file or directory of XML/XML.gz files).

Output:

$value    -> '0' = Successful / '-1' = Un-Successful

Example:

use Word2vec::Xmltow2v;

$xmlconv = new xmltow2v();      # Note: Specifying no parameters implies default settings
$xmlconv->SetSavePath( "testCorpus.txt" );
$xmlconv->SetStoreTitle( 1 );
$xmlconv->SetStoreAbstract( 1 );
$xmlconv->SetBeginDate( "01/01/2004" );
$xmlconv->SetEndDate( "08/13/2016" );
$xmlconv->SetOverwriteExistingFile( 1 );
$xmlconv->ConvertMedlineXMLToW2V( "/xmlDirectory/" );
undef( $xmlconv );

_ThreadedConvert

Description:

Multi-Threaded Medline XML to text corpus conversion function.

Input:

$directory -> File directory or directory of files to parse.

Output:

$value     -> '0' = Successful / '-1' = Un-successful

Example:

Warning: This is a private function called by 'ConvertMedlineXMLToW2V()'. It should not be called outside of xmltow2v module.

_ParseXMLString

Description:

Parses passed string parameter for Medline XML article title and abstract data and appends found data to the text corpus.

Input:

$string -> Medline XML string data to parse.

Output:

None

Example:

Warning: This is a private function called by "ConvertMedlineXMLToW2V()" and "_ThreadedConvert()". It should not be called outside of xmltow2v module.

_CheckParseRequirements

Description:

Checks passed string parameter to see if it contains relevant data and XML::Twig handler is initialized.

Input:

$string -> String data to check

Output:

$value  -> '0' = Successful / '-1' = Un-successful

Example:

Warning: This is a private function called "_ParseXMLString()". It should not be called outside of xmltow2v module.

_CheckForNullData

Description:

Checks passed string parameter for "(null)" string.

Input:

$string -> String data to be checked.

Output:

$value  -> '1' = True/Null data or '0' = False/Valid data

Example:

Warning: This is a private function called by "new()" and "_ParseXMLString()". It should not be called outside of xmltow2v module.

_RemoveXMLVersion

Description:

Removes the XML Version string prior to parsing the XML string data. (Depreciated)

Input:

$string -> Medline XML string data

Output:

None

Example:

Warning: This is a private function called by "new()" and "_ParseXMLString()". It should not be called outside of xmltow2v module.

_ParseMedlineCitationSet

Description:

Parses 'MedlineCitationSet' tag data in Medline XML file.

Input:

$twigHandler -> XML::Twig handler
$root        -> Beginning of XML directory to parse. ( Directory in Medline XML string data )

Output:

None

Example:

Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.

_ParseMedlineArticle

Description:

Parses 'MedlineArticle' tag data in Medline XML file.

Input:

$medlineArticle -> Current Medline article directory in XML data (XML::Twig directory)

Output:

$value          -> '1' = Finished parsing Medline article.

Example:

Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.

_ParseDateCreated

Description:

Parses 'DateCreated' tag data in Medline XML file.

Input:

$article -> Current Medline article in XML data (XML::Twig directory)

Output:

$date    -> 'XX/XX/XXXX' (Month/Day/Year)

Example:

Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.

_ParseArticle

Description:

Parses 'Article' tag data in Medline XML file. Fetches 'ArticleTitle', 'Journal' and 'Abstract' XML tags.

Input:

$article -> Current Medline article in XML data (XML::Twig directory)

Output:

None

Example:

Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.

_ParseJournal

Description:

Parses 'Journal' tag data in Medline XML file. Fetches 'Title' XML tag.

Input:

$journalRoot -> Current Medline journal directory in XML data (XML::Twig directory)

Output:

None

Example:

Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.

_ParseOtherAbstract

Description:

Parses 'Abstract' tag data in Medline XML file. Fetches 'AbstractText' XML tag.

Input:

$abstractRoot -> Current Medline abstract directory in XML data (XML::Twig directory)

Output:

None

Example:

Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.

_QuickParseDateCreated

Description:

Parses 'DateCreated' tag data in Medline XML file. Used when 'QuickParse' member variable is enabled. Sets $tempDate member variable to parsed 'DateCreated' tag data.

Input:

$twigHandler -> 'XML::Twig' handler
$article     -> Current Medline article directory in XML data (XML::Twig directory)

Output:

None

Example:

Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.

_QuickParseJournal

Description:

Parses 'Journal' tag data in Medline XML file. Fetches 'Title' XML tag. Used when 'QuickParse' member variable is enabled.
Sets $tempStr to parsed data and stores in text corpus.

Input:

$twigHandler -> 'XML::Twig' handler.
$journalRoot -> Current Medline journal directory in XML data (XML::Twig directory)

Output:

None

Example:

Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.

_QuickParseArticle

Description:

Parses 'Article' tag data in Medline XML file. Fetches 'ArticleTitle' and 'Abstract' XML tags. Used when 'QuickParse' member variable is enabled.
Sets $tempStr to parsed data and stores in text corpus.

Input:

$twigHandler -> 'XML::Twig' handler.
$article     -> Current Medline article directory in XML data (XML::Twig directory)

Output:

None

Example:

Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.

_QuickParseOtherAbstract

Description:

Parses 'Abstract' tag data in Medline XML file. Fetches 'AbstractText' XML tag. Used when 'QuickParse' member variable is enabled.
Sets $tempStr to parsed data and stores in text corpus.

Input:

$twigHandler -> 'XML::Twig' handler.
$anstractRoot -> Current Medline abstract directory in XML data (XML::Twig directory)

Output:

None

Example:

Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.

CreateCompoundWordBST

Description:

Creates a binary search tree using compound word data in memory and stores root node. This also clears the compound word array afterwards.

Warning: Compound word file must be loaded into memory using ReadCompoundWordDataFromFile() prior to calling this method. This function
         will also delete the compound word array upon completion as it will no longer be necessary.

Input:

None

Output:

$value -> '0' = Successful / '-1' = Un-successful

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt" );
$xmlconv->CreateCompoundWordBST();

CompoundifyString

Description:

Compoundifies string parameter based on compound word data in memory using the compound word binary search tree.

Warning: Compound word file must be loaded into memory using ReadCompoundWordDataFromFile() prior to calling this method.

Input:

$string -> String to compoundify

Output:

$string -> Compounded string or "(null)" if string parameter is not defined.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt" );
$xmlconv->CreateCompoundWordBST();
my $compoundedString = $xmlconv->CompoundifyString( "String to compoundify" );
print( "Compounded String: $compoundedString\n" );

undef( $xmlconv );

_CompoundifySearch

Description:

Recursive method used by CompoundifyString() to fetch compound word data in binary search tree.

Warning: This function requires specific parameters and should not be called outside of CompoundifyString() method.

Input:

$stringArrayRef -> Array reference containing string data
$oldNode        -> Last 'Word2vec::Node' data match was found
$searchStr      -> Search phrase
$index          -> Current string array index

Output:

Word2vec::Node  -> Last node containing positive search phrase match

Example:

Warning: This is a private function and is called by 'CompoundifyString()'. It should not be called outside of xmltow2v module.

ReadCompoundWordDataFromFile

Description:

Reads compound word file and stores in memory. $autoSetMaxCompWordLength parameter is not required to be set. This
parameter instructs the method to auto set the maximum compound word length dependent on the longest compound word found.

Note: $autoSetMaxCompWordLength options: defined = True and Undefined = False.

Input:

$filePath                 -> Compound word file path
$autoSetMaxCompWordLength -> Maximum length of a given compoundified phrase the module's compoundify algorithm will permit.

Note: Calling this method with $autoSetMaxCompWordLength defined will automatically set the maxCompoundWordLength variable to the longest compound phrase.

Output:

$value                    -> '0' = Successful / '-1' = Un-successful

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt", 1 );

undef( $xmlconv );

SaveCompoundWordListToFile

Description:

Saves compound word data in memory to a specified file location.

Input:

$savePath -> Path to save compound word list to file.

Output:

$value    -> '0' = Successful / '-1' = Un-successful

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt" );
$xmlconv->SaveCompoundWordDataFromFile( "samples/newcompoundword.txt" );
undef( $xmlconv );

ReadTextFromFile

Description:

Reads a plain text file with utf8 encoding in memory. Returns string data if successful and "(null)" if unsuccessful.

Input:

$filePath -> Text file to read into memory

Output:

$string   -> String data if successful or "(null)" if un-successful.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $textData = $xmlconv->ReadTextFromFile( "samples/textcorpus.txt" );
print( "Text Data: $textData\n" );
undef( $xmlconv );

SaveTextToFile

Description:

Saves a plain text file with utf8 encoding in a specified location.

Input:

$savePath -> Path to save string data.
$string   -> String to save

Output:

$value    -> '0' = Successful / '-1' = Un-successful

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $result = $xmlconv->SaveTextToFile( "text.txt", "Hello world!" );

print( "File saved\n" ) if $result == 0;
print( "File unable to save\n" ) if $result == -1;

undef( $xmlconv );

_ReadXMLDataFromFile

Description:

Reads an XML file from a specified location. Returns string in memory if successful and "(null)" if unsuccessful.

Input:

$filePath -> File to read given path

Output:

$value    -> '0' = Successful / '-1' = Un-successful

Example:

Warning: This is a private function and is called by XML::Twig parsing functions. It should not be called outside of xmltow2v module.

_SaveTextCorpusToFile

Description:

Saves text corpus data to specified file path. This method will append to any existing file if $appendToFile parameter
is defined or "overwrite" option is disabled. Enabling "overwrite" option will overwrite any existing files.

Input:

$savePath     -> Path to save the text corpus
$appendToFile -> Specifies whether the module will overwrite any existing data or append to existing text corpus data.

Note: Leaving this variable undefined will fetch the "Overwrite" member variable and set the value to this parameter.

Output:

$value        -> '0' = Successful / '-1' = Un-successful

Example:

Warning: This is a private function and is called by XML::Twig parsing functions. It should not be called outside of xmltow2v module.

IsDateInSpecifiedRange

Description:

Checks to see if $date is within $beginDate and $endDate range. Returns 1 if true and 0 if false.

Note: Date Format: XX/XX/XXXX (Month/Day/Year)

Input:

$date      -> Date to check against minimum and maximum data range. (String)
$beginDate -> Minimum date range (String)
$endDate   -> Maximum date range (String)

Output:

$value     -> '1' = True/Date is within specified range Or '0' = False/Date is not within specified range.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
print( "Is \"01/01/2004\" within the date range: \"02/21/1985\" to \"08/13/2016\"?\n" );
print( "Yes\n" ) if $xmlconv->IsDateInSpecifiedRange( "01/01/2004", "02/21/1985", "08/13/2016" ) == 1;
print( "No\n" ) if $xmlconv->IsDateInSpecifiedRange( "01/01/2004", "02/21/1985", "08/13/2016" ) == 0;

undef( $xmlconv );

IsFileOrDirectory

Description:

Checks to see if specified path is a file or directory.

Input:

$path   -> File or directory path. (String)

Output:

$string -> Returns: "file" = file, "dir" = directory and "unknown" if the path is not a file or directory (undefined).

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $path = "path/to/a/directory";

print( "Is \"$path\" a file or directory? " . $xmlconv->IsFileOrDirectory( $path ) . "\n" );

$path = "path/to/a/file.file";

print( "Is \"$path\" a file or directory? " . $xmlconv->IsFileOrDirectory( $path ) . "\n" );

undef( $xmlconv );

RemoveSpecialCharactersFromString

Description:

Normalizes text input based on settings below.
  - All Text Conveted To Lowercase
  - Duplicate White Spaces Removed
  - "'s" (Apostrophe 's') Characters Removed
  - Hyphen "-" Replaced With Whitespace
  - All Characters Outside Of "a-z" and NewLine Characters Are Removed
  - Lastly, Whitespace Before And After Text Is Removed

Note: This method is called when parsing and compiling Medline title/abstract data.

Input:

$string -> String passed to remove special characters from and convert to lowercase.

Output:

$string -> String with all special characters removed and converted to lowercase.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();

my $str = "Heart Attack is$ an!@ also KNOWN as an Acute MYOCARDIAL inFARCTion!";

print( "Original String: $str\n" );

$str = $xmlconv->RemoveSpecialCharactersFromString( $str );

print( "Modified String: $str\n" );

undef( $xmlconv );

GetFileType

Description:

Returns file data type (string).

Input:

$filePath -> File to check located at file path

Output:

$string   -> File type

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $fileType = $xmlconv->GetFileType( "samples/textcorpus.txt" );

undef( $xmlconv );

_DateCheck

Description:

Checks specified begin and end date strings for formatting and logic errors.

Input:

None

Output:

$value   -> "0" = Passed Checks / "-1" = Failed Checks

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
print "Passed Date Checks\n" if ( $xmlconv->_DateCheck() == 0 );
print "Failed Date Checks\n" if ( $xmlconv->_DateCheck() == -1 );

undef( $xmlconv );

Accessor Functions

GetDebugLog

Description:

Returns the _debugLog member variable set during Word2vec::Xmltow2v object initialization of new function.

Input:

None

Output:

$value -> '0' = False, '1' = True

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $debugLog = $xmlconv->GetDebugLog();

print( "Debug Logging Enabled\n" ) if $debugLog == 1;
print( "Debug Logging Disabled\n" ) if $debugLog == 0;


undef( $xmlconv );

GetWriteLog

Description:

Returns the _writeLog member variable set during Word2vec::Xmltow2v object initialization of new function.

Input:

None

Output:

$value -> '0' = False, '1' = True

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $writeLog = $xmlconv->GetWriteLog();

print( "Write Logging Enabled\n" ) if $writeLog == 1;
print( "Write Logging Disabled\n" ) if $writeLog == 0;

undef( $xmlconv );

GetStoreTitle

Description:

Returns the _storeTitle member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$value -> '1' = True / '0' = False

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $storeTitle = $xmlconv->GetStoreTitle();

print( "Store Title Option: Enabled\n" ) if $storeTitle == 1;
print( "Store Title Option: Disabled\n" ) if $storeTitle == 0;

undef( $xmlconv );

GetStoreAbstract

Description:

Returns the _storeAbstract member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$value -> '1' = True / '0' = False

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $storeAbstract = $xmlconv->GetStoreAbstract();

print( "Store Abstract Option: Enabled\n" ) if $storeAbsract == 1;
print( "Store Abstract Option: Disabled\n" ) if $storeAbstract == 0;

undef( $xmlconv );

GetQuickParse

Description:

Returns the _quickParse member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$value -> '1' = True / '0' = False

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $quickParse = $xmlconv->GetQuickParse();

print( "Quick Parse Option: Enabled\n" ) if $quickParse == 1;
print( "Quick Parse Option: Disabled\n" ) if $quickParse == 0;

undef( $xmlconv );

GetCompoundifyText

Description:

Returns the _compoundifyText member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$value -> '1' = True / '0' = False

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $compoundify = $xmlconv->GetCompoundifyText();

print( "Compoundify Text Option: Enabled\n" )  if $compoundify == 1;
print( "Compoundify Text Option: Disabled\n" ) if $compoundify == 0;

undef( $xmlconv );

GetStoreAsSentencePerLine

Description:

Returns the _storeAsSentencePerLine member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$value -> '1' = True / '0' = False

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $storeAsSentencePerLine = $xmlconv->GetStoreAsSentencePerLine();

print( "Store As Sentence Per Line: Enabled\n" )  if $storeAsSentencePerLine == 1;
print( "Store As Sentence Per Line: Disabled\n" ) if $storeAsSentencePerLine == 0;

undef( $xmlconv );

GetNumOfThreads

Description:

Returns the _numOfThreads member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$value -> Number of threads

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $numOfThreads = $xmlconv->GetNumOfThreads();

print( "Number of threads: $numOfThreads\n" );

undef( $xmlconv );

GetWorkingDir

Description:

Returns the _workingDir member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$string -> Working directory string

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $workingDirectory = $xmlconv->GetWorkingDir();

print( "Working Directory: $workingDirectory\n" );

undef( $xmlconv );

GetSavePath

Description:

Returns the _saveDir member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$string -> Save directory string

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $savePath = $xmlconv->GetSavePath();

print( "Save Directory: $savePath\n" );

undef( $xmlconv );

GetBeginDate

Description:

Returns the _beginDate member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$date -> Beginning date range - Format: XX/XX/XXXX (Mon/Day/Year)

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $date = $xmlconv->GetBeginDate();

print( "Date: $date\n" );

undef( $xmlconv );

GetEndDate

Description:

Returns the _endDate member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$date -> End date range - Format: XX/XX/XXXX (Mon/Day/Year).

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $date = $xmlconv->GetEndDate();

print( "Date: $date\n" );

undef( $xmlconv );

GetXMLStringToParse

Returns the XML data (string) to be parsed.

Description:

Returns the _xmlStringToParse member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$string -> Medline XML data string

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $xmlStr = $xmlconv->GetXMLStringToParse();

print( "XML String: $xmlStr\n" );

undef( $xmlconv );

GetTextCorpusStr

Description:

Returns the _textCorpusStr member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$string -> Text corpus string

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $str = $xmlconv->GetTextCorpusStr();

print( "Text Corpus: $str\n" );

undef( $xmlconv );

GetFileHandle

Description:

Returns the _fileHandle member variable set during Word2vec::Xmltow2v object instantiation of new function.

Warning: This is a private function. File handle is used by WriteLog() method. Do not manipulate this file handle as errors can result.

Input:

None

Output:

$fileHandle -> Returns file handle for WriteLog() method.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $fileHandle = $xmlconv->GetFileHandle();

undef( $xmlconv );

GetTwigHandler

Returns XML::Twig handler.

Description:

Returns the _twigHandler member variable set during Word2vec::Xmltow2v object instantiation of new function.

Warning: This is a private function and should not be called or manipulated.

Input:

None

Output:

$twigHandler -> XML::Twig handler.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $xmlHandler = $xmlconv->GetTwigHandler();

undef( $xmlconv );

GetParsedCount

Description:

Returns the _parsedCount member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$value -> Number of parsed Medline articles.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $numOfParsed = $xmlconv->GetParsedCount();

print( "Number of parsed Medline articles: $numOfParsed\n" );

undef( $xmlconv );

GetTempStr

Description:

Returns the _tempStr member variable set during Word2vec::Xmltow2v object instantiation of new function.

Warning: This is a private function and should not be called or manipulated. Used by module as a temporary storage
         location for parsed Medline 'Title' and 'Abstract' flag string data.

Input:

None

Output:

$string -> Temporary string storage location.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $tempStr = $xmlconv->GetTempStr();

print( "Temp String: $tempStr\n" );

undef( $xmlconv );

GetTempDate

Description:

Returns the _tempDate member variable set during Word2vec::Xmltow2v object instantiation of new function.
Used by module as a temporary storage location for parsed Medline 'DateCreated' flag string data.

Input:

None

Output:

$date -> Date string - Format: XX/XX/XXXX (Mon/Day/Year).

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $date = $xmlconv->GetTempDate();

print( "Temp Date: $date\n" );

undef( $xmlconv );

GetCompoundWordAry

Description:

Returns the _compoundWordAry member array reference set during Word2vec::Xmltow2v object instantiation of new function.

Warning: Compound word data must be loaded in memory first via ReadCompoundWordDataFromFile().

Input:

None

Output:

$arrayReference -> Compound word array reference.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $arrayReference = $xmlconv->GetCompoundWordAry();
my @compoundWord = @{ $arrayReference };

print( "Compound Word Array: @compoundWord\n" );

undef( $xmlconv );

GetCompoundWordBST

Description:

Returns the _compoundWordBST member variable set during Word2vec::Xmltow2v object instantiation of new function.

Input:

None

Output:

$bst -> Compound word binary search tree.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $bst = $xmlconv->GetCompoundWordBST();

undef( $xmlconv );

GetMaxCompoundWordLength

Description:

Returns the _maxCompoundWordLength member variable set during Word2vec::Xmltow2v object instantiation of new function.

Note: If not defined, it is automatically set to and returns 20.

Input:

None

Output:

$value -> Maximum number of compound words in a given phrase.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $compoundWordLength = $xmlconv->GetMaxCompoundWordLength();

print( "Maximum Compound Word Length: $compoundWordLength\n" );

undef( $xmlconv );

GetOverwriteExistingFile

Description:

Returns the _overwriteExisitingFile member variable set during Word2vec::Xmltow2v object instantiation of new function.
Enables overwriting of existing text corpus if set to '1' or appends to the existing text corpus if set to '0'.

Input:

None

Output:

$value -> '1' = Overwrite existing file / '0' = Append to exiting file.

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
my $overwriteExitingFile = $xmlconv->GetOverwriteExistingFile();

print( "Overwrite Existing File? YES\n" ) if ( $overwriteExistingFile == 1 );
print( "Overwrite Existing File? NO\n" ) if ( $overwriteExistingFile == 0 );

undef( $xmlconv );

Mutator Functions

SetStoreTitle

Description:

Sets member variable to passed integer parameter. Instructs module to store article title if true or omit if false.

Input:

$value -> '1' = Store Titles / '0' = Omit Titles

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetStoreTitle( 1 );

undef( $xmlconv );

SetStoreAbstract

Description:

Sets member variable to passed integer parameter. Instructs module to store article abstracts if true or omit if false.

Input:

$value -> '1' = Store Abstracts / '0' = Omit Abstracts

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetStoreAbstract( 1 );

undef( $xmlconv );

SetWorkingDir

Description:

Sets member variable to passed string parameter. Represents the working directory.

Input:

$string -> Working directory string

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetWorkingDir( "/samples/" );

undef( $xmlconv );

SetSavePath

Description:

Sets member variable to passed integer parameter. Represents the text corpus save path.

Input:

$string -> Text corpus save path

Output:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetSavePath( "samples/textcorpus.txt" );

undef( $xmlconv );

SetQuickParse

Description:

Sets member variable to passed integer parameter. Instructs module to utilize quick parse
routines to speed up text corpus compilation. This method is somewhat less accurate due to its non-exhaustive nature.

Input:

$value -> '1' = Enable Quick Parse / '0' = Disable Quick Parse

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetQuickParse( 1 );

undef( $xmlconv );

SetCompoundifyText

Description:

Sets member variable to passed integer parameter. Instructs module to utilize 'compoundify' option if true.

Warning: This requires compound word data to be loaded into memory with ReadCompoundWordDataFromFile() method prior
         to executing text corpus compilation.

Input:

$value -> '1' = Compoundify text / '0' = Do not compoundify text

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetCompoundifyText( 1 );

undef( $xmlconv );

SetStoreAsSentencePerLine

Description:

Sets member variable to passed integer parameter. Instructs module to utilize 'storeAsSentencePerLine' option if true.

Input:

$value -> '1' = Store as sentence per line / '0' = Do not store as sentence per line

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetStoreAsSentencePerLine( 1 );

undef( $xmlconv );

SetNumOfThreads

Description:

Sets member variable to passed integer parameter. Sets the requested number of threads to parse Medline XML files
and compile the text corpus.

Input:

$value -> Integer (Positive value)

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetNumOfThreads( 4 );

undef( $xmlconv );

SetBeginDate

Description:

Sets member variable to passed string parameter. Sets beginning date range for earliest articles to store, by
'DateCreated' Medline tag, within the text corpus during compilation.

Note: Expected format - "XX/XX/XXXX" (Mon/Day/Year)

Input:

$string -> Date string - Format: "XX/XX/XXXX"

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetBeginDate( "01/01/2004" );

undef( $xmlconv );

SetEndDate

Description:

Sets member variable to passed string parameter. Sets ending date range for latest article to store, by
'DateCreated' Medline tag, within the text corpus during compilation.

Note: Expected format - "XX/XX/XXXX" (Mon/Day/Year)

Input:

$string -> Date string - Format: "XX/XX/XXXX"

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetEndDate( "08/13/2016" );

undef( $xmlconv );

SetXMLStringToParse

Description:

Sets member variable to passed string parameter. This string normally consists of Medline XML data to be
parsed for text corpus compilation.

Warning: This is a private function and should not be called or manipulated.

Input:

$string -> String

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetXMLStringToParse( "Hello World!" );

undef( $xmlconv );

SetTextCorpusStr

Description:

Sets member variable to passed string parameter. Overwrites any stored text corpus data in memory to the string parameter.

Warning: This is a private function and should not be called or manipulated.

Input:

$string -> String

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetTextCorpusStr( "Hello World!" );

undef( $xmlconv );

AppendStrToTextCorpus

Description:

Sets member variable to passed string parameter. Appends string parameter to text corpus string in memory.

Warning: This is a private function and should not be called or manipulated.

Input:

$string -> String

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->AppendStrToTextCorpus( "Hello World!" );

undef( $xmlconv );

ClearTextCorpus

Description:

Clears text corpus data in memory.

Warning: This is a private function and should not be called or manipulated.

Input:

None

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ClearTextCorpus();

undef( $xmlconv );

SetTempStr

Description:

Sets member variable to passed string parameter. Sets temporary member string to passed string parameter.
(Temporary placeholder for Medline Title and Abstract data).

Note: This removes special characters and converts all characters to lowercase.

Warning: This is a private function and should not be called or manipulated.

Input:

$string -> String

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetTempStr( "Hello World!" );

undef( $xmlconv );

AppendToTempStr

Description:

Appends string parameter to temporary member string in memory.

Note: This removes special characters and converts all characters to lowercase.

Warning: This is a private function and should not be called or manipulated.

Input:

$string -> String

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->AppendToTempStr( "Hello World!" );

undef( $xmlconv );

ClearTempStr

Clears the temporary string storage in memory.

Warning: This is a private function and should not be called or manipulated.

Input:

None

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ClearTempStr();

undef( $xmlconv );

SetTempDate

Description:

Sets member variable to passed string parameter. Sets temporary date string to passed string.

Note: Date Format - "XX/XX/XXXX" (Mon/Day/Year)

Warning: This is a private function and should not be called or manipulated.

Input:

$string -> Date string - Format: "XX/XX/XXXX"

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetTempDate( "08/13/2016" );

undef( $xmlconv );

ClearTempDate

Description:

Clears the temporary date storage location in memory.

Warning: This is a private function and should not be called or manipulated.

Input:

None

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ClearTempDate();

undef( $xmlconv );

SetCompoundWordAry

Description:

Sets member variable to de-referenced passed array reference parameter. Stores compound word array by
de-referencing array reference parameter.

Note: Clears previous data if existing.

Warning: This is a private function and should not be called or manipulated.

Input:

$arrayReference -> Array reference of compound words

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my @compoundWordAry = ( "big dog", "respiratory failure", "seven large masses" );

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetCompoundWordAry( \@compoundWordAry );

undef( $xmlconv );

ClearCompoundWordAry

Description:

Clears compound word array in memory.

Warning: This is a private function and should not be called or manipulated.

Input:

None

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ClearCompoundWordAry();

undef( $xmlconv );

SetCompoundWordBST

Description:

Sets member variable to passed Word2vec::Bst parameter. Sets compound word binary search tree to passed binary tree parameter.

Note: Un-defines previous binary tree if existing.

Warning: This is a private function and should not be called or manipulated.

Input:

Word2vec::Bst -> Binary Search Tree

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my @compoundWordAry = ( "big dog", "respiratory failure", "seven large masses" );
@compoundWordAry = sort( @compoundWordAry );

my $arySize = @compoundWordAry;

my $bst = Word2vec::Bst;
$bst->CreateTree( \@compoundWordAry, 0, $arySize, undef );

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetCompoundWordBST( $bst );

undef( $xmlconv );

ClearCompoundWordBST

Description:

Clears/Un-defines existing compound word binary search tree from memory.

Warning: This is a private function and should not be called or manipulated.

Input:

None

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ClearCompoundWordBST();

undef( $xmlconv );

SetMaxCompoundWordLength

Description:

Sets member variable to passed integer parameter. Sets maximum number of compound words in a phrase for comparison.

ie. "medical campus of Virginia Commonwealth University" can be interpreted as a compound word of 6 words.
Setting this variable to 3 will only attempt compoundifying a maximum amount of three words.
The result would be "medical_campus_of Virginia commonwealth university" even-though an exact representation
of this compounded string can exist. Setting this variable to 6 will result in compounding all six words if
they exists in the compound word array/bst.

Warning: This is a private function and should not be called or manipulated.

Input:

$value -> Integer

Ouput:

None

Example:

use Word2vec::Xmltow2v;

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetMaxCompoundWordLength( 8 );

undef( $xmlconv );

SetOverwriteExistingFile

Description:

Sets member variable to passed integer parameter. Sets option to overwrite existing text corpus during compilation
if 1 or append to existing text corpus if 0.

Input:

$value -> '1' = Overwrite existing text corpus / '0' = Append to existing text corpus during compilation.

Output:

None

Example:

use Word2vec::Xmltow2v;

my $xmltow2v = Word2vec::Xmltow2v->new();
$xmltow2v->SetOverWriteExistingFile( 1 );

undef( $xmltow2v );

Debug Functions

GetTime

Description:

Returns current time string in "Hour:Minute:Second" format.

Input:

None

Output:

$string -> XX:XX:XX ("Hour:Minute:Second")

Example:

use Word2vec::Xmltow2v:

my $xmlconv = Word2vec::Xmltow2v->new();
my $time = $xmlconv->GetTime();

print( "Current Time: $time\n" ) if defined( $time );

undef( $xmlconv );

GetDate

Description:

Returns current month, day and year string in "Month/Day/Year" format.

Input:

None

Output:

$string -> XX/XX/XXXX ("Month/Day/Year")

Example:

use Word2vec::Xmltow2v:

my $xmlconv = Word2vec::Xmltow2v->new();
my $date = $xmlconv->GetDate();

print( "Current Date: $date\n" ) if defined( $date );

undef( $xmlconv );

WriteLog

Description:

Prints passed string parameter to the console, log file or both depending on user options.

Note: printNewLine parameter prints a new line character following the string if the parameter
is undefined and does not if parameter is 0.

Input:

$string -> String to print to the console/log file.
$value  -> 0 = Do not print newline character after string, all else prints new line character including 'undef'.

Output:

None

Example:

use Word2vec::Xmltow2v:

my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->WriteLog( "Hello World" );

undef( $xmlconv );

Author

Clint Cuffy, Virginia Commonwealth University

COPYRIGHT

Copyright (c) 2016

Bridget T McInnes, Virginia Commonwealth University
btmcinnes at vcu dot edu

Clint Cuffy, Virginia Commonwealth University
cuffyca at vcu dot edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA  02111-1307, USA.