NAME
Text::Distill - Quick texts compare, plagiarism and common parts detection
VERSION
Version 0.01
SYNOPSIS
use Text::Distill qw(Distill);
Distill($text);
or
use Text::Distill;
Text::Distill::Distill(Text::Distill::ExtractTextFromFB2File($fb2_file_path));
Service functions
ExtractTextFromFB2File($FilePath)
Function receives a path to the fb2-file and returns all significant text from the file as a string
ExtractTextFromTXTFile($FilePath)
Function receives a path to the text-file and returns all significant text from the file as a string
ExtractTextFromDocFile($FilePath)
Function receives a path to the doc-file and returns all significant text from the file as a string
ExtractTextFromDOCXFile($FilePath)
Function receives a path to the docx-file and returns all significant text from the file as a string
ExtractTextFromEPUBFile($FilePath)
Function receives a path to the epub-file and returns all significant text from the file as a string
DetectBookFormat($FilePath, ($Format))
Function detected format of e-book and returns format of file (string). You may suggest the format to start with too speed up the process a bit
$Format can be 'fb2.zip', 'fb2', 'doc.zip', 'doc', 'docx.zip', 'docx', 'epub.zip', 'epub', 'txt.zip', 'txt'
Distilling gems from text
TextToGems($UTF8TextString)
What you really need to know is that TextToGem's from exactly the same texts are eqlal, texts with small changes have similar "gems" as well. And if two texts have 3+ common gems - they share some text parts, for sure. This is somewhat close to "Edit distance", but fast on calc and indexable. So you can effectively search for citings or plagiarism. Choosen split-method makes average detection segment about 2k of text (1-2 paper pages), so this package will not normally detect a single equal paragraph. If you need more precise match extended @SplitChars with some sequences from SeqNumStats.xlsx on GitHub, I guiess you can get down to parts of about 300 chars without significant losses (don't forget to lower $MinPartSize as well).
Function transforming the text (valid UTF8 expected) into an array of 32-bit hash-summs (Jenkins's Hash). Text is at first flattened the hard way (something like soundex), than splitted into fragments by statistically choosen sequences. First and the last fragments are rejected, short fragments are rejected as well, from remaining strings we calc hashes and returns reference to them in the array.
Should return one 32-bit jHash from 2kb of source text (may vary from text to text thou).
Distill($UTF8TextString)
Transforming the text (valid UTF8 expected) into a sequence of 1-8 numbers (string as well). Internally used by TextToGems, but you may use it's output with standart "edit distance" algorithm. As this string is shorter you math will go much faster.
At the end works somewhat close to 'soundex' with addition of some basic rules for cyrillic chars, pre- and post-cleanup and utf normalization. Drops strange sequences, drops short words as well (how are you going to make you plagiarism without copying the long words, huh?)
Internal Functions:
Receives a path to the file and checks whether this of
CheckIfDocZip() - MS Word .doc in zip-archive
CheckIfEPubZip() - Electronic Publication .epub in zip-archive
CheckIfDocxZip - MS Word 2007 .docx in zip-archive
CheckIfFB2Zip() - FictionBook2 (FB2) in zip-archive
CheckIfTXT2Zip() - text-file in zip-archive
CheckIfEPub() - Electronic Publication .epub
CheckIfDocx() - MS Word 2007 .docx
CheckIfDoc() - MS Word .doc
CheckIfFB2() - FictionBook2 (FB2)
CheckIfTXT() - text-file
REQUIRED MODULES
Digest::JHash; XML::LibXML; XML::LibXSLT; Encode::Detect; Text::Extract::Word; HTML::TreeBuilder; OLE::Storage_Lite; Text::Unidecode (v1.27 or later); Unicode::Normalize (v1.25 or later); Archive::Zip Encode; Carp;
AUTHOR
Litres.ru, <gu at litres.ru>
Get the latest code from https://github.com/Litres/TextDistill
BUGS
Please report any bugs or feature requests to https://github.com/Litres/TextDistill/issues.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Text::Distill
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
Copyright (C) 2016 Litres.ru
The GNU Lesser General Public License version 3.0
Text::Distill is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3.0 of the License.
Text::Distill is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
Full text of License <http://www.gnu.org/licenses/lgpl-3.0.en.html>.