NAME

FileArchiveIndexer::Indexing

DESCRIPTION

This module should not be used directly. It is inherited by FileArchiveIndexer. The code and documentation are placed herein for organization purposes only.

INDEXING

Indexing a file can take a long time. Indexing does not always mean simply tell me how many pages are in this document- or, what is the last modification time for this file? FileArchiveIndexer allows for more intesive procedures. Primarily in mind was using OCR software to turn hard copy paper scans into text. If you have to do this to 30 gigabytes of pdf data, this will take weeks. During this time, the files can be moved, renamed, copied! Do you want to re-index a file just because the file's name changed, because it was moved, or because now there is a duplicate of the file? That would be incredibly wasteful, and you would never be able to keep your data current, with a large file archive- especially in a multi-user environment.

This is why we do two odd things in FileArchiveIndexer. One is that the authority about what a file is, is the md5sum hex digest of the file. The other thing is that we keep the procedure files location update step, and the indexing step, totally separate.

This means if a file with the md5sum 'qwerty', once indexed, *is* indexed. And remains so. If the file is copied, moved, even erased- we still keep the data we indexed. Thus the precious indexed data we collected- is valid across filesystems and user 'usage'.

Of course, if the file is modified in any way whatsoever, if a character is added, taken away, anything modified inside the file's data itself- the indexing of that file will be repeated.

SYNOPSIS

	use FileArchiveIndexer;

	my $i = new FileArchiveIndexer({
		DBHOST => $dbhost,
		DBNAME => $dbname,
		DBUSER => $dbuser,
		DBPASSWORD => $dbpassword,		
	});   
	
	while( my($abs_path, $md5sum) = get_next_indexpending ){
   
		$i->indexing_lock( $md5sum ) or next; # is already indexed or locked for indexing by another process

		my $text = your_method_of_getting_text_out_of( $abs_path );

		$i->insert_record( $md5sum, $text );

		$i->indexing_lock_release( $md5sum );

	}

INDEXING STEP

How we index, brief overview:

1 - Ask the queue for the next file pending indexing.

An entry in the files table that does not match an entry in the md5sum table is a file not inedexed, thus, pending indexing. To index a file, we receive the file's location and the file's md5sum.

2 - Attempt to lock the file for indexing

We attempt to lock the file for indexing by identifying the file with the md5sum string.

3 - Secure the file data.

If the indexing process running is remote to the archive's physical location, a copy of the file is stored to a temporary location. Then the temporary file's md5sum is checked against what it should be according to the files table. This is to assure a) the file did not change, b) the file was not corrupted in transition accross the network.

3 - Turn the file in question into simple text.

Your method of turning said file into text. You may separate pages with pagebreak characters. A full method using tesseract ocr is provided within this package.

4 - Insert the data

The text is inserted into the data table.

6 - Release the indexing lock

The file is released from indexing.

ABOUT MD5SUM

The md5sum is the most important thing for the data. We do not care about the file's absolute path much. Everytime we faiupdate, we rebuild the entire files table. This makes sure that the paths to the data is current.

What is most important and interesting to us, the gold, the value of the database, is the data and md5sum table.

We can have multiple files with the same md5sum, in different computers, accross networks.. etc.

What we build with indexing is the matching of md5sum to text content.

This is why when we start indexing, we "lock" the file (in case other processes are also indexing parallel to us) and acquire the id to the md5sum table entry. This is what we record data against.

So, we let's get one of the files pending indexing..

my $pending = $self->get_indexpending(1);

my( $files_id, $files_abs_path ) = $pending->[0];

my $md5sum_id = $self->indexing_lock( $files_id );

This creates the entry in the md5sum table. This file yet will not be returned in a search result, because it is locked.

Now let's get the content of the file.. Let's imagine it is a text file. We slurp the content and give it to insert_record()

require File::Slurp;

my $text = File::Slurp::slurp($abs_path);

# maybe we want to clean the text of sensitive data?

$text=~s/sensitive data/ /sig;

$self->insert_record($md5sum_id, $text);

now we release the file

$self->indexing_lock_release( $files_id );

and save what we've done...

$self->dbh->commit;

This is also done automatically by DESTROY.

What is so fabulous about the authority of the data being md5sum instead of a path on disk or an inode number, is that the file may dissappear, and the data still resides- the file may be in a different computer.. etc.. It doesn't matter. The Authority is the md5sum.

ABOUT FILE LOCATIONS

Of course, the md5sum and the data itself is useless if it cannot point us in the direction of where a file can be found.

The files table holds this information.. it holds the location of where the file resides, and what the file's md5sum is.

When we search the text, the results point not to a file on disk.. but to a unique md5sum! If we can then match the md5sum value to a location in the files table, we consider it a result.

WHY

Indexing html, text files, is relatively simple. It's just text. But in the real world, people scan in hard copy documents as pdfs. In my particular office we have about 60k such documents and the list is growing every day. The files can only be found by filename and location. But what if someone misnames the files! Or if they misplace it!!! We could lose valuable data!!

The FileArchiveIndexer is made with ocr in mind.

Resetting the files table (to detect filename changes, moves, and files that no longer exist) takes 22 seconds for 60k files. But re-indexing the entire archive is not a realistic procedure. First of all, to index all of the content in the first place would take about 2 weeks. This is a procedure you want to do ONE time and hopefully never again. And you really don't need to! Because we index on MD5SUM and not location or inode number.

INDEXING METHODS

get_next_indexpending()

no argument returns abs_path, md5sum string for next file in queue you should attempt to lock afterwards beacuse of the nature of indexing, it can take a long time, and we may be running multiple indexers, so attempting to lock is needed

if none in pending, returns undef

everytime you call get_next_indexpending, it returns a different file

while( my ($abs_path,$md5sum) = $self->get_next_indexpending ){
   # lock or next
}   

The md5sum string is the md5 hex sum for the file data at the time the files table was updated you should check it again on disk so you know it has not changed in the meantime, and also, if you are remote indexing to make sure the data was not corrupted in transit

This sub DOES return either those two values OR undef.

indexing_lock()

argument is md5sum that you are going to start indexing returns boolean

if it is already locked, returns false otherwise locks (inserts md5sum and timestamp in indexing_lock table and returns true

while( my($abs_path,$md5sum) = $self->get_next_indexpending ){
   $self->indexing_lock($md5sum) or next;

   # ... index....      
}

This will make a call to commit the database, so that subsequent requests to lock return false

indexing_lock_release()

argument is md5sum you just finished indexing.

$self->indexing_lock_release($md5sum);

This is called when you complete indexing of a file. It makes a commit call to the database. returns boolean

indexing_lock_by_path()

argument is abs path to file to index. must be a file. If the file is not already in the files table, returns undef. (because this whole thing is meant to index millions, not a few files, and you want to make sure the process that selects what those files are, is repeated automatically).

same as indexing_lock(), but here you provide abs_path, returns md5sum if cannot lock, returns undef

This is for calling via cli. returns files.id and md5sum.id for indexing

my ($md5sum) = indexing_lock_by_path('/home/myself/file.pdf') or die('cant lock to index');

md5sum_is_indexed()

argument is md5sum, will return boolean

returns true if md5sum digest string is in md5sum table no md5sum digest should be in md5sum table unless this was indexed.

insert_record()

argument is md5sum and text scalar.

The text should be formatted with pagebreaks \f and linebreaks \n insert record does NOT commit, when you call indexing_lock_release() then it is committed. So if the process dies or is interrupted, the file doesnt have halfway indexed data, etc.

argument is md5sum hex digest and text scalar The text should be simple text with page break \f and line feed \n characters.

will check if the md5sum is already in md5sum table, will overrite it.

You may want to check if the md5sum is indexed already by calling md5sum_is_indexed()

For example, (this is not good usage, this is only to demonstrate example):

my ($abs_path,$md5sum) = $self->get_next_indexpending;

my $text = File::Slurp::slurp($abs_path);

$self->insert_record($md5sum,$text);

Remember that the UPDATE STEP will automatically keep track of paths and md5sums as a separate step, so abs path does not need be recorded.

return value is boolean

SEE ALSO

FileArchiveIndexer

AUTHOR

Leo Charre leocharre at cpan dot org

LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the "Artistic License" or the "GNU General Public License".

DISCLAIMER

This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the "GNU General Public License" for more details.