NAME
FileArchiveIndexer::Database
DESCRIPTION
This module should not be used directly. It is inherited by FileArchiveIndexer. The code and documentation are placed herein for organization purposes only.
THE DATABASE
THE DATABASE LAYOUT EXPLAINED
In this document we assume that your database is called faindex, thus faindex.files refers to the table "files" in the database "faindex".
faindex.files table
This table stores information about where the files are on disk. The actual physical location of the data. This stores absolute path on disk, and the md5sum of this file. There may be multiple m5sums present, because the same file may have copies.
faindex.indexing_lock table
This stores a timestamp of when the file was locked for indexing, and the md5sum as the file's identification. There is no sense in normalizing the md5sum string here. Because that md5sum may appear two or more times in the files table.
faindex.md5sum, and faindex.data tables
The faindex.md5sum table, keeps an id and a md5sum string. We recognize the authority of "file" to be its md5sum hex digest. So all data / metadata indexed, is recorded with an id, which is the faindex.md5sum.id entry.
The faindex.data table is where we look up text. The matching rows resolve to an id- which match faindex.md5sum.id. This way we can see that the search for string "casual" is present in any file whose md5sum is x. Then we can look inside faindex.files.md5sum and see if that file is on disk, and where.
Why is the md5sum field not normalized?
If you view the UPDATE PROCESS, you can see that we want the files table to be able to update quickly. We may be updating the files table every hour. The age of your files table rebuild defines how accurate your data is in regards to location of the files.
Think about an archive with 80k files. First, we have to get the md5sum for those 80k files, on a 2.4GHz Xeon machine, this can take 30 minutes. Depending on the size of the archive- of course. Then, we have to make those 80k inserts. This process, the UPDATE PROCESS, can take anything from 20 minutes to an hour.
In an ideal world, I would like the machines to be fast enough to normalize the md5sum string. But that would require that for each insert into the files table, we look up its md5sum string the faindex.md5sum table first, get that id, then come back and insert- Of course if the md5sum string is not in the faindex.md5sum table, we have to insert that as well. Doing this for tens of thousands of entries slows the whole thing down exponentially. MySQL is not so good at inserts when you compare it to SQLite, but the select queries are faster. Those are more important because we need the system to be responsive to user search requests.
The downside is that all the varchar(32) column is present in large quantities two times in the database. Once in faindex.md5sum.md5sum and once in faindex.files.md5sum. The entry in faindex.indexing_lock.md5sum is negligible, because the maximum should match the number of indexers running.
DATABASE RELATED METHODS
dbh_count()
argument is statement returns count number you MUST have a COUNT(*) in the select statement
my $matches = $self->dbh_count('select count(*) from files');
dbh_sth()
argument is a statment, returns handle it will cache in the object, subsequent calls are not re-prepared
my $delete = $self->dbh_sth('DELETE FROM files WHERE id = ?');
$delete->execute(4);
# or..
for (@ids){
$self->dbh_sth('DELETE FROM files WHERE id = ?')->execute($_);
}
If the prepare fails, confess is called.
There is a dbh prepare_cached() statement also
dbh()
returns database handle
dbh_is_mysql()
returns boolean
dbh_is_sqlite()
returns boolean
dbh_driver()
returns name of DBI Driver, sqlite, mysql, etc. Currently mysql is used, sqlite is used for testing. For testing the package, you don't need to have mysqld running.
dbsetup_reset_files()
will drop the files table if already exists. will recreate the table
CREATE TABLE files (
id INTEGER AUTO_INCREMENT PRIMARY KEY,
abs_path varchar(300) NOT NULL,
md5sum varchar(32) NOT NULL
);
returns true.
dbsetup_reset_indexing_lock()
dbsetup_reset_data()
Will drop the data table and rebuild it. It will also drop the md5 table and rebuild it. CAREFUL- this deletes ALL your indexed data! Only call this is you really dont want your indexed data or are starting a fresh setup.
There's almost no reason to call this, ever. Unless you've changed the way you want to store the text, for example, clean or change it.
dbsetup_reset_files()
Will drop and recreate files table. This is called when reindexing the whole archive. It leaves the data alone.
dbsetup_reset()
set up the database This cleans out entire database!
IMPORTANT NOTE AUTOCOMMIT
Autocommit is set to 0 by default. That means you should commit after indexing_lock(), indexing_lock_release(), delete_record()
DESTROY will finish and commit if there are open handles created by the object
SEE ALSO
AUTHOR
Leo Charre leocharre at cpan dot org
LICENSE
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the "Artistic License" or the "GNU General Public License".
DISCLAIMER
This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the "GNU General Public License" for more details.