NAME
Dezi::Tutorial - getting started with the Dezi search platform
Installation
Install the Dezi server from CPAN:
% cpan -i Dezi
Install the Dezi client from CPAN:
% cpan -i Dezi::Client
Beginner - Hello World
Start the Dezi server:
% dezi
In a separate terminal, add a small test document to the index:
% echo '<doc><title>bar</title>hello world</doc>' > test.xml
% dezi-client test.xml
Search the index to confirm your test document worked:
% dezi-client -q bar
Intermediate - The Dezi Demo
The Intermediate tutorial details the specifics behind the Dezi demo available at http://dezi.org/demo.
Download the Reuters corpus
The Reuters News Corpus for Text Classification (Reuters-21578) is a common document corpus used for information retrieval projects. Other document collections have become more popular since the Reuters corpus first appeared (e.g. Wikipedia database) but the Reuters corpus is a nice, medium sized collection for demonstrating Dezi.
You can find the corpus many places on the internet. The version used for the demo came from http://svn.peknet.com/search_bench/. The 2xml.pl
script at that URL will convert the original SGML documents to valid XML and split them into about 21k individual documents.
Unpack the tar.gz file somewhere and run the 2xml.pl
script as described in the script's comments.
Create a Swish3 configuration file
As described in Dezi::Architecture, Dezi is based on Swish3 http://swish3.dezi.org/. You can index the Reuters corpus with the deziapp command that comes with Dezi::App (one of the Dezi dependencies).
First, you'll need a configuration file. Here's the one used for the Dezi demo:
DefaultContents XML*
StoreDescription XML* <text> 10000
PropertyNameAlias swishtitle title
MetaNames dates topics people places orgs author swishdocpath
PropertyNames dates topics people places orgs author dateline
FuzzyIndexingMode Stemming_en1
Save the file as dezi.conf
.
More details on Swish3 configuration can be found at http://swish-e.org/docs/swish-config.html.
Index the XML
If your Reuters docs are in a directory called reuters
, you can create an index with a command like:
% deziapp -c dezi.conf -i reuters
You can index all kinds of document types, not just XML, but for the purposes of this tutorial, we'll keep it simple.
Create a Dezi configuration file
Here's the contents of the demo config file, named dezi.config.pl
:
{
engine_config => {
facets => {
names => [qw( topics people places orgs author )]
},
},
ui_class => 'Dezi::UI',
base_uri => 'http://dezi.org/demo',
username => 'deziuser',
password => 'a-secret',
}
NOTE that the username/password is there to prevent unwanted modification of the index. Since Dezi supports POST, PUT and DELETE HTTP actions on an index, it's a good idea to protect an index, particularly if it is on the open internet.
NOTE too the Dezi::UI
class is enabled. That requires a separate installation from CPAN.
% cpan -i Dezi::UI
Start the Dezi server
% dezi --dezi-config dezi.config.pl
From a separate terminal, you can search the index containing text from the Reuters corpus:
% dezi-client -q 'some words'
Thanks to the Dezi::UI module, you can also search via a web browser. Assuming you are running the demo on a local machine, you can point your browser at http://localhost:5000/ui and explore the index contents graphically.
Advanced - Roll Your Own
Write your own client application
% cat indexer.pl
#!/usr/bin/env perl
use strict;
use warnings;
use Dezi::Client;
use File::Find;
my $client = Dezi::Client->new(
server => 'http://localhost:5000'
);
find({
wanted => \&add_to_index,
follow => 1,
no_chdir => 1,
}, @ARGV);
my $resp = $client->commit();
print $resp->content;
sub add_to_index {
my $file = $File::Find::name;
# we only want .xml files
return unless $file =~ m/\.xml$/;
my $resp = $client->index($file);
if (!$resp->is_success) {
die "Failed to index $file: " . $resp->status_line;
}
}
Start your Dezi server
% dezi
Run your indexer
In a separate terminal:
% perl indexer.pl path/to/xml/docs
Search with dezi-client
After you're done indexing, look for something:
% dezi-client -q foo
AUTHOR
Peter Karman, <karman at cpan.org>
BUGS
Please report any bugs or feature requests to bug-dezi at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Dezi. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find this documentation with the perldoc command.
perldoc Dezi::Tutorial
You can also look for information at:
Website
IRC
#dezisearch at freenode
Mailing list
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
COPYRIGHT & LICENSE
Copyright 2011-2018 Peter Karman.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
SEE ALSO
Dezi::Client, Search::OpenSearch, SWISH::3, Dezi::App, Plack, Lucy