README - metacpan.org


            
              1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
              
                                    NAME
                                        
  DBIx::KwIndex - create and maintain keyword indices in DBI tables
    _________________________________________________________________
    
                                  SYNOPSIS
                                        
package MyKwIndex;
use DBIx::KwIndex;
sub document_sub { ... }
package main;
$kw = DBIx::KwIndex->new({dbh => $dbh, index_name => 'myindex'})
  or die "can't create index";
$kw->add_document   ([1,2,3,...]) or die $kw->{ERROR};
$kw->remove_document([1,2,3,...]) or die $kw->{ERROR};
$kw->update_document([1,2,3,...]) or die $kw->{ERROR};
$docs = $kw->search({ words=>'upset stomach' });
$docs = $kw->search({ words=>'upset stomach', boolean=>'AND' });
$docs = $kw->search({ words=>'upset stomach', start=>11, num=>10 });
$docs = $kw->search({ words=>'upset (bite|stomach)', re=>1 });
$kw->add_stop_word(['the','an','am','is','are']) or die $kw->{ERROR};
$words = $kw->common_word(85);
$kw->remove_word(['gingko', 'bibola']) or die $kw->{ERROR};
$ndocs  = $kw->document_count();
$nwords = $kw->word_count();
$kw->remove_index or die $kw->{ERROR};
$kw->empty_index  or die $kw->{ERROR};
    _________________________________________________________________
    
                                 DESCRIPTION
                                        
  DBIx::KwIndex is a keyword indexer. It indexes documents and stores
  the index data in database tables. You can tell DBIx::KwIndex to index
  [lots] of documents and later on show you which ones contain a certain
  word. The typical application of DBIx::KwIndex is in a search engine.
    
  How to use this module:
   1. Provide a database handle.
use DBI;
my $dbh = DBI->connect(...) or die $DBI::errstr;
   2. Subclass DBIx::KwIndex and provide a `document_sub' method to
      retrieve documents referred by an integer id. The method should
      accept a list of document ids in an array reference and return the
      documents in an array reference. In this way, you can index any
      kind of documents that you want: text files, HTML files, BLOB
      columns, etc., as long as you provide the suitable document_sub()
      to retrieve the documents. The one thing to remember is that the
      documents must be referred by unique integer number. Below is a
      sample of a document_sub() that retrieves document from the
      'content' field of a database table.
package MyKwIndex;
require DBIx::KwIndex;
use base 'DBIx::KwIndex';
sub document_sub {
   my ($self, $ary_ref) = @_;
       my $dbh = $self->{dbh};
   my $result = $dbh->selectall_arrayref(
   'SELECT id,content FROM documents
    WHERE id IN ('. join(',',@$ary_ref). ')');
   # if retrieval fails, you should return undef
   defined($result) or return undef;
   # now returns the content field in the order of the id's
   # requested. remember to return the documents exactly
   # in the order requested!
   my %tmp = map { $_->[0] => $_->[1] } @$result;
   return [ @tmp{ @$aref } ];
}
   3. Create the indexer object.
my $kw = MyKwIndex->new({
         dbh => $dbh,
         index_name => 'article_index',
         # other options...
         });
      dbh is the database handle. index_name is the name of the index,
      DBIx::KwIndex will create several tables which are all prefixed
      with the index_name. The default index_name is 'kwindex'. Other
      options include: max_word_length (default 32).
   4. Index some documents. You can index one document at a time, e.g.
$kw->add_document([1]) or die $kw->{ERROR};
$kw->add_document([2]) or die $kw->{ERROR};
      or small batches of documents at a time:
$kw->add_document([1..10])  or die $kw->{ERROR};
$kw->add_document([11..20]) or die $kw->{ERROR};
      or large batches of documents at a time:
$kw->add_document([1..300])   or die $kw->{ERROR};
$kw->add_document([301..600]) or die $kw->{ERROR};
      Which one to choose is a matter of memory-speed trade-off. Larger
      batches will increase the speed of indexing, but with increased
      memory usage.
      Note: DBIx::KwIndex ignores single-character words, numbers, and
      words longer than 'max_word_length'.
   5. If you want to search the index, use the search() method.
$docs = $kw->search({ words => 'upset stomach' });
die "can't search" if !defined($docs);
      The search() method will return an ARRAY ref containing the
      document ids that matches the criteria. Other parameter include:
      num => maximum number of results to retrieve; start => starting
      position (1 = from the beginning); boolean => 'AND' or 'OR'
      (default is 'OR'); re => use regular expression, 1 or 0.
      Note: num and start uses the LIMIT clause (which is quite unique
      to MySQL). re uses the REGEXP clause. Do not use these options if
      your database server does not support them.
      Also note: Searching is entirely done from the index. No documents
      will be retrieved while searching. A simple 'relevancy' ranking is
      used. Search is case-insensitive and there is no phrase-search
      support yet.
      Some examples:
# retrieve only the 11th-20th result.
$docs = $kw->search({ words=>'upset stomach', start=>11, num=>10 });
die "can't search" if !defined($docs);
# find documents which contains all the words.
$docs = $kw->search({ words=>['upset stomach'], boolean=>'AND' });
die "can't search" if !defined($docs);
   6. Now suppose some documents change, and you need to update the
      index to reflect that. Just use the methods below. # if you want
      to remove documents from index $kw->remove_document([90..100]) or
      die $kw->{ERROR};
# if you want to update the index
$kw->update_document([90..100]) or die $kw->{ERROR};
    _________________________________________________________________
    
                            SOME UTILITY METHODS
                                        
  If you want to exclude some words (usually very common words, or
  ``stop words'') from being indexed, do this before you index any
  document:
    
$kw->add_stop_word(['the','an','am','is','are'])
  or die "can't add stop words";
  Adding stop words is a good thing to do, as stop words are not very
  useful for your index. They occur in a large proportion of documents
  (they do not help searches differentiate documents) and they increase
  the size your index (slowing the searches).
    
  But which words are common in your collection? you can use the
  common_word method:
    
$words = $kw->common_word(85);
  This will return an array reference containing all the words that
  occur in at least 85% of all documents (default is 80%).
    
  If you want to delete some words from the index:
    
$kw->remove_word(['common','cold']);
  or die "can't remove words";
  To get some statistics about your index:
    
# the number of documents
$ndocs = $kw->document_count();
# the number of words
$nwords = $kw->word_count();
  Last, if you got bored with the index and want to delete it:
    
$kw->remove_index or die $kw->{ERROR};
  This will delete the database tables. Or, if you just want to empty
  the index and start all over:
    
$kw->empty_index or die $kw->{ERROR};
    _________________________________________________________________
    
                                   AUTHOR
                                        
  Steven Haryanto <steven@haryan.to>
    _________________________________________________________________
    
                                  COPYRIGHT
                                        
  Copyright (c) 1995-1999 Steven Haryanto. All rights reserved.
    
  You may distribute under the terms of either the GNU General Public
  License or the Artistic License, as specified in the Perl README file.
    _________________________________________________________________
    
                             BUGS/CAVEATS/TODOS
                                        
  Test the module under other database server (besides MySQL).
    
  Use a more correct search sorting (the current one is kinda bogus :).
    
  Probably implement phrase-searching (but this will require a larger
  vectorlist).
    
  Probably, maybe, implement English/Indonesian stemming.
    
  Any safer, non database-specific way to test existence of tables other
  than $dbh->tables?
    _________________________________________________________________
    
                                    NOTES
                                        
  At least two other Perl extensions exist for creating keyword indices
  and storing them in a database: DBIx::TextIndex and MyConText. As of
  this writing, only DBIx::TextIndex features phrase-searching and
  boolean NOT; and only DBIx::KwIndex offers feature to delete documents
  from index (but please see the updated version and documentation for
  details). I personally find DBIx::KwIndex more convenient when I need
  to index documents that change often, because one can add/remove some
  documents without rebuilding the entire index.
    
  Advices/comments/patches welcome.
    _________________________________________________________________
    
                                   HISTORY
                                        
  0001xx=first draft,satunet.com. 000320=words->scalar.
  000412=0.01/documentation/cpan.
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)