NAME
SWISH::3 - Perl interface to libswish3
SYNOPSIS
use SWISH::3 qw(:constants);
my $handler = sub {
my $s3_data = shift;
my $props = $s3_data->properties;
my $prop_hash = $s3_data->config->get_properties;
print "Properties\n";
for my $p ( sort keys %$props ) {
print " key: $p\n";
my $prop = $prop_hash->get($p);
printf( " <%s type='%s'>%s</%s>\n",
$prop->name, $prop->type, $s3_data->property($p), $prop->name );
}
print "Doc\n";
for my $d (SWISH_DOC_FIELDS) {
printf( "%15s: %s\n", $d, $s3_data->doc->$d );
}
print "TokenList\n";
my $tokens = $s3_data->tokens;
while ( my $token = $tokens->next ) {
print '-' x 50, "\n";
for my $field (SWISH_TOKEN_FIELDS) {
printf( "%15s: %s\n", $field, $token->$field );
}
}
};
my $swish3 = SWISH::3->new(
config => 'path/to/config.xml',
handler => $handler,
regex => qr/\w+(?:'\w+)*/,
);
$swish3->parse( 'path/to/file.xml' )
or die "failed to parse file: " . $swish3->error;
printf "libxml2 version %s\n", $swish3->xml2_version;
printf "libswish3 version %s\n", $swish3->version;
DESCRIPTION
SWISH::3 is a Perl interface to the libswish3 C library.
CONSTANTS
All the SWISH_*
constants defined in libswish3.h are available and can be optionally imported with the :constants keyword.
use SWISH::3 qw(:constants);
See the SWISH::3::Constants section below.
In addition, the SWISH::3 Perl class defines some Perl-only constants:
- SWISH_DOC_FIELDS
-
An array of method names that can be called on a SWISH::3::Doc object in your handler method.
- SWISH_TOKEN_FIELDS
-
An array of method names that can be called on a SWISH::3::Token object.
- SWISH_DOC_FIELDS_MAP
-
A hashref of method names to id integer values. The integer values are assigned in libswish3.h.
- SWISH_DOC_PROP_MAP
-
A hashref of built-in property names to docinfo attribute names. The values of SWISH_DOC_PROP_MAP are the keys of SWISH_DOC_FIELDS_MAP.
FUNCTIONS
default_handler
The handler used if you do not specify one. By default is simply prints the contents of SWISH::3::Data to stderr.
CLASS METHODS
new( args )
args should be an array of key/value pairs. See SYNOPSIS.
Returns a new SWISH::3 instance.
xml2_version
Returns the libxml2 version used by libswish3.
version
Returns the libswish3 version.
refcount( object )
Returns the Perl reference count for object.
wc_report( codepoint )
Prints a isw* summary to stderr for codepoint. codepoint should be a positive integer representing a Unicode codepoint.
This prints a report similar to the swish_isw.c example script.
slurp( filename )
Returns the contents of filename as a scalar string. May also be called as an object method.
OBJECT METHODS
get_file_ext( filename )
Returns file extension for filename.
get_mime( filename )
Returns the configured MIME type for filename based on file extension.
get_real_mime( filename )
Returns the configured MIME type for filename, ignoring any .gz
extension. See looks_like_gz.
looks_like_gz( filename )
Returns true if filename has a file extension indicating it is gzip'd. Wraps the swish_fs_looks_like_gz() C function.
parse( filename_or_filehandle_or_string )
Wrapper around parse_file(), parse_buffer() and parse_fh() that tries to Do the Right Thing.
parse_file( filename )
Calls the C function of the same name on filename.
parse_buffer( str )
Calls the C function of the same name on str. Note that str should contain the API headers.
parse_fh( filehandle )
Calls the C function of the same name on filehandle. Note that the stream pointed to by filehandle should contain the API headers. See SWISH::3::Headers.
error
Returns the error message from the last call to parse(), parse_file() parse_buffer() or parse_fh(). If there was no error on the last call to one of those methods, returns undef.
set_config( swish_3_config )
Set the Config object.
get_config
Returns SWISH::3::Config object.
config
Alias for get_config().
set_analyzer( swish_3_analyzer )
Set the Analyzer object.
get_analyzer
Returns SWISH::3::Analyzer object.
analyzer
Alias for get_analyzer()
set_parser( swish_3_parser )
Set the Parser object.
get_parser
Returns SWISH::3::Parser object.
parser
Alias for get_parser().
set_handler( \&handler )
Set the parser handler CODE ref.
get_handler
Returns a CODE ref for the handler.
set_data_class( class_name )
Default class_name is SWISH::3::Data
.
get_data_class
Returns class name.
set_parser_class( class_name )
Default class_name is SWISH::3::Parser
.
get_parser_class
Returns class name.
set_config_class( class_name )
Default class_name is SWISH::3::Config
.
get_config_class
Returns class name.
set_analyzer_class( class_name )
Default class_name is SWISH::3::Analyzer
.
get_analyzer_class
Returns class name.
set_regex( qr/\w+(?:'\w+)*/ )
Set the regex used in tokenize().
get_regex
Returns the regex used in tokenize().
regex
Alias for get_regex().
get_stash
Returns the SWISH::3::Stash object used internally by the SWISH::3 object. You typically do not need to access this object as a user of SWISH::3, but if you are developing code that needs to access objects within a handler function, you can put it in the Stash object and then retrieve it later.
Example:
my $s3 = SWISH::3->new( handler => \&handler );
my $stash = $s3->get_stash();
$stash->set('my_indexer' => $indexer);
# later..
sub handler {
my $data = shift;
my $indexer = $data->s3->get_stash->get('my_indexer');
$indexer->add_doc( $data );
}
tokenize( string [, metaname, context ] )
Returns a SWISH::3::TokenIterator object representing string. The tokenizer uses the regex defined in set_regex().
tokenize_native( string [, metaname, context ] )
Returns a SWISH::3::TokenIterator object representing string. The tokenizer uses the built-in libswish3 tokenizer, not a regex.
DEVELOPER METHODS
ref_cnt
Returns the internal reference count for the underlying C struct pointer.
debug([n])
Get/set the internal debugging level.
describe( object )
Like calling Devel::Peek::Dump on object.
mem_debug
Calls the C function swish_memcount_debug().
get_memcount
Returns the global C malloc counter value.
dump
A wrapper around describe() and Data::Dump::dump().
SWISH::3::Analyzer
new( swish_3_config )
Returns a new SWISH::3::Analyzer instance.
set_regex( qr/\w+/ )
Set the regex used in SWISH::3->tokenize().
get_regex
Returns a qr// regex object.
get_tokenize
Get the tokenize flag. Default is true.
set_tokenize( 0|1 )
Toggle the tokenize flag. Default is true (tokenize contents when file is parsed).
SWISH::3::Config
set_default
set_properties
Not yet implemented.
get_properties
Returns SWISH::3::PropertyHash object.
set_metanames
Not yet implemented.
get_metanames
Returns SWISH::3::MetaNameHash object.
set_mimes
Not yet implemented.
get_mimes
Returns SWISH::3::xml2Hash object.
set_parsers
Not yet implemented.
get_parsers
Returns SWISH::3::xml2Hash object.
set_aliases
Not yet implemented.
get_aliases
Returns SWISH::3::xml2Hash object.
set_index
Not yet implemented.
get_index
Returns SWISH::3::xml2Hash object.
set_misc
Not yet implemented.
get_misc
Returns SWISH::3::xml2Hash object.
debug
add(file_or_xml)
An alias for add() is merge().
delete
delete() is NOT YET IMPLEMENTED.
read( filename )
Returns SWISH::3::Config object.
write( filename )
SWISH::3::Data
s3
Get the parent SWISH::3 object.
config
Get the parent SWISH::3::Config object.
property( name )
Returns the string value of PropertyName name.
metaname( name )
Returns the string value of MetaName name.
properties
Returns a hashref of name/value pairs.
metanames
Returns a hashref of name/value pairs.
doc
Returns a SWISH::3::Doc object.
tokens
Returns a SWISH::3::TokenIterator object.
SWISH::3::Doc
mtime
Returns the last modified time as epoch int.
size
Returns the size in bytes.
nwords
Returns the number of tokenized words in the Doc.
encoding
Returns the string encoding of Doc.
uri
Returns the URI value.
ext
Returns the file extension.
mime
Returns the mime type.
parser
Returns the name of the parser used (TXT, HTML, or XML).
action
Returns the intended action (e.g., add, delete, update).
SWISH::3::MetaName
new( name )
Returns a new SWISH::3::MetaName instance.
TODO: there are no set methods so this isn't of much use.
id
Returrns the id integer.
name
Returns the name string.
bias
Returns the bias integer.
alias_for
Returns the alias_for string.
SWISH::3::MetaNameHash
get( name )
Get the SWISH::3::MetaName object for name
set( name, swish_3_metaname )
Set the SWISH::3::MetaName for name.
keys
Returns array of names.
SWISH::3::Property
id
Returns the id integer.
name
Returns the name string.
ignore_case
Returns the ignore_case boolean.
type
Returns the type integer.
verbatim
Returns the verbatim boolean.
max
Returns the max integer.
sort
Returns the sort boolean.
alias_for
Returns the alias_for string.
SWISH::3::PropertyHash
get( name )
Get the SWISH::3::Property object for name
set( name, swish_3_property )
Set the SWISH::3::Property for name.
keys
Returns array of names.
SWISH::3::Stash
get( key )
set( key, value )
keys
values
SWISH::3::Token
value
Returns the value string.
meta
Returns the SWISH::3::MetaName object for the Token.
meta_id
Returns the id integer for the related MetaName.
context
Returns the context string.
pos
Returns the position integer.
len
Returns the length in bytes of the Token.
SWISH::3::TokenIterator
next
Returns the next SWISH::3::Token.
SWISH::3::xml2Hash
get( key )
set( key, value )
keys
SWISH::3::Constants
The following constants are imported directly from libswish3 and are defined there.
- SWISH_ALIAS
- SWISH_BODY_TAG
- SWISH_BUFFER_CHUNK_SIZE
- SWISH_CASCADE_META_CONTEXT
- SWISH_CLASS_ATTRIBUTES
- SWISH_CONTRACTIONS
- SWISH_DATE_FORMAT_STRING
- SWISH_DEFAULT_ENCODING
- SWISH_DEFAULT_METANAME
- SWISH_DEFAULT_MIME
- SWISH_DEFAULT_PARSER
- SWISH_DEFAULT_PARSER_TYPE
- SWISH_DEFAULT_VALUE
- SWISH_DOM_CHAR
- SWISH_DOM_STR
- SWISH_ENCODING_ERROR
- SWISH_ESTRAIER_FORMAT
- SWISH_EXT_SEP
- SWISH_FALSE
- SWISH_FOLLOW_XINCLUDE
- SWISH_HEADER_FILE
- SWISH_HEADER_ROOT
- SWISH_IGNORE_XMLNS
- SWISH_INCLUDE_FILE
- SWISH_INDEX
- SWISH_INDEX_FILEFORMAT
- SWISH_INDEX_FILENAME
- SWISH_INDEX_FORMAT
- SWISH_INDEX_LOCALE
- SWISH_INDEX_STEMMER_LANG
- SWISH_INDEX_NAME
- SWISH_KINOSEARCH_FORMAT
- SWISH_LATIN1_ENCODING
- SWISH_LOCALE
- SWISH_LUCY_FORMAT
- SWISH_MAXSTRLEN
- SWISH_MAX_FILE_LEN
- SWISH_MAX_HEADERS
- SWISH_MAX_SORT_STRING_LEN
- SWISH_MAX_WORD_LEN
- SWISH_META
- SWISH_MIME
- SWISH_MIN_WORD_LEN
- SWISH_PARSERS
- SWISH_PARSER_HTML
- SWISH_PARSER_TXT
- SWISH_PARSER_XML
- SWISH_PATH_SEP_STR
- SWISH_PREFIX_MTIME
- SWISH_PREFIX_URL
- SWISH_PROP
- SWISH_PROP_DATE
- SWISH_PROP_DBFILE
- SWISH_PROP_DESCRIPTION
- SWISH_PROP_DOCID
- SWISH_PROP_DOCPATH
- SWISH_PROP_ENCODING
- SWISH_PROP_INT
- SWISH_PROP_MIME
- SWISH_PROP_MTIME
- SWISH_PROP_NWORDS
- SWISH_PROP_PARSER
- SWISH_PROP_RANK
- SWISH_PROP_RECCNT
- SWISH_PROP_SIZE
- SWISH_PROP_STRING
- SWISH_PROP_TITLE
- SWISH_RD_BUFFER_SIZE
- SWISH_SPECIAL_ARG
- SWISH_STACK_SIZE
- SWISH_SWISH_FORMAT
- SWISH_TITLE_METANAME
- SWISH_TITLE_TAG
- SWISH_TOKENIZE
- SWISH_TOKENPOS_BUMPER
- SWISH_TOKEN_LIST_SIZE
- SWISH_TRUE
- SWISH_UNDEFINED_METATAGS
- SWISH_UNDEFINED_XML_ATTRIBUTES
- SWISH_URL_LENGTH
- SWISH_VERSION
- SWISH_WORDS
- SWISH_XAPIAN_FORMAT
BUGS AND LIMITATIONS
libswish3 is not yet ported to Windows.
AUTHOR
Peter Karman perl@peknet.com
COPYRIGHT
Copyright 2010 Peter Karman.
This file is part of libswish3.
libswish3 is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
libswish3 is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.