NAME
NexTrieve::HTML - convert HTML to NexTrieve Document objects
SYNOPSIS
use NexTrieve;
$ntv = NexTrieve->new( | {method => value} );
$converter = $ntv->HTML( | {method => value} );
$index = $ntv->Index( $resource )->htmlsimple;
$docseq = $index->Docseq;
foreach my $file (<*.html>) {
$docseq->add( $converter->Document( $file ) );
}
$docseq->done;
DESCRIPTION
The HTML object of the Perl support for NexTrieve. Do not create directly, but through the HTML method of the NexTrieve object.
The "html2ntvml" script is basically a directly configurable and executable wrapper for the NexTrieve::HTML module.
CONVERSION PROCESS
The conversion of an HTML-file consists of basically five phases:
- creating the NexTrieve::HTML object
- setting the appropriate parameters
- obtain the HTML from the indicated source
- setting up the content hash (an internal representation of the HTML)
- serializing the content hash to XML
More specifically, the following steps are performed.
- create NexTrieve::HTML object
-
You must create the NexTrieve::HTML object by calling the "HTML" method of the NexTrieve object. You can set any parameters directly with the creation of the object by specifying a hash with method names and values. Or you can set the parameters later by calling the methods on the object.
- setting parameters
-
After the object is created, you will have to decide if a preprocessor should process the HTML before anything else (see preprocessor), which fields of the content hash should appear as what attributes (see field2attribute) and/or texttypes (see field2texttype).
You should also think about extra attributes (see extra_attribute) and/or texttypes (see extra_texttype) that should be added to the XML that are not part of the original HTML. And you should consider if any conversions should be done on the information that is destined to become an attribute (see attribute_processor) or a texttype (see texttype_processor), before they are actually serialized into XML.
By setting up all of this information, you may find it handy to use the Resource method for setting up the basic resource-file that could be used by NexTrieve to index the XML generated by these settings.
- read HTML from the indicated source
-
The next step is getting a copy of the HTML for which to create XML that can be indexed. This is done when the Document object is created. The HTML can either be specified directly, as a filename or as a URL. A content hash with the "id" field (either the filename or the URL, or whatever was specified directly) and the "date" field (last modified info from the file or the URL, or whatever was specified directly) is initialized.
- optional binary check
-
If you are unsure whether the input really _is_ HTML, you can specify a binarycheck to be executed. After any preprocessing is done, the binary check is performed if so specified. If the input is considered to be binary, then the entire input is ignored, an error message issued and no XML is returned.
- pre-process the HTML
-
If a preprocessor routine was specified, it will be executed now. The preprocessor routine takes a reference to the content hash and the HTML as its input and return (possibly adapted) HTML. The preprocessor routine has access to the content hash and is able to add, change or remove fields from the content hash as it seems fit.
No action is performed if no processor routine is specified. Please note however that by calling methods such as asp, php and mhonarc, you are in fact specifying a preprocessor routine.
-
At this point, any HTML-comments, in the form <!-- ... --> are removed. Then the HTML-tags that may contain information that you do _not_ want to be indexed, are removed. You can specify which HTML-tags conform to this with the removecontainers method. By default, only the <script..>...</script> are removed, causing any Javascript to be removed.
- extract "known" information from the HTML
-
The next step involves searching the HTML for known pieces of information. As of this writing these contain:
- title, as found in <TITLE>...</TITLE> - encoding information as found in <META HTTP-EQUIV....> - other information found in <META name="" content=""> tags
At the end of this step, the content hash can be enriched with the following fields (in alphabetical order):
- author - description - encoding - generator - keywords - title
If no encoding information is found, the HTML is considered to be encoded in "ISO-8859-1", an encoding that is a superset of "us-ascii" that in practice seems to be the most appropriate to use for HTML.
-
Before the HTML can be converted to XML, all of the remaining HTML-tags need to be removed from the HTML. This is done by removing anything between <HEAD>...</HEAD>. Then any existing <HTML>, <BODY>, </BODY> and </HTML> themselves are removed. Whatever remains then is considered to be the basis of the final text of the XML.
Then the HTML-tags that are considered to be "display" containers, such as <B>, <I> and <U>, are removed without replacing them by a space. This is done this way because it often happens that only a single letter of a word is highlighted with these HTML-tags. If these tags would be replaced by a space, then the words would be broken up.
If you have any HTML-tags that you would like to be processed the same way, you can specify these with the displaycontainers method.
After this, all other tags are replaced by spaces.
- serializing the XML
-
When all of this is done, all of the information in the content hash as well as the remaining text (from the original HTML) are fed to a generic serialization routine that is also used by the NexTrieve::RFC822 and NexTrieve::DBI modules.
This serialization routine looks for any extra attributes and/or texttypes and processor routines, executes them in the correct order and generates the XML for the HTML that was provided on input.
If you want to access the XML, you can call the xml method, which is inherited from the NexTrieve module.
OBJECT CREATION METHODS
These methods create objects from the NexTrieve::HTML object.
Docseq
$docseq = $converter->Docseq( @file );
$docseq->write_file( filename );
$index = $ntv->Index( $resource );
$converter->Docseq( $index->Docseq,@file );
The Docseq method allows you to create a NexTrieve document sequence object (or NexTrieve::Docseq object) out of one or more HTML-files. This can either be used to be directly indexed by NexTrieve (through the NexTrieve::Index object) or to create the XML of the document sequence in a file for indexing at a later stage.
The first (optional) input parameter is an (already existing) NexTrieve::Docseq object that should be used. This can either be a special purpose NexTrieve::Docseq object as created by the NexTrieve::Index module, or a NexTrieve::Docseq object that was created earlier on which a second run of HTML-files need to be added.
The rest of the input parameters indicate the HTML-sources that should be indexed. These can either be just filenames, or URL's in the form: file://directory/file.html or http://server/filename.html.
For more information, see the NexTrieve::Docseq module.
Document
$document = $converter->Document( file | html | [list] , | '' | 'file' | 'url' | sub {} );
The Document method performs the actual conversion from HTML to XML and returns a NexTrieve::Document object that may become part of a NexTrieve document sequence (see Docseq).
The first input parameter specifies the source of the HTML. It can consist of:
- HTML itself
If the second parameter is specified and is set to '', then the first input parameter is considered to be the HTML to be processed. If the second input parameter is not specified at all, the first input parameter will be considered to be the HTML if a newline character can be found.
- a filename
If the second input parameter is specified as "file", then the first input parameter is considered to be a filename. If no second input parameter is specified, then the first parameter is considered to be a filename if no newline character can be found.
- a URL
If the second input parameter is specified as "url", then the first input parameter is considered to be a URL from which to fetch the HTML. If the second input parameter is not specified, but the first input parameter starts with what looks like a protocol specification (^\w+://), then the first input parameter is considered to be a URL. Two protocols are currently supported: file:// and http://.
- a reference to a list
If the first input parameter is a reference to a list, then that list is supposed to contain:
- the HTML to be processed
- the "id" to be used to identify this HTML
- the epoch time when the HTML was last modified (when available)
- an indication of the source of the HTML (for error messages, if any)
- anything else
If the second input parameter is specified as a reference to an (anonymous) subroutine, then that routine is called. That "fetch" routine should expect to be passed:
- whatever was specified with the first input parameter
- whatever other input parameters were specified
The fetch routine is expected to return in scalar context just the HTML that should be processed. In list context, it is expected to return:
- the HTML to be processed
- the "id" to be assigned to this HTML (usually the first input parameter)
- the epoch time when the HTML was last modified (when available)
- an indication of the source of the HTML (for error messages, if any)
Resource
$resource = $converter->Resource( | {method => value} );
The "Resource" method allows you to create a NexTrieve::Resource object from the internal structure of the NexTrieve::HTML.pm object. More specifically, it takes the information as specified with the "extra_attribute, , and methods and creates the <indexcreation" section of the NexTrieve resource file as specified on http://www.nextrieve.com/usermanual/2.0.0/ntvresourcefile.stm .
For more information, see the documentation of the NexTrieve::Resource module itself.
OTHER METHODS
These methods change aspects of the NexTrieve::HTML object.
asp
$converter->asp;
The "asp" method makes sure that <%...%> processor tags are removed from the HTML before being processed. It basically defined a preprocessor routine to do so.
The "asp" method returns the object itself, so that it can be used in one-liners.
attribute_processor
$converter->attribute_processor( 'attribute', key | sub {} );
The "attribute_processor" allows you to specify a subroutine that will process the contents of a specific attribute before it becomes serialized in XML.
The first input parameter specifies the name of the attribute as it will be serialized. Please note that this may not be the same as the name of the content hash field.
The second input parameter specifies the processor routine. See "PROCESSOR ROUTINES" for more information.
binarycheck
$converter->binarycheck( true | false );
$binarycheck = $converter->binarycheck;
The "binarycheck" method sets a flag in the object to indicate whether a check for binary content should be performed. If the flag is set and binary content is assumed to be found, conversion will be aborted and an error will be set.
This method is mainly intended if you are unsure about the cleanliness of the list of files that you want to process. If you are not sure that all files listed are really HTML, it is probably a good idea to set this flag.
DefaultInputEncoding
$encoding = $converter->DefaultInputEncoding;
$converter->DefaultInputEncoding( encoding );
See the NexTrieve.pm module for more information about the "DefaultInputEncoding" method.
displaycontainers
$converter->displaycontainers( qw(a b em font i strike strong tt u) );
@displaycontainer= $converter->displaycontainers;
The "displaycontainers" method specifies which HTML-tags should be considered HTML-tags that have to do with the display of HTML, rather than with the structure of HTML. During the conversion from HTML to XML, all HTML-tags that are considered to be display containers, are completelyb removed from the HTML. This causes the HTML "<B>1</B>234" to be converted to the single word "1234" rather than to two words "1 234".
Please note that all HTML-tags that are not known to be display containers, or removable containers (see removecontainers) are completely removed from the HTML during the conversion process.
The default display containers are: a b em font i strike strong tt u .
extra_attribute
$converter->extra_attribute( [\$var | sub {}, attribute spec] | 'reset' );
The "extra_attribute" method specifies one or more attributes that should be added to the serialized XML, created from sources outside of the original HTML.
Each input parameter specifies a single attribute as a reference to a list of parameters. These parameters consist of:
- a reference to a variable or subroutine
- an attribute specification
If the first parameter in the list consists of a reference to a variable, then the value of that variable will be used for that attribute at the moment the XML is serialized. This could e.g. be an external counter variable.
If the first parameter in the list consists of a reference to a subroutine, then that subroutine is called as a processor routine when the XML is serialized. The first input parameter is the contents of the "id" field in the content hash and can e.g. be used by the processor routine for a database lookup. See "PROCESSOR ROUTINES" for more information.
The rest of the list consists of an attribute specification. See "ATTRIBUTE SPECIFICATION" for more information.
As a special function, the string "reset" may also be specified as the first input parameter to the method: it will then remove any extra attribute specifications from the object that were specified previously in the lifetime of the object.
extra_texttype
$converter->extra_texttype( [\$var | sub {}, texttype spec] | 'reset' );
The "extra_texttype" method specifies one or more texttypes that should be added to the serialized XML, created from sources outside of the original HTML.
Each input parameter specifies a single texttype as a reference to a list of parameters. These parameters consist of:
- a reference to a variable or subroutine
- a texttype specification
If the first parameter in the list consists of a reference to a variable, then the value of that variable will be used for that texttype at the moment the XML is serialized. This could e.g. be an externally stored title.
If the first parameter in the list consists of a reference to a subroutine, then that subroutine is called as a processor routine when the XML is serialized. The first input parameter is the contents of the "id" field in the content hash and can e.g. be used by the processor routine for a database lookup. See "PROCESSOR ROUTINES" for more information.
The rest of the list consists of a texttype specification. See "TEXTTYPE SPECIFICATION" for more information.
As a special function, the string "reset" may also be specified as the first input parameter to the method: it will then remove any extra texttype specifications from the object that were specified previously in the lifetime of the object.
field2attribute
$converter->field2attribute( 'title','date',['id',attribute spec] );
The "field2attribute" specifies how a key in the content hash should be mapped to an attribute in the serialized XML.
Each input parameter specifies a single mapping. It either consists of the key (which causes that key to be serialize as an attribute with the same name) or as a reference to a list.
If a parameter is a reference to a list, then the first element of that list is the key in the content hash. The rest of the list is then considerd to be an attribute specification (see "ATTRIBUTE SPECIFICATION" for more information).
So, for example, just the string 'title' would cause the content of the key "title" in the content hash to be serialized as the attribute "title".
As another example, the list "[qw(id filename string key-unique 1)]" would cause the content of the "id" to be serialized as the attribute "filename" and cause a complete resource-specification if the Resource method is called.
field2texttype
$converter->field2texttype( 'title',[qw(description whatitis 200)],'keywords' );
The "field2texttype" specifies how a key in the content hash should be mapped to a texttype in the serialized XML.
Each input parameter specifies a single mapping. It either consists of the key (which causes that key to be serialize as a texttype with the same name) or as a reference to a list.
If a parameter is a reference to a list, then the first element of that list is the key in the content hash. The rest of the list is then considerd to be a texttype specification (see "TEXTTYPE SPECIFICATION" for more information).
So, for example, just the string 'title' would cause the content of the key "title" in the content hash to be serialized as the texttype "title".
As another example, the list "[qw(description whatitis 200)]" would cause the content of the "description" key to be serialized as the texttype "whatitis" and cause a complete resource-specification if the Resource method is called.
htmlsimple
$converter->htmlsimple;
The "htmlsimple" method is a convenience method for quickly setting up field2attribute and field2texttype mappings. It is intended to handle simple HTML-pages. Currently, the following mappings are performed:
- key "id" serialized as "filename" attribute
- key "title" serialized as both attribute and texttype with same name
- keys "description" and "keywords" serialized as texttypes with same name
The "htmlsimple" method returns the object itself, so that it can be used in one-liners.
mhonarc
$converter->mhonarc;
The "mhonarc" method is a convenience method for quickly setting up field2attribute, field2texttype, attribute_processor and preprocessor settings. It is intended to handle HTML-pages that are created by MHonArc (see http://www.mhonarc.org for more information).
Currently, the following mappings are performed:
- a preprocessor that fills content hash with "subject", "date" and "title"
- a preprocessor that throws away anything that's not between <pre> and </pre>
- key "id" serialized as "mailbox" attribute
- key "date" serialized as "date" attribute, converted to "datestamp"
- key "subject" serialized as both an attribute and texttype with same name
- key "from" serialized as attribute with same name
The "mhonarc" method returns the object itself, so that it can be used in one-liners.
php
$converter->php;
The "php" method makes sure that <?...?> and <%...%> processor tags are removed from the HTML before being processed. It basically defined a preprocessor routine to do so. Please note that the third way of removing PHP processor tags, the <script language="php"...>...</script>, is by default already handled by the removecontainers specification.
The "php" method returns the object itself, so that it can be used in one-liners.
preprocessor
$converter->preprocessor( \&preprocess );
$preprocessor = $converter->preprocessor;
The "preprocessor" method allows you to specify a subroutine that will be executed before any of the other conversions take place on the input HTML.
When specified, the subroutine should be ready to expect the following input parameters:
- reference to content hash
The content hash has been initialized with the "id" and "date" keys when the preprocessor is called. The preprocessor subroutine can make any changes to the content hash that it seems fit. An example of the use of a preprocessor subroutine is the mhonarc convenience method that extracts subject, from and date information from the HTML which are stored in the content hash.
- HTML to be pre-processed
The second input parameter is the HTML that should be preprocessed. The result of this preprocessing should be returned by the subroutine or directly changed in the parameter passed.
removecontainers
$converter->removecontainers( qw(embed script) );
@removecontainer= $converter->removecontainers;
The "removecontainers" method specifies which HTML-tags, and their content, should be removed from the HTML when converting to XML. The difference with the displaycontainers is that in this case, everything between the opening and closing HTML-tag is also removed.
The default HTML-tags are: embed script .
texttype_processor
$converter->texttype_processor( 'attribute', key | sub {} );
The "texttype_processor" allows you to specify a subroutine that will process the contents of a specific texttype before it becomes serialized in XML.
The first input parameter specifies the name of the texttype as it will be serialized. Please note that this may not be the same as the name of the content hash field.
The second input parameter specifies the processor routine. See "PROCESSOR ROUTINES" for more information.
ATTRIBUTE SPECIFICATION
An attribute specification can be very simple: just the name of the attribute, e.g. 'date'. If you would like to use the Resource method to create the <indexcreation> section of the NexTrieve resource-file, then it is wise to add the type of attribute, key and multiplicity information as well, as described in http://www.nextrieve.com/usermanual/2.0.0/ntvresourcefile.stm .
A complete attribute specification would be "'date','number','notkey','1'". Which of course can be more easily expressed as "qw(date number notkey 1)".
Please note that future versions of NexTrieve may add more fields to the complete attribute specification. So watch this space for more info in the future.
PROCESSOR ROUTINES
Processor routines can be either a reference to an (anonymous) subroutine or a key to one of the available subroutines for doing standard conversions.
If a processor routine is a reference to an (anonymous) subroutine, then that subroutine should expect the following input parameters:
- the data to be processed
- the name of the attribute it will be serialized to
- the document object for which the XML will be serialized
The subroutine is expected to return the processed data.
If the second input parameter is a key, it must be one of the following:
- datestamp
Attempt to convert the input to a datestamp in the form YYYYMMDD. The Date::Parse module must be available for this to work.
- epoch
Attempt to convert the input to a Unix epoch time value (number of seconds since midnight Jan. 1st 1970 GMT). The Date::Parse module must be available for this to work.
- timestamp
Attempt to convert the input to a timestamp in the form YYYYMMDDHHMMSS. The Date::Parse module must be available for this to work.
Other keyed processor routines may be added in the future, so please check this space for additions.
TEXTTYPE SPECIFICATION
A texttype specification can be very simple: just the name of the texttype, e.g. 'title'. If you would like to use the Resource method to create the <indexcreation> section of the NexTrieve resource-file, then it is wise to add any extra information as well, as described in http://www.nextrieve.com/usermanual/2.0.0/ntvresourcefile.stm .
A complete texttype specification would be "'title','200'". Which of course can be more easily expressed as "qw(title 200)", which would make the "title" twice as important in exact searches by default than the other texttypes.
Please note that future versions of NexTrieve may add more fields to the complete texttype specification. So watch this space for more info in the future.
AUTHOR
Elizabeth Mattijsen, <liz@dijkmat.nl>.
Please report bugs to <perlbugs@dijkmat.nl>.
SUPPORT
NexTrieve is no longer being supported.
COPYRIGHT
Copyright (c) 1995-2003 Elizabeth Mattijsen <liz@dijkmat.nl>. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
The NexTrieve.pm and the other NexTrieve::xxx modules.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 541:
Nested L<> are illegal. Pretending inner one is X<...> so can continue looking for other errors.
Nested L<> are illegal. Pretending inner one is X<...> so can continue looking for other errors.
Nested L<> are illegal. Pretending inner one is X<...> so can continue looking for other errors.