Search::Circa::Parser - provide functions to parse HTML pages by Circa


use Search::Circa::Indexer;
my $index = new Search::Circa::Indexer;


This module use HTML::Parser facilities. It's call by Search::Circa::Indexer for index each document. Main method is look_at.

Public Class Interface

new Search::Circa::Indexer object

Create a new Circa::Parser object with indexer instance properties

look_at url, idc, idr, lastModif, url_local, categorieAuto, niveau, categorie

Index an url. Job done is:

  • Test if url used is valid. Return -1 else

  • Get the page and add each words found with weight set in constructor.

  • If maximum level of links is not reach, add each link found for the next indexation


  • $url : Url to read

  • $idc: Id of url in table links

  • $idr : Id of account's url

  • $lastModif (optional) : If this parameter is set, Circa didn't make any job on this page if it's older that the date.

  • $url_local (optional) Local url to reach the file

  • $categorieAuto (optional) If $categorieAuto set to true, Circa will create/set the category of url with syntax of directory found. Ex: will create and set the category for this url to Societe / StValentin

    If $categorieAuto set to false, $categorie will be used.

  • $niveau (optional) Level of actual link.

  • $categorie (optional) See $categorieAuto.

Return (-1,0) if url isn't valide, number of word and number of links found else

set_agent local

Set user agent for Circa robot. If local is set to 0 or $self->{ConfigMoteur}->{'temporate'}==0, LWP::UserAgent will be used. Else LWP::RobotUA is used.

analyse data, facteur, ref_hash

Split data in words, and put in in %$ref_hash with score Hash structure is ('mots'=>facteur).

  • data : buffer à analyser

  • $facteur : facteur à attribuer à chacun des mots trouvés

  • %l : Tableau associatif où est rangé le résultat

Return ref_hash


Method call for each HTML tag find in HTML pages.


Method call for each content of tag in HTML pages

Check if url $links will be add to Circa. Url must begin with $self->host_indexed, and his extension must be not doc,zip,ps,gif,jpg,gz, pdf,eps,png,deb,xls,ppt,class,GIF,css,js,wav,mid.

If $links is accepted, return url. Else return 0.


