NAME

Xango::Tutorial - Learn How To Write Crawlers With Xango

DECIDE ON BROKER TYPE

There are two types of Xango::Broker -- the "pull" and "push" types. The "pull" crawler is implemented as Xango::Broker::Pull, and the "push" is implemented as Xango::Broker::Push.

Xango::Broker::Pull crawler periodically "pulls" the jobs that needs to be processed. Xango::Broker::Push crawler waits for jobs to be pushed to the queue.

You should decide on the type of crawler to use depending on the application that you want to build. If you have a constant pool of jobs that need to be processed, then usually the pull model is better. However, if your crawler's processing is triggered by an event (for example, a user submits a form that contains the URI to crawl) then the push crawler is the way to go.

IMPLEMENTING THE PULL CRAWLER

A "pull" crawler goes into a loop that periodically checks for new jobs to process from a source. If you have a crawler that should be crawling a possibly infinite list of jobs, then this is probably the type of crawler that you want to use.

In this scenario, the handler component needs to implement the following states:

retrieve_jobs

IMPLEMENTING THE PUSH CRAWLER

Xango::Broker::Push->spawn(
  Alias => 'broker',
  
);

# In your code elsewhere...
POE::Kernel->post('broker', 'enqueue_job', $job);

PERFORMANCE

CUSTOM HTTP COMPONENT

For maximum performance, you will most likely need to re-write your HTTP fetcher component (by default POE::Component::Client::HTTP is used). This is due to the fact that, most crawler are daemon that perpetually retrieves data, and this requires maximum tuning from the code-writer to reduce the memory impact from downloading arbitrary content.

For example, http://xango.razil.jp uses a disk-based variation of POE::Component::Client::HTTP that writes data as soon as it can to minimize the memory usage.

For casual users, though, this may not be a problem.

AUTHOR

Copyright (c) 2005 Daisuke Maki <dmaki@cpan.org>