NAME
WWW::Crawl4AI::DeepCrawlIterator - breadth-first iterator for deep_crawl, separating frontier management from crawl logic
VERSION
version 0.001
SYNOPSIS
my $iter = WWW::Crawl4AI::DeepCrawlIterator->new(
crawler => $crawler,
start_url => 'https://example.com',
max_pages => 50,
max_depth => 3,
same_host => 1,
url_filter => sub { $_[0] !~ m{/login} },
);
while ( my $page = $iter->next ) {
my ( $result, $depth ) = @$page;
$on_page->( $result, $depth ) if $on_page;
}
DESCRIPTION
Iterator over pages returned by "deep_crawl" in WWW::Crawl4AI. Encapsulates the BFS frontier management: deduplication, same-host filtering, depth capping. Each call to "next" performs one crawl (through the strategy chain) and schedules its links for future traversal.
Replaces the inline BFS loop in WWW::Crawl4AI::deep_crawl, enabling alternative crawl orders and isolated testing of the frontier logic.
crawler
A WWW::Crawl4AI instance (or any object with a crawl method).
start_url
Starting URL for the crawl.
max_pages
Hard cap on pages crawled.
max_depth
Maximum link-following depth; the start URL is depth 0.
same_host
Only follow links on the start URL's host.
url_filter
Optional coderef ($url) -> bool; return false to skip a URL.
on_page
Optional coderef ($result, $depth) called as each page completes.
next
Returns an arrayref [$result, $depth] for the next page, or undef when the crawl is exhausted or max_pages reached.
results
Returns the arrayref of WWW::Crawl4AI::Result accumulated so far.
is_exhausted
True when the queue is empty or max_pages reached.
SUPPORT
Issues
Please report bugs and feature requests on GitHub at https://github.com/Getty/p5-www-crawl4ai/issues.
CONTRIBUTING
Contributions are welcome! Please fork the repository and submit a pull request.
AUTHOR
Torsten Raudssus <torsten@raudssus.de> https://raudss.us/
COPYRIGHT AND LICENSE
This software is copyright (c) 2026 by Torsten Raudssus.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.