NAME
WWW::Crawl4AI - Perl client and fallback orchestrator for Crawl4AI
VERSION
version 0.001
SYNOPSIS
use WWW::Crawl4AI;
my $crawler = WWW::Crawl4AI->new(
base_url => 'http://localhost:11235',
cloakbrowser_url => $ENV{CLOAKBROWSER_CDP_URL}, # optional
proxy_url => $ENV{CRAWL4AI_PROXY_URL}, # optional
fallback => 'auto',
);
my $result = $crawler->markdown('https://example.com');
say $result->markdown;
say $result->backend; # crawl4ai_plain / crawl4ai_stealth / crawl4ai_cloakbrowser
say $result->final_url;
say $result->cost_class; # cheap / browser / stealth / paid
say $result->attempts_json;
DESCRIPTION
A Perl interface to a self-hosted Crawl4AI service that escalates crawling through a visible strategy chain and returns a normalized, agent-friendly WWW::Crawl4AI::Result.
Crawl4AI does the fetch and Markdown extraction; WWW::Crawl4AI decides policy. Rather than hiding fallback inside Crawl4AI, every attempt is modelled on the Perl side as a WWW::Crawl4AI::Attempt, so a caller can see which backend won, what it cost, and — on failure — exactly why.
The chain
markdown/crawl walk strategies in cost order and stop at the first result that WWW::Crawl4AI::Detect rates good:
crawl4ai_plain cheap headless text mode
crawl4ai_browser browser full JS render, wait for networkidle
crawl4ai_stealth stealth enable_stealth + random user agent
crawl4ai_cloakbrowser stealth attach to CloakBrowser via cdp_url (if configured)
crawl4ai_proxy paid stealth via proxy_config (if configured)
external_callback paid user coderef (if configured)
For raw, single-shot REST access without the chain, use WWW::Crawl4AI::Client directly.
strategy_chain
A WWW::Crawl4AI::StrategyChain object. Holds the strategy registry. Override "_build_strategy_chain" in a subclass to change defaults or pre-build a specific chain.
strategies
Arrayref of instantiated, applicable WWW::Crawl4AI::Strategy objects in execution order. Lazily built via "_strategies_for".
base_url
Crawl4AI server URL. Default $ENV{CRAWL4AI_URL}, then $ENV{CRAWL4AI_BASE_URL}, then http://localhost:11235.
api_token
Optional bearer token ($ENV{CRAWL4AI_API_TOKEN}).
cloakbrowser_url
CloakBrowser CDP endpoint ($ENV{CLOAKBROWSER_CDP_URL}). When set, the crawl4ai_cloakbrowser strategy joins the chain.
proxy_url
Proxy URL ($ENV{CRAWL4AI_PROXY_URL}). When set, the crawl4ai_proxy strategy joins the chain.
callback
Optional coderef, called as ->($url, %opts) as the last resort; should return a page-shaped hashref or undef. When set, the external_callback strategy joins the chain.
fallback
Chain selection: 'auto' (default; all applicable strategies), 'plain'/'none' (Plain only), or an arrayref of backend names to run in that explicit order, e.g. ['crawl4ai_plain', 'crawl4ai_stealth'].
timeout
Per-request timeout in seconds (default 120), passed to the client.
min_markdown
Override the thin-content threshold (characters) for classification. Can also be given per call as markdown($url, min_markdown => N).
client
The underlying WWW::Crawl4AI::Client. Lazily built from base_url / api_token / timeout; inject your own to share a UA or change transport.
classify_signals
Returns the signals hashref for a normalized page. Default calls "signals" in WWW::Crawl4AI::Detect. Override in a subclass to substitute an alternative classifier.
classify_why_failed
Returns the most-specific failure token for a page. Default calls "why_failed" in WWW::Crawl4AI::Detect. Override in a subclass to substitute an alternative classifier.
crawl
markdown
my $result = $crawler->markdown('https://example.com');
my $result = $crawler->crawl( url => 'https://example.com' );
Run the strategy chain against one URL and return a WWW::Crawl4AI::Result. Both are the same chain; the result always carries both markdown and html. Accepts a single positional URL or named arguments with a url key.
The returned Result is never undef — on total failure $r->ok is false, $r->why_failed explains the last attempt, and $r->attempts holds the full history.
deep_crawl
my $results = $crawler->deep_crawl('https://example.com');
my $results = $crawler->deep_crawl(
'https://example.com',
max_pages => 50,
max_depth => 3,
same_host => 1,
url_filter => sub { $_[0] !~ m{/login} },
on_page => sub { my ( $result, $depth ) = @_; ... },
min_markdown => 200, # any crawl() option is forwarded
);
Breadth-first crawl that follows the "urls" in WWW::Crawl4AI::Result of each good page. Each page goes through the full strategy chain, so deep_crawl is just "crawl" applied across a frontier — the per-page escalation and attempt history are unchanged.
Returns an arrayref of WWW::Crawl4AI::Result in visit order (the first is the start URL). URLs are deduplicated with the fragment stripped, so map-style #lat/lon anchors don't count as separate pages. Options:
max_pages(default25) — hard cap on pages crawled.max_depth(default2) — link-following depth; the start URL is depth0.same_host(default true) — only follow links on the start URL's host.url_filter— coderef($url) -> bool; return false to skip a URL.on_page— coderef($result, $depth)called as each page completes (for streaming or progress).
Any remaining options (such as min_markdown) are forwarded to each "crawl". Note that the strategy chain is fixed per crawler instance — a per-call fallback is not honoured here; set it on the constructor.
health
True if the Crawl4AI server answers GET /health.
screenshot
html
execute_js
llm
token
Single-URL action endpoints, delegated straight to the underlying WWW::Crawl4AI::Client (they do not run the strategy chain). screenshot and pdf return raw bytes; html the preprocessed HTML; execute_js a normalized page with js_result; llm an answer string (needs a server-side LLM provider); token a JWT hash. See WWW::Crawl4AI::Client for arguments.
available_backends
Arrayref of backend names currently in the chain.
detect
Hashref describing what the orchestrator can reach: crawl4ai, cloakbrowser, proxy, callback and the active backends. Used by www-crawl4ai-doctor.
SEE ALSO
WWW::Crawl4AI::Client, WWW::Crawl4AI::Result, WWW::Crawl4AI::Detect, Net::Async::Crawl4AI, https://github.com/unclecode/crawl4ai
SUPPORT
Issues
Please report bugs and feature requests on GitHub at https://github.com/Getty/p5-www-crawl4ai/issues.
CONTRIBUTING
Contributions are welcome! Please fork the repository and submit a pull request.
AUTHOR
Torsten Raudssus <torsten@raudssus.de> https://raudss.us/
COPYRIGHT AND LICENSE
This software is copyright (c) 2026 by Torsten Raudssus.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.