NAME

WWW::Crawl4AI - Perl client and fallback orchestrator for Crawl4AI

VERSION

version 0.001

SYNOPSIS

use WWW::Crawl4AI;

my $crawler = WWW::Crawl4AI->new(
  base_url         => 'http://localhost:11235',
  cloakbrowser_url => $ENV{CLOAKBROWSER_CDP_URL},  # optional
  proxy_url        => $ENV{CRAWL4AI_PROXY_URL},    # optional
  fallback         => 'auto',
);

my $result = $crawler->markdown('https://example.com');

say $result->markdown;
say $result->backend;       # crawl4ai_plain / crawl4ai_stealth / crawl4ai_cloakbrowser
say $result->final_url;
say $result->cost_class;    # cheap / browser / stealth / paid
say $result->attempts_json;

DESCRIPTION

A Perl interface to a self-hosted Crawl4AI service that escalates crawling through a visible strategy chain and returns a normalized, agent-friendly WWW::Crawl4AI::Result.

Crawl4AI does the fetch and Markdown extraction; WWW::Crawl4AI decides policy. Rather than hiding fallback inside Crawl4AI, every attempt is modelled on the Perl side as a WWW::Crawl4AI::Attempt, so a caller can see which backend won, what it cost, and — on failure — exactly why.

The chain

markdown/crawl walk strategies in cost order and stop at the first result that WWW::Crawl4AI::Detect rates good:

crawl4ai_plain        cheap     headless text mode
crawl4ai_browser      browser   full JS render, wait for networkidle
crawl4ai_stealth      stealth   enable_stealth + random user agent
crawl4ai_cloakbrowser stealth   attach to CloakBrowser via cdp_url   (if configured)
crawl4ai_proxy        paid      stealth via proxy_config             (if configured)
external_callback     paid      user coderef                         (if configured)

For raw, single-shot REST access without the chain, use WWW::Crawl4AI::Client directly.

strategy_chain

A WWW::Crawl4AI::StrategyChain object. Holds the strategy registry. Override "_build_strategy_chain" in a subclass to change defaults or pre-build a specific chain.

strategies

Arrayref of instantiated, applicable WWW::Crawl4AI::Strategy objects in execution order. Lazily built via "_strategies_for".

base_url

Crawl4AI server URL. Default $ENV{CRAWL4AI_URL}, then $ENV{CRAWL4AI_BASE_URL}, then http://localhost:11235.

api_token

Optional bearer token ($ENV{CRAWL4AI_API_TOKEN}).

cloakbrowser_url

CloakBrowser CDP endpoint ($ENV{CLOAKBROWSER_CDP_URL}). When set, the crawl4ai_cloakbrowser strategy joins the chain.

proxy_url

Proxy URL ($ENV{CRAWL4AI_PROXY_URL}). When set, the crawl4ai_proxy strategy joins the chain.

callback

Optional coderef, called as ->($url, %opts) as the last resort; should return a page-shaped hashref or undef. When set, the external_callback strategy joins the chain.

fallback

Chain selection: 'auto' (default; all applicable strategies), 'plain'/'none' (Plain only), or an arrayref of backend names to run in that explicit order, e.g. ['crawl4ai_plain', 'crawl4ai_stealth'].

timeout

Per-request timeout in seconds (default 120), passed to the client.

min_markdown

Override the thin-content threshold (characters) for classification. Can also be given per call as markdown($url, min_markdown => N).

client

The underlying WWW::Crawl4AI::Client. Lazily built from base_url / api_token / timeout; inject your own to share a UA or change transport.

classify_signals

Returns the signals hashref for a normalized page. Default calls "signals" in WWW::Crawl4AI::Detect. Override in a subclass to substitute an alternative classifier.

classify_why_failed

Returns the most-specific failure token for a page. Default calls "why_failed" in WWW::Crawl4AI::Detect. Override in a subclass to substitute an alternative classifier.

crawl

markdown

my $result = $crawler->markdown('https://example.com');
my $result = $crawler->crawl( url => 'https://example.com' );

Run the strategy chain against one URL and return a WWW::Crawl4AI::Result. Both are the same chain; the result always carries both markdown and html. Accepts a single positional URL or named arguments with a url key.

The returned Result is never undef — on total failure $r->ok is false, $r->why_failed explains the last attempt, and $r->attempts holds the full history.

deep_crawl

my $results = $crawler->deep_crawl('https://example.com');
my $results = $crawler->deep_crawl(
  'https://example.com',
  max_pages  => 50,
  max_depth  => 3,
  same_host  => 1,
  url_filter => sub { $_[0] !~ m{/login} },
  on_page    => sub { my ( $result, $depth ) = @_; ... },
  min_markdown => 200,            # any crawl() option is forwarded
);

Breadth-first crawl that follows the "urls" in WWW::Crawl4AI::Result of each good page. Each page goes through the full strategy chain, so deep_crawl is just "crawl" applied across a frontier — the per-page escalation and attempt history are unchanged.

Returns an arrayref of WWW::Crawl4AI::Result in visit order (the first is the start URL). URLs are deduplicated with the fragment stripped, so map-style #lat/lon anchors don't count as separate pages. Options:

max_pages (default 25) — hard cap on pages crawled.
url_filter — coderef ($url) -> bool; return false to skip a URL.
on_page — coderef ($result, $depth) called as each page completes (for streaming or progress).

Any remaining options (such as min_markdown) are forwarded to each "crawl". Note that the strategy chain is fixed per crawler instance — a per-call fallback is not honoured here; set it on the constructor.

health

True if the Crawl4AI server answers GET /health.

screenshot

pdf

html

execute_js

llm

token

Single-URL action endpoints, delegated straight to the underlying WWW::Crawl4AI::Client (they do not run the strategy chain). screenshot and pdf return raw bytes; html the preprocessed HTML; execute_js a normalized page with js_result; llm an answer string (needs a server-side LLM provider); token a JWT hash. See WWW::Crawl4AI::Client for arguments.

available_backends

Arrayref of backend names currently in the chain.

detect

Hashref describing what the orchestrator can reach: crawl4ai, cloakbrowser, proxy, callback and the active backends. Used by www-crawl4ai-doctor.

SEE ALSO

WWW::Crawl4AI::Client, WWW::Crawl4AI::Result, WWW::Crawl4AI::Detect, Net::Async::Crawl4AI, https://github.com/unclecode/crawl4ai

SUPPORT

Issues

Please report bugs and feature requests on GitHub at https://github.com/Getty/p5-www-crawl4ai/issues.

CONTRIBUTING

Contributions are welcome! Please fork the repository and submit a pull request.

AUTHOR

Torsten Raudssus <torsten@raudssus.de> https://raudss.us/

COPYRIGHT AND LICENSE

This software is copyright (c) 2026 by Torsten Raudssus.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.