NAME
WWW::Crawl4AI::Detect - service detection and content-quality classification for Crawl4AI
VERSION
version 0.003
SYNOPSIS
use WWW::Crawl4AI::Detect ();
my $sig = WWW::Crawl4AI::Detect::signals($page);
# { js_required => 0, blocked => 1, captcha => 0, thin_html => 0, http_error => 0 }
if ( !WWW::Crawl4AI::Detect::is_good($page) ) {
my $why = WWW::Crawl4AI::Detect::why_failed($page); # 'bot_wall_detected'
}
WWW::Crawl4AI::Detect::probe_cloakbrowser('http://localhost:9222'); # 0/1
DESCRIPTION
The classifier that decides whether a normalized page (as produced by WWW::Crawl4AI::Client) is genuinely useful, and the probes that decide which backends to put into the strategy chain. Pure functions; nothing is exported.
signals
Given a normalized page, returns a hashref of boolean signals: js_required, blocked, captcha, thin_html, http_error. Accepts min_markdown => N to override the thin-content threshold ($WWW::Crawl4AI::Detect::MIN_MARKDOWN, default 500).
blocked reflects content fingerprints (Cloudflare / DataDome / "Just a moment" bodies) — not HTTP status, which lives on its own http_error axis. It is also raised when the page's final_url (the post-redirect URL, with fallback to url) matches a known WAF / bot-management challenge endpoint (/cdn-cgi/challenge, __cf_chl, /challenge-platform/, datadome, geo.captcha-delivery.com, /px/captcha, perimeterx) — many gates redirect to a challenge URL rather than embedding a widget. This is OR-ed in; a signal already raised by a body fingerprint is never cleared.
captcha means the page is captcha-walled, not merely that a captcha widget or the word "reCAPTCHA" appears somewhere. Context decides:
A thin page with a captcha marker anywhere (rendered markdown or HTML/script markup) is walled — a near-empty page that mentions a captcha is a JS-rendered gate.
A content-rich page is walled only when a markdown marker co-occurs with captcha-prompt language ("complete the captcha to continue", "I'm not a robot", "verify you are human", "checking your browser", "security check") — the wording a real captcha gate uses to address the visitor.
A content-rich page whose markdown mentions a captcha without prompt language (a cookie-banner / privacy-policy note that the site uses reCAPTCHA), or that carries the marker only in the HTML/script markup (an embedded comment-form reCAPTCHA, a Turnstile login box), is not walled — the real content is present.
A page whose
final_urlredirected to a CAPTCHA provider's own verification endpoint (google.com/recaptcha,/recaptcha/api,hcaptcha.com) is walled regardless of body content — the request left the origin and landed on the captcha provider.
is_good
True when the page passed all checks: success not explicitly false, no soft/hard HTTP failure, and no negative signal.
why_failed
Returns the most specific failure reason as a short token (captcha, bot_wall_detected, js_required, http_NNN, thin_content) or undef when the page is good.
probe_cloakbrowser
True if a CloakBrowser CDP endpoint answers GET /json/version. Query params on the URL (e.g. ?fingerprint=...) are stripped before probing. Pass ua => $lwp and/or timeout => $secs to control the probe.
detect_proxy_env
Returns $ENV{CRAWL4AI_PROXY_URL} or undef.
SUPPORT
Issues
Please report bugs and feature requests on GitHub at https://github.com/Getty/p5-www-crawl4ai/issues.
CONTRIBUTING
Contributions are welcome! Please fork the repository and submit a pull request.
AUTHOR
Torsten Raudssus <torsten@raudssus.de> https://raudss.us/
COPYRIGHT AND LICENSE
This software is copyright (c) 2026 by Torsten Raudssus.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.