NAME

WWW::Crawl4AI::Detect - service detection and content-quality classification for Crawl4AI

VERSION

version 0.004

SYNOPSIS

use WWW::Crawl4AI::Detect ();

my $sig = WWW::Crawl4AI::Detect::signals($page);
# { js_required => 0, blocked => 1, captcha => 0, thin_html => 0, http_error => 0 }

if ( !WWW::Crawl4AI::Detect::is_good($page) ) {
  my $why = WWW::Crawl4AI::Detect::why_failed($page);  # 'bot_wall_detected'
}

WWW::Crawl4AI::Detect::probe_cloakbrowser('http://localhost:9222');  # 0/1

DESCRIPTION

The classifier that decides whether a normalized page (as produced by WWW::Crawl4AI::Client) is genuinely useful, and the probes that decide which backends to put into the strategy chain. Pure functions; nothing is exported.

signals

Given a normalized page, returns a hashref of boolean signals: js_required, blocked, captcha, thin_html, http_error. Accepts min_markdown => N to override the thin-content threshold ($WWW::Crawl4AI::Detect::MIN_MARKDOWN, default 500).

Master rule: content volume decides. A bot-wall, a JS shell, and a captcha gate all replace the page content, so they are thin by construction. Every signal derived from the visible rendered text (the markdown) therefore only fires on a thin page — on a content-rich page the same words ("enable JavaScript" in a footer, "unusual traffic" quoted in an article, a privacy note mentioning reCAPTCHA) are incidental and never discard a successful scrape. Structural fingerprints are exempt and fire regardless of size, because a real content page never carries them: WAF tokens in the HTML markup (__cf_chl, datadome), a "Just a moment" / "Access denied" <title>, and a redirect whose final_url (the post-redirect URL, falling back to url) is a known WAF or captcha challenge endpoint.

js_required — thin page whose markdown asks to enable JavaScript: a JS shell that never rendered.

blocked — a bot-wall body phrase on a thin page ($RE_BLOCK), OR a WAF token in the HTML, OR a <title> WAF banner, OR a redirect to a WAF challenge URL (/cdn-cgi/challenge, __cf_chl, /challenge-platform/, datadome, geo.captcha-delivery.com, /px/captcha, perimeterx). Not HTTP status — that lives on the http_error axis, so a bare 403 reads http_403 while a Cloudflare body reads bot_wall_detected.

captcha — a captcha marker (markdown or HTML markup) on a thin page, OR a redirect to a captcha provider's own verification endpoint (google.com/recaptcha, /recaptcha/api, hcaptcha.com). A captcha marker on a content-rich page (cookie-banner note, embedded comment-form reCAPTCHA, Turnstile login box) is not a wall — the real content is present.

is_good

True when the page passed all checks: success not explicitly false, no soft/hard HTTP failure, and no negative signal.

why_failed

Returns the most specific failure reason as a short token (captcha, bot_wall_detected, js_required, http_NNN, thin_content) or undef when the page is good.

probe_cloakbrowser

True if a CloakBrowser CDP endpoint answers GET /json/version. Query params on the URL (e.g. ?fingerprint=...) are stripped before probing. Pass ua => $lwp and/or timeout => $secs to control the probe.

detect_proxy_env

Returns $ENV{CRAWL4AI_PROXY_URL} or undef.

SUPPORT

Issues

Please report bugs and feature requests on GitHub at https://github.com/Getty/p5-www-crawl4ai/issues.

CONTRIBUTING

Contributions are welcome! Please fork the repository and submit a pull request.

AUTHOR

Torsten Raudssus <torsten@raudssus.de> https://raudss.us/

COPYRIGHT AND LICENSE

This software is copyright (c) 2026 by Torsten Raudssus.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.