Changes for version 0.004 - 2026-06-11
- Detect: content volume is now the master signal. Every signal derived from a page's VISIBLE rendered text -- js_required, the blocked body-phrase arm, and the captcha marker -- only fires on a THIN page. A content-rich page that merely mentions JavaScript ("enable JavaScript" footer), quotes a bot-wall phrase ("unusual traffic", "access denied"), or carries captcha-prompt wording is no longer discarded: once 500+ chars of markdown are in hand the scrape succeeded, and body words can never prove otherwise. STRUCTURAL fingerprints are unchanged and still fire regardless of size -- WAF tokens in the HTML markup (__cf_chl, datadome), a "Just a moment" / "Access denied" <title>, and redirects to a known WAF / captcha challenge endpoint. Removes the rich-page captcha-PROMPT heuristic (and $RE_CAPTCHA_PROMPT) added in 0.003: same word-on-a-rendered-page false-positive class
Documentation
probe Crawl4AI / CloakBrowser / proxy reachability and print the chain
run the full WWW::Crawl4AI strategy chain against one URL
Modules
Perl client and fallback orchestrator for Crawl4AI
one strategy attempt in a WWW::Crawl4AI fallback chain
UA-agnostic REST client for the Crawl4AI Docker API
breadth-first iterator for deep_crawl, separating frontier management from crawl logic
service detection and content-quality classification for Crawl4AI
structured error class for WWW::Crawl4AI
markdown field resolution across Crawl4AI response shapes
builds Crawl4AI /crawl and /md request payloads
normalized result of a WWW::Crawl4AI strategy chain
role for a single crawl strategy in the WWW::Crawl4AI fallback chain
Crawl4AI strategy with full JS rendering (wait for networkidle)
last-resort Crawl4AI strategy delegating to a user coderef
Crawl4AI strategy attaching to CloakBrowser over CDP
cheapest Crawl4AI strategy — headless text mode, no escalation
Crawl4AI strategy routing through a configured proxy
Crawl4AI strategy with enable_stealth and randomized fingerprint
ordered list of strategy objects, pluggable at construction time