0.004 2026-06-11 14:24:19Z
- Detect: content volume is now the master signal. Every signal derived from a
page's VISIBLE rendered text -- js_required, the blocked body-phrase arm, and
the captcha marker -- only fires on a THIN page. A content-rich page that
merely mentions JavaScript ("enable JavaScript" footer), quotes a bot-wall
phrase ("unusual traffic", "access denied"), or carries captcha-prompt wording
is no longer discarded: once 500+ chars of markdown are in hand the scrape
succeeded, and body words can never prove otherwise. STRUCTURAL fingerprints
are unchanged and still fire regardless of size -- WAF tokens in the HTML
markup (__cf_chl, datadome), a "Just a moment" / "Access denied" <title>, and
redirects to a known WAF / captcha challenge endpoint. Removes the rich-page
captcha-PROMPT heuristic (and $RE_CAPTCHA_PROMPT) added in 0.003: same
word-on-a-rendered-page false-positive class
0.003 2026-06-11 02:22:02Z
- Detect: captcha signal no longer false-positives on cookie-banner /
privacy-policy mentions of reCAPTCHA. A content-rich page now only walls when
a markdown captcha marker co-occurs with captcha-PROMPT language ("complete
the captcha to continue", "I'm not a robot", "verify you are human", …); thin
pages still wall on any marker (JS-rendered gate) and html-only markers on
rich pages still never wall
- Detect: signals now also flags WAF / bot-management gates that REDIRECT to a
challenge URL instead of embedding a widget. When the page's final_url
matches a known challenge endpoint, blocked is raised for Cloudflare /
DataDome / PerimeterX (/cdn-cgi/challenge, __cf_chl, /challenge-platform/,
datadome, geo.captcha-delivery.com, /px/captcha, perimeterx) and captcha for
provider verification endpoints (google.com/recaptcha, /recaptcha/api,
hcaptcha.com). Additive and OR-ed in; cosmetic redirects (http->https, www,
trailing slash) and an absent final_url never trigger
0.002 2026-05-30 22:57:39Z
- Result: expose response_headers (lowercased keys) from the origin HTTP fetch,
round-trips through to_hash/from_hash JSON persistence; empty hash default
0.001 2026-05-29 23:36:25Z
- Initial release
- WWW::Crawl4AI: Perl client and fallback orchestrator for Crawl4AI
- WWW::Crawl4AI::Client: UA-agnostic REST client (/crawl, /md, /crawl/job,
/crawl/job/{task_id}, /health) with request/parse/convenience flavours
- WWW::Crawl4AI::Request: BrowserConfig/CrawlerRunConfig payload builder
- Visible strategy chain: plain, browser, stealth, cloakbrowser (CDP), proxy,
callback — escalated in cost/complexity order
- CloakBrowser strategy: per-domain fingerprint seed is now a deterministic
32-bit FNV-1a hash of the host (CloakBrowser requires a numeric seed and
rejects raw host strings with HTTP 400)
- WWW::Crawl4AI::Result with attempt history, signals, backend and cost_class
- Result link accessors: urls (deduped, absolute, fragment-stripped),
internal_links, external_links, links — no reaching into raw
- deep_crawl: breadth-first crawl following each page's links through the full
strategy chain (max_pages / max_depth / same_host / url_filter / on_page)
- Single-URL action endpoints on the Client (and delegated from WWW::Crawl4AI):
screenshot / pdf (raw bytes), html (preprocessed), execute_js (page +
js_result), llm (LLM Q&A), token (JWT) — each with request/parse/convenience
flavours like the rest
- WWW::Crawl4AI::Detect: service detection + content-quality classification
(js_required / blocked / captcha / thin_html)
- WWW::Crawl4AI::Error structured error model (transport/api/job/content)
- bin/www-crawl4ai-doctor and bin/www-crawl4ai-test-url
- examples/docker-compose.yml (+ proxy escalation variant)