0.005 2026-06-12 15:23:16Z
- Detect: removed every body-text and HTML/title fingerprint from the block
classifier. is_good now discards a page only on thin_html, an HTTP push-back
status, or a final_url that landed on a known WAF/captcha challenge endpoint.
The old size-independent arms -- $RE_WALL matching a bare `__cf_` substring,
the "Just a moment"/"Access denied" $RE_TITLE, and the thin-gated $RE_BLOCK /
$RE_JS / $RE_CAPTCHA body phrases -- are gone: on a thin page they were
redundant (thin_html already fails it) and on a full page they were pure
false-positives. Real regression: www.delphin.de served a full 386 KB 200
page carrying Cloudflare's passive /cdn-cgi/challenge-platform/.../jsd/main.js
beacon (a `__cf_` token), which was thrown away as bot_wall_detected across
all four strategies. A content-rich 200 page is now the scrape, full stop.
Drops the js_required signal/reason (a thin JS shell reads thin_content).
Soft-blocks that serve one identical interstitial for every URL are caught
site-level by the caller, not per-page here.
0.004 2026-06-11 14:24:19Z
- Detect: content volume is now the master signal. Every signal derived from a
page's VISIBLE rendered text -- js_required, the blocked body-phrase arm, and
the captcha marker -- only fires on a THIN page. A content-rich page that
merely mentions JavaScript ("enable JavaScript" footer), quotes a bot-wall
phrase ("unusual traffic", "access denied"), or carries captcha-prompt wording
is no longer discarded: once 500+ chars of markdown are in hand the scrape
succeeded, and body words can never prove otherwise. STRUCTURAL fingerprints
are unchanged and still fire regardless of size -- WAF tokens in the HTML
markup (__cf_chl, datadome), a "Just a moment" / "Access denied" <title>, and
redirects to a known WAF / captcha challenge endpoint. Removes the rich-page
captcha-PROMPT heuristic (and $RE_CAPTCHA_PROMPT) added in 0.003: same
word-on-a-rendered-page false-positive class
0.003 2026-06-11 02:22:02Z
- Detect: captcha signal no longer false-positives on cookie-banner /
privacy-policy mentions of reCAPTCHA. A content-rich page now only walls when
a markdown captcha marker co-occurs with captcha-PROMPT language ("complete
the captcha to continue", "I'm not a robot", "verify you are human", …); thin
pages still wall on any marker (JS-rendered gate) and html-only markers on
rich pages still never wall
- Detect: signals now also flags WAF / bot-management gates that REDIRECT to a
challenge URL instead of embedding a widget. When the page's final_url
matches a known challenge endpoint, blocked is raised for Cloudflare /
DataDome / PerimeterX (/cdn-cgi/challenge, __cf_chl, /challenge-platform/,
datadome, geo.captcha-delivery.com, /px/captcha, perimeterx) and captcha for
provider verification endpoints (google.com/recaptcha, /recaptcha/api,
hcaptcha.com). Additive and OR-ed in; cosmetic redirects (http->https, www,
trailing slash) and an absent final_url never trigger
0.002 2026-05-30 22:57:39Z
- Result: expose response_headers (lowercased keys) from the origin HTTP fetch,
round-trips through to_hash/from_hash JSON persistence; empty hash default
0.001 2026-05-29 23:36:25Z
- Initial release
- WWW::Crawl4AI: Perl client and fallback orchestrator for Crawl4AI
- WWW::Crawl4AI::Client: UA-agnostic REST client (/crawl, /md, /crawl/job,
/crawl/job/{task_id}, /health) with request/parse/convenience flavours
- WWW::Crawl4AI::Request: BrowserConfig/CrawlerRunConfig payload builder
- Visible strategy chain: plain, browser, stealth, cloakbrowser (CDP), proxy,
callback — escalated in cost/complexity order
- CloakBrowser strategy: per-domain fingerprint seed is now a deterministic
32-bit FNV-1a hash of the host (CloakBrowser requires a numeric seed and
rejects raw host strings with HTTP 400)
- WWW::Crawl4AI::Result with attempt history, signals, backend and cost_class
- Result link accessors: urls (deduped, absolute, fragment-stripped),
internal_links, external_links, links — no reaching into raw
- deep_crawl: breadth-first crawl following each page's links through the full
strategy chain (max_pages / max_depth / same_host / url_filter / on_page)
- Single-URL action endpoints on the Client (and delegated from WWW::Crawl4AI):
screenshot / pdf (raw bytes), html (preprocessed), execute_js (page +
js_result), llm (LLM Q&A), token (JWT) — each with request/parse/convenience
flavours like the rest
- WWW::Crawl4AI::Detect: service detection + content-quality classification
(js_required / blocked / captcha / thin_html)
- WWW::Crawl4AI::Error structured error model (transport/api/job/content)
- bin/www-crawl4ai-doctor and bin/www-crawl4ai-test-url
- examples/docker-compose.yml (+ proxy escalation variant)