Changes - metacpan.org

0.005     2026-06-12 15:23:16Z

 - Detect: removed every body-text and HTML/title fingerprint from the block
   classifier. is_good now discards a page only on thin_html, an HTTP push-back
   status, or a final_url that landed on a known WAF/captcha challenge endpoint.
   The old size-independent arms -- $RE_WALL matching a bare `__cf_` substring,
   the "Just a moment"/"Access denied" $RE_TITLE, and the thin-gated $RE_BLOCK /
   $RE_JS / $RE_CAPTCHA body phrases -- are gone: on a thin page they were
   redundant (thin_html already fails it) and on a full page they were pure
   false-positives. Real regression: www.delphin.de served a full 386 KB 200
   page carrying Cloudflare's passive /cdn-cgi/challenge-platform/.../jsd/main.js
   beacon (a `__cf_` token), which was thrown away as bot_wall_detected across
   all four strategies. A content-rich 200 page is now the scrape, full stop.
   Drops the js_required signal/reason (a thin JS shell reads thin_content).
   Soft-blocks that serve one identical interstitial for every URL are caught
   site-level by the caller, not per-page here.

0.004     2026-06-11 14:24:19Z

 - Detect: content volume is now the master signal. Every signal derived from a
   page's VISIBLE rendered text -- js_required, the blocked body-phrase arm, and
   the captcha marker -- only fires on a THIN page. A content-rich page that
   merely mentions JavaScript ("enable JavaScript" footer), quotes a bot-wall
   phrase ("unusual traffic", "access denied"), or carries captcha-prompt wording
   is no longer discarded: once 500+ chars of markdown are in hand the scrape
   succeeded, and body words can never prove otherwise. STRUCTURAL fingerprints
   are unchanged and still fire regardless of size -- WAF tokens in the HTML
   markup (__cf_chl, datadome), a "Just a moment" / "Access denied" <title>, and
   redirects to a known WAF / captcha challenge endpoint. Removes the rich-page
   captcha-PROMPT heuristic (and $RE_CAPTCHA_PROMPT) added in 0.003: same
   word-on-a-rendered-page false-positive class

0.003     2026-06-11 02:22:02Z

 - Detect: captcha signal no longer false-positives on cookie-banner /
   privacy-policy mentions of reCAPTCHA. A content-rich page now only walls when
   a markdown captcha marker co-occurs with captcha-PROMPT language ("complete
   the captcha to continue", "I'm not a robot", "verify you are human", …); thin
   pages still wall on any marker (JS-rendered gate) and html-only markers on
   rich pages still never wall

 - Detect: signals now also flags WAF / bot-management gates that REDIRECT to a
   challenge URL instead of embedding a widget. When the page's final_url
   matches a known challenge endpoint, blocked is raised for Cloudflare /
   DataDome / PerimeterX (/cdn-cgi/challenge, __cf_chl, /challenge-platform/,
   datadome, geo.captcha-delivery.com, /px/captcha, perimeterx) and captcha for
   provider verification endpoints (google.com/recaptcha, /recaptcha/api,
   hcaptcha.com). Additive and OR-ed in; cosmetic redirects (http->https, www,
   trailing slash) and an absent final_url never trigger

0.002     2026-05-30 22:57:39Z

 - Result: expose response_headers (lowercased keys) from the origin HTTP fetch,
   round-trips through to_hash/from_hash JSON persistence; empty hash default

0.001     2026-05-29 23:36:25Z

 - Initial release
 - WWW::Crawl4AI: Perl client and fallback orchestrator for Crawl4AI
 - WWW::Crawl4AI::Client: UA-agnostic REST client (/crawl, /md, /crawl/job,
   /crawl/job/{task_id}, /health) with request/parse/convenience flavours
 - WWW::Crawl4AI::Request: BrowserConfig/CrawlerRunConfig payload builder
 - Visible strategy chain: plain, browser, stealth, cloakbrowser (CDP), proxy,
   callback — escalated in cost/complexity order
 - CloakBrowser strategy: per-domain fingerprint seed is now a deterministic
   32-bit FNV-1a hash of the host (CloakBrowser requires a numeric seed and
   rejects raw host strings with HTTP 400)
 - WWW::Crawl4AI::Result with attempt history, signals, backend and cost_class
 - Result link accessors: urls (deduped, absolute, fragment-stripped),
   internal_links, external_links, links — no reaching into raw
 - deep_crawl: breadth-first crawl following each page's links through the full
   strategy chain (max_pages / max_depth / same_host / url_filter / on_page)
 - Single-URL action endpoints on the Client (and delegated from WWW::Crawl4AI):
   screenshot / pdf (raw bytes), html (preprocessed), execute_js (page +
   js_result), llm (LLM Q&A), token (JWT) — each with request/parse/convenience
   flavours like the rest
 - WWW::Crawl4AI::Detect: service detection + content-quality classification
   (js_required / blocked / captcha / thin_html)
 - WWW::Crawl4AI::Error structured error model (transport/api/job/content)
 - bin/www-crawl4ai-doctor and bin/www-crawl4ai-test-url
 - examples/docker-compose.yml (+ proxy escalation variant)
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)
Keyboard Shortcuts