Skip to main content

TIL: The State of the Art in the Scraper VS. Bot-Detection Arms Race

· 3 min read

Web scraping has always been fascinating to me – the thrill of harvesting forbidden fruit. The deepest I ever got was ~3 years ago using Playwright with playwright-stealth. Not that that actually helped me avoid Akamai or Cloudflare.

As I've been diving into the space again recently, I've learned just how much things have advanced.

The state of the art of web scraping in 2026:

Offense

Tools

  • Camoufox
  • Residential proxies

Secret Sauce

  • Proxy rotation.
    • IP flagged as a bot? Try another.
  • Session warming - (the craziest new concept to me that I actually saw make a difference)
    • Search your target website in google, bing, or duckduckgo. Browse around, look normal, smile and wave, etcetera, to build "reputation" for Akamai-type detectors. Once you've done this for about a minute, Akamai seems to let their guard down a bit, and you can scrape to your hearts content, as long as your bot's movements aren't too erratic.
    • Learned while searching through X for clues about how pro scrapers do what they do. Src: https://x.com/mathieulevrai1/status/2054543662021820683?s=20

Defense

Tools (the web scraper's "final bosses")

  • Akamai
    • Behavioral tracking. Essentially, inject a script into the browser that watches your actions. Mouse movements, keystroke patterns, and browsing history are all fair game. If any of these are suss, you're flagged and either blocked or served fake data.
  • Cloudflare
    • Fingerprinting. Looks for fishy things like mismatch between the actual TLS fingerprint of a python requests-issued HTTP request and the fingerprint you'd expect from the request's User-Agent.

For more specifics on the techniques each company uses, this AI chat is fairly comprehensive.

TLDR; what works?

  • Camoufox (thanks to it's advanced fingerprint injection), and
    • rendering in a real display (even XVFB works and has first-party support)
    • a residential proxy pool,
    • and "session warming", described above, pretty reliably bypasses Cloudflare and Akamai checks, even from a VPS. For now.

Where do I think this all goes in the future?

I don't see how this ends besides a GAN-type war between model apps (e.g. Firecrawl), who rely on sophisticated, dubiously legal scraping pipelines, and security providers (Akamai & Cloudflare), who sell millions of enterprise contracts on protecting your data from relentless bot armies.

Unlike most other software battles, I see the odds being in favor of offense. Realistically, bot's will be able to mimic human browsing behavior perfectly, to the point here someone who really wants to protect their data from scraping needs to put the bulk of it behind a paywall or a paid API. Do you see Amazon.com doing that, dropping conversion rates, and evaporating shareholder value? No chance. But some more boutique websites definitely could. Even then, there will likely still be profit to be had at an aggregation layer which subscribes to expensive data behind paywalls, builds value on top, and captures value internally or exposes a new, higher-level platform which charges it's own spread.

For these reasons, I'm glad I upped my scraping game. It will be a valuable skill for a long time to come.