The web's longtime signal for crawl behavior, robots.txt, was designed as a voluntary protocol: a simple, machine-readable request that well-behaved crawlers honor. It remains useful for coordinating search engine indexing and avoiding accidental exposure of sensitive paths, but RFC 9309 explicitly notes that robots.txt is not a substitute for content security and depends on voluntary compliance.
Over the last two years operators and publishers have discovered a hard truth: some modern AI agents ignore robots.txt or find ways to circumvent site defenses. That reality has provoked technical, legal, and standards responses as the industry grapples with how to protect sites, enforce publisher preferences, and update norms for an agent-first web.
Why robots.txt was never a perfect shield
The Robots Exclusion Protocol has always been a coordination mechanism rather than a security control. RFC 9309 formalized parsing and behavior, but it also warned that exposing paths in robots.txt can reveal what site owners prefer to hide and that the protocol relies on voluntary crawler compliance.
Because it is advisory, robots.txt works well with reputable search engines and crawlers that identify themselves and respect site wishes. However, it offers no technical enforcement against actors that choose to ignore or actively circumvent it; those actors can fetch content like any other browser unless additional blocks are in place.
As a result, site owners must treat robots.txt as one layer in a larger defensive stack: useful for signaling intent, but insufficient alone to stop determined scraping or unauthorized reuse of content.
Documented cases: AI agents and stealth crawling
Cloudflare's August 4, 2025 technical report highlighted a striking case study: an AI answer engine identified as Perplexity operated both declared user agents and undeclared, stealth crawlers that rotated IPs and ASNs and sometimes ignored or failed to fetch robots.txt files. Cloudflare subsequently de-listed the service from its verified bots list and added blocking heuristics.
Cloudflare also published volumes showing the mix of declared and stealth traffic: the declared Perplexity user agent made roughly 20, 25 million daily requests, while the undeclared stealth user agent produced about 3, 6 million daily requests using generic Chrome-like user-agent strings and unlisted IP ranges. Those numbers illustrate how significant stealth fetches can be compared with declared crawling.
This case fits a broader empirical picture. A large-scale study on arXiv (May 27, 2025) found that some scraper categories, including AI search crawlers, rarely check robots.txt. Reuters and industry monitors have similarly reported multiple AI services bypassing the Robots Exclusion Protocol, prompting warnings from publishers and licensing firms.
How AI agents evade robots.txt and web defenses
Common evasion techniques are well documented: user-agent spoofing (impersonating mainstream browsers), rapid IP and ASN rotation, and the use of third-party browser-as-a-service proxies. These tactics make agent traffic look like ordinary human browsing until it is fingerprinted.
Cloudflare's analysis described stealth traffic using Chrome-like user-agent strings and unlisted address blocks, and industry writeups have shown that some agents will fall back to browser-driven fetch mechanisms to blend into normal traffic patterns. That complicates simple defenses that rely solely on user-agent or IP blacklists.
Operators that rely only on robots.txt therefore face a technical gap. Without active bot management, anomaly detection, or policy enforcement at the CDN/WAF layer, stealth agents can harvest content with only modest additional effort.
Publisher harms and market effects
Publishers have raised clear concerns about scraping by AI agents because summaries or AI overviews can reduce direct traffic and monetizable clicks. A Pew Research, based tracking study (March 2025) showed that AI summaries reduce clickthrough rates substantially, examples included reductions from about 15% to around 8% in some contexts, and only about 1% of AI overview occurrences led to a click through to the cited source.
Those traffic shifts threaten publishers' ad and subscription economics, motivating the use of robots.txt-based opt-outs and paid licensing approaches. Reuters reported in mid‑2024 that multiple AI companies were bypassing web standards to scrape publisher sites, and industry observers urged publishers to negotiate licenses rather than rely on robots.txt alone.
The seriousness of the issue is reflected in legal responses. In 2025 several plaintiffs, including Reddit and multiple Japanese publishers, filed suits alleging unauthorized scraping and circumvention of anti‑scraping measures. Complaints cite test-post evidence and claim robots.txt and terms were ignored, seeking damages and injunctions.
How infrastructure providers and the industry responded
Infrastructure firms moved quickly to protect customers. Cloudflare reported millions of sites adopting options to disallow AI training via managed robots controls and launched default AI‑crawler blocking for customers along with a pay‑per‑crawl concept. Wired covered Cloudflare's September 2025 policy changes and noted that the Robots Exclusion Protocol remains ineffective against many AI scrapers.
Defensive techniques deployed in practice include managed bot rules, fingerprinting of stealth crawlers, honeypot or labyrinth traps to detect automated agents, and pay‑per‑crawl arrangements that monetize legitimate data access. These measures raise the cost of stealth scraping and provide publishers with remediation options beyond robots.txt.
Cloudflare's public actions after its Perplexity analysis, de‑listing a verified bot entry and adding automated heuristics, demonstrate how CDN and security operators can detect and mitigate stealth fetches, even when those fetches disguise themselves as ordinary browsers.
Standards, law, and the search for durable controls
Recognizing that existing norms fall short, standards and legal systems are adapting. An Internet‑Draft published in April 2025 proposed extending robots.txt with a machine‑readable "AI preferences" vocabulary to allow sites to express AI‑specific opt‑outs in a standardized way. That draft reflects broad interest in updating the Robots Exclusion Protocol for agent-era use cases.
At the same time, courts and litigants are testing whether unauthorized scraping and reuse of content can be restrained by contract, copyright, or other legal theories. The 2025 lawsuits against AI firms argue not only about copying but also about circumvention of technical measures and contractual terms that publishers use to control access.
These parallel tracks, technical standards work, industry controls, and legal challenges, are likely to converge. Either robots.txt will be extended or supplemented with enforceable mechanisms, or market and legal pressure will push AI services toward explicit licensing and technical cooperation.
Practical steps for site owners
Site operators should assume robots.txt alone is insufficient to stop some AI scraping. Useful defensive measures include enabling CDN/WAF bot management, deploying rate limits and anomaly detection, using honeypots to identify stealth crawlers, and logging detailed fetch metadata for later attribution.
Publishers concerned about training and reuse should consider contractual licensing, pay‑per‑crawl models offered by CDNs, and explicit business agreements with AI providers. Cloudflare and other vendors now offer managed opt-outs and paid access controls that convert site preferences into enforceable policy at the network edge.
Finally, keep an eye on standards and legal developments. Adopt new machine‑readable AI preference signals once they stabilize, and consult counsel about potential remedies if you detect systematic circumvention. Combining technical, contractual, and legal tools gives the best chance of protecting content in the near term.
AI agents ignore robots.txt in some real-world cases, and that mismatch between expectation and behavior has real consequences for publishers and the web ecosystem. The Cloudflare/Perplexity episode, empirical studies, and legal filings together make the problem plain: voluntary signals are no longer sufficient when some agents act stealthily.
Going forward, defending the open web will require layered defenses, clearer standards, and stronger commercial and legal agreements. Robots.txt will remain part of the toolbox, but publishers and infrastructure providers must pair it with active enforcement, negotiated access, and participation in standards work so that the web's norms evolve with the agent era.