Cloudflare's managed robots.txt and ad-aware AI bot blocking
1 min read
Originally from blog.cloudflare.com
View source
My notes
Summary
Cloudflare launched two free tools (July 2025) to defend publishers from AI training crawlers: a managed robots.txt that auto-prepends Disallow rules for major AI bots, and a feature that blocks AI bots only on pages that show ads. The piece shares striking crawl-to-referral data showing how broken the old “crawl in exchange for traffic” deal has become with AI crawlers.
Key Insight
- Crawl-to-referral collapse (June 2025):
- Google: ~14:1 (still close to symbiotic)
- OpenAI: 1,700:1
- Anthropic: 73,000:1
- AI training traffic up 65% over the prior 6 months.
- robots.txt is massively underused: only 37% of the top 10,000 domains have one. GPTBot is disallowed in just 7.8% of those;
Google-Extended,anthropic-ai,PerplexityBot,ClaudeBot,Bytespidereach <5%. - Most-active AI bots by share of sites accessed: GPTBot 28.97%, Meta-ExternalAgent 22.16%, ClaudeBot 18.80%, Amazonbot 14.56%, Bytespider 9.37%. Bytespider’s traffic dropped 71.45% YoY after Cloudflare’s one-click block went live (1M+ customers enabled it).
- The Googlebot trap: Googlebot crawls for both SEO AND AI training. You must specifically Disallow
Google-Extended(not Googlebot) to opt out of AI training without killing SEO. Same logic applies to Apple, you wantApplebot-Extendedblocked, notApplebot. - Cloudflare’s managed robots.txt prepends directives, preserving customer’s existing rules. Auto-updates as new AI bots emerge, set-and-forget.
- Block-only-where-ads uses LOL HTML parser to stream-scan response bodies for ad-unit signatures (e.g.
class="ui-advert", googlesyndication.com script tags) plus CSP report data. They distilled EasyList’s 40,000+ filters down to the top 400, sufficient for detection (vs blocking). - Bot detection beyond user-agent: ML fingerprinting on 57M req/sec catches bots that spoof user-agents using tool/framework signatures.
- Strategic shift in 2025: publishers moved from “Partially Disallowed” toward “Fully Disallowed” for top AI crawlers, trust is collapsing, not just being negotiated.