Cloudflare's managed robots.txt and ad-aware AI bot blocking

1 min read
cloudflareai-botsrobots-txtcontent-protectionweb-scrapingpublisher-economicsgptbotclaudebot
View as Markdown
Originally from blog.cloudflare.com
View source

My notes

Summary

Cloudflare launched two free tools (July 2025) to defend publishers from AI training crawlers: a managed robots.txt that auto-prepends Disallow rules for major AI bots, and a feature that blocks AI bots only on pages that show ads. The piece shares striking crawl-to-referral data showing how broken the old “crawl in exchange for traffic” deal has become with AI crawlers.

Key Insight

  • Crawl-to-referral collapse (June 2025):
    • Google: ~14:1 (still close to symbiotic)
    • OpenAI: 1,700:1
    • Anthropic: 73,000:1
    • AI training traffic up 65% over the prior 6 months.
  • robots.txt is massively underused: only 37% of the top 10,000 domains have one. GPTBot is disallowed in just 7.8% of those; Google-Extended, anthropic-ai, PerplexityBot, ClaudeBot, Bytespider each <5%.
  • Most-active AI bots by share of sites accessed: GPTBot 28.97%, Meta-ExternalAgent 22.16%, ClaudeBot 18.80%, Amazonbot 14.56%, Bytespider 9.37%. Bytespider’s traffic dropped 71.45% YoY after Cloudflare’s one-click block went live (1M+ customers enabled it).
  • The Googlebot trap: Googlebot crawls for both SEO AND AI training. You must specifically Disallow Google-Extended (not Googlebot) to opt out of AI training without killing SEO. Same logic applies to Apple, you want Applebot-Extended blocked, not Applebot.
  • Cloudflare’s managed robots.txt prepends directives, preserving customer’s existing rules. Auto-updates as new AI bots emerge, set-and-forget.
  • Block-only-where-ads uses LOL HTML parser to stream-scan response bodies for ad-unit signatures (e.g. class="ui-advert", googlesyndication.com script tags) plus CSP report data. They distilled EasyList’s 40,000+ filters down to the top 400, sufficient for detection (vs blocking).
  • Bot detection beyond user-agent: ML fingerprinting on 57M req/sec catches bots that spoof user-agents using tool/framework signatures.
  • Strategic shift in 2025: publishers moved from “Partially Disallowed” toward “Fully Disallowed” for top AI crawlers, trust is collapsing, not just being negotiated.