# Cloudflare's managed robots.txt and ad-aware AI bot blocking

> Cloudflare's free managed robots.txt auto-blocks AI training crawlers, with an option to block AI bots only on ad-monetized pages.

Published: 2026-05-05
URL: https://daniliants.com/insights/control-content-use-for-ai-training-cloudflare/
Tags: cloudflare, ai-bots, robots-txt, content-protection, web-scraping, publisher-economics, gptbot, claudebot

---

## Summary

Cloudflare launched two free tools (July 2025) to defend publishers from AI training crawlers: a managed robots.txt that auto-prepends Disallow rules for major AI bots, and a feature that blocks AI bots only on pages that show ads. The piece shares striking crawl-to-referral data showing how broken the old "crawl in exchange for traffic" deal has become with AI crawlers.

## Key Insight

- **Crawl-to-referral collapse (June 2025):**
  - Google: ~14:1 (still close to symbiotic)
  - OpenAI: 1,700:1
  - Anthropic: 73,000:1
  - AI training traffic up 65% over the prior 6 months.
- **robots.txt is massively underused:** only 37% of the top 10,000 domains have one. GPTBot is disallowed in just 7.8% of those; `Google-Extended`, `anthropic-ai`, `PerplexityBot`, `ClaudeBot`, `Bytespider` each <5%.
- **Most-active AI bots by share of sites accessed:** GPTBot 28.97%, Meta-ExternalAgent 22.16%, ClaudeBot 18.80%, Amazonbot 14.56%, Bytespider 9.37%. Bytespider's traffic dropped 71.45% YoY after Cloudflare's one-click block went live (1M+ customers enabled it).
- **The Googlebot trap:** Googlebot crawls for both SEO AND AI training. You must specifically Disallow `Google-Extended` (not Googlebot) to opt out of AI training without killing SEO. Same logic applies to Apple, you want `Applebot-Extended` blocked, not `Applebot`.
- **Cloudflare's managed robots.txt** prepends directives, preserving customer's existing rules. Auto-updates as new AI bots emerge, set-and-forget.
- **Block-only-where-ads** uses LOL HTML parser to stream-scan response bodies for ad-unit signatures (e.g. `class="ui-advert"`, googlesyndication.com script tags) plus CSP report data. They distilled EasyList's 40,000+ filters down to the top 400, sufficient for detection (vs blocking).
- **Bot detection beyond user-agent:** ML fingerprinting on 57M req/sec catches bots that spoof user-agents using tool/framework signatures.
- **Strategic shift in 2025:** publishers moved from "Partially Disallowed" toward "Fully Disallowed" for top AI crawlers, trust is collapsing, not just being negotiated.