Why ChatGPT Cites One Page Over Another (1.4M Prompt Study)
1 min read
Originally from ahrefs.com
View source
My notes
Summary
Ahrefs analyzed 1.4 million ChatGPT prompts to identify what drives citation selection. The dominant factor is semantic similarity between a page’s title and ChatGPT’s internal “fanout queries”, sub-questions generated behind the scenes from a user prompt. Pages in ChatGPT’s general search index have an 88% citation rate, while Reddit content (67.8% of non-cited URLs) is used for context but almost never credited.
Key Insight
The citation pipeline has a gating stage before your content is ever read:
- ChatGPT uses title, URL snippet, and URL text to decide which pages to open, content quality is irrelevant if the gate rejects you
- Only ~50% of retrieved URLs end up cited; the rest are read but discarded or never opened
Channel type is the biggest lever:
ref_type: search-> 88.46% citation rate (25.5M data points)ref_type: news-> 12.01%ref_type: reddit(dedicated feed) -> 1.93% (cited 16M times for context, credited almost never)- YouTube and academia: <1%
Fanout query alignment beats raw prompt relevance:
- Cited URL title vs. prompt: cosine similarity 0.602
- Cited URL title vs. fanout query: cosine similarity 0.656
- Non-cited URL title vs. prompt: 0.484
- Optimizing for what ChatGPT is actually asking internally (fanout queries) matters more than the surface-level user query
URL readability has a measurable effect:
- Natural language URL slugs: 89.78% citation rate vs. 81.11% for opaque URLs
Freshness is nuanced:
- Across the web, ChatGPT skews ~458 days newer than Google organic
- Within a single retrieval set, older/more established pages beat fresh ones, relevance dominates
- For news queries specifically, freshness becomes the tiebreaker when relevance scores are equal
Analytical trap to avoid:
- Comparing “cited vs. non-cited” without isolating by
ref_typeproduces misleading results, Reddit’s bulk volume distorts every aggregate metric (snippet rates, publication dates, etc.)