Why ChatGPT Cites One Page Over Another (1.4M Prompt Study)

Curated April 22, 2026 1 min read

ai-citationsllm-seochatgptcontent-strategysemantic-searchgenerative-engine-optimizationahrefs

My notes

Summary

Ahrefs analyzed 1.4 million ChatGPT prompts to identify what drives citation selection. The dominant factor is semantic similarity between a page’s title and ChatGPT’s internal “fanout queries”, sub-questions generated behind the scenes from a user prompt. Pages in ChatGPT’s general search index have an 88% citation rate, while Reddit content (67.8% of non-cited URLs) is used for context but almost never credited.

Key Insight

The citation pipeline has a gating stage before your content is ever read:

ChatGPT uses title, URL snippet, and URL text to decide which pages to open, content quality is irrelevant if the gate rejects you
Only ~50% of retrieved URLs end up cited; the rest are read but discarded or never opened

Channel type is the biggest lever:

ref_type: search -> 88.46% citation rate (25.5M data points)
ref_type: news -> 12.01%
ref_type: reddit (dedicated feed) -> 1.93% (cited 16M times for context, credited almost never)
YouTube and academia: <1%

Fanout query alignment beats raw prompt relevance:

Cited URL title vs. prompt: cosine similarity 0.602
Cited URL title vs. fanout query: cosine similarity 0.656
Non-cited URL title vs. prompt: 0.484
Optimizing for what ChatGPT is actually asking internally (fanout queries) matters more than the surface-level user query

URL readability has a measurable effect:

Natural language URL slugs: 89.78% citation rate vs. 81.11% for opaque URLs

Freshness is nuanced:

Across the web, ChatGPT skews ~458 days newer than Google organic
Within a single retrieval set, older/more established pages beat fresh ones, relevance dominates
For news queries specifically, freshness becomes the tiebreaker when relevance scores are equal

Analytical trap to avoid:

Comparing “cited vs. non-cited” without isolating by ref_type produces misleading results, Reddit’s bulk volume distorts every aggregate metric (snippet rates, publication dates, etc.)