# Why ChatGPT Cites One Page Over Another (1.4M Prompt Study)

> Ahrefs analyzed 1.4M ChatGPT prompts: semantic similarity between page title and internal fanout queries drives citations, with search index pages cited 88% of the time.

Published: 2026-04-22
URL: https://daniliants.com/insights/why-chatgpt-cites-one-page-over-another-study-of-1-4m-prompts/
Tags: ai-citations, llm-seo, chatgpt, content-strategy, semantic-search, generative-engine-optimization, ahrefs

---

## Summary

Ahrefs analyzed 1.4 million ChatGPT prompts to identify what drives citation selection. The dominant factor is semantic similarity between a page's title and ChatGPT's internal "fanout queries", sub-questions generated behind the scenes from a user prompt. Pages in ChatGPT's general search index have an 88% citation rate, while Reddit content (67.8% of non-cited URLs) is used for context but almost never credited.

## Key Insight

**The citation pipeline has a gating stage before your content is ever read:**

- ChatGPT uses title, URL snippet, and URL text to decide which pages to open, content quality is irrelevant if the gate rejects you
- Only ~50% of retrieved URLs end up cited; the rest are read but discarded or never opened

**Channel type is the biggest lever:**

- `ref_type: search` -> 88.46% citation rate (25.5M data points)
- `ref_type: news` -> 12.01%
- `ref_type: reddit` (dedicated feed) -> 1.93% (cited 16M times for context, credited almost never)
- YouTube and academia: <1%

**Fanout query alignment beats raw prompt relevance:**

- Cited URL title vs. prompt: cosine similarity 0.602
- Cited URL title vs. fanout query: cosine similarity 0.656
- Non-cited URL title vs. prompt: 0.484
- Optimizing for what ChatGPT is actually asking internally (fanout queries) matters more than the surface-level user query

**URL readability has a measurable effect:**

- Natural language URL slugs: 89.78% citation rate vs. 81.11% for opaque URLs

**Freshness is nuanced:**

- Across the web, ChatGPT skews ~458 days newer than Google organic
- Within a single retrieval set, older/more established pages beat fresh ones, relevance dominates
- For news queries specifically, freshness becomes the tiebreaker when relevance scores are equal

**Analytical trap to avoid:**

- Comparing "cited vs. non-cited" without isolating by `ref_type` produces misleading results, Reddit's bulk volume distorts every aggregate metric (snippet rates, publication dates, etc.)