Why ChatGPT Cites One Page Over Another (1.4M Prompt Study)

1 min read
ai-citationsllm-seochatgptcontent-strategysemantic-searchgenerative-engine-optimizationahrefs
View as Markdown
Originally from ahrefs.com
View source

My notes

Summary

Ahrefs analyzed 1.4 million ChatGPT prompts to identify what drives citation selection. The dominant factor is semantic similarity between a page’s title and ChatGPT’s internal “fanout queries”, sub-questions generated behind the scenes from a user prompt. Pages in ChatGPT’s general search index have an 88% citation rate, while Reddit content (67.8% of non-cited URLs) is used for context but almost never credited.

Key Insight

The citation pipeline has a gating stage before your content is ever read:

  • ChatGPT uses title, URL snippet, and URL text to decide which pages to open, content quality is irrelevant if the gate rejects you
  • Only ~50% of retrieved URLs end up cited; the rest are read but discarded or never opened

Channel type is the biggest lever:

  • ref_type: search -> 88.46% citation rate (25.5M data points)
  • ref_type: news -> 12.01%
  • ref_type: reddit (dedicated feed) -> 1.93% (cited 16M times for context, credited almost never)
  • YouTube and academia: <1%

Fanout query alignment beats raw prompt relevance:

  • Cited URL title vs. prompt: cosine similarity 0.602
  • Cited URL title vs. fanout query: cosine similarity 0.656
  • Non-cited URL title vs. prompt: 0.484
  • Optimizing for what ChatGPT is actually asking internally (fanout queries) matters more than the surface-level user query

URL readability has a measurable effect:

  • Natural language URL slugs: 89.78% citation rate vs. 81.11% for opaque URLs

Freshness is nuanced:

  • Across the web, ChatGPT skews ~458 days newer than Google organic
  • Within a single retrieval set, older/more established pages beat fresh ones, relevance dominates
  • For news queries specifically, freshness becomes the tiebreaker when relevance scores are equal

Analytical trap to avoid:

  • Comparing “cited vs. non-cited” without isolating by ref_type produces misleading results, Reddit’s bulk volume distorts every aggregate metric (snippet rates, publication dates, etc.)