How LLMs Choose Which Sources to Cite: A Data Study

  • LLMs do not cite sources randomly. DerivateX’s citation study found that three structural factors , indexability, topical authority clustering, and named entity density , predict citation likelihood more reliably than domain authority alone.
  • Pages excluded from Bing’s index are excluded from ChatGPT’s browsing citations entirely, regardless of content quality. Getting into the index is table stakes, not a differentiator.
  • The DerivateX study identified a pattern they call citation co-occurrence: cited pages are rarely cited alone, because LLMs pull sources in thematic clusters.
  • Content structure , specifically the presence of extractable, self-contained answer blocks , is a stronger predictor of AI citation than raw domain authority for informational queries.
  • Recency matters, but not uniformly. For fast-moving topics, pages older than six months are systematically underrepresented in LLM citations. For stable reference content, freshness is nearly irrelevant.

LLMs select sources by combining index presence, topical authority, content structure, named entity density, and recency signals. No single factor determines citation. Pages that appear in Bing’s index, contain explicit and extractable answers, and are surrounded by topically related cited pages have the highest citation rates across ChatGPT, Perplexity, and Google AI Overviews, according to DerivateX’s citation pattern study.

Why Most Guesses About AI Citations Are Wrong

Most practitioners assume AI citation works like traditional search ranking: high domain authority wins. That assumption is wrong, and it’s costing teams real visibility. DerivateX’s study of citation patterns across major LLM platforms found that domain authority is a weak predictor when content structure and entity density are controlled for.

The more accurate mental model is that LLMs are doing something closer to knowledge graph retrieval than search ranking. They are not looking for the most authoritative page on a topic. They are looking for the page that most efficiently completes a structured answer their model is already forming. That distinction changes everything about how you build pages that get cited.

A second common mistake is treating ChatGPT with browsing enabled and ChatGPT in base mode as the same system. They are not. Base model responses draw from statistical patterns in training data and never cite a URL. Browsing responses pull from live Bing results and cite specific pages. Conflating the two makes any citation strategy incoherent before it starts.

What Does the DerivateX Citation Study Actually Measure?

DerivateX analyzed citation behavior across ChatGPT (with browsing), Perplexity, and Google AI Overviews, examining which pages were cited in response to thousands of informational and commercial queries. The study tracked URL-level citation data rather than domain-level, which means it could distinguish between high-performing and low-performing pages on the same domain. That granularity is what makes the findings useful in practice.

The methodology involved submitting structured queries to each platform, recording every cited URL, and then analyzing the cited pages for a set of structural and content-level features. Features tested included Bing index presence, content structure (presence of explicit Q&A blocks, headers, tables, and numbered steps), named entity density (how many specific tools, companies, people, and standards a page mentions by name), topical co-citation rate, and page freshness.

Three findings from the study warrant particular attention because they contradict the prevailing advice on AI visibility.

Finding 1: Index Presence Is Binary, Not a Spectrum

Pages not indexed by Bing do not appear in ChatGPT browsing citations. Full stop. This sounds obvious, but a meaningful share of pages that rank well in Google are not in Bing’s index, particularly newer content on younger domains. Before any other optimization, a page needs Bing Webmaster Tools verification and a clean sitemap submission. No amount of content quality compensates for index absence.

Finding 2: Named Entity Density Outperforms Generic Authority

Cited pages named significantly more specific entities , tools, companies, standards, people, datasets , than non-cited pages covering the same topic. This aligns with how LLMs build responses: they are pattern-matching against named concepts their training data associates with a query. A page that mentions ChatGPT, Perplexity, Google Gemini, and specific retrieval mechanisms by name gives the model more attachment points than a page that discusses “AI chatbots” in generic terms.

Finding 3: Extractable Answer Blocks Are Decisive for Informational Queries

Pages containing what DerivateX’s study terms direct retrieval units , self-contained paragraphs or structured blocks that answer a specific question without requiring surrounding context , are cited at substantially higher rates for informational queries than pages that bury answers in flowing prose. This is the structural equivalent of writing for featured snippets, but applied to LLM extraction rather than Google’s snippet algorithm.

How ChatGPT Chooses Sources When Browsing the Web

When ChatGPT’s browsing mode is active, the process runs in three stages. First, it queries Bing using keywords derived from the user’s prompt. Second, it scores candidate pages from those results. Third, it selects a subset to cite in the final response.

The scoring in stage two weighs authority, content match, freshness, and structural extractability. According to research from ZipTie.dev, ChatGPT also considers search intent alignment , whether the page’s primary purpose matches what the query is asking for. A product page cited for an informational query scores lower than an editorial page that directly addresses the question, even if the product page has higher domain authority.

One underreported behavior documented in the DerivateX study: ChatGPT rarely cites a single source in isolation. It cites topically adjacent pages together. Profound‘s research corroborates this, describing it as sources traveling in packs. Your citation neighbors , the other pages cited alongside yours , are partly determined by your topical clustering in the index. Pages that are strongly associated with a topic cluster through internal links, backlink profiles, and semantic content are more likely to appear in citation groups for that topic.

What Separates Cited Pages From Non-Cited Pages on the Same Domain?

This is the question most AI visibility guides skip, because they operate at the domain level. DerivateX’s URL-level analysis makes it answerable.

The table below summarizes the key differentiating features between cited and non-cited pages observed in the study, across all three platforms analyzed.

FeatureCited PagesNon-Cited Pages (Same Domain)
Bing index status100% indexedSignificant share not indexed
Direct retrieval units presentHigh prevalenceLow prevalence
Named entity densityHigherLower
Topical co-citation rateHigherLower
Page age (informational queries)Skews recent (under 12 months)Mixed; older pages underrepresented
Structured formatting (tables, headers, steps)High prevalenceLow prevalence

The practical implication: you can have a strong domain and still have most of your pages ignored by LLMs if those pages lack direct retrieval units and named entity density. Domain authority is a floor, not a ceiling.

How Does AI Citation Behavior Differ Across Platforms?

ChatGPT with browsing, Perplexity, and Google AI Overviews each have meaningfully different citation behaviors. Treating them as a monolith produces strategies that are mediocre for all three.

Perplexity is the most citation-heavy of the three. It regularly cites five to ten sources per response and tends to pull from a wider range of domain authorities, including smaller publications, if the content structure is strong. ChatGPT with browsing is more selective and shows a stronger bias toward recognizable brands and established publications for ambiguous queries. Google AI Overviews skews heavily toward pages already ranking on page one of standard Google results, making traditional search ranking a prerequisite rather than an alternative path.

The DerivateX study found that Perplexity shows the strongest response to content structure improvements among the three platforms. Pages that added explicit Q&A blocks and structured headers saw measurable citation rate changes on Perplexity before similar changes registered on ChatGPT. If you are prioritizing one platform for citation experiments, Perplexity gives faster feedback loops.

What Does the Found On AI Citation Signal Framework Identify as the Core Levers?

After reviewing the DerivateX study data alongside citation pattern research from Profound and ZipTie.dev, Found On AI identified four discrete levers that determine whether a page gets cited by an LLM. We call this the Citation Signal Framework, and it organizes the decision-making more cleanly than the standard SEO checklists being applied to GEO.

Lever 1: Index Eligibility

Is the page in Bing’s index? Submit to Bing Webmaster Tools, verify your sitemap, and check fetch status for priority pages. This is a one-time fix with a binary payoff.

Lever 2: Retrieval Density

Does the page contain direct retrieval units , paragraphs or blocks that answer a specific question completely without surrounding context? Each retrieval unit should be 40 to 80 words, state the answer in the first sentence, and avoid pronouns that require prior context to resolve. Writing for retrieval density is structurally different from writing for reading flow. For example, instead of writing “ChatGPT is a conversational AI that helps with a wide range of tasks including writing, research, and coding,” a direct retrieval unit reads: “ChatGPT generates human-like text responses, answers nuanced research questions, and completes code debugging tasks.” Most editorial teams have not made that adjustment.

Lever 3: Entity Specificity

How many named entities , tools, companies, people, standards, datasets, platforms , does the page contain? Generic topic coverage scores poorly against specific entity-rich coverage for LLM citation purposes. A page about AI writing tools that names Jasper, Copy.ai, Writer, and Claude by name, with specific claims about each, gives a language model far more to attach to than a page discussing “AI writing assistants” in abstract terms.

Lever 4: Topical Co-location

Is the page topically clustered with other pages that are already being cited? This is the lever most practitioners cannot directly control, but you can influence it through internal linking strategy and by building content in topic clusters rather than isolated posts. Pages that exist in topically coherent clusters on your domain are more likely to be pulled into citation groups alongside already-cited neighbors.

Does Content Freshness Actually Affect AI Citation Rates?

It depends entirely on the query type, and the distinction matters. DerivateX’s study found a clear bifurcation in how freshness affects citation rates across query categories.

For queries about AI tools, market conditions, pricing, product features, and anything with implicit recency expectations, pages older than six months are systematically underrepresented in citations. LLMs appear to apply a recency filter that is more aggressive than Google’s standard freshness algorithm. A page published 18 months ago about a tool’s pricing will be cited less frequently than a page published three months ago, even if both contain accurate information.

For stable reference content , definitions, methodology explanations, evergreen how-to content , freshness has minimal effect on citation rates. The structured content and entity signals dominate. This means an evergreen methodology page does not need quarterly updates to maintain citation eligibility, but a tools comparison page does.

What Gets a Page Excluded From AI Citations?

Exclusion is easier to understand than inclusion, because the filters are mostly binary. A page is excluded if it is not in the Bing index. It is excluded if the query-to-content match is poor , meaning the page is topically tangential to what the LLM is answering. It is excluded if the content is primarily thin, duplicative, or structured as marketing copy rather than informational content.

One exclusion factor that surprises practitioners: pages behind paywalls or login walls are not cited, regardless of their quality or authority. This sounds obvious but has real implications for publishers who gate their best data-heavy content. If the content being gated is exactly the kind of structured, entity-rich, data-backed material LLMs favor, gating it removes it from AI citation entirely while competitors with open versions of similar content accumulate citations.

Inconsistent brand messaging across pages on the same domain also appears to suppress citation rates, based on findings in the broader citation pattern literature. LLMs weight entity coherence. If your site says your tool does X on one page and implies it does not on another, the model’s confidence in citing either page drops.

Frequently Asked Questions About How LLMs Choose Sources

How does ChatGPT decide which sources to cite when browsing?

ChatGPT with browsing queries Bing, pulls candidate URLs from the results, and then scores those pages on authority, content-to-query match, recency, and structural extractability. Pages not in Bing’s index are excluded entirely before scoring begins. Among indexed pages, those with self-contained answer blocks and high named entity density are selected at higher rates for informational queries. ChatGPT typically cites multiple sources per response rather than selecting a single winner.

What kinds of pages do LLMs recommend and cite most often?

LLMs most frequently cite pages that contain explicit, self-contained answers to specific questions, name many relevant entities by name, are indexed in the search engine the LLM queries, and exist within a topically coherent cluster of related content. Structured pages , those using headers, tables, numbered steps, and Q&A blocks , are cited at higher rates than pages presenting the same information in flowing prose. Data-backed pages with named sources also perform above average across platforms.

Does domain authority determine whether an AI cites your page?

Domain authority is a weak predictor of AI citation at the URL level when content structure and entity density are controlled for. High domain authority raises the floor , it gets pages indexed and lowers the threshold for initial scoring , but it does not guarantee citation. DerivateX’s study found that low-authority pages with strong direct retrieval units and high entity specificity were cited more frequently than high-authority pages with generic, poorly structured content on the same topic.

Does Perplexity use different citation criteria than ChatGPT?

Yes, meaningfully so. Perplexity cites more sources per response than ChatGPT, is more tolerant of lower domain authority when content structure is strong, and shows a faster response to structural content improvements. ChatGPT with browsing applies a stronger brand recognition filter and is more selective overall. Google AI Overviews is the most restrictive, strongly preferring pages that already rank on the first page of standard Google results, making traditional search ranking a prerequisite for AI Overview citation on competitive queries.

How do LLMs handle pages behind paywalls or login walls?

They do not cite them. Pages that require authentication or payment to access are not retrieved by LLM browsing systems. This has a practical implication for publishers: gating high-quality structured content removes it from AI citation eligibility entirely, even if that content would otherwise score well on every other citation factor. If the goal is AI citation and brand visibility through LLM responses, open-access content is a structural requirement.

Can you get cited by an AI without ranking on Google?

For ChatGPT with browsing and Perplexity, yes. Both pull from Bing’s index rather than Google’s, so strong Bing presence is the relevant prerequisite. A page can be absent from Google’s top results and still appear in ChatGPT or Perplexity citations if it is indexed in Bing and scores well on content structure and entity density. Google AI Overviews is the exception , it correlates strongly with Google search rankings, making standard Google SEO a near-requirement for that specific platform.

What is citation co-occurrence, and why does it matter?

Citation co-occurrence is the pattern, documented in DerivateX’s study and corroborated by Profound’s research, where LLMs cite pages in thematic groups rather than in isolation. Your citation neighbors are partly determined by your topical positioning , through your internal links, backlink profile, and semantic content clustering. Building isolated, standalone pages is a weaker strategy than building topically coherent content clusters where each page reinforces the others’ citation eligibility.

How should teams prioritize improvements to increase AI citation rates?

Start with index eligibility , confirm all priority pages are in Bing’s index via Bing Webmaster Tools. Then audit for retrieval density: does each page contain self-contained answer blocks that resolve a specific question in 40 to 80 words without requiring surrounding context? Next, increase entity specificity by naming specific tools, companies, datasets, and standards rather than discussing categories generically. Finally, build internal topical clusters around your core subjects. These four changes, applied in sequence, address the Citation Signal Framework levers in order of impact.

The Deeper Pattern Behind AI Citation Behavior

What the DerivateX data reveals, and what most AI visibility advice misses, is that LLMs are not running a ranking algorithm against your page. They are completing a structured knowledge representation, and they are looking for pages that slot cleanly into the gaps in that representation. A page that names specific entities, answers a specific question in a retrievable block, and sits inside a coherent topical cluster is not just well-structured content. It is behaving like a node in a knowledge graph rather than a document in a search index.

That reframe matters because it tells you where to spend time. Traditional SEO has trained teams to think about rankings, authority, and traffic. AI citation optimization requires thinking about coherence, specificity, and extractability. A page can rank on page one of Google and still contribute nothing to your AI citation footprint if it is written as flowing editorial prose with no extractable units and generic topic framing.

High domain authority does not build an AI citation footprint on its own. Understanding how language models form answers , and building pages that participate in that process , does. The full DerivateX citation pattern study, including platform-level breakdowns and query category data, is worth reading in detail if you are building that kind of program.

Emily Carter
Emily Carter