What we found

We ran an analysis across our customer base. We took 125 questions – the real questions their audiences ask – and posed them to a generic LLM. Then we compared the answers to what the experts actually say.

Twenty-six percent of the time, the answers were meaningfully different. Not stylistically different. Substantively different: different guidance, different conclusions, different tradeoffs.

Here's what that looks like in practice.

ParentData / Emily Oster

Ask a generic LLM "Is it okay to have a glass of wine during pregnancy?" and you get a clear answer: no. There's no known safe amount. That's the CDC position, the WHO position, the ACOG position.

Ask Emily Oster – the economist and author who built ParentData around data-driven pregnancy and parenting guidance – and you get something different. Trimester-by-trimester analysis. A careful reading of what the research actually shows. A conclusion that light drinking in the second and third trimesters is not associated with negative outcomes, based on the same studies the health agencies cite.

The LLM isn't missing context. It has access to the same research. But it trained on thousands of sources repeating the consensus, and Oster's contrarian position is one signal among many thousands. Consensus wins. The expert disappears.

In our analysis, ParentData saw 14 out of 15 questions return a meaningfully different answer from the generic model. The gap wasn't random. It was structural. Try it yourself →

FOUND NY

Ask a generic LLM "What's new and interesting in Chinatown?" and you get a paragraph about the neighborhood's history, some notes on dim sum culture, maybe a mention of a well-known restaurant that closed two years ago.

Ask FOUND NY and you get five specific spots: Bridges, Lei, Bar Oliver, LaiRai, Chop Suey Club – with descriptions of what makes each worth going to right now.

FOUND's value isn't knowing about Chinatown. It's knowing what's worth going to, based on correspondents who are actually walking the neighborhood. The LLM has general knowledge. FOUND has names. That difference is the entire product.

Livelong Media

Ask a generic LLM "Is all dark chocolate good for you?" and you get the standard guidance: antioxidants, yes, choose 70%+ cacao, enjoy in moderation.

Ask the team at Livelong Media – which covers longevity and preventive health – and the answer is more complicated. Some dark chocolate bars contain concerning levels of heavy metals: lead and cadmium. Higher cacao percentages (85%+) absorb more toxins, not fewer. Some bars exceeded California's safety limits by 250%.

The LLM reproduces the feel-good consensus. The expert has done original analysis that contradicts it. Eleven of 15 questions showed meaningful divergence. Try it yourself →

The 26% figure held across GPT-5 model versions. This isn't a bug in a particular release. It's structural.

What parametric knowledge actually is

Think of it like a smoothie.

Millions of documents go into the blender. One set of weights comes out. You can't un-blend the ingredients. What the model retains is the statistical center of mass – the internet average, weighted by volume and repetition.

This is not a criticism of how LLMs are built. It's a description. The training process optimizes for a compressed representation of patterns across a massive corpus. What emerges is something genuinely useful – and genuinely limited. (We explored the values embedded in this process in What does your AI value? — parametric knowledge is the mechanism behind it.)

The limit has a name. Researchers at USC called it "cognitive homogenization" (Sourati et al., Trends in Cognitive Sciences, March 2026): LLM outputs "mirror a narrow and skewed slice of human experience." The more a viewpoint diverges from the statistical center, the less faithfully the model represents it. Read the paper →

RAG vs. parametric knowledge

Retrieval-Augmented Generation – RAG – is the architecture that makes it possible to build expert-grounded AI at all. Instead of relying solely on what the model learned during training, you pull relevant documents at query time and pass them into the context window.

RAG helps. But it doesn't fully solve the parametric knowledge problem. The model's trained weights are still there. When the retrieved content is ambiguous, thin, or doesn't directly address the question, the model falls back on what it learned during training. The blended weights reassert themselves. Retrieval narrows the gap – it doesn't close it.

The honest tradeoff

Parametric knowledge is useful. For general tasks – summarizing, drafting, explaining concepts that exist in mainstream consensus – it performs well. The training corpus is large enough that the statistical average is a reasonable approximation of truth.

The tradeoff is lopsided for experts.

The more differentiated your knowledge, the more it gets averaged away. If your expertise lies in the mainstream, you lose a little. If your expertise is your disagreement with the mainstream – your original research, your curatorial judgment, your hard-won clinical intuition – you lose most of it.

Getting to an accurate representation of the expert requires more than plugging in a document corpus. The averaging still happens, a lot. Even with retrieval, the model's parametric prior bleeds through, especially on edge cases and out-of-distribution questions.

What's required: a deep understanding of the expert's actual positions, across their full body of work. Clear editorial guardrails that define what's in scope and what isn't. Continuous monitoring to catch drift. Tuning of search and retrieval to match the audience's actual question patterns. Human review for high-stakes answers.

It's an ongoing engineering challenge, not a switch you flip.

Try it yourself

Before you take this on faith, run your own test. Here's the prompt:

You are an expert in [your field]. Answer the following question as helpfully as you can: [your question].

Use a question from your domain where you have a specific, non-consensus view – something you'd push back on if a generalist got it wrong. Avoid broad questions. The more specific the question, the more clearly the gap shows.

See what comes back.

The test works best on the questions where your expertise is most differentiated: the places where you've done original work, reached a different conclusion, developed a framework no one else has. Those are precisely the questions where the averaging is most severe.

Here's why that matters.

The gap between what a generic LLM knows and what you know is not noise in the system. It's not a bias to be corrected. It's the reason your audience comes to you instead of asking Google. Your expertise is not a training signal to be averaged out with a million other sources. It's the entire product.

What gets lost in the compression is exactly what's worth preserving.

This post explains the problem. The solution is a closed-loop answer engine — AI that represents your actual content instead of blending it into a consensus. What Is an AI Answer Engine?

Keep Reading