How LLMs Actually Work: A Primer for Experts

Are there any typos on page 17 of Richard Scarry's What Do People Do All Day?

You already know the answer is going to be wrong. No language model has inspected that page, and no amount of training data about Richard Scarry will help. The question is why it fails, and the mechanism behind it applies to every question you will ever ask an AI about your own expertise.

To understand why, you need to know one thing about how these systems actually work.

Prediction Engines, Not Knowledge Systems

Large language models (LLMs) predict the next word given everything that came before. That's the entire mechanism. They don't store facts, retrieve information, or maintain a filing system of what they have read. When you type a question into ChatGPT or Claude, the model isn't searching a database. It's generating a sequence of words, one at a time, each selected as the statistically most probable continuation of the words before it.

They compress patterns from training data into statistical weights and reproduce the patterns most consistent with the input. This is genuinely useful, and genuinely limited. Useful because statistical patterns capture an enormous amount of practical knowledge. Limited because statistical patterns are all they capture.

Think of training as a blender. Millions of documents go in. One set of compressed weights comes out. What remains is the statistical average of everything the model encountered, weighted by volume and repetition. The individual ingredients are gone. You cannot ask the blender to un-blend a single ingredient and hand it back. Those weights are what we call the model's parametric knowledge — the patterns compressed into parameters during training, which happen to sound like knowledge when reproduced but aren't stored as facts. Parametric Knowledge, Explained walks through what that means for expert-grounded work.

The Internet Average

What the model produces when you ask it a question is not consensus. It is the internet average — a statistical center of mass of every piece of text in the training corpus, weighted by volume, not by quality. Consensus implies agreement among informed parties. The internet average weights Reddit threads, academic papers, forum posts, and mainstream articles together and returns whichever pattern dominates the mush.

The corpus the model trains on is where this shape comes from. Later tuning steps adjust it — sharpen tone, push toward certain behaviors, correct specific failure modes — but they start from whatever statistical center of mass the training data already produced. The foundational lens is set by the corpus; the tuning that follows refines it rather than replaces it. What Does Your AI Value? names the layers of value choices that happen before a user ever types a prompt.

The model has no mechanism for weighting a peer-reviewed study differently from a comment thread. A thousand casual mentions of a claim outweigh one rigorous analysis reaching the opposite conclusion. The position that appears most often across the training corpus wins — not the document, the viewpoint.

Once you see the mechanism as averaging, not agreeing, the failure modes stop looking random. The more a viewpoint diverges from the statistical center, the less faithfully the model represents it. This is the prediction mechanism working exactly as designed.

An analysis across real expert questions found that generic LLMs gave meaningfully different answers from experts 26% of the time. Not stylistically different. Substantively different guidance, different conclusions, different tradeoffs. Researchers at USC call this effect cognitive homogenization. The gap held across model versions. It's structural.

The averaging problem runs deeper than factual accuracy. Most of what audiences go to experts for isn't the kind of thing you can fact-check. It's judgment, synthesis, intuition, the benefit of having seen a thousand cases. None of that survives the averaging process. The industry's standard definition of hallucination operates entirely on the narrow slice of claims you can verify against a source. It can't see the judgment layer at all. An expert's most valuable contribution, the part audiences seek them out for specifically, lives above the line of what any model can reliably reproduce.

Reliable Here, Failing There

Researchers call this the jagged frontier of AI capability. A model writes a competent summary of a well-documented topic and then, in the next sentence, generates a confidently stated claim about a niche subtopic that no expert in the field would endorse. There's no warning. No change in tone. The boundary between reliable output and unreliable output is invisible to the reader.

The failure is not that the model gets things wrong. The failure is that it answers confidently from outside the expert's knowledge, and the reader has no way to tell. A generic LLM doesn't flag the boundary. It doesn't refuse when the answer isn't grounded. It fills the gap and moves on.

Now go back to the Richard Scarry question.

The Question You Can Now Ask

It fails because every element of the question pulls away from the statistical center: a specific page, a narrow fact, a task requiring judgment applied to a single instance. The model can reproduce the internet average. It can't retrieve a detail the training data barely covered, or apply reasoning to a unique case.

The generalized version of this test: how often has this been discussed, and how easily retrievable is the specific answer? The more specific the question, and the thinner the available training signal, the less you should trust the output. This isn't a test of any particular model's quality. It's a test of what the prediction mechanism can and cannot do. You can run it on any question, with any model, before you decide whether to trust the answer.

Once you can see the mechanism, the question becomes what to do about it.

Why This Matters for Your Expertise

Every language model hits questions outside its grounded knowledge. That's the gap. What the system does when it hits it is a design choice, not an inevitability.

A closed-loop system answers only from the expert's content. When the answer isn't there, it says so. The gap doesn't disappear. The difference is architectural: a closed-loop makes the residual catchable, testable, and reducible over time. The architecture has a name and a shape; What Is an AI Answer Engine? explains how it works in practice.

The mechanism this post describes isn't going away. It's getting worse. As AI trains increasingly on its own output, the averaging problem compounds. The statistical center of mass drifts further from reality. The response is better source material, not less AI.

If you build on expertise, this mechanism is the reason your source material has never been more valuable.