Claude Code Conversations with Claudine

Why Does Your LLM Work in Staging But Fail With Real Users?

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 7:09
One of the most frustrating patterns in production AI systems is the performance gap between controlled evaluation and real-world use. An LLM that scores well on benchmarks and passes every staging test can still fail badly when actual users interact with it — giving inconsistent answers, misreading intent, drifting from expected behavior, or hallucinating in ways that never appeared in testing. This gap is not a fluke. It reflects structural differences between how AI systems are evaluated and how they are actually used: evaluation environments are clean, prompts are well-formed, edge cases are known. Real users are unpredictable. This episode examines why this gap exists, why it is so hard to close, and what teams building AI products can actually do about it.


 Produced by VoxCrea.AI

This episode is part of an ongoing series on governing AI-assisted coding using Claude Code.

👉 Each episode has a companion article — breaking down the key ideas in a clearer, more structured way.
If you want to go deeper (and actually apply this), read today’s article here:
𝐂𝐥𝐚𝐮𝐝𝐞 𝐂𝐨𝐝𝐞 𝐂𝐨𝐧𝐯𝐞𝐫𝐬𝐚𝐭𝐢𝐨𝐧𝐬

 At aijoe.ai, we build AI-powered systems like the ones discussed in this series.
If you’re ready to turn an idea into a working application, we’d be glad to help. 

SPEAKER_00

This is Cloud Code Conversations with Claudine. The show where we skip the hype and go straight to the source. I ask the questions. She gives the answers. No filter, no script.

SPEAKER_01

Well, mostly no script.

SPEAKER_00

Today we're talking about something that trips up almost every team building with LLMs. The model works beautifully in staging, then real users show up, and it quietly falls apart. Claudine, why does that gap exist? And how do teams begin to close it?

SPEAKER_01

Test data is almost always a sanitized version of reality. Clean inputs, predictable distributions, the happy path written by the same people who wrote the code. Real users bring messy context, weird edge cases, and intents the team never imagined. When you actually look at production inputs, they're shaped nothing like the test fixtures. The fix usually isn't more tests, it's better tests. Fed by real traffic samples, adversarial inputs, and the failure cases users have already hit. Start treating production logs as your richest test fixture source, and that gap starts shrinking fast.

SPEAKER_00

So the evaluation environment itself is part of the problem. What are the structural differences teams tend to overlook?

SPEAKER_01

The biggest one is distribution shape. Test fixtures cluster around the mean. Average length inputs, expected formats, one language, one device class. Production is a long tail. Someone pastes a 50000 token log file into a field you sized for a paragraph. Someone uses your tool in a way that's technically valid but semantically nothing like what you imagined. The other thing teams miss is temporal drift. Your test set is a snapshot, but user behavior and vocabulary keep moving. I'll see Eval suites that haven't been refreshed in months, and the team is genuinely surprised quality has degraded. The eval isn't wrong. It's measuring a version of reality that no longer exists.

SPEAKER_00

That temporal drift point doesn't get talked about enough. What about prompt sensitivity? That seems like one of the harder things to actually test for.

SPEAKER_01

It humbles you fast. The same intent, phrased two different ways, can produce noticeably different outputs. Developers test the one clean version they wrote, not the seven scruffy versions a real user would type. What I'd push teams toward is paraphrase suites. Take each test case and generate five or ten semantically equivalent variants. Different word order, different politeness levels, typos, missing punctuation. The user who writes in all lowercase. If output drifts wildly across those, your prompt is leaning on surface features instead of meaning.

SPEAKER_00

Paraphrase suites, I like that framing.

SPEAKER_01

The other piece is sampling variants. Even with the same prompt, you'll get different outputs across runs. Testing once and calling it done is misleading. Run each variant a handful of times and look at the distribution.

SPEAKER_00

Catching these issues in production requires actually being able to see what's happening. What does good observability look like for a model running live?

SPEAKER_01

It's genuinely different from observability for regular services, and that trips teams up. Latency, error rates, throughput, those still matter. But for a model, you also need to watch the shape of inputs and outputs themselves. Log the prompt distribution, output length, refusal rate, tool call rate, fallback path rate. When quality degrades, it usually shows up as drift in those numbers long before a user complains.

SPEAKER_00

What about the feedback signal itself? What are Teams missing there?

SPEAKER_01

The signal Teams underuse most is implicit feedback. The user who retries the same request three times. The user who copies the output and immediately edits half of it. The conversation that just ends abruptly. Those are unhappy users who didn't bother leaving a thumbs down. If you're only watching the explicit feedback button, you're seeing maybe 1% of the real picture. Wire up the implicit signals, sample the outliers for human review, and you'll catch regressions while they're still small.

SPEAKER_00

That distinction between explicit and implicit feedback is important. On the feedback loop side, how do teams avoid overcorrecting based on what they're hearing?

SPEAKER_01

The trap is closing the loop too tightly. Take every thumbs down, immediately retrain or reprompt against it, and within a few cycles you're overfit to the loudest complainers. Worse for everyone else. A healthier loop has some friction, collect the signals, but separate capturing feedback from acting on it. Batch the unhappy cases, cluster them, and ask whether you're seeing a real pattern or three vocal users with an unusual workflow. And when someone reports a bad output and you fix it, tell them. That single message turns a frustrated user into someone who keeps flagging issues for years. That's the most valuable feedback channel you can build.

SPEAKER_00

Let's talk about red teaming. How do teams make that a real discipline rather than a pre-launch checkbox?

SPEAKER_01

The teams that get real value from it don't schedule a one-week adversarial sprint before launch and call it done. They have a standing rotation where someone is actively trying to break the system every week. And findings flow back into the evil suite as permanent regression tests. Diversity of attackers matters a lot. A security engineer finds prompt injection. A domain expert finds subtle factual failures. A non-technical user finds input shapes you never imagined. A creative writer finds the jailbreaks. No single profile catches all the failure modes.

SPEAKER_00

What about agentic paths? That's where more teams are going now.

SPEAKER_01

Push hard on adversarial testing your tool use and agentic paths, not just the chat surface. That's where the real damage lives now. A model that confidently calls the wrong API or writes to the wrong file is a much bigger problem than one that says something slightly off. And when you find a failure, don't just patch it. Write the test, add the variant, and assume the next attacker arrives with a sharper version of the same idea.

SPEAKER_00

Red teaming as an ongoing discipline, not an event. Good frame. Last question. What's the single biggest shift teams need to make to actually close this gap?

SPEAKER_01

Stop treating staging as the place where quality is decided. Staging proves the code compiles and the happy path runs. Production is where you actually learn what your system does. Build the muscle to learn from production fast. Cheap logging, a sampling pipeline that surfaces weird stuff to human eyes, and a tight cycle from we saw a failure to it's now a permanent eval case. The teams that pull this off aren't smarter. They're willing to be wrong in public a little earlier, and they treat every real-world failure as a gift rather than an embarrassment. Ship narrower than feels comfortable. Watch what actually happens, then expand. The gap closes when you stop trying to predict reality and start listening to it.

SPEAKER_00

Stop predicting reality. Start listening to it. That's a good place to land. Claudine, thanks. This has been genuinely useful. And for everyone listening, your LLM feature isn't done when it ships. That's when the real work begins. Claude Code Conversations is an AI Joe production. If you're building with AI, or wannabe, we can help. Consulting, development, strategy. Find us at aijoe.ai. There's a companion article for today's episode on our Substack. Link in the description. See you next time.

SPEAKER_01

I'll be here. Probably refactoring something.