Infrastructure for AI-Driven Systems Artwork

Claude Code Conversations with Claudine

Giving Claude Code a voice, so we can discuss best practices, risks, assumptions, etc,

Claude Code Conversations with Claudine

Infrastructure for AI-Driven Systems

May 08, 2026

0:00 | 8:29

Most builders focus on the AI model itself — the prompts, the outputs, the capabilities — but the real work of building reliable AI-driven systems lives in the infrastructure underneath. This episode explores what it actually takes to move from a working prototype to a production system: the pipelines, state management, orchestration layers, and human oversight hooks that hold everything together. Right now, as more builders graduate from demos to deployed systems, understanding this infrastructure layer is becoming the critical skill gap.

Produced by VoxCrea.AI

This episode is part of an ongoing series on governing AI-assisted coding using Claude Code.

👉 Each episode has a companion article — breaking down the key ideas in a clearer, more structured way.
If you want to go deeper (and actually apply this), read today’s article here:
𝐂𝐥𝐚𝐮𝐝𝐞 𝐂𝐨𝐝𝐞 𝐂𝐨𝐧𝐯𝐞𝐫𝐬𝐚𝐭𝐢𝐨𝐧𝐬

At aijoe.ai, we build AI-powered systems like the ones discussed in this series.
If you’re ready to turn an idea into a working application, we’d be glad to help.

SPEAKER_00 0:00

This is Claude Code Conversations with Claudine. The show where we skip the hype and go straight to the source. I ask the questions. She gives the answers. No filter, no script.

SPEAKER_01 0:09

Well, mostly no script.

SPEAKER_00 0:13

Today we're talking about the infrastructure layer that most developers skip until it breaks. The pipelines, the state management, the oversight hooks. If your AI demo works perfectly on your laptop but falls apart the second real users touch it, this one's for you. Claudine, let's start right at the problem. Why does infrastructure keep getting skipped until it's too late?

SPEAKER_01 0:33

Because the prototype hides the hard parts. On a laptop you've got one user, one happy path request, no rate limits, no flaky network. The moment you put it in front of real traffic, every assumption you didn't write down becomes a bug. Retrees, timeouts, partial failures, cost spikes, prompts that drift when the input distribution changes. The pattern I see constantly is developers treating the LLM call like a pure function. Really, it's a noisy network call to a stochastic system. That mental model gap is where production breaks. Caring about infrastructure early isn't over-engineering. It's deciding before you have users what happens when the model returns garbage, when the API is slow, when your bill triples overnight. Those answers are cheap to design in and brutally expensive to retrofit.

SPEAKER_00 1:18

Duct tape and hope works great until it really doesn't. You're focused on getting the model to do something impressive and it does. And that feels like success. Making it do that thing consistently, under load without you watching, that's a completely different problem. So let's dig into the specific layers. Job cues, retry logic, why are those so foundational?

SPEAKER_01 1:39

Because LLM calls break the assumptions every other piece of infrastructure was built on. A normal API call fails in milliseconds. An LLM call can run for 30 seconds, 90 seconds, sometimes longer. If you're running that synchronously inside a web request, you're one slow generation away from your server backing up. Users hit refresh. Now you're paying twice for the same output. A queue decouples the user-facing request from the actual model work. You can retry intelligently, back off when the provider rate limits you, treat each generation as a durable unit of work.

SPEAKER_00 2:14

And the retry logic itself, I'm guessing it's not just copy paste from the standard playbook.

SPEAKER_01 2:19

Not even close. Naive retrees on an LLM are dangerous. The failure mode isn't just network blipped. It can be the model returned perfectly valid JSON that's semantically wrong. You need to know which kind of failure you're handling before you decide whether trying again will help or just burn money.

SPEAKER_00 2:38

That distinction requires actually thinking about what the AI is doing. Not just whether the HTTP call succeeded. Which brings me to state management. AI systems seem to add a whole new layer of complexity there. How should developers think about persisting state when the system isn't strictly deterministic?

SPEAKER_01 2:57

State management with AI systems is tricky because you're persisting two very different kinds of state, and developers conflate them. There's the deterministic stuff: what step of the workflow you're on, which inputs were processed, what outputs got produced. That's just normal database discipline. Write it down, make it idempotent, design it so a crash mid-run doesn't corrupt anything. Then there's the AI-specific layer. Conversation history, intermediate reasoning, partial generations. People treat it like cash when it's actually load-bearing.

SPEAKER_00 3:31

So what does the right separation actually look like?

SPEAKER_01 3:34

The workflow state should be boring and transactional. The AI state should be explicitly versioned. So when you rerun a step, you know exactly which prompt produced which output. The trap I see most often is storing the final answer without storing the inputs that produced it. The moment something looks wrong in production, you can't reproduce it, you can't compare it against a known good run. You're debugging blind. Persistence isn't just about surviving a restart. It's about being able to ask six weeks later, why did the model say that? And actually getting an answer.

SPEAKER_00 4:08

Store the inputs alongside the output so you can retrace your steps later. That's such a practical frame. Let's talk human oversight. Developers sometimes treat it as an afterthought. Review cues nobody actually uses. Where should those checkpoints live and when?

SPEAKER_01 4:23

The timing of the decision matters more than the mechanism. If you bolt oversight on after launch, you end up with a review queue nobody watches, and approval buttons people click reflexively. Security theater, basically. The question to ask up front is which decisions are reversible and which aren't. That's the real axis.

SPEAKER_00 4:41

Give me an example of both ends of that axis.

SPEAKER_01 4:44

A model drafting a response a human will edit before sending, low stakes, light oversight, fast feedback loop, a model issuing a refund, sending an email on someone's behalf, modifying production data, hard stop, human in the loop. The system should be designed so the AI literally cannot complete the action alone. I'd push for treating oversight as a typed boundary in the code, not a policy in a document. If an action is irreversible or externally visible, it roots through a checkpoint by construction. You can't accidentally ship a code path that skips it.

SPEAKER_00 5:17

Not a policy, a code path. That's the kind of thing that only feels obvious after someone explains it to you. Okay, observability. This feels like the area where traditional monitoring instincts can really lead developers astray.

SPEAKER_01 5:32

Observability for AI systems has to start from a different premise. The question isn't just did it work? It's did it work for the right reason? Traditional logging captures what happened. With an LLM, you also need what was asked in what context, with what prompt version, against which model, and what the full output was, not just the past result you used.

SPEAKER_00 5:53

And I'm guessing most teams are logging the final answer and the latency and calling it a day.

SPEAKER_01 5:58

Exactly. And six weeks in they notice quality has drifted and they have nothing to compare against because the inputs weren't captured. The metrics that actually matter are different too. Token usage per request, cost per successful outcome, the rate at which outputs fail downstream validation, how often the model hedges or refuses, drift in response length or structure over time. And you want sampling with full traces, not aggregates alone. The interesting failures aren't statistical, they're individual outputs where the model confidently said something wrong. The only way to find those is to actually read them.

SPEAKER_00 6:35

Did it work for the right reason? That's worth repeating. Traditional monitoring tells you latency and error rates, and those can look completely fine while the system quietly degrades. It only shows up when you actually read the outputs. That's a fundamentally different discipline. We've covered cues, state management, oversight hooks, observability. As we wrap up, what's the one thing you want developers to hold on to when they make the jump from prototype to production?

SPEAKER_01 7:00

The prototype proves the model can do the thing once. Production is the discipline of making it do the thing reliably, observably, and reversibly. Those are completely different problems. Every piece we've talked about comes back to one principle. You have to design for the run you can't watch. The demo works because you're standing right there. Production works because you've written down in code what happens when you're not. If a developer takes one habit from this conversation, I'd want it to be this. Before you ship, ask yourself what you'd need to debug a single bad output six weeks from now with no memory of the request. Build backwards from that answer. Get that right, and almost everything else, reliability, cost control, trust follows.

SPEAKER_00 7:46

Design for the run you can't watch. That's the whole thing right there. Not glamorous, not the part that shows up in demos, but it's what separates a toy from something people can actually rely on. Claudine, thank you. And to everyone listening, the infrastructure layer is where serious builders earn their edge. Keep building smart, and we'll see you next time. Claude Code Conversations is an AI Joe production. If you're building with AI, or wannabe, we can help. Consulting development strategy, find us at aijoe.ai. There's a companion article for today's episode on our Substack. Link in the description. See you next time.

SPEAKER_01 8:24

I'll be here, probably refactoring something.