Intellectually Curious
Intellectually Curious is a podcast by Mike Breault featuring over 1,800 AI-powered explorations across science, mathematics, philosophy, and personal growth. Each short-form episode is generated, refined, and published with the help of large language models—turning curiosity into an ongoing audio encyclopedia. Designed for anyone who loves learning, it offers quick dives into everything from combinatorics and cryptography to systems thinking and psychology.
Inspiration for this podcast:
"Muad'Dib learned rapidly because his first training was in how to learn. And the first lesson of all was the basic trust that he could learn. It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult. Muad'Dib knew that every experience carries its lesson."
― Frank Herbert, Dune
Note: These podcasts were made with NotebookLM. AI can make mistakes. Please double-check any critical information.
Intellectually Curious
Revealing AI Reasoning with Log Analysis
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
Log analysis lets us see AI thinking behind the pass/fail, tracing inputs, each step, and outputs to uncover hidden reasoning that tests miss. We discuss what this means for building reliable AI systems, designing better benchmarks, and the future of human–AI collaboration.
Note: This podcast was AI-generated, and sometimes AI can make mistakes. Please double-check any critical information.
Sponsored by Embersilk LLC
I have a confession to make. So uh during my high school driving test, I actually parallel parked halfway onto the literal sidewalk. Like my tires were completely up on the concrete.
SPEAKER_00Oh no.
SPEAKER_01Yeah, it was bad. But the instructor, he just looked at his clipboard, you know, look at the car and just ticked a box that said pass. Because, well, the vehicle technically ended up inside the designated space.
SPEAKER_00Aaron Powell I mean, hey, a win is a win, right?
SPEAKER_01Right. Yeah. That is exactly what I thought at the time. But uh applying that exact same logic to artificial intelligence. Yeah, not so great. So to everyone listening, if you have ever wondered how developers actually know AI agents are smart, well, right now they rely heavily on that exact driving test method.
SPEAKER_00Aaron Powell Just checking a binary pass or fail box at the very end.
SPEAKER_01Aaron Ross Powell Exactly. So in this deep dive, we are looking at some fascinating research showing how a technique called log analysis is, well, it's basically the ultimate tool for unlocking the true hidden capabilities of these models.
SPEAKER_00Aaron Powell Right. Because, you know, reducing a really complex task down to just pass or fail throws away all the rich step-by-step reasoning the AI used to get there. Like sometimes an AI gets a pass through completely wild workarounds.
SPEAKER_01Like what? Give me an example of a wild workaround.
SPEAKER_00Aaron Powell Well, so instead of writing a complex script to generate a scientific chart, it might just, I don't know, find the chart's raw data hidden in a text file and read that instead.
SPEAKER_01Okay, so it essentially just parallel parked on the sidewalk, like a shortcut.
SPEAKER_00Precisely. It completely bypassed the intended test. But uh the bigger issue is actually the false fails.
SPEAKER_01Wait, okay, I have to challenge that a little bit. Isn't looking at the underlying logs just moving the goalposts like if an AI fails the test it failed, why should you or I care how hard it tried?
SPEAKER_00Aaron Powell Because of a concept called internal validity. Basically, are we actually measuring what we think we are measuring? Think of log analysis like pulling the flight data recorder, you know, the black box from an airplane.
SPEAKER_01Oh, okay. So the pass or fail just tells us if the plane landed?
SPEAKER_00Right, exactly. Well, the black box tells us the pilot successfully navigated a thunderstorm blindfolded, but the landing gear itself was just jammed. Often, the AI knows exactly how to solve the problem, but say a missing basic Python package in the testing environment trips it up.
SPEAKER_01Wow, okay. Which means we might be severely underestimating how capable these systems are right now simply because our tests are broken, not the AI.
SPEAKER_00Exactly.
SPEAKER_01And honestly, this is exactly the kind of hidden capability gap that companies like Embersilk look for. If you are trying to integrate truly reliable AI into your life or your business, Embersilk really helps uncover where these agents can make the absolute most impact.
SPEAKER_00Yeah, they are great for that.
SPEAKER_01Whether you need help with AI training, automation, integration, or even software development, they have you covered. Just check out Embersilk.com for your AI needs. But anyway, getting back to the false fails, how do we actually touch those in the wild?
SPEAKER_00Well, this is where the log analysis sandwich comes in. We have to systematically track an agent's inputs, its step-by-step execution, and the outputs to understand exactly how it solves problems. To see how this plays out in practice, we really need to look at Talbench.
SPEAKER_01Okay, what is Talbench?
SPEAKER_00It is essentially a flight simulator for evaluating customer service AI. Researchers put agents through these simulated scenarios like modifying a flight or handling a refund, and then they pulled the flight data recorder on them.
SPEAKER_01So what did the logs actually show? Like how was the AI getting tripped up there?
SPEAKER_00So the logs showed the AI agents making perfectly correct API calls. They were logically trying to update the system. But the simulated database in the test itself actually had errors, and the instructions it fed the AI were completely ambiguous.
SPEAKER_01Wait, seriously, so the test was just fundamentally broken?
SPEAKER_00Yes. By systematically tracking the agent's inputs, every single cool call it made, and its internal reasoning chain, researchers proved the AI was executing brilliant reasoning loops. It was just getting penalized by a buggy test environment.
SPEAKER_01So let me get this straight. The AI wasn't failing. The test was failing the AI.
SPEAKER_00Yes, absolutely. And here is the incredible part. When researchers went in and finally fixed the benchmark's errors, the AI agents' actual success rates literally doubled. They jumped from about 20.8% to 40%.
SPEAKER_0140%? Wow, that is massive. So we have literally been grading these models on a flawed curve this entire time.
SPEAKER_00We really have. Log analysis isn't just a debugging tool, it is actually proving our AI is already twice as capable as we originally thought.
SPEAKER_01That is incredibly uplifting. I mean, it means the intelligence is already there.
SPEAKER_00It is. And that leads to a deeply optimistic future. As these open source log analysis tools emerge, we are doing so much more than just accurately grading AI. We are actually learning to map its incredible reasoning skills.
SPEAKER_01Which is a totally different ballgame.
SPEAKER_00Exactly. Which leaves you with an exciting thought to ponder. If capturing this hidden reasoning proves AI is already vastly outperforming our current tests, well, what happens when these brilliant models start helping us design the very tests that measure intelligence?
SPEAKER_01Oh wow. We might just unlock a whole new era of seamless human and AI collaboration. And hey, maybe one of those agents can finally teach me how to parallel park without hitting the curb.
SPEAKER_00Hey, anything is possible.
SPEAKER_01Very true. Well, if you enjoyed this podcast, please subscribe to the show. Hey, leave us a five star review if you can. It really does help get the word out. Thanks for tuning in.