Intellectually Curious

Interaction Models: Scalable Real-Time Human-AI Collaboration

Mike Breault

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 6:01

We dive into Thinking Machines Lab’s breakthrough that shatters the typing bottleneck by streaming real-time microturns and decoupling quick conversation from deep reasoning. Learn how a fast-front interaction model handles live dialogue, while an asynchronous background system tackles heavy thinking, using encoder-free early fusion to process raw audio and video. We explore how this real-time collaboration enables multi-speaker dialogue, live translation, instant insights, and a new era of human–AI teamwork—and what it could mean for learning, work, and creativity.


Note:  This podcast was AI-generated, and sometimes AI can make mistakes.  Please double-check any critical information.

Sponsored by Embersilk LLC

SPEAKER_01

You know that agonizing feeling when you send a really thoughtful text and then you just scare at those three little typing bubbles. Like they sit there dancing for what feels like hours. And then finally you get the reply and it's just the letter K.

SPEAKER_00

Oh, it is the absolute worst. You know, you're just left hanging, completely disconnected from whatever that person is actually doing or thinking on the other end.

SPEAKER_01

Right, exactly. And honestly, that clunky uh waiting around dynamic, that is exactly how it feels interacting with most AI right now. But today we are looking at a massively optimistic leap forward in human AI teamwork. We've got a stack of research papers and demo videos from Thinking Machines Lab or TML detailing their new interaction model.

SPEAKER_00

Yeah, it's a really huge step forward.

SPEAKER_01

It really is. So our mission for you today is to unpack how they completely ditched the typing bubble, like how they created an AI that natively understands real-time conversation.

SPEAKER_00

Well, to understand the breakthrough, we kind of have to look at the flaw in how we currently interact with AI. Researchers call it the collaboration bottleneck.

SPEAKER_01

The collaboration bottleneck. Right.

SPEAKER_00

Yeah, because today's AI is fundamentally turn-based. It operates on a single thread. So it waits for you to finish your entire prompt, and while it generates a response, it is practically deaf and blind to anything else you are doing.

SPEAKER_01

Aaron Powell I mean, it's like trying to brainstorm a big project over a walkie-talkie. You say your piece, then say over, and just wait.

SPEAKER_00

Exactly. You lose all the nuance of reading a room, you know, and just jumping in organically.

SPEAKER_01

Aaron Powell You really do. Now, we are about to break down exactly how TML solves this bottleneck, but real quick, if you are currently trying to solve your own AI bottlenecks at work, our sponsor, Embersilk, actually specializes in this.

SPEAKER_00

They definitely do.

SPEAKER_01

Yeah. So if you need help with AI training or integration or software development or just uncovering where agents can make the biggest impact for your business and life, you know, you've got to check out Embersilk.com for your AI needs.

SPEAKER_00

So getting back to that walkie-talkie problem, TML solved this by introducing what they call time-aligned micro-turns.

SPEAKER_01

Time-aligned micro turns.

SPEAKER_00

Yep. Right. So instead of waiting for one big turn, the model processes a continuous stream of input and output in tiny chunks. Like 200 millisecond chunks.

SPEAKER_01

Wait, so it's not relying on a pause in the audio to know I am done speaking?

SPEAKER_00

No, not at all. It throws out those clunky external harnesses we used to rely on, like uh voice activity detection entirely. Oh wow. Yeah, because it is streaming these micro turns, it naturally understands silence, overlapping voices and interruptions. You can just talk right over it and it adapts.

SPEAKER_01

Aaron Powell I see the appeal there. But logistically, I mean, how can a model handle lightning fast banter, watch a video feed, and still solve complex problems without lagging out and freezing on it.

SPEAKER_00

Well that's the core innovation here. They split the architecture. You have an interaction model that holds the live conversational thread.

SPEAKER_01

Okay, got it.

SPEAKER_00

But when you ask it something complex, it delegates the heavy lifting to an asynchronous background model.

SPEAKER_01

Ah, so it is decoupling the reflexes from the deep thinking. Like it keeps the fast talking part up front while a researcher in the back room figures out the hard stuff. How does the front model stay so fast though?

SPEAKER_00

It comes down to two things, really. First, the TML interaction small model has 276 billion parameters, but only 12 billion are active at any given moment.

SPEAKER_01

Wow, that's a huge difference.

SPEAKER_00

It is. By only keeping a fraction active, they slash the compute latency, which makes real-time audio mathematically possible. And second, they use encoder-free early fusion.

SPEAKER_01

Okay, let's ground that term. What does early fusion actually look like in practice?

SPEAKER_00

So normally an AI has to transcribe your audio into text before it can think about it. Early fusion means the AI processes the raw sound waves directly as DML and images as 40 by 40 patches co-trained from scratch with the transformer.

SPEAKER_01

Aaron Powell So it processes the visuals and sounds natively.

SPEAKER_00

Yeah, it's exactly like how our brains process sights and sounds instantly, you know, without having to read a transcript of reality first.

SPEAKER_01

Which explains the demo video in the research stack, where someone asks the AI to say the word friend the exact moment a specific person walks into the video frame.

SPEAKER_00

Right.

SPEAKER_01

But while the AI is watching for the friend, another person starts speaking in Hindi. And the AI flawlessly live translates it to English, all while still saying friend the exact second the guy walks in.

SPEAKER_00

It is an incredible display of multitasking. And during that same demo, a user asked for typical human reaction times to auditory, visual, and tactile cues.

SPEAKER_01

Oh, right. This is where that background model kicked in.

SPEAKER_00

Exactly. It ran a web search and generated a visual bar chart right on the screen, all without pausing the live conversation. It's like slipping a post-turn note to the host while they are mid-sentence.

SPEAKER_01

And the insight it pulled was fascinating too. The chart showed auditory reactions take about 140 to 170 milliseconds, but visual reactions are slower, around 180 to 250 milliseconds.

SPEAKER_00

Yeah, and when the user asked why, the AI explained that sound actually travels a shorter, more direct neural path to the brain. Learning that in a fluid real-time conversation just feels so much more natural than reading a textbook.

SPEAKER_01

It really does. It is so optimistic for the future of learning.

SPEAKER_00

Definitely. If we connect this to the bigger picture, it raises an inspiring question. When our tools can fluidly interject, adapt, and brainstorm with us in true real time, how will this completely revolutionize the way we teach and learn together?

SPEAKER_01

We are stepping into a beautiful era of genuine collaboration. So as you go through your notes on TML today, ask yourself, what new ideas could you discover if your tools could finally keep up with your curiosity?

SPEAKER_00

It is a great question to ponder.

SPEAKER_01

Thanks for exploring this deep dive with us. And hey, if you enjoyed this discussion, please subscribe to the show. Leave us a five star review if you can. It really does help get the word out.

SPEAKER_00

Thanks for tuning in.

SPEAKER_01

The next time you are stuck staring at a typing bubble, just remember the future of communication is already here, and it doesn't make you wait.