Interaction Models: Scalable Real-Time Human-AI Collaboration Artwork

Intellectually Curious

Intellectually Curious is a podcast by Mike Breault featuring over 1,800 AI-powered explorations across science, mathematics, philosophy, and personal growth. Each short-form episode is generated, refined, and published with the help of large language models—turning curiosity into an ongoing audio encyclopedia. Designed for anyone who loves learning, it offers quick dives into everything from combinatorics and cryptography to systems thinking and psychology.

Inspiration for this podcast:

"Muad'Dib learned rapidly because his first training was in how to learn. And the first lesson of all was the basic trust that he could learn. It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult. Muad'Dib knew that every experience carries its lesson."

― Frank Herbert, Dune

Note: These podcasts were made with NotebookLM. AI can make mistakes. Please double-check any critical information.

Show More

Intellectually Curious

Interaction Models: Scalable Real-Time Human-AI Collaboration

May 11, 2026 • Mike Breault

0:00 | 6:01

We dive into Thinking Machines Lab’s breakthrough that shatters the typing bottleneck by streaming real-time microturns and decoupling quick conversation from deep reasoning. Learn how a fast-front interaction model handles live dialogue, while an asynchronous background system tackles heavy thinking, using encoder-free early fusion to process raw audio and video. We explore how this real-time collaboration enables multi-speaker dialogue, live translation, instant insights, and a new era of human–AI teamwork—and what it could mean for learning, work, and creativity.

Note: This podcast was AI-generated, and sometimes AI can make mistakes. Please double-check any critical information.

Sponsored by Embersilk LLC

SPEAKER_01 0:00

You know that agonizing feeling when you send a really thoughtful text and then you just scare at those three little typing bubbles. Like they sit there dancing for what feels like hours. And then finally you get the reply and it's just the letter K.

SPEAKER_00 0:11

Oh, it is the absolute worst. You know, you're just left hanging, completely disconnected from whatever that person is actually doing or thinking on the other end.

SPEAKER_01 0:19

Right, exactly. And honestly, that clunky uh waiting around dynamic, that is exactly how it feels interacting with most AI right now. But today we are looking at a massively optimistic leap forward in human AI teamwork. We've got a stack of research papers and demo videos from Thinking Machines Lab or TML detailing their new interaction model.

SPEAKER_00 0:42

Yeah, it's a really huge step forward.

SPEAKER_01 0:44

It really is. So our mission for you today is to unpack how they completely ditched the typing bubble, like how they created an AI that natively understands real-time conversation.

SPEAKER_00 0:54

Well, to understand the breakthrough, we kind of have to look at the flaw in how we currently interact with AI. Researchers call it the collaboration bottleneck.

SPEAKER_01 1:02

The collaboration bottleneck. Right.

SPEAKER_00 1:03

Yeah, because today's AI is fundamentally turn-based. It operates on a single thread. So it waits for you to finish your entire prompt, and while it generates a response, it is practically deaf and blind to anything else you are doing.

SPEAKER_01 1:16

Aaron Powell I mean, it's like trying to brainstorm a big project over a walkie-talkie. You say your piece, then say over, and just wait.

SPEAKER_00 1:22

Exactly. You lose all the nuance of reading a room, you know, and just jumping in organically.

SPEAKER_01 1:27

Aaron Powell You really do. Now, we are about to break down exactly how TML solves this bottleneck, but real quick, if you are currently trying to solve your own AI bottlenecks at work, our sponsor, Embersilk, actually specializes in this.

SPEAKER_00 1:40

They definitely do.

SPEAKER_01 1:41

Yeah. So if you need help with AI training or integration or software development or just uncovering where agents can make the biggest impact for your business and life, you know, you've got to check out Embersilk.com for your AI needs.

SPEAKER_00 1:54

So getting back to that walkie-talkie problem, TML solved this by introducing what they call time-aligned micro-turns.

SPEAKER_01 2:00

Time-aligned micro turns.

SPEAKER_00 2:02

Yep. Right. So instead of waiting for one big turn, the model processes a continuous stream of input and output in tiny chunks. Like 200 millisecond chunks.

SPEAKER_01 2:13

Wait, so it's not relying on a pause in the audio to know I am done speaking?

SPEAKER_00 2:17

No, not at all. It throws out those clunky external harnesses we used to rely on, like uh voice activity detection entirely. Oh wow. Yeah, because it is streaming these micro turns, it naturally understands silence, overlapping voices and interruptions. You can just talk right over it and it adapts.

SPEAKER_01 2:34

Aaron Powell I see the appeal there. But logistically, I mean, how can a model handle lightning fast banter, watch a video feed, and still solve complex problems without lagging out and freezing on it.

SPEAKER_00 2:45

Well that's the core innovation here. They split the architecture. You have an interaction model that holds the live conversational thread.

SPEAKER_01 2:51

Okay, got it.

SPEAKER_00 2:52

But when you ask it something complex, it delegates the heavy lifting to an asynchronous background model.

SPEAKER_01 2:57

Ah, so it is decoupling the reflexes from the deep thinking. Like it keeps the fast talking part up front while a researcher in the back room figures out the hard stuff. How does the front model stay so fast though?

SPEAKER_00 3:08

It comes down to two things, really. First, the TML interaction small model has 276 billion parameters, but only 12 billion are active at any given moment.

SPEAKER_01 3:19

Wow, that's a huge difference.

SPEAKER_00 3:20

It is. By only keeping a fraction active, they slash the compute latency, which makes real-time audio mathematically possible. And second, they use encoder-free early fusion.

SPEAKER_01 3:31

Okay, let's ground that term. What does early fusion actually look like in practice?

SPEAKER_00 3:35

So normally an AI has to transcribe your audio into text before it can think about it. Early fusion means the AI processes the raw sound waves directly as DML and images as 40 by 40 patches co-trained from scratch with the transformer.

SPEAKER_01 3:50

Aaron Powell So it processes the visuals and sounds natively.

SPEAKER_00 3:52

Yeah, it's exactly like how our brains process sights and sounds instantly, you know, without having to read a transcript of reality first.

SPEAKER_01 3:59

Which explains the demo video in the research stack, where someone asks the AI to say the word friend the exact moment a specific person walks into the video frame.

SPEAKER_00 4:08

Right.

SPEAKER_01 4:09

But while the AI is watching for the friend, another person starts speaking in Hindi. And the AI flawlessly live translates it to English, all while still saying friend the exact second the guy walks in.

SPEAKER_00 4:21

It is an incredible display of multitasking. And during that same demo, a user asked for typical human reaction times to auditory, visual, and tactile cues.

SPEAKER_01 4:32

Oh, right. This is where that background model kicked in.

SPEAKER_00 4:34

Exactly. It ran a web search and generated a visual bar chart right on the screen, all without pausing the live conversation. It's like slipping a post-turn note to the host while they are mid-sentence.

SPEAKER_01 4:45

And the insight it pulled was fascinating too. The chart showed auditory reactions take about 140 to 170 milliseconds, but visual reactions are slower, around 180 to 250 milliseconds.

SPEAKER_00 4:58

Yeah, and when the user asked why, the AI explained that sound actually travels a shorter, more direct neural path to the brain. Learning that in a fluid real-time conversation just feels so much more natural than reading a textbook.

SPEAKER_01 5:11

It really does. It is so optimistic for the future of learning.

SPEAKER_00 5:14

Definitely. If we connect this to the bigger picture, it raises an inspiring question. When our tools can fluidly interject, adapt, and brainstorm with us in true real time, how will this completely revolutionize the way we teach and learn together?

SPEAKER_01 5:30

We are stepping into a beautiful era of genuine collaboration. So as you go through your notes on TML today, ask yourself, what new ideas could you discover if your tools could finally keep up with your curiosity?

SPEAKER_00 5:42

It is a great question to ponder.

SPEAKER_01 5:44

Thanks for exploring this deep dive with us. And hey, if you enjoyed this discussion, please subscribe to the show. Leave us a five star review if you can. It really does help get the word out.

SPEAKER_00 5:52

Thanks for tuning in.

SPEAKER_01 5:53

The next time you are stuck staring at a typing bubble, just remember the future of communication is already here, and it doesn't make you wait.