Deep Learning with PolyAI

Did OpenAI’s Realtime API just change everything?

Team PolyAI Episode 83

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 32:33

Send us Fan Mail

🚨 OpenAI just made its Realtime API generally available, and it's got everyone talking. But is it a true game-changer, or just another update?

In this episode, your host Nikola Mrkšić (Co-founder & CEO, PolyAI) is joined by Shawn Wen (Co-founder and CTO, PolyAI) to break down what this release really means:

🔹 Why Realtime API could kill the old turn-taking model
🔹 How speech-to-speech changes latency, interruptibility, and naturalness
🔹 What benchmarks actually say about performance
🔹 The impact on contact centers, CX leaders, and startups building voice agents

Whether you’re a CX leader, product builder, or just curious about the latest in voice, check out Nikola and Shawn's conversation to learn more about the news.

#OpenAI #RealtimeAPI #VoiceAI #AI #CustomerExperience #PolyAI #ContactCenters

  • Follow PolyAI on LinkedIn
  • Watch this and other episodes of the Deep Learning pod on YouTube
SPEAKER_00

The target user is very different from people who work on voice AI day and night. The way you interact with the system right now, on the one hand, looks like a great interaction because you were so assertive. Correct. But I think the issue is you were that assertive. And I think because I know how it works. Hello, everyone. Welcome to another episode of Deep Learning with PolyAI. I'm here with my co-founder, Sean, in another reaction video to the release of OpenAI's real-time API. So I think to everyone who's used the ChatGPT app, especially its voice mode, it was clear that it got way better about, I'd say, about two months ago. It just got to a point where we're just like, wow, right? Like the reactiveness, the ability to hold the comments the interruptibility was like spot on perfect yeah there was this woman voice you know phenomenal yeah that was really good yeah I can't remember on top of my head right now and I remember when I went to GTC in Paris and I talked to some of the opening guys and it was just like oh we secretly shipped something you know real time with a voice mode you know a couple of days ago and then everyone here was like oh that's super amazing and then you know afterwards I just realized they were talking about that secret release. A lot of self-release. And then afterwards, when we found that British female voice, we tried to go out to the API documentation, tried to find where she is. Guess what? Not available. If only you knew her name, maybe you could access it. We do know the name, but now I've... But I think what was interesting is that when you tried to use the API, it clearly wasn't the same. It wasn't the same release, and I think, what, Friday, Thursday. Last week, we got this OGA release, the voice mode, well, of the real-time API, where I think we finally got developer hands-on access to this real-time API. And I think in this video, I'd love to kind of cover what it means for the industry, but also just first off, maybe, what it is. So I think it'd be great if you kind of maybe explain to people like what is it really? Because I think everyone's now pretending that, you know, it's a game changer and stuff, but maybe we'll just start from the basics. So like, what is this thing? Let's share our Miro board. So if you think about traditional voice agents, people usually implement it in a cascaded pipeline. And this is still the mainstream architecture right now, as you can see on the left-hand side, that you basically have three speech, well, three kind of machine learning models cascaded together. And then these days people like to go call it like speech cascaded pipeline usually comes with these three different modules and the reason that you do that is because historically voice-to-voice data is very hard to collect and to be honest these days it's still very hard to connect as well compared to text data right which is crawling the internet that's right and you if you train a larger language then you can train this speech recognition and speech to text well text-to-speech models separately then you can and because like these models require different data collection procedures, different framework. For example, speech recognition is very important that you collect a lot of different people's voices and over different noisy conditions and different accents so that your model can maximize your likelihood of recognizing the speaker. However, the text-to-speech model, what you wanted to train is generally high-quality voice actor data because you want your voice agent to sound like professional, crisp, and very nice, obviously. So the different way, these different requirements kind of like detect that the way that you're collecting the data is very different. Yeah. Yeah. I mean, to kind of like maybe use a different, simpler analogy, like, you know, if we were in the gym, these are isolation movements, right? Where you're training your biceps and your legs and your core muscles differently. Then you hope that it kind of like comes together well. Whereas, you know, if you're doing just like squats, pull-ups and deadlifts, a lot of people say you're doing, you know, you're training for strength, not for looks and that you get it a lot faster. But the promise of if you can get it right is that it learns to pick up all sorts of things that matter for the composite task. And again, collecting data here is hard and I think they don't really reveal the full details of how they or anyone else does it. I think, you know, I remember your end-to-end paper back in, I think this was 2015, right? And that was like end-to-end training, very specifically, in the end, an end-to-end model. But the hope was that all three components would get better because you're training them on the composite task. Yeah. I think, you know, for based on these like real-time APIs, you know, because it got released like last October and then since then it has gone through several iterations as well. It's quite clear that the benefit of it is that, you know, it's basically you remove the boundary between three models and then you will be able to, you know, unlock new, you know, exciting features that, you know, hey, speak to me with strong French accent, for example. Yeah. Or like become very emotional emotional aware, that be able to laugh, be able to actually cry, et cetera. These kind of features is not something that a cascaded model can give you because the text doesn't have a very strong presentation of this emotion text. So these are like very cool feature you can unlock with a speech-to-speech model. Another benefit is that it reduces the latency overall because then you don't have to wait for one model to get down to the other one. So it has these... And I think, I guess with that, maybe a third one would be just... We say barging and interruptibility, and we'll show how it works, but I guess the real thing is this melting of the turn-taking paradigm, where now when you interrupt, let's say that I'm a system and I've said five words, and you interrupt me, it doesn't mean that I just stop speaking. It means that the model is aware of what it said in those five words, and then it's aware of maybe it gave you a piece of information that you would have heard, versus it didn't give you anything, right? Correct. Yeah, so I think one major... The difference is exactly that, turn-tagging functionality. Because historically, we are treating turns as boundaries of the interactions. Now, agents can interrupt you anytime. You can interrupt agents anytime because the conversation is now one single stream of events. Events happening from the agent's side, it happens on the user's. And it's detecting them together. But eventually, you need a mutual protocol how to interpret both. Totally. It's almost like one One version was like we're playing tennis and passing the ball back and forth. Correct. The other one is we've got a machine gun each and we're shooting at each other. You see where we end up. Yeah. Very Serbian analogy. Okay. Well, maybe to kind of go into just the performance, right? Yeah. Do you want to show us? I think we have it integrated in the platform already. Yeah. So let me just quickly show the demo that we built last Friday once the agent came along. And we just integrated it to our platform. And then now we can actually... to it with our systems. All right. Hi, thanks for calling PolyAI Private Bank. How can I help? Hi, what can you do for me? I can help with checking your account balance. Well, actually, can you just set up a payment for me now? Of course. I'd be happy to help with that. Yeah, so my name is Shang Wen. The phone number is 07715000999. Date of birth is 1988, November 2nd. Thank you. That matches our records for Sean Wen. You're now verified. Let's move on to setting up your payment. Yeah, can you set a payment to Nikola Moshech and can you just send him 850 pounds immediately, you know, right now? Understood. We have the payee name as Nikola Moshech and the amount is 80 pounds.

UNKNOWN

Okay.

SPEAKER_00

No problem. Could you please say the full 8-digit account number for the payee so I can make sure everything is correct? Yeah, it's 00112233 and the last name is MRKSIC. Message. Thank you for that. Let's confirm the details. Payee name, Nikola Merksic. Payee account number, 00112233. Amount, 80 pounds. When? Immediately. Correct, yeah. One moment while I set that up for you. Thank you. Can you let me know when it is done? Hello, hello. We'll get there eventually. So I think this is really interesting, right? I mean, we'll talk about tool calling and stuff, and I think accidentally it worked a second ago when we tried, but what I think is really interesting here is that it... Like, we'll talk... Sorry, go ahead. Tell me back. Certainly. The payment is confirmed, and your updated balance is$320. All right. Thanks a lot. Appreciate that. Bye-bye. All right, we got there in the end. So this is super interesting, right? So I think like what I find really, really interesting having worked on this my whole life, like we got used to turn taking and I think people got used to turn taking. And what I found frustrating, frankly, is that we'll all remember a very famous client who called me enraged and we dealt with it for like a whole day when our engineering team cut the latency so much that it started interrupting people. And, you know, I was told the system's not working. I called it. I was like, it's working. Better than ever. Yeah. And I think that was like where I realized that, you know, like the target user is very different from people who work on Boise IDN 9. Correct. The way you interacted with the system right now, on the one hand, looks like a great interaction because you were so assertive. Correct. But I think the issue is you were that assertive. And I think... Because I know how it works. Exactly. The majority of the users, because I don't know. They don't, right? Like they wait, they're not confident about interrupting. And most importantly, they expected not to work at So based on that, I think it's very interesting. And it will take a bit of time for the general population to start using it the way you did and beating it around. Because it worked. At the end, I think maybe that comes to the heart of the RAG and function calling. And I'll let you explain. But that's where I think the flip side downsides of S2S come in. So I think this is actually an interesting behavior. Now the initial orchestration, I think that we will probably have to do a little bit more work there. But I Sometimes the agent does actually have an event that it may take and it will probably not immediately processing it. Or that event just happened a lot later because now, as I say, it's a single stream of OU. And then when the event got triggered, it becomes a little bit uncertain. So sometimes if the model is actually turned well, it should actually very properly react at any single point of time. So I think this is something that we'll have to look into a bit more. But I think overall, that orchestration is very powerful. Technically, in those kind of events, we just need to actually issue another event into the stream and then we'll be able to get the response back. But I think because now we probably do like a more Valina integration. So then therefore the behavior is actually really depends on the model. But yeah, I think that's definitely solvable. Oh, no, totally. I mean, I think that like, but I think it is interesting. So people kind of go like, what are the drawbacks? What is it less good? Why would we all not go like full speech to speech right now? And I think, you know, well, we've definitely spent our lives trying to make things more data-driven and one model like to learn these behaviors and everything. So we are all for it, right? And we love that people with infinite resource were able to work on this, are working on it, because we think it's an important problem, right? But what is the difference between how RAG is done? What actually is the difference for someone implementing a voice agent? Yeah, so basically for voice agent now, for these speech-to-speech agents, there's basically one single prompt and then you can still trigger events and tool calls. And those tool calls can, you can feedback the tool call with the result back to the agent. So pretty much the orchestration is very similar to your traditional agentic applications. The only difference is now is that the input of the user, the conversation between user and agent become a one single stream of events. And those are single stream of events. You are basically now, I mean, there are several orchestration programs differences you can do, because I think OpenAI is thinking about, you know, different kinds of use cases. It could be the telephone use cases we just showed. It could be also a push to talk, that kind of use case, very similar to WeChat or Siri, right? And then, so I think they allow you to do a lot of different kinds of orchestrations. So you can actually say that now, I want you to behave like a teacher in a school, and I only ask a student to actually say something back when I allow them to. So that's actually what we are doing here. But then you can also say, hey, let's that student go crazy. They can buy the budget to us. Serbian school experience. Exactly. I'm kidding. Sorry. But yeah, so this is like, you know, it's basically very powerful like orchestration framework they have built into the API and the system. So now you can build a lot of this. But then basically everything becomes an event. There's like a simple process that as execute all the time. The traditional RAC approach is that RAC is always wrong and that you feed the content then later to the agent. Now it's the other way around. Agent always get the prompt first and then agent decide whether I want it to trigger a RAC or not. I mean, that's actually really common sense if you think about it. Like when you ask me like, is the sky blue? I don't Google like, is the sky blue? And consider that before I respond, right? Correct. So really you want me to just go like, yep, and you continue rather than like RAC, even if it's like two seconds, like it's two seconds where I should have just said yes. Totally, yeah. So I think, you know, by doing that, you give all these freedom and capabilities back to the agent. But then, you know, if the agent is actually not good enough, not strong enough, the hallucination risk is higher. For sure. Because now you no longer do the filter of the contents of the agent. You say now it's your job. You do the filter yourself. Yeah. I mean, also, what is interesting in this world where, like, you push everything into a single prompt, it is easier in theory to start. Like always it pushes the whole like impressive demos ever easier. And then controllability in terms of why did it do that? It's not like four rack topics are triggered and then you go into one of them, see that the piece of information is unclear and you change. You kind of go back to looking at your, it's like, it's as if you had the whole like code base in one file, right? And you're like, sure, yeah, the mistake is there. You know, The problem is everything is there and you don't know how it's linked, how it works together, how the mutually contradicting information is. Yes. Now, you know, with this, well, maybe it's like this new release or actually previous ones. I think the entire orchestration tool, as I say, they have already perfected it. So previously it used to be like, you know, there's only one single problem and there's not a lot of control. Now you can actually use API to actually orchestrate these contacts and then, you know, so allow you to do sophisticated contact engineering with this voice aid agents already. So we will have to see how the attack effect of that would look like. But this is looking very promising. No, it's powerful. It's really, really powerful. So, okay, we feel good. In terms of the voices they released, I think there's Marin and Cedar. They're decent. I mean, I wouldn't say they're the best. I still have not found a British female voice yet. She's great. I use that one in my cap. I guess we're in the UK, but I like that voice. She agrees a bit too much and every first sentence is, oh yeah, that's exactly right, you got it. And I feel like it's a bit condescending. But other than that, it's better than these two in my mind. Well, the only thing I think the voice is available on the API is still predominantly US voices. We would love to see a bit more diverse of accents from different regions. And I think that's why in our platform, even though we put speech-to-speech in, in a lot of use cases, we will actually turn off output of the audio of a speaker's speech and then leveraging a different bit of performance. I mean, okay, let's be honest. The value of an end-to-end model, if you unify understanding and reasoning and even the generation in terms of the text, sure. If that thing is a streaming model and one model, you've got all the benefit already. Because then it streams the text, the speech, it works. Maybe it has a few tens of milliseconds. That's actually better. The only thing I haven't yet to try is whether message will work in that text output. So if you turn off the speech-to-speech audio output and then just use their text output, I don't know whether the text actually recognize your name correctly. I think that's something. I think that's fully down to like the pre-training of the text-to-speech engine. Yeah. Like has it seen weird names like mine, like inferred? Like honestly, if you just look at text repetition, like the shu and qi are not encoded in the S and C that we spell, right? So like you'd have to almost like have. Yeah. It's cool that it has that. Like, yeah, so the speech model, I think it's interesting, right? It would have heard the way we said it. Correct. Whereas, like... But it said it really well. Yeah. It's arguably better than you. Totally. Which means that it had pre-training. It inferred and then you... This shit's pretty good. It's not... It's the Serbian Shrenche, not the Polish. Or whatever, right? It's... I can't really tell if it's there, so... I can barely... Oh, yeah. It takes a lifetime of torture with Slavic languages. Yeah. But, yeah, no. Really fascinating stuff, you know? It's... Again, like, okay, so It's natural. It's great. I think like a lot of voice platforms are all over it. I think that a lot of their orchestration will probably over time move towards orchestrating to this like no turn taking paradigm and like a different way of context engineering. What about the benchmarks? What are they saying? Yeah. So I think, you know, together with this release, they released like three different benchmarks. You know, to be honest, like each of the tasks is actually has its own boxes. And that as a tech person, it will be harder to actually interpret everything. everything at all. I think I just make it super simple for everyone. So there are three different tasks. The first task you can consider as a single turn understanding. Like, you know, I give you a task in one go and then you try to complete a task and then you reply back to me. So that's single turn. It's not dialogue. Yeah, exactly. Single turn accuracy. The second one is more... And how are they doing in that one? So, yeah, okay. We can go one by one. So in that single turn, they have a comparison to the previous which was the first release of voice to voicemail the the performance is around 66 comparing to the latest one the recent release is 83 so it's a huge big jump big jump in terms of performance so from 66 to 83 currents and the ceiling being 90 the ceiling being lead 90 is actually the cascaded okay so if you think about cascaded pipeline with like no speech record their own gb default transcribe plus lms plus like a TTS, then the accuracy can up to like 90 plus. But we all know that like in that one, I think the latency is like six, seven seconds. Yes. Because Whisper is like a nuclear model and they're waiting. However, like you can make it faster. So, you know, cascaded latency could be a lot shorter. But I think that's kind of like the performance, like, you know, you should be a way off. So around 7% is that accuracy of a single term. You know, when you are using speech to be versus cascade system. So technically, it's not bad. But then if you think about the conversation lasts for several terms, the performance will degrade very quickly. So that's the second task that evaluate on, which is more of a dialogue level evaluation. So the task is like, hey, I wanted to do a trip in Italy. Can you help me to plan it? And you have a several-term conversation. You say, oh, I wanted to at Rome. I want to add a restaurant. I really like it. I don't want it to go to the museum. So you have a lot of back and forth conversation, you know, cap it to 10 terms. And in the end, you evaluate the performance, whether the agent actually consider, factor in all your consideration into the final results. That's the evaluation. So in that evaluation, their performance this time is 30. Previous one was 20. So again, a huge improvement in terms of quality. So pretty well. But then, if you compare to the SOTA model on the market, which I think is O3, or this reasoning-based model, it's about 63. And even with GPT-4 or GPT-5, without inferencing capability, maybe it is around 50. So you can see that, you know, okay, then it's a huge gap. When you look at most voice-based tasks that you encounter in our world, this is way too skewed towards reasoning. Correct. Yes. This is still a very difficult task But, you know, it kind of gives you a good gut feeling of how good the system is. Like, where is the gap, right? So in terms of accuracy, reliability, the final one, it's basically purely on tool code reliability, right? So I think the GPT-4 tool code accuracy was around 80. So that's like the benchmark you're comparing to. Their new speech-to-speech model is about 66 or 67. The older generation was around 50. Again, big jump comparing to the previous one. But if you're comparing to the pure text-based models, the tool code reliability is actually, there's still a little bit. Now, in our enterprise production use cases, even 4.0 sometimes is not reliable enough. That's why we have to train our in-house model to actually make the tool code accuracy absolutely not possible. So I think there's still a little bit of a way to go. But I mean, I think that's the thing, right? OpenAI is serving to a very wide audience, right? Yeah. I think people building this are mostly building very more social-oriented, like, interesting applications, right? And I think for that, like, I mean, not to say that this isn't a problem, but, like, it's less of a problem. Correct, yes. Yeah. So, yeah, I think, you know, that's kind of, like, roughly how I would think about it. So I think in terms of naturalness, the entire user experience-wise, the speech-to-speech model is definitely heads-up the best you can kind of get quite close with the Cascade system in terms of latency, in terms of performance-wise. But when you're comparing reliability, overall, the Cascade system is still not that good. So, okay, I think that everyone's favorite question, you know, like some open release new thing kills way of startups. But who did this release kill? I think, if you think about it, right? Oh, one thing about this, to answer your question is, also a very important feature comes with it, is that speech-to-speech model now also integrate with SIP and MCP protocol as well. So you can kind of, you know, you think about it as like an integration to actually get to more users. Yeah, so SIP protocol, session-initiated protocol, is basically the main way of systems integrating with telephony stags these days and the voice-over IP over the internet world. So there, I think it's basically their time to, you know, well, I mean, to answer my own question, but they're coming after the rappers who have kind of like made a life out of, you know, plugging in sufficiently interesting, like, you know, deepgram integration probably in the front end, then open AI, then, you know, text-to-speech multi-11 labs. I think partially deepgram and 11 labs are themselves like doing this stuff as well. And with that, it's kind of like open AI just kind of like moving the borders of their land. Yeah, one set of castle walls further out, right? Correct. Although, though, I don't think that OpenAI is actually picking on them. It's just like a natural evolution of their product strategy require them to build up this new integration endpoint to get to more users. Yeah. So they're actually building up their product, but, you know, while you're building up their product, they are destroying a lot of purpose of the other companies as well. Well, yeah, because, I mean, you know, if you can buy directly from the source and get to the I mean, I think those integrations, the telephony integrations, are really valuable, right? I mean, and SIP by itself still means that, you know, you have to do a lot of work to integrate this. It's not a done deal, but it's a prerequisite, right? Correct. And I think a lot of these other wrappers assume that this won't be available and worked with like Twilio plus BSDN, Public Search Telephony Network, so like the kind of like simple telephony redirect. Correct. And assume that that's the best they're going to get. Now with SIP, this opens up the ability to implement directly for I think this is actually serving a market of deep, well-equipped builders inside enterprises that want to try out their own things. It still doesn't really give them a platform around the voice agent for analyzing its performance, testing for these hallucinations and all of that. But it's definitely a step towards providing some of that. I think there are lots of product opportunities you can build around it. And to be honest, having speech-to-speech models be more powerful in natural conversation... is actually great for the entire community overall. This would just mean that this adoption of the technology would become much more like faster and clear. So I think for the community overall, it's very good. But then, you know, if you are building a startup, you really need to think about where you have the niche and then where, what are the product features you build around it actually matters? Like the orchestration of getting down the latency, whatever. Yeah, that's great. But you walked for one. But then, yeah, but then you no longer it no longer matters when the technology gets better. I mean, when you think about this world, right, where you have a lot of these quasi-developer companies that make it a bit easier for the developers to do more, right? It's a dangerous land to be because everything around OpenAI, the first layer is an opportunity, but they may just come out and make it even easier, which they can do because of first-party integrations and the fact that they own the model itself, right? Correct, yeah. I think there you might experience fast growth because you're making it easier to access them with a lower latency kind of thing to wrap around that LLM. But then when they release something like this, you're kind of left with no USB and nothing unique you're offering. I think that kind of squares out the topics we had in mind. Any final thoughts? What do you think happens next with all this from OpenAI and from everyone else? I think it's very interesting. I think, you know, we started to see these different model providers have different product strategies. Like LLM, you know, like For example, OpenAI goes a lot into real-time speech, and then they do a couple of vertical industry applications, like healthcare and education. Yeah, the renewable demo. Yeah, exactly. Phenomenal. And then they have another... If you think about Atropic, they are going on heavy inferencing-based tasks, and they do actually say that it's not in their shared situation to actually ship real-time API. So they are looking at a lot of... Carving a lot slowly, right? in these cases. Gemini is doing a lot of multi-model input. They don't do a lot of multi-model output, like these real-time speech outputs, but they do a lot of input, like understanding videos and images and things like that. And Grok is there just to offer a political credit. Which is perfect, because there's a huge gap on the market there. Very cool. But I think just in terms of these performances, we ranted off a few numbers, we'll see if we can get them embedded, but The saturation that we saw with GPT-5, there is a moment here where with voice, this is still a big push in terms of those benchmarks at least. And I think the subjective improvement is also strong. Where do we think the saturation starts kicking in? And on what parameter? Now, if you look at that, from this real-time voice, speech-to-speech API-wise, in terms of the growth, the performance improvement of it, you will still continuously improve over time. And then all these accuracy numbers, when they release the next version, definitely will be becoming a lot better as well. But the thing is that all these voice, those kind of like tool called accuracies and reasoning, understanding capabilities, still it's going to be very similar to the text-based reasoning. But text-based reasoning is simpler to do because it's always like text in, text out. But now it's like all these the audio model needs to first decide what's the boundary of a sentence and then so that they can actually do reasoning on top of that. So you can imagine that's inherently going to be a much harder task to do. So I do think that it will follow, the performance improvement will follow very closely as the text reasoning model's improvement curve. If we see, and the good thing is that text model is going to be always ahead of speech model anyway. So if we started to see text model as a sign of a down, then you can definitely expect the voice model will slow down. So just for the last, what, a year? And also, yeah, yeah, and then roughly around that. And if you compare the actual instruction following all these recent capabilities, voice model is not only one year behind. It's actually more than that, the accuracy. And then if you think about it, text model has so many different companies competing for it. Voice is relatively niche currently to all the open AI because they are the major company invested very heavily. So without the competition, I don't know how much they can actually push that. It's actually really surprising how little, I mean, okay, forgetting like the other company that have historically had large, you know, like voice assistant, consumer voice assistant, things like, I'm surprised with how little we're seeing these days. Yeah. Because they've got the people, they've got the data, they could be in this arena. Yeah. Very, We wait. Google Assistant gets worse and worse every day. I don't know why. They have a lot of people. That department definitely is not doing well. I've been watching my daughter trying to play APT on Google Assistant. It shouldn't. Given that it's probably played by 1 in 30 people playing a song on Google Assistant, you'd think that it would work. That's right. Cool. Well, Sean, thank you very much. Thank you, everyone. Thank you for tuning in. Like, share, subscribe, and let us know if there are any specific topics you'd like us to cover in the future. All right. See you then.