EDGE AI POD

Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.

These are shows like EDGE AI Talks, EDGE AI Blueprints as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.

Join us to stay informed and inspired!

All Episodes

EDGE AI POD

Audio AI on the Edge with Ceva

June 12, 2025 • EDGE AI FOUNDATION

Audio processing at the edge is undergoing a revolution as deep learning transforms what's possible on tiny, power-constrained devices. Daniel from SIVA takes us on a fascinating journey through the complete lifecycle of audio AI models—from initial development to real-world deployment on microcontrollers.

We explore two groundbreaking applications that demonstrate the power of audio machine learning on resource-limited hardware. First, Environmental Noise Cancellation (ENC) addresses the critical need for clear communication in noisy environments. Rather than accepting the limitations of traditional approaches that require multiple microphones, SIVA's single-microphone solution leverages deep neural networks to achieve superior noise reduction while preserving speech quality—all with a model eight times smaller than conventional alternatives.

The conversation then shifts to voice interfaces, where Text-to-Model technology is eliminating months of development time by generating keyword spotting models directly from text input. This innovation allows manufacturers to create, modify, or rebrand voice commands instantly without costly data collection and retraining cycles. Each additional keyword requires merely one kilobyte of memory, making sophisticated voice interfaces accessible even on the smallest devices.

Throughout the discussion, Daniel reveals the technical challenges and breakthroughs involved in optimizing these models for production environments. From quantization-aware training and SVD compression to knowledge distillation and framework conversion strategies, we gain practical insights into making AI work effectively within severe computational constraints.

Whether you're developing embedded systems, designing voice-enabled products, or simply curious about the future of human-machine interaction, this episode offers valuable perspective on how audio AI is becoming both more powerful and more accessible. The era of intelligent listening devices is here—and they're smaller, more efficient, and more capable than ever before.

Ready to explore audio AI for your next project? Check out SIVA's YouTube channel for demos of these technologies in action, or join the Edge AI Foundation's Audio Working Group to collaborate with industry experts on advancing this rapidly evolving field.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1: 0:36

Thank you all, righty, we're back. Good morning, good afternoon, good evening, hello. We got a huge crowd from around the world online here, which is awesome. I love.

Speaker 2: 1:03

I love that Looking sharp Pete.

Speaker 1: 1:05

Thank you, yeah, I got my hoodie on. There you go, the official host uniform. You want to gear up Tariff-free hoodie. So you know, while supplies last, that supports our scholarship fund. So actually, if you want the hoodie I don't know if I have a little banner to show it, but yeah, there you go.

Speaker 1: 1:26

Go to our hoodie site hobe AJI and get a hoodie. It's cool and it supports our scholarship fund. But I'm getting ahead of myself. I want to just say welcome Davis, welcome everyone, to AJI Talks, and we're here to talk about audio AI and audio ML, which is going to be pretty hot.

Speaker 2: 1:50

It's kind of a hot topic. Yeah, always evolving. I think folks who have been in the digital signal processing space for a long time will be talking about it. So what's new, of course, is well, AI and adding deep nets to the processing pipeline, getting better accuracy, better feature extraction so we're in a store for a cool talk.

Speaker 1: 2:05

Should be good. So a couple of public service announcements, a couple of PSAs. One is that we published we're going to have a two-day live stream it's our third live stream around generative AI on the edge or generative edge AI, I guess would be the right way to say it on May 27th and 28th. So we have a call for papers out for that. That's a two day live stream worldwide presentations, academic, industrial. If you have a good idea and want to present kind of a 30 minute, 25 minute pitch on our live streams, go to our website, submit the your call for papers and you too could be famous, famous person. But yeah, we really appreciate all submissions on that. So that's May 27th 28th. Also, registration's free and it's open so you could reserve your spot right now. So that's the next big hot thing. And then we have our registration for Milan conference July 2nd through the 4th is open and so you can register there. That's going to be a a blowout in uh in Italy.

Speaker 2: 3:11

Still got to make it out there. I mean the Jenny at the edge, the virtual one. That's a. That's a fantastic event that you know, danilo and myself have done going back almost actually just about a year. It's been about just a year I did the first one.

Speaker 2: 3:23

I don't know what other landscape is changing faster than Gen AI Plus at the edge. It's really sweet to get together, I think the dialogue, like as usual, that's. My favorite part of these kinds of interactions is when someone puts some data out there and it becomes contested or interested.

Speaker 1: 3:39

It's a live stream so we got questions. We had about 900 people in the last one, so this one's going to be hopefully top that. But, yeah, hot topic we're also going to be. We have a bunch of partners at InnoVex and Computex coming up in May.

Speaker 2: 3:55

So if anyone, that's Taiwan, right, yeah, yeah, yeah.

Speaker 1: 3:59

That going on, which is pretty cool. So yeah, always stuff going on, which is pretty cool. So yeah, always stuff going on. Go to our website, sign up for a newsletter subscribe.

Speaker 2: 4:08

Blah blah blah. Click links, things like that. So is InnoVex co-located at Computex. It is it's part of Computex.

Speaker 1: 4:12

It's more for, like, the startup thing Okay cool, we have a lot of startups now in the foundation that are doing lots of cool things, and so we'll have a pavilion of stuff Awesome.

Speaker 2: 4:24

I'm always seeing new logos in that intro video. Always wanted to New logos.

Speaker 1: 4:28

It's hard to keep up. Rosina is in the background here.

Speaker 2: 4:31

She's not visible but she's the invisible hand of the foundation. The invisible hand exactly.

Speaker 1: 4:38

Cool. Well, why don't we bring on Daniel from SIVA? And he's our guest of honor and the audio ML expert, so let's see if we can get him on there he is hey Daniel.

Speaker 3: 4:50

Hi everyone, hi Pete, hi David.

Speaker 1: 4:53

Hey, good to see you. Hi, thanks for joining us.

Speaker 3: 4:56

Glad to be here with you.

Speaker 1: 4:58

Let's put you at a proper. You get the guest of honor spot here, so you get the big square and we'll get the small squares, but yeah, so thanks for joining us and you're from SIVA. So what do you do at SIVA other than audio AI? Yes, okay, all right, so what's your role at SIVA?

Speaker 3: 5:23

Thanks for first for letting me be here at this stage in the talk At CIVA. First of all, I am a for the past six years specifically specializing on auditions for tiny email domains, which is why we are here. Things like record detection, speech recognition automatics, etc. Today we are going to do two of these features.

Speaker 1: 6:22

Cool. Your audio is a little choppy and your video is kind of yeah, so it's a little rough. I'm not sure if there's a resolution to that. Just to let you know it's cool.

Speaker 3: 6:35

Do you hear me? Yes?

Speaker 2: 6:38

It's better all of a sudden. Yeah, so I mean, tinyml is not a new technology to Siva, right? You talked about your PhD and how you've gotten involved. How long has Siva been working with TinyML? I don't know if you caught the question.

Speaker 3: 6:57

So I have a. I don't hear you. This is the live stream.

Speaker 2: 7:06

Wait a second. Maybe we try to stop talking to you and let you talk through your slides. I think if you reduce the video, it might help us with the audio lag, but it's up to you. I think the main thing today is we want to make sure the audience gets your slides. Yeah, I have a lot of.

Speaker 1: 7:25

Wait a second.

Speaker 3: 7:27

Why don't we shift?

Speaker 1: 7:29

Yeah, we'll shift over to that mode. So you just need to share your slides on the screen thing and then we'll move into that mode and then we'll focus the bandwidth on the the presentation yes, we can do that, and so yeah so I like the logo, I like the the school logo yeah that's good, that's good, all right, daniel um, waiting for him to share his slides, so, uh, he might be having some connectivity issues. I think you might be frozen, yeah.

Speaker 2: 8:07

It happens, you know.

Speaker 1: 8:09

I had that yesterday. Yes, you do. Yeah, you got some serious issues. I was, I had to switch my phone. Doesn't seem better. But I had to switch to my phone yesterday because my, for some reason, my xfinity internet was just crapping out on me. So and I was doing like a recording with someone and I had to just switch actually it was with your boss here with ollie.

Speaker 1: 8:40

Oh no, kidding, okay, okay, and it's like all of a sudden I'm just like, and then I had to just switch to my hotspot on my phone and that was tough, that was tough.

Speaker 2: 8:53

So, uh, we yes, we can hear you.

Speaker 1: 8:58

I want to say yes, yeah so if you want to share your slides, and then we'll. If you want to do camera off, we'll, we'll. We're cool with that One minute, we will. We can show your camera at the end, but I think maybe the camera is eating up some bandwidth. So, yeah, put your slides up there, we'll, we'll share them and then jump into audio AI. Audio is kind of becoming a hot thing. I mean we becoming a hot thing, I mean we we've. We have a number of folks in the foundation. Now we have Siva sentient, you know, who bought the Knowles microphone folks, and a number of other folks doing cool stuff with audio AI green waves, if you remember them, um, but like you can use audio AI not only for just like higher quality kind of sound, but also doing detection so seeing audio applications, and hopefully Daniel will talk about that.

Speaker 1: 9:45

so audio using audio doing detection. So seeing audio applications, and hopefully daniel will talk about that. So audio using audio as detection. So there's a kind of the classic kind of glass breaking, gunshot detection. But also you can listen in like you know environmental monitoring, you can be listening for wildlife, you can be distinguishing, you know birds, calls, how many birds, even insect sounds you can do. So we're starting to see audio used more and more as a kind of a substitute for a camera. So instead of visual detection you're doing there you go, nice, all right, we're going to, we're going to back out. But once you get started, daniel, and then we'll monitor the chat for questions, everyone throw your questions in there and we will gently interrupt Daniel somewhere along the way to get those questions answered.

Speaker 2: 10:34

But yeah, you have the floor.

Speaker 1: 10:35

Sir, Take it away Okay.

Speaker 3: 10:38

So, Pete, now you can hear me all.

Speaker 1: 10:41

Everything is okay, yes, sir.

Speaker 3: 10:44

Go for it. Yes, good to know, good to know. Wow, it was very, very hard, bumpy right you're good okay, so uh, hello everyone and thank you first also for coming join us. And today we will explore two audio applications and within the TinyML domain, we will cover the entire process from development through to deployment. Specifically, we will discuss about environmental noise cancellation and also text-to-model applications. Okay, so we at Silva, we have identified three key stages every neural network model goes through.

Speaker 3: 11:32

The first stage is the development stage, where we design the architecture. We need to gather data and training the model Once the model meets our criteria we can move to the productization stage, where we optimize the model efficiency in size and complexity using methods like quantization, pruning, compression, knowledge distillation, et cetera. Finally, for deployment on a C-Rays-MCUs family, we convert the model from PyTorch to LightRT for microcontrollers or, if needed, alternatively, we directly implement the model in SQL. Okay, so we will start with the environmental noise cancellation. In short, ANC applications.

Speaker 3: 12:30

So I think that as everyone knows, human to human communication largely occurs through devices such as smartphones, video conferencing and other digital platforms. To maintain these clear conversations, it is essential to reduce the background noise captured from the environment. For example, if you are at home, you can have a lot of baby cry, dog barks, other ambient noise.

Speaker 3: 13:03

Or if you are outside, you can have a car, horn playing, etc. A lot of noises. Additionally, as human-to-machine interaction becomes more prevalent, automatic speech recognition, known as ASR systems, requires high-quality audio signals to function reliably. Require high quality audio signals to function reliably. Therefore, ENC applications play a crucial role in facilitating clear and effective communication whether human to human or human to machine applications.

Speaker 3: 13:39

Okay, so, specifically in SIVA. At SIVA, we chose D-Filter 3 to be our architecture for the ENC solution due to its excellent combination of features, which are, first of all, its single-channel support, which is ideal for any microphone-enabled device microphone-enabled device. Secondly, it uses full-band processing, allowing us to support any sample rate from narrowband up to the full-band systems. Also very important is the causal design of the model, achieving low latency of only about 40 milliseconds, which is good enough for all far-end applications. And finally, and the most important one, is that it also has an optimal balance between the quality and efficiency. Okay, so, how does the architecture itself look like?

Speaker 2: 14:41

So, the Diff.

Speaker 3: 14:42

Filternet architecture consists of three parts the encoder, erb decoder and the Diff Filternet encoder. And the encoder accepts two inputs. First is the complex feature, the complex spectrogram up to 5 kilohertz generated by a short time flow transform. In short is the SPFT.

Speaker 3: 15:06

And the second input is the equivalent of the rectangular bandwidth, which is the ERB features approximating the human auditory filters, modeling how humans perceive the sound. Next we have the ERB decoder, which predicts the ERB gains, acts as a mask for frequencies between bins above 5 kHz, and the last one is the diff filter decoder which estimates the big filter coefficients to reconstructing the periodic structure of the speech signals up to the 5 kHz.

Speaker 3: 15:51

And as you can see here now, as you can see the entire architecture, you can see that the model employs a convolutional neural network and the gated recurrent units layers to capture both local and temporal dependencies, and it also used a group linear layer to significantly reduce the model size and its complexity. Okay, after extensive evaluation inside SIVA, we identified three scenarios where the D-Filter-03, the original one wasn't sufficiently stable for our needs. The first one was when the signal has a low speech levels around the minus 50 dB SL below that.

Speaker 3: 16:51

And the third one was in rapid transitions between different speech intensities. To address this weaknesses we made several modifications, as seen in the table below. The first change was in the encoder is a feature combination method from addition of two features to concatenation of them. The second one was to increase the convolutional feature maps from 64 to 96. The third change modification was to reduce the GLU units, specifically in the ERB decoder, from 256 to 128, because we see that it doesn't need more than that. Fourth, we apply the ERB gains as we recall the mask for the frequency bins to all the frequency beams and not just the one above the 5 kHz.

Speaker 3: 17:52

And lastly, we also trained our model with multilingual data and not as originally the model trained only on. English.

Speaker 3: 18:04

Okay, once we have been satisfied with the model's performance our model, we apply two optimization techniques to further enhance the suitability for microcontrollers because, as we mentioned at the beginning, we want to deploy it on the M-series at the end. So we first apply the quantization award training technique and also the singular value decomposition compression, which we both explain in the next slide. So in quantization award training we apply the fake int eight quantization to the weights during the forward pass, while we maintain the full precision during the backpropagation.

Speaker 3: 18:50

This method allows the model to account for the quantization effects during the training, significantly enhancing the robustness in the inference. Quantizing weights to INT8 reduces the memory usage by four times, which is a substantial reduction in memory. On the right you can see an example of this flow what we explained on a single fully connected layer, but again, when we have a big model it will be on the entire model.

Speaker 3: 19:25

This is just an example to illustrate how it works.

Speaker 3: 19:30

Okay, in addition to the quantization-aware training, the SVD involves mathematically decomposing every 2D weight matrix M into three matrices U, s and V. Into three matrices, u, s and v. We encourage during the training to sparsity in the singular value matrix s by integrating the regularization terms to the loss. After training, we eliminated the zero-value elements from the S matrix, allowing corresponding rows and columns from the U and V respectively to also be removed, which you can see in the left slide. This results in a compressed representation of the original matrix M, achieving an optimal balance between the compression and the performance, which can be easily controlled by the regularization coefficients that we can decide on. Okay, following the quantization-aware training and the SED compression, we obtained an efficient, high-performance model. However, our goal is to run this model on an MCU. Our deployment process involves three main steps. The first one is we need to develop a streaming version of the model in TensorFlow from PyTorch to TensorFlow, which addresses challenges such as a cyclic buffer for the temporal convolutional parts, also converting the PyTorch weights to the TensorFlow.

Speaker 3: 21:21

If some of you already tried it you know, that, for example, in the fully connected you need to transpose the weight before you move them from PyTorch to TensorFlow. Also, we need to find solutions for unsupported layers. Again, I can give an example from the D-Filtern that you have, the depthwise transpose convolution that in PyTorch it exists but it doesn't support in the tens of code, so we need to recreate a new layer in PyTorch to have the same thing in the tens of code.

Speaker 3: 21:58

Next, after we have a model in TensorFlow that it is actually working, we need to convert this TensorFlow model into a quantized TensorFlow Lite version of the model which also handle additional challenges, such as unsupported operators again because it's in a quantized version and also creating a calibration data and each calibration data can we can give you a different kind of model, because it's a for the post processing of the quantization.

Speaker 3: 22:33

And third, lastly, we need to ensure that all the operators are compatible with civil-Vized MCUs, which involve implementing and optimizing unsupported operators specifically for our hardware, for the light, RT or microcontrollers. Okay, now for the results. Regarding the performance, we compare our ENC solution with the open source alternatives.

Speaker 3: 23:05

As you can see, the X-axis represents the model's footprint in megabytes while the Y-axis indicates the mini opinion score, in short the MOS, which measures the audio quality. The higher the score, the better. The orange rectangular in the upper left corner represents the optimal trade-off between the two metrics. Our model is 8 times smaller than the original DeepCalc Internet while maintaining performance, making it highly suitable for memory and power-constrained devices. Here you can see an example that demonstrates the performance difference using a customer-provided noisy input signal. Initially, the signal contains background noise. This is the first half of the signal that you are seeing. This is the spectrogram, followed by a speech in Chinese and again, if you remember, the original D-filter net didn't train on the other than English language. The original D-filter net, as you can see, effectively removed the noise but also overly attenuated the speech signal, which is not what we are desiring.

Speaker 3: 24:27

On the other hand, our version successfully eliminated the background noise while preserving the speech, which took the deep ultimate tree architecture, identified and addressed its limitation and significantly improved its robustness, while being eight times smaller.

Speaker 3: 24:51

To achieve this we combined the quantization-aware training and the SVD compression technique and finally, we demonstrate the comprehensive process from initial model development to successful deployment on mCedus. And now let's move on to our second application, and text to model. Okay, so I believe that all of you have noticed and that devices around us are becoming increasingly smarter, often incorporating voice user interfaces for easier and more intuitive operations. For instance, you can imagine how you can talk to your TV without looking for your remote because you don't know where you put it last night. Or, for example, you want to operate your microwave but your hands are busy or dirty. So on highly resource-constrained MCUs, typical voice user interfaces include a wake-hole detection model to allow the device to know that you are talking to it and alongside another model designed to recognize a specific set of commands to operate this device.

Speaker 3: 26:17

So we have two things here. We have wake-up detection and the keyword spotting model.

Speaker 3: 26:26

So from the manufacturer point of view, which is ours. Developing a keyword spotting system involves five key steps. First, they need to choose a wake word and a set of voice commands, which is a hard for itself. After that, they need to acquire the necessary data, a process that can take weeks or even months. And the third step is processing the raw data that we acquired and make it suitable for the training, because sometimes the data, the raw data, it's not exactly what you're looking for Some of the people that recorded your keywords didn't say it correctly, etc. And this track.

Speaker 3: 27:20

After that we start the training phase, which itself is a tedious work and can take several days or up to a month. And finally, after we have a model, we need to test it thoroughly to assess its performance. These traditional are expensive and highly time-consuming for companies and even though this process alone sounds long enough and very expensive, it doesn't end there. Changes to products like you want to rebrand the name of the product or change even a couple of the voice commands often leads to more data collection and retraining, which multiply the cost of the delays by the number of the changes, which is the end and this is exactly where our texture model solution comes in to eliminating these costly, time-intensive steps and significantly accelerating the development cycles.

Speaker 3: 28:33

So, what is actually our text-to-model. As its name implies, our solution directly generates keyword-spotting models using only text input. Our approach consists of two main phases. The first one is the registration, which occurs offline on any startup device can be on server or on your laptop, etc. And the second one is the inference running in real time on resource constraint m series. So let's take an example daniel, if you want your model to recognize wake words hello, yes, yeah, can we?

Speaker 1: 29:17

can we jump in for a sec, because we got a backlog of questions. So before you dive into this, okay, is that okay? Yeah, we had a lot of good questions. Davis, you want to tease something up here? What's your favorite?

Speaker 2: 29:25

Well, I think there's a theme that definitely makes sense, which is this PyTorch to TensorFlow conversion step. So I think I can understand why you don't want to take this flow, but I think it'd be great for the audience, based on all the questions, if you can talk about how you made that possible, why you didn't start with TensorFlow in the first place. Have you tried things like Edge, ai, torch or LightRT? So maybe just really focus for a second on why you started with PyTorch and how you were able to bridge that gap from PyTorch to TensorFlow For the first use case Deepfield.

Speaker 3: 29:52

This is a very good question and I will start with that the PyTorch is a much more easier platform to develop a model.

Speaker 3: 30:07

It's much more agile to changes. You can do whatever you want. And in TensorFlow we also tried, by the way, working with TensorFlow. But very, very early in the process of trying to train a model in TensorFlow, we found that here at Siva that it's too complicated. Things were very hard to ramp up. And in PyTorch, when we tried it at the beginning, we see that everything works very smooth. We want to customize the loss function. It was very smooth. Everything works out of the box perfectly and in TensorFlow it wasn't the case.

Speaker 3: 30:54

And as I mentioned, for example, in the example for one of the problems is that in TensorFlow one of the problems is that in TensorFlow there is much less new or supported layers. For example, the one that I give you is the depth-wise convolutional transposed convolutional layer which in. Pytorch you have and somehow in TensorFlow you don't have it. So this is the reason why we choose to always continue with PyTorch, because at the first try it was perfect for us and what was the other question about it.

Speaker 2: 31:38

Well, there was a few sub questions. But what about when you go to inference the PyTorch framework has a lot of overhead and baggage, right, that doesn't make it maybe as edge friendly as TensorFlow Lite. So there are questions on how did you convert? Did you replace certain operations or layer types themselves, like, if you can talk about that, that conversion process? So you kind of justified why you used PyTorch in the first place. Now it's like how did you, how did you make it edge friendly?

Speaker 3: 32:03

Okay, now I understand, so I can continue with the same example, for example, if I have the problem with the depth-wise transport convolution layer. So yes you are correct that we need to do some modification in the PyTorch model to be compatible with the TF Lite converter. Yes, we did it. For example, we need to change the layer from this layer to something called sub sub pixel, convolutional layer, I believe that some of the audience will know what I'm talking about, and then it works smoothly.

Speaker 3: 32:57

But the main reason that we, and another thing that I want to say here is that yes we need to also make changes in both edges. We change things in the PyTorch pipeline, I mean we change the model to be more compatible with the TF Lite, but we also need to change things under the hood of the tf light to be, compatible with the pytorch. So it's a symbiotic tool and we need to work hard on both of these sites and cool but again we choose to always work on the training side of the models in pytorch because it was much easier.

Speaker 3: 33:41

It's still much easier I mean we of the models in PyTorch because it was much easier.

Speaker 1: 33:44

It's still much easier.

Speaker 3: 33:45

I mean, we're still training models in PyTorch and I think that we're not going any soon to change it to TempRef.

Speaker 2: 33:53

Yeah, oh, super interesting. I think that was a really helpful answer.

Speaker 3: 33:58

But it's a good question and, again, it's a very hard work and tedious work, but this is how it works for now.

Speaker 1: 34:08

There's a couple other ones, just there's one that's kind of involved here. You know custom layers of DFN at three. You're familiar with this. Which data set did you evaluate the dns moss? Blah, blah, blah. So hopefully you want to get into that level of detail.

Speaker 3: 34:29

But yes, so the moss that I showed you is the dns moss but, we the this uh results is not on the.

Speaker 3: 34:47

This results is not on the. Let's call it the academic benchmark. We have our own internal benchmark, which is much harder, as I mentioned multilingual, very dynamic range of speech intensities, because we saw that the DiffilterNet 3, the original one, works on a specific range of speech intensities and when it becomes too high or too low it just collapses entirely. So what we present here is much more let's call it representative to the real world.

Speaker 2: 35:25

Okay, cool.

Speaker 3: 35:30

I answer your question I think so.

Speaker 1: 35:33

Should we carry on, davis?

Speaker 2: 35:34

I mean, we could do questions all day, but I want to, yeah, I think you're a popular guy, daniel, so I think we'll have to get back to the slides and we'll take your next batch of questions when it makes sense. Go for it. Yep, back to your presentation, so I can continue, please.

Speaker 3: 35:51

Okay. So what is a text model? Oh, okay, this one. I will say it again shortly. So Text2Model is composed of two separate parts. The first part is the registration occurs offline on any standard device, and second one is the inference, which will run in real time on resource-constrained MCUs. Here I'll start to talk about an example of how it will work, so assume that you will want that your keyword spotting model will have three keywords Alexa, ok Google and Hello Siva.

Speaker 3: 36:36

So we'll just simply input these phrases during the offline registration phase, resulting in an immediately deployable model, as you can see here. And also it is worth to mention that each additional keyword increases the model size by only about one kilobytes, which is very small.

Speaker 3: 36:59

The model generated offline is then deployed on the MCU and when the user say a phrase like, for example, hello Siva, the corresponding keywords is detection probability should spike significantly, while the others here in the example it will be Alexa and Google. Ok, google, the probability should should remain very low, as you can see. Okay, so how it looks under the hood. So, again, because we need to separate between the offline and online, the parts where we start from the offline models. First model is the Graphene2Form module, which gets a text as an input and predicts the corresponding phoneme string formatted in the alphabet representation. The second module is the Query2Filter, which gets a phoneme string from the grapheme to form and generates a filter to a conv1d layer. As you can see, it goes to the online part. Now for the online section. We have first the CTC-based acoustic model, without its font projection layer, and it follows a 2-1-2 convolutional layer. Okay, but how it works?

Speaker 3: 38:32

So we train our text-to-model using contrastive learning to project both the text and the audio domains to the same latent space, just as it is a common approach when you want two domains to be in the same latent space in general, each sample in our dataset consists of a positive text audio pair and the negative pairs created by leveraging other pairs from the same training batch. In other words, we want that all the positive text-audio embedding pairs will have a high similarity, while the negative pairs will have very low similarity. You can see an illustration on the right how it should behave the contrast loss in ideal scenario.

Speaker 3: 39:26

Okay, to optimize the model for production, we apply quantization, error training and again the SVD compression, as previously described with the EMC. But additionally we implemented the knowledge distillation technique, specifically targeting the acoustic module as it accounts for 99% of the inference parameters module. So in general, knowledge distillation involves transferring insights from a larger high-performance teacher model to smaller model students model. This is how it calls. In our case, the student model is the compressed acoustic module, using the quantization over time and the SVD methods, instead of training the student directly with CPC loss, like we did with the teacher. We found that aligning the student blocks closely with the corresponding teacher blocks result in much better performance and stable training and furthermore, we used the teacherless projection module to ensure the student predicts similar output, labels, labels, distributions, and this additional training loss significantly enhanced the model's accuracy with the phone prediction.

Speaker 3: 40:52

Okay, after we have the model. So, regarding the deployment, again we follow the same steps that we've discussed but, as you may guess, it is not the same. I mean that the obstacles will be the general obstacles for each one of the steps will be the same, but it will be different in the specific details. But it is more or less the same pipeline. Okay, utilizing the production techniques we mentioned, the computational training, svd and the large-distance manipulation we developed three model variants, ranging from 500K down to 250K per hour, which is very small variance.

Speaker 3: 41:49

To assess the performance, we use an open source benchmark measuring the true positive rate and false accept rate metrics, which are common metrics when you are developing a keyword spotting model. When you are developing a keyword spotting model, our 500K model achieves equivalent to a positive rate compared to proprietary alternatives, but with three times fewer false alarm, demonstrating its excellent efficiency and viability. Okay, so to conclude, we designed the system to instantly create a keyword model using only text input. We utilized the quantization of the training, sdd compression and knowledge distillation techniques to enhance the efficiency and, finally, we successfully deployed the model on C-Rays, mc-waze 7.

Speaker 3: 42:46

And if you want to see our text-to-modeling action, we have a demo in our YouTube channel of Civo, so I encourage you to visit them, and thank you all for listening again. If you have more questions with what we will try to cover here, you can reach out to linkedin email awesome, that's it now, I think, for the questions yeah, now we got a great great talk dan talk.

Speaker 2: 43:20

Daniel. Yeah, thank you, pete. I would go back to a question earlier that came from Sonny Catry, that, if you scroll up a little bit farther to around 11.19 am is the timestamp. I thought this was a good question because it is a very, let's say, pervasive use case nowadays and there's a lot of Bluetooth headsets that have different variations depending on the price point and quality. Daniel, if you can see the question, can you answer? What is the advantage of this technology over traditional noise cancellation nowadays available with most of the headsets?

Speaker 3: 43:57

Are you there, Daniel I?

Speaker 2: 43:59

don't see the question, but.

Speaker 3: 44:00

I hear what you are asking, so it's good enough for me. We start at the beginning of the environmental loss cancellation. It was a classic approach but very we understand that again during extensive evaluations that if we want a single microphone and this is very important a single microphone solution for environmental, low-scale installation, all the classical techniques are not, are falling short and can't really handle it. They need at least two microphones or even a day of microphones, which is very expensive for low end devices.

Speaker 3: 44:50

So we found that using deep learning models with only single channel information works not good as a classical classical, but even further much better compared to classical environmental and space intervention algorithms with multiple array of microphones, so we have only benefits from moving to this approach.

Speaker 1: 45:21

And we are not the only one.

Speaker 2: 45:26

Yes, the entire industry is going in this direction. Yeah, deep learning eats everything. Yeah, that's good. But I mean, I think you told the story quite clearly.

Speaker 3: 45:33

So they grew in the bed and they argued.

Speaker 2: 45:36

Yeah, yeah, True, true, I mean, there's another one Pete, that I was, I don't think you covered before from Suman Grai, around 1134, says hello, sir, I've written a CRNN code in TensorFlow. Then try to convert it to TF Lite and Daniel, they're facing an issue due to LSTM layer. So do you have a solution for when you're trying to convert a code in TensorFlow that has a CNN or no? Two questions down from that one.

Speaker 1: 46:02

Oh sorry.

Speaker 2: 46:03

Similar vein? Yeah, similar. There's a lot of conversion questions. So, Daniel, the question is converting in TensorFlow with LSTM layers. Do you have a solution for that? I've heard this problem before as well.

Speaker 3: 46:21

He's thinking about it Again. I didn't hear the question. I see the question.

Speaker 1: 46:24

Yep, yep.

Speaker 3: 46:26

No, I didn't hear you well because of the lag, oh yeah. So this is the question here below?

Speaker 2: 46:34

Yep, it should be highlighted on the screen. Lstm layer conversion in TensorFlow.

Speaker 3: 46:47

So, first of all, I didn't work with the LSTM layer. We are working with GRU. But to be, honest, it should work the same as GRU, because it's just too fully connected. In LSTM. It will be, I think, again too fully connected, so I don't see any problem with that. I don't know why. What I can suggest to him is that he can implement by himself the LSTM. This is what we also back in the days did, and it works good.

Speaker 3: 47:20

I mean he can just break the LSTM. This is what we also back in the days did, and it works good. I mean it can just break the LLM. This is my suggestion.

Speaker 2: 47:26

Okay, so you either rewrite it yourself or use a GRU, use a Gated Recurred Unit, similar building block, honestly, at the end of the day. So, like you said, it should have similar functionality.

Speaker 3: 47:35

Yeah, and if this does not work for him, he can just re-implement it by itself. You know a custom, let's call it a custom LSTM which will do the same property, but it will break much more healthier instead of the native LSTM. So that's what I will do Nice, native.

Speaker 2: 47:59

LSTM.

Speaker 3: 47:57

This is what I will do.

Speaker 1: 48:02

Nice. There's a couple of questions from Shubham Sharma. We can throw those up there. One was around the Kaldi toolkit. Did you use the Kaldi toolkit?

Speaker 3: 48:15

Did you use it? No, not at all.

Speaker 2: 48:25

Okay, okay, easy answer, easy answer.

Speaker 1: 48:28

Then there was a follow-up the model size which has device used for testing Yep. What was the model size which change?

Speaker 3: 48:39

device. What was the model size? So our? I don't know if the question is regarding the text to model or the different, but I can tell about both of them in the different internet. We have at the end model of one megabyte megabytes not a 1 million parameters, but it's the same. After you quantize that to int8 and in the text2 model we have a couple of variants, from 500 kilobytes and down to 250 kilobytes. Okay, Okay.

Speaker 2: 49:19

Nice, very edge-friendly, pretty compact.

Speaker 1: 49:22

Yes.

Speaker 2: 49:23

In both cases Great.

Speaker 1: 49:27

What other questions we got here.

Speaker 2: 49:32

We can go back a little ways. But, davis, you see, anything that's I mean there's one that was we can get in the weeds here. There's a question from Xiaoming Azu that said how the cyclic buffer concept is represented in PyTorch and TensorFlow. I wasn't sure the answer to this. Maybe, daniel, you can help us answer this question how is the cyclic buffer concept represented in PyTorch and TensorFlow?

Speaker 3: 49:53

This is a good question. So very, very easy, so very, very easy. First of all, in PyTorch we didn't need to care about the cyclic buffer because the input is the entire input itself, and in the PyTorch we don't have a streaming model. The model runs on the entire signal at once.

Speaker 3: 50:19

But when we moved to the TensorFlow and we wanted the model. We hit the sequence frame by frame, which is the streaming mode, and the CyclicBuffer is just another layer called CyclicBuffer and it calculates at the beginning what is the size of the CyclicBuffer, what is the size of the frame of the features, and then you just save it. For example I will give you an example If you need to have a cyclic buffer for a convolutional there that has a kernel size of three, three temporal features, so you need to store a size or a metrics of size three by the number of features, and each time you have a new frame you need to remove the last one and concat the new one. This is how it works. I hope it answers.

Speaker 2: 51:17

Do you write that down, pete? Yeah, yeah, I got it. No, thanks, I think for those who are interested in the technical details, that was hopefully sufficient. In the meantime, we got a couple of new ones. Actually, I know we have nine minutes here, so we can go to Alexander's question. Pete, Sure this one's interesting. Oh, this is a long one, yeah. Production-ready.

Speaker 3: 51:39

RNNs for the.

Speaker 2: 51:39

Oh this is a long one, yeah, so I'll read yeah, so they're looking for production-ready RNNs on, you know, over class, with advantages over classic spectrum-based recognition approach. So not exactly the title of your topic, but hey, you're here.

Speaker 3: 51:53

People are curious what you think yeah, try to understand the question, not to say I think it's not related to the question.

Speaker 2: 52:07

So I guess the question is also do you guys work with time series data like what's questioned here? Maybe not, actually is one of the answers.

Speaker 3: 52:14

No, no, we're working on audio and not a virtual sensor. So again, yes To be so again yes. To be honest, I don't know the answer to this question.

Speaker 2: 52:24

Fair enough. I think there's other types of networks we can use and there's other speakers that we've had in the past talk about time series data. That, alexander, you can probably go to that Next one's. Actually this is speech, because it's in the title. Have you tested models for just it's Shubham's question? Have you tested the models for keyword spotting or also ASR? So ASR is automatic speech recognition and the question is ASR with one megabyte. Is this possible?

Speaker 3: 52:51

Okay. So of course that we test our model, our text model, keyword spotting against ASR, and it's obvious that our solution it's better for the specific teams. Why is that? Because in our model, specifically, you can insert keywords which are not in the language and the ASR, which trains for general purpose, doesn't know, for example, the name of this man. You know, I mean you have a lot, for example, agi talks, but no, this is not a good one. But SIVA, for example. Siva, it's not a word in the English vocabulary, in the lexicon.

Speaker 3: 53:39

So, text-to-model can identify it. It's not a word in the English vocabulary in the lexicon, so text-to-model can identify it, but ASR. Most of the time we fall and didn't catch the keyword. So our solution is better than general ASR for this specific need for keyword spotting and ASR with one specific need for keyword spotting and ASR with one megabyte. It is possible. No, it's not possible for a general purpose. For our needs it was possible because we utilized the ASR only to be encoder, for the audio, just to capture the audio. Let's call it the features.

Speaker 2: 54:24

Yep. So it's different applications we didn't use ASI as ASL.

Speaker 1: 54:30

Right, Got it. Let's see there's a question here from Bing Huang. Are there?

Speaker 3: 54:44

open source RNN for audio classifications, I believe. So I didn't check it because we have our own, but I 99% show that if you search enough you will find something.

Speaker 1: 55:04

You've got a couple of. By the way, audio classification can be a lot of things.

Speaker 3: 55:07

It can be scene detection, not only keywords.

Speaker 2: 55:11

Right right. I don't know what exactly application Use case yeah, the challenge is less about sometimes the models, but what data and what data classes and sources are they trained on and is that something that's useful to your application? That's often where the mismatch can be, not so much on the model but the training data. Right, here's a TensorFlow Lite.

Speaker 1: 55:32

Yeah, there's a TensorFlow Lite question here, if that's interesting. How do you determine which operations are unsupported in TensorFlow Lite and which alternatives can replace them without degrading performance?

Speaker 3: 55:46

This is a very good question. So for those of us that work with TF Lite, after you convert the model to TF-Lite in a quantized way, you can see that if there is unsupported layers in INT8, you will see that the TensorFlow Lite will de-quantize.

Speaker 3: 56:17

You will add new for this specific layer, two layers before and after. Before it will be dequantized and then it will operate the unsupported operator in float 32. And then it will have another layer that he adds which calls quantize. When you have something like this, you understand that in terms of the light this specific layer doesn't support quantization, the int8 quantization, for example. And then what we did is again all we tried to implement it in the tflite itself, in the code. So we changed the actual code of the TF Lite. Or alternatively, if we can and it was easier we change the operator itself. For example, we can change the activation yellow to the activation yellow if we can, and we saw that and we of course need to check then ourselves that the model didn't, it didn't degrade the performance of them, but if it didn't, we benefit from it.

Speaker 1: 57:27

so this is the way hey, by the way, there's a couple of questions about uh follow-ups and so a couple of things is we were gonna uh. First of all, this is a recorded live stream so you can always come back to our youtube channel to watch this again, and you can find us on youtube, obviously as a foundation.

Speaker 1: 57:47

Uh, daniel put in some of his contact information in there. We're also going to post a pdf to our discord server, so if you go to dscgg forward slash edgi, which I put in the comments, you can go there and find all kinds of discussions as well. So a few different ways. The other thing to note is the edge. Ai foundation has a number of working groups, which is kind of collaboration between companies on specific topics, and audio working group was just started and Siva is chairing that for us. So if you're working at a company that is interested in audio AI, contact us, and we're always looking for new strategic partners and folks to join working groups, so that could be a good venue for you as well.

Speaker 3: 58:31

Yeah, it could be awesome. Yeah, we would love to others to join to this working group. I think that we will create their really interesting stuff, like the one you've seen here excellent, let's see any other.

Speaker 1: 58:47

We got another. Well, got about like 30 seconds. Say one last question, or maybe we should just call. Okay, why don't we just call? It because yeah, but, daniel, thank you so much for your time. Appreciate it. We got to look at what you looked like before your camera was shut off. Appreciate that.

Speaker 3: 59:08

Yes, of course I will do it. I'm very sorry for the internet issues.

Speaker 2: 59:14

We got all the information. Your slides were great and it was a really good talk, really interactive.

Speaker 1: 59:19

Thanks everybody for joining and we'll see you in some future episodes. Have a great day. All right, sounds good. Thank you, thank you.