Bringing Generative AI to Your Pocket: The Future of Edge Computing Artwork

EDGE AI POD

Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.

These are shows like EDGE AI TALKS, EDGE AI BLUEPRINTS as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.

Join us to stay informed and inspired!

All Episodes

EDGE AI POD

Bringing Generative AI to Your Pocket: The Future of Edge Computing

June 19, 2025 • EDGE AI FOUNDATION

A technological revolution is quietly unfolding in your pocket. Imagine your phone creating stunning images, understanding what its camera sees, and responding to complex questions—all without sending a single byte of data to the cloud. This isn't science fiction; it's Generative EDGE AI, and it's already here.

We dive deep into this transformative trend that's bringing AI's creative powers directly to our devices. Building on the foundation laid by the tiny ML movement, Generative EDGE AI represents a fundamental shift in how we'll interact with technology. The benefits are compelling: complete privacy as your data never leaves your device, lightning-fast responses without internet latency, independence from network connections, and significant cost savings from reduced cloud computing needs.

The applications span far beyond convenience. For people with disabilities, it means having image captioning that works anywhere, even without internet. For photographers, it's like having a professional editor built right into your camera. In healthcare, it enables diagnostics while keeping sensitive patient data secure and accessible even in areas with poor connectivity.

The technical achievements making this possible are equally impressive. Researchers have shrunk massive AI models to run efficiently on everyday devices, from visual question answering systems that respond in milliseconds to text-to-speech engines that sound remarkably natural. They're even making progress bringing text-to-image generation and small language models directly to smartphones.

As we explore these breakthroughs, we consider the profound implications of truly intelligent devices that can learn, adapt, and make decisions autonomously. What happens when our technology not only understands but creates and acts independently? The silent AI revolution happening in our hands is set to transform our relationship with technology in ways we're just beginning to comprehend.

Ready to understand the future that's already arriving? Listen now and glimpse the world where intelligence lives at your fingertips, not in distant server farms.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1: 0:00

Welcome back Deep Divers. We're diving deep today, folks.

Speaker 2: 0:03

Into some seriously cutting-edge tech Buckle up.

Speaker 1: 0:06

Yeah, today we're talking all about the move to bring generative AI. You know that AI that can whip up like crazy realistic images or audio or even text and putting it directly on our devices like our phones, our watches. All that stuff right at the edge, no cloud required.

Speaker 2: 0:23

The edge. It's the new frontier.

Speaker 1: 0:25

So I think it's a good place to start. Like, what exactly do we mean when we say Gen AI?

Speaker 2: 0:30

Right. So you're probably familiar with regular AI. You know, like when your phone recognizes your face to unlock, or when it predicts the next word you're typing Right, that's standard AI.

Speaker 1: 0:39

Doing its thing.

Speaker 2: 0:48

But gener Generative AI that's like next level. Instead of just analyzing what's already there, it actually creates completely new, believable data from scratch. Have you ever seen those crazy realistic AI-generated artworks online?

Speaker 1: 0:53

Oh yeah, those are insane.

Speaker 2: 0:55

Exactly. It's that kind of creative power that we're talking about squeezing onto something like your smartphone. That's Gen AI.

Speaker 1: 1:01

Wild. So this isn't just some random like hey, let's try this idea. This actually builds on a movement, right? Oh yeah?

Speaker 2: 1:07

absolutely the tiny ML movement. It's been doing its thing, showing us what's possible when you run AI models locally right there on your devices.

Speaker 1: 1:17

Tiny ML. Yeah, okay Well yeah. Like my phone being able to translate languages even when I'm on a plane with no Wi-Fi. That's tiny ML.

Speaker 2: 1:24

That's it. Or think about getting instant responses from your smart home stuff. No lag, it's all happening locally. Tiny ML paved the way and now Edge.

Speaker 1: 1:36

Gen AI is like the souped up next generation. So, okay, I get that it's cool and builds on stuff, but I got to ask why bother? I mean, why try to like squish these massive Gen AI models onto our little devices when we have those huge, powerful servers in the cloud?

Speaker 2: 1:49

Yeah, that's a fair question. I mean, at first glance it does seem a bit like backwards, right, but listen, when you actually dig into it, there are some seriously compelling reasons.

Speaker 1: 1:58

Okay, convince me.

Speaker 2: 1:59

All right. First off, think about independence. Your devices wouldn't need to be tethered to the internet 24-7 to be smart or do all this creative stuff Off the grid gen AI, exactly.

Speaker 1: 2:08

And then there's the privacy angle. All your personal data stays local. It's not floating around in the cloud somewhere. And let's be real, speed is king. Processing everything right there on the device means no waiting for data to go back and forth to the cloud. That's huge.

Speaker 2: 2:24

No lag, I like it.

Speaker 1: 2:26

And there can be some major cost savings too. Plus the potential. Oh man, imagine the possibilities when you can deploy intelligent AI across like billions of edge devices. Right, that's a game changer.

Speaker 2: 2:38

Yeah, okay, the potential is huge. So this isn't just some fringe research project?

Speaker 1: 2:43

Oh, absolutely not. This is gaining some serious momentum. Even the tiny ML foundation you know the pioneers of this whole small scale AI thing they just rebranded themselves, yeah, just this past November they became the edgy AI foundation. That's a statement, a big statement. It highlights the broader focus and this growing belief that edge gen AI it can seriously shake things up. I mean, imagine our devices not just, you know, passively doing what we tell them, but actually solving problems and interacting with us in a way more natural way, like they're thinking, almost.

Speaker 2: 3:14

Almost. It's this idea of local intelligence, right there in your hand.

Speaker 1: 3:19

So for today's deep dive, our mission is to really understand just how far this whole Edge Genii thing has come. We've dug through a ton of research from 2022 all the way up to this year, 2024. And we specifically focus on models that are actually running on real-world devices. Right, We'll look at how they're doing, what they can achieve, but first let's paint a picture. Where can we actually use Edge Genii? What kind of cool applications could it unlock for us?

Speaker 2: 3:45

Okay, let's talk real-world impact. One area where it's already making a difference is in assisting people with disabilities. Imagine having like image or video captioning, or even the ability to just ask questions about what your phone's camera is seeing. And all of this is happening right there on the device.

Speaker 1: 4:01

That's amazing. So it's private and it works even if you don't have an internet connection.

Speaker 2: 4:05

Exactly. It gives people, especially those who are visually impaired, a whole new level of independence and access to information.

Speaker 1: 4:12

And speaking of independence, what about personal digital assistants? That's an area where edge gen AI could really shine right.

Speaker 2: 4:18

Oh, big time. We're talking about assistants that are actually intelligent way beyond what we have now. I mean, imagine your assistant like truly understanding your environment and responding in like milliseconds. It's almost like having another person there.

Speaker 1: 4:33

Yeah, and the whole interaction could be so much more natural, like imagine real-time translation happening right on your phone, no server needed.

Speaker 2: 4:41

Exactly. And speaking of our phones, think about the trend of under-display cameras. Cool tech, but sometimes the image quality it's not quite there yet, right.

Speaker 1: 4:50

Yeah, it can be a little soft.

Speaker 2: 4:51

Well, edgegen AI could solve that. Real-time image enhancement and super resolution happening right on the phone. You snap a pic, boom instant improvement. What's really wild is that they're starting to build this AI processing directly into the image sensors.

Speaker 1: 5:06

It's like having a pro photo editor built right into your camera.

Speaker 2: 5:09

Pretty much.

Speaker 1: 5:10

Now, beyond our personal devices, what about applications in like video surveillance or health monitoring? I can see edge gen AI being huge there.

Speaker 2: 5:19

Oh, definitely Think about smart security systems, right? Not only can they detect something fishy going on, but they can actually explain what's happening and suggest what to do. All locally and in healthcare man. The possibilities Edgegen AI could be helping with diagnoses, you know, making those interactions between patients and devices way more intuitive. And again, all that sensitive data it stays secure and local.

Speaker 1: 5:42

Okay, that makes sense. So I'm starting to get the why. It's about giving power back to the users, making things more private and just making AI more accessible and immediate. Right Got it. So how are they actually doing this? How are they shrinking these like massive AI models down to fit on our devices? Yeah, our research looked at a bunch of different approaches and they kind of fall into these neat categories based on the specific tasks that AI is supposed to do.

Speaker 2: 6:06

Like specific jobs for specific AI right.

Speaker 1: 6:08

Right. So first up, let's talk about visual question answering, or VQA. This is all about getting a device to understand an image and then answer questions about it, like it's looking at a picture and having a conversation with you, yeah, like hey phone, what color is that car?

Speaker 2: 6:21

And it tells you.

Speaker 1: 6:21

Exactly phone, what color is that car, and it tells you exactly. So one of the early breakthroughs in this area was something called mobi vqa. This came out back in 2022. The researchers were working with these existing really powerful vqa models, you know, like lx mert, xlx mert and vilti big names yeah, big complex models and they figured out ways to optimize them so they could run smoothly on mobile devices.

Speaker 2: 6:46

Like, what kind of optimizations?

Speaker 1: 6:47

Well, they did some clever things like early exit, where the processing just stops, if it's obvious the AI has the answer, or if doing more calculations won't really help.

Speaker 2: 6:57

So it doesn't waste time.

Speaker 1: 6:58

Right, and then they use this thing called question aware pruning, which basically helps the model focus its attention on the most important parts of the image, you know, the stuff that's actually relevant to the question.

Speaker 2: 7:09

So it's not getting bogged down by unnecessary details.

Speaker 1: 7:12

Exactly. They even took these really accurate but super demanding models ones that usually focus on specific regions and adapted them to be more efficient by working on a grid system instead. So when they tested these tweaked models on devices like the NVIDIA Jetson TX2 and a Pixel 3 XL phone, the results were impressive. The latency, or the time it takes to get an answer it dropped like crazy, went from seconds down to just milliseconds, and they used less energy too. All while keeping the accuracy pretty much the same. This Mobi VQA it was like an early win that showed hey, we can actually do this, we can bring this complex visual understanding to mobile devices.

Speaker 2: 7:49

It's like that moment when you realize something huge is about to happen.

Speaker 1: 7:56

Right, okay, so then in 2023, researchers came up with something called the Bilaterally Slimmable Transformer, or BST, bst Catchy. It's a mouthful Okay, but basically it's this framework for training, a single VQA model that can be shrunk down into smaller versions, these submodels, without having to retrain the whole thing from scratch.

Speaker 2: 8:12

Oh, that's smart. So it's like having one master model that you can customize for different devices.

Speaker 1: 8:16

Exactly Different levels of processing power, no problem. This BST thing. They used it on models like Mencan, uniter and ClipFill and they were able to run some of these smaller MCN versions, like one that was less than 10 million parameters.

Speaker 2: 8:29

Tiny.

Speaker 1: 8:30

Tiny. They got to run it on phones with those Snapdragon 888, dimensity 1100, and Snapdragon 660 chipsets, and the response times were super fast, like tens to hundreds of milliseconds. This adaptability is key, right yeah, because edge devices come in all shapes and sizes.

Speaker 2: 8:47

Totally. And then what happened in 2024?

Speaker 1: 8:49

Well, in 2024, there was a thing called TinyVQA, and this one was specifically designed for tiny ML hardware, you know, those super resource constrained devices, these researchers. They were training this model for a really cool application. They wanted to use it to assess damage after, like a natural disaster, using images from drones.

Speaker 2: 9:07

Oh wow, that's smart.

Speaker 1: 9:08

Yeah. So they use this technique called knowledge distillation or KD, which is basically like a teacher, student and thing where a smaller model learns from a bigger, smarter one. And then they did this thing called post-training quantization, basically shrinking those numbers down to save space and speed things up. Clever and the results. Oh man, this tiny VQA. It was amazingly accurate for its size. It was only 339 kilobytes Kilobytes Compared to a baseline model that was almost 500 megabytes. It's like they shrunk an entire encyclopedia down to like a post-it note.

Speaker 2: 9:41

Yeah.

Speaker 1: 9:41

And they actually got this thing running on a GAPA processor, on a drone. The latency was crazy fast only 56 milliseconds and used very little power.

Speaker 2: 9:50

Talk about efficient. Okay, anything else from 2024?.

Speaker 1: 9:53

Yeah, there's one more. Researchers from Samsung developed a brand new transformer architecture and they paired it with this thing called Inteed Quantization, so specifically for mobile devices. They tested it on a standard VQA data set and after they applied their quantization magic, the model size shrank to just 58 megabytes.

Speaker 2: 10:09

Impressive shrinkage.

Speaker 1: 10:11

Right, and it was crazy fast on a Galaxy S23 phone only 2.8 milliseconds to get an answer. So there you have it Lots of cool research happening in VQA Getting AI to understand images and answer questions about them right on our devices.

Speaker 2: 10:26

It's like magic.

Speaker 1: 10:27

Now let's switch gears a bit. What about image and video captioning? This is where AI can actually generate a description, like a sentence, for what it sees.

Speaker 2: 10:36

Oh, like a fluffy cat is napping on a red blanket.

Speaker 1: 10:40

Exactly so. There's this model called LightCap came out in 2023. It's a super lightweight image captioning model made specifically for mobile. It has all these different parts, like an image encoder to understand the picture, a concept extractor to pick out the important stuff, a thing to link the visual and the text information, and then this little language model called TinyBird.

Speaker 2: 10:58

TinyBird, it's all about being tiny.

Speaker 1: 11:00

Right. And even with all those parts, it was still small only 40 million parameters, about 112 megabytes. They tested it on a Huawei P40 phone and it had a latency of only 188 milliseconds. Plus, it was super accurate with its captions.

Speaker 2: 11:13

Okay, so what about video captioning? That seems way more complex, right?

Speaker 1: 11:17

Definitely. But in 2024, there was this project called Tiny V2A and they focused on this even harder task of describing future actions in videos, specifically of mice, mice, mice. They used this thing called a Video MAE Transformer, which is a mouthful and a language model called OPT, and they used all those techniques we talked about, like knowledge distillation and parameter efficient fine tuning, to make it all work. The smallest version of their model, which still had 169 million parameters, ran in under a second on a Raspberry Pi, both a Pi 4 and a Pi 5. So not just saying what's happening in the video, but predicting what might happen next.

Speaker 1: 11:55

Yeah, pretty impressive, especially on a Raspberry Pi.

Speaker 2: 11:57

Yeah, for sure.

Speaker 1: 11:58

Oh, and then there was Vela. This one's a little different. It's what they call a visual language model, or a VLM, and it's designed to run on the NVIDIA Jetson Orin, which is a pretty powerful little computer. These VLMs are basically large language models that can also deal with visual stuff. Yeah, so they can do image and video captioning, vqa, all that. This Vela thing outperformed some of the best models out there, and they used 4-bit quantization to make it efficient enough for edge devices.

Speaker 2: 12:24

So, even these big multimodal models, they're finding ways to make them work the edge.

Speaker 1: 12:28

Yep, okay. Moving on to another super cool application, let's talk about making our devices actually speak in a more natural way.

Speaker 2: 12:36

Text-to-speech, or TTS. This is where things get really interesting.

Speaker 1: 12:40

Because a robot voice is not a good look.

Speaker 2: 12:42

Not anymore. Back in 2022, there was this project called Nix TTS. It's a non-autoregressive end-to-end TTS model that learned from a bigger model called VITS Nix TTS. It's small, only about 5 million parameters and only 21 megabytes. Now, it wasn't quite as natural sounding as VITS, but it was way faster on a Raspberry Pi 3B.

Speaker 1: 13:03

So sometimes a tiny bit of quality loss is worth it for a big gain in speed.

Speaker 2: 13:06

Especially for real-time stuff.

Speaker 1: 13:08

Oh, and there was another one from 2022 called Piper, a neural TTS system optimized for the Raspberry Pi 4, also based on VITS. This one had about 15 million parameters. So more options, more speed. And then, if you need even more efficiency, there's Efficient Speech from 2023. It's a non-autoregressive two-stage model. It creates like a visual representation of sound and then uses this small vocoder called HiFiJAN to actually make the audio. This one was super tiny only 1.2 million parameters for the acoustic model, and it ran on a Raspberry Pi 4B as fast as Fast Speech 2, which is a much bigger model.

Speaker 2: 13:44

So basically, amazing quality packed into a tiny model.

Speaker 1: 13:47

Pretty much. And then there was Fast Stream Speech, also from 2023. This one also used a two-stage approach, but added this cool streaming mechanism to make it even faster, especially on like smartphones. It sounded as good as larger models, but they got it running super fast on a MediaTek Helio G35 processor.

Speaker 2: 14:03

Super fast. So we've got the main TTS models, but don't forget about the vocoders, right? That's the part that takes the sound features and actually creates the audio waves. There are some cool projects out there, like TinyVocos and Bunched LPCnet, both designed for tiny devices that show you can even get great sound out of these smaller, specialized components.

Speaker 1: 14:22

So we've talked about understanding what our devices see and making them speak, but what about making what we hear clearer?

Speaker 2: 14:29

Ah, you're talking about speech enhancement, taking noisy audio and making it better.

Speaker 1: 14:33

Exactly so. There's this cool model from 2024 called TransFilm. It's an audio super resolution network for mobile devices. It uses these things called transformer blocks and a U-net architecture which is known for doing great stuff with audio and image processing. It takes low quality audio and makes it sound much better. It's small about a million parameters, got good scores on audio quality tests and ran at a decent speed on a MyZoo 16S smartphone.

Speaker 2: 14:59

So imagine being able to clean up your voice calls or recordings right on your phone.

Speaker 1: 15:02

Yep. And then, specifically for those super noisy situations like think industrial settings, there's this other model from 2024 that was designed to run on microcontrollers. They took a smart speech enhancement system and adapted it to run on an STM32 H735 MCU, which is a tiny computer that could be used in like a safety helmet.

Speaker 2: 15:22

In a helmet.

Speaker 1: 15:23

Yeah, so this system has a part that detects emergency signals and then another part that's a DCRN, a deep, complex recurrent network for noise reduction. They made a model really small, tweaked some stuff and used in date eight quantization to make sure it could actually run on that tiny microcontroller. The result it actually worked and it performed as well as bigger models with low latency and low energy usage. It's amazing what they can do with these small devices these days.

Speaker 2: 15:49

It's really impressive. Okay, so let's talk about language understanding and translation, neural machine translation, or NMT. This is all about getting AI to translate languages automatically.

Speaker 1: 15:59

And transformer models have been awesome at this, but they can be a bit resource heavy right, yeah, they can be.

Speaker 2: 16:05

So back in 2022, researchers developed Transformer DMB, which basically added some dynamic multi-branch layers to the standard transformer architecture. Now, it wasn't the absolute fastest on a Raspberry Pi 4B compared to some even smaller transformer versions, but when they combined knowledge distillation with 8-bit quantization, they found a really good balance between size, performance and efficiency.

Speaker 1: 16:27

So it was a good compromise. And then also in 2022, there was Hypoformer, which used a technique called hybrid tensor train decomposition. It's a mouthful, but it basically helps compress and speed up transformer models, particularly for NMT, and by using knowledge distillation during training, they got it to perform just as well as a regular transformer on a Raspberry Pi 4B, but with way fewer parameters and much faster inference times.

Speaker 2: 16:51

All those techniques are crucial if we want real-time translation right on our devices.

Speaker 1: 16:55

Totally Now. How about using AI to modify and transform images for enhancing privacy.

Speaker 2: 17:02

Yeah, so this is where Neural Style Transfer, or NST, comes in. It's like taking the style of one image and applying it to another, or even changing an image based on a text description For privacy. This is cool because you can use it to blur faces or change features while still keeping the rest of the image understandable.

Speaker 1: 17:19

So it's like an artistic way to protect people's privacy.

Speaker 2: 17:21

Exactly. And in 2024, they developed a super lightweight NST model based on Gynet, specifically for anonymization. This thing was tiny only 0.6 million parameters and when they quantized it to 8-bit it ran beautifully on a Raspberry Pi 4B and an STM32H743 microcontroller Low latency, minimal energy. It was perfect for on-device privacy protection.

Speaker 1: 17:46

That's awesome. Also, in 2024, some researchers created a really lightweight video NST model. They distilled knowledge from a larger model and added this efficient feature transformation module. It ended up being a tiny 2.67 megabyte model that could style transfer videos at almost 41 frames per second on those NVIDIA Jetson devices.

Speaker 2: 18:04

So real-time video anonymization totally possible.

Speaker 1: 18:07

And then there's text-to-image style transfer.

Speaker 2: 18:08

Oh yeah, this is cool.

Speaker 1: 18:09

There are these two projects from 2024, fastlab Styler and EdgeClip Styler. They basically take a text prompt like make this picture look like a Van Gogh, and they apply that style to an image. Edgeclip Styler specifically used a super efficient model for understanding the text to make it better suited for edge devices, though it did take about 15 seconds to do its thing on a Raspberry Pi 3B plus key. So there's room for improvement there.

Speaker 2: 18:34

But it shows what's possible.

Speaker 1: 18:35

And then there's Mosaic from 2023. This one's cool because it can apply different styles to different objects in an image, and it ran at a pretty decent speed on a Snapdragon 8 Gen 1 processor.

Speaker 2: 18:46

So you could have like one object that looks like a Picasso and another that looks like a Monet all in the same image.

Speaker 1: 18:50

Exactly.

Speaker 2: 18:51

Wild.

Speaker 1: 18:52

Okay, now what about directly swapping faces in videos or pictures? Another potential privacy tool, right.

Speaker 2: 18:58

Face swapping yeah, definitely a hot topic, and researchers are looking into how to do it right on our devices. Back in 2022, there was a study that showed you could actually swap faces in real time on a phone with a MediaTank Dimensity 1100 processor, and they didn't even have to specifically optimize the model for that task.

Speaker 1: 19:17

So the tech was already there.

Speaker 2: 19:18

Yeah, it's just a matter of making it smaller and faster. And there's FinetGen from 2023 and SimSwap from 2024. These were specifically designed for anonymization using face swapping. Finetgen actually ran pretty well on a Kendrick K210 microcontroller, which is a seriously tiny device, and they got the smallest version of SimSwap running really fast on both a Raspberry Pi 4 and a STM32H743 microcontroller.

Speaker 1: 19:43

It's crazy how far they've come.

Speaker 2: 19:45

Right.

Speaker 1: 19:45

Okay, so we've covered a bunch of specific applications, but our research also found a whole bunch of other visual processing tasks that are making their way to the edge.

Speaker 2: 19:52

Oh yeah, tons of stuff. We're talking super resolution to make pictures and videos crisper, in-painting to fill in missing parts of images, general image and video restoration, different types of image enhancement and denoising stuff happening right in the camera sensor with image signal processing or ISP, and even things like bokeh rendering, which creates those cool blurry backgrounds and photos. There are so many different models being developed with different sizes and goals, but smartphones are definitely the most popular platform for all of this and the processing times. They range from milliseconds to seconds. There were even competitions like Mobile AI and AM in 2022 that really pushed things forward, especially in areas like super resolution and making it more energy efficient. There was this one called TinyLUT that could restore images really well and it was crazy small and fast, even on a Xiaomi 11 smartphone and a Raspberry Pi 4B.

Speaker 1: 20:47

So many cool things happening in the visual processing world.

Speaker 2: 20:49

For sure.

Speaker 1: 20:49

And then there's the even crazier stuff like what about actually generating images? Like creating them from scratch?

Speaker 2: 20:55

You're talking about text-to-image diffusion models.

Speaker 1: 21:02

This is where you type in something like a cat wearing a hat, riding a bicycle, and the AI actually creates an image of that Exactly, and that's where these big models like stable diffusion come in. Now, getting something like that to run on a phone it's still tough, but there are some promising early results. They've gotten it down to under 15 seconds on some of the more powerful mobile chips 15 seconds.

Speaker 2: 21:19

That's not bad.

Speaker 1: 21:20

Not bad at all, and researchers are coming up with all sorts of optimizations, new architectures trying to reduce the number of steps it takes to generate the image, models like BKSDM, edge Fusion, snapgen, snapfusion, mobile Diffusion they're all pushing the boundaries here. Mobile Diffusion even managed to create images in like 0.2 seconds on the latest iPhone 15 Pro. It's all about finding the right balance between model size, complexity, speed and image quality.

Speaker 2: 21:46

And what about running language models directly on our devices?

Speaker 1: 21:50

Ah yes, large language models, or LLMs, have been getting all the attention, but they are huge, right. So the focus has shifted to these smaller, more efficient versions what they call small language models, or SLMs that are much better suited for running on our phones and other devices.

Speaker 2: 22:05

Yeah, these SLMs have way fewer parameters than the giant LLMs that power like chat GPTs, so they're much more realistic for edge devices.

Speaker 1: 22:18

Exactly. And there are some really cool ones out there already, like Fi3, mini, mini, cpm, mobile, lm, mobile, lama, phonelm, and they're using all sorts of tricks to make them work well. They're optimizing the transformer architecture, doing crazy quantization, sharing parts of the model to save memory and even sharing entire layers to speed things up. Like PhoneLM can process 58 tokens per second on a Xiaomi 14 smartphone 58 tokens.

Speaker 2: 22:39

That's pretty fast.

Speaker 1: 22:40

Right Now, these SLMs. They might not be as knowledgeable or as good at reasoning as the giant LMs yet, but there's still a big step towards having natural language understanding and generation right on our devices without needing the cloud.

Speaker 2: 22:52

Totally agree.

Speaker 1: 22:53

So we've looked at a ton of stuff today, from understanding images to generating speech to translating languages. It's a lot to process, but there's some common threads running through all of this right.

Speaker 2: 23:04

Oh, absolutely. A few key optimizations keep popping up again and again, Like first off, you've got to design efficient models, ones that are naturally smaller and require less computation.

Speaker 1: 23:16

Starting with a good foundation.

Speaker 2: 23:17

Exactly. Then there's quantization, which basically means shrinking those numbers down, making them less precise, but way more efficient. That's a huge one. And knowledge distillation you know that teacher-student thing, that helps a lot too.

Speaker 1: 23:31

Yeah, it's like giving the smaller model a cheat sheet from the bigger one.

Speaker 2: 23:34

Exactly. And then there's pruning, which is like trimming the fat off a neural network, getting rid of the connections that aren't really that important. It's the combination of all these strategies that's really making edge gen AI possible.

Speaker 1: 23:47

Okay, so lots of exciting developments, but what about the big picture? What are the trends? Where's all of this headed?

Speaker 2: 23:53

Well, one thing is clear Research in edge gen AI is exploding. There are more and more publications every year, with new breakthroughs happening all the time, and what's really important to remember is that we focused specifically on actually deploying gen AI on edge devices. That's different from other research that might talk about edge AI in general.

Speaker 1: 24:14

Right, because it's one thing to talk about it and another to actually do it.

Speaker 2: 24:17

Exactly, and all those benefits we talked about earlier independence from the cloud, more privacy, faster speeds, lower costs, the ability to scale up AI massively all of that is driving this intense research, and it's opening up possibilities for applications we haven't even thought of yet.

Speaker 1: 24:34

Yeah, like the whole field is just exploding.

Speaker 2: 24:36

It is, and when you look at the research, it's clear that smartphones are leading the way. They're the most popular platform, with over 50 different processors from six different manufacturers being used, and the big chip companies Qualcomm, mediatek, google they're all pouring money into developing new hardware that can handle these Gen AI workloads.

Speaker 1: 24:55

It's our race.

Speaker 2: 24:56

It is a race hardware that can handle these Gen AI workloads. It's our race. It is a race, and while most of the research is happening in visual processing, we're seeing more and more work in natural language processing and audio stuff too, so it's spreading Big time.

Speaker 1: 25:06

Right.

Speaker 2: 25:06

And you know, one thing that's really needed going forward is more in-depth analysis of how these models perform on devices in the real world. Accuracy is great, but we need to understand latency, model size, storage space and especially energy consumption.

Speaker 1: 25:22

Because a model that drains your battery in five minutes isn't very useful.

Speaker 2: 25:26

Exactly. We need to know how much energy it takes to run these things on a device compared to doing it in the cloud. And then, of course, there's the environmental impact, which is something we need to think about too Good point. And finally, there's the whole idea of agentic AI, giving devices the ability to learn, adapt and make decisions on their own, without needing the cloud. Edge Gen AI could really push that forward.

Speaker 1: 25:48

Imagine having devices that are truly intelligent and proactive. It's kind of mind-blowing.

Speaker 2: 25:54

It is.

Speaker 1: 25:54

So, to wrap this all up, it's pretty obvious that running these complex Gen AI models on our everyday devices it's not science fiction anymore. It's happening right now.

Speaker 2: 26:04

And the progress in just the last few years has been incredible. We're talking about faster, more private and more accessible AI for everyone.

Speaker 1: 26:13

And while there are definitely still challenges, the speed of innovation is just insane, both in how they're optimizing these models and in developing new hardware. The future of Edge Gen AI it's looking pretty bright.

Speaker 2: 26:24

And that brings us to like the ultimate question. Think about what it means to have truly intelligent generative devices all around us. How will that change how we interact with technology? How will it change our lives? What new possibilities will open up when our devices can not only understand, but also create and act independently? These are questions worth thinking about.

Speaker 1: 26:47

Big questions for sure. Well, we hope this deep dive has given you a lot to think about. Maybe you'll even explore some of this research yourself.

Speaker 2: 26:53

There's a lot to discover.

Speaker 1: 26:55

Thanks for joining us. See you next time. Deep divers.