EDGE AI POD
Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.
These are shows like EDGE AI Talks, EDGE AI Blueprints as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.
Join us to stay informed and inspired!
EDGE AI POD
Beyond TOPS: A Holistic Framework for Edge AI Metrics
Beyond raw computational power lies the true measure of AI system effectiveness. Austin Lyons, founder of ChipStrat and analyst at Creative Strategies, challenges us to rethink how we evaluate Edge AI technologies in this thought-provoking talk on metrics that truly matter.
For too long, the industry has obsessed over Trillion Operations Per Second (TOPs) as the gold standard measurement. Lyons expertly deconstructs this limited view, introducing us to a more nuanced framework that considers what users actually experience. As generative AI moves to edge devices, shouldn't we care more about tokens per second—how quickly systems respond to our prompts—than abstract computational capabilities?
But speed alone doesn't tell the whole story. What happens when your lightning-fast AI assistant drains your battery in an hour? Lyons presents "tokens per second per watt" as a crucial metric for practical, everyday AI use. He also introduces the concept of "vibes"—those harder-to-quantify qualities like perceived intelligence and personality that make or break user adoption, drawing a compelling parallel to why people choose Apple products despite comparable technical specs from competitors.
The most valuable insight comes from Lyons' call for cross-functional collaboration in AI system design. When hardware engineers, software developers, designers, and product managers work in isolation, optimizing for their preferred metrics, the end result often disappoints users. By approaching AI development holistically, teams can make informed trade-offs that deliver better overall experiences—sometimes with less powerful but more efficient models.
Ready to transform how you think about AI performance? Subscribe to Austin's newsletter at chipstrat.com where he regularly shares insights on the evolving intersection of semiconductors, AI, and product strategy.
Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org
Okay, everyone, I'm Austin Lyons. I'm not the founder of Creative Strategies. I'm the founder of ChipStrat, which is a Substack newsletter that I write. So it's a portmanteau of chips and strategy. So chipstratcom, if you'd like to read about stuff like what does deep seek R1 distilled mean for Edge AI. Now I'm also an analyst at Creative Strategies. We are a boutique research firm. We work with large semiconductor companies and help them with their technical marketing, their roadmap, talking to investors, that kind of thing.
Speaker 1:If you've ever heard the podcast the Circuit, that kind of thing. If you've ever heard the podcast the Circuit, that's our CEO, ben Beharin. Okay, so let's see how do we use this bad boy, okay, beyond Tops. So I'm going to talk to you today. I'm going to try to keep it kind of lighthearted. It's 3.30,. You've been here for like six and a half hours or something. I'm going to talk about thinking about metrics holistically, not just as an engineer, thinking about tops, but thinking beyond that, trying to include the user experience and the product manager. I guess when I walk away from this, I got to bring the clicker with me.
Speaker 1:So yes, austin Lyons, I'm from Iowa. It's good to be down here in Texas. It's nice and warm. Everyone's very welcoming here. I went to HEB on South Congress and they had this big sign that said, hey, austin. I just thought, man, how nice is that that they would welcome me, put a sign up. I know that's a bad joke, okay, so my name's Austin.
Speaker 1:Okay, I'm going to try to talk in pictures. You know, english characters on slides aren't very efficient. Like what does the letter A mean? That doesn't have much information. And strings of characters or words those don't have much information either. You have to fill the whole slide with words before you convey any information. So I try to use pictures, because pictures are worth a thousand words, right, that must be quite the picture, right? Maybe like 10 words or something. So, yes, so I'm trying to use icons here.
Speaker 1:So I'm excited about Edge AI and about LLM specifically. So this is going to have a lens toward generative AI at the edge. Why am I excited about that? With generative AI, now we have the opportunity to have a system that can understand natural language, understand written language, it can hear what you hear, it can see what you see and on top of that, and that's just the input. So from an output perspective, you know now, all of a sudden these systems. They can interact with the world in the way that we interact with the world, right. So written language, drawings, audio we saw some of that earlier with the AI Zip, gizmo demos that was pretty cool and even actions, whether it's like tool use or like tokens for a robot to generate motion control, right.
Speaker 1:So it's very exciting, very interesting time. On top of that, you know, I think, as I'm a former product manager, I've also worked as an engineer and I get excited about the use cases for AI to work with us and work for us. You know. So, talking to Siri like that was exciting 10 years ago. Of course, siri is not the Siri that we wanted, didn't understand context, right. So it was kind of just like a fun gimmick for a while. But now we're seeing real use cases of AI working with you so think you know ADAS or working for you, so Waymo right. So driving with you or driving for you. Now we see helping you do research with chat GPT or doing research for you with chat GPT's deep research. So you know it's a very exciting time.
Speaker 1:I think a lot of this is coming to the edge. Obviously, there's constraints that we'll talk about. So now, when we're thinking about metrics. You know, historically we talk a lot about tops. So how many operations per second can this system do?
Speaker 1:And in the world of generative AI, at the edge, the new unit of work is really tokens. Now, of course, at the end of the day, it's all matrix math and so under the hood it's all the same. But when we're starting to think about metrics, you know we should be thinking about tokens. And so obviously, the most simple example chat GPT. You ask it a question in English or natural language, or many other languages for that matter, and it gets broken down into input tokens and then those run through the neural net and it makes predictions and, in output, tokens that we can recognize, which are plain English. But it's more than that, right, it's the multimodal language models are super interesting and super exciting.
Speaker 1:Now you know you can imagine these neural networks. They can understand images, right, so they can understand visual tokens. So you can give it a picture and you can give it English as well. Say what city is this? And then it can output tokens in whatever modality makes sense for the use case. So maybe it's audio because it's your helmet and you're biking, or maybe it's just written text because you're just on your computer, yeah, and so when we're thinking about tokens and thinking about the user experience and metrics that matter, you know, again, maybe tops don't really matter, maybe what matters is tokens per second.
Speaker 1:So how quickly does this respond? You know, that was like the first thing with chat GPT was. I was like it blew my mind for like five minutes when it first came out and then it was like so slow. You know, it was where we were experiencing. You know, actually that's not five tokens per second, I got that wrong. It's 20 tokens per second, which, by the way, I used AI to generate all this. I just said, hey, give me a, I want to have an interactive demo. Give me a web app so I can show off you know what is fast tokens per second.
Speaker 1:So this is a hundred. Is a hundred enough? I don't know. That's the conversation that we should be having with our specific use cases. Is 20 tokens per second enough? Is 100? And maybe, if you have agents, maybe it's not 100. Maybe it's 1,000. Or maybe latency matters, like we talked about, but of course it's not just tokens per second. If it kills your battery, then that doesn't matter, right? So obviously there are a bunch of constraints and metrics that we should be taking into account. So maybe is it tokens per second, per watt OK, if it's 50 tokens per second is good enough for your use case. How many watts does it take to do that? Does it drain your battery or not? But then, as you start to unpack it, you know it's even more than that.
Speaker 1:So you know, I call it vibes up here. That's what. When you see all the AI labs talking about, like their latest Cloud 3.7 release or something, they're talking about vibes. And this is an interesting one, because engineers, it's like well, how do you quantify that? And I think actually, and so vibes, by the way, would be like how intelligent is this model? But it's more than that. Maybe. It's like does this model have a personality? Was it post-trained or fine-tuned to make it funny or have a sense of humor or whatever? And I actually think that vibes could be a differentiator for some people. It's like taste right, like you know, why do people love Macs and iPhones? It's not really something you can quantify, it's a design, it's an experience, and so when we're talking about generative AI and even AI at the edge, you know how are people interacting with your system. How does it make them feel? Maybe we need like a chief vibes officer thinking about that.
Speaker 1:Of course, tokens per second matter, cost matters, battery life, compute you know how much headroom do you have? Could you run really fast so you could do all these agents at the edge? But also you're out of compute headroom and you can't add any features or improve the algorithms. And I lay this out here to just show that there are a lot of trade-offs that you have to consider when you're designing a system or buying hardware to implement a system. And so again, how do we get to the right set of trade-offs for our system?
Speaker 1:What I'm putting forth is that we have to come together sort of holistically with people on the team to talk about so that's hardware engineers, software engineers, designers, product managers and say how do we make decisions? Not just like, hey, we bought this system because it had the highest tops, now go run your ML on it. But how do we dial the knob so that we can get that set of like it only needs to be 50 tokens per second, but we really care about battery life and the vibes. It doesn't have to have a personality, it can just be stock Lama 3B or something you know. So how do we make decisions holistically as a team where we're thinking about, you know what is the hardware that we're locking in? Obviously, memory capacity and bandwidth really matter and that impacts your vibes, right? So it's like, hey, we can run a small model and it can be quantized to FP4 and it's actually not even that smart, but it's good enough for this use case and therefore we don't need as much memory and we can have a longer battery life. Okay, great, maybe that's the decision that we're trying to go for here.
Speaker 1:Okay, so I say all of that to build up to this slide, which is you know, how can we, as an industry, get people to think holistically about their AI systems in such a way that, you know your designer says, yes, people are going to use this, it's responsive enough, it's fast enough, it's fun to use. You know you'd hate to have the engineer say it's got a bunch of tops and it's got great performance, power, efficiency. You know tokens per second, per watt, and the product manager says it's affordable, but then no one uses it because you know you ask it a question and then you ask the next question and it forgets the context and it's not that useful, and so it's just another Siri gimmick, right? But of course we can't just care about TOPs. You know, right, the engineers may be very interested in TOPs and watts, but the product manager might need to say like, hey, does it actually solve the problem for the user that we're trying to solve? Is it affordable? Right, will people use this?
Speaker 1:So my final slide here is more of just a call to action to say how can we get away? I think a lot of what I've seen in the last year is like you know, microsoft says, hey, you have to have 40 tops to be a co-pilot PC. And now everyone starts benchmarking against 40 tops. We can do 40 tops, you know. We can do 45 tops, we can do 50 tops. How can we help everyone to think holistically, from marketing and sales to just design teams around having a conversation that it's more than just TOPs, that it's about the user experience, it's about the power, it's about the cost. So maybe it looks like just being transparent and saying, hey, for this use case, we can run distilled LAMA 3B with DeepSeek's reasoning distilled on it and we run it at FP8, and it takes this much memory and it's 50 tokens per second at 2 watts, so it's 25 tokens per second per watt.
Speaker 1:Right, I feel like everyone we see is always talking about their one constraint that they've prioritized for it, like in the AI ASIC world. You see this right the highest throughput you might hear from one ASIC startup or the lowest latency from another. But how can we help everyone talk more holistically about? It's all trade-offs, right, it's all trade-offs. So how can we talk about all the metrics that matter to these edge AI and LLM-based generative AI systems, and how can we think about all the metrics that matter to the user? So that is it. That's it for me. Eleven minutes, so not bad right.