Revolutionizing Edge Devices with Energy-Efficient Generative AI Techniques Artwork

EDGE AI POD

Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.

These are shows like EDGE AI TALKS, EDGE AI BLUEPRINTS as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.

Join us to stay informed and inspired!

All Episodes

EDGE AI POD

Revolutionizing Edge Devices with Energy-Efficient Generative AI Techniques

January 09, 2025 • EDGE AI FOUNDATION

Unlock the secrets of energy-efficient AI as we explore the groundbreaking fusion of Edge AI and generative AI in our latest episode. With expert insights from Victor Jung, a trailblazer in the field, discover how foundational models can be deployed on tiny and embedded systems to revolutionize devices like AR glasses and nanodrones. Listen as we unravel the complexities of deploying neural networks on microcontrollers, with a focus on powerful techniques like quantization, graph lowering, and innovative memory management strategies.

Victor guides us through the nuanced process of deploying neural networks, highlighting critical stages like graph lowering and memory allocation. Traverse the intricate front-end and mid-end stages where neural network graphs are optimized, ensuring peak performance on specific hardware platforms. We'll illustrate the importance of efficient memory usage through a fascinating example involving a tiny language model on the Syracuse platform, showcasing the role of quantization and memory management tailored for hardware constraints.

Dive into the future of AI deployment on edge devices with a focus on quantization and hardware support. From exploring the potential of foundation models like DenoV2 to discussing the emerging micro scaling format, we uncover the technologies that are making AI more energy-efficient and versatile. Our conversation underscores the importance of viewing memory as a compute asset and the need for ongoing research to enhance system efficiency for generative AI at the edge. Join us for an enlightening episode that highlights the vital steps needed to optimize memory and computing resources for meaningful applications on small platforms.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1: 0:36

Thank you All right, here we are back with Edge AI Talks.

Speaker 2: 0:59

Hello, Ciao Pete.

Speaker 1: 1:01

Yes, good to see you. So where are you? You're dialing in from Milan. Yeah, awesome. At STMicroelectronics I'm actually on site. I'm in Santa Clara, california, this morning, if you can see the sign behind me at Amazon over there so presenting at their Silicon and Hardware Summit about the future of Edge AI. So I will be ducking out once we have Victor rolling here, but it's great to see you again and it's great to you know. Bring Victor back on. We had him talk at our Generative AI at the Edge forum, I think, in October. Right, a huge, a huge thing. We had some challenges in terms of recording the live stream over two days. We want to have Victor come back and talk more about but his stuff. So, by the way, peter, you pay.

Speaker 2: 2:02

Your meeting in Amazon will be. Your talk will be exciting. We need really to to share the beauty of a GI with the companies like Amazon and others, and so yeah wish the best for you for sure.

Speaker 1: 2:19

Yeah, no, I'm looking forward to it. I mean I'm sure they're working on some cool stuff. Yeah, yeah, lots. Don't know if they're gonna let me see anything, but uh, yeah, yeah, lots of cool stuff. I mean we all have not all of us, but most of us have alexis and things like that, so they really pioneered a lot of some of the, some of the edge devices and wake vision type things, so cool. Well, why don't we bring victor on? Let's see, here he is.

Speaker 3: 2:41

Hi everyone.

Speaker 2: 2:43

Hi, Victor Hi.

Speaker 1: 2:45

Welcome. Welcome back, I should say yes, and where are you dialing in from?

Speaker 3: 2:52

I'm dialing from Zurich right now, and thanks for hosting me a second time here. That's always a pleasure.

Speaker 1: 2:59

Awesome, that's fantastic. Ok, great. And then just a reminder folks that are out there that are just joining us, this is a live Q&A live stream show, so you can send in your questions here and Danilo and I will gather them up as Victor talks and hopefully we have a good conversation going here for people who have any questions about what's going on, about what's going on, cool. So let's see. Why don't we uh get you into into your mode here? Uh, are you ready to to get started, danilo, any opening?

Speaker 2: 3:35

comments, questions or thoughts for victor before he dives in. Yeah, um, I would like to introduce victor reading his his. I promise it's short, but I think we need to know better. Victor Context, sure. So Victor received his Bachelor in Computer Science and Engineering Physics from Uniata College and a Master in Computer Science from IZIN LIL in 2022. Then, after three months as a research intern in KU Leuven, supervised by Marian Verhulst, he started his PhD with Professor Luca Benini in the Integrated System Laboratory, with the professor luca benigni in the integrated system laboratory, and his current interest include efficient deployment of machine learning model on microcontrollers, tiny transformers, scheduling and quantization.

Speaker 1: 4:36

So that's, that's a very great topic to talk, awesome yes go ahead, richard.

Speaker 3: 4:43

Okay, great. Thank youilo for the great introduction and yes, I can say this work is a bit the quintessence and the perfect intersection between all those research interests that also form all the community around Edge, ai and TinyML. So today the talk is about energy efficient generative AI deployment on the-source RISC-V heterogeneous SOCs so the new generation of SOCs we see emerging to tackle those very challenging workloads of transformers. That runs generative AI. So let's quickly dive in. What are we first talking about when we mean GenAI and foundation model for TinyML? So those models are the very famous ones, such as the ChartGPT models, the one that everyone used basically, even if you don't work in tech, or also the Meta Lama models, the one that are a bit more niche but also very known in the tech industry. So the advantages of those foundation models here are definitely that you can use one backbone for many different tasks and you can also support mixed modalities such as images and voice. So that's very important here and that's groundbreaking capabilities for many applications and some embedded application of this model already exists. So, instance, with the AR glasses of Meta, for instance, the Ray-Ban here they have a very small budget in terms of power and in terms of also area here and they need to perform very complex tasks. So that's where you want to have a really well-compressed and embedded foundation model to perform all these tasks. That takes different modalities, like the camera on the glasses, also a mic and different other possible sensors that you can plug in here. Also, you have, of course, the application with the nanodrones, where you want to have the most autonomous engine possible that also can fuse different modalities to build a representation of the world around and being able to navigate here. So those generative and foundation model at the edge. This is a trend that is increasingly getting getting bigger and bigger. So there is this new trend of small language model and the release frequency of this model is definitely increasing. So here is a quick timeline from a recent survey and you can see that in the year 2024 only there has been many small language models that have been released, and so this keeps growing and growing. However, most of those small language models, they are still too large for tinyML applications because they range from 1 billion to 125 million parameters. So it is still orders of magnitude higher than the requirements of tiny ML, of course, or small embedded system. But the tiny language models are coming and we need as a community to be ready for this. So let's quickly come back and give some context on the generative inference process so that we know we are all aligned on what it means.

Speaker 3: 7:46

So if you take the sentence here what does BULP mean? And you break down this sentence in different tokens, so basically one token is one word and the sentence here is named the prompt. It's the prompt that you give to your model. So you give this prompt directly to your foundation model, here the llama, for instance. Then it will generate, it will predict basically which token should be next or which word should be next. Here it's parallel for the p of pulp. Then you can append this new token at the front of your prompt. This becomes your new prompt. You send it again in the language model, you get the next token predicted, etc. Etc. And in the end you end up with the full answer basically of the language model to the question what does pulp mean? It means parallel, ultra low power and it's also the name of the chip that we developed here at IIS.

Speaker 3: 8:44

So this generative model, they have two different modes. They can work for two different modes for two very different purposes. So it's important to quickly introduce them before we continue. The first mode is the parallel mode that is used to process the prompt. So when you give the prompt it processes all the tokens in parallel of this prompt and build up the context. And this mode here is basically composed of matrix multiplication, so gems. Then there is the autoregressive mod, which is the mod used to generate the answer Each time your model is looping to generate a new token. That's in the mod you are and it stores the context here in the KV cache, and it's mostly composed of matrix to vector multiplication, so gemv, and uh, we'll not know that.

Speaker 3: 9:36

We know more or less what the workload looks like. Let's quickly look at the constraint for the tiny ml heterogeneous soc. So what kind of hardware are we talking about here? We talk about the hardware with limited off-chip capacity, usually a maximum of around 64 megabytes for the MCU, and since it's off-chip, the read and write energy and latency are quite high. So it's a high cost to access, sometimes necessary, but we try to minimize these transfers here. Then we also have many smaller on-chip memory, typically 1 to 4 megabytes, and we finally have different memory hierarchies.

Speaker 3: 10:11

So it's represented with the green links on the right-hand side figure and it's made of software-managed caches, so it's basically scratch pads. It doesn't have any hardware feature to prefetch and stuff like this. So this is very suitable for workload that we know at compile time because we can predict everything and it increased the energy efficiency compared to regular caches. However, this completely offloads a lot of complexity to the compiler because then the compiler has to perform static memory allocation. It has to do also to plan all the memory transfer and to do some scheduling to decide when and until when do all the memory transfer from which memory to to another memory. So that's, that's a lot of added complexity here, but it comes. It's necessary if you want to reach the best energy efficiency you can get.

Speaker 3: 11:06

And talking about those specific heterogeneous SoCs there is a real Cambrian explosion in the number of heterogeneous SoCs that are proposed because each of these SoCs have their own quirks basically and so many flavors are developed and you need to evaluate how good is your hardware, your new SOC that you developed. You need to run end-to-end workloads here. You cannot only do micro-benchmarking. That's not a good enough metric anymore. So those heterogeneous SOCs they can contain some cores with different kinds of ISAs, so this most likely often changes. Then it also has different neural processing engines with different capabilities and additionally, on the top of all that, there are different memory systems and hierarchy. So if you take into account all these three different characteristics, where you think that there are many different possible choices for each of those characteristics, you always almost end up with a unique SOC, like each SOC is unique in its own way. And then you need a flexible tool to deploy on those different SOCs without having to make a tool from scratch from the beginning. So one example here is the Syracuse microcontroller.

Speaker 3: 12:20

We'll talk again about this specific MCU here, so it's good to give to everyone a quick glimpse of it first. So this one has, for instance, a convolution or gem accelerator. It also has a cluster of eight RISC-V 32-bit cores and it has a specific neural memory subsystem that is basically a weight memory attached to the NPU. So what are the actual tools to deploy on those heterogeneous SOCs that exist in the state of the art? So there are many, but they each focus on one specific SOC. For instance, if you want to deploy on the STM32 MCUs, you will use the STM provided the QAI tool for this. The STM provided the QAI tool for this. If you want to deploy on the gap processors, you will use a tool given by the company GreenWave and so on and so on. So this makes sense for industry.

Speaker 3: 13:14

Of course, you are proposing one specific class of SOC and you don't really care about supporting deployment for other SOC, which makes complete sense. But it's very inconvenient for researchers Because you don't want in a lab where you develop, like ours, when you develop many different chips and you try different combinations, and where you open source all the hardware so other people can also take some of your chips and make some new ones. You don't want everyone to rebuild a full deployment stack every time they need to deploy, to try a new architecture here. Additionally, these tools here from vendors. To the best of my knowledge, they don't support the specific Gen AI pattern yet, such as KV caching.

Speaker 3: 13:59

So our deployment stack what does it look like? It always starts with the very famous PyTorch TensorFlow training framework. Then we export the neural network into a graph representation named ONNX, which is a command representation to share neural network, and then this representation is fed into our tool deploy. That then generates some code that can be run or compiled with your favorite compiler for your platform like ARM or R live. So in this talk we will focus on deploy here and deploy here. To be just to be clear, it generates a bare metal code that requires a minimal runtime support. Just to you know, make have a development cycle that is faster. You don't have to build a super fancy runtime for each of your platform, you just need the bare minimal and the general structure of deploy makes it easy to add new platform and complex network. I'm stressing that a lot in this talk is because that's really one of the requirements of this tool is try to unify a bit this deployment tools, especially for researchers that do architecture research. This tool Deploy has been published at ES Week 2024. And so if, after the talk, you're curious to learn more about the tool, you can of course check the open source repo, but if you want more technical detail and research insight, you can of course check the paper.

Speaker 3: 15:22

So let's first dive into the front end. What is this doing? So what we get as an input of deploy is a graph that represents the neural network and there we first perform a graph lowering. That is platform dependent. So depending on which platform you want to deploy in, you will perform different operations here. So you will fuse or insert operators, for instance, and the graph still remains fully ONNX compliant. This is very useful, for instance, if you have a requantization that follows a matrix multiplication and you have a specific kernel that fuses those two operands. Then in the lowering stage you want to fuse them in the graph in order to have one node for one kernel. And this is the next step the nodes are matched with a specific low-level kernel that supports this operator. So, for instance, we use libraries that are very famous like CMCisNN or PulpNN. I mean, you can always add your own custom kernel, but the goal in the end is to have one specific kernel that has been made by experts to execute a specific operator of the graph.

Speaker 3: 16:29

And then, once we have this graph where each node is binded basically with one specific kernel, we need to handle the on-chip memory management like how do we move data around and stuff like this. And this is handled in the mid-end and that's the deep part of Deploy. Basically, because this mid-end here is quite complex, it's divided into two different stages. The first stage performs the tiling and the second stage performs the memory allocation. So during the first stage we tile by constraining our tensors here in order to slice them, and we divide those constraints into two different categories. The first category is the operator constraint, which basically makes sure that you produce valid output for a certain operator. Then the second category, the target constraint, which just makes sure it works well with the specific hardware target. For instance, you may have a processor that supports SIMD operations and you have a kernel that uses this and it's very efficient for this. But for instance, if you're in Integer 8 and you pack 4 Integer 8 in your world, then you may have a big performance boost if you can have tiles that are multiples of four in certain dimensions, for instance. So you can enforce this just to give a performance hint to the tiler in this kind of constraint. So from this constraint programming representation you get the symbolic buffer sizes that will be used later for the memory allocation. And these symbolic buffer sizes, just to give a hint for the next stage, they represent the height of the block in the Tetris memory scheduling that we will see in the slide right after.

Speaker 3: 18:18

So now let's go to the second stage of the mid end of deploy, which is the memory allocation. How do we do the memory allocation in deploy? Well, it's not so simple but also not so complex to understand visually. The we start from the graph here on the left hand side where each node is represents one operator. Then from this graph we convert it into a tensor graph where each node this time represents a tensor. And from this tensor operator we generate an adjacency matrix and we use a cost vector c where each element of this cost vector c it's the symbolic buffer size that you get from tiling stage. And from this, from this a and c, a and C matrix and vector, you can get one Tetris memory allocation scheme. So remember that this A matrix here it encodes the order of the block placement when you basically play Tetris to do your memory allocation.

Speaker 3: 19:19

So the core idea here is to solve this integer linear programming problem with jointly for tiling and the memory allocation solution to get a good solution, both, that that both gives a good tiling and also that respects all the memory constraint of our specific hardware. But you can see here that if you use the topological order of the graph and to do the Tetris scheduling it leads to very suboptimal solution. It basically creates some kind of a staircase allocation where you have very high defragmentation. That's very suboptimal. So we really don't want to do that here. How can we fix this? It's quite simple. You can use permutation matrices, just generating a valid permutation matrices that respects the following property so every element has to be a Boolean, of course, and the sum of the rows and the columns have to be one. Once you respect those properties, you have a valid permutation matrix, p, and by this quick linear algebra trick you can generate a new permutation matrix, a, prime, prime and a new associated cost vector, c prime. Basically, it's like shuffling the order in which you place the block in this Tetris scheme here. So you can see this example here with the coordinate transform we basically change the order with which we are placing the blocks and it leads to a better design point. Here the solver co-optimizes the tiling solution with the memory load schedule in order to find a solution that fits all the requirements, so the specific on-chip memory, but also the specific tiling requirements to validate to keep the correctness of the operations.

Speaker 3: 21:07

Now let's quickly go through one example. So the tiny language model on Syracuse. So we took a TLM based on Lama2.c from Carpati and we trained it on the tiny story dataset, just to get an example of a tiny language model. So it's very small, it's only 4.6 million parameters, but it's using the Lama2 architecture and it's a good demonstrator for all case here. So we completely quantize this network in integer eight using Quantlib, which is an open source quantization tool that we use and we developed here at the lab at IIS. And for the quantization of the nonlinear operator, such as Softmax or the RMS normalization, we use polynomial approximation.

Speaker 3: 21:57

So let's quickly come back to the specific platform that we are targeting here, which is Syracuse. Quickly introduced before In this platform we have again this CNN accelerator named N-Eureka. We also have a lot of on-chip memory here 4 megabytes of SRAM weight memory, 2 megabytes of L2 SRAM and 256 kilobytes of L1 TCDM. Then we also have a high bandwidth link to the NPU, both from the neural memory subsystems to stream the weights and from the L1 to stream the activations. Finally the L1 to stream deactivations. Finally, we have an Octa-Core RISC-V cluster here that supports core with the DSP extension, integer SIMD support and also post-increment and hardware loop extensions. So all of these things are extensions to make them very efficient at crunching big numbers, such as performing matrix multiplication or other dsp tasks.

Speaker 3: 23:00

So our benchmark setup, before we talk about numbers siracusa is silicon proven, so it has been typed out and we perform measurements on board directly. So we benchmark three different configurations here. The first configuration is with the Octa-Core cluster only. Then the second configuration is with the NPU but without the neural memory subsystem. And the final configuration is the complete SoC with the NPU, the neural memory subsystem and, of course, the Octa-Core cluster. So first we benchmarked the first benchmark we performed Basically.

Speaker 3: 23:31

The goal was to see if it was beneficial to use the KV caching and, if yes, by how much. So we performed the cumulative latency and energy of a 256-step inference and we looked at what it's like with and without KV caching. So we know that if we don't use KV caching we do a lot of redundant operation. But we wanted to quantify how it was translating in terms of latency and energy. So we can see that it's definitely non-negligible and not supporting the KV caching here is a big loss and it's 23x of speed up in terms of latency and 26x in terms of energy. So that's definitely super costly to ignore the KV caching and this should not be done. So you need the deployment tool that supports this specific paradigm.

Speaker 3: 24:18

Then we looked at the breakdown of where do we spend time, depending on the specific category of layer, for the parallel and the autoregressive mode in an end-to-end inference scenario. So we looked at this for the three different configurations I introduced earlier, so the cluster, the cluster plus the accelerator and the complete SOC. So for the softmax, the RMS norm and the activation it's all the same, which makes sense because they are always executed on the cluster, whatever the configuration. But now when we look at the gem, we see there is a big difference between the three configurations. You clearly see that when we use the NPU we can compress the time we spend in the gem. Consequently, we actually gain 59% of the runtime compared to the octa-core regime. Also, it's very important to clarify that we offload only linear layers to the NPU. So we offload matrix multiplication between wait and activation, not between activation and activation. Finally, we have a little bit of data movement and it's quite homogeneous across the different configuration.

Speaker 3: 25:25

So from this breakdown here we can clearly pinpoint some of the properties of the parallel mode. Here we can clearly pinpoint some of the properties of the parallel mode. Here we clearly see that we are in a compute bound regime and that we benefit maximally from using the NPU to relieve a bit this compute boundness. Then we did the exact same thing, but with the autoregressive mode here. So again same story for the softmax and for the activation. It is the same because it's executed on the cluster.

Speaker 3: 25:54

But now when we look at the matrix to vector multiplications, the gem V, you see it's a very different trend than with the parallel mode. So the NPU here still provides 16% of speedup on the gem V, but way less compared to the parallel mode. That's clearly because the gem V has a way lower computational intensity compared to the Gem and so we are way more memory bound, so memory bound that the accelerator is way less beneficial here. Now if we, if we look at the data movement, we see that it's a consequent part of the of the runtime here, but that when we use the neural memory subsystem, when we add these quirks basically to the configuration this allows basically if you remember in the schematic before it was the neural memory subsystem basically gives more bandwidth towards the accelerator. So it relaxes the NPU memory boundness by giving more bandwidth, but it still stays in a memory bound regime. And that's the two properties we can extract from the two insights we get from this.

Speaker 3: 26:59

Benchmark here is that the autoregressive mode is definitely used in a memory bound regime and leverages mostly the octa-core cluster.

Speaker 3: 27:09

So see, we have this new kind of workload that can be run in two different modes and you need the two different modes, otherwise you lose a lot, as we saw in the previous benchmark where we tried not using at all the kvc caching.

Speaker 3: 27:22

So you need these two mode. So you have an heterogeneous workload and an heterogeneous workload need heterogeneous hardware, fundamentally if you want to be able to execute all the mods efficiently. So the text of this talk here is that we demonstrated the end-to-end inference of a tiny language model on an heterogeneous SoC. With Deploy. We also benchmarked different configurations of the SoC for the two modes of the network to isolate different trade-offs and collect insights, and we finally reached a throughput of 340 tokens per second at 490 microjoules per token with this specific configuration measured on chip. And the best part of it is that it's all open source, both in the paper, of course, but also the tool, deploy and additionally Quantlib, which is the tool we used to quantize the network. So with this, that's the end of the talk and I think we can go through the Q&A now.

Speaker 2: 28:25

Thanks, victor. Excellent presentation and well taken topic. Topic In the meanwhile we wait for questions from attendees. I would like you to address some of mine. So I took my notes and clearly ethereogenius was a key word in your presentation and you addressed this from the main perspective of both hardware and software. So let me try to ask you something around this keyword, in no specific order. Ethereogeneous means you pointed the different deep learning frameworks, and different deep learning frameworks means many operators, hundreds of operators. You know operators that do conv convolutions, different flowers of convolutions, and you know also mathematics, mathematical operators, tensor operators, different meanings of this operator across the deep learning framework, and and and so on. So how do you address this, this complexity, through your, your tool chain? What are the principle or or the, the thought or the advices you you think to share?

Speaker 3: 30:14

well, that. That's a very good point, danilo, because, yes, the yes, the different, like there are many like deep learning framework that we take the network from. I think if I quickly come back to this slide describing our stack here, people will also have an overview of what I'm talking about A lot of animations, but here we are. So here, yes, I mean there are many different framework upstream. So here, yes, I mean there are many different frameworks upstream.

Speaker 3: 30:41

And the way we, when we built the framework, the way, the reasoning that we, the reason why we have the ONNX in the model export, is to have this kind of harmonization step between all the different frameworks. So all these frameworks, they support an export in the ONNX format, which has a specific frozen dialect, an export in the ONNX format, which has a specific frozen dialect. And if we support the operators from this dialect, then we support the operators from every deep learning framework that supports export in ONNX. So that's why we have this layer here to harmonize all the operators into a dialect that we know exactly, instead of trying to directly support, for each different tool, their own dialect.

Speaker 2: 31:24

OK, thank you. And staying on this slide and on this processing, do you use ONX as a representation during the manipulations that are performed by this stack?

Speaker 3: 31:42

Yes, exactly. So here I didn't introduce the backend. I have some slides in backup for this, but this is really just code generation, so it's not so interesting. But during the frontend especially, and also during the midend, we keep this ONNX graph and during the frontend we modify it heavily. We keep this ONNX graph and during the front-end we modify it heavily, but we keep it ONNX compliant. So it allows us basically to re-export the ONNX graph or like to just export it or save it at different point of the flow. To clearly visualize what we are doing within the tool, like how do we transform the graph across the execution of the tool?

Speaker 2: 32:17

And this help also verification, because you could cross check against the onyx runtime exactly the outcomes of your transformations yes, yes, yes, that was that.

Speaker 3: 32:29

That was one of the big point also of uh, why we want to keep the graph on the next compliant.

Speaker 2: 32:35

It's because of and yeah, you said you didn't touch the backend, but there is also the possibility to generate C code Now, which maybe is not interesting. You said but let me staying on these slides, ask you another thing. Say but let me staying on these slides, ask you another thing. You know that there are tools for automated code generation, for example, pyjit. Um, what do you think about, uh, this possibility to automate C code using a tool like PyGIT? Is that effective or maybe less optimal in your experience?

Speaker 3: 33:28

Usually it's a bit complicated because most of the tools they assume a traditional architecture that uses caches, so they don't have to do explicit data transfer and they don't have to do static memory allocation. So that's why we could not really connect or extend an existing tool here. But there is still room for improvement because basically with Deploy we still use kernel libraries and so if you have a tool that, for instance, automatically fine-tune your kernel based on the specific argument to call this kernel, then you could use in a collaborative aspect deploy with another tool that, once deployed, generated the C code. The tool will basically fine-tune the specific kernels based on the argument of this kernel.

Speaker 2: 34:24

Thanks. When you speak about these low-level kernels, it means that you took the load of optimizing operators on the hardware architecture right.

Speaker 3: 34:36

Yes, architecture right. Yes, because how it goes when we have a new class of cluster, for instance with new ISA extension. In the lab, or in most of the labs, I think, usually people start by programming, doing micro-benchmarking, so just looking at how efficient is a matrix multiplication here, just a very simple kernels, and they usually build a simple kernel library to validate the concept and once this is done, we basically reuse those kernels because they have been made by experts already and we just enable the execution of a more complex workload that also takes into account all the data transfers and all the movement, with the whole memory hierarchy instead of just the L1, for instance.

Speaker 2: 35:26

Sure, sure good, if you are available. I would continue with some questions. You know, supporting Keras, there is the possibility and sometimes is not said that all the kernels are explicitly known ahead, because maybe these are the intellectual property of the developer and the Keras custom layers are a great asset to capture specificity and proprietary implementation. So how would you, your methodology, your workflow, deal with this special purpose kernel, for example, the Keras custom kernels.

Speaker 3: 36:18

For example, the Keras custom kernels. The usual how we go usually is that we have always fallback kernel that are less tailored to a specific hardware but functionally correct. To validate the correctness first and then, depending of which kernel we have access or which specific operator we want to optimize because we see it takes a big chunk of the runtime. For instance, we just write it ourselves or we ask the people is it possible to use it in an open source manner? For instance, we did this with GreenWave Technologies. They have a very well-engineered and super specialized library for convolution layer that is way better than ours. So we discussed with them and we in the end are currently integrating it into Deploy such that we can use also their own kernel. So that's usually how it goes.

Speaker 3: 37:13

But if the kernels are proprietary and we cannot have access to them, we fall back on the less optimized but still functionally correct functionally correct yeah, yeah, I think that's a great way to support uh, unknown, unknown kernel and at least keep the application running yeah, yes, that's all thanks to the onyx dialect, that is, you know, the the source of truth that we are both. We are both talking the same dialect, so we know exactly how to fall back on which kernel we need to fall back for a specific operator.

Speaker 2: 37:46

So, from your point of view, nx is a de facto standard in terms of representation of ML models.

Speaker 3: 37:56

I think it became a bit of a standard standard, or at least what we use, what I see other researchers using. I see a lot of researchers using onnx to share to visualize the model. I'd say it's less and less true because I'm more and more tempted to directly support PyTorch operators, because I think that PyTorch is getting a lot of traction in the research world. But still PEN software is very used also in industry, so I don't want to tailor it to a specific framework yet. So yes, I think ONX is still the standard here.

Speaker 2: 38:32

I tended to also agree with you, but it happened at some time that you know conversion between formats still happen because there are some converters on GitHub that helps from Nix, keras, tf Lite. You know this type of converter and sometimes they mix the channel location within the pipeline. So you would expect everything channel first or everything channel last, but sometimes it happens that channel are messed in between, and so how do you deal with this intricacy? I mean, it's really, it breaks the regularity you know.

Speaker 3: 39:18

Yes, definitely. We saw this thing happening and we had to hack a bit the converter to make sure it was returning the thing in the specific layout that we like, because if the layout of the input is not assumed correctly there is nothing you can do. But what we do in Deploy it is that we automatically we say which layout by default you want to use for each operator and based on this we insert the transposition to make sure that the specific kernel that use, for instance channel last or channel first, will get the correct data. So that's also one transformation performed in the front end here.

Speaker 2: 40:02

Sounds great. Thanks, Victor, and I'm taking advantage for a few minutes still. You know, quantization integer 8 is well known, but sometimes there are also other possibilities. I'm thinking about this experimental 16x8 provided by TF Lite or FP16 and sometime through QKeras. You have all the possibility in fractional or integer between 1 and 32 bits. What's your experience on dealing with these ethereal geniuses still standing on the keywords? That was driving your presentation. How do you stand in front of these many possibilities in terms of bit depth and representations about quantization depth and representations about quantization.

Speaker 3: 40:59

In the deployment part. It's actually not so hard in the deployment part to support. I think the real complexity lies in the hardware and in the quantization tool themselves. Because from the point of view of the deployment tool it's just a new operator and you need to. The annoying thing is that you most likely will have to build a specific kernel to perform if you want to perform efficiently those operations. For instance, for the gem to fuse the quantization with, or like to fuse the way they perform the quantization with the gem. And for less important operator, you may have a very default kernel that is not so optimized but since it is a small part of the runtime it's not so bad.

Speaker 3: 41:46

But I think the real difficulty here is to have a runtime to execute those fancy quantization scheme. So like you need to have a hardware that support this specific format. I'm also thinking of the recently released, you know, mx format, the micro scaling format. Um, that is very interesting. It's a way to do low precision floating point and to basically have a small block that share the same scale for the quantization and you need to have the platform that support this. This is new type, basically if you want to be able to simulate it and to do the quantization, and that's currently, I think, one of the bottleneck for quantization.

Speaker 2: 42:32

Thank you, so much, thank you so much. And you say the short SLM are Getting getting smaller, but not enough and so, in general, we need at the edge generative model uh, smaller and smaller, but how much? Smaller and smaller, but how much? What's your ballpark, in your experience or in your vision, even in other cases, not only SLM that we could achieve to make this, this new wave of edge gen? I happen.

Speaker 3: 43:16

I think the ballpark would be around like a million, like a hundred million parameters, a little bit below hopefully, like maybe 50 million parameters there, with aggressive quantization, so going a bit lower than 8-bit. You could start to fit this into, for instance, like you know, the L3 memory of those tiny platforms here and you can start to have very meaningful applications. One application I think about it's not generative AI, but it's still foundation model it's using the network named DenoV2, which is basically a backbone for any visual task. So from this backbone you can do sentiment analysis, you can do image classification and you can do segmentation of images, for instance, with only one backbone. And this backbone is 21 million parameter. So it would clearly fit in this kind of use cases, where it would be embedded on a very small processor at the very edge and where you could have a lot of applications that would benefit from the fact of Deno being a foundation model, where you can use one backbone to do many different tasks right after.

Speaker 2: 44:27

So you name it the backbone. It sounds like the more the technology advances we are able to find, as you said, a common denominator the backbone. So here the it's an opportunity. I mean to really do a very optimal mapping about it. Do you see this backbone becoming as well a de facto standard type of machine learning graph?

Speaker 3: 44:59

yes, yes, yes, definitely like this. The clearly, if you, if you really think about a full product, like, for instance, what the the the ir glasses I showed before, the one from the collaboration between rayban and meta, this is the only way to go if you are reasonable, because this is, you cannot afford anymore to have a specific network for every task, considering how much space it takes, like having a common backbone for those tasks will lead to a way. Basically, you will be way more efficient in terms of parameter per accuracy for each task, if you prefer and yes, just taking, starting from what you said about their glasses even the nano drone.

Speaker 2: 45:46

Clearly, this system must be very energy efficient. So you were commenting 490 microjoules per token. So let me elaborate a little bit the question. So, from another aspect, the architecture that you spoke and the tool that are able to perform the deployment on on the architecture, use memory. There is a lot of memory in the device, but it's not used as a compute asset. So so the question is how do you see, maybe in five years from now or even less?

Speaker 2: 46:32

as you judge going on this energy efficiency with or without new opportunities, implying memory as a compute asset and not only as a storage asset memory as a compute assets and not only as a storage assets.

Speaker 3: 46:54

Definitely, I mean, I think we it's easier. Either we have this kind of, you know, new 3d stacking technologies that allow us to drastically improve the number of bytes per centimeter square or millimeter square that we can store. So it would increase the density of the memory by a lot and would allow us to keep up with our current way to do the things. But if this is not happening, of course, the way to go is the in-memory computing or the closed-memory computing, where you move some of the compute units very, very close to the memory there, and I think this is a this is definitely a trend. It's a huge research field and it has already proved uh to be, to be efficient. And, um, I expect that if those uh 3d scaling technologies are not coming soon, uh up to speed for small microcontrollers, for instance, or, you know, edge devices, then that the computing memory trend will keep growing and keep growing and start being implemented here.

Speaker 2: 48:01

And, yes, I also really hope that researchers will put their best energies in helping to make the system even more and more energy efficient, because that was we need about generative ai at the edge. I've been played with some of the shelf hardware and definitely there is a huge room of of improvement, which I think is a is a huge opportunity for researchers, for people like you and and others, and I hope this will be addressed also in the interest of the community, of the aj community. So, victor, I would like really thank you so much for your passion. You spent 20 minutes with my questions.

Speaker 2: 48:51

I hope I didn't torture you, it was not, uh, wanted by by myself, but really I I enjoyed this interaction with you and the topic that you uh that you exposed. So thanks so much for all the passion and all the answers that you provided were really much much helpful, and I hope the the community will also uh take advantage of of your talk. Thanks so much and let's meet at the next possible event why not in Austin?

Speaker 3: 49:32

Sure, sure, maybe, maybe I'll discuss with my professor about that, but thanks for inviting again. That's always amazing to be here in these talks, so thanks a lot, danilo.

Speaker 2: 49:43

Thanks to you, victor, and, I think, thanks to everybody to attend and to Pete and Rosina. Talk to you to the next time. Thank you, ciao.

Speaker 1: 50:01

Thank you.