EDGE AI POD
Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.
These are shows like EDGE AI Talks, EDGE AI Blueprints as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.
Join us to stay informed and inspired!
EDGE AI POD
Support for Novel Models for Ahead of Time Compiled Edge AI Deployment
The growing gap between rapidly evolving AI models and lagging deployment frameworks creates a significant challenge for edge AI developers. Maurice Sersiff, CEO and co-founder of Germany-based Roofline AI, presents a compelling solution to this problem through innovative compiler technology designed to make edge AI deployment simple and efficient.
At the heart of Roofline's approach is a retargetable AI compiler that acts as the bridge between any AI model and diverse hardware targets. Their SDK supports all major frameworks (PyTorch, TensorFlow, ONNX) and model architectures from traditional CNNs to cutting-edge LLMs. The compiler generates optimized code specifically tailored to the target hardware, whether it's multi-core ARM systems, embedded GPUs, or specialized NPUs.
What truly sets Roofline apart is their unwavering commitment to comprehensive model coverage. They operate with a "day zero support" philosophy—if a model doesn't work, that's considered a bug to be fixed within 24 hours. This approach enables developers to use the latest models immediately without waiting months for support. Performance benchmarks demonstrate the technology delivers 1-3x faster execution speeds compared to alternatives like Torch Inductor while significantly reducing memory footprint.
Maurice provides a fascinating comparison between Roofline's compiler-based approach for running LLMs on edge devices versus the popular library-based solution LLama.cpp. While hand-optimized kernels currently maintain a slight performance edge, Roofline offers vastly superior flexibility and immediate support for new models. Their ongoing optimization work is rapidly closing the performance gap, particularly on ARM platforms.
Interested in simplifying your edge AI deployment while maintaining performance? Explore how Roofline AI's Python-integrated SDK can help you bring any model to any chip with minimal friction, enabling true innovation at the edge.
Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org
Good afternoon everyone. So nice to see you all after the lunch break and I'm happy to present today how we are tackling deployment in the edge AI space, especially focusing on rapid support for new AI models coming to the market to really enable innovation within the space market. To really enable innovation within the space, I'm Maurice Sersiff. I'm CEO and co-founder of Roofline AI. We are a Germany-based startup working on AI compilers, and we are on the mission to make edge AI deployment simple. Our idea here is to really enable more innovation within the space by giving developers a tool so that they can bring their AI models to any of their desired SoC as fast as possible. So why do we do this? And the core motivation here lies in the different pace of AI models versus deployment frameworks. So what we have seen in the last years is that AI models have developed at a very rapid pace, while a lot of deployment solutions have not kept up with this pace. So developers often struggle with bringing their desired model to the chip that they actually want to support. And then, on the other hand, from all these amazing innovations that we see from the hardware space, we see more heterogeneous and diverse platforms entering the market. So it was anyways a fragmented situation, and now we have even more diverse hardware setups, and this calls for a more streamlined and convenient solution to bring your AI models to the different platforms that you actually want to use, and this is what we are doing. So we are developing a software development kit that is basically in between the AI model from the end customer and the AI chip of all the different vendors that you could think about, and our software development kit really sits in the middle, with, at its heart, being a compiler technology being able flexibly to bring any AI model onto these chips. We are always looking into efficiency as a second value that we are contributing, so we are really focusing on reducing the latency to run inference on these devices, which in turn, also minimizes the power consumption, and we are also minimizing the memory footprint of the executed models, and this is all integrated into a relatively easy-to-use framework that gets the developer there where they are within PyTorch or within TensorFlow, so everything's integrated into a Python library at the end.
Speaker 1:Specifically, what we are doing is on the model side. We are supporting basically any relevant framework, so TensorFlow, pytorch, onnx and YACS, and then model-wise, and this is also what I'm going to talk today more about. We are really focused on supporting any kind of architecture that you would find. So from within our models, you will find conventional CNN models up to RNNs, but also the latest and greatest LLMs and transformer-based architectures. And then we are taking this reading this into Attic. Attic is our AI application layer that then takes the AI model and well, rewrites it in a way that you can find a common abstraction for different target architectures side, and this is basically the core DNA of our retargetable AI compiler technology, where we do support CPU architectures, we do support GPUs, but we also do develop and this is under development at the moment but we also develop specific back-end or integrations of back-ends for NPUs and at this point in time, or today, I'm really going to talk about our CPU deployment 2-chain, because that's the most mature piece of technology that we can offer at this point in time.
Speaker 1:So, if you're using our technology, how does it look like? So, basically, as I said, we are compiler technology. So EditHeart is a piece of compiler software that you're using and this generates two artifacts for you as a user. So, on the one hand side, it generates host code, so host code is referring to the code that is triggering the execution inside your device. So basically everything that is not really related to the actual calculation of the AI model but to getting data in and out, invoking kernel calls and stuff like that, and this is all executed through a runtime, a small piece of C code that could run on basically any CPU inside conventional SoCs.
Speaker 1:And then the second part that we are generating is the target code. The target code, in a way, is just generated kernels to talk about this in a very abstracted way, but you can think about the target code targeting CPUs. If you have a multi-core ARM system, for instance, we are generating kernels for this multi-core setup. But you can also generate target code for GPUs. So we do have a backend for SPAV, for instance, for a lot of embedded GPUs like Broadcom, but also for Jetson boards, for NVIDIA, we do have a backend available. And then, as I said, we are also developing specific kernel generation for NPUs with our customers at the moment.
Speaker 1:And yeah, so, as I said, today is really about this coverage story, so enabling users to bring any model that they want to this kind of compiler technology, and so our vision is that at some point the HCI space should be there, where the cloud space is today, that you can basically develop your new model. You download it for a hugging phase and it will just run on your target device, and this is the core vision that we are contributing to, and what you can see here is basically a collection of top 10 models from Hugging Face, as well as specific models like Lama, smalllm or Quen that are supported by Torch Inductor, which is the go-to compile solution from PyTorch, but it is also fully supported by Torch Inductor, which is the go-to compile solution from PyTorch, but it is also fully supported by Roofline at any point in time, and this was always a day zero support. So our core philosophy is if a model doesn't work, that's just a bug in our software and we should be able to fix it within 24 hours, and so far we were really successful in achieving this promise towards our customers and all this coverage. I think it's a nice promise, but it is only valuable if it is also giving you a significant performance promise at this point in time, so really something where you would either be on par with existing solutions or even outperforming them. And to demonstrate this, we are here comparing again against Torch Inductor, both in terms of latency as well as memory footprint. So the two core metrics that we are optimizing for and these are the same models that you just saw on the previous slide here now compared for this performance, and what you can see is that we are, for nearly all models, around one to three X faster in the execution speed for a single inference run, and then we have a significant reduced footprint in terms of the memory usage for these systems. And this is run on an x86 system, but we have basically the same numbers for ARM as well, so very similar kind of order of magnitude that you can achieve on ARM for these systems. And so what we are basically showing here is that we are able to bring you well, the latest AI models onto your system, but we are also really providing a very good performance for these systems.
Speaker 1:And to further drive this point, I wanted to today dive into one specific piece of code that we are currently working on, one specific piece of code that we are currently working on and that is enabling LLMs on edge devices and what most of you might have already heard of, lama CPP, I assume. So, basically, it's currently the go-to solution to run quantized LLMs on edge AI devices. So what Lama CPP does is basically you take your model, you export it into a specific data format after you applied quantization, and then you can rebuild your model with the Lama CPP library and then just run it. So basically, you are running a library of kernels that are there and it's basically a library of performance-optimized execution blocks here, and this is, in general, very fast and very performant, and I think this is why it is the go-to solution. But on the other hand, if you think about coverage again, if something changes in your model, if you want to onboard a new piece of hardware on the system, it's a very tedious process because you have to basically run or implement code for the specific layer that you want on the specific hardware and then optimize it. And one example how long it takes is if you take a look at OpenELM it took months for the community to support this model. So you can take a look at this GitHub thread and you will see that there's just an amazing long discussion on how to really implement this. So this is not a really scalable solution looking forward.
Speaker 1:And so what we are doing is we are trying to basically replace this kind of library approach with our compiled solution. So with us you also download the model, you export it from Hugging Face, you match the quantized layers. So very similar approach as before. But then you just compile the model and you get your target executable for either x86 or ARM. And then you have this target executable and you can just run it as any piece of software on your target design that you have. And this is much more flexible for most models and allows you to much faster support new kind of layers that also come into play here.
Speaker 1:And yeah, the only risk here is that it obviously might be slower than handwritten kernels because it's not as much optimized manually as you would do with manual kernels, because we are actually generating the kernels and to show how far we are with this performance improvement. So these are numbers from two weeks ago, from the development cycle and basically here we are comparing ourselves against Lama CPP executing DeepSeq. So, as I said, we had day zero support for DeepSeq, so the day it came out it was directly able to run it, and then if you compare it against this library on x86, what you will see is that we are slower. So a point taken, it is not there yet as fast as the manual kernels that you can write, but we are only 11 and 15 percent slower in terms of pre-fill and decode throughput. And then again, comparing against Torch Inductor, we have a significant improvement in terms of the peak memory footprint. So for x86, we already see that we are nearly there where Lama CPP is. And then, as we are in the edge space, you probably also want to know what's about ARM, and for ARM we are not so far yet. So for pre-fill, we basically match the speed and decode. This is what my engineers are currently working hard on. So this is the performance gap that we are trying to close. But, as you saw on the x86 numbers, we can get really close in terms of the performance and we are very sure that in the next weeks we will be able to fill this gap as well, again showing a very nice performance improvement in terms of memory footprint.
Speaker 1:So to wrap it all up, what do we deliver with Roofline? We have a software development kit at its heart. It's an AI compiler technology. It's really targeted towards ML developers. Everything's integrated with Python, so you can just run it as you would also do with a conventional PyTorch library. We do have a runtime specifically for the edge space. We support any quantization within your models. We can do debugging, tracing and then benchmarking of different systems, and this all for different kind of target platforms. And with that I'm open to your questions and thank you for your kind attention. Thank you for your kind attention.