Unveiling the Technological Breakthroughs of ExecuTorch with Meta's Chen Lai Artwork

EDGE AI POD

Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.

These are shows like EDGE AI TALKS, EDGE AI BLUEPRINTS as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.

Join us to stay informed and inspired!

All Episodes

EDGE AI POD

Unveiling the Technological Breakthroughs of ExecuTorch with Meta's Chen Lai

November 21, 2024 • EDGE AI FOUNDATION

Unlock the secrets to deploying machine learning models on edge devices with Chen Lai from the PyTorch Edge team at Meta. Discover how XTorch, a brainchild of the PyTorch team, is transforming edge deployment by addressing challenges like memory constraints and hardware diversity. Get an insider's view on the technical collaborations with tech giants like Apple, Arm, Qualcomm, and MediaTek, which are revolutionizing the deployment of advanced language models like LLAMA on platforms such as iOS and Android. With Chen's expert insights, explore the fascinating process of converting PyTorch models into executable programs optimized for performance, stability, and broad hardware compatibility, ensuring seamless integration from server to edge environments.

Immerse yourself in the world of XTorch within the Red Bull Ecosystem, where deploying machine learning models becomes effortless even without extensive hardware knowledge. Learn how key components like Torchexport and Torchio capture compute graphs and support quantization, elevating edge deployment capabilities. Discover how Torchchat facilitates large language model inference on various devices, ensuring compatibility with popular models from Hugging Face. As we wrap up, hear about the community impact of Meta's Executorch initiative, showcasing a commitment to innovation and collaboration. Chen shares his passion and dedication to advancing edge computing, leaving a lasting impression on listeners eager for the next wave of technological breakthroughs.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1: 0:25

I kindly ask Chen Lai hi nice to meet you, to show up, and Chen is going to speak about Unify and Fast deploying Gen-I on Edge with Xeq Torch. The floor is yours, chen.

Speaker 2: 0:41

Great, thank you. Hi everyone, my name is Chen and I'm from PyTorch Edge team at Meta, so my talk today is about fast and unified deploying GNI on Edge with ExuTorch. As we know, pytorch has become the unified framework for working with ML models. Most researchers start from PyTorch, train with PyTorch and deploy with PyTorch on server-side. However, deploying models on edge is just more difficult because of challenges like constraint memory, compute and heterogeneity of hardware devices. This often leads to fragmented ad hoc solutions. So can we apply a unified edge framework to generated AI systems while remaining performant? Axis Torch is our answer and I will introduce it in a bit shortly. So let's take a step back and see why people are interested in running ml models on edge. Well, there has been an emergence of running models on devices including mobile biosensors, iot, etc. Because of listed opportunities like low latency, privacy. However, they are definitely challenges. So edge devices usually have like lower battery life, less memory, and most of them also have thermal requirements. The diversity of the hardware is also just much more broader than server space. Here I would like to introduce Xtorch. It's a fast and unified solution. Xtorch is the solution from PyTorch To run PyTorch models on edge devices. We aim to support iOS, android and embedded device. It's fully open source and we are working with Apple, arm, qualcomm and MediaTek to support their hardware to achieve the most optimum performance. Exutorch starts from PyTorch model run through torchexport to get a graph. After compilation and transformation it will generate the ExuTorch program On edge devices. The runtime will generate the XTorch program On edge devices. The runtime will load the program and execute the instruction. So this is pretty much the XTorch workflow and here is the example to run LLAMA 3.1 ABLIN on Android devices and the QR code can be used to try it out. We also support multi-modality. We support running LLAMA 1.57 billion on Android. For some reason it should be a GIF image, but it's not playing. But yeah, I think it's like in the QR code. We can see the same demo there.

Speaker 2: 3:57

Why do we want to use XTorch? Pytorch empowers AI researchers already nowadays, but the path to deploy the model on edge is just really fragmented and complicated. We would like to support it in a principled way. For example, we want to leverage hardware accelerators for optimum performance and users can easily follow the similar workflow to target different devices. Our runtime can target most devices and we don't need conversion anymore. So here is the XTorch timeline we went through preview on October 2023 and we announced alpha on April this year and we have our beta released last month on PyTorch conference. The official release for beta is this week, so it will include the performance LAMA examples.

Speaker 2: 5:05

In 2025, we will add more richer ecosystem integration and robustness For Beta. What is new? We aim to target stability, performance and coverage, especially on stability. We target on target stability, performance and coverage, especially on stability. We target on API stability and also backward-forward compatibility. On performance side, we focus on optimum performance on JNI, for example, lama 3 AB and then Lava 1.57 billion. Of course, we also support the latest LAMA 3.2, 1 billion and 3 billion models. And on the coverage side, we enable more kernels and we focus on setup the benchmark infra and also the on-device debugging and profiling infra. Here for the stable APIs, we set up the lifecycle from experimental to deprecation such that, like when user integrated to the, when user integrate to the latest code, it won't break their existing integration code. That's one of the most important parts for our beta release. Here I'm going to work through the on-device Gen-I use cases. The on-device Gen-I use cases.

Speaker 2: 6:47

Cpu is a crucial role, especially when we want to enable a wide variety of PyTorch models on ExaTorch and experimenting novel research ideas like LLM quantization. Right now we can run a family of LLM quantization. Right now we can run a family of LLM models and a wide range of other LLMs out of box for JNI use cases. Our primary focus for our CPU performance optimization has been the ARM CPUs. We developed JNN kernels for Omnion to achieve peak CPU tokens per second on a variety of edge devices like Android and iOS phones and also Raspberry Pi 5. Collaborating with Google, the new 4-bit gem kernels are upstream to XMPack, and working together with Omn, we also feature faster CPU pre-fill performance through Cli Day, which now has similar 4-bit gem kernels. Cpu is our baseline.

Speaker 2: 7:52

In addition to CPU, a critical puzzle piece of on-device AI is the neural processing units, as known as MPU. It has more tops, which means faster pre-fill, and it is also more power efficient, which is crucial for battery life. Most software frameworks may not have the capability to enable to efficiently enable MPUs, or the stack can be kind of like fragmented. So we create a delegate API in XUTorch. Mpu can be accessed by delegating part or the full PyTorch program to the corresponding backend. So we collaborated with Apple, mediatek, qualcomm, etc. To integrate Xutorch into their development roles. Note that although MPU power is leveraged, we don't move away from PyTorch and the Xeforge ecosystem. It keeps consistency when compiling the PyTorch program from server to device and helps users to profile and debug on device.

Speaker 2: 8:57

As I mentioned, lama 2.3 and 3.1 and other LMs can run on phones through these accelerators. We also support multi-models. Here's the examples to run Lava models with both image and text inputs. The image encoder has 307 million parameters with a bit quantization. The text models has a bit quantization, 7 billion parameters with 4-bit quantization. We provide IoT and runner components with CPU optimization and the QR code is the instruction to try it out. We are also going to support other modality like audio, video and yeah. So to support general use cases above, a number of techniques are used, so here we list some of them, like 4-bit, 8-bit for LAMA models and to further improve the post-training quantization accuracy, we also integrated the new spin quant technology. We are also actively working on the other new technology like long context length, lower bit quantization and parallel decoding. So yeah.

Speaker 2: 10:26

So here is for the on-device. It's sort of a short summary for the on-device LLM status. So on CPU, we have the optimum kernels On accelerators. We work with our partners to lower LLMs running on the backends, including Core ML, metal, qualcomm and MediaTek. So right now they are all enabled and then the LLMs can fully run on these accelerators.

Speaker 2: 10:57

This is like taking a closer look at what is needed. What is the fundamental features needed to enable LLMs on edge. So quantization, like LLM, large language model, meaning the weight is large. So the first optimization we always do is quantization to compress the model size. Secondly, we need to optimize the memory. The bandwidth among different memory regions are just really different and we need to plan them carefully, especially when we are memory-bound Decoding acceleration. So users just want to get the result right away when they type the prompt. Fine tuning we would like to fine tune different LoRa adapters for different tasks like summarization and writing emails.

Speaker 2: 11:45

Devtools is a pain to debug accuracy and profile and that's actually an advantage. We provide native tools and here's the example Say, for example, when there's a prompt saying how is the weather, and then if the LLM starts hallucinating, then it's really hard to debug, especially its own device. So with our partners we can help you to debug even if it's inside of the black box accelerators. So here is, more concretely, what are the exact techniques here? For quantization there is 4-bit linear kernel KV cache quantization For memory. One example is half KV cache quantization for memory. One example is half KV cache state. And then we can also do cache application. For decoding acceleration, we can do batch refill and we can also do speculative decoding.

Speaker 2: 12:45

Fine-tuning, as I said, like LoRa, is quite common and dev tools. That usually comes with Profiler and also accuracy analysis. There are two more updates In addition to Gen-AI support. We also enable OSS reproducible performance benchmarking infra, Either like on-demand benchmark request or we can also schedule benchmarking to monitor the performance trends. They will go through the GitHub actions and then the benchmark will run on AWS device farm to generate the performance matrix. We are also working on these experimental features on device training and fine-tuning. By design it's composable with the existing XTorch flows and backends. On the right-hand side we show a convergence of loss when fine-tuning LoRa 5.3 models. Since it's a proof of concept, there's still lots of ongoing work and feature planned.

Speaker 2: 13:54

Yeah, check out our Red Bull Ecosystem and community adoptions. So actually Torch does not work alone. It starts from Torchexport, where the compute graph is captured with dynamism. Quantization is supported by Torchio where the lower-best quantization kernels quantization API will be hosted. We built this LLM vertical stack with Torchchat to showcase LLM inference on mobile devices. Torchchat is a PyTorch library showcased how to run LLMs across laptop, desktop and mobile For model sources. The model can be fine-tuned models from Torch tuned with PyTorch native support. We also partner with Hug Face as well to make the models XTorch compatible, like Gemma 2B and 5 Mini. Xtorch ecosystem is built with all the parts working together.

Speaker 2: 14:59

Even though X-Torch just came to beta, we have seen active adoptions. It has been adopted by Meta products. So X-Torch support Ryban, meta smart glasses, oculus Quest 3, and Met meta apps like Instagram and WhatsApp. It has been used by developers in the community. So example including like in painting mobile apps from software mentioned. It's based on Efficient SAM and Swing Transformer and use Core ML Delegate for acceleration and use Core ML Delegate for acceleration. Exutorch is used in the inventory mobile apps like Diksha, running CV and OCR models for medical devices. We see researchers are adopting ExuTorch as well. Examples include mobile LLM, aqlm enablement, Windows enablement, etc. So, coming back to the first slides, we provide XTorch as a PyTorch platform to help deploy RLMs and many others to edge devices. This week it will be the formal data status. You're welcome to visit our GitHub repo and try it out. Yeah, you're welcome to visit our GitHub repo and try it out. Yeah, so that is pretty much my talk and I'm happy to take questions if there's any.

Speaker 1: 16:32

Thanks a lot, chen. It was really a great talk and a reach of many, many information. There are questions from the community. Let me go through some of them. Chancellaker is asking, for example, if there are other dependencies for the execute Torch model inference.

Speaker 2: 17:00

So it depends well. I think Xtorch itself is targeting on like portability, meaning like the Xtorch runtime it can run like on pretty much everywhere. And now when we talked about running ExitWatch model for inference, say, for example, then usually user will have some hardware on top of their mind. If user want to target iOS, that totally works, and if user want to target Android, that will work. And if user want to target Raspberry Pi, then that will always work as well. But for the Axis Torch runtime it doesn't really have. It's designed to be portable, like it's supposed to run anywhere.

Speaker 1: 17:52

Okay, and let me complement this question with maybe a comment. How would you consider easy or hard the interface of execute torch to custom-made NPUs? Is there a fair amount of work to be done. Maybe, some of these NPUs sorry to interrupt you maybe some of these NPUs in the future will feature also ad hoc accelerators, optimized memories, maybe a memory computing, you know? Any comment in this direction?

Speaker 2: 18:36

That is a very good question and it's also like part of the topic here. So the goal for XTorch is to make it easy to use for users and it's aimed to be both unified and also aimed to be fast. I agree that say, for example, the MPU can have a dedicated path to make it really specialized and then performant. However, for example, if I'm just an ML researcher and I may not know that much about this specific MPU and I need lots of background knowledge such that I can deploy my models on the MPU and I may need to follow a different instruction, like yeah, so there's like lots of burden there and the path can be really fragmented.

Speaker 2: 19:32

So for XTorch, our goal is, like it's a unified way. So if I'm researchers and the line of code to target, say, apple's MPU or versus like Qualcomm's MPU or like Mediatek's MPU, the flow is pretty much the same. So user doesn't need to like really look into like how this specific MPU can work. Yeah, that's like, I think, the advantage of ExitWatch. So we hope like it's pretty much like user doesn't really need to like know all the details for this MPU, but they can just run it and then it will just work.

Speaker 1: 20:16

So it means thanks, chen. It means that it's abstracted enough, such that I don't need to be a super expert of the hardware underlying, because it would require much investment in knowledge.

Speaker 2: 20:34

Yeah, yeah, that's expectation, and with the default path, we also aim to like have the best performance as well. So, yeah, as I said, like we wanted to like provide users a unified way, like essentially the flow is unified, but it's also like performance thanks a lot.

Speaker 1: 20:55

Um, let me go through some other questions. Let me go through some other questions, for example, from Davis. You spoke about 4-bits, but Davis is asking if there is or will be support for FP16 or other formats. You know mixed precision. Mixed precision can be a great configuration for generative models at the edge.

Speaker 2: 21:27

That is an excellent question. So, yes, axialtorch supports FP16. And for the hardwares, for the NPU, as long as they support FP16, then it will just work. Say, for example, if once NPU they support FP16, it will just work out of box. Regarding the other lower bits or mixed precision, yes, they are supported as well. We are bringing up a list of lower bits kernels, like including 2-bit, 3-bits, all the way to 6-bit or 7-bit. I think it's up to 6-bit, I'm not sure about 7-bit, but we support a list of lower bit kernels and then it can be used. And then for the mixed precision, yeah, so I think that's part of our quantization framework. We support custom annotation and then essentially, a user can decide how I would like to quantize my input, activation and weight, et cetera. I understand mixed precision is actually very important, especially for lms. Well, I guess maybe it's important for other models as well.

Speaker 1: 22:44

But, like for mixed precision, yeah, in short, it's supported thank you, and staying a bit on the quantization topic, there is a question from chanceok about qat quantization aware training.

Speaker 2: 23:08

what's the kind of support you have? Yeah, it's supported. So, uh, yeah, um, for the uh, our default uh back end is the CPU back-end and it supports QAD and pretty much every back-end has their own quantizer and yeah, so I think quite a lot. I think currently the CPU support, xtorch itself supports QAD, and then whether or not this back-end can support QAD is based on the back-end's implementation. I think currently Qualcomm is also investigating QAT as well. I think it will be up very shortly.

Speaker 1: 23:47

Thank you. Thank you so much. You know there are also other execution models out of there. So, for example, javen is asking about TVM. Any comparison I would generalize also to I don't know any other interpreter that you may have already considered.

Speaker 2: 24:14

I see. Yeah, that's also good questions. I think TVM is probably like a different ways. I think AxioTorch I guess like I'm not quite sure the latest data is for the TVM, but I can tell like one advantage from AxioTorch compared to other frameworks is that it's leveraging the latest Torchdoor export. And then the benefits of like I think there are mostly two benefits. One benefit is it carries over all the information from the PyTorch source code all the way to the on-device runtime.

Speaker 2: 25:05

Meaning, say, for example, if I have a PyTorch model and I author it on server and then somehow it just airs out inside an NPU and via other framework, it could be kind of tricky to figure out how to find out where it goes wrong. But, like via ActiveTorch, we have the native support and we can help you to find out like, say, for example, why it's slow native support and we can help you to find out, say, for example, why it's slow, and then also which line might be wrong. We can help you to find out the exact PyTorch line of code in the source code. Yeah, essentially easy to use, as probably one of the things out of my mind, but I guess like they both have their own, but they like TV and I have their own advantage. I'm I probably don't have enough knowledge for like for this comparison.

Speaker 1: 26:14

That's, that's fine. And another question maybe is obvious for you Anise is asking if Lama Meta Lama has XQtorch support now.

Speaker 2: 26:30

Yeah, so like all the well, I guess, like for the smaller version, like Lama, smaller version like llama, we for the exit torch, uh, we support llama 2, 7b, we support llama 3ab, 3.1, ab, uh, 3.2, 1b and 3b models. So yeah, pretty much like all the models that can fit to the exit, to the, to the ram, then we can run it to the, to the ram, then we can run it.

Speaker 1: 27:05

And another question from nick the what are the memory footprint requirements from lama 3, 8 bit? Let me? Let me also add in general any comment you may, you may share about the capabilities of execute torch to deal with the memory system of the target in order to optimize its usage.

Speaker 2: 27:29

Wait, so I guess, like the question asked here is the memory footprint requirements for the LAMA3AB, and what is your part again?

Speaker 1: 27:41

And if, after you answered this question, maybe if you can comment about the capability of optimize the memory usage of the target from execute torch point of view. You know embedded system has a lot of constraints in terms of embedded memory a lot of constraints in terms of embedded memory.

Speaker 2: 28:00

I see, yeah, that's actually a very good question, I think. For the footprint, I think in the GIF we actually have it. I don't have the exact number, but what I can say is like so the memory weight the model size in total is around is less than four gigabytes, and for cause, like, we can run it on iOS and for the iOS the app will be killed if it's over 6 gigabytes. So I believe it's less than 6 gigabytes. We did some profiling before for the memory footprint. I just don't quite remember the exact numbers.

Speaker 2: 28:56

I can also briefly mention essentially how the memory planning works here. So for the memory footprint, we actually have the toolings to figure out how much memory is needed, even when we just export a model, Because it's pretty much like this will be just have the model weights, that is like constant memory and that's like less than 4 gigabytes, and then we have the like. And then also for the regarding how memory works in AxiTorch. The way it works is like we pre-allocate a chunk of memories and then we will plan them carefully such that we can reuse some memories, like when this tensor is no longer needed. So we actually have the tools to find out, like how much memory is needed there? Yeah, I hope. And then like, based on the, we can figure out, we can continue optimizing it.

Speaker 1: 30:02

Thank you. Thank you so much, chen. I encourage you to share more numbers later on, if you have time, on the comment sessions, because people are really interested and I think I can also say the Executorch investments and the impact on the community is really interesting. It's incredible how Meta is investing on the edge. That's fantastic, and thanks a lot for your time, chen, and to have accepted to stay with us and to present Execute Torch, and for the passion to reply to many comments and there would be many others in the future, I'm sure. So thanks a lot for your time and a great talk.

Speaker 2: 30:52

Yeah, I will leave the answer. I will continue to answer the question when I go there. Thank you.

Speaker 1: 30:59

Sure. Thank you, nuchen.