Aptos: Creating ML models that fit your edge device like a glove Artwork

EDGE AI POD

Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.

These are shows like EDGE AI Talks, EDGE AI Blueprints as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.

Join us to stay informed and inspired!

All Episodes

EDGE AI POD

Aptos: Creating ML models that fit your edge device like a glove

April 16, 2026 • EDGE AI FOUNDATION

0:00 | 20:30

Shipping edge AI shouldn’t feel like a marathon through model zoos, missing ops, and latency ceilings. We lay out a practical path to get from your data and constraints to a hardware-ready model—measured on real boards—without the endless back-and-forth between data science and firmware teams. If you’ve wrestled with quantization loss, unsupported kernels, or picking the “right” NPU, this walkthrough will feel like oxygen.

We start by naming the pain: quick demos that collapse under real device limits, foundation models that fail after export, and feedback loops that burn months. From there, we unpack Aptos, our automation engine that turns edge AI into a data in, model out process. The system explores parameterized architecture recipes and neural architecture search, trains promising candidates, and deploys them to a hardware farm packed with evaluation kits. Every candidate returns hard numbers—latency, per-layer timing, memory, on-device accuracy, and power—so tradeoffs are grounded in measurements, not wishful thinking.

What makes it fast is the learning layer. As Aptos accumulates results, meta models predict runtime, memory fit, and stable hyperparameter ranges before committing compute. That means less time wasted on dead ends and more time converging on models that satisfy your KPIs, whether you care about sub-5 ms inference on an i.MX 8 Plus, battery life in the field, or non-square inputs that match your camera feed. We also fold in research-backed techniques—pruning, quantization, distillation—so you benefit from the latest without chasing papers.

If your team is eyeing a chip migration or evaluating new NPUs, a dropdown swap in Aptos triggers a fresh search tuned to the new hardware, minimizing lock-in and keeping options open. The result is timeline compression: where projects used to take 12–18 months with large teams, we aim to surface strong, deployable candidates in one to two weeks. Subscribe for more deep dives into edge AI deployment, share this episode with your team, and leave a review telling us which device you want to target next.

Send us Fan Mail

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Setting The Edge AI Problem

SPEAKER_00 0:06

Hi, thanks. So today I'd like to entertain you with uh my talk about uh the tool we're developing at Eta Compute. Uh just one word about the company. Uh we are a startup based in uh uh Sunnyvale, California. Uh and we are in the business of building uh AI models. Okay, yeah, there it goes. And let me just check the laser. As a laser, perfect. I'm all set here. Uh well I mean I think it goes without saying that it's challenging to to develop and deploy uh successful uh uh applications. Uh gaps are many. Uh the gap between the actually the embedded world and and uh uh AI world is is is big. There are two different animals that move at different pace and they have different histories and so on. Uh surely one gap is talent. Um there are not so many people that actually have competence in both worlds. Uh I think we we know that very well. Probably most of the people who are actually can call themselves experts in edge AI are in this room uh today. There are not all that many of us uh that understand both both both worlds of embedded and and uh ML well. Uh targeting uh uh you know a ML model onto a device, uh embedded device with resource constraints and so on is challenging for a number of reasons I'm gonna cover in this talk. Uh even choosing a silicon today where you want to build your application on is not that trivial. I think it has accelerated a lot in the last uh last years. Probably have like at least one uh uh chip release per month right now with some uh NPU or some uh accelerator. So actually being able to keep up with that, choosing a good chip for your project or moving to a newer chip is uh is is challenging enough. And this eventually leads up to long development cycles uh and a number of problems. Uh so today um I would say that I identify three types of uh uh design patterns that offer a problematic. So first one is uh many people when they want to build an uh AI, edge AI application, uh they would start with uh see the model zoo that comes often from silicon vendors. So that you know that's just often the path that is preferred by more embedded aligned teams. It's uh you know typically a very easy deployment. It's it works great on hardware, so you are able to deploy a model you know within uh a day or so. Uh but then you just start to realize that oh, you know, this this model probably doesn't exactly fit your application, right? Because maybe it runs too slow, maybe maybe it runs fast but it's not accurate enough, maybe it just doesn't really represent your use case because maybe it's uh model was trained specifically to detect people in the rooms, like uh VW type of model, but then you're actually looking for something that detects uh say animals or something different. And then you start to have problems, okay. How do I actually adjust? How do I retrain? How fine-tune this model? Um, many of the models that are uh given in in the uh model zoos are are quite uh uh kind of binary, compiled already, not really trainable. I think there's some activity from the in the community here that is trying to address those issues as well, but as of today it's it's a difficult path. Uh another path that's typically preferred by data science-oriented teams who maybe have less experience in uh in embedded is to just go for foundation models. Right? And out of the box, often they get very good accuracy and they are very happy, okay, it works, and they say, you know, now we're gonna put the model on our device. And suddenly, oh, we need to export it and maybe into quantize it because it doesn't uh really uh fit or the chip doesn't support the non-quantized models. Uh maybe when quantifies we lose some accuracy. Uh okay, we eventually got through this process, we jump uh we thread it on the board, and then oh, some ops are not supported, and the inferences uh time is really long, doesn't fit the application after all, and this is where some steps will uh some some stairs will uh or problems will start. Um and the third path is is probably kind of in the between if you have maybe balanced teams or two teams that are independently working between embedded and uh uh and uh data science world, that you will say, okay, let's just craft our own model for specific application, right? So the so you'll build a model, improve it, uh deploy, try to deploy it, it doesn't work, you optimize, you come back and keep uh uh working in this very interactive uh mode. So this one, uh, you know, often it gets out uh like the budget is is overblown, the time is is timelines are not met, it's it's maybe not as good as you initially hoped for. So I think whichever path you follow of those three, it's it's very common that you end up in some sort of feedback loop of try and error, right? And and well, if your organization is big enough, uh maybe you have two teams, embedded one and the and uh uh some data science one on the left and embedded one on the right. And the first guys will propose the model, uh select some candidates retrain, you try deployment. Uh you realize there are some constraints that are maybe not met after deployment doesn't work as well, you you close the feedback loop and keeps coming and spinning back and forth. And uh weirdly enough, we as a company we already went from through this process and it took more or less 12 to 18 months. Uh uh today, I think on the uh on the panel uh Avjit, if I remember correctly, he mentioned uh a use case from his company where uh a project from his company where actually some 20 engineers worked for six months to get the model deployed on the board and kind of get it to work. So it actually, from my perspective, it's really this is how how long it takes uh in many cases to get something to that is a product ready uh feature. And uh I would say this is how most embedded projects, uh GI projects are being uh developed today. Uh so I would say majority of them never really pass through the you know experimentation or proof of concept phase, and they don't end up being in a um in a product just for the sake of of the the the you know the time and money it takes to actually get it to work. Um you know typically the the inputs to this whole system is what you're trying to achieve, right? It's it's uh uh objectives. You have some sort of KPI or metric of success for your application as a product as a whole. Uh you know, you have your data that you will want to train your model on and so on, and you have your constraints, right? So this is where okay, I have to have uh this many milliseconds of runtime, or I have to fit in this memory because I need to have some extra memory for my application, and so on. And what you expect as the output uh is you know is your ideally your opt-optimized uh model that does the job. Um so at the compute, we thought I think it's impossible to really continue in this this loop. Let's invent you know how this process could be automated. And we thought, okay, how about we just work with uh data in model out approach where there's no people doing the the loop and teams glueing the loop uh in between, right? So we thought, okay, this is kind of uh we let's build a black box system. Uh and this is what Aptos basically is. It's a system where you provide your input, uh, objectives, data, constraints, uh, you have the black box that does the magic, and you get eventually your model is optimized for the hardware. And we you know what we aim for is that you get this thing done in in one to two weeks. And on top of that, we give a little bit of a knob for for your team uh uh to actually fiddle and tune this machinery of this black box, right? So so if you're somebody's more embedded, he can tune some embedded parameters of how this machinery should work. If somebody's more data science oriented, he'll work more on the how to tune the data with kind of data science part of this machinery. Uh overall it's quite intelligent by itself, so so you don't have to fiddle a lot uh to get good results. Um what's what's in the box, right? Uh so this is our general architecture. The the main uh theme, let's say, or or the main uh idea we follow here is that we want to build AI that builds AI. I think this this uh word has been already said today on on one of the uh um in their panel presentations. Um so we have a system that uh has two feedback loops. The first one is basically a model builder, and this is independent of what kind of user wants. So basically we kind of keep building different architectures, uh, generate different models, uh uh that this can be a combination of neural architecture search and can be also uh our proprietary, edge-friendly architecture um recipes. And when I say recipe, I don't want to say architecture simple architectures because those those architectures are often very parameterized, and we can actually fiddle and squish and squish them and modify how they will actually uh uh work when they we when we instantiate them. So there's this one part of the system that is uh doing this this loop of building models. And the second part uh is actually selecting some of those models from our pool and training them and optimizing and deploying on target. Um so the this part is is you know uh from the uh from the the whole knowledge store of of all the models that were generated and uh put in our kind of knowledge database, we'll select one that we believe fit the best to the user request. So this is where the user comes with a certain objective, it comes with his certain data and with certain constraints. And if we are able to uh meet uh uh from from our pool of models uh uh through those constraints the best models and train them, this is where you're gonna get close to the optimal result. Right? So whenever a customer would run a certain job on our platform, like you know, okay, build me a model that detects uh people in pictures or uh you know classifies uh images and so on, we're gonna spin this kind of loop. Everything that we do is get eventually to the hardware. So we have a uh app farm. This is an actual server racks with uh SDKs, sorry, not SDKs, but development kits or evaluation kits of different chip manufacturers being uh uh instantiated uh there. So every time we actually pick a certain model with a certain instantiation of an architecture, so actually proper model with weights and so on, uh we're gonna validate it on our on our um on this farm. We're gonna get feedback whether this has run and how well it ran and so on and so forth. Uh this feedback gets uh this merch gets sent back to our core, which is the model tide, everything gets stored in the in the knowledge store. So knowledge store is where we have our recipes for architectures, where we have our performance of how well the the models performed after being trained and how well they perform on different chips. This the kind of triad of of knowledge. Uh now, interesting part is that you know why there's a brain here in the center, right? Because we uh if you do it on scales, you actually start to collect a lot of data here. So you know, the more you spin those two loops, the more data points you're gonna realize. You know, the you have some hundreds to thousands of architecture recipes, uh, you're gonna have uh tens of thousands of actually instantiations of architecture with trained weights, and you're gonna have hundreds of thousands of profilings coming from from boards because you have you can test the same model on multiple, multiple boards with different uh runtime parameters, different settings, SDK, and so on. Um so having all that, what we do is we train meta models. Right? So meta models, uh this is the kind of intelligence of our platform. So meta models is our machine learning models, uh sometimes it can be deep learning models, that will uh understand what has worked, how and why. So next time we asked, uh when next time the customer asks us, okay, I want to train and you know a model on a given chip of this data set, we can say, okay, you know, let's predict what would be the best model that fits his requirement, right? So this model, Metamos, for example, and there are many different types that we use, can uh predict the runtime on the board before we actually even have to test it. So we can pick uh uh uh pick or discard uh models very quickly based on his uh specific requirements on runtime parameter. Or they can understand where is the bottleneck of in the of the memory in the model and based on that reject the model because it's not gonna fit on specific uh a specific chip and so on and so forth. Or which hyperparameters to use, actually, because you know that hyperparameters are very important. If you go for very slow learning rate, you're gonna lose a lot of time training. If you go for too high, you're not gonna converge or have divergence, and so on. Uh so actually every model will have a certain uh uh range of hyperparameters which are quite optimal to it. Right? So we're basically trying to infer of what those parameters uh and the architecture and so on would be in there. And the more jobs are coming from our system, the more of data samples we collect, the more we can learn. Uh one more important part of the system is the little academic hat here. Uh so um even if some of our customers can be fairly big companies, they rarely have really large data science teams that are dedicated to edge. So, you know, maybe five, ten people really most from in my experience. Uh it's not that easy that those teams they stay uh in touch with all the newest research that happens. And there's so much research being published every, basically every day, you can say, on optimization, on quantization, on new architectures and so on. It's it's really a lot. So if you want to get the best outcomes, you would have to really you know have people follow all the all the academic work that happens. And you know, it's it's really not realistic uh scenario. So often they just say, Oh, I heard about YOLO, let's pick YOLO and let's go for it. Uh this is what we do, right? So here we actually do try to keep track of what happens, uh, even a conference like here, of you know what are the new techniques for optimization, what are the new the techniques for pruning and so on, and we do uh inject those uh this knowledge into our model building engine. We actually also do it in the training optimization, right? So, what can we do to make the models better after they are trained? Um so here is uh an example of uh example result of a job that we run on Aptos. Um so user inputs here, data set. I used the you know the very standard well-known data set for everybody, just uh for an academic example. So a visual wake word. Um there is the device we targeted was was uh uh NXP uh IMX8 Plus. We gave a constraint inference time, less than five milliseconds. Uh we selected some input uh uh size range, you can specify a very defined one, or you can just say leave it open, whatever. Uh and aspect ratio for input. The aspect ratio I put it here because if you actually look at the foundation models or model zoos, almost all the vision models will have a square input. But it doesn't mean that your application actually would be optimal if you have a square input. If you look at some of the automotive applications, you often have like a very wide angle, uh uh uh for example four to one or three to one aspect ratio. So actually, Aptos can also give you models with uh non-square aspect ratio as well. This means you're just more optimal because you're not losing information when you're deforming the images to fit them into the model. Um so every one of those dots here uh is an actual model with its own weights and so on. The the four orange ones are I just took what was available in the publicly available model zoos. So uh you know the mobile netv1 is the original VW benchmark model that everybody uses, and uh I took a few models from the uh ethos uh no ethos, uh ARM, I think ARM uh model zoo here. So I mapped them. You can see that you know the what is the inference time, what is the accuracy here, right? So, first of all, you what stands out is that what aptos are generated throughout maybe a day of a job uh are models that dominate this this part of so the part of frontier in terms of the runtime versus accuracy is is here gently dominating those models. We also have many more models of higher accuracy. And also today, we've said, you know, nobody wants to actually pay for models with uh uh with 70 or 80 percent accuracy. You often you want to go higher. Uh if you go for model zoo, often you in this specific case, it's not uh the only case that can happen, but here you're probably you know way under your limit of application, which is five milliseconds, or maybe one or two, and so on. If your application can run five milliseconds, well, why not have a much more accurate model? So um so you can see that here it's really a point cloud, and this is where you can pick up the model that meet meets best your uh your uh your uh application. Uh accuracy and time is not the only uh it's not the only compromise that you can take because also there's memory, usage, and so on. So this all these models will fit in a certain constraint that was defined, but uh you can also pick you know whichever will be matching best your your expectation here. So you have much more models to work with than from a model zoom. Uh yeah, every model, as I mentioned, we have this farm. Every model gets validated fully. So actually you get a full access to each one of those dots that you saw before. You're gonna see whether it has run on the board, is runnable or not. I mean, obviously the ones were on the plot were runnable. Uh what is what is the inference time and the uh per layer uh timing? Uh you get a full memory profiling, so how much actually those models uh take, obviously, under your constraint that you have defined, but maybe they're you know some of them consume less or more. You will actually get a proper power measurements. We have instrumentation that's connected to all those boards, and we'll measure uh in this inference window you know what what was the actual uh power consumption of a given uh given board. This is, I think, very few people, if anybody does it. Uh and we also look at the on-device accuracy. That's the that's the part that also is funny because you know most people say, oh, after quantization lost some lost some accuracy, and that's true. You typically do. But after deployment, you sometimes lose accuracy that you don't even measure. Uh because uh you know there's uh some ops in some uh NPUs will actually maybe do some rounding errors and so on. And if that's a very deep model, maybe you're gonna accumulate and actually you could can get uh a fairly bad result after deployment. So it's not just enough to say, oh, my my TF quantas TF Lite file is having good accuracy. It's you have to actually measure it after it's deployed. So actually, Aptos also measures uh uh those parameters as well. Um yeah, so maybe I'm probably running out of time, so I'm just gonna jump quickly through a few points. So you know, I I think that Aptos does address many of the points that that today's uh uh typical development flow for a GI has. You know, which architecture should I use? You don't have to worry because you just provide data and we create architectures or select architectures based on the uh uh on your requirements. Uh does my silicon support uh those ML operations and so on? That's being taken care of. Enough memory? Yes, you can define constraints or you can optimize against memory. Can I get uh more out of your NPU CPU GPU? Yes, we do test a lot of architectures and a lot of model instantiations in order to find the one that will you know fit best. Uh what if I choose a different hardware? That's interesting. So you provide data to Aptos, you just select a different chip from your drop-down, uh, and you're gonna have a model uh done for that specific hardware. So migrating to newer hardware as long as it is in our system, it's gonna be you know fairly seamless experience. Uh training training hyperparameters, you don't have to worry about this one anymore. Uh training takes too long, we actually have a uh fairly large uh cluster uh uh uh uh infrastructure that can do the training efficiently. Um yeah, can I make my model go faster without losing accuracy? Yes, another level of the optimization that Aptos takes care of. So you don't have to really code. There's no coding involved uh in getting your model done. Uh as mentioned, we also measure battery or measure power consumption so you you know can estimate what is your expected battery life in case you're building a battery-powered device. Um yeah, so basically, long story short, we try to bridge the gap between embedded and ML world. Um and you know, I think we build some automation tools that for many companies can be uh a savior because you don't have to have a whole data science team and embedded team. You may have one or two engineers, and uh this will be enough for you to launch a successful project that is uh you know the mods that are validated and and working properly. So yeah, that's that's it, thanks a lot.