Neural-ART: ST’s New NPU Architecture at the Edge Artwork

EDGE AI POD

Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.

These are shows like EDGE AI Talks, EDGE AI Blueprints as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.

Join us to stay informed and inspired!

All Episodes

EDGE AI POD

Neural-ART: ST’s New NPU Architecture at the Edge

April 09, 2026 • EDGE AI FOUNDATION

0:00 | 14:47

What if the fastest path to efficient edge AI isn’t a bigger CPU, but a smarter stream of data? We pull back the curtain on NeuralArt—the flexible, stream‑based accelerator inside the STM32N6—and show how a decade of prototypes led us to rethink how tensors move, how layers are scheduled, and how much work a compiler can save when memory is the real bottleneck. Instead of shuttling activations back and forth, our architecture routes data through specialized units in tightly orchestrated “epochs,” keeping compute hot and bandwidth cool.

From there, we tackle the hard limits of standard‑cell designs on practical MCU nodes. Power efficiency stuck around 1–5 TOPS/W and density near 0.1–2 TOPS/mm² pushed us to explore in‑memory computing. We break down digital versus analog IMC—determinism and integration on one side, approximate but highly efficient compute on the other—and share prototype results that hit roughly 40 TOPS/W and about 10 TOPS/mm² at 1 GHz. Along the way, we dig into why half of system power can vanish into data movement and how weight‑stationary strategies change the game.

We also get candid about trade‑offs. Embedded phase change memory (PCM) brings remarkable density and multi‑level storage, but demands strict weight‑stationary mapping and drift compensation. No single technology wins every metric, so we lay out a heterogeneous 2D mesh that blends digital IMC, analog IMC, and classical stream units. Our compiler assigns each subgraph to the node that fits its accuracy, throughput, and energy needs, and our NeoSoC research effort moves this vision toward silicon with an upcoming 80‑nm tapeout.

If you care about edge inference, memory bandwidth, quantization, and real‑world efficiency beyond spec‑sheet peaks, this conversation is for you. Subscribe, share with a teammate who’s wrestling with on‑device AI, and leave a review with the biggest bottleneck you want us to tackle next.

Send us Fan Mail

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Setting The Stage: NeuralArt

SPEAKER_00 0:06

Thanks a lot for the honor to be able to present here uh Neural Art, so ST's new MPU architecture. And uh well, maybe some of you had the chance to see downstairs the STM32N6 demos. Um obviously, there is this type of MPU integrated. Maybe some of you are also curious what is under the hood actually. I will provide here some technical information and give you an idea what we are trying to do in the future. So, NeuralArt was not born just in one day. It was a long journey, so it took us almost 10 years to go through various prototypes, trials and errors, and many measurements to actually being able in 2024 to bring it to the public and release uh this STM32N6 together, also with some we saw a rather wave of new uh devices coming out with MPUs integrated at that time. So now I would like to show you what we are actually doing, and during that time we learned certain things. One of them is we learned that our MPU had to be somehow flexible. So it was not enough to have a single network that we could execute, it needed to somehow be programmable. Another important topic was scalability. So we knew we had different products, different uh let's say, scales of performance. So we need to have an architecture which is somehow scalable. Another very important issue was the fact that on edge devices memory bandwidth is a scarce resource. So no DDR, and you have to be very careful with it. And last and not least, silicon area is an important topic because it defines how many dyes you can sell per waiver, and therefore the pressure was obviously there. So, how does that look like? This is the STM2N6, and it has an ARM, um M55 core integrated, neural art accelerator, quite some decent SRAM on chip. Um it has GPU, it has uh ASP and a lot of interfaces, obviously. Now, what is this MPU actually doing? It's a stream-based architecture. That means we have a centralized stream switch in the middle that is routing data among different components. And this all at the same time. That means there is no interaction between the streams or blocking. There are DMAs, a set of DMAs, their job is to translate data tensors that are somewhere stored in memory into data streams and route them through the stream to the accelerators that are integrated. These accelerators have different functionalities like convolution, pooling, activation, and many other things. Support, also different precision, 8-bit, 8-bit, mixed position, 16, 16-bit, and so on. And we have additional U components like for debugging, we have uh on-the-fly compression, encryption engines, interrupt generation, temporary storage, plus also this thing on the lower right, which is a programmable controller that allows us to execute all these programming of the system independently from the host. So we are offloading everything to the MPU. Now, how do you program that? On the left side, you see a classical representation, an Onex graph of a network. We have a specific compiler that translates that into a graph where the graph is segmented in epochs. Epochs means these are the units that are connected at the same time and being alive at the same time. These epochs are scheduled one after the other on the um MPU itself. And this is handled either by host or by the epoch controller. Now these epochs can be quite complex, they can be alive for several million cycles depending on the complexity of the network. And there can be also multiple of these processing chains at the same time, or they can be depending on the amount of resources that you have on your MPU configuration, they can be more complex or less complex. So, now as I said, memory is a scarce resource, so we have to be careful. And the advantage of such a stream-based architecture with uh data streams and chains of processing units is that we don't have to go always forth and back to memory. So we route calculations to the different functional units. Now you can do that in different ways. Example A, for example, there we are using two parallel chains. Example B, we use one chain, just slightly differently uh combined. If we do rather complex network layers, we have to somehow iterate across a parameter, is for example, input channels or output channels. Depending on how we do that, the effect, however, on memory can be quite significant. So be the difference between A and B is that one of them creates three times more memory traffic. But be careful, this is just true for this example. There are others' examples where it is exactly the opposite. So it's the job of the compiler to find the exact best configuration to map the chains in a way that you are using your memory as efficient as possible. So then we also learned you said it has to be scalable. We have different products, so we actually are not building a single MPU, we are building an MPU generator. So it's a tool suite. This gets a description of a few tens of lines, a library of components, we run it through it, it produces a hardware description for the compiler, it produces an RTL database, it produces documentation, it produces libraries, software libraries, and so on. And then you have different MPUs there. So from low end to middle level in the tera op range to the big guys, which are multi-island. We call that multi-island if you have multiple switches there. And obviously, this has a constraint in terms of size. The more resources you integrate, the bigger this thing gets, and the cost is obviously the factor. So this is the classical MPU design. And there are two things that we also saw. We saw that the important figures of merit, so power efficiency and density, they are somehow limited. In the power efficiency, there it was somewhere between one and five teraops per watt for the technologies that we can afford for uh an MCU, right? So it's obviously not three nanometers in Fed. It's somewhere 16 FF or 18 FTSOI. On the density, story similar. It's somewhere between 0.1 and 2 teraops per square millimeter. Obviously, here it depends on what do you count as an operation, right? And the learning is that, well, what we think will be needed in the future is a lot more than that, specifically on the power efficiency. So what can we do? And we got to the conclusion we have to find something new. So we are looking at in-memory computing. And there are two ways, principally. You could do that obviously on in Astroram or in non-volatile memory, but basically you can separate them in digital in-memory computing and analog in-memory computing. Digital is quite straightforward, right? This is deterministic, it is easy to be integrated in a digital environment, all fine. You can also use it as an Astro RAM itself if you are not using it for computing. Analog IMC is a bit more tricky because this is approximate computing, it's more efficient for certain parameters, and if you integrate that, you have to be a bit careful because they can have you know some sensitivity on process changes, temperature, and such things. So, if we look at now at real data, how does it look like? This is data that we generated from a prototype on 18 FTSI. So on the upper left, you see a graph of the energy efficiency of such an implementation. And there, if you this is obviously dependent on um sparsity and transition, but you see this the black curve is one tile, so one memory that does a computation. And the green one is a cluster of eight of these tiles with some logic around it that you need, right? And you see you are in the range of 40 terops per watt. So this is a 10x for sure better than what we had before. This here, this is the density. If we look something at one gigahertz here, we are in the range of 10 tops per square millimeter, right? Again, more or less 10x better. If we compare that with analog, um, the situation is the following that analog is obviously also supporting different widths. So there the tendency is to go a bit to narrower number representation, and there, depending on that, we see another factor that could be gained. But as I said, analog has some additional catches, right? But this is more or less what we expect from this approach. So 10 or even more X improvement. So then there's a second problem. Energy efficiency is not only about how efficient are your tiles, also how you integrate it. So this is an example where we integrate such tile for a very simple network, a regular network computation. Just 50% of the power dissipation is gone by reloading and data movement between uh these units. The other topic is number representation. Obviously, the smaller number representation you are choosing, the more power efficient you can be. But this is what you spend. And are there ways around it? Well, we could talk about weight stationary, right? So not moving weights around, you keep that local. And ST is working since many years on phase change memory, PCM, embedded. These are very dense memories. So we are talking here about densities that go up to 50 million bits per square millimeter on an 80 nanometer technology, which is very significant. And we started to integrate here analog in-memory computing with also cells that can store multiple levels, so even multi-level cells storing multiple bits per cell. This is interesting, but it has a catch. PCM memories, they are very slow in loading and not very power efficient if you write to them. This means it has to be absolutely weight stationary, no really reloading allowed. That means for a device, either you have enough memory on it or no mapping for what you want to do. Second problem is these memory, they have drifts. You have to compensate that. This is drift compensation, can be done, uh, but it's a bit more tricky, so you have to be careful with your application. So this is another option. Again, with advantages and disadvantages. We know that none of these technologies is actually perfect, right? There's no one-fits-all technology. Therefore, we think that it's maybe the right approach to go for something that's heterogeneous. So it uses the advantages of the different technologies and exploits that in a unified architecture. So there's the problem of number representation, precision, there's the problem of accuracy, there's the problem of throughput, power efficiency, and so on. What we try here is we combine the different units in a 2D mesh, and we try to map the parts of the algorithm that are can better deal with one or the other property on the different nodes. This is a project, a European financed research project called NeoSock, where we exactly try to do that. And the idea is that in next year we will try to tape out a chip on 80 nanometer that should demonstrate this. And then I get to my conclusion. So on the NeuroArt, it is a flexible data stream processing architecture. It's not a CPU, it's a data stream data path. Our focus was on cost, memory bandwidth efficiency, scalability, and reconfigurability. But we somehow understood that the maximum power efficiency and density is there is somewhere a limit that you can reach with this approach. Standard cell-based approach. So we are looking at in-memory computing, analog or digital. We know that the weight reloading is a problem, it's costly, so weight stationary could be a solution to fix that. And as I showed, PCM is an interesting uh option there. But maybe in the end, as there will be not a technology that fits all, you will have something that uses a bit of everything. So a heterogeneous architecture. Thanks a lot for your attention. So if there are any questions.