EDGE AI POD

An Embedded Transformer- base face recognition system in the STM32N6

EDGE AI FOUNDATION

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 11:52

What if transformer-level face recognition could run on a microcontroller without giving up speed or accuracy? We set out to make that real on the STM32N6 by pairing its neural processing unit with a hybrid model that blends convolutional efficiency and attention-like global context. Along the way, we rewired core assumptions about attention, reworked unsupported operators, and delivered a full on-device pipeline that actually feels instant.

We start with the hardware edge: ARM Cortex M55, 4 MB of continuous RAM, and an NPU pushing up to 600 GOPS at remarkable power efficiency. That lets us chain models—RetinaFace-style detection with landmarks, alignment for a stable canonical view, MobileNetV2 anti-spoofing to block print and replay attacks, and a final recognizer that outputs a 512‑dimensional embedding. The recognizer is built on EdgeFace, itself based on EdgeNext, chosen for its sweet spot between parameter count and accuracy. It behaves like a transformer where it matters—capturing long-range relationships—yet fits into the tight compute envelope of a microcontroller.

The turning point is attention without the dot product. Because the ST toolchain doesn’t support batch matmul, we replaced it with a convolutional self-attention mechanism. Depthwise and pointwise convolutions encode relationships across pixels and channels, a sigmoid stands in for softmax, and element-wise products reconstruct attention’s weighting behavior. This maps cleanly to the NPU, avoids quadratic costs, and preserves the ability to stabilize identities across pose, lighting, and occlusion.

Benchmarks show roughly 40 ms per frame end to end—about 25 FPS—plus substantial speedups over STM32H7 and higher accuracy than MobileFaceNet across validation sets. That opens doors for privacy-first access control, frictionless enrollment on-device, and personalized experiences where latency matters and data should never leave the edge. If you’re exploring embedded AI, this walkthrough shows how to align model design with silicon capabilities and deliver results that feel both fast and trustworthy.

Enjoy the deep dive? Subscribe, share this episode with a fellow edge AI builder, and leave a quick review to help others find the show.

Send us Fan Mail

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Opening And Thanks

Speaker Background And Focus

Agenda And Goals

Who We Are At Reply

Sensor Reply Capabilities

Why Transformers On The Edge

STM32N6 Hardware Tour

Project Context And Challenges

End‑To‑End Pipeline Overview

Detection And Anti‑Spoofing Models

CNNs Versus Transformers

Choosing A Hybrid Architecture

EdgeFace Model And Embeddings

Toolchain Limits And Workarounds

Convolutional Self‑Attention Replacement

Step‑By‑Step Attention Emulation

Performance And Benchmarks

Accuracy Gains And Use Cases

Future Directions On Microcontrollers

SPEAKER_00

Good afternoon everyone. I would like to thank the AGI Foundation for being here and to give me the opportunity to present this work on embedded transformer-based face recognition system on the STM32 and 6. So, first of all, I am David Aiello, software engineering, graduated in 2024 at Polytechnico di Torino. I'm currently working on real-time AI development on embedded system focusing on harder acceleration computer vision at Sensor Reply. So these are the topics that I will cover today, a brief introduction about Reply Sensor Reply, the STM32 and 6 platform, the hardware user for this project. We'll take a look at context and challenge, the system path and model architecture performances, and finally the conclusion. So Reply is a digital consulting company with a global present, especially in Italy, Germany, and USA. On the bottom part, we can see how Reply has grown since 2007 in terms of revenue and people. Here we have our competence area that covers every digital field, starting from digital humans, robotic, digital marketing, cybersecurity, and NAI, our partners ecosystems, some of our customers. But as I said, I am from Sensor Reply, which is the reply group company specialized in the embedded AI solution. Our technology background is built on two main pillars: physical modeling and embedded AI and real-time computing. So starting from that, we are able to provide several services like condition monitoring, virtual sensor, real-time image recognition, digital things, and control systems. So over the years we have built a portfolio of edge solutions spun across multiple domains, face recognition, gesture recognition, image captioning, and so on. We are here today to present a solution that combined the new hardware with the most promising model of the past year, which is a transformer, a model that typically runs on edge platform. And thanks to our pattern, ST microelectronics, we were able to get early access to STM32N6, which is the first microcontroller of the ST family, to feature a neural accelerator. So this is the STM32 and 6. I suppose that you have already seen this board these days. It features a neural processing unit, ARM Cortex M55, 4 MB of continuous RAM, is able to provide up to 600 Gig operation, achieving three tops per watt. And the development kit also comes with a five inches LCD screen and with a camera camera module. And this is the NPU, an accelerator specifically 304A application which is able to offer better efficiency compared to GPU, is easier to use than FPGA and more flexible than ASIC. Basically, it's capable of transforming an edge device from a simple sensor into an autonomous and intelligent brain. So the context here explore the capability and constraint of the N6 platform, improve and expand the histome face recognition system is something that we already have, and evaluate the implementation of a transforming-based model on a microcontroller. So starting from the context, several challenges arise, like deploying multiple models in sequence, our core face recognition model, which is built on a transforming-based architecture, even a spoofing detection using only the device built-in camera, and an internal database that allows users to dynamically enroll into the system. So this is the pipeline. So we uh capture the frame from the camera, then we have facial detection steps that aims to locate faces into the capture the frame, then the face alignment phase, which is responsible alignment phase of canonical view to improve robustness, the anti-spoofing detection ensure resilience against sprint or video attack, and at the end the core model, the phase recognition, which extract an embedding from a detected face. Starting from this embedding, at the end we can uh compare this embedding with the one uh already uh extracted, so the one uh present in the database. Those are some of the model architecture for the phase detection which is a retina face that feature uh feature pyramid network, and this means that the network is able to uh detect uh face of different size, is also a multitask model because it's also able to detect the landmarks on the on the face. Uh for the anti-spoofing which is mobile net V2, it has been training to uh distinguish between real image and uh spoof image. So before uh discussing about uh the core model, the one based on the transformers, I would like to introduce the main difference between uh CNN and transformers. So, for example, CNN has typical inarchical structures based on the receptive field, while transforming instead has a global structure based on the attention mechanism. Um CNN are uh has effective local feature extraction while transformers are able to model the long distance relationship. CNN has a strong inductive bias to convolution operator, while on the other side there is no strength inductive bias from from since thanks to this reason, transformers are able to uh truly understand the pattern inside the data. Uh CNN can be reduced with a compromise accuracy. For this reason, they're generally computing less uh demanding, while instead transformers perform well typically on an IM number of parameters. And so for this reason are computationally intensive. So by studying uh by studying the state of the art, we figure out a model trend in lightweight networks. So on this table, uh, I put the main network from different families. We have a pre-CNN that indicates classical CNN appeared before uh vision transformer, post CNN that denotes CNN that integrates the narrative of VITs but retain a poor CNN structure, and then VITS and uh hybrid networks that are networks that combine both convolution operation and uh attention mechanism. And we figure out that only this type of network, this hybrid structure, are able to maintain a low number of parameters while also preserving uh an eye accuracy. Uh we can say edge next is the best trade-off between the number of parameters and accuracy. So we choose for the last uh model edge face, which is the winner of the efficient face recognition competition in 2023, which is based on EdgeNext, and uh as we can see it has an anarchical structure divided in stages like a classical convolutional neural network, a convolutional encoder, and STDA encoder which is based on the attention mechanism. Uh the uh as output it provides us an embedding of 512 value. So in order to deploy a Loviz model uh on uh the hardware, we need to rely on the STM32 toolchain, especially STM32 QBI, which has uh recently changed uh its name in STGI, which is a part of the STM32 toolchain, uh, is able to convert as a tool uh TFLite on an X Keras model into a C code for edge execution. In this case it was mandatory given the presence of the NPU, but unfortunately not all the deep layers are supported, especially for uh the last uh model. Uh in this case we have problems for all of these uh operators and layers, and layer normalization gel activation has been replaced respectively with batch normalization and hard switch, while instead batch matmul, which is the dot product, the core of the attention mechanism, results to be more complex to replace. So this is the self-attention mechanism, the classical one. We have the fully connected layer at the at the beginning, then we obtain the matrix's uh query, key and values, we multiplied with the inner product the matrix Q and K, and then with the softmax we are able to obtain an attention map, then is already multiplied with the with the matrix value. So here the goal is to find a solution which is able to emulate this type of behavior without relying on the inner product. And actually, we were able uh we are able to find that the on the convolutional self-attention from the NVIDIA blog. Uh as I said, this is designed to emulate the relational encoding process of self-attention, and as we can see, the inner product has been relayed has been uh replaced with the Admart product, which is basically an element-wise multiplication. Thanks to the convolution, it enables the modeling of both local and global features uh relying exclusively on convolution. It turns out also, according to the experiment conducted by the um the man from NVIDIA to be a faster and also a more accurate than pure transformers architecture. Uh we will see uh now in details uh what happens. So the structure starts with an initial modeling of local relationship within the image. So thanks to this uh that wise convolution, we obtain the matrix uh V. Uh then we obtain the matrix Q with a mapping of the channel dimension from C to H times W. And then by performing the computation of the matrix uh transf transpose and followed by the reshaping of the matrix Q, we obtain the matrix key. So now the important things to notice is that uh uh the matrix uh query for each uh dimension has all of the value of all of the pixels. While instead the matrix uh key for a dimension has the value of just one pixel. And since they have the same dimension, we can actually multiply that um with the element-wise multiplication, and this enables each pixel to interact with all of uh the other pixel within a specific channel. Obtain a sort of uh pixel wise aggregation like in the classical attention mechanism. Then we are able also to make this aggregation strong uh stronger by projecting it into a new space thanks to the pointwise convolution, and the sigmoid emulates the softmax, uh, allowing us to obtain a weighted uh weighted map. And then at the end, this weighted map is element-wise multiplied with the matrix value, reproducing the second phase of the attention. So uh these are the performance we are able to obtain for the full pipeline uh around uh 40 milliseconds and execution speed which is around 55, sorry, 25 fps. So those are the inference times comparison, the the parameters for each model, uh the inference times for the STM32N6, the STM32H7, and also for the desktop. Um as we can see the presence of the NPU play a crucial role. We are able to obtain a speed up of 78 uh times for mobile face net compared to STM STM32H7, which is the most powerful microcontroller available on um on the cloud. Uh we also compare the mobile face net because it's the classical uh network used for face recognition on Edge. And this is the comparison in terms of accuracy. As we can see, Edge Face, our model, proves to be more effective uh than mobile face net on all validation data sets. So this type of solution can be embedded in a wide range of scenarios where user identification is required, such as access control, uh, to restrict our sensitive areas, granting me permission or privilege, seamless identification, and also personalized user experience in private or control environments. So the future perspective in this case is reuse of the transformant-based architecture for other computer vision tasks on microcontroller.