EDGE AI POD

Crafting Artistic Images with Embedded AI with Alberto Ancilotto of FBK

EDGE AI FOUNDATION

Unlock the secrets of neural style transfer on microcontrollers with our special guest, Alberto Ancilotto of FBK, as he explores a groundbreaking approach to image generation on low-power devices. Discover how this innovative technique allows us to combine the content of one image with the artistic style of another, transforming simple visuals into unique masterpieces—like turning a regular cat photo into a Van Gogh-inspired work of art. Alberto introduces Xinet, a cutting-edge convolutional neural network designed to perform these creative tasks efficiently on embedded platforms. Gain insight into the process of optimizing performance by evaluating CNN operators for energy efficiency and adapting networks for a variety of devices, from the smallest microcontrollers to advanced TPUs and accelerators.

We dive deep into the collaboration between Clip and style transfer networks, enhancing the precision of semantic representation in generated images. Witness the impressive capabilities of this technology through real-world examples, such as generating images in just 60 milliseconds on the STM32N6 microcontroller. Experience the advanced applications in video anonymization, where style transfer provides a superior alternative to traditional blurring methods, altering appearances while maintaining action consistency. Alberto also addresses the broader implications of anonymization technology in public spaces, including privacy protection and GDPR compliance, while maintaining artistic integrity. Join us as we tackle audience questions about model parameters, deployment flexibility, and the exciting potential of this technology across various sectors.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1:

Thank you, bye and welcome to Alberto. Last but not least, alberto will talk about neural style transfer for microcontrollers. I see also the logo of ST, so please explain your work and let's see what you have to propose. Thank you.

Speaker 2:

Okay.

Speaker 2:

So thank you, danilo, for the introduction. So, as you said, this work is titled Neural Style Transfer for Microcontrollers and it's done in a collaboration between FBK and ST, and so we are going to change a little bit the topic, because I'm sure you've all heard a lot about generative AI applied to text in this forum and actually in this presentation. I'm going to change a little bit the topic and we are going to go to image generation, actually with generative AI, but this is always targeted at very small devices and, in particular, this work targets very, very lightweight, low power devices such as microcontrollers, and that's why ST is fundamental. So, if some of you are not familiar with the task, what we are doing is called the neural style transfer, which is an image generation task, and it has particular applications. Well, fun applications in like synthetic media generation or generating artistic renditions, but it has two very like practical uses, which are in anonymization and in domain adaptation.

Speaker 2:

The problem of these networks that usually do neural side transfer is that they are very, very large and very, very slow, and they are definitely not fit for tiny, low-power embedded devices. So if any of you are not familiar with the task, I'm going to give you an example of what we are trying to do with the help of my cat. Okay, this is my cat, her name is Enea and, no, the picture is not upside down, she just sleeps like this. But so what Neural Style Transfer is? It's a task of merging the content of one image and the style of another one. So, for example, here in the center we have the content of our image, which is like my cat, and if I go here, we can see my cat, but painted by Van. Gogh. You see the content image in the middle, you see the style image on the left and you see the generated image on the right. So we are copying the content of one and the style of the other.

Speaker 2:

So these approaches are not new. Actually, they were born around seven, eight years ago. But the problem is the first paper that proposed the networks for this task used an iterative approach, so you needed to run the network like hundreds of times for a single image, which was definitely not very, very fit for embedded devices. And then, okay, we moved to feed-forward approaches and we moved to CNNs, but still very, very large convolutional neural networks. Okay, we are talking tens of megabytes, or 400 megabytes, 80 megabytes nothing that can actually fit a microcontroller.

Speaker 2:

So I will explain a little bit how these approaches work and actually the the inference of these approaches. It is very simple. We just have our content image, we have our style image and we have our style transfer network and obviously we just fit content and style inside of our style transfer network and what we get out is our styled image. This is very simple. The fun part of how these approaches work is actually the training. So to train these approaches you actually need a separate neural network, a separate feature extractor. In this case we are using VGG19., and how we are using it and how it is generally used in these approaches is we are obtaining a style loss and a content loss from the features in these neural networks and in particular, we just say that, okay, the features in the first layers of our network, they represent in some way the style of our image. Because the first layers of VGG19, they are trained to understand color, lines, shapes, okay, so high frequency stuff, stuff more related to the style of an image, and while the later layers of an image of VGG19, they are trained to understand the higher level content of an image. So they are understanding like there's a cat, it is upside down, the eyes are yellow and stuff like that. So in some way, if we can reproduce the first layers of our style and the last layers of our content, then we are generating an image that is perfectly doing style transfer. So we want to do this very small and we are not really interested in the training part, so we can actually ignore VGG19 for now.

Speaker 2:

What we are focusing on is the first part, which is the style transfer network. If we want to do this on embedded devices, if we want to do this on microcontrollers, we need a style transfer network that is much smaller, much faster and much more fit for microcontrollers, and for this we actually chose this network here. So this is a Xinet, it is a network that was developed here in FBK and it is a CNN that is specifically designed for generative tasks, so generative AI on very low power embedded devices. And this is done because, okay, our network at the convolutional block is designed, using the operations that we have benchmarked, to be maximally efficient on a very large set of devices. So we just took, like some commonly used CNN operators, such as COM2T, depthwise, pondwise, grouped convolution, stuff like that, and we just got all the embedded platforms that we could find and we benchmark each operator based on efficiency, and we define efficiency as the performance that we achieved on that operator divided by the energy usage of that particular operator, because some are slower, some are less optimized. Com2d is generally very optimized across the board. So with the most efficient operators that we found, we designed our convolutional block and we actually have downsampling and upsampling blocks so we can use the network for generative tasks.

Speaker 2:

And another cool thing about this architecture is that it allows what we are calling hardware aware scaling. So what is hardware aware scaling? Well, embedded devices. You have a great variety of platforms that you can use. You can go from a very tiny microcontroller to a parallel platform, to a TPU, to an accelerator, stuff like that. So how can you design a network that can fit your particular device? So what we are doing is we are describing each platform with just three numbers, which are the amount of flash, so storage space that you have, the amount of RAM, so working memory, and the speed of your CPU, so how many operations you can do in a second. And based on these three numbers, you can just change three hyper parameters of the network, which are the ones here in the slide alpha, beta and gamma and with each one you can optimize a separate parameter.

Speaker 2:

Ok, let's say that our original network takes too much, too many operations. We just go down in alpha and we get a network that uses fewer operations and you get a very good network for your platform and you get a network that can maximally utilize the hardware provided by your platform. So here I have an example of what happens when we do this with our style transfer network and in particular, here we have a small breakdown of the original style transfer network that we were using, and you can see that we are using a small breakdown of the original style transfer network that we were using and you can see that we are using a big amount of RAM 21 megabytes. We have a lot of parameters, we require a lot of operations. These are generally not very big numbers, but if you're considering a microcontroller, they are very big numbers and in particular yeah, I forgot to say our target platform. It is the STM32H7, which is a microcontroller with two megabytes of flash and one megabyte of RAM and it can run around 60 million operations per second. So the numbers here in the slide, they are way too large for our use case. So what we do is we design our Xinet using hardware or scaling and we make it actually a little bit smaller than what we can achieve with our platform. And you can see here that we can manage savings of like 96% in RAM, 90% in number of parameters and 97% in the number of operations. So the main network is much smaller. It is much, much faster than our original one. However, we do have a small price to pay. Obviously, we are approximating a network with something that is 1-100 of the size of it, so we are losing a little bit of performance.

Speaker 2:

Let me introduce briefly the metrics that we are using. So for site transfer, we use these two metrics mainly FID is how close you are to your style reference, and clip score is how close you are to your content. Reference and clip score is how close you are to your content reference, and ideally you want to maximize them both. So, as you can see here, we lose like 15-20% in both these metrics just by using our approach. So the question is how can we get back this performance that we lost? So one way of doing it would be trying to make the task a bit easier for our network to learn.

Speaker 2:

Our network is very small. If you can make the task a bit easier to understand, then it will perform better. And to do this, we are going to slightly change the loss function, so the one that I showed you before with content and style loss, and we are going to add a couple of modifications to it. The first one is we are logarithmically scaling the features inside of VGG. This sounds very complex, but actually what we are doing is just what you can see here in the slide. By doing these logarithmic scaling, we are making small features more expressive. You can see them a lot better. The feature map is more expressive and more easy for the network to approximate.

Speaker 2:

And the other thing that we are doing is we are running the style transfer operation in a multi-scale manner. So this helps a lot because if you imagine, for example, the starry night, well, in the starry night you have the very, very small brush strokes, but you also have the very largest wheels around the image and you want to approximate them both. So by doing this in three different scales, we can actually learn the task a lot better than just using a single scale and a larger network. So by doing this, you can see that we get a lot of the lost performance back. Uh, by doing this, you can see that we get a lot of the lost performance back and we are actually around five to three percent worse than the original approach, by saving like 97 of the network, which is quite a good result already, and I can show you some results obtained by by this approach.

Speaker 2:

And so here is my cat again, and this time it is styled with like a stained glass. Look, and you can see that the image is quite nice. It is quite consistent. The colors are there, the lines are there, and here instead is a view of Brixen, which is a city here near Trento. It is already quite pretty, but after we run a style transfer, we can make it a bit more detailed, a bit more colorful, a bit more artistic. However, we still do have some problems. If we try to run particularly related to color, if we try to run this approach, as is using the starry night as our style, what we get is this and you can see.

Speaker 2:

And now the image yes, it is styled, but it is very, very blue, okay, and this makes sense because our style reference is blue. So what is happening? Well, as I said before, the first layers of VGG. They are learning stuff about the color. So if you provide VGG with a style reference that is blue, well then your network will learn that the generated image must be blue. But we don't want this because, especially if you want to use this for anonymization, well, you cannot just have a uniform color across your frame.

Speaker 2:

So how can we solve this? We solve this adding an additional loss term which forces the generated image to take the color from the content instead of from the style. And to do this we just converted both generated image and content image to the LAB color space, where you have one channel for brightness, which we are completely ignoring, and then you have two channels for prominence, and we just try to match as closely as possible the prominence of the generated image and the original content image. And so here you can see the effect that this has on our cat and you can see that the rendition of the image becomes much, much better. We can actually carry the colors from our content reference a lot better to our generated image, and obviously this is an additional loss term, so you can adjust a little bit the weight of this loss term. You can have a generated image that is closer in color to your style reference, or an image that is closer in color to your content reference, depending on the application.

Speaker 2:

So then we try to run this on another picture of my cat and we find another problem. As you can see here, we have another picture of my cat and then in the generated image we are actually losing the cat, which is quite a big problem, Not really for artistic renditions, but it is a very big problem if you want to do this for anonymization, because if you are removing your targets from the frame, then it doesn't really make much sense to do this as anonymization. So why is this happening? The problem now is actually the content loss. The content loss I was telling you before, it is calculated at the final layers of VGG19, but the problem is the final layers. They have a smaller and smaller spatial resolution, so they can actually miss some small targets, and so how can we fix this?

Speaker 2:

Well, to fix this, we are going to resort to the Clip Vision Transformer, which is the Clip Foundation Model. It is not directly related to the metric that I was telling you about before, so don't mix them up. Keep them as separate things, okay, and the Clip Vision transformer is the pair of these two models, one working on images, one working on text, and they are trained to project image and text in a shared latent subspace. Okay, very complicated way of saying that they are just obtaining the same representation for the image and for the text. You can imagine, like, for example, here these orange features that you see. They are containing the concept of orange, of bag, of cat, and they are the same features extracted from both the image and the text. And this actually helps us a lot because in some way you can think of Clip as providing captioning for the image, so providing you a description of what's in the image.

Speaker 2:

And so what we do is actually we run Clip on the generated and the original image during training and we obtain the two semantic representations and from these we try to line them up in such a way that they represent the same stuff. So you can see, here from the original image, we have a representation that says like, okay, there is a desk, there is a bag, there is a cat, and from our generated image it will be a bit different, but it will still say, oh, there's a desk, there's a bag, there is a cat and from our generated image. It will be a bit different, but it will still say oh, there's a desk, there's a bag, but there is no cat. Okay, so they are quite different. We have an additional loss term that forces these two representations to be the same. So in some way we are distilling the concepts learned by Clip during training and we are forcing the two images to represent the same stuff, and here you can see an effect of this.

Speaker 2:

On the left we have our original image and on the bottom you can see some text and you can see the score, the correspondence score between that image and the text, and obviously in the first picture it is quite perfect, while in the second image we are missing the cat and this correspondence score is quite poor. In the third image, our style transfer network learned to put back a cat into the picture, and so we obtain a better correspondence score. So here is a small breakdown of our complete training approach. Ok, you can see that for each image you need to do five inference passes, three through VGG19 and two through the Clip image encoder, and so, yeah, the training is a bit complicated, but because our network is so small, actually the training just takes a couple of hours on a recent GPU, so it's not really much of a burden, Okay. So here you can see some examples of this run on, like the view from outside my office, actually, which is already quite pretty, but we can make it prettier with Van Gogh's Starry Night. And there is my cat again, and this is my cat painted by Van Gogh. And so at the end we obtain a style transfer network that works quite well. Okay, we almost match the performance of our original network, and actually the clip score becomes even better than our original network because we are optimizing in some way this clip representation.

Speaker 2:

Okay, so we have our cell transfer network. That works pretty well, and here's where ST comes in, because, together with ST, we worked on the deployment of this network on three different embedded devices to test their performance. And, in particular, we chose on the deployment of this network on three different embedded devices to test their performance, and, in particular, we chose three different families of devices. We have our H7 microcontroller, which is the one that I was telling you about before, okay, the one with a couple of megabytes of RAM, of flash, and one megabyte of RAM, and then we have the STM32 MP2, which is actually a microprocessor, so much faster, higher power consumption, much more memory, and it also embeds an hardware AI accelerator, so it can make convolutional workloads a lot more efficient. And then we have the STM32N6, which is actually an unreleased board by ST and it is a microcontroller with a low power AI accelerator. Okay, and also in this case this helps a lot with this kind of workloads because it can compute convolution quite efficiently.

Speaker 2:

So here we have some results and if you want, you can get the code from the QR on the top right. Okay. So we will start with the N6. We achieve image generation of one frame at 60 milliseconds, which is very, very fast actually, and also on the MT2, 72 milliseconds, it is very, very fast because this approach is like state-of-the-art approaches. They can reach one frame per second on GPU. Okay. So having an approach that is this fast on a microcontroller, that is a very good result in my opinion. We also tested it on a Raspberry Pi 4, we can still reach around 10 frames per second. We tested it on the MP2, so the microprocessor, but without the AI accelerator and we still managed around a couple of frames per second. But the most impressive result in my opinion opinion is the last one, because on the H7, on the pure microcontroller, we are actually running the image generation pipeline. Okay, it is quite slow it takes around eight seconds per image but as far as I know, this is the first work that can actually generate pure images on a single microcontroller without any external help. So, in conclusion, what we achieved was we proposed a lighter network for neural slide transfer and actually generative tasks for low-power embedded devices that can run in real time on these devices, and we also proposed a couple of ways to improve the existing neural slide transfer approaches with these additional loss terms that I explained to you and also before I was telling a bit about anonymization.

Speaker 2:

So here I have a very, very preliminary, very, very small demo about how this works. I hope this doesn't work on the Wait. How can I start the video? I should. Okay, here we go.

Speaker 2:

This is a video of me walking across the frame and here you can see the anonymized versions with like two different styles, and what you can see is actually the. Okay, the video flickers a bit because we are doing it frame by frame, but the action is really consistent across the three different videos and we are actually changing everything about the person. Okay, we are changing the hair, we are changing the clothes. So this is like much, much better than doing like pure anonymization, like by blurring the image where you cannot recognize the action. That anonymization, like by blurring the image where you cannot recognize the action that is being taken, or just blurring the face, where you can still recognize the person by looking at how the clothes and stuff, like that. So this is a very preliminary version of how we can anonymize the video based on this approach. So that's it. Thank you a lot for your attention and if you have any questions, I will be happy to try and answer them.

Speaker 1:

Thank you so much, Alberto. Really impressive amount of work and very nice presentation, almost image-based. So very, very cute. Let me check if there are comments. Yeah, amazing work from Damien. Maybe I could ask you what about audio? You spoke about techniques in order to. It was in the previous slide, but you had a different way to match semantically the information between the original image and the text and the style to be overimposed.

Speaker 1:

So it's not a simple overimposition what you achieved clearly. I wonder if the techniques can be applied in a more generalized way also to other content, for example starting from the audio.

Speaker 2:

Yeah, actually this is a very good question because we have some ongoing work that is using Xinet and the same architecture that I showed you here applied to Spectrogram and applied to audio generation, so that demonstrated a bit that this approach is quite general. Audio here applied to Spectrogram and applied to audio generation, so that demonstrated a bit that these approaches are quite general. Yes, you need to change a little bit the loss function, because what I am explaining here is more related to the image domain, but actually the general technique and the network itself can work very, very well for audio and also in that case we are achieving some very great savings compared to other audio anonymization techniques.

Speaker 1:

Okay, I was wondering. You know I've been challenged by a comment from a friend asking me how, to, what kind of impact this, this technology, can can have on the market in the community. So let me forward to you this. I know you are not a businessman clearly but I wonder at least in which direction this technology can really provide an impact. You spoke about anonymization and concealment.

Speaker 1:

Maybe I don't want to share my nose or some part of my body, or, you know, I want to preserve a kind of personal information that I don't want to share, at least at the beginning.

Speaker 2:

then maybe no, no, that makes a lot of sense. Actually, the way all of this was designed was always having an eye on, like public facing circumstances. Let's say, you have a camera that is in a public space Well, maybe people there don't. Maybe you want to in some way understand if there are some anomalies or what actions are being are being carried out or just more simply doing I don't know people counting stuff like that. Well, maybe people that don't know people counting stuff like that Well, maybe people that don't really want to have their privacy broken because they are sending away video of their faces like that.

Speaker 2:

So, in that case, having something that can anonymize stuff and that can actually guarantee you that you are keeping all the objects in the frame and that's why I was telling you about the clip loss. So the problem is, if you have very small targets, if, because the camera is very far away, then you tend to have the risk of losing targets. In this way, you are actually more or less guaranteeing that you are keeping all the targets between two different frames and you can run your, like high level image understanding pipelines without worrying about people's privacy and sending faces and dealing with all of that Also in the case of GDPR and privacy laws, you can actually be quite sure that everything is being anonymized, because you saw from the video, there was no way you would have said that the person in the same slide was me.

Speaker 1:

So, yeah, yeah, thanks a lot, alberto, because I think now I see a a useful application for for everyone, without falling in the danger of creating fake content, because if I can improve artistically my content and still not be fake content, I think there is a failure, there is a value because it helps. It helps my, my image, my way to communicate with the others. Now we have a couple of questions from the first, from chen liu, so complementing a great job. So what's the complexity of the final model in your demo, like how many million parameters and how many gops giga operation per second it requires, per inference?

Speaker 2:

okay, so the the model at the end, the smallest one, which is the one that we have benchmarked here in the slides and the one where I showed the FID and the metrics it is the one designed for the microcontroller, so it is less than 1 million parameters and it is half a gigaops to run. Obviously, the complexity depends on the size of your input image. In this case, this half a million gigaops is related to 160 by 160 image. Actually, we also have in the GitHub repo we have a couple of more models targeted at, like Raspberry Pi or the MPU, which are a bit bigger, and in that case the input resolution is 320 by 320. So, yeah, two giga Ops for the large ones, alpha giga Ops for the small one for the microcontroller.

Speaker 1:

Thank you. Thank you to answer to Chen question. Brenda, you deploy the two multiple STM boards.

Speaker 2:

Is your model uh hardware aware, or you can achieve similar results with the target agnostic approach so, uh well, when I talk about hardware aware scaling, I made the example of the network that we have developed and that is the one that is targeted to the STM32 microcontroller, so the H7. But we actually run that on a lot more devices. Yeah, you can deploy, you can make a small network and run it wherever, and it will run faster on a Raspberry Pi than on an H7. But you can also, if you want maximum performance, you can just design your network for your Raspberry Pi and you can just force it to use how many million operations that you want and in that case you have a very high performance and the frame rate that you want to guarantee. So it's very, very flexible. But yeah, in the GitHub repo, you can see, you can find a couple of networks that are more or less platform agnostics. Ok, they can fit a microcontroller, but you can put them wherever you want.

Speaker 1:

Thank you, and another question from Chris Do you have pointers to references on the hardware-aware scaling? Is that an available tool?

Speaker 2:

Yeah, so, regarding the hardware-aware scaling, you can check out the paper about our network you can find it looking for Xinet and it has been published at last year's ICCW and you can look on GitHub for Micromind and it is the tool case that we use to develop these networks and there you have a lot of stuff already pre-programmed to experiment a bit with these hardware-aware scaling approach.

Speaker 1:

Thank you, we can stop here, but, alberto, please take some time to answer to Ponam In the chat. He is asking to share the link to the github. Yep, uh, thanks a lot, alberto, especially to be the last one and to stay still with us and to contribute, even if it's it's late. I know you are interested. Thanks a lot for your contribution and really was was different from the others, and thank you bye.

Speaker 2:

Thank you for having me. Thank you.