Stochastic Training for Side-Channel Resilient AI Artwork

EDGE AI POD

Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.

These are shows like EDGE AI Talks, EDGE AI Blueprints as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.

Join us to stay informed and inspired!

All Episodes

EDGE AI POD

Stochastic Training for Side-Channel Resilient AI

September 23, 2025 • EDGE AI FOUNDATION

Protecting valuable AI models from theft is becoming a critical concern as more computation moves to edge devices. This fascinating exploration reveals how sophisticated attackers can extract proprietary neural networks directly from hardware through side-channel attacks - not as theoretical possibilities, but as practical demonstrations on devices from major manufacturers including Nvidia, ARM, NXP, and Google's Coral TPUs.

The speakers present a novel approach to safeguarding existing hardware without requiring new chip designs or access to proprietary compilers. By leveraging the inherent randomness in neural network training, they demonstrate how training multiple versions of the same model and unpredictably switching between them during inference can significantly reduce vulnerability to these attacks.

Most impressively, they overcome the limitations of edge TPUs by cleverly repurposing ReLU activation functions to emulate conditional logic on hardware that lacks native support for control flow. This allows implementation of security measures on devices that would otherwise be impossible to modify. Their technique achieves approximately 50% reduction in side-channel leakage with minimal impact on model accuracy.

The presentation walks through the technical implementation details, showing how layer-wise parameter selection can provide quadratic security improvements compared to whole-model switching approaches. For anyone working with AI deployment on edge devices, this represents a critical advancement in protecting intellectual property and preventing system compromise through model extraction.

Try implementing this stochastic training approach on your edge AI systems today to enhance security against physical attacks. Your valuable AI models deserve protection as they move closer to end users and potentially hostile environments.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1: 0:07

So yesterday I talked about a specific attack, a side channel attack, that can cause leaking a design secret, that can cause reverse engineering of a design and cause stealing of a trained AI model from an edge device. And these are not theoretical attacks, these are practical attacks that I've shown in my lab on many devices, including Nvidia devices, arm devices, nxp devices, as well as, most recently, google, coral, hdpus, and I've discussed one solution that requires fabricating a new type of hardware that can mitigate such side channel attacks when adversaries have hands-on access and they can measure those devices. So the question that I got was what about existing devices that are out there, that are already deployed or that are being produced? Can we improve the security of those devices, even so slightly if possible, so that we can improve the security at least for the existing solutions? So in this talk I'm going to say yes to that question. I'll show a way to change the training so that the devices will become more resilient against these kinds of attacks, and I will show you a solution that can increase the memory size linearly to achieve a quadratic improvement in side-channel resilience. So I've already discussed why machine learning models are valuable and why we may want to keep them proprietary, because we pay a lot of money to develop those models and, moreover, once you leak device details and a trained AI model, your system will become more vulnerable to other types of attacks, which are especially critical for missile mission control systems or safety critical systems. So we want to keep machine learning models secure and safe and keep them private. There are so many attacks published. Don't trust my word. Trust all these great researchers out there, and I'll be talking not about these kinds of attacks, but I'll be talking about defense that you can deploy on your current solutions, including the devices that you see out there.

Speaker 1: 2:27

So the problem with previous approaches, including my own work, was that we either do this on hardware or in software, or in hardware-software co-designs. If you do it in a hardware-driven way, you need to tape out a new chip, which you don't want to do. Do just to add security. If you want to use software, then you need to be able to support these weird low level arithmetic functions that you don't have in your native instruction sets. So you cannot essentially use these existing solutions on existing edge TPUs. That's why all these developed defenses are meaningless for existing TPUs. So here is what we did. So we took this Google Core Edge TPU, which we've shown to be vulnerable against side channel attacks, and this is quite good. It can support various applications and it doesn't have any side channel attacks. And this is quite good. It can support various applications and it doesn't have any side channel defense, obviously just like any other devices out there.

Speaker 1: 3:32

The problem with having defenses here is that you have no control over hardware. The hardware is proprietary. You have no control over compiler. The compiler is proprietary. You have no control about instruction set architecture. Instruction set architecture is proprietary, and so the source code that you can run on this is limited to TensorFlow or TensorFlow Lite. So you have some control over the source code, but once you click a button, you don't know what happens after that. Right, so can we put some level of security in those devices? That's what we did. So we developed a practical solution that can work on such devices so that you can improve the security of those devices to sophisticated side channel attack, and I'll leave it to Dr Anuj Dubey to present the rest of the slides. Thanks, Aitain.

Speaker 2: 4:31

Hello everyone, I'm Anuj and yeah, before I begin, because I work for Amazon as a security engineer. Whatever I present today reflects the opinions of just me and me myself, not Amazon in any ways. All right, cool. So now that we have already set the pretext that we have this device which we have very less control over and we still want to provide some kind of side channel protection right, without changing the hardware. So how do we have this device which we have very less control over and we still want to provide some kind of side channel protection without changing the hardware? So how do we do this? Am I audible?

Speaker 1: 5:04

We need to speak up, check, check.

Speaker 2: 5:08

Am I good? All right, cool, okay. So some preliminaries. I don't think I need to explain this in way too much detail. How do we train neural networks? We create some kind of a neural network architecture with a certain number of layers, some number of nodes in each layer, and then we instantiate, we initialize the parameters, either randomly or based on some fixed algorithm, and then we simply propagate, feed forward propagations of those initialized models. We get the outputs, we compare it with the ground truth and then we compute the loss, and then we differentiate the loss, or in other words, compute the gradients and we propagate them back, and then we tune the parameters until we get a good enough accuracy.

Speaker 2: 5:54

There are many loss functions available here. I'm just showing one particular loss function. The main point I wanted to explain here is that the product of training is these model parameters. That is what we are tuning. Consider it like a bunch of knobs which we are tuning until we get a specific accuracy. So it's these model parameters that we really care about when we are training, and this will come into picture later. Now, apart from just that, that was the code or the algorithm.

Speaker 2: 6:26

But if you want to visualize the process of training. Essentially, it's like this high dimensional plane which represents the loss function. You know, high dimensional plane which represents the loss function, and when I'm training, all I'm doing is I start at a particular point on that plane, considering this three-dimensional figure for easy visualization. We can start at any of those. Is this working? Alright? Cool, so we can start at a fixed point, theta, and then we compute those gradients and the expectation is that slowly, we will, you know, converge into some kind of a local minima. If you do this again, well, it may end up in a different local minima, right? And the reason is that this process has got some inherent randomness in it. We shuffle the batches, we can initialize these parameters randomly or based on some fixed algorithm, but that would still have some kind of randomness. So, because of all all this randomness, what is happening is we are ending up with different model parameters each time we train the same machine learning model, and that is crucial. That is important why? Because, as a security engineer, I'm always hungry for randomness, right? If I have randomness, I can somehow use it in some kind of a countermeasure which would add side channel protection, and you can read side channel protection literature and you'll see that most works require some form of a random number generator, because randomness makes it difficult for an attacker to guess what is actually happening in the hardware, right? So that was the whole point. That was my initial intuition.

Speaker 2: 7:48

Maybe try to use this stochasticity in the training. Now how do we do that? Well, now that we have the stochasticity, we need to propose something right, and here's my proposal. So on the left-hand side, I'm just showing how we typically perform neural network training and inference. After training, we get this set of parameters W0, w1, w2, and then we just use the same set of parameters during inference. And here's my proposal for secure inference.

Speaker 2: 8:17

So we don't train the machine learning model once, but many, many times. In this example, I'm showing that I trained this thrice. The superscript represents the training right, the training number, and the subscript is just representing the layer and the parameters. And then what we'll do is that was the training, so we train it thrice. And then, when we move to inference, we create this randomizer function. I'll talk about this in detail, but this is just like an if condition. So it takes as input some random number, r, and based on that value, it will choose one of these three models for each inference. So even if you try to infer the same image twice, you may end up using a different set of model parameters. And that's how I'm trying to use that stochasticity, because I trained the machine learning model multiple times. I don't have to do anything special. It will automatically be randomly tuned and then all I'm doing is I'm picking and choosing randomly during inference in runtime. Great, so we have a solution.

Speaker 2: 9:18

What's the problem? Well, the problem is that the edge TPU that we're using it's specifically created for machine learning, right? So this is not like a regular CPU where you can write if-else conditions and just execute them on hardware. It's a very specific hardware which has something called a systolic array, you know, and it completely depends on the schedule that the compiler creates. So the compiler will create some kind of a schedule and then it will push all that data to the systolic array. The systolic array will crunch these numbers in a very, very deterministic way, because there is no non-determinism in machine learning models, right? When we're inferring, it's the same thing that we just do many, many times.

Speaker 2: 9:56

So that was the first problem. How do we really create that randomizer function? And after scratching my head for many, many months. This was my second intuition. So if you look at the relu function, which is our favorite friend, the most popular activation function that we use, relu is essentially an if condition, right? I mean, this is essentially what ReLU is right. And my intuition was can we really use this if condition to implement what I was proposing in the previous slide? Well, as it turns out, we can, and we'll do this in the next slide.

Speaker 2: 10:32

So let's try to use the ReLU function and build an if-else condition. So let's start with some number, r, which can randomly be either minus 1 or plus 1. So 50-50% probability. And then we pass it through a ReLU function. So what I'm showing in these curly braces is the possibility. So if r was minus 1, the output of ReLU would be 0. If r was minus 1, the output of ReLU would be 0. If r was plus 1, the output of ReLU would be 1.

Speaker 2: 10:59

What do we do next? We multiply this output with one of the weight tensors that I received from the training. So I'm doing multiple trainings From one of the trainings, what I received. I'm just multiplying that with this output of the ReLU function. And what will happen if r was minus 1, I'll end up getting a 0, if r was plus 1, I'll end up getting a w1, or the actual weight tensor. We'll do the same thing in a parallel path, but I'll flip the minus 1 and plus 1 here, so the minus 1 and plus 1, I'll just multiply, multiply minus 1, so now it becomes plus 1, minus 1. And then I repeat the same process on the second weight tensor, right, which, let's say, is w2, and what would happen is, as you can see, the 0, which was in the first position, will now be w2 right, because I flipped the ones and the minus ones, and the second one would be 0, and the solution is slowly starting to come together. So now I do have these two weight tensors which, based on a certain number, r is either W1 or 0, or either W2 or 0.

Speaker 2: 12:02

So what's the final step? I just need to add these two tensors. The moment I add them, I get this emulated if-else condition from ReLU which, based on minus 1, will output W2, or based on plus one, will output w1. Okay, that's, that's essentially the function that I created in this circuit. Now. So that was the meat of the talk, that was the main presentation. You still have a lot done, perfect, cool.

Speaker 2: 12:29

So let's talk about what we did next. So now that we built this relu function, we know we can execute, if else conditions conditions on a TPU which does not natively support any kind of control flow instructions. There are multiple things we can do. The first thing is what I proposed. So we train the machine learning model multiple times. We get a set of weight tensors and then I'm just picking and choosing one based on this emulated if else condition that I built right. And that would mean if I trained the machine learning model m times, I would have m permutations right or m possibilities. If an attacker is trying to, let's say, conduct multiple experiments, by definition one out of m times the same model will repeat, right. And this is important. Why? Because I want to increase this m. If I have too many possibilities, it's difficult for the attacker to now guess which was the correct model parameter, right.

Speaker 2: 13:20

So this is where we started to see if we can make this better, can we reduce the randomness costs and still provide the same level of security? And that's where we move from choosing the entire model by itself this specific idea to choosing between layers. So I'm essentially reducing the granularity of the model parameters that I'm training, instead of choosing all the model parameters together. Now what will happen is, among three models that I chose during runtime, I'm going to choose the parameters per layer. So for the first layer, I may end up choosing the model parameter from the 0th model. In the second layer, I may end up choosing the first one. So this is going to increase the number of permutations. Why? Because now I still have the same number of machine learning models, but now because I'm working at a layer. Granularity now it's not m anymore, it becomes M to the power, n. Right, because we have the number of models that we trained on and then for will find it difficult to attack.

Speaker 2: 14:38

Well, but this cannot work in a straightforward fashion. Right? Because I trained the machine learning model for all the layers, so I cannot just use the layer parameters from multiple models randomly. I tried doing that. That decreased the accuracy. Why? Because the machine learning model was trained assuming a certain set of parameters in each layer. So I cannot just randomly swap parameters, which means I do have to now train the machine learning model also and make it aware that I am going to do this in the inference, which is where the secure training.

Speaker 2: 15:12

The stochastic secure training of my title comes into picture. And this was not very difficult. I mean, the code still remains the same. It's still the same algorithm. What will happen during training now is based on this value of R, multiple layers may get these propagated gradients and automatically tune itself to still work. Perfect. And yeah, because I used relu and not a weird if-else condition, this still remains differentiable, right? So I don't really have to do anything when I'm talking about back propagation, because I'm using relus, I can propagate these gradients to any of these multiple trained models within the same model.

Speaker 2: 15:52

And just, you know some accuracy numbers. I did not use the bells and whistles and try to pump the accuracy all the way to 100, but if we assume this baseline, um the model wise training, which is when I'm swapping entire models, it's roughly around the same. Um, there's a slight decrease, but it's not too much. And for the layer wise models also, um, this is around the same lines. So there was not a lot of decrease in the accuracy after using the secure training method or the model wise training methods, simply because I'm still training, right, I'm still training. It's just that I'm not training one set of parameters, but multiple parameters, so I'm not going to see a lot of decrease in the accuracy and we can also increase it by increasing the number of epochs.

Speaker 2: 16:38

This was our side channel setup. This is the XYZ table. That's the oscilloscope. This chip is the Edge TPU. Well, this entire board is the development board shipped by Google, and somewhere here is the Edge TPU and this is all side channel setup that we used to capture the power traces. Why? Because we want to capture the power traces. Why? Because we want to conduct this test, called test vector leakage assessment, which basically tells how much side channel leakage is present in this piece of hardware that I'm evaluating.

Speaker 2: 17:15

And the simple idea is you run two tests, one with fixed plaintexts and one with random plaintexts, and then you just try to see if those are statistically different. And the plot should look something like this if it's secure, between the thresholds of plus minus 4.5, or it's going to look like this if it's insecure. So all these places where the T scores are crossing these thresholds, those are the points where it could be attacked. And yeah, just wanted to show you the results here, without the training, without any defense. The lower plot is the T-score plot and the upper plot is the power side channels plot. So you can see that the T-scores, or the leakage, is somewhere around 140 to 150 without any defense, and once I apply the defense which I just, it drops down to nearly half, around 70.

Speaker 2: 18:01

So well, there's still leakage, right, because the thresholds are crossing 4.5, but it's still not. It's still less than what it was before. So if we cannot change the hardware, we cannot change the compiler and we just have the source code to change. This is what. This is one way we can achieve some kind of the side channel security. Yeah, this was the final slide. To conclude, machine learning models, as my advisor already explained, are vulnerable to side channel attacks and we need to invest all our resources into developing more and more efficient and robust techniques to provide some kind of side channel defense. Right, because AI is moving towards the edge as the days pass by, and if it's moving on the edge, it's physically accessible and anybody can run these attacks. So, yeah, that was more or less my talk. Yeah, thank you. Thanks very much.