Marco Salgado

And then also whenever you are at the first tile and you want to copy the first element into the back of the array, you also need to make sure that that the first tile or the last tile has already been read because you would be overwriting its space. And so you basically need to guard against those things.

Bryce

But isn't it just like a scan?

Marco Salgado

In what way?

Bryce

That like I'm just moving, well, at least some part of it is I'm just moving information down. I'm just shifting information down. It's just like if you think about it, it's like like like a shift is definitely just a scan. This is just a shift.

Conor

It's not just a shift. It's it's like a shift left for one partition, though, and a shift right for the other partition. So you can't do this with a scan, right? Well, explain that to me. Like if you're doing a if you're doing a one rotate, the the first element goes to the back. Sure. You can consider that a shift, like a shift right. But then every other element is going left. And like by definition, how are you gonna get the second piece of information in the first spot with the scan? Welcome to ADSP the podcast, episode 284, recorded on April 23rd, 2026. My name is Connor, and today with my co-host Bryce, we chat with Marco in part two of this three-to-four part interview. In this episode, we do a deep dive on the potential implementation of a GPU rotate. Over to you, Marco. Now a like a mini representation of the better code GPU rotate that you gave a week ago.

Marco Salgado

Yeah, it was recently, so I have everything in my head still. Yeah, so basically the idea of this side project was to implement like a GPU implementation of rotate, doing it in place. And the reason for doing it in place is that you're gonna get the best theoretical performance. So a rotate is just memory moving memory around, so you're going to be bound by the memory bandwidth of the GPU. And so the less memory movement that you have to do, that's gonna give you the best theoretical performance. So if you think about it, you could theoretically just let's say you have an array and you want to rotate it, you could just allocate a buffer of the same size, do the rotation into that other buffer, and then copy the buffer back into the original. That would be like the easy way of doing it. But that way you're you're moving from one array to another. And so if your array is of size n, you have to do you would have to do four n data movement, right? And by doing it in place, you only have to do two times n data movement because in a rotate, you're gonna have to read each element from the array, and then you're gonna have a wrap, you're gonna have to write it back at another location. That's an amount of work that you cannot get around. So the minimum theoretical movement that you have to do is two times the size of the array. And so by doing it in place, you're you're able to get to that theoretical limit. And so that would be what we call speed of light. So if your algorithm, if the throughput of your algorithm is half the memory bandwidth of the GPU, that means that you've achieved the best theoretical performance. So that was basically the goal of the project.

Bryce

Wait, what why is it half? I'm not sure I understand.

Marco Salgado

Yeah, because you have to, if your array is of size n, you have to read the whole array and you have to write the whole array back. If you're not doing it in place, you have to do four because you would have to read the array, write it into another array, then read the other array and write it back to the original. That would be four. If you do it in place, you have to read the array and then write it back, which is two.

Bryce

Yeah, but like the way that we we we usually count, like for like memcopy, for example, we usually count like memory movement, we count reads plus writes, but we don't expect it to be half for something like memory movement, for something like mem memcopy. I I don't understand, I don't understand the half part.

Marco Salgado

Why would you expect the memory, I think the memory bandwidth of the GPU is just telling you how many gigabytes per second can you can can like move through the GPU. And so in order to read uh let's say a million elements, you need and in order to read a million elements, you're saturating the GPU, whatever bandwidth that is, and in order to write them back, you're saturating it again. So you're doing like two trips. That's why you would divide it by half.

Bryce

But that's not usually the performance model that we use for something like memcopy or transform. Like we expect memcopy to hit like 90% of peak bandwidth, right?

Marco Salgado

Yeah, yeah, it hits that 90%. But I mean what I mean is that let's say you have an array of a million elements and it takes you a second to rotate that array. That means I mean, no, how do I put it? Basically, it's like the amount of elements per second that you can rotate is gonna be half the memory bandwidth of the GPU if you see it in that way.

Bryce

Oh yeah, yeah, yeah, yeah, yeah. Sure, sure. Okay, I'm on board. Yeah.

Marco Salgado

So you're still saturating the you're still saturating the GPU, of course. So the goal would be to saturate the memory bandwidth.

Bryce

Right, right. It it it's what you're saying is the elements per second is gonna be half. I'm always thinking about this in terms of number of bytes mode.

Marco Salgado

Exactly. So the throughput of the algorithm itself, not necessarily the throughput of the memory move.

Bryce

Right, right, right. Okay, yeah, sure. I'm on board. Yeah.

Marco Salgado

Okay.

Bryce

Okay, so we take that and then Okay, and this is this is assuming that's assuming that you never have to touch any sort of global memory, or like that's your theoretical best case, assuming that you never have to go out of L2 cache.

Marco Salgado

Exactly. So that's basically the we cannot get better than that. That's the best we can do. And so that's what we're trying to achieve. That's what I'm trying to achieve with the implementation and the algorithm. And so now that we have that, we can basically think how would we rotate an array by just one element? So that's the first thing that I did. It's just one element, how would I do that? And so if we if we think if we think about a huge array, because I mean if we're gonna do a GPU implementation, the thing would like the use case would be to have huge amounts of like huge arrays. And so if we have a huge array and we want to rotate by just one element and we want to do that in place, what we would do is since we're on a GPU, we want to tile the array so that then each CPA or each block gets one tile of the array and does whatever it has to do on it. And so once we've tiled the array, we're not gonna tile the whole array, we're gonna tile the array except the first element. The first element is the one we're rotating and it's just one element, so we don't need to tile it. We can just tile the rest of the array, and then each each block is gonna grab one of the tiles, and then it has to it's gonna read one of the tiles and it has to write it back one element before, because that's how you would be rotating the array. And then the first tile is also the one that's in charge of set of moving the first element into the back of the array, because that's what you would need to do in order to rotate an array, right? You move the first element into the back and you move the rest of the array all the way to the front. And so there's two problems there. The first one is that whenever you are writing a tile back, it could be that you are overwrite, well, you are overwriting the next tile or the tile before you, and so it could be that you overwrite it before it has been read by another block, which would cause a wrong result. So you need to guard against that. And then also whenever you are at the first tile and you want to copy the first element into the back of the array, you also need to make sure that that the first tile or the last tile has already been read because you would be overwriting its space. And so you basically need to guard against those things.

Bryce

But isn't it just like a scan?

Marco Salgado

In what way?

Bryce

That like I'm just moving, well, at least some part of it is I'm just moving information down. I'm just shifting information down. It's just like if you think about it, it's like like like a shift is definitely just a scan. This is just a shift.

Conor

It's not just a shift, it's it's like a shift left for one partition, though, and a shift right for the other partition. So you can't do this with a scan, right? Well, explain that to me. Like if if you're doing a if you're doing a one rotate, the the first element goes to the back. Sure. You can consider that a shift, like a shift right. But then every other element is going left. And like by definition, how are you gonna get the second piece of information in the first spot with the scan?

Bryce

Sure, I agree. But like in the case of like doing like a doing like a rotate of one, like if you're doing like a rotate of like small number, then it's just like a scan plus some special handling for the small number, maybe.

Marco Salgado

But how do you express the memory movement in terms of the scan? Where does this scan come into play? Yeah, I I I don't think this is I think the communication pattern is similar to a scan.

Bryce

But I don't know.

Marco Salgado

Yeah, so I think we can get into that. That's what's that's what I was gonna get into. So how do you communicate between blocks whenever whenever a tile can be overwritten or not? And so how I did that is that you have an array in global memory in a scratch buffer where every tile has like a has like an atomic flag. And then whenever so uh so the steps that a block takes is that it gets a tile, it caches it into shared memory, and then once it has cached it into shared memory, it can basically release the flag so that the other so that the other blocks know that they are safe to overwrite that memory location, and then it needs to poll the previous tiles flag so that it's sure that it can overwrite the memory location that it is going to write to. And then once that flag has been released, it can write the cache tile to the destination. And so whenever you do that, you're basically using those global flags to make sure that you're never overwriting some memory location that has not been read already, and you're avoiding getting a wrong result. So you're using those flags, those are that's basically the communication path between the blocks.

Bryce

How many flags do you need?

Marco Salgado

One. So you have one flag per tile, and a tile is 32 kilobytes. So I don't know, I think for an array of of a gigabyte, let's see.

Bryce

So it's so it's per per block.

Marco Salgado

Yeah, for an array of a gigabyte, you would have th around 30,000 tiles and 30,000 flags. So the amount of extra memory that you're using there is very small in comparison to the size of the array.

Bryce

Yeah.

Marco Salgado

Yeah, then so that's basically the algorithm. But if you think about it, so in the presentation, I call this and I call this the short algorithm, and that is because if you say that instead of having a rotate distance of one, let's say you have a rotate distance of I mean, I think now it's better if we switch to thinking in terms of tiles instead of elements, because we're always going to be talking about tiles, so we can just say that an array has a certain amount of tiles instead of elements. So imagine that you are now trying to rotate an array by five tiles. And let's say you only launch three blocks, so you can only process three tiles at the same time. Then okay, another thing I forgot to say is that whenever in the implementation, you're actually processing the tiles from the back of the array. Because since you're moving them to the left, you want to start processing from the back so that the first block gets the last tile and the second block gets the second to last tile. And so that way you're moving from the back, since you're moving the array to the left, you want to start processing from the back in order to get the best like the best overlap in that sense. Yeah. And so taking into account that we're doing that, let's say we want to rotate an array by five tiles and we're only launching three blocks, then we would have a deadlock because let's say the first block gets the last tile, the second block gets the second to last tile, the third block gets the third to last tile. So you're you're processing the three last tiles, but then the first block wants to overwrite the memory location of five tiles before it, because you're rotating it by five tiles. But that five tiles before it is not being processed by any block, so it's just going to be waiting indefinitely until it can overwrite that, and you're so so you're basically going to have a deadlock. And that's where that algorithm no longer works, and you need another way of doing it.

Bryce

Okay, and so you have some other approaches. That's interesting. So you you're you depending on the uh the rotate size or the amount of rotation and the I guess the the amount of concurrency you have on the GPU, you uh you you pick a different algorithm.

Marco Salgado

Exactly.

Bryce

Do you have a do you do this as like a persistent kernel?

Marco Salgado

Yes, so I am launching. Well, there's actually we can get into that later. The first implementation is that you're launching as many CTAs, so you have a tile size of 32 kilobytes, and you launch as many CTAs or as many blocks per SM as you can to s to like use up the shared memory of the of the SM, which on H100, I think you're launching six blocks per SM. And then you have also an atomic counter on the scratch buffer, so that you're doing a work stealing loop. So whenever a block wants a new tile, it decrements the atomic counter to get whatever tile it needs to process, and so that way you have load balancing implemented into that as well. I don't know, Connor, you said at the beginning that you there were some things that were not clear in the presentation. So I would like you to No, no, no.

Conor

I was I was waiting because Bryce was Bryce was staring at the ceiling, looking like he was looking for religion or something, and so I was waiting for him to either nod his head in agreement or ask for clarification. This stuff, this stuff made sense. It was the later on when you started making adjustments to the algorithm based on like the Nikki results, and I can't remember if it was acronyms or if it was like words and stuff, and you started tuning and adjusting things and getting closer to speed of light, and that's the stuff where I was kind of like, oh, you know, I don't use NICU that often. So and admittedly, I think of like people that work at NVIDIA, at least on like CUDA code, because obviously there's a bunch of folks that do different stuff, like some of the C engineers, they're working on tooling, right? So they're not actually authoring uh kernels or working with any of our kind of GPU accelerated libraries. We've got lots of folks at NVIDIA, but of the folks that are doing uh CUDA and kernel-related things, I would guess that there are like different tiers of people that work at different levels, and like the people that work at the NICU and that are considered like you know, CUDA ninjas, which I would put you in that camp because you're designing algorithms and like using NICU, and I think that is like a fraction. I don't know, and maybe Bryce, you've worked longer at NVIDIA, but like I primarily work at like a higher level, you know, the the thrust level, cub level. The number of times that I have actually from scratch like coded a raw CUDA kernel, like is in the low double digits.

Bryce

And I know you're sadly in the minority, Tana. Am I sadly in the minority? More people should, but but don't.

Conor

More people should, but don't.

Bryce

More people should not write their own kernels and just use our libraries, but I I thought I see, I thought you were saying something else.

Conor

But because admittedly, writing like raw kernels is like, in my opinion, approaching infinitely harder than just reaching for like a cub reduce or a thrust reduce.

Bryce

Yeah, and there's a bunch of there's a bunch of like hardware and software factors, like the disaggregation of the disaggregation of ML inference, the disaggregation of the GPU core itself that is making this a lot more and more challenging.

Conor

Yeah.

Bryce

GPU core is now this like crazy asynchronous thing, and programming, it's kind of a nightmare. I I should clarify. Programming, it's kind of a nightmare if you program it the way that we programmed GPUs 10 years ago. Because the GPU today is very different. And so we need new abstractions. And that's why things like kube DSL, like uh CUDA tile, etc., have all come about is because like the GPU itself is just different than it was. And so we're sort of like struggling with that problem.

Conor

So wait, why do why does uh why is raw kernel authoring uh complicated by what you just said?

Bryce

Because the traditional view is that like you th what you think about a processor core, right? Like, you know, you issue a load instruction, you issue like a you do some indexing math, then you do like an FMA, then you have like a store instruction, right? And like that's that those are all things that are executed by a processor core, right?

Conor

Yeah.

Bryce

Yeah, that's not how that works. Because like there is no processor core. There's there's an uh a load and store unit, there's an ALU unit that's doing your indexing math, there's an FMA unit that's doing your you know, your fused multiply ad that's doing your actual like hard hard math, you have a special math functions unit that's doing things like your cosines, your signs, your exponents, etc. Those are all different like pieces of hardware that the the abstraction or that we've had in software for like the fast last 50 years has been that they're like this monolithic thing. But in reality, like each of these separate separate operations is executed by a different piece of hardware. And what's happened over the past, you know, nine, 10 years is we've slowly exposed more and more of that asynchrony to you, the programmer. So now like there's a bunch of different ways to issue asynchronous loads and stores within a thread in CUDA. And the way that you issue instructions to the tensor cores that do the matrix multiplication, those are asynchronous operations. So now, like within each thread, you're not just like, oh, I do a load, then I do a map mall, then I do a store. It's like you asynchronously launch a load, you asynchronously launch a map mall, you asynchronously launch a store. And then like you have to like, if you want to get full utilization, you have to overlap the compute and the communication. So it's like, oh, I start launching like this load, and then I do some compute, and then I launch the store. And then by the time I've done that, the load that I launched has landed and I can start doing the next thing. And so you have to do this pipelining and you have to do this thing called warp specialization, where you have some of your SM cores that are doing the dispatching of work to the tensor cores, and some that are doing compute, and some that are just dispatching the I.O. operations, et cetera. So now you have to have this tasking model. And uh, if you're writing CUDA kernels the way that you wrote CUDA kernels 10 years ago, this is like you're either not taking advantage of this, in which case your performance is bad, or your code's really, really ugly and awful. And so we're building like new abstractions, things like CUT DSL, things like uh CUDA tile, et cetera, that abstract over these things and make CUDA programming simple and easy again and also performance ported.

Conor

Simple and easy again. That's a funny statement. As if at any point CUDA kernel authoring has been an easy endeavor.

Bryce

CUDA kernel authoring has never been easier than it is today because you can write something like CUDA tile and you can, which is like a very small diff on how you'd write something in NumPy and PyTorch. And you can write that, you can write Triton, there's like three or four other DSLs like this, uh, where you can write very high-level code and you can get the fastest performance. You can get better performance than you'd get writing the low-level code. They would be the equivalent of it.

Conor

I see.

Bryce

And and you can write that code and you can get performance portability.

Conor

So, I mean, is this all to say that the work that Marco's doing is is like suffering compared? Because I didn't hear any of that across synchrony stuff, or is it simple enough because we're just copying stuff around that we're not going to suffer from?

Bryce

You can't write what Marco is writing easily in something like tile, because what Marco is writing is fundamentally more of a communication primitive than just pure compute. And by communication primitive, I mean like what like what's one of the main things we talked about in this algorithm is how do you know when another when like you can overwrite the like incoming tile. So just like scan or reduce, this is more about how do threads communicate and talk to each other. This is a fundamental building block that you need to be able to expose to a higher level dialect. So now I think this is this is like the the rare case of a of an example of something that you do still need to write, or maybe it's not the rare case. This is an example of something that you still need to write in SIMT.

Marco Salgado

But given that the communication is not at the threat level but at the tile level, would it not be possible to still do it in CUDA tile?

Bryce

It might be. I mean, maybe, maybe.

Marco Salgado

Because that's actually a question that I wanted to ask you. Because I think the I mean, the other thing we haven't talked about that yet, because the challenge of the actual implementation is that you have a lot of unaligned memory movement because GPUs are very sensitive, and the bandwidth, the memory bandwidth of a GPU is very sensitive to the alignment and the size of your loads and stores. And in this implementation, you have a lot of unaligned loads and stores, and you need to move around that in specific ways, which is what I talked about in the presentation. And so I'm interested, I would be interested to know if something like that is possible in CURATILE where it takes care of maximizing the performance of the memory movement no matter what the alignment is or other characteristics.

Bryce

Yeah, so I mean, uh initially hearing the unaligned thing, my initial reaction is is there some way we could pad? And I think the equivalent of padding here would be some form of like, could we over rotate? Could you rotate to the next alignment and then like correct? And I think that that's always going to be like I it's just the idea off the top of my head. I don't think that helps you because I think it's just as expensive to over align and then. correct but like if you could if if you think about like something like a transpose where a transpose when you decompose it into tiles there's like you you you transpose the tile locally and then like you transpose the entire tile when you write it back into global memory so it's like two step there's like a local operation and then there's a global operation. The thing that would be great would be with rotate if you could have the same thing where like if you over rotated and were able to do like aligned axes when you do the load from global memory and then you like you had to do some extra work to do a correction but that and that correction would be the the unaligned part but that correction only happened at the L2 level like at worst it would hit the L2 then that might be you know worth it. But I don't know whether that's just a very off the top of my head.

Marco Salgado

I don't know that I think the problem is that now you're having you're having extra memory movement so you're getting further away from the theoretical maximum performance. What I did basically is we if if you imagine the rotate distance of one and we imagine that our array is aligned to a sector boundary so it's 32 byte align which is basically perfect alignment. And so you're one off that perfect alignment whenever you whenever you want to copy your your first tile to shared memory and so what you can do is that you can just overcopy. So you align the tile to the sector boundary so to the previous sector boundary which would be in this case the beginning of the array and then you copy all of that instead of starting from that unaligned position. And then that way you're doing it's called over copying where you just copy more than you actually need to but that way you copy in an aligned fashion and so you get the best performance there. But then in shared memory the tile that you're actually interested in is still unaligned. And so what I did in that case is that I copied into registers and then aligned it into registers so to speak and then you can do aligned stores back into global memory. And so you can you can use an instruction that is called a funnel shift where you just copy eight bytes so so you want to copy four bytes into registers and what you do is that you copy the eight bytes that surround those four bytes that you want because those four by those four bytes are not even in one place or in one register or in another they are in between. And so you copy them both and then the funnel shift basically just gives you part of that yeah the funnel shift instruction gives you part of that of that eight byte so it gives you four bytes out of those eight bytes that you have loaded. And so that way you can basically align it into registers and then you can do align stores back into global memory. So there's like a little a couple of tricks that you need to do in order to make the global memory movement as good as possible.

Bryce

So with the overcoping so that's what I'm interested in knowing. But with the overcopying thing how does that help? Because you're you're just copying more of a tile than you need but then like don't you end up with like where the starting address does have to be offset by one?

Marco Salgado

What do you mean by the starting address address of what of your tile or of what the memory that you're copying?

Bryce

The memory that you're copying.

Marco Salgado

No so let's say our array our array starts at address zero which is 32 byte aligned and then the tile that we want to copy starts at address one. So we forget about that and we copy from address zero. But then in shared memory in shared memory the tile is still unaligned it's still at address one what that's what we are actually interested in copying.

Bryce

We just copy extra so that we can do the copies asynchronously yeah I'm having an idea oh that could be interesting you may want to look at the TMA I'm to call mode no the I'm to call mode specifically which is what we use for convolutions. What's it called like I'm like I am two two then the number two then call C O L. So we use that for convolutions because for convolutions you sort of have to do a similar thing where you you want to access a bunch of stencil points and you can think of this rotate as sort of being similar. And in sp in particular for the case of like the one the off by one this might be interesting where it will it will build you like a matrix where it it's hard to get into here without like a visual but it'll build you like a matrix where it's got your like input like tiled into like shifted columns. And then each one of those columns is going to be aligned. Although I guess it is way like the downside is that it is shared memory it it's wasteful of shared memory right it has some redundant shared memory. In the case of like a rotative one it may be an interesting thing to look at. Nice yeah so currently the implementation the thing that may the thing that's maybe interesting to think about here is that like the the problem that you're facing is similar to the problem that like stencils would have.

Marco Salgado

Now when you see these alignment problems are for what data types is this usually I in the implementation I basically assumed an array of bytes because that's basically the worst case scenario. So if I get if I get the best performance on an array of bytes I should be able to get the best performance on any data type.

Bryce

Right. So so this is hard so for something like a stencil problem you don't often have this because for a stencil like you know I'm usually dealing with like FP32 or something. And so if I move over by one I'm still aligned to like a reasonable number of bytes. Whereas like for something like string processing where I'm dealing with like you know a single byte thing then it's a lot easier to get hosed here. But nobody cares about single byte like nobody cares about like a stencil on like you know a car. Well I don't know maybe I mean things like things like word count yeah things like word count so that's actually interesting if you do something like a transform reduce word count on like on like into eight data does it have similar alignment problems maybe no and maybe it doesn't because what you're saying for rotate is you well I don't know no I do think it's similar. I do think it's a lot of this I think it's very similar to like these stencil type problems.

Marco Salgado

But you never have to do stencils on FP16 or fp eight because I mean sometimes are you I don't know if you're using convolutions on the smaller types of loading points.

Bryce

You might you might be doing convolutions on the smaller types but I'm specifically I'm thinking about like string processing stuff like like word count or other things that would use like a stenciled view of text. Interesting. So you said that there was a second algorithm right so the the what's the second algorithm?

Marco Salgado

Yeah I mean I don't know it's I don't know if you have uh what's up with Ramon I don't know because do we leave the second algorithm for part two?

Conor

Well it depends on when you have to go I needed to go 18 minutes ago well then probably yes we should save that for part two we will do part two of this I mean we already talked for an hour so potentially I will split that into two episodes and we will come back and I was thinking this whole time too I was like how would I do this in parrot? I was like I would just use a permutation iterator and a copy and then I was like oh yeah that's the point though that's not in place that's out of place. Yeah so the thing that makes this tricky is the in place. Like if you're just doing an out of place rotate it's very easy. Yeah we should talk more next week I do very much want to hear more about yeah yeah we'll set something up for next week maybe not on Thursday but maybe uh Tuesday or Wednesday and then uh you'll heal you'll hear part two of the rotate story and then we'll also get to hear Bryce's Fifth Infinity Stone you know autoc Autokuda taking over the world I don't believe that Bryce is better than me at AI but I'm excited to hear why he thinks he is folks and stay tuned for that.

Marco Salgado

And also if the listener if the listener has some questions on the algorithm I guess it would be a good time also to ask them so that we can maybe cover them next week.

Conor

Oh yeah yeah we will uh links will be to the GitHub discussion or you can also if you're on X or I mean I'm on all the uh socials Blue Guy Mastodon the best place is probably the GitHub discussion because that like is all in one place but yeah you can add us on socials as well and if there's questions for Marco or if even if you've got ideas why didn't you consider this we can definitely chat about those on part two because it's rare that we have to record twice separately on the same topic back to back. So a rare opportunity to have your question asked and answered or comment I guess just read if it's a comment. Bryce you got looked like you got one last thing to say.

Bryce

No no no I got my last thing to say it's I'm gonna be in so much trouble.

Conor

All right thanks Marco be sure to check these show notes either in your podcast app or at adspthepodcast dot com for links to anything we mentioned in today's episode as well as a link to a GitHub discussion where you can leave thoughts, comments and questions.

Bryce

Thanks for listening we hope you enjoyed and have a great day low quality high quality that is the tagline of our podcast.

Conor

It's not the tagline our tagline is chaos with sprinkles of information