ADSP: Algorithms + Data Structures = Programs
A programming podcast hosted by three software engineers (two at a time) that focuses on algorithms, data structures, programming languages, latest news in tech and more. The podcast was initially inspired by Magic Read Along.
ADSP: Algorithms + Data Structures = Programs
Episode 285: GPU Rotate (Part 2)
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
In this episode, Conor and Bryce chat with Marco Franzeb Salgado about a potential GPU implementation of the rotate algorithm (Part 2).
Socials
- ADSP: The Podcast: Twitter
- Conor Hoekstra: LinkTree / Bio
- Bryce Adelstein Lelbach: Twitter
About the Guest:
Marco is a software engineer at NVIDIA, where he works on improving the nvCOMP library, which offers fast GPU implementations of multiple data compression formats. For the past couple of months he has been working on a GPU implementation of the rotate algorithm.
Show Notes
Date Recorded: 2026-05-05
Date Released: 2026-05-08
- ADSP Episode 237: Thrust with Jared Hoberock
- ADSP Episode 284: GPU Rotate
- NVIDIA CCCL
- NVIDIA nvCOMP
- NVIDIA Nsight Systems
- NVIDIA Nsight Compute
- NVIDIA CuTe DSL
- NVIDIA CUDA Tile
Intro Song Info
Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic
Creative Commons — Attribution 3.0 Unported — CC BY 3.0
Free Download / Stream: http://bit.ly/l-miss-you
Music promoted by Audio Library https://youtu.be/iYYxnasvfx8
No, no, I I agree with you. It's incredible also how I think this is one of the interesting things and nice things about GPU programming is that an algorithm that can be so simple to understand and so intuitive as a rotate can get so complicated when it comes to implementing it. If you want to achieve the best performance.
ConorYeah.
Marco SalgadoThat's basically the beauty of GPU programming, that everything is so intricate.
ConorI mean that it's definitely a property of GPU programming. But if you think about the implementation of rotate on the CPU, like it's very, very complex as well. Like before I knew about iterator category tag dispatching, you'd think it's just a bunch of swaps, which I guess it is at the end of the day, but depending on the iterator category, it's dispatching to like a bunch of different implementations. Anyway, so just like rotate visually, very simple. But like not just on the GPU, but also on the CPU. You know, it's like it's like the the algorithm rabbit hole that never ends. Welcome to ADSP the podcast, episode 285, recorded on May 5th, 2026. My name is Connor, and today with my co-host Bryce, we continue part three of our five-part chat with Marco, our fellow NVIDIA. In today's episode, we continue our discussion of a GPU implementation of the algorithm Rotate. Alright, so we're gonna we're gonna rotate to rotate, and this is actually gonna be episode 285. We'll lose the listeners in a couple weeks when we release this out of order. And so now we are gonna rotate back to our GPU rotate. Do we wanna I guess we'll add the MVComp at the end, we'll save the questions for the end, and we'll just pick up where we left off when we recorded two weeks ago. And so where we left off was that we had covered the first implementation of rotate on the GPU, and then we had to pause because Bryce was late. We never we actually I do know that you managed to pick up Ramona because I saw a video that you sent me. And uh so Ramona is safe and sound at home. Anyways, back over to you, Marco. Tell us about the second GPU rotate implementation. Why do we need to?
BryceNot only that, I managed to p I did bring flowers.
ConorYeah. Anyways, over to you, Marco.
Marco SalgadoAlright, so do you want to do a recap on why we needed a long algorithm and how the short algorithm works?
ConorLet's do that, yes.
Marco SalgadoThe in the short algorithm it's called short because you're assuming that the rotate distance is going to be less than the size of the tile that you use for tiling the array for parallelizing the memory movement. And so in the short algorithm, how it works is that each block of threads is going to get a tile starting from the back of the array and moving forwards, and then they get a tile, they cache it into shared memory, they set a flag, flagging that that tile has been cached, and another thread block can overwrite that memory location, and then they look at the flag of the tile that they are going to overwrite to make sure that they are not overwriting memory that has not been read yet, and then they can they can store the tile in its destination and move on to the next tile. And so since you're moving the tiles in a spatial order from the back forwards, in the in a specific in a situation where your rotate distance is large enough that you cannot cover that rotate distance with the amount of tiles that you're processing at the same time, you're gonna have a deadlock because all of your blocks of threads that are waiting for the flags to be released in order to store something are going to be waiting indefinitely because there's not enough blocks running in order to make that work. And so that's why it's called a shortcut.
BryceThis is a very interesting property of this particular algorithm, you know, that if you you have this linear parameter, how much you're gonna shift by. And if you go over a certain size, then you have to switch to a completely different implementation strategy. And that that's that shift point is like gonna be some unintuitive number that's just like a property of the underlying hardware and the underlying degree of concurrency. That's I just find that's like a very fascinating property. I don't think that any of our other core algorithms have a similar property in that they have some linear parameter like this that just so it's uh like it's almost like you you tell me that there's these two very different implementations. It's almost as if the maybe these one could imagine them they're different algorithms, you know. Like in other circumstances where it's not a linear parameter, we would probably have two different named algorithms. But in this case, because the interface like there's no reason to surface this in the interface because it's just this like linear parameter. Like why, if it's you know, a thousand versus a thousand and one do I need to call a completely different API? It's just fascinating to I I I I really think this is very unique in our parallel algorithms library design. I mean maybe there's some other analogy that I um uh that you can think of that I can't.
Marco SalgadoNo, no, I I agree with you. It's incredible also how I think this is one of the interesting things and nice things about GPU programming is that an algorithm that can be so simple to understand and so intuitive as a rotate can get so complicated when it comes to implementing it if you want to achieve the best performance.
ConorYeah.
Marco SalgadoThat's basically the beauty of GPU programming, that everything is so intricate.
ConorWell, I think it it's you can almost I mean, that it's definitely a property of GPU programming, but if you think about the implementation of rotate on the CPU, like it's very, very complex as well. Like before I knew about iterator category tag dispatching, you'd think it's just a bunch of swaps, which I guess it is at the end of the day, but depending on the iterator category, it's dispatching to like a bunch of different implementations, of which, you know, the uh generalization of GCD, which you know I've gotten comments on videos before, is like GCD isn't used for anything. And it's like, well, it's actually used for a lot of stuff, and it's and it's not just on integers. You can think a bit on like sequence lengths. And anyway, so just like rotate visually, very simple, but like not just on the GPU, but also on the CPU. You know, it's it's like the the algorithm rabbit hole that never ends.
Marco SalgadoYeah, yeah, yeah, for sure. And so, yeah, so that's basically that, and then you need to switch to the long algorithm. And so actually it's very similar, quite similar in terms of the memory movement and the implementation. It's just that now you no longer process the tiles in spatial order because if you did that, you would deadlock. And so now you need to say, you need, let's say we start at the tile that's at the back. Then we know that the tile that's at the back is going to overwrite the tile that's wherever. And then that tile is going to overwrite another tile. And so you start creating this dependency chain of how do you need to process tiles in order to minimize the distance between them whenever processing and minimize the possibility of a deadlock. Yeah, so now, of course, now that our rotate distance is large, we also need to tile the part that we are moving into the back. So in the short algorithm, since we knew that our rotate distance was less than the size of a tile, we wouldn't bother with that. We just said, okay, the last tile in the in the ones that we're processing is the one that takes care of moving the first elements into the back. But now, since our rotate distance is large, we also need to tile all of the elements that go into the back, because that's now a big part of our array, and we would lose a lot of performance if we didn't. And so now in the implementation, those tiles that are going to be rotated into the back are going to be negative tiles. So the tile that starts at the beginning of the array is tile minus one, and then you go to tile minus two, minus three, minus four, until you get on it until you get to the boundary where the rotate distance is, and then you start with the positive tiles, which start at zero, and so on. And so how the implementation now works is that you create a dependent, you create a dependency map, so to speak. So you calculate. This is basically a mathematical calculation where if you know the array size, the tile size, and the rotate distance, for any tile, you can mathematically find out which tiles it is going to overwrite. So which memory location it is going to overwrite. And so you do that for every tile. And once you have that, you basically start at tile zero. And let's say tile zero overwrites tile minus one and tile minus two. Then, since you know that it's going to overwrite those, you start at tile zero and then you go on to minus one. And then you see, okay, which tile does tile minus one overwrite? It overrides tile 30. And so you jump to tile 30. And then tile 30 overwrites tile 50 or whatever. And you do that, and so you have this dependency chain that you follow until you've covered all of the tiles. And so you do this on the CPU, and then you pass this, you copy this array onto the GPU, and so now instead of processing the tiles in spatial order, you process these tiles along the dependency chain. And so that way you're minimizing the possibility of a deadlock. And so this is basically the only change. In terms of the memory movement, it's the same. It's the same in terms of that you cache into shared memory, you free the override flag, and then you look at your dependencies' override flags in order to write to the destination. What changes is how you process the tiles and in which order you do that. But this is also another interesting thing that it's also not impervious to deadlocking. Because in certain in for certain array sizes and rotate distances, you can get a dependency chain where the distance between one tile and its furthest dependency that it needs to be freed in order to override is actually larger than the number of third blocks that you have on your on your GPU resident. So you would again have a deadlock. And so you actually need a third implementation that is then the naive implementation where you just do it naively. You just do the rotate by copying the first elements into a scratch buffer.
BryceYou you lost me on the third third one. Uh I don't I don't understand. I'm not seeing the deadlock. Can you explain this to me again?
Marco SalgadoYeah, so so you understand how now we're using this dependency chain so that we process them in the order that they are going to be needed to be processed, so to speak. Because if tile zero is overwriting tile minus one and tile minus two, for it to be able to overwrite them as or for for it to be able to overwrite them and not deadlock, after pro after a thread block grabs tile zero, you want the next thread blocks to grab tiles minus one and minus two so that you free up tile zero and it can move on and store into its location. And so that's the concept of the dependency graph. That you you get the concept of that.
BryceDo you have a visual? Yes, I know I know we're not I know we're not supposed to ask in the podcast, but yeah, yeah, I know.
Marco SalgadoI yeah, this is the listener is gonna have to visualize it, but I can I can show you.
BryceThe listener may not be the listener may not be as as tired as I am, so Connor can Connor maybe please can give me an exception.
ConorI mean, I was thinking that you know how Wikipedia for at least the sort algorithms, they have all these really nice animations of like, you know, merge sort, bubble sort. That it's an equivalent animation for the two different implementations here that show the tiles being you know copied in the dependency order, would probably be like an order of magnitude easier if you could show that in a podcast, which you cannot. I mean, we could if we had a video format, but we don't right now. So I mean technically this is a good idea. We've talked about it.
BryceWe've talked about it.
ConorWe have, yeah. Once once uh both Bryce and I retire and you know, whenever that is a couple decades, maybe we'll add video.
Marco SalgadoOkay, so you can see you can see the screen. Okay, so I guess basically I can describe what we're seeing. So we're seeing an array that has 10 tiles where we want to rotate it by three and a half tiles.
BryceRight.
Marco SalgadoAnd so how we do this is that we first create this dependency table where each tile is going to have as their if you if you think of it as a hash map, it's tile the tile number is going to be the key and the value is going to be a list of the tiles that that tile overwrites. So for example, since tile zero is always going to be going into the front of the array because it's the first one after our rotate distance, it is always going to be overwriting tile minus one, which is the first tile at the beginning of the array. And so the key would be tile zero and the value would be tile minus one. And so if we do this for all of our tiles, we now have our dependency map, and we now start the dependency chain in tile zero. And so tile zero, as we've said, overwrites tile minus one. And so we add that one into our dependency chain. And then we look at tile minus one, and then from accessing the hash map, we see that it overwrites tile three in our example, and so we add tile three to our processing order or to our dependency graph. And then we look at tile three and tile three overwrites tiles minus four and tile zero. And so we add minus four, but since tile zero has already been processed earlier, we don't need to add it again, so we don't add it. And then we look at tile minus four and we do that on and on.
BryceWhat does processing mean? So if zero if zero depends on negative one, then how do you process zero before negative one?
Marco SalgadoNo, no, so well, what zero depends on negative one in the sense that in order for the thread block that has grabbed tile zero to be able to continue with its processing, it needs someone, some other thread block to have grabbed tile minus one and have cached it. Because if not, it's going to override whatever was here.
BryceSo the after you grab tile zero, then uh you're waiting for somebody to grab tile negative one?
Marco SalgadoExactly.
BryceOkay, and so so this is just the processing order is to make this go faster.
Marco SalgadoI then well, it's to make this go faster, but also to make it not deadlock.
unknownYeah.
Marco SalgadoBecause imagine in our example that we only had that we only had three processors. And let's say we started from in the fr in the front, just to make it simple. Then we would grab tiles minus one, minus two, minus three, but tile minus one would be wanting to overwrite something that no other processor is is grabbing. So it would deadlock. So it's it has that twofold.
BryceWhen you build this processing order, there's a natural pattern to it. So you don't have to actually build a list, right? You you can just encode the pattern.
Marco SalgadoNo, I do build a list. It would be nice. This is something that I want to investigate a bit further. Yeah, yeah. Whether there's a way of mathematically calculating the processing order without actually calculating it.
BryceI I wonder if the processing order, if you can make it like a lazy iterator so that it like you generate the list on the fly. Connor, con Connor, Connor maybe can can think about the type of lazy iterator that would spawn this sort of sequence. He's cramping on it, folks.
ConorI don't fully well I raised an eyebrow when you asked that initially, and I was thinking in my head, I don't fully understand what you're you say.
BryceSo so so this this sequence, this sequence of the sequence of the processing order of 0, negative 1, 3, negative 4, 6, 2, negative 3, 5, 1, negative 2, 4. Like come up with some lazy iterator invocation that expands to that.
ConorUm I mean, what's the goal of doing this lazy?
Marco SalgadoWell, the the thing is that right now in the in the actual implementation, you calculate this processing order on the CPU, and then you allocate an array on the GPU that contains this processing order. Yeah and that's what you use so that the thread blocks grab tiles. They grab tiles according to this processing order. And it would be ideal if you didn't have to do that, and a a thread of blocks could say, okay, I want to I uh right now I want to process the fifth tile along the processing order. And it would be nice if the thread block had a way of calculating which tile is the fifth one along the processing order without having to go through all of this, because then this would be super costly.
ConorI mean and and and the the it's pretty simple the way that this is generated in that once you have your dependency table of like key value pairs, you're just following it starting at zero, right? You go from the zero key to the zero val value, which is then gonna be your next key that you look up.
BryceAnd but it does get a little complicated once you get to the end of this dependency graph, because then you have some entries that have two in the table. Like there's negative fours, zero, there's zero.
Marco SalgadoThere's very little that have two. In real implementations, most tiles would be overwriting two tiles.
BryceWhy is that?
Marco SalgadoWell, okay, maybe maybe maybe not most.
BryceI think probably But wait, what why why are you why are you ever overwriting two tiles? Aren't you just ever overwriting one tile?
Marco SalgadoNo, because here, for example, since our array has a size of three and a half tiles, the last negative tile is going to be partial. And so whoever overwrites it is not going to override fully, and so it's gonna jump into two tiles.
BryceLet's for right now, I think for it for for if you limit yourself to the case of no partial tiles, then I think it becomes easier to imagine how you could write a lazy iterator for this.
Marco SalgadoYeah, and it probably doesn't affect generalizability, I would guess.
BryceYeah. Come on, Connor, crank crank out some crank out some uh APL or something.
Marco SalgadoThe other problem is also that you don't want to have to generate the dependency table on the GPU either. Because right now we only have 10 tiles, but you have to imagine that for an array of a gigabyte you have tens of thousands of tiles, and so generating this dependency table would take more time than actually doing the rotate. And so that's kind of the problem. So wait, you you said you don't want to do that on the GPU? No. So this the processing order generation and the dependency table generation takes part on the CPU. And then the array that is the processing order, you copy it to the GPU, and then you do the actual rotation on the GPU.
ConorAnd you were saying you want to keep the dependency table generation on the CPU, but the No, I want to find a way of not having to generate the dependency table.
Marco SalgadoLet's say I want to I I want to find a way where a thread block says it wants to process the fifth tile along the processing order, which in our case is tile six. And the question is, would there be a way of calculating that the fifth tile along the processing order if we start at zero is tile six without generating the full dependency table or the partial dependency table that we would need up up until that point, and doing the and like walking along the dependency graph.
ConorThat's yeah, I see I see.
Marco SalgadoIt seems like it could be possible, but it's hard to kind of find the mathematical now.
BryceI I I I lack brain cells right now to be able to think through this. However, I think we should both think about this. I have another I I said earlier I can't think of an algorithm that's sort of similar to this, but I I did just think of a an algorithm that I don't think it's similar, but maybe it's another family. And the algorithm is transpose. And the re the the similarity that transpose has to rotate is that both algorithms are purely data movement, and both algorithms have like some data reuse pattern, but like at the L2 level, they would have some data reuse pattern. However, both algorithms only ever load each element once and store each element once from like global memory. Now, the other thing about transpose that I think's a little bit interesting is so when you transpose, you sort of do like a local transpose and then you do like a you know a global transpose. Like you load a tile, you transpose within the tile, and then you like write write back that tile to some new look to some new transpose location. I don't know. Something about that pattern sort of reminded me or or of of rotate. I felt like maybe there were some similarities there. We don't have like a transpose algorithm in Cub, just like a pure transpose, but um, can you do the transpose in place?
Marco SalgadoI'm not sure about that. I don't know if the usual implementations do it in place or if they just allocate an O array. Because I think it's hard to imagine that it could be done for large enough matrices.
BryceWell, I mean it actually I think if you took a similar approach, right? Like if I if I know if I for any given tile, I know where that tile is gonna go. And and so I know my it's the same thing, the same dependency thing can be done, I think.
Marco SalgadoYeah, but the problem is because I I thought about this also for doing a bit plan transpose, which is very similar to a normal one, where you're just basically if you have an array of characters, you're then transposing them so that you now have the most significant bit of the character one after the other, then the second most significant bit, and so on. And in that case, you basically have An infinite dependency chain where everything depends on everything, and so you always would have a deadlock because you wouldn't have enough enough parallelism to cover everything basically. I don't know if that would be the case with the normal transpose.
BryceI don't think so. I think in the case of a 2D or like a 2D matrix transpose, if I break everything up into tiles, like the tile the ith jth tile is gonna be written into the jth ith tile. And so if I load tile ij, then I have to wait for somebody to load tile ji. And your dependency order thing may help with that.
Marco SalgadoYeah, actually it seems like uh it seems like the tiles would be just dependent upon one another. And so that would actually make it very easy because you like you you would actually never deadlog. You would be able to just load both tiles and then just do it that way.
BryceYeah, yeah. So maybe it's simpler than that. Maybe you can just do it that way.
Marco SalgadoHmm, nice.
BryceYeah, yeah, that's a good point. Maybe you could just do it that way. Interesting. Connor, you've been very quiet.
ConorNo, no, I've been thinking about the processing order thing, and uh maybe we should have a separate episode on is there an algorithm to do this? But I I feel like you might need to put more constraints in place, right? And also too, it's unclear, right? Because in the case like are you you're always gonna have when there is a like partial tile, you're always gonna have the case where you've got a single key pointing to multiple values, right? And I that's what makes it tricky, right? Because you don't well, it's one of the things that makes it tricky because you're not able, like if every key only points to one value and all of those values are unique, then you're guaranteed that like everything is one to one, and you don't need to worry about like has this already been covered before in the case where you know we have a key here that's three pointing to negative four zero, like the zero already happened at the beginning, and this one has two. Anyways, I I feel like a simpler version of this problem, it's definitely possible. But anyways, like maybe we'll defer that because I'm sure we've lost the listener in between bouncing around from GPU rotate, implementation number two, and transpose back to rotate. We're at the I don't know, almost half hour mark of this part of this recording. Should we should we, I don't know, slice this now? Be sure to check these show notes either in your podcast app or at adsphepodcast.com for links to anything we mentioned in today's episode, as well as a link to a GitHub discussion where you can leave thoughts, comments, and questions. Thanks for listening. We hope you enjoyed and have a great day.
BryceLow quality, high quality, that is the tagline of our podcast.
ConorThat's not the tagline. Our tagline is chaos with sprinkles of information.
Podcasts we love
Check out these other fine podcasts recommended by us, not an algorithm.