<cite>Marco Salgado</cite>

    <time>0:00</time>
    <p>So uh I think it&#39;s starting from Ampere, you have these asynchronous loads that allow you to load from global memory to shared memory without having to go through registers. And so your reg your threads just issue those loads and then can move on to doing something else. And then you have to synchronize in it to make sure that that has already happened. And so using that is also very a very powerful way of increasing and making sure that you&#39;re saturating the memory bandwidth. And then apart from that, it&#39;s just making sure that you&#39;re not using too many instructions and being too inefficient because that can also be that can also hurt the performance.</p>

      <cite>Conor</cite>

    <time>1:00</time>
    <p>My name is Connor, and today with my co-host Bryce, we continue our chat with Marco, fellow NVIDIA. In this episode, we chat about profiling with NikU, the GPU rotate algorithm, and much more. Where we can, and we we&#39;ll throw the NVICOM explanation at the end and the questions at the end, but then start this off by talking about, and I guess it doesn&#39;t need to be specifically for GPU rotate, but uh, as I mentioned in episode 284, I believe it was, that a part of this presentation you started going through and showing the uh basically results of running NICU on these algorithms, and then you you started making adjustments to your algorithm based on what you saw from NICU. I&#39;m not sure how well that&#39;s going to translate to a podcast format, but selfishly, I&#39;m interested, yeah, to hear that again and then also talk about like just general uh tips and techniques for if you are trying to design an algorithm to run efficiently at the speed of light on a GPU. What are the like low-hanging fruit things to look at? Uh, what are the uh non-low-hanging fruit things, and see see where we go from there?</p>

      <cite>Marco Salgado</cite>

    <time>2:17</time>
    <p>Yeah. Yeah, I would say the first thing is um measuring or finding out what the speed of light is, because that depends on the algorithm. But I would say most algorithms or most implementations are going to be memory throughput bound. And so that&#39;s the thing that is going to be limiting you from getting faster, that you saturate the memory throughput of the GPU. And so usually whenever you you have your initial implementation, so that&#39;s that&#39;s how I started. I had my idea of how I would do this, and then I implemented that idea in a simple way, and then I added tests to make sure that the simple way worked. And then once I had a working implementation, that&#39;s how I started optimizing. For me, that&#39;s a very good way of doing it, that you just start with something simple, and then you go step by step and try to optimize it and make it better. And so, yeah, I had my initial implementation, ran NICU on it, and then I knew that what I wanted to do was saturate the memory bandwidth. And that was was what was going to be the speed of light and the limiting factor. And so, in terms of the GPU, whenever you&#39;re doing memory memory operations, you want to make sure that you&#39;re doing wide memory operations so that every thread is accessing or loading as many bytes as possible in one request. And then you also want to make sure, well, for that, you need to make sure that the alignment is correct. And then you want to make sure that you have enough threads loading stuff at the same time, because that&#39;s what you need in order to have enough memory movement happening so that you actually saturate the memory bandwidth. And then you also need to make sure that when you&#39;re storing the memory, you also are doing uh wide stores and you have those stores aligned. So, in terms of memory movement, it&#39;s always a game of uh kind of ch adapting your algorithm so that you can use the specific alignment and also very wide memory movement so that you can actually saturate the GPU. And so for me, it was basically that. It was just looking at NICU. It has, I mean, for me to be honest, it&#39;s the best profiler I&#39;ve used in terms of CPU GPU profiling. You just have a memory workload analysis section where it tells you how wide are your accesses, how many bytes are you accessing compared to how many bytes you actually need. Because it could be the case that you are uh it looks like you&#39;re saturating the memory bandwidth of the GPU, but you&#39;re let&#39;s say your array is of only one gigabyte, but due to how you&#39;re accessing the array, the amount of memory that you&#39;re moving is actually 10 gigabytes. Then it could look like your algorithm is uh performing the best it can, but it actually isn&#39;t because it&#39;s moving around way more memory than it needs to.</p>

      <cite>Bryce</cite>

    <time>5:06</time>
    <p>So so one one way uh do you ever use you know the insight compute memory workload analysis chart? Yeah, that&#39;s yeah, my favorite tool for this.</p>

      <cite>Marco Salgado</cite>

    <time>5:18</time>
    <p>Yeah, yeah, 100%. So there that you have so on the GPU, you basically have these things called sectors, which is kind of the granularity that the GPU moves memory around, and a sector is going to be 32 bytes, and a cache line is going to be four sectors. And so in this memory workload analysis, you get a metric that&#39;s called the amount of sectors per request. And this tells you per request of a warp or of a warp scheduler, how many sectors are you moving around and how many sectors are you? So you want to maximize that.</p>

      <cite>Bryce</cite>

    <time>5:52</time>
    <p>I have a much simple like what I do is I just look at how much memory traffic is there from global memory into L2, and then from L2 to L1, and just like if that&#39;s more than I expect, like for memcopy. You know, if you do if you do an uncoalesced mem copy, like of like two gigs of memory, you&#39;ll see like two gigs from global memory into L2, and then you&#39;ll see like eight gigs from L2 uh to L1. And it&#39;s like, okay, something weird&#39;s happening there. Yeah. Or actually, sorry, I take I take that back. If it&#39;s doing an uncoalesced access, you&#39;ll see you&#39;ll even see at the from global memory to L2, you&#39;ll see more than two gigs of movement.</p>

      <cite>Marco Salgado</cite>

    <time>6:27</time>
    <p>Yeah, exactly. And so for me in this specific algorithm, it was looking mostly at the memory workload analysis. And so you want to make sure that you&#39;re loading just as much memory as you need and not more, that you&#39;re loading a lot at as many sectors as you can per request, because that&#39;s the most efficient way of loading. And then obviously, if you&#39;re loading on the newer architectures, you want to use these asynchronous loading methods so that you can overlap the loading of memory with doing other stuff with your threads. Uh so if I think it&#39;s starting from Ampere, you have these asynchronous loads that allow you to load from global memory to shared memory without having to go through registers. And so your regist your threads just issue those loads and then can move on to doing something else. And then you have to synchronize in it to make sure that that has already happened. And so using that is also very a very powerful way of increasing and making sure that you&#39;re saturating the memory bandwidth. And then apart from that, it&#39;s just making sure that you&#39;re not using too many instructions and being too inefficient because that can also be that can also hurt the performance. And another thing that I also find very useful when profiling is the warp stall analysis. So that also gives you or gives me a lot of information of what is the bottleneck in my implementation right now. So the warp stall analysis tells you at which places in the code are my threads and my warps waiting on what on some dependency on some instruction to be finished or something like that, and what type of weight is it? So they could be waiting on a mem on a on a load from global memory, they could be waiting because a certain compute pipeline is being utilized too much, and so they are waiting for that to free up, and so on. And so from that, you can find out what is holding me back from getting better performance and what do I need to change in my implementation. And so for me, I would say those are the main things. Again, in terms of memory movement, it&#39;s always adapting your algorithm so that you can use these wide memory loads that also need a wide alignment. And sometimes the algorithm is not uh at first sight is not really adaptable to that, but you can use some tricks in order to make sure that you can do these types of memory movement.</p>

      <cite>Conor</cite>

    <time>8:54</time>
    <p>I mean that&#39;s all um probably very, very useful for those that are uh GPU curious um when it comes to algorithm design. Um the main question that stood out is you mentioned that uh uh you know there&#39;s certain APIs that are architecture specific. So, you know, starting from Ampere, you know, implied in that is that you know, before Ampere you don&#39;t have these APIs. Uh what is the uh you know typical solution for people that are writing these algorithms at this level? Like, do we have backward compatible? Like if it recognizes the architecture, it does one thing, uh, or obviously it recognizes the architecture, but if it notices you&#39;re you know ampere greater, it does this asynchronous thing. If not, it does some you know less efficient thing, or do people uh you know just assume that you know you&#39;re on an ampere or blackwell and they simplify their live? Like what are because I imagine uh you know, for people that are doing this, you&#39;re probably not running a single workload across different architectures, but you might have you know different sets of older GPUs and newer GPUs, and ideally you&#39;d like it if your program ran across all of them. Um yeah, uh what do you what do what do people do typically?</p>

      <cite>Marco Salgado</cite>

    <time>10:02</time>
    <p>Yeah, so this is basically why CCCL exists. So in my implementation, I&#39;m using a function called CUDA memcopy async. And this is a function that you give it the source and destination pointers of the memory that you want to copy, and you give it the amount of memory that you want to copy, and so on, and then it finds the most efficient way of doing it. And so on Ampere, it uses a certain instruction that is called LDG STS, which means load global store shared, which is one of these asynchronous memory operations. And then on Hopper and Blackwell, in case the alignment is correct, it uses something called the TMA, which is the sensor memory accelerator. And this is also another uh it&#39;s basically like another type of copy engine that allows you to copy from global memory to shared memory. And so my tip in that case is uh use uh already these libraries that have taken a lot of effort and a lot of pain in order to make them as efficient as possible, and just read into how you need to structure your data to make the best use of those functions. But yeah, I think in most cases for a lot of stuff, it has already been implemented and it has been implemented very efficiently, so you should try to use existing libraries as much as you can.</p>

      <cite>Conor</cite>

    <time>11:22</time>
    <p>I see. So if you choose the right API, it&#39;ll do the complicated stuff behind the scenes which are.</p>

      <cite>Marco Salgado</cite>

    <time>11:28</time>
    <p>Yes, but you still have to, for example, let&#39;s say you are on Hopper and you are using this CUDA mem copy async function, but for some reason you are not giving it a pointer that is 16 byte aligned, which is what you would need to use the TMA. Then KUDA mem copy async cannot do it for you, and so it&#39;s going to use a less efficient way of moving memory. So you still need to know what the correct alignment is and so on, in order for the function to actually do the most efficient thing. So there is some stuff that you can get wrong.</p>

      <cite>Bryce</cite>

    <time>12:02</time>
    <p>And this is why one of the reasons why we created Q tile, because Qtile automates all of this for you and lowers down to the most optimal memory movement strategy given the constraints of your hardware and the constraints of the arrays that you&#39;re dealing with in terms of alignment, you know, strides, shapes, etc., and dimensionality.</p>

      <cite>Marco Salgado</cite>

    <time>12:26</time>
    <p>So do you think actually, Bryce, that uh Q tile would get the memory movement in the rotate only just the memory movement, because you need to think that, for example, for the global to shared memory movement, I&#39;m using, I don&#39;t know, I guess you are aware of what over copying is, where you extend your array virtually so that you can uh copy whole sectors. And so I use that. And then for copying from shared memory back to global memory, I need to use some funnel shifts and so on in order to get it to the correct alignment. And so you think that Q-tile would be uh like the implementation is flexible enough that it can handle any type of alignment uh with the best performance.</p>

      <cite>Bryce</cite>

    <time>13:12</time>
    <p>I had one part of the company telling me that you basically on Blackwell needed to use TMA for all memory bandwidth bound kernels. Um and I had another part of the company telling me that you only really need to use it for kernels that um uh use the tensor cores. Um and so I wrote this thing called pressure bench, which was a synthetic benchmark with a configurable amount of register pressure. Um because one of the advantages of of of TMA is that it saves you register pressure because it goes straight to shared memory without going through registers. Um so I wrote this little benchmark and I did some experiments, and um the the the results are kind of a lot more complicated than and and with a no clearer winner than than you might expect. Um and the decision tree for how QTAL decides to lower here. Um I don&#39;t know. I think I think it&#39;d be worth benchmarking to see. Um, for this particular type of access pattern, I don&#39;t know. I I I really don&#39;t know. Um I think the thing I guess what I&#39;m trying to get at is one thing that we&#39;ve seen a lot with Qtile and Triton is that the vectorized load stores um can often do quite well on uh Blackwell. Like Qtile um generates really good vector ads, like better than Cub uh vector ads in some cases, with just vectorized loads and stores. Um I I think the situation&#39;s very um uh muddied right now. Um you know uh there&#39;s not as clear of a winner as I thought there was gonna be.</p>

      <cite>Marco Salgado</cite>

    <time>15:08</time>
    <p>Yeah, yeah, I&#39;ve already seen uh I&#39;ve also seen a couple of presentations talking about using the TMA or using vectorized lows and so on, and that the TMA is not always or is often actually not better than doing that. So yeah, I think to be honest, it&#39;s surprising how complex and how complicated it is to saturate the memory bandwidth on these modern GPUs. That is also another reason why people should be using libraries, because the amount of effort that it takes in order to get the maximum performance is very very large.</p>

      <cite>Conor</cite>

    <time>15:42</time>
    <p>This reminds me of an unrelated, well, quasi-related, hence why it reminds me of it. And I made this uh Perfors YouTube video a month ago, a couple weeks ago, and it was solving a trivial problem that uh I profiled across several languages, and for every single, there were three different implementations, like uh solutions. One uses one used a sort, one used a partition, and the other one was just uh basically reduction and construction, um and or like uh uh a reduce and then uh you know creation of some uh sequence. And in every single one of the solutions, uh one of my favorite array languages, BQN1, including the sort. Like the sort in BQN, which is an interpreted language, was faster than the uh O3 compiled like Rust and C solutions. And a lot of people in the comments and on GitHub were whining saying, that&#39;s not fair, you&#39;re not comparing the same thing, because the reason why BQN was faster is it would recognize that you were sorting a Boolean array, and that&#39;s why you can go from a sort to a partition, because a partition is basically just a predicate sort. And so if you recognize that you have a Boolean array, you don&#39;t actually have to do a full sort, you can just do a partition. And then I realized what I was doing, I was like, actually, it might just be easier to do like a pop count, count the number of uh Booleans, and then just construct your final string. And uh BQN has basically like bit-packed vector optimizations um for whenever you encounter Boolean arrays. And on top of that, when you sort a Boolean array, it just does this pop, it doesn&#39;t even sort. It does this pop count and like construction behind the scenes. And on top of that, it has a bunch of uh SIM D vectorized uh optimized code because it&#39;s implemented on top of this other language called Singelli that is like does all these massive like SIM D generated code paths. And so everybody, or not everybody, a lot of people in the comments were just like, that&#39;s not fair. You&#39;re you&#39;re comparing apples to oranges. Like you can go and write that SIMD code in C and Rust, and like you should be comparing that. And I was like, what are you talking about? Like, I&#39;m comparing what it I&#39;m capable of writing. Yes, you could, and the reason this uh reminds me of this is because when you mentioned that like there&#39;s all these dispatching to the TMA or to some other, you know, underneath CUDA mem uh mem copy, async memcopy, and then also like Qtile and CUDATAL are doing these things, it it makes me think that like, you know, there&#39;s some people thinking that like, oh well, QTAL is unfair because it&#39;s automatically doing that stuff, whereas I if I wanted to, I could go and hand optimize this stuff myself. But it kind of just misses the point. The point is that like you want to give people either a language or a library that is easy to use, and hopefully the naive thing that you uh would like to express your problem or algorithm is going to give you the most efficient thing. And like saying that, like, oh, I could have gone and spun up some SIMD optimized C code that would have been the equivalent of this and and did bitpacking and whatnot, that misses the entire point, right? It&#39;s just that what is the easiest path to like getting this either as code on the screen or in this case like uh on the GPU? And yeah, I don&#39;t know. I don&#39;t know. It&#39;s it makes me think that like, you know, we&#39;re we&#39;re encouraging people to use the libraries, but then there&#39;s gonna be some like uh you know CUDA Ninja Andy out there being like, well, I can do something faster than KUTILE because you know I know how to use TMA the best. And uh the goal is we&#39;re trying to make it simple for people, not like yeah, I don&#39;t know. I&#39;m not sure if if you guys have thoughts on that stuff.</p>

      <cite>Marco Salgado</cite>

    <time>19:34</time>
    <p>No, no, it&#39;s 100% the case. And I think as the architectures get more and more complex, it&#39;s more and more it&#39;s more and more relevant to hide all of that stuff under the wood. Because I think even now, if you wanted to implement something as simple as a vector ad, for example, uh and make it be reach speed of light on Blackwell, it would probably take you a good amount of time. Whereas if you if you use Q-tile, I think Bryce said that it was even better than COP. And probably implementing implementing it in Q-tile would take you 10 minutes, 15 minutes. So that&#39;s basically why that&#39;s the future probably for those sorts of workloads. But then it&#39;s not the case for everything. I can imagine I can think of a lot of things where you couldn&#39;t express that in Qtile. And so you&#39;re still going to need that lower level abstraction.</p>

      <cite>Bryce</cite>

    <time>20:28</time>
    <p>Alright, folks, um this is uh this is bedtime.</p>

      <cite>Conor</cite>

    <time>20:33</time>
    <p>We got eight minutes left. Eight minutes left, and the eight minutes is reserved. Reserved for Yeah, I mean you could have you could have replied to Marco&#39;s Slack message last week, Bryce. Last week. And then I forgot too. Uh and then I was like, oh yeah, we got to do this on Tuesday or Wednesday. And I did I realized that when it was like 8 a.m. this morning and I was on a short run. And I was like, wait, it is Tuesday.</p>

      <cite>Bryce</cite>

    <time>20:59</time>
    <p>What do you call a short run, Connor?</p>

      <cite>Conor</cite>

    <time>21:01</time>
    <p>That wasn&#39;t what I was going to ask. Uh today was seven kilometers, which is definitely the shortest run I&#39;ve gone on in a long time. Uh, because of my my legs hurt from the race that I ran. Uh but, anyways, enough about that. We&#39;ve only got seven minutes left now. So we had uh the two questions from Alpha Strata, whose name is Jur. And that&#39;s all the information we have about this person on GitHub. And the two questions were, and there was a comment afterwards says, I hope this GitHub discussion for asking more questions uh before the guest comes on becomes more of a thing. We will do our best. Maybe we&#39;ll even just post on the socials. Here&#39;s our next guest.</p>

      <cite>Bryce</cite>

    <time>21:40</time>
    <p>Say that again.</p>

      <cite>Conor</cite>

    <time>21:42</time>
    <p>So because this happened to be split apart, we said uh Marco said at the end if you guys got questions, feel free to ask them, and we&#39;re more than happy to answer them. Anyways, this person says, Do more of that. We&#39;ll do our best. Question number one. And I actually didn&#39;t really under these uh well, the second question&#39;s obvious, but the first question I didn&#39;t understand. Super speed benchmarks inbound. for a Hadamard slash spinquint or similar. Does that mean anything to you? Not to me.</p>

      <cite>Bryce</cite>

    <time>22:08</time>
    <p>I know Hadamard is a type of matrix, right?</p>

      <cite>Marco Salgado</cite>

    <time>22:12</time>
    <p>No a Hadamard you have I think you have the Hadamard product which is a sort of inner product. And then you I think you also have the Hadamard transform which is another type of matrix. I think a Hadamard transform is something to do with quantum mechanics. So it is a matrix.</p>

      <cite>Bryce</cite>

    <time>22:30</time>
    <p>It is a type of matrix. It&#39;s a square matrix a Hadamar matrix is a square matrix whose entries are all either plus one or minus one and whose rows are mutually orthogonal.</p>

      <cite>Marco Salgado</cite>

    <time>22:44</time>
    <p>And then a Hadamar product is sort of a dot product and maybe you implement that with a rotate no idea but then what&#39;s the second thing?</p>

      <cite>SPEAKER_02</cite>

    <time>22:52</time>
    <p>A spin? Spin quint S P I N Q U A N T or spin quant.</p>

      <cite>Marco Salgado</cite>

    <time>23:00</time>
    <p>L L M quantization with learned rotations. So since it has the word rotation in it maybe it has something to do with rotation.</p>

      <cite>Conor</cite>

    <time>23:09</time>
    <p>So I just asked ChatGPT I would have asked Gemini folks but it&#39;s been down today. And there&#39;s a bottom line at the end that says spin quant style methods depend directly on what people are calling GPU rotate. Oh so I guess maybe so uh feedback for Alpha Strata, please ask a better more clear question in the future but I think the question is is there going to be benchmarks for uh these two different calculations or things that could potentially be enhanced by an in-place GPU rotate?</p>

      <cite>Marco Salgado</cite>

    <time>23:42</time>
    <p>Which I guess the answer is maybe if Marco feels like it I am working on adding the rotate to Cub and so once that is in Cub everyone can go benchmark whatever they want with the rotate and then the second question is much easier.</p>

      <cite>Conor</cite>

    <time>24:01</time>
    <p>Where&#39;s the code? Well I guess you just answered that it&#39;s inbound for uh Cub at some point in the next couple of months I guess. Okay. And stuff shows up in Cub uh on GitHub way before it gets released so if you want to go mess around with this it could be in uh available uh if you want to go look at the GitHub repo should we uh do a brief I know Bryce wants to go to sleep but uh we haven&#39;t talked about mvcomp and that was Bryce&#39;s question from like four episodes ago ago now. Alright Marco tell us briefly about MVComp uh and then we&#39;ll call it we&#39;ll call it a day.</p>

      <cite>Marco Salgado</cite>

    <time>24:40</time>
    <p>So NVComp is basically a library that implements many compression algorithms and compression formats on the GPU so that in case your application has to move data from the GPU to the CPU or loads data from the disk to the GPU in large quantities, NVComp is probably something then that can help you speed up your end-to-end uh performance because usually when loading uh memory from disk to GPU you the large bottleneck is gonna be that CPU GPU interconnect and so if you could compress your data so that it takes up less space you have less time uh uh transporting it from the CPU to the GPU and then you can decompress it very fast on the GPU and that way you can get some performance. And so we have general purpose algorithms implemented. So for example we have Deflate implemented which is uh what&#39;s underneath Gsip and Clib. We have C standard implemented LC4 Snappy which are kind of these general purpose algorithms that usually work well for any kind of data and then we have some other high performance more specialized algorithms like ANS which is an entropy encoder that&#39;s very fast.</p>

      <cite>Conor</cite>

    <time>25:58</time>
    <p>And then also Bitcomp which is for floating point data specifically and cascaded which is for database and and um like data um data table type of data and columnar data and so yeah that&#39;s basically the library in case you have in case you move a lot of data around between CPU GPU GPU and disk this is something that could help you speed up your performance basically nice I mean we might have to have you back and then we can talk about uh compression algorithms for a whole uh three or four episodes um that is a big big rabbit hole for sure well we are we are supposedly allegedly the algorithms plus data structures podcast uh equals programs podcast um anyways Bryce has fallen asleep so we gotta we gotta wind this down uh enjoy the rest of Paris are you coming back to America or are you avoiding uh America you know these days and just trying to stay abroad uh no I&#39;m I&#39;m coming back to New York as soon as I possibly can all right um also if I may if I ask plug anything a a guest recommendation I perfect one person that I would really like to like hear talk about that I I don&#39;t get to hear about enough is Dwayne Merrill which I think Connor you probably know since he&#39;s from NV Research. Yeah yeah he&#39;s on the the PSA uh programming systems and applications that&#39;s the research team that I work on. I mean technically we had uh Jared Hobrock on who is the individual behind thrust uh back in the day it was ooh I&#39;m gonna it&#39;s it&#39;s not just Jared it was Jared and do you remember his name Bryce the guy that went to Google? Nathan Bell is the name that I was looking for. So we talked to Jared and uh you know Thrubb&#39;s twin is Cub and Dwayne is the guy behind um Cub.</p>

      <cite>Marco Salgado</cite>

    <time>27:54</time>
    <p>So yes we can I mean pending he wants to come on but uh knowing Dwayne I&#39;m sure he would be happy to um and uh yeah I think for me it&#39;s just I don&#39;t know I I&#39;ve already seen I mean I think he invented the decouple look back algorithm that the prefix scanning is based on. He&#39;s done a lot of work on interesting algorithms and he seems to be a guy that really I would just be interested to know how his head works and how he comes up with the ideas for these algorithms and so on.</p>

      <cite>Bryce</cite>

    <time>28:24</time>
    <p>So the thing the thing that we were gonna the thing to do if we&#39;re gonna do that. Sean Baxter and Dwayne worked together at NV Research back in the day and like back in the early days of NV Research Sean says it&#39;s like at the time you know they were looking into sorting algorithms and uh Sean investigated the merge sort path and Dwayne investigated the Rhytic sort path. And I think it&#39;d be fun to have both of them on and talk about like what was it like in those good old days back before CUDA had product market fit.</p>

      <cite>Conor</cite>

    <time>28:56</time>
    <p>And just you know because they both they they they walked down the two different paths of sorting tree well I guess okay only one of them walked down the tree based path that&#39;s uh that&#39;s a sorting joke rough rough um yeah we&#39;ll have we&#39;ll have Dwayne on first and then uh because because Sean has already been on was it just that one time oh we can we can have Sean on to talk about uh our respective thoughts on the uh the Moby Dick uh semi musical thing that we saw at the Brooklyn Academy of Music that was half in German half in English all right we&#39;ll put put that on the topic stack but uh I think Sean&#39;s only been on that one time but he he is responsible for one of I think our most highest viewed episodes which was the entitled the like C vs Rust versus Carbon versus uh circle and I don&#39;t think we&#39;ve had him on we&#39;ve talked about having him on uh since but all right we&#39;ll do Dwayne then we&#39;ll do Sean and then we&#39;ll ask them who they want to have on. Anyways we free you Bryce uh thank you Marco this has been a blast I look forward to seeing uh in place uh GPU in Cub sometime soon and uh yeah we&#39;ll uh hopefully and you&#39;re I guess you&#39;re based in Spain so will we ever meet at some point but I was just I just realized the other day I was on like some video call you&#39;re not going to Spain you&#39;d never go to Spain again or will we ever meet uh well it&#39;s just funny I was on a video call with someone the other day and then uh it was with Asher Manconelli and then we had invited another guy Charles I believe was his name uh and I was like oh yeah nice to emet you uh Asher and I and then I was like oh yeah I&#39;ve actually never met Asher in person we&#39;ve been having these meetings for like several years now and I I&#39;ve known Asher I don&#39;t know maybe close to half a decade maybe a little less than that and I&#39;ve we&#39;ve never met in person and we work for the same company but he works in Oregon and uh we don&#39;t go get to go to GTC unless we&#39;re speaking and he wasn&#39;t at GTC I wasn&#39;t at GTC and the only time I go on site is for research on which Asher which Asher do you mean Asher Mancinelli. Oh we should totally have him on I&#39;m working with him a lot oh yeah yeah he was on uh Raycast my other podcast um but uh I have no idea I had no idea you two were buddies yeah yeah he was um he has a YouTube channel with a bunch of BQN plus CUDA uh language yeah yeah no there&#39;s like a small cohort and we added Charles shout out to Charles he is uh a K enthusiast which is one another one of the array languages Charles uh I want to get his name right I believe it&#39;s Charles Hall um I thought you meant modal Charles we should also have modal Charles on at some point add it to the add it to the queue it&#39;s just gonna be a guess from here on out folks and uh but yeah anyways there&#39;s like a small debt better than Bryce hot takes well no we still got to get Bryce hot takes I mean that&#39;s probably only like there&#39;s like probably 10% of our listeners they only you know tune in just for the occasional Bryce hot take um anyways and I&#39;ll also you I have to stay entertained you know I&#39;ve been doing this for what 290 episodes straight you know and I&#39;m about to have a kid uh if these things get boring you know I&#39;m sorry ADSP listeners but uh I&#39;ll just I&#39;ll just be I&#39;ll just be yeah asking Bryce here Bryce you find some content send it to me I&#39;m just gonna post that no intro no anything this is just the content you&#39;re gonna live with for the first uh year of me being a father is just Bryce rambling to himself while walking down New York with absolutely terrible make an AI I&#39;m gonna make an AI Connor I&#39;m gonna I&#39;m gonna I&#39;m gonna make I&#39;m gonna just make a distill skill for Connor too.</p>

      <cite>Bryce</cite>

    <time>32:52</time>
    <p>Have you heard about that? So apparently apparently this is a thing in China where where uh like at some companies people are writing uh like distill skills to like distill uh what their coworkers do into a skill so that they can go to their management and be like hey look you can fire this person but you should you should keep me around because I&#39;ll help you you know automate away more of my coworkers coming from the guy that said earlier yeah it is it is hard to be empathetic uh for the guys that are the girls that aren&#39;t doing AI and then to the no no no i i I thought I took a pretty empathetic take just by saying that listen we gotta fire Joe we got a clotskill that does his whole job he&#39;s out you&#39;re gonna get me canceled bro you&#39;re gonna get yourself canceled you&#39;re gonna get yourself canceled uh you know what maybe I will maybe I will all right I gotta go talk to me just got a button thank you Marco this was a blast I hope we do meet at some point whether that&#39;s in Spain um or yeah at the IEN hopefully and where at GTC I also want to go to GTC yeah yeah yeah I don&#39;t know I don&#39;t know if it&#39;s ever if it&#39;s ever gonna happen uh I&#39;ve been here for what six six plus years but anyways we&#39;ll find an event there is a couple conferences in Spain if we weren&#39;t having a kid I&#39;d be in I think I mentioned that last time Malaga Malaga Malaga and I was in Cadiz a couple years ago um and I I love Spain.</p>

      <cite>SPEAKER_04</cite>

    <time>34:32</time>
    <p>I love Spain great place to run great great place to run uh and the weather there is fantastic much better than Canada much better than America arguably better than California as well because it gets a little chilly but I don&#39;t think Spain ever gets chilly um to my knowledge. I don&#39;t know it does? Alright yeah I take a bit of it all depends on the location it&#39;s very mountainous Spain you know is it? Oh that&#39;s true that that&#39;s uh what&#39;s his name uh best ultra runner in the world arguably Killian Jorne oh I th I thought you were gonna mention like you know like famously mountainous Spain problem from Napoleon no no no we&#39;re man uh we&#39;re mentioning uh Killian he says he says from France how do you pronounce it&#39;s kil Kilian Horne or Jornay?</p>

      <cite>Marco Salgado</cite>

    <time>35:23</time>
    <p>I think it&#39;s Kilian Jornay but uh he&#39;s from Catalonia and so in in Catalonia they have a different dialect or a different language basically which is called Catalonian and so I guess maybe it&#39;s it&#39;s spoken in another in another way. But I mean he lives in Norway so he lives in Norway?</p>

      <cite>Conor</cite>

    <time>35:42</time>
    <p>Well he grew up in Spain though because I&#39;ve I&#39;ve listened to podcasts and he&#39;s he&#39;s like a genetic freak in that like his parents used to take him on like hikes and runs like in the mountains and so his ability like they did some study on him once and it found that he naturally changes his gait during his like ultra runs and runs so that he is like shifting what muscle in his leg he&#39;s using like from mile to mile so that he like doesn&#39;t tire out as quick as quickly.</p>

      <cite>SPEAKER_04</cite>

    <time>36:10</time>
    <p>And I&#39;m just like whoa my guy like that&#39;s insane.</p>

      <cite>Conor</cite>

    <time>36:13</time>
    <p>He&#39;s like a Alright for the first the third fourth time thank you once again this has been a blast how I&#39;m gonna edit this nobody knows um be sure to check these show notes either in your podcast app or at adspthepodcast dot com for links to anything we mentioned in today&#39;s episode as well as a link to a GitHub discussion where you can leave thoughts, comments and questions.</p>

      <cite>Bryce</cite>

    <time>36:31</time>
    <p>Thanks for listening we hope you enjoyed and have a great day low quality high quantity that is the tagline of our podcast.</p>

      <cite>Conor</cite>

    <time>36:40</time>
    <p>That&#39;s not the tagline our tagline is chaos with sprinkles of information</p>