What's Up with Tech?
Tech Transformation with Evan Kirstel: A podcast exploring the latest trends and innovations in the tech industry, and how businesses can leverage them for growth, diving into the world of B2B, discussing strategies, trends, and sharing insights from industry leaders!
With over three decades in telecom and IT, I've mastered the art of transforming social media into a dynamic platform for audience engagement, community building, and establishing thought leadership. My approach isn't about personal brand promotion but about delivering educational and informative content to cultivate a sustainable, long-term business presence. I am the leading content creator in areas like Enterprise AI, UCaaS, CPaaS, CCaaS, Cloud, Telecom, 5G and more!
What's Up with Tech?
Inside AMD’s AI Strategy From Edge To Data Center
Interested in being a guest? Email us at admin@evankirstel.com
Big leaps in AI rarely come from one breakthrough. They emerge when hardware design, open software, and real workloads click into place. That’s the story we unpack with AMD’s Ramine Roane: how an open, developer-first approach combined with high-bandwidth memory, chiplet packaging, and a re-architected software stack is reshaping performance and cost from the edge to the largest data centers.
We walk through why memory capacity and bandwidth dominate large language model performance, and how MI300X’s 192 GB HBM and advanced packaging unlock bigger contexts and faster token throughput. Ramin explains how Rocm 7 was rebuilt to be modular, smaller to install, and enterprise-ready—so teams can go from single-node experiments to fully orchestrated clusters using Kubernetes, Slurm, and familiar open tools. The highlight: disaggregated and distributed inference. By splitting prefill from decode and adopting expert parallelism, organizations are slashing cost per token by 10–30x, depending on model and topology.
The conversation ranges from startup-friendly workflows to hyperscaler deployments, with practical insight into VLLM, SGLang, and why open source now outpaces closed stacks. We also look ahead at where inference runs: the edge is rising. With performance per watt doubling on a steady cadence, AI PCs, laptops, and phones will take on more of the work, enabling privacy, responsiveness, and lower costs. Ramin shares a sober view on quantum computing timelines and a bullish take on the broader compute shift—moving once-sequential problems into massively parallel deep learning that changes what’s even possible.
If you care about real performance, total cost of ownership, and developer velocity, this conversation brings a grounded blueprint: open ecosystems, smarter packaging, and inference architectures built for high utilization. Subscribe, share with a colleague who cares about LLM throughput and cost, and leave a quick review to help others find the show.
PodMatch Automatically Matches Ideal Podcast Guests and Hosts For Interviews
More at https://linktr.ee/EvanKirstel
Hey everybody, really excited for this chat today as we dive into AMD's AI vision and strategy with a true insider. Ramin, how are you?
SPEAKER_01:I'm good. Thank you. Thanks, Evan.
SPEAKER_00:Well, thanks for being here. I've been really looking forward to this chat. Before we get into the details, perhaps introduce yourself, your role, your team's role within AMD, and the big picture. What is AMD doing in AI right now? Sure.
SPEAKER_01:My name is Ramin Rouan, and I'm leading the group in the AI group at AMD that's uh basically leading the integration of uh the open source ecosystem with our GPU. And uh we're also responsible for DevRel developer uh relation. And what's AMD is doing in AI right now? Well, we we have uh products that can accelerate AI from uh the endpoint all the way uh to data centers. So at the endpoint we have uh AI running in self-driving cars, in satellites, um even even on Mars, on the on the Mars rover, uh in in hospitals doing analytics and genome analytics and things like that. And then we have AI PCs uh on on laptops and desktop at the edge, and then the big uh instinct uh GPU in data centers.
SPEAKER_00:Wow, it's an amazing portfolio, amazing breadth of technologies. And central to that vision is open source, which is very impressive. Maybe talk about that and how does AMD approach that uh rather differently from others, perhaps?
SPEAKER_01:Yeah, so uh for one thing, our entire stack is open source. So the the Rockham software, which is uh our software stack to program our GPU, which was first specialized for HPC, since AMD is really big in HPC. We have like the top two or three biggest supercomputers in the world. And then we have uh maybe three years ago, we have adapted that to AI, which uses uh you know smaller data types, like more FP8, FP4 than FP64, like HPC. So we've done uh our first AI GPU uh in 2023 and adapted Brockham to do the whole AI stack, and everything is open sourced. Uh, and we also very quickly realized that companies are actually using open source software. Like uh if you look at OpenAI, they're using their own branch of VLLM, which is uh open source software for uh inference serving, and XAI uses uh their own branch of uh SGLang, which is another open source software. So we uh very quickly started to work with the open source community, with the PyTorch and VLLMs and SGLang and Onsloth uh and Hugging Face and so on and so on. We provide them with machines to do daily CICD and work on uh supporting all the features and all that. And it's uh it's it's just super important, not just for enterprise, but even big AI labs, everybody is using open source.
SPEAKER_00:Absolutely. And you talk a lot about not just the software innovation, but combining that with hardware leadership, uh tremendous synergy. How does that show up in real-world performance?
SPEAKER_01:Oh, so we're uh we basically, as I said, we came up with our very first uh GPU in uh 2023. It was at the very end of 2023, like December. And we started to uh sell it in 2024, and we basically went from 0% market share to 5% market share in our first year in uh in AI against the the biggest, baddest uh GPUs out there, right? So we're extremely competitive. Our customers are not small, they're you know the open AIs and uh uh metas of the world. Um so we're we're basically competing with the number one, the biggest GPUs out there.
SPEAKER_00:Yeah, it's amazing. And you have so many developers, so many examples of where hardware, software, synergy has already delivered results. Um, can you talk, share any of those stories or anecdotes, examples where you're getting really measurable performance or efficiency gains?
SPEAKER_01:Yeah, sure. So for one thing, um our GPUs are not just competitive in compute, they actually go way above and beyond in uh in memory and memory bandwidth. Um and uh that's uh because we have accumulated a wealth of uh IP and technology uh on the on the epic uh side. Uh namely we're we we're chipletizing uh our our devices. It's not just two and a half D, it's actually two and a half D on top of two and a half D. Like we have the the memory hierarchy and the IOs on the first floor, and we have the compute on the second floor, and that allows us to pack a lot more compute in the same uh device and put a lot more HBM. Just to give you an idea, when our first AI GPU came out, MI300X, we had 192 gigabytes of HBM. The biggest best number one GPU out there had only 80 gigabytes. So we were like two almost two and a half X more in terms of memory and more than 1.6x uh the bandwidth. And obviously, LLMs are extremely hungry, not just in memory, but also in memory bandwidth, at least for the the decode side of LLM is very memory band view intensive, and that's why we got very quickly adopted by the Microsoft and OpenAI and Metas of the world. So it's phenomenal. Yeah, and and uh so so any it's a combination of hardware and software, right? Because then uh on the software side, things are moving very quickly, but uh honestly, the the state of the art in software is no longer proprietary. There is no proprietary mode, it's it's really open source. It's uh uh what VLLM and SGLANG are doing, no propriety stack can can match, they're moving way too fast. So on the software side, we're taking advantage of our memory memory bandwidth at the same time as uh keeping uh up with the state of the art because we were directly uh working and developing with open source.
SPEAKER_00:Fantastic. And clearly Frontier Labs and others are betting big on AMD for AI. And a lot of that is through partnerships, uh, really helping shape your future as well. Any lessons or insight into those partnerships and the partner philosophy that you're driving?
SPEAKER_01:Um you mean with our customers, right?
SPEAKER_00:Yeah, customers and partners.
SPEAKER_01:Yeah, yeah, yeah, yeah. So um that's really key, actually, to uh to any any product development, you you want to work very closely with uh the users of your technology, whether they're customers or partners, and uh understand the gaps that they're seeing out there and uh fill those gaps. So we understand the gaps, work with them, work with the internal engineering team, propose solutions and uh the solutions that uh customers really echo well with, we we we implement them. It's really product management one-on-one. You you want to understand the gaps and fill the gaps.
SPEAKER_00:Yeah, back to the basics. Love it. You mentioned uh AMD technology is on Mars. I I love that analogy. I'm a space tech geek as well. So if AI is the rocket, what makes AMD the launch pad? Uh, you know, what's the mission control behind the launch? Yeah, people, partnerships, platforms. Is that a good analogy?
SPEAKER_01:Or is that I mean, no, that's that's a good analogy. Uh uh I would say the the launch pad is um is the community. It's really the community, the open source community is not nothing would uh take off without them uh at this point of time. Again, there was a time where propriety software was the only thing available, and therefore the state of the art, that time is long gone. And and really without without uh open source, none of that could happen. Yeah, we have great technology, yes, we have advantages on chiplets and 3D packaging and all that, but none of that would uh would take off without the the open source community.
SPEAKER_00:Yeah, it's fantastic to watch. Let's talk Rockam 7, your uh developer platform. Uh it kind of evolved from being an AI software stack to much more. Maybe talk about that evolution and where it is, where it's going. Sure.
SPEAKER_01:So Rock M7 is a big change in that we completely re-architected uh Rockam. So it's more uh modularized, it's it's much smaller to install. That was actually one of the big pain points from our partners, is that uh it was too big, Rockam was just too big. So we trimmed it down, modularized it, uh it's completely open. Right now it's on a GitHub called uh The Rock. So it's AMD The Rock. And uh you can install just the pieces you want. And at the same time, we've been keeping up with the needs out there. So we have a full enterprise AI stack because you know, when we cater to the metas and the open AI of the world, they have their own stack, right? They they don't need us to manage their own cluster. But when an enterprise company is buying our GPU, what do they do? They need a way to bring up that stack and start distributing workloads and using the GPUs efficiently uh every hour of the day, right? Uh so we we we have a full-fledged enterprise stack, which is also open source, uh, that's out there. We actually announced it in wide availability just today. So we have a blog on our website with all the links uh that are needed to access to that. And um another big thing we're doing in Rock M7 also is to uh catch up to a new trend, which is uh distributed inference and disaggregated inference. Because now that uh Frontier Lab are inferring at least as much as their training, if not more, um inference efficiency has become a big thing. And one thing that uh the open source community really uh figured out is that by disaggregating pre-fill and decode, so pre-fill is the part of the LLM that analyzes your prompt, which could be, for example, a whole book, right? To summarize. And uh decode is the part where it's just generating, right? By disaggregating those two phases and running them on different pools of GPUs, you actually get much higher efficiency because they're just specialized workloads that uh you don't want to run on the same GPU because they're gonna start uh interfering and slowing uh each other down. And then um so that's disaggregated inference, and we also enable uh distributed inference in uh in Rocam 7, uh, which are things like for mixture of experts, for example, you can run different experts in different GPUs. You you you you you bundle experts and run them on different GPUs, and then you have to take care of all the GPU to GPU communication. You uh the the GPUs actually communicate directly, bypassing the whole networking stack running on CPU. Uh so all of that is is coming with uh Rocum7.
SPEAKER_00:That's fantastic. Are there any particular industries or workloads that are you think will take advantage most uh and kind of stand out here?
SPEAKER_01:Yeah, so uh definitely definitely LLMs. That's uh that's uh one of the top use cases, although we're also uh running things like diffusion, search, uh uh ads. Uh there we I mean we have customers now really across the the spectrum, video generation, image generation. Um, but a lot of those are either attention-based or uh diffusion-based or just classic ML for search and recommendation. Um those are the workloads we're we're mostly focusing on. And and Enterprise AI is really the big new use case that we're we're also enabling in Rahm 7.
SPEAKER_00:Yeah, very, very exciting. And when it comes to enterprise, how do you think about integration with existing AI frameworks and and tool sets and stacks? Um, there's a lot in the enterprise out there that's that's legacy.
SPEAKER_01:Oh everything we have in the in the enterprise stack is based on uh open source. So we are using uh you know the Kubernetes and Slurms and uh VMs and uh OS that that are available out there. Again, it's based on talking to customers and uh getting information from them on which stack they want to use and what their problem is, and so so that's exactly what we did. We we we got the legacy and the state of the art in the uh in the open source community and enabled it.
SPEAKER_00:Fantastic. So it's a competitive landscape, obviously, in AI, and you know, reducing deployment costs is crucial. Total cost of ownership is being looked at. How do you think about Rock and 7 helping organizations, enterprises achieve these goals?
SPEAKER_01:Um so yeah, uh as I said, cost is very, very important to our customers. So um for uh for inference, just like for training, you just need to try to get to maximum utilization of your GPUs. Um that includes uh using as many GPUs as possible, but also using as much compute as possible in in the GPU and generating uh tokens really, really fast for the compute and for the for the power consumption that you have, right? So uh things like distributed inference are basically meant to uh optimize uh ROI. So it lower lowers the cost per token, basically.
SPEAKER_00:Fantastic. Um any real-world examples or benchmarks delivering cost efficiency? I know I'm not gonna ask you for a price list here, but um this must be increasingly uh an ask of your customers and partners.
SPEAKER_01:Yeah, I mean, we without giving uh dollar per token per watt kind of metrics, just to give you an idea, if you infer LLM the old way, which means how we were doing it 12 months ago, that's uh a uh a very long while ago, compared to uh using disaggregated and distributed inference on a cluster of GPUs, you can probably slash down the cost of tokens somewhere between uh 10, 20, 30x, uh depending on the LLMs and on how many GPUs you have. Uh but yeah, basically inferring everything just brute force on the same set of GPUs, uh pre-fill and disaggregation mixed up versus doing it cleanly, properly, disaggregated and using uh expert parallelism. Yeah, that's the kind of improvement you can get. 10 to 30x, I would say, is doable.
SPEAKER_00:10 to 30x. Wow, that that's blockbuster. Uh amazing. So moving a little bit away from the present, there's a lot exciting happening, but maybe talk about the future of AI compute. How do you see this evolving over the next couple of years or beyond? Any insights you can share there?
SPEAKER_01:So yeah, you see a lot of uh headlines on the big, big, big uh mega deployments, right, in the in the data center. I think this will go on for a while because you know, at least big training is gonna continue to be a thing, uh, because models are improving and you you have to build and train uh new models. But I think on the inference side, honestly, uh as edge devices become more and more powerful, I think you're gonna see a lot more inference moving to the edge, from the data center to the edge, to desktops, laptops, uh phones. Um, I mean, if you just look at uh your your your phone your cell phone today, and uh you look at the amount of compute and the little wattage it's using, like typically it can do 35 teraops at four watts, and you compare that to one of the first supercomputer like in the 1970s, the Cray one, that was 160 megaflop, I believe, at hundreds of thousands of uh more than 100,000 watts. If you do the calculation, you see that the tops per watt has improved by like more than 6 billion X. And the funny thing is if you bring that to an annual rate, it's exactly 2x every year and a half. It kind of resembles Moore's law, but it's in performance per watt.
SPEAKER_00:Amazing.
SPEAKER_01:And the improvement hasn't come just from hardware like Moore's law was uh at the time, but from really hardware, software, and architecture, and even data types, right? Uh moving from FP64 to 32 to 16 to 8 to 4. Um, yeah, we we we're we're improving basically 2x per per year performance per watt. And uh that that means in a few years cell phones will have the same power as a data center GPU today. So yeah, that's how I see it. I see that inference will move more and more to to to the edge, to to the endpoints. Um and eventually in a longer time frame, uh, even training could with federated training, right? This is more like 10 years maybe from from from from now. But uh I I don't think we're gonna continue this unsustainable gigawatt consumption for AI. This is all gonna go down just like the Cray was using 100,000 watt and now it's using uh uh four watt in your pocket for uh uh six billion times more power. This is this is what's going to happen for AI as well.
SPEAKER_00:Well it's gonna be fantastic to watch it unfold. And it's not just the tech and AI giants that you're working with. You're supporting smaller teams, startups entering the AI race for the first time. I I see your posts about this every day, amazing innovation happening. Uh that must be very fun and uh gratifying.
SPEAKER_01:Yeah, absolutely. Yeah. I I I was at uh Hackathon just yesterday, Sunday in San Francisco at a startup called Liquid AI. And uh yeah, that's what they do. They do small uh small models, small LLMs that can run on incredibly small devices with not a whole lot of compute.
SPEAKER_00:Oh fantastic to watch. So if we were to jump ahead to 2028, like in the Back to the Future movie, and uh what what sort of milestone or breakthrough would you see signaling that AI's AI vision has really come to life? What what would you hope to envision there in the future in 2028?
SPEAKER_01:Wow, 2028 is in a long time. In AI terms, that's like decades, yes. Right. Uh I would love to see AI uh moving into more industries. I mean, today the industries that AI is really transforming are things like uh coding. Coding is definitely uh one one of the top use cases. Uh obviously chatbots like Chat GPT uh that goes without saying that that is more on the consumer side, but also uh there's a lot of innovation happening in uh biology and medicine with AI, with genome analytics and uh uh protein folding and medical imaging. So I would love to see AI uh have more breakthrough there. Okay, I guess the one single thing I want I I want to see by 2028 is AI proving a theorem that humans have never proved. Wow. That that would be a major breakthrough. Like one of one of the millennium problems or something like that. That or finding uh unifying the the the the theories that Einstein never could, right? The coming up with quantum gravity or something. That that that that's that would be fantastic.
SPEAKER_00:Well, that would be, and we I don't think we even need AGI to get those kind of breakthroughs. Um uh speaking of the future and other topics that are might be relevant. Um do you think we'll see quantum computing at real scale and real utility, real use uh in our lifetime? I know you're involved in different in our lifetime.
SPEAKER_01:Um I'm skeptical, honestly. Uh I think we we're gonna see a lot of improvement in quantum computing, but what one of the biggest problems in quantum is that uh those qubits are actually not stable at all because you you know they're they're in a quantum state. If they interact with any other particle, they collapse. So that's why it has to be in vacuum, it has to be close to zero Kelvin. Uh you shouldn't have electromagnetic wave passing by because that's like a photon hitting your particle. It's very, very finicky, very, very sensitive. So when uh a company announces that they have one quantum bit, ten quantum bit, hundred quantum bits, it's not actually true. Uh a quantum bit is really a logical quantum bit is actually physically implemented on a thousand quantum particles. They basically spread the state of one quantum particle on a thousand, and and they constantly have to monitor for error and error correct. And this is the biggest problem in quantum computing, and nobody has cracked the knot. I know Microsoft had a couple of publications saying they came up with a topological qubit that uh is uh not sensitive to errors and all that. This has never actually been um peer-reviewed, and uh turned out uh it wasn't totally the case. Um, I think we're still pretty far from having stable quantum bits. And and even when that happens, um the quantum algorithms are not actually fit for all the algorithms uh we we use today, right? They're they're very specific to like combinatorial optimization algorithms or uh crypto, but uh cryptography, but um they they can just be used for regular algorithms that that that we use in every uh every day. So it's gonna be a combination of quantum and CPUs and GPUs.
SPEAKER_00:That's oh yeah, really.
SPEAKER_01:But in our lifetime, I don't know if there's gonna be such a big breakthrough.
SPEAKER_00:Well, it'll be fascinating to watch for sure. Uh speaking of technologies, I'm excited about. I just recently got an AI PC, actually powered by AMD. And I do a lot of cool local stuff involving editing and video and AI locally. So I'm excited, but I'm a tech geek. I'm an early adopter. Do you think AI PCs are are gonna really take off? Maybe we'll see more at CES this January, or you know, is this still mostly a data center cloud story?
SPEAKER_01:Yeah, they will take off. As I said, just the rate of improvement will will make AI PCs in a few years, like in a year or two. I think AI PCs are going to be as powerful as the data center GPUs of uh maybe three, four years ago. Um and that trend will accelerate. Yeah, I totally believe it's it will take off.
SPEAKER_00:Well, good to know that investment was well made. And you a final question here. Do you think we're in an AI bubble? All the media and journalists are are are super hyped on this topic, or are we looking at growth that is really backed by fundamental change?
SPEAKER_01:Yeah, um, I don't think we're in a bubble. I I do understand Michael Burry's uh concern. He he's concerned that uh uh there's too big of a gap between free cash flow and earning. And the difference between free cash flow and earning is basically um how you the your your amortization schedule of all the GPUs that you're buying for your data center, right? When you're a hyperscaler, they used to amortize that uh in between two to three years, and now they're amortizing that in between like four and six years. So they're saying they're using their GPU for a longer time, even though nowadays there is a GPU coming out every year. Uh so are you really using it for six years? I I I somewhat think they actually are. Um, there's still a lot of old GPUs out there in the cloud, and and they're actually used. Uh but in general, I mean, taking a step up, um there is a big shift happening in compute, and that's basically turning uh regular algorithms, like optimization type of algorithms, and and even algorithms that we could never write before in regular uh you know CPU programming, we're turning those algorithms into deep learning. And the reason we're turning them into deep learning is because deep learning is almost 100% parallelizable. The algorithms we had before had, let's say, 20-ish percent uh a sequential core to it, and that's why with multi-threading, we could never improve by more than 5x, right? If it's 20% sequential, 80% parallel, the best you can do by using maximum multi-threading is to get rid of the 80%, but you you still have the 20% left, so you you accelerate it by 5x. With deep learning algorithms, you don't have that, it's almost all parallel, and you can accelerate an algorithm by 100x, a thousand x, a million x, a billion x. Uh, you can keep on going, and that's why we're we're not just accelerating algorithms like in HPC and all that by turning bits and pieces into deep learning, but we're creating new algorithms that we could never have done before, like image recognition, generate video generation, LLMs. So all that requires a shift in compute, and I don't think we're in a bubble. I think we're we're we're just shifting uh to a new type of compute. It will eventually calm down. We're still in our exponential phase. It's it's gonna eventually gonna settle down uh in in a few years. Now, yeah, there is a lot of excitement. Maybe there's some overexcitement in bits and pieces of the market, but in general, I don't think we're in a bubble that's gonna burst in a year or two. Uh, I don't believe that.
SPEAKER_00:Well, I I could listen to you for hours, but sadly I have to let you go. Thanks so much for the insights, the insider's perspective, and a brief sna uh uh view of the future at AMD. I really appreciate it. Thank you. Thank you, Ivan.
SPEAKER_01:Thanks everyone.
SPEAKER_00:Thank you, and thanks everyone for listening, watching, and sharing this episode. Talk to you soon.
SPEAKER_01:Bye bye.