
SNIA Experts on Data
Listen to interviews with SNIA experts on data who cover a wide range of topics on both established and emerging technologies. SNIA is an industry organization that develops global standards and delivers vendor-neutral education on technologies related to data.
SNIA Experts on Data
The Importance of Solid State Drives (SSDs) in the AI Revolution
What if you could optimize AI storage solutions without breaking the bank? Join Cameron Brett and Jonmichael Hands, Co-Chairs of the SNIA Solid State Drive (SSD) Special Interest Group, as they dive into the latest in solid state drive technology advances – including their increasing presence in artificial intelligence and high performance computing. This discussion looks at technological leaps in SSDs, zeroing in on how enterprise and data center class NVMe SSDs can transform AI system performance while balancing cost considerations.
You'll hear about advancements in QLC SSD technology, and the introduction of PCIe Gen 6. It highlights the importance of energy efficiency and total cost of ownership in navigating the future landscape.
SNIA is an industry organization that develops global standards and delivers vendor-neutral education on technologies related to data. In these interviews, SNIA experts on data cover a wide range of topics on both established and emerging technologies.
About SNIA:
All right, folks, welcome back to the SNEA Experts on Data podcast. My name is Eric Wright. I'm the Chief Content Officer at GTM Delta and co-host of this amazing podcast here. So lucky because I'm surrounded by experts on data experts on so many things and I've got two fantastic folks joining me today because we're going to talk about what's new in storage.
Speaker 1:As we're recording this, we're wrapping 2024, which has been a big year for so many things. We've seen a lot of innovation. You know, show season winds down and the tech events start to move into. Let's prep for next year. But the one thing that's interesting, of course, is the work that's going on in SNEA is well ahead of stuff you're going to be seeing announcements for in the first part of the year, because the stuff that we're talking about is stuff that really is going to have pretty interesting long lasting effects. But really the thing about it is the people that you get to hang out with and spend time with learning about what's's happening, what's coming, and really the idea that we can innovate at incredible paces when we collect together in great groups like snea. So with that, I'm gonna open up and uh, cam, if you want to do a quick introduction for folks that are brand new to you. And then, uh, and then we'll move on to jm next, and then we'll let jump in and we're going to talk about what's new in storage for 2025.
Speaker 2:Yeah, sure thing. Thanks, Eric. My name is Cameron Brett. I'm co-chair of the SNEA SSD SIG along with John Michael, and we try to be an authority and representation for the industry for things about SSDs, such as form factors, classification, types of SSDs, TCO models, things like that. So John Michael and I work very closely on that. I am also an employee of Kioxia.
Speaker 1:And John Michael yeah.
Speaker 3:John Michael Hands, senior Director of Product Planning at FADU, and we thought it'd be fun today to get together and chat about some of the new trends in storage. There's a lot of discussion on, of course, ai and storage workloads, qlt, gen 6, edsff, txl, all kinds of fun stuff, so we thought we'd just kind of go back and forth and chat about a little bit what's going on in the trends for next year.
Speaker 1:Well, there's no better trend than when you can just say let's rub some AI on it. So, Cameron, let's talk about what is AI doing to drive trends and changes in the storage industry.
Speaker 2:Well, I think both John Michael and I will have a lot to say about this, but at a very simplistic level, gpus and DRAM and storage are definitely the big cost drivers for AI, and storage is critical for AI systems in all the various phases, and one of the goals of storage is to help keep the GPUs working so they have less idle time. All types of workloads are being used. Pretty much all the SSD attributes are tested Low latency, read-write throughput, small block, large block, iops, performance, et cetera, et cetera. In some AI phases, lowest latency and highest performance are needed, but in many cases it's not. The lowest latency and highest performance are needed, but in many cases it's not.
Speaker 2:This is where you have a choice for kind of a cost versus performance metric comparison and you can choose different SSDs. The highest performance, lowest latency SSDs are enterprise class and the data center class SSDs still provide very good performance and latency, but have a focus on consistent and predictable performance and latency. So these are typically required by hyperscale applications, and this is one of the pages that John Michael and I worked on kind of having a distinction between enterprise class and data center class in VME SSDs. Ssds will also likely play a role in the expansion of memory space to augment existing DRAM in places like RAG, specifically with approximate nearest neighbor search or ANN. Being able to scale by storing both index and vectors in SSDs as the database grows will boost performance while keeping costs in check, since the SSDs can, in some cases, replace DRAM. You'll see, you'll see more technologies like this in 2025. You know so there's there's a lot going on. I'll hand it off to John Michael to to say a few more words on SSDs and AI.
Speaker 3:Yeah, where I started was Meta has with the Lama 3 training. Obviously it's open source model and there's been a tremendous amount of information they've put in those white papers and research studies on the storage configuration for the actual training, which I thought was really fascinating. Like, for instance, in the Lama 3 training there was like I think it was like 16 or 70,000 GPUs in the cluster that they were actually used for Lama 3 training and in that they had, like you know, a 240 petabyte NFS cluster and they're doing this with NVMe over fabrics and they kind of describe their tectonic architecture that they used to use for warm storage and blob storage and how they migrated that to Flash. But the really interesting part is these ratios, like this checkpointing, where it's very bursty. They said up to the peak bandwidth of seven terabytes, a second of bandwidth needed to do these checkpoints on the cluster, which is just insane. You have to remember all these GPUs have to be coherent to do the training. So I really enjoyed, you know, these videos that Meta put out. There's one I think it's called Training Llama Storage Perspective. You know they kind of talk about some of these challenges and tail latency.
Speaker 3:You know, obviously all these storage vendors are trying to ask the question of like okay, what do we do in storage to make it better for AI? So you know, for one, I think it's clear that high performance is needed. Like you know, lots of bandwidth, you know, for like checkpointing and sequential writes and training for random reads. Like I ran all the MLPerf storage stuff, it's really just like you know, the three MLPerf workloads that are up for training are like the UNet3D and ResNet, are all just like high block size random read. And you know there wasn't, unfortunately in the Rev1.0, there wasn't, unfortunately, that much stuff you could tune on the drive to make it go faster. All the tuning was like file system and the network storage level to make the training workloads go faster, the network storage level to make the training workloads go faster. And so we're actually, you know, for the I am part of the MLPerf storage work group and they're targeting the 2.0 for like middle of next year and there's a bunch of stuff they're trying to tackle RAG S3, checkpointing, like updated training benchmarks. So everybody wants to know like what can we do to like benchmark the storage for you know, for AI.
Speaker 3:The other really interesting thing that happened was NVIDIA has kicked off this like storagenext motivation, which you know unclear if they're going to, you know there's going to be some. Obviously a bunch of storage vendors and companies are part of SNEA. They want to maybe do this in OCP. You know maybe some stuff in NVMe they don't really know yet but I mean there was 110 storage vendors on this call that they kick off meeting and it was really about like where are we today? And you know just that was part of what Cameron was talking about.
Speaker 3:Like you know characterizing these workloads like data loading sequential versus semi-random checkpoint. You know characterizing these workloads like data loading sequential versus semi-random checkpointing sequential large block size. You know RAG embedding lookup. You know small block size, semi-random and understanding some of these workloads. But they're looking more in the future and I think that that was really interesting to look at the perspective of.
Speaker 3:You know, right now it's obviously like storage is not a huge spend in the CapEx for AI, if you just look at you know I think some of these hyperscalers were over $30 billion and you know in a quarter for CapEx. You know in data center spend. So like, if you just think about, you know, storage as a two to 5% or something of that of these giant numbers like it's very easy to see where. That's why the storage market is so interested in understanding this. But it was really nice because they kind of focused on okay, where are we at today? We understand these.
Speaker 3:I just mentioned some basic understanding of some of these training workloads. You need lots of bandwidth, you need lots of IOPS, but I thought it was really interesting where they are going and what they were explaining and I think CJ gave a session on this at OCP as well as Super Compute of where they are going and where what they were um explaining and I think you know cj gave a session on this at ocp as well as a super compute, and I think a lot of those sessions are up on the internet. Um, but they, they were the ones that pioneered this, uh, gpu direct storage and you know this. The wrote the white papers on the big accelerator memory and the GPU direct, and so they're showing us this world where the IOs come from the GPU directly to the SSD or MDMU or Fabrics, through some kind of like Bluefield NIC or something, directly to the SSDs. And this looks a lot different than storage today.
Speaker 3:Right, it's thousands of queues, thousands of threads and the really interesting use case I thought about rag was you know, you know we talked about cx a little bit in my. They're like, okay, memory doesn't solve you know, okay, it solves your like a couple of terabytes problems. But what if you have a database that's 100 terabytes or two, two petabytes for rag, like it's not going to work for for memory? Right, you have to, you need ssds. So the the the thing that just blew my mind was their target for Gen 6 is 200 million IOPS per GPU.
Speaker 1:Good golly.
Speaker 3:Their challenge to the storage industry is like how do we get that? It's very clear. No pressure, it's very clear.
Speaker 3:No pressure we can get to like 7 million IOPS, gen 6 with standard SSDs we have today. But what are we going to do? I thought this was just a lot of fun. I mentioned a lot of this. Work is just getting kicked off, but there's so much going on. I mentioned, obviously, the shameless plug for SNEA STC. I thought the AI sessions were fantastic. I went to the one going on right, right. I mentioned um and I'll just do, obviously, the shameless plug for snea stc. I thought the ai sessions were fantastic. I went to the one percent e I went to the one from dell and one from microsoft. They were so good and most of them way over my head, which means that they're good. That's the stc way yeah, well they're.
Speaker 1:They're amazing to watch because of how quickly people sort of head to the stage and the one hour talk is pre, is what precedes the three hour talk. That really is the one that we moved to the hallway track and there's such a great set of people there that are really sharing what's going on. It's such a beautiful sounding board and you know it's funny. You talk about like, how do we predict what's coming? And that there's so many parameters, so many opportunities for tuning, changing and then workloads like it's.
Speaker 1:This is a brave new world, although some would say it's a silly new world, but it is brave in what we're trying to accomplish and trying to do it in the most environmentally, you know, intelligent way. We know there are many different sacrifices in this equilibrium of you know what we have to sacrifice in order to achieve these gains, so it's amazing. So, john Michael, one of the things he mentioned there was pcie, gen 6 and uh, obviously we're seeing some neat stuff that's coming around that. So what does the 2025 outlook you know as far as coming with gen 6 and why is this actually important in in what we're seeing around some of the innovation coming up?
Speaker 3:Yeah, the funny thing is you know every single generation there. You know we had a camera and I've had this discussion with with SATA and NVMe. We first came out with NVMe it was like why does anybody need seven times performance? Sata is just fine for these workloads, like you know. And then you go from Gen 4 to Gen 5, you double the bandwidth. I don't think people quite understand like that. You're doubling the bandwidth. It's 2x the performance and that's why these interface changes are so important. And you know, as we just talked about AI, like when you go from, say, 200 gigabit networking to 400 gigabit networking, now you need to go from Gen 4 to Gen 5 on the NIC and you need to go to Gen 5 on the drives to basically saturate the network and saturate the GPUs and all that stuff. So we are seeing now all the Gen 5 shipping and volume from all the vendors, which is great.
Speaker 3:We saw a couple of announcements for Gen 6 at FMS. Just like, the companies are working on SSD controllers. I don't expect any of them to ship in 2025. I think there's a lot of engineering work that goes on. But the interesting new, you know, obviously the new company in town for high speed computing is NVIDIA, you know, and it's clear that you know the AMD and Intel. You know what they've said about. You know Gen 6 is, you know, it'll be like 2026, 2027 timeframe. It'll be like in 2026, 2027 timeframe, right when NVIDIA is. Like we have Blackwell and we have DPUs in SuperNIC from Mellanox that are going to be shipping Gen 6 next year, like so they're asking for Gen 6 yesterday, right, like what?
Speaker 1:Yeah, we're already. The demand is already there. No market testing required.
Speaker 3:So back to that, like we're going to go through this cycle that we go through every time. We go through an interface transition, which is why do we need the extra performance? And um, I talked to a few of the hyperscalers about this. They said, well, if you can do gen 6 at you know the same or less cost than gen 5, then of course, of course so and when. When they say cost, they don't mean like cost of the controller, cost of the nan, like that stuff doesn't materially change. You know, it's really the retimers in the system and the low loss material and all the other stuff.
Speaker 2:Yeah, overall system costs.
Speaker 3:Yeah, and when they say TCO and lower costs like, they mean like, ok, how do you go from 10.5 to 10.6 without increasing the system cost by 30 or 40 percent, because that's not going to work. And so again, every generation we go through this. It's not possible you have to have extra costs to get there and then people figure out ways to. You know, retimers are getting more common, there are more vendors, prices are going down. So, yeah, if it wasn't clear from the AI that AI absolutely needs higher bandwidth drives, they are absolutely going to 800 gig networking and they need to be six drives. And they have Gen 6 GPUs for Blackwell and they want Gen 6 drives. So it's really you know race. The storage vendors, you know, are going to try to figure out how they get there.
Speaker 2:But yeah, I mean, yeah, nvidia, when NVIDIA wants something, I mean and they're ready for it. You know people are going to jump, you know, as quickly as possible. So I mean, nvidia is basically the bellwether for ai right now.
Speaker 1:So that is definitely uh. You know, jensen is the new uh. The new name drips off the tongues of so many pundits because we see when there's announcements coming up I know they've got stuff with CES, so we're likely going to see tons of new announcements coming about, what's already on the ground and what is coming. And then this always has this incredible downstream effect. And it's funny that, like you said, JM, this idea that why do we need more? You know, I used to have my network team at one organization. He'd always tell me he's like I say, hey, we need to get on. You know, the on the good backbone for this one. He said we need to move off the 10 gig to the 100 gig. And he would just look at me, you know, in that grisly networking style, through his cigarette smoke, out in the parking lot going if you can, if you can, max the bandwidth on that thing, I'll buy you lunch every day for the next year. He says this isn't a bandwidth problem.
Speaker 3:You, you cannot possibly stress the bandwidth, because at the time there's always the bottleneck was way closer to the workload and now that is just the need to sit over the wire is incredible you know I forgot to say one key trend right, which is, you know, obviously all the controller vendors are trying to figure out okay, can we make a controller that's twice as fast at similar power? You can. You have to go to, you know fancier tsmc nodes and, you know, do lots of tricks, you know, to basically be able to scale the form, but the nand also has to get more power efficient. You can't TSMC nodes and do lots of tricks to basically build a scalable form, but the NAND also has to get more power efficient. You can't double the bandwidth and double the IOPS and then not have a huge generation over generation improvement in NAND IOPS per watt and bandwidth per watt, because if you want to saturate Gen 6, gen 6 is 28 gigabytes a second and it's seven and that's a lot of io and you still have the same power envelope per ssd.
Speaker 3:So this is, uh, incredibly important, as you know, kind of moves us into this discussion about energy efficiency and form factors and, like all this, gen 6 just highlights all these issues that were present on Gen 5, but they were definitely magnified compared to Gen 4. But in Gen 6, they're very polarizing.
Speaker 1:We quite literally have been pounding against these walls now as we close out the year. Now then on the hardware side, qlc neat stuff we're going there and it's funny. Over the years I remember we always three things software, firmware and hardware that are giving us gains that allow us to both, and now baking it into the hardware layer that potentially we can unlock future, you know, efficiencies again with better, better use, better software that can leverage these capabilities. So so, cameron, you want to talk about what's what's going on in in QLC and why is it may not be the year of VDI in QLC and why it may not be the year of VDI, but it will be the year of QLC.
Speaker 2:Well, I mean, as John Michael was commenting on, all the improvements in performance, improvements in storage density, has to happen at the same time, and QLC has been talked about for many years and this is likely the year when QLC will kind of take a major foothold. You know there were some QLC SSDs out in 2024, maybe a little bit in 2023. But you know 2025 is likely when it's going to take a big foothold, you know, especially towards high cap SSDs and, you know, possibly even into some cold or archival SSD use cases. I mean one thing about you know the storage density and we'll probably touch on this a little bit more during. You know talking about, maybe, power efficiency. Is that you know? I mean, once you grow an SSD in capacity, you know you still have the same power envelope for, you know, for an NVMe SSDs, you know there's some cases where you can kind of go up the power envelope to, you know, 40 or 70 watts, but generally speaking, you know keeping it at 25 watt cap is kind of you know the mark that you don't want to go over for an NVMe SSD.
Speaker 2:Yeah, not all QLC is going to be created equally and you know they are going to vary by supplier and a lot of it is going to depend on the chip and packaging architecture. So, depending on you know there's CBA architecture, cua architectures and some of those will have an effect on the robustness of the cells and you know thereby can determine whether it's really fit for enterprise use or data center class SSDs or archival or relegated to client. You know type of workloads to client. You know type of workloads, but you know definitely QLC is going to be certainly the key to the high-cap SSD space that we're going to see a lot more of in 2025.
Speaker 1:I for one welcome our high-cap QLC overlords.
Speaker 3:You know it's funny when I talk to a lot of folks about qlc stuff, the more I realize I'm like, oh my god, we have to just go back to the basics, like I. I think that some people forget like some of the absolute fundamentals. Like so you go from three bits per cell to four bits per cell, you're storing 33 percent more capacity and you have a cost reduction of 25%. Now, like that's just at the NAND level. And if you have like a wafer one wafer and you can turn it into TLC or QLC, like that's the density increase you get. So if you're looking at just like bit output from the industry, like they will get more bits in the same way first if they go to QLC and they will be at a lower cost Like this is just fundamental and this is the whole reason why you want to go to QLC. Now at this cost reduction is a trade-off. The trade-off is endurance, retention and right performance. You're storing more bits per cell, you have more voltage levels and you have to do different programming stuff and the way that this works is it takes more power to do the NAND program and so typically there's a lot of trade-offs on like the right performance. Now some of this is alleviated when you go to these like a lot higher capacities, you can at least get reasonable right performance, you know, at 25 Watts. But just remember this is a massive trade-off, right? So you're going typically from like a, you know, 10,000 cycle programmer race cycles for enterprise TLC QLC ranges from 1,500 cycles at some of the main vendors. I think Solidigm has 3,000 cycles for their hard drive replacement high capacity QLC and then they also have a TLC replacement that is like 5,000 cycles.
Speaker 3:So what Cameron said is it's all across the board as far, like there are a bunch of vendor differences in the QLC, but we do see something kind of standardizing around this two terabit die, you know, which is enables these 60 and 120 terabyte SSDs. So it's very clear that these are very desirable for AI workloads where power is a huge constraint in the data center, or they want to alleviate network bottlenecks by just basically having a lot of capacity, you know, local to the storage nodes, and all we can say is people are buying these really high capacity drive systems that vendors can make them. So now it's a scramble of like oh yeah, okay, we, or reminder, we need to look at this high capacity QLC. You know there's always going to be this TCO story of people.
Speaker 3:By the way, the origin of this Neo TCO model was me doing a bunch of analysis on QLC versus hard drive racks and looking at some of the various workloads and it gets really interesting right 3x to 4x dollar per gigabyte range and QLC versus hard drives. And you have all these extra IOPS for deduplication and compression and erasure, coding and stuff. There's all this fun stuff you can do with TCO. But yeah, just as a reminder, qlc is always a TCO thing. It always will be a TCO thing.
Speaker 1:Versus TLC it is cheaper but it has a bunch of trade-offs and whenever you're doing trade-offs it's now a tco discussion and one thing that you brought up and it's it's starting to then move to the fore that when we talk about like life cycle, it means from manufacturing, pre-manufacturing, like we're talking about the artifacts that are generating the physical you know, drives and and memory themselves and then all the way through to destruction. So it's beautiful to see more and more discussions. We talk about lifecycle meaning, quite literally, birth to destruction at the physical layer, and we're seeing more, I'll say, a lot of centralized use. Before it used to be, every enterprise would have bought up all the NVIDIA gear, and now obviously it's the hyperscalers that are doing a bulk of the buying, and in doing so, that means that we could probably get more efficient use of it because by the nature of their sales cycle, they have to make it effective and efficient for them as a business. So I hope that in doing so, while centralization, decentralization there's a continuous battle over who should own innovation, but it's like we're going to be able to do stuff, innovation at scale in some of these pieces of hardware that we couldn't have had access to unless it was in that sort of shared model.
Speaker 1:But anyways, that's just my piece as the outsider looking in of, I'm excited by what we've got coming ahead and that we're not just talking about speeds and feeds, and you know, tco is far more than just what it cost me to buy the drive. Now, power efficiency again close to my heart. I'm a, I'm somebody who I do have a number of children, who I want to have a number of children, and I want to have a number of children, and I hope that they have plenty of excitements in the outside world to enjoy, and that is from things like better things, what we're doing around power efficiency. We're changing the way we do computing in general. So, when it comes to power efficiency, down to the metal, what's what's new and what should we be really looking forward to in 25?
Speaker 3:I gave a presentation at FMS this year on power efficiency and SSD controllers and where we started was in consumer SSDs. The drive sits there 99% of the time. So power efficiency in a laptop is all about battery life, all about how fast can the drive go to sleep, go to zero power and then how fast can it wake up. And these are all about these NVMe power states and PCIe L1.2, very low power states, and so none of that works in the data center because you have very tight latency requirements. You can't even you know the. But, by the way, consumer drives are really good. Now they can go to zero idle power and get back in you know 10, 5 to 10 milliseconds, which is like the latency of a hard drive. Like that's pretty wild. Still too much. Right In the data center world we're talking about 50 to 100 microsecond read latency, like 5 milliseconds way too much. You can't. Those don't work. And obviously you have a, you know. You know you have battery backup and you have power loss capacitors and you have PLI stuff, like. So some of these tricks to go to sleep don't exactly work. But the important part is in computing SSDs. You know, historically the conversation is, the assumption is. These things are always being used and the measure of power efficiency is performance per watt in the active state. So you run an active workload and you measure how much active power it's consuming to run that workload and then you divide the performance by the amount of power in watts and now you have performance per watt. This is a really important metric for a ton of reasons. So one we just talked about these form factor power limits, which is like okay, if you were at an interface limit or a form factor limit of 25 watts, the better power efficiency you have, actually the higher performance you can deliver at a certain TDP. The example I gave was actually going the other way, which is saying okay, well, we have a 25 watt drive, what happens if we cap the power at 16 watts? Okay, well, you want to know what your performance per watt is so that if you have a better performance per watt then you won't lose as much performance when you're reducing the power outflow.
Speaker 3:The power savings aren't just on the drive. I think people forget that the drives go into a system and then the systems have fans and all this other stuff going on, and when you have a drive that consumes like 16 watts instead of 25 watts. And, by the way, hyperscalers have been playing this tco game for a long, long time, running drives kind of close to their operating temperature, towards the upper limit, to make sure they run fan speeds as low as they absolutely can, to basically reduce the power on the servers and reduce the tco. So, um yeah, there's uh, well, there's the whole sustainability angle which we haven't even discussed yet, but there's all the like.
Speaker 3:What I just mentioned is like the practical SSD architecture, applications of performance per watt, which is understanding how much power can you deliver in a certain power envelope, in certain form factors. You know, modulating at different power limits for certain customer requirements and certain use cases. The other benefit is that if you can lower the active power in the use phase and lower the TCO of the server, you have less wasted power on idle power, on thermal loss and fans. And there's a bunch of work going on in the sustainability space to enhance the PUE metrics, to be able to describe how do you actually measure that inefficiency at the server level, not just at the data center level? Traditionally they've been talking about PUE at the data center level, like the cooling at the data center level, at the rack level, but there's cooling within a server. It has fans and you need to be able to quantify that as well.
Speaker 2:Yeah, I mean that's also where EDSFF comes in, with the cases and the heat sink so you can run the fans at lower CFM and help keep the power consumption down. You know, taking that metric IOPS per watt or performance per watt that John Michael was talking about, I mean you can. Then you can factor in a cost so you can look at IOPS per watt per dollar, kind of extend that to really take a look at the economics in addition to the performance. And you know, as far as you know power efficiency goes, you know the power consumption of a high cap drive is also still going to be. You know it's not double the power if you double the capacity or triple the capacity you still need to keep either a 25 or some sort of constrained power envelope. So I mean that's where additional efficiencies through QLC and eventually five bits per cell come into play.
Speaker 3:Purcell come into play. Yeah, and I kind of touched on it briefly, but I mentioned when a workload has a drive that's idle for an extended period of time. Now you need to optimize idle power and we started to see that I think Micron was the first to announce a drive that actually has a decent power reduction for L1 in a data center drive. The new high-cap drive goes to 5.something watts to 4 watts in a L1 substrate, which is great. It doesn't sound like a lot of savings, like a watt and a half or something, but that's a 20-30% reduction. It's actually a pretty big chunk.
Speaker 3:And so now in this world where you're replacing potentially hard drives with a QLC or high-biz Purcell or something and you have a drive that can do hundreds of thousands of IOPS, but you have a hard drive workload that's 200 IOPS, the drive's going to be idle a lot of the time. So now you have to figure out how do we save the power on the drive side. And so we're not there yet from a data center side as far as like those technologies being deployed. But we have hyperscalers, like actively asking those questions, which is great because I was asking them like four or five years ago in OCP and everybody was like shut up.
Speaker 3:Yes right, we're never ever going to use L1 sub-threads on a data center drive. Just we can talk about. We're never, ever going to use L1 someplace on a data center drive. Like, just don't even talk about it.
Speaker 1:Yeah, and it's funny because that becomes this thing that we find like there's an innovation that will create an efficiency in something that we thought was a dead technology and then that ultimately becomes the new technology. You know, it's funny.
Speaker 1:I'm on the straight software side. I'm a Ruby on Rails fan, just because it's super easy. I've been using it for so long and it comes out of the box with SQLite and the first thing you do is get rid in SQLite that make it as performant, if not more so, and as scalable as most Postgres implementations for moderate and to large size sites. So we're seeing software development that's being enabled by work that's happening closer to the metal, and so the real like the impact is so broad and it's that's what's exciting to me. I'm just like because it's not just this one thing that we're innovating on, it's the entire ecosystem that sits atop of it, and I know every day is a bloody wondrous day to be in computing. I don't know, I'm excited as heck about it every day, but I maybe I'm too much of a nerd, I don't know, I think we all are at some level that's it.
Speaker 1:Well, and you know high cap drives. Let me tell you I was uh, on a previous podcast I was talking about remember the days of the excitement when you could get. Wait, we've got 64 gig drives. What, oh my goodness, this is wild. You know what is an unlimited amount of storage? Then 128, then 256 and now, like you wouldn't even hand someone a free you know usb stick if it was less than 512.
Speaker 2:You're like what am I gonna do with that?
Speaker 1:yeah, exactly what are you gonna keep one picture on it. So let's talk about what does. What's the future of high cap, and where are we seeing high capacity work show up in that enterprise type of display?
Speaker 2:Well, I mean, the most talked about place right now is AI, and AI certainly has a demand for more storage. But I mean, for many years, you know, 960 gig and 3.84 terabytes were the sweet spots and then things, you know, over the years. You know, even not that long ago, it quietly went up to 8 terabytes and 15. And I know that my company was one of the first ones to offer a 30 terabyte SSD. But those were kind of used in you know kind of special use cases were kind of using you know kind of special use cases. And then, once AI became a reality, you know 30 terabyte is kind of like the starting point for hard high cap drives and you know this is this is also where you know, talking about EDSFF, where 2T and E3.L and maybe 2T. You know this is where the high cap drives are going to be most prevalent. But yeah, this is definitely AI has spurred on the high cap race and every company is sprinting in it right now.
Speaker 1:So I was going to say it's kind of like when we look and we say, why do we choose these sort of moonshot missions and what does it actually get us? But like, while AI doesn't necessarily seem like it's doing what we would hope or expect it to do at the moment, what it is doing is creating a fantastic you know burgeoning innovation around the data, centered architectures and hardware and software to allow AI to be efficient and performant. So while we're sitting here getting it to generate emails for us while burning off thousands of watts, at least we can hopefully get it better down to the bits and get these drives where they're getting the most out of that hardware. Sorry, jim, I cut you off there.
Speaker 3:You touched on one of the use cases, which is terabytes per watt, right. And if you have a 122-terabyte drive and the NANDs, the vendors have all said, yeah, we're going to go to 1,000 layers or whatever, right? So it's not going to slow down, certainly paths to 256-terabyte drives and above. And even though hard drives Seagate's shipping hammer in production and they just announced 30-terabyte CMR and 32-terabyte SMR in production, even for the channel. So that's, I mean, the trend is going up on hard drives, but SSDs are already remember, they're already four times bigger and they're half the physical size and a fraction of the size. So from just a capacity standpoint, ssds are already far, far ahead. And that's going to be continually stressed in data centers where they have no more power, and this could be in a regional data center, in a COLO or AI data center, where they're like, okay, well, we didn't, you know. Yes, they cost a lot more and SSDs are still 10X the price of, you know, of hard drives. But you know, maybe if the QLC market comes back down to the earth, maybe it's going to be like a 6X. And now if you have a 6X multiplier on the price but you can go from five racks of hard drives to one rack of SSDs the same power like man. These are really tough decisions for data center operators to make. Remember, you also don't just get the capacity, you also get a ton of performance. Now you can open up a bunch of AI use cases for reading that data.
Speaker 3:Some interesting thing we've seen. I think one thing that we mentioned on QLC is just from a technical standpoint. It's actually physically cheaper. There's more bids per sell. But the other trend we've seen is some of these vendors have used consumer TLC to basically get to market faster on a high-capacity drive with 16 die stacks or even potentially these crazy 32 die stacks of consumer 3,000-program race cycle TLC NAND. So there isn't just one way to build a super high cap drive, but certainly QLC is is the way to build the biggest drives right now. And as far as the use cases yeah, you know I don't like there there are tons of use cases for high capacity SSDs.
Speaker 3:Object store power savings in this training data. You know all the training data is going multimodal. You know I just got the update pushed on my Tesla this morning, which I'm excited for, the new self-driving that they trained in the new xai data center with all these gigabytes of footage from high bandwidth, high resolution footage from the new cameras. It's so awesome, right that these um, you know these new training sets, you know these new big data, you know you want fast ssds to basically kind of do that type of workload. So, yeah, I don't think it's going to slow down. I think the demand, I think we're you know, we're already at like 20 of the bits shipped at qlc. I think it's going to continue to ramp and now all the analysts are finally like, yep, now it's time qlc is going to go nuts next year.
Speaker 1:It's such an interesting innovation area because I think of general optimizations. We always link back to the sort of gold rat-esque type of things of find the constraints, sublimate the constraints. And we look at this in like, how do we just tackle just the bottleneck? But what's actually happening with these types of innovations is that we're eliminating someone else's bottleneck. But what's actually happening with these types of innovations is that we're eliminating someone else's bottleneck by adding innovation in an entirely different area. So that's why there's so many moving parts, but we're seeing them all converge together. That allows, like you said, stuff that's going on in the XAI data center that just would not have been possible five years ago from 50 different areas of innovation. And then now that we're actually pushing the technology with stuff like FSD, I love it. I love seeing use cases that are real, not just like, hmm, what could we do with this? You're like, we're literally, we're already doing it. It's being used every bloody day.
Speaker 3:Now I forgot the actual plug for the SNE SSD SIG. So if you are working on SSDs and big SSDs, come to the SNE SSD SIG. We have. I believe SolidIme is about to contribute a paper on how do you benchmark large you know, large capacity drives for things like endurance, like something. If you have like an indirection unit size it's not four kilobytes, maybe much bigger so you can reduce the amount of Ram. You know the typical like. You know, genetic workloads just don't make sense. So there's a lot of like, nuance of like, how do you test these, these big drives, and that's actually a lot of the technical discussion that we're driving through the work groups.
Speaker 1:Yeah, instead of the. I said even as a buyer of, just you know, enterprise storage gear for sands and you'd get those performance numbers from the vendor and you're like you sure these aren't just like you know, linear 4k reads Like I think that your iometer test is cute, but let me try it out in production, and every time I put it in production I would get told about a month later like, well, you see, the problem is your workloads. I'm like, oh, it's my fault, sorry, sorry for me and my silly workload getting in the way of your performance. So I could literally go on for hours on this stuff. You guys are both fantastic and so thank you both for sharing what's coming up. But quick, closer, uh, what's what's super exciting to you and how do we best get a hold of you? If you want to chat more on this stuff, let's start with you, jm, and then we'll close out with cam I mentioned.
Speaker 3:Yeah, cameron, cameron and I. That's me. If you're a member of snea, you should come to the SSD SIG. Everybody's welcome. Now, I think the 2025 rules are different, where you used to be part of CMSI or whatever. So they're like anyways.
Speaker 2:It's open to any member now.
Speaker 3:It's open to anybody next year. So just come into the work group if you want to talk about SSDs. Yeah, the TCO stuff. There's a ton of interest in AI, tco and, like I mentioned, like the, now you're talking about IOPS per dollar and stuff like before it was just gigabytes per dollar. So now I'm going to be updating all the SNA TCO models to basically be ready for AI workloads and that's going to be a lot of fun. I can't wait. Nvidia is asking me to do this.
Speaker 2:And yeah, it's going to be fun. Fantastic Cameron, Fantastic Cameron AI, as well as kind of continually monitoring and kind of being the advocate for EDSFF. You know this form factor transition away from 2.5 and M.2.
Speaker 1:So I think that's some of the exciting stuff that we have going on in the SSD, sig and, as they say in sports, you got to be in it to win it, and getting joined up with SNEA is super easy. The new membership options we had a great discussion with Jay Metz talking about how the membership look for 2025. So it's much easier to get involved. It's much easier to get involved in multiple disciplines within the group now, so every barrier to entry is being lowered and just the fact that you can sit with a peer group who is living these problems. If all this stuff, if you're sitting in your chair and you're shaking with excitement, like I am, and you want to go and start putting this stuff into action, then get beside JM and Cameron and all the fine folks at SNEA. There literally is no better community of people that are really doing fantastic things and you know there's lots of opportunity ahead. So let's see what's coming up with 2025. Thank you both for having number one created great content together in 2024. It's been great to spend time with both of you and for folks that want to get involved again, head to steenaorg.
Speaker 1:Check out the other podcast episodes as well. We're both on audio and we're also. We have these on youtube. So if you'd like to see these beautiful smiling faces, do more, all right. And of course, uh, jm, you mentioned uh, stuff that's from sdc and even some of the other stuff with ocp. There's tons of great public facing content, so, even if you missed it, you didn't necessarily need to be there. Uh, check out the youtube channels. There's lots of great content available for recap and review, and I'm a fan. So with that, both of you, thank you very much, and for all the folks that are watching and or listening, thank you and happy new year and we'll see you all in 2025. For the Stia experts on DataCrew yeah, thanks, eric.