Trends from the Trenches

Episode: 42 - Adam Marko on AI-Ready Life Sciences Data

Bio-IT World Season 1 Episode 42

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 22:17

Your AI plan can’t outrun your data. Adam Marko, life science field CTO at Hammerspace, joins the podcast to unpack the problem almost every biotech, pharma, and biomedical research group runs into: unstructured data that are siloed, fragmented, and scattered across storage systems, sites, and clouds. With host Jessica StLouis, they talk through what “data orchestration” means when building an AI-ready data foundation, infrastructure constraints and the tiered storage patterns that help teams keep AI and HPC workloads moving, and why life sciences are in a uniquely tough spot. Plus, Marko shares a preview of his presentation at Bio-IT World Conference & Expo in Boston.  

If you care about faster discovery, smoother AI workflows, and fewer manual file moves, subscribe, share this with a colleague, and rate or review so more researchers can find the conversation. 

Links from this episode:  
From Data Chaos to Discovery: Building the Data Foundation for AI-Ready Scientific Research
Bio-IT World Conference & Expo
Bio-IT World
BioTeam
Hammerspace 

Bio-IT World’s Trends from the Trenches podcast delivers your insider’s look at the science, technology, and executive trends driving the life sciences through conversations with industry leaders. 

Why AI Readiness Breaks Down

Jessica StLouis

Hi, I'm Jessica St. Louis, Senior Scientific Consultant at Bioteam, and I'm your guest host. Welcome to the Trends from the Trenches Podcast. I'm joined by Adam Marko, field CTO of Life Sciences at Hammerspace. Organizations across life sciences are racing to become AI ready, but one of the biggest challenges we keep hearing about is their data. It is unstructured, siloed, and fragmented across storage systems, sites, and clouds. So transforming that fragmented data into AI ready data, that state is a massive challenge. Today we're diving into data orchestration, AI enablement, and what it really takes to build an AI ready data foundation. Adam, thanks so much for joining us.

Adam’s Path Into Research IT

Jessica StLouis

And before we dive in, let's start with your background. What led to your work at Hammerspace?

Adam Marko

Yeah, thanks, Jessica, and thanks for setting this up. I actually have a fairly diverse background. I started in undergrad as an ecology and evolution major. I've always been interested in the life sciences. And as I continued down that path, I became more and more interested in the computational side of life sciences and what computers and technology could do to the field. And it was still fairly new at the time. There wasn't even a bioinformatics major at most universities as compared to now. So as I learned more about that, I ended up taking a computational biology class and interned at the Pittsburgh Supercomputing Center, where I was involved in protein structure prediction. And from there, I just got very interested in the intersection of advanced high-performance computing and life sciences. So from there I worked at the Pittsburgh Supercomputing Center or the PSC for a while. And then I went to graduate school at the University of California. And from there, I moved into the corporate space and I worked at several organizations. I was involved in antibody design at Pfizer, agricultural biotech at a company called Mineral Biotechnology, as well as molecular diagnostics based on genomics for cancer patients at a company called Asturgen. And then I actually was at Bioteam for four years. And Bioteam really ramped up my knowledge and experience in research IT, particularly storage. So I became very interested in the storage aspect and the data aspect of life sciences. And that's what led me to another storage vendor. And now I'm at Hammerspace and I've been here just over two years.

Jessica StLouis

That's great. Thanks, Adam. You definitely have a vast background. For the listeners who may not be familiar, can you give us a quick picture of what Hammerspace does and the problems Hammerspace is built to solve?

What Hammerspace Actually Does

Jessica StLouis

Sure.

Adam Marko

Hammerspace is a very high performance, flexible, feature-rich data platform. So what Hammerspace can do at a very high level is unify your data in a file system format. So this isn't an application that interacts with your file existing file systems. It is a file system. And it can use existing storage, it can be deployed as brand new storage, it can be all on-prem, it can be all on cloud, it can do hybrid cloud as well. And we have something called data orchestration. So what that is is a rule-based, transparent way to manage your data. So you could imagine running analysis on high performance storage, setting a rule that if the files haven't been touched for a month, for example, move them to colder storage. And that happens all automatically. And when the user logs in, it doesn't look like their files have moved. So their existing directory structure is preserved. So this can span different storage types, as I mentioned, on-premise and the cloud and different locations. So if you have users in the UK and San Diego, for example, it will still look like a single file system that's all managed transparently behind the scenes.

Jessica StLouis

Thanks. That definitely really clarifies why data orchestration is so critical. And although I should probably think of an analogy of a symphony orchestra, I keep thinking more of an air traffic controller helping data land where it's needed. Does that make sense, Adam?

Adam Marko

Yeah, that's probably a little more accurate than the symphony. The symphony might be like the underlying file system part, but yeah, it's designed to make sure your data gets put in the right place at the right time. And it's not just limited to location. So we can do things like permissions, accessibility. Sometimes when people hear Hammerspace, they think it's just a dumping ground for all of your data and everyone can see everything. And that's not true. Any permissions you want to enforce, we can enforce. Anything that you're already restricting or making access to, we can do that as well. And we've had pharmac customers, for example, who have asked us to move data to the analysis location and at the same time make a copy that is read-only somewhere else. So that all happens transparently so the scientists don't have to worry if their raw data was copied to the right location and it's preserved. They can get started with analysis knowing that their data is protected. And we have all kinds of levels of protection as well. So lots of flexibility in how you orchestrate your data.

GPUs And SSD Shortages Meet Reality

Jessica StLouis

Nice. Okay. So I'm glad that lines up. Could you share how organizations, especially those with existing SSD or cloud infrastructure, struggle to make that data readily accessible for AI?

Adam Marko

Yeah, it's definitely a problem. And I think organizations now more than ever are aware of it, both from AI initiatives and the current SSD availability crisis. So one issue, and these aren't in any particular order, is just access to GPUs. That can change daily or even hourly, depending on your cloud provider, what GPUs you can get access to. And you may or may not have them on premise as well. So being able to seamlessly access GPUs wherever they are, and that might mean two different physical locations for your organization and the cloud, for example, Hammerspace can enable that transparently. And we can integrate with a Slurm job scheduler. So again, your your users don't have to really know where those GPUs are. They're just working on their data like it's all right there in their own server room. So that's one of the biggest advantages of Hammerspace is the GPU availability. With respect to SSDs and NVMe, we'll just use the term SSD to refer to all flash storage. There's been a major shortage globally. And I think we're all aware of that. Some vendors are delayed for months or years with shipping. And basically, having flash storage is required for modern AI workloads. You can do some workarounds using spinning disk, it slows things down. It is technically possible. But if you're going to do AI, you need very fast storage. So what Hammerspace can do is, for example, we've done this for some life science customers, is create a, we'll just say a smaller tier of NVMe and then a larger tier of spinning disk that gets you around the availability issue and as well as the cost issue. And using Hammerspace data orchestration, that's transparent to the users. So they're doing their high performance analysis on that relatively small NVMe tier. Then it's automatically moved off to the colder tier, the spinning disk or the cloud. And they don't have to worry about filling that smaller tier up and taking up the space. That's done automatically. So that really enables a more consistent, steady stream of workloads. And for organizations that have existing storage, whether it's NVMe or an existing NAS, we can use that as well. So you can imagine deploying a relatively small NVMe layer and then leveraging your existing storage that at the time wasn't fast enough for GPU or AI analysis. Now it is, and it looks like one single storage platform. So you've greatly expanded the capacity and performance of your platform at a comparatively low cost and overhead. We remove the manual moving back and forth that's very common in life sciences.

Jessica StLouis

Thank you. I think that definitely hits home for a lot of folks. I was wondering, could you give us a real world example where an organization was trying to become AI ready, but it faced infrastructure challenges?

Real-World Fix For Multi-Cloud Data

Adam Marko

Yeah, one of them is a AI medical healthcare research organization in Europe. And they had multiple clouds and multiple locations of their GPUs. And they're doing several different things, but one of them that that's becoming more common that they're doing is analysis of pathology images. And they were having to move them to multiple different clouds. And this is all a manual process. And it it technically works, but it is very slow, it's error prone. You lose track of things. If somebody's on vacation or whatever, you might not know where those files are. Just all of those things can compound. So what they were doing with Hammerspace is unifying their file system across two clouds and on-prem. And there's GPU mixed GPUs in all different locations. And now they don't need to worry about where the files are or where the GPUs are. That's all been streamlined for them.

Jessica StLouis

Thank you. I think that that's a great example of how removing the data friction can free the researchers to focus on their science.

Adam Marko

Yeah, I think data friction is actually a really good term. I like I might use that. And one other thing I wanted to just sort of mention as a plug for Hammerspace with respect to SSDs, we do have petabytes of SSD that we are ready to ship. So if organizations are looking to expand their infrastructure with a Hammerspace SSD tier, we can enable that very quickly. So that's different than a lot of supply chain vendors who are delayed for months or years.

Jessica StLouis

Yeah, it's exciting to see technology directly supporting this life-saving type of research. Are there any additional practical examples or perhaps a different perspective you could share?

Adam Marko

Yeah, Hammerspace is very flexible. We're not only designed for AI GPU workloads, but one of our other areas is net new HPC storage. So a major cancer research organization did a brand new build-out of what they call a collaborative research cluster that's going to be used primarily for genomics in the beginning, but then it's going to expand to different research areas. And they created that tiered architecture that I alluded to earlier. They went with the relatively smaller NVMe layer and then a large layer of spinning disks. And this is all Hammerspace storage. It's all transparently managed by Hammerspace. The admins have already set up data orchestration that's all working for them. And this supports a CPU cluster as well as some GPUs as well. And long term, they do plan to integrate some existing NAS storage that is on site and as well as the cloud. And this contrasts to the other organization I previously mentioned, which is very AI GPU focused. So we can do that whole spectrum. So whatever organization needs are with respect to HPC storage and data,

Data Silos And Performance Bottlenecks

Adam Marko

we can handle it.

Jessica StLouis

Yeah, that's incredible. These are both great examples.

Announcement

Are you enjoying the conversation? We'd love to hear from you. Please subscribe to the podcast and give us a rating. It helps other people find and join the conversation. If you've got speaker or topic ideas, we'd love to hear those too. You can send them in a podcast review.

Jessica StLouis

With that in mind, Adam, from your experience, could you tell me what things tend to hold a team back, research teams?

Adam Marko

Right. I think the global access to data combined with performance, and I say global, just that could mean several storage systems in one location. It could mean multiple clouds, any range there. As we know, and in even in Bio-IT world, I think there's there's there's a whole panel on data silos. Data silos are a huge issue in life sciences. I think they're probably worse there than any other research vertical. And that's driven, of course, by different instrumentation, different research areas. But the top-down AI initiatives are really pushing for that data integration and unifying the data. And that's what slows organizations down. Researchers don't necessarily know where the data is, or even if they don't do know where it is, they can't get it to where it needs to be. And typically that means GPUs or CPUs. And that's what Hammerspace helps enable. And Hammerspace is a file system, as I mentioned, and we're also very high performance. So not only do we enable the data access, we enable very high performance access to your data. So it's more than just opening up a share drive and seeing your data. It means you can compute directly against it. And we're actually on the IO 500 top 20 production list. So very much committed to high performance. And that's all done without proprietary drivers. That's over something called NFS, which is included in all Linux kernels.

Jessica StLouis

Wow, the top 20 performance list, that's pretty impressive. And it's pretty amazing that Hammerspace is the only one that combines both. And it sounds like combining both is the differentiator. The timing is crucial and there is a pressure to move fast. So, how are organizations thinking differently about leveraging what they already have versus investing in brand new infrastructure?

Adam Marko

Yeah, it's definitely a new conversation because most workers haven't thought like I can really expand what I have and amplify my performance in a scalable way. It's something that you don't know what you don't know. So educating about what Hammerspace is truly capable of is something we're very focused on. And just to go at a very high performance level, we we have something called tier zero. We can turn your local NVMe storage into a parallel file system. And then you can expand that out using Hammerspace to your existing storage. So suddenly you have a massive high performance unified data platform that basically wouldn't be possible before. So educating organizations that this is possible, they can unify their data. It doesn't have to be a manual process or a separate software layer layer that moves things around. This is a faster, scalable data platform and parallel file system.

Jessica StLouis

So are you seeing organizations realize that they may already have the infrastructure that they need, but they just need better orchestration? Do you see that?

Adam Marko

Yes, it's a it's it's like a light bulb goes off when they realize what they can do with what they already have. And we're very vendor agnostic. So we can work with most commodity hardware and some other commercial vendors in the space. So this enables a whole range from, as I mentioned, tier zero improving the disk bandwidth of your GPU nodes all the way to doing a migration off of an older multi-petabyte NAS system, for example. So we really enable that whole spectrum. And that's that's a very novel technology. So as I mentioned, it's a light bulb moment when organizations realize they can do this.

Jessica StLouis

Yeah. I'm gonna segue back into life sciences

Why Life Sciences Data Is Different

Jessica StLouis

again. And I know you already mentioned silos, but for a life sciences audience, what's unique about AI readiness in the life science space compared with other industries?

Adam Marko

I think the nature of life sciences, whether it's the instruments or research types lended or even file types, have lended itself to unintentional but necessary silos at the time. Now making sense of all that is extremely difficult. I don't see it as much in other verticals, although that's not really my area of expertise from dealing with customers at commercial organizations like Hammerspace. It does seem to be the case that life science is is in unfortunately in the worst position to do this. So given that there's so many different instruments and given that instruments can be upgraded with software or hardware and then suddenly increase the data volume, this is an additional challenge that life sciences organizations face. It's it's very difficult to keep up with different instrumentation and the pace at which that instrumentation is advancing. I use the example of media and entertainment. If you ask a movie company how much data a two-hour, 4K video is going to take, they can tell you. But if you ask a scientist how much data their Cryo EM microscope is going to be putting out in 12 months, they can only guess. There's all that variability that you don't see in other organizations.

Jessica StLouis

Yeah, that's true. There's definitely a lot of variability. And it makes sense the complexity of life sciences really does add layers. Building on that complexity, what makes biomedical research data especially challenging?

Adam Marko

To go back to my previous points, definitely the different instrument types, the different file types, the unknowns surrounding data volumes. And then there's sort of a whole compliance standpoint. I'll use I'll use the term compliance, but things like data retention. How long do you need to keep your raw data? Do you need to keep the raw data? Organizations dealing with patients, of course, have personally identifiable information. And that is different depending on what country they're in. So there are different policies based on different countries, different policies based on if this is a commercial drug application or if it's used for research only purposes. That is something that's not really seen in other organizations, is this level of compliance, health protection, and data retention.

Jessica StLouis

Got it. Yeah, patient data certainly brings unique privacy and complexity challenges. I want to talk about AI readiness a bit more. I've heard Hammerspace recently launched the Hammerspace AI data platform. Can you tell us a bit about that solution and how it's helping organizations accelerate AI?

NVIDIA AI Data Platform Partnership

Adam Marko

Sure. This was launched by NVIDIA. They're the AI data platform initiators, and it provides a high-level framework and a recipe. I do think they're even called recipes on their website. They have some examples that helps you with AI readiness at your organization. It's very new. It was announced maybe 12 months ago by NVIDIA, but Hammerspace announced their official NVIDIA partnership at GTC this year. And this has immense interest from our current customers as well as net new customers that we're we're talking to. They really want to help enable AI, and NVIDIA is doing more than just selling them GPUs. So they they really want to get the workflows in place. And where Hammerspace is different than the other vendors who are AIDP certified is that we have that orchestration layer. So we get you that unified access to data that another vendor might not be able to do that you would have to still do manually, even though they have the pieces in place to count as an NVIDIA AI data platform.

Jessica StLouis

Very cool. So what would you say has been the biggest impact since from this integration?

Adam Marko

The idea that organizations can finally start solving their problems. So we're dealing with that data layer for them so they can actually put the workflows in place and not having to worry about that manual process that has slowed things down for so long.

Jessica StLouis

Yeah, this sounds very promising and it clearly resonates already.

Healthcare HPC And Real-Time Analysis

Jessica StLouis

Adam, I wanted to switch it up a bit and ask you about emerging trends that you foresee, like next gen AI infrastructure or how you think AI and life sciences will evolve.

Adam Marko

What's your take? Yeah, I wrote an article years ago about something called healthcare HPC that I think is finally becoming a reality. And that's different than research HPC. So research HPC is usually de-identified data. It doesn't necessarily mean it's patient or even human data. We're now seeing research doctors actually want to analyze their patient data in a high performance way. So taking existing decades of image data, running that through AI pipelines, and comparing that to real-time output from a patient who's undergoing a 3D CAT scan, for example. That's finally becoming a reality. And it's taken years, both from a data management standpoint, a software standpoint, and an analysis GPU availability standpoint to really make that happen. And it's still relatively new because there is an education hump that has to be overcome by these researchers who aren't used to HPC. So I think that's going to be one of the largest areas going forward. And if you research some of the spending and expected size of those markets, it's growing faster than almost anything else if you look five to 10 years out. And that really takes advantage of existing data. I think we'll see that in other fields as well, like agricultural biotech or industrial enzymes, really taking existing data to the next level and making better use of it. The other thing is real-time processing. We've seen this in CryoEM, for example, with improvements in instrument throughput and output, hardware performance, software performance, and availability of GPUs. We're going to see much more real-time analysis. So running the analysis as it's coming off the instrument. And just a technical detail, like sequencers, genomic sequencers used to only have one gigabit Ethernet connections. Now they're moving to 10 and even above 10. And that's really enabling much higher real-time performance. So we're going to be able to get to insights much quicker. And then when you have that existing data that you're able to train the models on, that creates a whole new paradigm for what your research ends up looking like.

Jessica StLouis

Yeah, that's great. We definitely have a lot of breakthroughs ahead of us. And whether that's in faster discoveries, new therapies, or something else, we certainly have a lot to look forward to. Building on that, as these real-time

BioIT World Preview And How To Connect

Jessica StLouis

insights become more critical, Bio-IT World is the perfect place to showcase them since organizations are looking for practical steps to scale AI. And Bio-IT World actually does start today. And Adam, I know you're going to be speaking with Hitachi on building AI ready data factory for life sciences. Can you give us a preview of that session?

Adam Marko

Yeah, this is really about thinking about your data from a data first standpoint. So abstracting away and hopefully forgetting about the infrastructure and enabling your workflows from a data first standpoint. So that really it I think it ties together what we've discussed on this call, going from worrying about infrastructure specifics and manual management to focusing on your workflows and your outcomes.

Jessica StLouis

Very cool. I hope those who are here at Bio-IT World can sit in on this presentation on May 21st. And for those attending Bio-IT World, besides Adam's session, you can catch Biochain's talk on reproducible AI workflows on May 20th. And that one's at 2 30 p.m. And if you're exploring the floor, we will be at booth 419. We'd love to connect. Adam, what is Hammerspace's booth number?

Adam Marko

Yes, we'll be there in booth number 203. Excellent.

Jessica StLouis

Well, Adam, this was fantastic. Thank you so much for joining us. And for our listeners, if you are here at Bio-IT World, be sure to check out Adam's session and connect with both Hammerspace and Bioteam. Thanks for joining us on Trends from the Trenches. Bioteam is a life sciences IT consulting company with multi expertise consultants. If you have any questions from Adam or myself, please email me at jessica bioteam.net. I will make sure I get you to the right person. Thanks for joining us today at the Trends from the Trenches Podcast.