Real World Serverless with theburningmonk

#64: Revolutionising scientific experiments with Emerald Cloud Lab

July 06, 2022 Yan Cui Season 1 Episode 64
Real World Serverless with theburningmonk
#64: Revolutionising scientific experiments with Emerald Cloud Lab
Show Notes Transcript

Links from the episode:

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

To learn how to build production-ready serverless applications, check out my upcoming workshops.


Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:13  

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today, I'm joined by Ben Smith from Emerald Cloud labs. Hi man, welcome to the show.


Ben Smith: 00:27  

Hey Yan, thank you for having me.


Yan Cui: 00:29  

So I read about a blog post that you guys wrote on the AWS blog about how Emerald Cloud Lab is revolutionising the laboratory testing using AWS and Lambda functions features pretty heavily. So that really piqued my interest and get in touch with you guys and see if I can get an inside peek into what's actually going on. So before we start, do you mind just maybe giving us a quick introduction to what the Emerald Cloud lab do? And maybe your role inside the company?


Ben Smith: 00:58  

Yeah, absolutely. So the goal of Emerald Cloud Lab is basically to make science more reproducible, make it move faster, and make it less toil based. And so the way we do that is scientists are able to program write code in Mathematica, that then is translated through our system into actual experiments that are run in the lab. And right now we have about over 100 instruments that can do many, many different types of experiments from synthetic chemistry to analysis and biology. And you can really run an entire company, entire biotech company solely through running these remote experiments and then getting the data back to you. And so obviously, that requires a lot of software engineering. And so I'm the VP of Engineering at Emerald, wherever I oversee the software engineering team and also IT, and then we have another team called Scientific Computing, which is focused on kind of doing a lot of the machine learning analysis.


Yan Cui: 01:58  

Okay, so I'm not coming from a scientific background. So when you're talking about experiments, maybe help me understand what sort of experiment you're talking about here? Could it be something like analyzing the genome sequence for COVID-19 or something like that?


Ben Smith: 02:13  

Yeah, absolutely. If you kind of think of the analogy with AWS, where AWS took all of the pieces that it takes to run a modern, you know, a modern web application, and they broke it down into storage, and compute and network, we've tried to do the same thing with science. And so you can imagine that, if you're trying to find a new drug, what you want to do is you want to be able to synthesize the drug. And so that means you may take some, some different compounds and mix them together, you then want to make sure that you've made something that's very pure, so it's not going to cause any, any contamination issues. And so you may run it through some machines that can check that you've actually produced the substance you expect to produce. And then you want to expose it to tie their proteins or cells or other things to see if it actually has the intended function. And so we've basically broken that down into various like low-level building blocks, like mix these two things together, put this thing into a microscope, grow the cells that you can then combine in to run your actual experiment.


Yan Cui: 03:17  

Okay, so do the lab come up with the experiment, the design of the experiment, and then you guys essentially become the execution arm of the lab, and so that you would take those designs from the client, and then you execute the experiments themselves, and then you do some computations, or do you provide the sort of the computing side of things as a separate business to the running experiments?


Ben Smith: 03:39  

Yeah, great question. So exactly. The idea is that we want to make it so that our users can just think about science, and then they have some idea that they think some chemical will do something good. They then can write code that describes how to synthesize that, how to check that it's been made, how to, you know, check that it works, we then use computing to turn that code that they've written almost through a compiler into something that's executable in the lab in an efficient way. And then we we actually run all of those experiments, and they can take anywhere from a few hours to a few days. And then the results are given back to them in a structured way so that they can then do large-scale analysis or machine learning or kind of whatever they need to do on those results.


Yan Cui: 04:22  

Okay, so you are running those experiments, you produce a whole bunch of data, and then you are, then allowing your customers to then run analysis on those results in machine learning stuff. So in that case, do you provide a computing platform for the machine learning models to run, or do you actually run and build machine learning models for your clients?


Ben Smith: 04:43  

Yeah, it's a great question. So we do both. So we provide a platform for them to run their code if that's what they choose to do. But a lot of our customers you know, are really kind of more interested in biology or chemistry than in kind of the computer science or machine learning inside. And so we have prebaked analysis, if you will, that they can then run on the data that they've generated to give them scientifically interesting results.


Yan Cui: 05:10  

Right, right, because I guess a very common thing you hear about for the machine learning world is this MLOps which obviously, you know, it's a very specific set of DevOps kind of things that has been applied to machine learning workloads, and you hear a lot about, you know, this data scientist that is great with building a model, but not so much of the operational side of things and is struggling with that. So I guess you're kind of providing an abstraction over all of those things for your customers. I’m actually quite curious in that case then. So with machine learning, you know, oftentimes when I hear people talk about what they're doing, it feels like a specialized thing, depending on the industry, your particular workload, and having to understand a lot about the ins and outs of the data you're working with. So in this case, you know, someone just giving you an experiment with all this data and then telling you to give us some common set of machine learning models. How well does that actually work in terms of giving you useful insights where I'm like a third party, not really experts in a particular domain that you are doing experiments on? How does that actually work?


Ben Smith: 06:17  

Yeah, great question. And when I say machine learning, I think this is kind of common when we talk about machine learning, I mean, a whole spectrum from like very simple like predicting if we're going to be out of a certain type of pipette a week from now. And we should reorder it kind of scheduling problems, to things like we actually do things like physical simulation, so we can tell like, we think your experiment is going to work or is not going to work before you run it all the way up to the very end. So those are kind of more generalized all the way up to very specialized like you're talking about where we've run many, many thousands or tens of thousands of particular types of experiments. And so you can, we can build kind of a library where we say we've seen things that look similar to what you're trying to do before. And here are the ways that worked. And here are the ways that didn't work. And where this really comes into play is in a process which is called the design of experiment. And so you can, sorry, I don't mean to be like too technical with it. But you can, you can imagine that you're a scientist, you want to mix two things together. And you don't know how much of thing one or thing two you should mix together. And so oftentimes, what you do is you just try 20 different combinations to see what comes out. And that can be very slow and expensive because it actually has to run in the lab. It uses real reagents which can be expensive. It uses instruments which can be expensive. And so if you're able to simulate ahead of time, what you think is going to happen, that can help you cut down kind of the 20 down to two or three. And then you can actually try those in the lab and see if your model worked well.


Yan Cui: 07:53  

Okay, okay. So I guess in a day and age where there's a lot of hype around different materials being used for batteries and things like that, I guess this is exactly the kind of thing that can be really powerful as the different science teams are trying to work out what's the next breakthrough in the battery technologies. So in this case, uh, how does, like serverless come into play? Why do you guys settle on using Lambda for your compute as opposed to, you know, containers or just running just raw virtual machines?


Ben Smith: 08:22  

Yeah, great question. So we do a lot of stuff in the lab. That's very bursty. And it's everything. We have kind of three major use cases that that we're trying to solve. So the first is just making sure the lab is working. And so we had a script system that would run on a VM, and every minute or five minutes or whatever it would go through a check to make sure like, all of the temperatures of all the refrigerators were correct, that nobody left the door open, that we had enough of, you know, the supplies we needed, and we didn't need to buy more things, all of that kind of stuff. And that was set on a like a timer to run. And what we what we started to see is that basically, as the lab grew, and as we were running more and more experiments, that that just keeping up with that kind of logistical side of thing was starting to take more and more time. And we didn't like the fact that like every day, we would have to come back and shard our system again, and then move it to multiple VMs. And so we decided, you know, we want something that can just kind of scale infinitely as we as we grow without any work from us. The other thing we were running into is our unit testing system was growing very, very quickly. And so we run a lot of tests. Because this is used for drug discovery it's very important to us that everything be, you know, exactly right. And so we run a lot of quality tests pretty much constantly. And again, as the scale of the lab was growing, we had a VM cluster of about four machines that we're running all these tests before, we need 10 times that and then you know, next year, we're going to need another 10 times that. And then the last piece is as you point out, it's kind of these customer analysis jobs where they want to run big machine learning projects or even do things like some of the instruments produce very, very large datasets, you know, hundreds of gigabytes of data, and you need to run down sampling or other kind of like data reduction jobs on them. And they're very, very bursty. Because we don't know when a customer is going to come in and want to run a big job like this. So I was a big fan of Lambda. But how we actually ended up architecting this is we went with Fargate. And the reason we went with Fargate is because most of our code, as I mentioned at the very beginning, runs in Mathematica, and we're able to containerize Mathematica and then run that in Fargate. And that gives us kind of the scalability and the versatility that we're looking for in a pretty cool way.


Yan Cui: 10:43  

Right, right, because my next question was going to be doing machine learning on Lambda, you're going to run into a lot of the constraints, like the size of the deployment package, and things like that, which means oftentimes people have ended up using containers as their format for the Lambda functions anyway, just so that they can load the two gigabytes machine native model into the Lambda runtime. So I guess in this case, you're just using Lambda function as part of the pipeline that ingests data and then push them into your Fargate which is doing the actual computation and the machine learning training.


Ben Smith: 11:16  

Yeah, that's exactly right. So so what we do is we have a RDS cluster that basically stores all of the data. So every time an experiment finishes, or anything like that, the status of all that is stored in RDS. We feed that into a Kinesis stream. So every time anything changes in the lab, it goes through this Kinesis stream. And then we use Lambda to process that Kinesis stream to figure out if we need to trigger new Fargate jobs off of that. And so for example, if somebody sets up a machine learning model, that's like, anytime anyone finishes a running what's called HPLC, as an experiment you do a lot, they can set it up so that they trigger their machine learning model to update off of anytime one of those experiments finish.


Yan Cui: 11:57  

Right, got it. And I guess that’s where the sort of Lambda’s event triggers become really powerful, very useful. And I guess that's one of the things that Fargate is still kind of missing right now, is that event-driven, I guess, programming model that you got with Lambda. It’s great that they can run all this ad hoc task on Fargate. But then you still have to have something that causes Fargate to start a task, which I guess I've seen quite a lot of people use a Lambda for that, where you use Lambda for the event trigger, and then that all he does he just starts a Fargate task. 


Ben Smith: 12:29  

Yeah, that's exactly the direction we've gone. I think it's worked pretty well, I agree that it would be like, like, there's a lot of pieces to this. And I think that's okay, but it does make kind of debugging a little bit more complicated. Because you now have to ask, did the event get into Kinesis? Did the Lambda job trigger trigger properly? Did it trigger the Fargate job properly? Did the, you know, the Fargate job run correctly? And so there is a little bit of kind of a debugging overhead there?


Yan Cui: 12:54  

Okay, I don't know. What about other challenges? What were some of the biggest challenges you had to get this whole thing up and running?


Ben Smith: 13:03  

Yeah, definitely. So there were two main challenges. The first one was kind of cultural, which is that this was the first real, like the first real distributed system, I would say we built in Emerald. And so we have the standard, already we had the standard architecture of like, we had a front end JavaScript web application that called into a Go, you know, a Go back end and not talk to RDS. And that was very— but this was the first time we had you know, we had an event bus, we had a bunch of Lambda jobs, we had a bunch of Fargate jobs. And getting people to start to think in terms of managing and deploying and testing distributed systems was was definitely kind of a cultural shift for us. And then the second piece was just scale, like I mentioned earlier, that the reason we wanted to do this is because the, you know, because the traffic was very bursty. And we were tired of, you know, having to add VMs and reshard all the time. Fargate is great in the sense that it gives us all the compute we need. But it still depends on a bunch of other things. And all of those other things, it's now getting much, much harder. And so there's, we basically, we've gone in the last, maybe a year from running, you know, like I said, we were probably running four or five of these VMs. And now we're running something like 70 to 100,000 of these jobs a week. And so they're just hammering all of the downstream dependencies. And it can be sometimes very hard to figure out, you know, what the failure modes are there.


Yan Cui: 14:31  

I guess, in this case, this is where things like visibility tooling becomes super important. What are you guys using to I guess, get some visibility into what's going on inside your serverless workloads?


Ben Smith: 14:45  

Yeah, great question. So we use Honeycomb throughout our back. So we start basically everything that happens on the front end, we generate a trace ID and we can trace that all the way through. That works quite well. I'm super happy with Honeycomb. I've used X-Ray in the past and was not, you know, not terribly happy with it. So I like Honeycomb a lot. And then a lot of it to be honest is we pull, then this is gonna sound really silly, but we pull the key information from the CloudWatch logs into our database into our system that the scientists can see. And the reason we do this is because a lot of times the scientists are the ones who are, you know, submitting these jobs. They're the ones who are saying, like, oh, you know, this, this material has an expiration day of a week. And so if we don't have any new ones, we need to know it's going to expire a week from now and have ordered things. And they're not super comfortable with like distributed systems or logging into AWS to see see CloudWatch logs and things like that. And so we try to pull as much of that information back into our database in our system that they're comfortable with so that they can debug themselves.


Yan Cui: 15:47  

Okay, so that's going to be interesting, especially if you've got the disputed transactions that span across the Lambda function and Fargate as well. So I guess do you then just ingest all the logs from CloudWatch and then aggregate them based on the trace ID manually, and then put them into some kind of a, like a separate view that the scientists can use?


Ben Smith: 16:09  

No, so what we're trying to do, I mean, that probably would be better than what we do. But what we're trying to do is so if you imagine what's happening is that there's a system to actually trigger the job that the event comes through Lambda runs that then triggers the Fargate job that's owned by the software engineering team. And they're comfortable kind of logging and looking at AWS logs. And so anything about my my job was not triggered goes to them. What we're trying to get the scientists is they've written code that is now running in that Fargate container. And we're trying to get them the output of that so they can debug the code they wrote. And so we're just exporting the logs that are from their code running.


Yan Cui: 16:48  

Right, right. Okay, gotcha. And I guess that this also brings up the interesting question, which is, now you're taking jobs from scientists, and then just running it in your infrastructure, how do you make sure and how do you, I guess, make sure you secure you know, they're not able to steal information about your environment and do something malicious with them?


Ben Smith: 17:11  

Yeah, this is a great question that we've we've spent a lot of time with. So what we have is we have three separate EKS clusters that have different, they have different IAM roles, execution IAM roles. And they have different, they're in different VPCs and the VPCs have different network access. And so one of the clusters is allowed, for example, to talk to our lab network, it has a VPN setup. And in order to submit jobs to that, you have to have like the very highest credentials, you need to be an internal employee of ECL and you need to be someone on the team who should be doing things like that. The— on the other end is the cluster that go that customer jobs are submitted to. And so this is basically it's in a separate VPC that has no connectivity to any of our other VPCs, or our lab network. And we make sure that basically the execution role is just completely pared down. It has no access to, you know, secrets manager or any of our internal, you know, our internal resources in AWS. And then the fact that it's a container is actually really nice because we throw it away at the end of the job, and we never mix customer jobs in the same container. And so really, they're kind of locked into, you know, their one execution environment.


Yan Cui: 18:24  

Right, gotcha. I guess in that case, there's not much they can actually do in, well, to your environment. What about, can they do anything, like, I don't know, try to run the Bitcoin mining or something like that? Something kind of dodgy?


Ben Smith: 18:37  

Yeah, absolutely. So so this is the challenge of we're allowing them to write Mathematica code. Mathematica, if you've ever worked in it, you can run— There's a lot of stuff you can do in Mathematica. It probably has like a mine Bitcoin function that will will do it. The good news here is that, A, we're running like a b2b business so we're selling to like very large pharmaceutical companies they have— We don't have like free accounts or anything like that. So that's not really an issue. And then our model is that we charge them based on concurrent computation usage. And so if they want to pay us to run their coin mining, we're, you know, like, I don't think that's a good use of anybody's time or money. But it doesn't, it's not a problem for us.


Yan Cui: 19:16  

Right, right. Gotcha. So if they make a mistake, or something happened maliciously, then as least they're gonna be paying for the amount of execution time that you end up running, which is what you end up paying for AWS, but I guess you just got like a margin on top of what you pay AWS for the amount of time that the Fargate task runs for, right?


Ben Smith: 19:37  

Yeah, so generally, what we do kind of the business model, the contracts are fairly large, like we're running very big, complicated experiments. And so the cost of Fargate is fairly low by comparison. So as long as someone's not like attempted to actively abuse it, we haven't had an issue where where the cost is a major problem.


Yan Cui: 19:56  

Okay, okay. Gotcha. So I guess in this case, you know, you have been running this for a while, what would you say are some of the biggest wins of using this stack versus running loads of infrastructure yourself? And you mentioned earlier that there's less of the— every couple of months, we have to split our stack again, and just upsizing just that being able to handle those bursts in traffic. Was there anything else, you know, that you consider yourself major wins for this architecture?


Ben Smith: 20:24  

Yeah, absolutely. So definitely the scalability with the caveat, as I mentioned earlier, that your downstream components have to be able to handle it. But the two other things I've really liked are, I really liked the fact that the container gets thrown away, and there's no like, we used to have always had these problems with like the VMs, with their disk would fill up or there'd be like some corruption or like, you know, there's basically the state of the VM that over time, especially if you're running kind of customer stuff, is problematic. It's also very nice to get the segmentation where if we were running, you know, a cluster of four VMs. And it would be much harder to make sure like, there's a VM dedicated for each customer. And so the the segmentation of the container is super nice. And then the last piece I really like is the testability. And this may just be that we are now using better practices, but our entire stack from the database to Kinesis to Lambda to Fargate, we're able to spin up programmatically. And so our integration tests actually spin up an entire copy of that, run the full integration test, and then tear it down. And I know that's the, it's possible as well with VMs. It was just, perhaps you've just done it better this time.


Yan Cui: 21:33  

Yeah, that's quite a common, I guess, well, maybe not quite common. But I see it becoming more and more common, I guess, as a practice for serverless teams to bring up the entire stack for running CI/CD pipelines or when you're doing feature development work, because things like Kinesis streams and DynamoDB tables, it’s just so easy to provision, so quick to provision and so easy to tear down afterward. Not like when you've got RDS databases that need to have a lot of dedicated infrastructure around there like VPCs and security groups. And it takes a long time to provision and be ready. So I think that sort of things, it really helps to bring drive home this, I guess, new way of doing things where the whole environment itself just ephemeral, you know, we bring it up where we need it, and we just throw it away. So we don't have to worry about like you said, problems with long-running virtual machines which I remember so many of those things where you will be spending a week or a month debugging some kind of memory issue, memory fragmentation that when machine runs there for too long, and those problems are so hard to figure it out. And you know, with things are Fargate and Lambda functions, they're so short-lived. You just don't have those kinds of problems.


Ben Smith: 22:45  

Yeah, absolutely, absolutely. Yeah, big fan of that. So we've now basically containerized everything and have no more VMs. And so even for our for our non-Fargate tasks, we use EKS and have been super happy with that just because of exactly what you say the ephemerality and the ability just to throw it away and get a new one when things go wrong.


Yan Cui: 23:03  

Okay, so in that case, you're using both Functions as a Service offering, as well as the serverless containers within AWS. Are there any sort of things that you'd like to see from AWS to improve the user experience for these services?


Ben Smith: 23:17  

Yeah, absolutely. The number one is just parity between Fargate and EC2. And so I look at the at the offerings of machines, you can get on EC2, and then I compare that with what you can get on Fargate. And it just, it makes me very sad. And so in particular things like lots more RAM. So right now, I believe it goes up to about four vCPU and 32 gigs of RAM, something like that. But we would love that, I mean, there are EC2 instances that go up to, you know, a terabyte of RAM. And we would love that we would love the ability to access like GPUs for for some of the machine learning model. Yeah, basically just parity with with all of the EC2 offerings is number one. And then number two, and this may just be a lack of, of my knowledge, but I sometimes I have this mental model in my head of if I have a web service that's responding to external requests, I want to put it in EKS. And if I have this kind of bursty async load, I want to put it in Fargate. But I sometimes have these kinds of middle ground where I have like bursty web services that I want to be able to respond well to external requests, but not have a bunch of warmup time and things like that. And I feel like it just hasn't hit hit its stride in Fargate. 


Yan Cui: 24:33  

Okay, a lot of people would use Lambda for those kinds of APIs. Obviously, Lambda has got its own. So cold start issues, but I guess depending on what your latency requirements are and how warm those Lambda workers end up being in production. I think in production most of the time, you don't really see a lot of cold start just because the traffic kind of just keeps the workers warm anyway, but yeah I'd love to see more being done to improve the cold start time. I know they introduced the provisioned concurrency a couple of years ago, which makes Lambda, you know, you can you can have now long-running Lambda workers. But that kind of defeats the purpose a little bit. But yeah, but I guess, they are bound by some constraints around how fast they are able to boost up the new environment. But yeah, one of those things has come back quite a few times. The GPU thing is very interesting. I've had quite a few customers who have been asking for the same thing because they are doing machine learning stuff and do that on CPU just nowhere near the level of efficiency you can get on GPU. I didn't realize they don't have it on Fargate yet, but I'm sure that's something that must be on their roadmap somewhere.


Ben Smith: 25:37  

Yeah, absolutely. And it's possible that I'm just behind the times on these things. And they've recently released it. But yeah, that would be fantastic.


Yan Cui: 25:46  

Okay, so I think those are all the questions that I've got. Before we go, Ben, do you have anything else that you'd like to share? Maybe if Emerald Lab is hiring, or maybe there's something that we can go and read about what you guys are doing?


Ben Smith: 26:00  

Yeah, absolutely. We're definitely hiring. We're hiring across all engineering teams back end, front and DevOps, security as well. And there's a post on the AWS blog Startup posts about what we've done about what we're doing with serverless. And you're also welcome to check out www.emeraldcloudlab.com to learn more about it.


Yan Cui: 26:21  

Okay, I'll get those links on the show notes. And if there's a job spec, then feel free to share it with me and I'll put it on the show notes as well. Once again, thank you so much for taking the time to talk with me today. And I hope to catch up with you in person hopefully at re:Invent at some point.


Ben Smith: 26:36  

Yeah, my pleasure. Thank you for having me. 


Yan Cui: 26:38  

Cheers. Ok, bye bye. 


Ben Smith: 26:39  

Bye.


Yan Cui: 26:53  

So that's it for another episode of Real World Serverless. To access the show notes, please go to realworldserverless.com. If you want to learn how to build production-ready serverless applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.