Real World Serverless with theburningmonk

#67: High-Performance Computing with serverless at scale with Eoin Shanaghy

September 07, 2022 Yan Cui Season 1 Episode 67
#67: High-Performance Computing with serverless at scale with Eoin Shanaghy
Real World Serverless with theburningmonk
More Info
Real World Serverless with theburningmonk
#67: High-Performance Computing with serverless at scale with Eoin Shanaghy
Sep 07, 2022 Season 1 Episode 67
Yan Cui

In this episode, I caught up with Eoin Shanaghy to talk about his work at FourTheorem, a boutique consultancy based out of Ireland. We touched on many topics, including the challenges for serverless adoption at enterprises and a super interesting client project he worked on, which involved using serverless at scale in a high-performance computing environment.

Links from the episode:

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Want to step up your AWS game and learn how to build production-ready serverless applications? Check out my upcoming workshops and I will teach you everything I know.


Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Show Notes Transcript

In this episode, I caught up with Eoin Shanaghy to talk about his work at FourTheorem, a boutique consultancy based out of Ireland. We touched on many topics, including the challenges for serverless adoption at enterprises and a super interesting client project he worked on, which involved using serverless at scale in a high-performance computing environment.

Links from the episode:

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Want to step up your AWS game and learn how to build production-ready serverless applications? Check out my upcoming workshops and I will teach you everything I know.


Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:12  

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today, I'm joined by Eoin Shanaghy. Hey, man, welcome to the show.


Eoin Shanaghy: 00:25  

Hi Yan, how are you doing? Great to be here.


Yan Cui: 00:27  

So yeah, we've connected on Twitter for quite a while now. And I know Luciano who also works with you and a former colleague of mine Gurapit have recently started with fourTheorem as well. So we've got quite a few connections that are going on there. So I guess before we start, maybe, can you just tell the audience about fourTheorem? And what you do over there?


Eoin Shanaghy: 00:47  

Yeah, sure. So fourTheorem is a, I suppose a boutique consultancy and the AWS space, we've been around since 2017. We're still quite small, small and lean, I guess, around 20 people. And while we're growing, we're trying to grow deliberately slowly I suppose at the moment and make sure we kind of keep everyone on the same page and make sure everybody has the opportunity to come up to speed with what we're doing. The kind of stuff we do, I suppose most of the projects we do are either kind of greenfield development. And the other, the other side of it would be more like enterprise migrations where you're taking existing workload that might be very large, might have a long legacy to it, and moving it to AWS, usually modernising it along the way. So not so much lift and shift, although that's sometimes what has to be done. But often, it's like modernization and getting to teach people I suppose, and work with people and show them what you can do is like serverless architecture these days.


Yan Cui: 01:47  

So I guess with the enterprise migration projects, do you see a lot of people, I guess the large enterprises are going to serverless as part of this migration and learning about the cloud for the first time, because I've dealt with a few clients who kind of falls into that category, they kind of just skip the whole containers step altogether, and then they want to just go straight from on premises to serverless, because they recognise that's where you're going to get the most value. So is that something that you're seeing as well, in your area?


Eoin Shanaghy: 02:17  

There's definitely a bit of a shift happening there. A couple of years ago you probably would be a little bit more cautious about going into a customer and talking about serverless on day one. We're seeing more and more, it's still not the majority, but you're seeing more and more of established big companies, even like financial services and banking, are coming to you and talking about serverless, and talking about managed services. So there has been a bit of a shift there. But you still have to kind of take it on a case-by-case basis. Because if you're talking about a company that's on prem, that got a lot of existing workloads, you know, there's a lot to bite off when you're moving to AWS, and even a lift and shift for a company like that is a big undertaking, a lot of complexity involved. So you kind of have to pick your battles when you're talking about adopting new principles, and if you're going to do serverless, but if organisations already have to get up to speed with all the technical concepts around AWS, like IAM and your accounts, and continuous deployment even infrastructure as code. If you start loading in completely new development patterns from day one, you can really overload people, and it's not necessarily the right thing to do for them. So sometimes it's a question of, okay, are you going to lift and shift first? Or are we going to say start off with like a proof of concept, maybe take some new workload they want to build or some core piece of business logic and migrated? If you do that, then you can often say, Okay, well, let's, let's see how we would ideally build it as if we were building it from scratch. And maybe what you end up doing is somewhere in between what they have now, and what that ideal kind of greenfield vision would be. And so when that when you do that, you tend to end up using a lot of serverless technologies, then because you can you can get started much faster, you can often reduce the code base by a significant percentage like sometimes you can take away like 70 or 80% of their code just by rearchitecting it using things like Lambda and Step Functions. So yeah, we're seeing that quite a lot. And it's nice to see that you don't have to approach enterprises with that kind of level of trepidation anymore. You can you can openly talk about serverless these days.


Yan Cui: 04:24  

Yeah, I think that's the one aspect probably a lot of people don't recognise when they think about serverless and think, Oh, this is easy. But then when you think about someone moving from, say, on premises, who's never done anything besides big monoliths. Now they're thinking about, you know how to build things in a more sort of microservices architecture style, they have to think about the problem that comes with that in terms of distributed tracing or these other challenges, and how they organise their code and the team, which they've never done in the past. And this is the first time they are adopting a new architectural paradigm as well as the new compute platform and new to cloud. So it’s that cumulative effect of all the things that you have to take on in one go, that's where I think things get a bit scary, which, for people like you and me who has been in the cloud for a long time, you're doing that incremental learning. But for some of our clients, this becomes like a big bang, which can be quite scary.


Eoin Shanaghy: 05:18  

Yeah, we kind of have a responsibility there. You know, even though we're serverless evangelists I suppose, you can't just, you can't be all or nothing on that approach, you have to understand where people are coming from. And you can really do a lot of harm. If you go in there pushing a huge technology jump on customers, you have to figure out okay, well, maybe we want to get there eventually. But you get have to get to know the people and where they're coming from or what they want to do, and bring them along a journey in some way. And you have to do that incrementally. You have to work with them. And you meet them halfway, right? So some of the things you might migrate more in a way that matches their existing patterns. So often that that could be just like, rather than try to rearchitect everything in a serverless way, taking some pieces of the workload, containerizing them using infrastructures code, continuous deployment. And then for maybe more data transformation, or orchestration, started introducing things like Step Functions, Lambda, all these lovely events, services like Kinesis, and SQS. That helps, that helps a lot. I think, if you can kind of meet people halfway like that because they can start to see the benefits without having to be completely overstimulated by loading them with 100 new services with all the various concepts and nuances you need to understand, I guess there's, we have to recognise is that there's there's a huge amount of knowledge and you can't achieve the burning monk level proficiency with AWS in a short period of time like you say it's incremental, you know, so you have to allow people to earn incrementally too.


Yan Cui: 06:57  

Yeah, absolutely. And I've had a few projects where I ended up just telling the client to keep using relational databases because I didn't want them to learn DynamoDB as well as everything else that I was teaching them already. And the things that VPCs, and things like that, which, you know, they don't really need but rather than trying to educate them as well as the InfoSec and everybody else, let's just tackle one problem at a time, you know, win one battle, and then we can think about the next one.


Eoin Shanaghy: 07:23  

Yep. For real, we've had customers where, you know, they've had an existing platform, like, like IoT system, and they build their system with IoT Core and end up as a result naturally, using Lambda, and DynamoDB, as part of that. And then the customer might say, well, we've also got this Django application with Postgres. We'd like to convert this into API Gateway and DynamoDB, Lambda base stack instead because then we're using a consistent set of technologies. We've also found ourselves pushing back on requests like that and saying, Well, look, if you've got this REST API built in Django, and it's working, and you're not spending a lot of time maintaining it, then you could just leave it as it is. If you don't have a really compelling reason to migrate that now, maybe your best spending your time and investment on other things today. Because, like you say, once you start going down the DynamoDB route, you're kind of opening a can of worms. And, you know, I have mixed feelings about DynamoDB, and where it's headed, and what it means for the kind of serverless direction. But it can start it can open up a can of worms and lead you into all sorts of decisions, then around just single table design versus multiple tables, and all the layers, you might have to build on top of that in order to map your front end down to your underlying data store.


Yan Cui: 08:39  

So one of the examples that we talked about before the show was really interesting. It was serverless HPC use case. So maybe we can pivot and talk about that because I think that's really interesting use case. And there's quite a few different talking points we can dive into there as well. So this was something that you did a while back, maybe you can do an introduction and tell us about the project. 


Eoin Shanaghy: 09:01  

Yeah, so we've been talking a lot about this case study recently, because it's really it's not your classic serverless project with, you know, like front-end APIs and AppSync API Gateway. Instead, this is like high performance computing at scale. So we started working with the client who's one of the leading reinsurance companies globally, they're called RenaissanceRe. And this is a company that manages a lot of insurance risk. So they insure insurers, that's their, that's their function as a reinsurance company. So they would write like a billion dollars in premium in a given year, for example. And they also have to manage a lot of capital, they have a lot of capital loan in place in order to be able to underwrite that risk. So the goal for that company is basically to match manage the risk and then manage the money and the capital they have on hand to try and optimise for both of those things. And as a company, they're kind of known for being the best in terms of technology use, using technology to analyse the risk and understanding the portfolio risks better than anyone else. And they have this process, which we started working with them on called a risk roll up, which is essentially like a high performance computing problem where they have to perform lots of statistical modelling jobs in order to understand the risk position of their portfolio of all the insurance deals. And this is like a complex set of jobs like it's a graph, if you can imagine, like a tree structure of jobs that have to be run in some compute environment. In terms of the technology, it's like, largely Python based, so you can theoretically run it anywhere. But like when you talk about HPC, there's a lot of kind of long running HPC technologies out there, like MPI, on AWS even have things like parallel cluster. And you also have things like AWS batch. And I suppose the job for us, when we were in a position where like they were running on prem, their workload was taking quite a long time to run, but their business was growing. So they knew they had to move to the cloud in order to get scale they needed to be able to run workloads into the future. So they really wanted to like increase the scale by like an order of magnitude, we're talking about moving from 10 hours to run a workload down to one hour or less. Or, like data volume increases over the next couple of years of like 15x. So they're really talking about massive scale that are on prem servers wouldn't really handle. So what, what we started working with them with the goal of reducing that roll up time to one hour, and the first POC we did was actually using AWS batch, because AWS batch is kind of, at least in terms of its labelling, it's designed for this kind of workload. You can give it a lot of jobs, you give it a container, and it'll run EC2 instances and run that workload for you. And it has a scheduler as well, that will feed the cluster with all of the job parameters, and run all the jobs until they're complete. So with that, we were actually able to do the process in an hour. But we noticed with batch, there's still a lot of waste and inefficiency because the scheduler has quite a lot of overhead. And the jobs it allows you to run are not very granular. So you end up actually wasting a lot of time, we found that we were wasting like 50% of the cluster, while you were running a workload end to end. So we started looking at alternatives on that. And when we had done the batch POC we had used like Lambda and Step Functions for the orchestration part of the workload. We started to move into more like serverless technologies as we evolved. We were kind of in a privileged position, I suppose, as a consultant working with the customer, because RenaissanceRe had already kind of recognised that they are reinsurance company, they don't want to be maintaining lots of infrastructure. So they were already asking us, can you use manage services and avoid us having to like maintain a lot of infrastructure, we're trying to move away from that. So we started, like experimenting with running this HPC workload in Lambda and in Fargate, and figuring out what that would look like. And early on, we had pretty good success with all that actually. The main issue, I guess, is that we were able to run this workload on Fargate and Lambda. But the main main concern is that AWS doesn't have anything that will really allow you to schedule the kind of workload of this scale in a serverless way. So if you imagine, you've got a big tree of jobs you want to run, a lot of them are very different sizes. So some of them might take a minute to run, and some of them might take an hour. So if you want to parallelize that on, you know, stateless serverless compute, what we do is we split the jobs into like chunks so that you can run them in a much smaller execution time. So then we basically fan out the job into a, from a graph of 1000 nodes, or 5000 nodes into one that has like a million and a half nodes. So then you end up with like a million and a half things to run. And then you can choose wherever you run them. So we split the whole workload into like a planning phase where you create a plan of all the jobs you want to run. And then you run the workload in your compute infrastructure. And in order to do that, you know, there are all of the schedulers that are out there for running HPC workloads are really designed to work with stateful traditional compute, like large instances, typically with large amounts of memory and a shared storage, maybe with inter process communication as well. But we felt that there was definitely a place to be able to do this in a serverless way where you could just respond to events that were happening on this serverless cluster, and process them using serverless infrastructure. So we essentially built a custom scheduler using things like Step Functions, Lambda, Kinesis, and Redis, actually ElastiCache Redis for storing the state of all of your jobs. And that scheduler works pretty well. It allows you to as each job runs you have a Lambda function, that job is to as quickly as possible, figure out, is there another job I can run so that you feed that cluster as quickly as possible. And that makes sure that the cluster is running jobs 100% of the time. And it works. And it works really well. I guess the downside to that is that we ended up creating this kind of custom scheduler. And that's not something that we'd really like to have to do. We'd much rather that there was something off the shelf we could buy either an AWS service or a third-party service, that would allow you to take a graph of 2 million jobs and let it run on Lambda, or on whatever container infrastructure.


Yan Cui: 15:39  

This is actually very similar to something that one of my clients actually working on this notion of how to run some kind of data processing job, and that has got dependencies on other calculations that need to be triggered, need to make this on the spot decisions in terms of okay, yep, I don't have any data for this value that need to calculate. So I need to go and trigger some other job to run that as well. And we've also approach this using Step Functions. But I think it sounds like you guys are using Fargate or something else to do the actual data crunching, and then using Lambda and Step Functions as the orchestration layer. Did I get that right?


Eoin Shanaghy: 16:14  

Yeah, I mean, there's two layers of orchestration, like as the first one is the Step Function. And that's more like saying, Okay, first, we need to get our inputs parsed and create a plan. And then you have a plan actually, so you don't have to dynamically figure out what has to run based on data you need, you kind of predict what data each job needs, so that you have a completely deterministic plan at start. And then you start feeding the SQS queue, which is actually something you have to design for, because if you want if you've got 250 jobs, and you want to get them into an SQS queue as quickly as possible, you can't just do send message batch, right, you need to parallelize that and get them in there. Once the jobs are in the queue, and workers start picking them up and running them, they start pulling them off the SQS queue, the Step Function is kind of has almost done its job, it waits for the callback when it's the whole process is finished. But the rest of the orchestration is essentially like event driven choreography. Because every job is report is reporting its success or failure on a Kinesis stream that gets picked up by a Lambda, which processes and schedule subsequent jobs. So that's all kinds of asynchronous and event driven. The workers themselves and there's each part of the system is kind of very self contained, like the worker itself has very complex modelling logic in it, that the customer has built over many years proprietary model that they run. But from our perspective, we kind of treat that as a black box. And it's a very simple wrapper that just pulls a job from SQS reads and writes from S3, and emitted state on a Kinesis stream. So in all of these, the fact that you're using all of these separate stateless pieces, it's actually pretty good from a point of view of development and troubleshooting, because everything can be considered in isolation and replicated. If you want to figure out why Java has failed or any individual pieces failed, you can just run that Lambda with one input, and and replay it, or run it locally with the same input, or run the container with the same input, and and understand how it works pretty pretty well. Whereas if you're using something like Spark, where, you know, it distributes the workload for you, and you, it's it's kind of hidden from you, it could be when things go wrong it'd be much can be much more difficult to actually drill down into the source of the problem, right? Because you can't really isolate it down to a single stateless piece of computer.


Yan Cui: 18:37  

Okay, and in this case, you're using Kinesis to feed the decision making Lambda function. Is it because you want to have some kind of ordering in place, and also you want to also make one decision for one job at a time so that you don't have two different Lambda instances or making decisions for same job and then having some kind of a conflict?


Eoin Shanaghy: 18:58  

Yeah, I mean, you we use Kinesis there because it gives you pretty good low latency, it gives you guaranteed ordering. So for a given job, we make sure that the events arrive in order. So you don't get the succeeded event before the started event, for example. And yeah, it also allows us to control the parallelization of that decision making Lambda, as you say, because with Kinesis it gives you very strict, predictable concurrency because you've got your shard count and your parallelization factor. So we can size that very predictably, I guess, and I know exactly how it's going to run.


Yan Cui: 19:37  

Okay, and then in this case, you are sending the jobs to this Fargate cluster. And it sounds like the job themselves can be quite spiky because you mentioned that you can have stuff with one really big job, and then that can fan out into 10s of 1000s or 100s of 1000s of different smaller jobs. So in terms of scaling, is it something that you guys are doing on demand based on the amount of CPU you have in the cluster and all of that, or is it something that you do more predictably because you know that this job is going to require a big cluster of resources to process?


Eoin Shanaghy: 20:11  

Yeah, it's an interesting one. So we do, we have this very heterogeneous set of inputs, because it got really tiny jobs and really massive jobs, like jobs that run in seconds, jobs that take an hour, jobs that produce a megabyte of data, and jobs that produce gigabytes of data. So we try to evenly size them by splitting them into more or less similarly sized jobs, but you still have some variation. The clusters are completely uniform in the size of every container is going to run one job at a time. This makes monitoring and troubleshooting a lot easier actually. So we try to make sure that every job, every worker is dealing with one job at a time. And it's it's in series. And they're using a fixed capacity. So it's like four gigabytes of RAM, a single CPU in a container infrastructure or similar in Lambda, right, so you configure it for four gigabytes, and you're using pretty much one CPU. And what we'd like to be able to do, I guess, and that you can't really do right now is based on knowledge of what the size of the job is, they trigger dynamically a Lambda function that has a different size to the other ones running with the same function configuration. That would be pretty cool for this workload because then you could run some of the… you could run some of them with 512 megabytes of RAM, and others with eight gigabytes, right, you know, and then you would you would be optimising for performance and cost there pretty effectively because we know generally how much resources an individual job runs. But it's something you need to be able to do at runtime, not at deployment time.


Yan Cui: 21:45  

Okay, so could you in that case, not just deploy different variants of the same function, but with different memory settings, and then at runtime, say, invoke this one instead of that one?


Eoin Shanaghy: 21:56  

You could, you could, but that, that makes your management managing your concurrency, your reserve concurrency and stuff a little bit more challenging, because you don't know always know the shape of a given workload, and how many of them are going to be big jobs, and how many are small jobs. So that's pretty hard to predict in advance. So I'd rather kind of avoid that kind of custom work around AWS, and push the request back onto AWS and hope they make that possible in the future. I mean, it right now that you're talking about probably premature optimization. You know, cost is an important factor here. But it's more important that the workload runs right, and that you get scale. That's way more important for the customer. So we do some cost optimization, I mean, that's why we use Fargate. Ideally, we would run everything in Lambda, because the scalability characteristics of Lambda are just the best by far. One of the things we had to do with Fargate, when we started working with this, we wanted to run, you know, 2000, or 3000 containers to run this workload. That was taking like an hour and a half by default, you know, when you use the ECS service with Fargate. So we talked to the Fargate team, and they gave us a lot of support. And we were able to put in place like a custom scheduler with some limit increases and using the RunTask API. And we were able to scale them in about 15 minutes, but it meant we had to put in place this custom scheduler with Step Functions and Lambda to scale up and down when we were running the workloads. This is another piece of code we'd rather get rid of. Actually, now we are in a position where we can get rid of it because ECS team has performed few miracles in the meantime. And now the scalability characteristics of Fargate out of the box are way better than they used to be. And you can scale to 1000s of containers in in 5, 10 minutes. But there is there is actually a workload as well, as well as this long-running large workload that runs a couple of times a day. There's also another use case for the whole platform, which is like on demand real time analytics. And this is where you have… you can imagine like somebody on the phone trying to make a deal on an reinsurance and they want to be able to price that and assess the risk in real-time. So we can use the same platform. And the analyst at the end of the phone can can run a deal and get results pretty, pretty quickly, like in the order of less than 30 seconds. And Fargate scalability characteristics aren't that suitable for that. And we don't really want to have to prewarm because you can't predict when this is going to happen, or how many people are going to be doing this at the same time. So for that, we actually route the modelling to Lambda instead of a Fargate. So for these near real-time requests, we use the same scheduler, the same orchestration layer, but in terms of compute, it goes to Lambda instead of Fargate. And that means that we get way better scalability characteristics. And actually, we think we can scale pretty large because because we've talked to AWS about this workload quite a lot. They've been pretty accommodating. And now we can actually burst to 10,000 concurrent Lambdas. And we can scale to 10,000 concurrent Fargate containers as well. So that gives you a lot of power in terms of being able to have lots of people in the organisation, running concurrent workloads, and analytics in a really short space of time.


Yan Cui: 25:12  

Yeah, that's one of the things that I guess, AWS doesn't really quite publicise the fact that even some of the hard limits are negotiable. If you've got a really good use case, they can demonstrate to AWS. And I remember we when I was at DAZN, we have similar conversations about lifting the 3000 concurrency burst limit on Lambda because some of our workloads are also very bursty. I guess the way it fell down a bit for us was the fact that, you know, our limits just need to be keep pushing up. Because we don't know where the limit, we don't know where the limit is, like, you know, today, maybe 10,000, tomorrow, maybe 20,000, depending on how many concurrent users we're gonna get. So think for that reason, we ended up just running a bunch of containers, we prescaled them, because for DAZN it’s sports streaming. So we know when the match kicks off, we know when people is going to come in. So it's easier to prescale based on a schedule, rather than something that is required on demand, as is in your case. So in this case, I guess with with some of those really bursty workloads, I guess, having every container of doing one task at a time, it makes debugging a lot easier. But if you want to make better use of the containers you already have, I guess you could push to a model where you can process multiple requests with one containers simultaneously, so that you don't need to have as many containers in a burst as you would do otherwise, right?


Eoin Shanaghy: 26:35  

You can, yeah, for sure. It's all part of that simplicity, performance optimization trade-off, really. And I guess we're pretty stubborn when it comes to resisting going down paths like that because every little decision you take to do something like that and say, okay, we can optimise this and, you know, we've got this small amount of CPU and memory over here. Let's use that, right? I mean, you all of a sudden, you end up with like multi concurrent execution environment in one in one container. When the container fails, you have to figure out which job succeeded and which job was still running and respond to that. And everything just becomes a small bit more complex. And if you keep making decisions like that, then these things multiply, right, and then all of a sudden, you've got infrastructure, that's harder to reason about, and harder to troubleshoot. And already, like with this custom scheduler, you know, as a piece of engineering, the fact that we built this serverless scheduler, you know, we're kind of proud of it, and we're happy it works. But at the same time, we realised that when you write a piece of code together like this, for a customer together with the customer, you kind of you've taken on a technical debt, because you now you've got your own custom scheduler, and we'd rather avoid that. We're looking for opportunities to simplify all the time, not just tinker with code all day you know, because ultimately, they're looking for fast time to insight on their portfolio. They're not looking for delivery of millions of lines of code.


Yan Cui: 28:08  

Yeah, totally understand. And I think that the aspect of having one request at a time that you get with Lambda is probably underappreciated. I spend quite a bit of time with Erlang and really a big fan of the actor model. And one of the things you get with the actor model is the fact that it just processes one message at a time. So you never have to think about, like you said, if the actor failed, you know, figured out which jobs will fail, which one, which ones are being processed concurrently. I think that aspect of the whole Lambda runtime is quite alien to people when it first comes into Lambda. But it's something that there really helps in terms of understanding and debugging problems.


Eoin Shanaghy: 28:44  

Yeah, I agree. I also spend a bit of time luckily with Erlang and Elixir and really loved that model. And I was trying to kind of replicate parts of it, I guess in software architecture. It'll be interesting to see with Lambda where it's gonna go because I guess, AWS being kind of customer-driven, I think it's kind of often customer-led, right? And that often gets pulled in directions where, you know, customers may want want to run multiple things concurrently in Lambda, and have all these niche workloads and, you know, very influential customers that want to do certain things. And we'll all feel the effect of that, right? As Lambda grows, and the number of features grow. It's always a little bit of a, I suppose a source of frustration when you know, you want Lambda’s when Lamda’s beauty is in its simplicity, right? And as they add more and more features, it becomes very powerful and flexible for people. But for us as developers evangelising the serverless model because of its simplicity, you don't want that to be taken away, right? You want to want things to be kept as simple as possible. And sometimes, you know, we have to face the fact that a lot of these services DynamoDB and Lambda can be very simple if you use them in a simpler way. But you have to be very conscious of not overcomplicating how you use them and boiling it down to your simplest minimal need, the number of knobs and buttons you can twiddle is growing,


Yan Cui: 30:08  

I do my workshop. And normally we start with an introduction to Lambda session that started off with maybe like 10, 15 minutes. And nowadays it’s probably closer to an hour just for the introduction of Lambda 101. Because of all the different, like you said, configurations they have. But I think there's this design of Lambda, where I'm still basically just use the same feature that I was maybe four years ago because all the other additional features that they introduced are great for specific use cases, which doesn't impact me say 90% of the time, so I don't even think about them. I just do the same thing that I've been doing, which works, it’s simple, it’s all the zipping your content instead of using container images, you know, shipping stuff without custom runtimes that any of that stuff, just using the vanilla box and Lambda that you had four years ago. But it's great all these additional things you've got now provisioned concurrency and all of that for specific use cases. And I think sometimes this is, you see, it is almost hard to tell people that you don't have to do all this fancy stuff. They don't impact you, just do the box standard, you know, boring setup that that just works.


Eoin Shanaghy: 31:17  

I agree. I agree. I mean, having the ability to run containers with 10 gigs of ram was a big help for us, particularly in workloads like that. Other things like provisioned concurrency, I really tried to steer away from because it's not really true to the original simple promise of Lambda. So it's really like you say only if you really, really have to. But I really think that workloads like this, you know, batch processing workloads, Lambda is really well suited to them. It's not the original intention and not the original set of use cases that came with Lambda. But the fact that you can take like a complex number and running massive workload and divided it into these small pieces. And then run them in individual stateless functions. It makes, once you once you do this in a simplest possible way in terms of orchestration. It makes it quite simple as a developer to work with, right? Because if something goes wrong, you don't have to worry about it taking an hour to go wrong anymore. It's a matter of minutes before it fails. And this is like fail fast, small, individual pieces. It's a model that really works well for HPC. So I think we're gonna see a lot more of it. And, you know, we can already see that this is kind of being addressed because they've announced like a bulk pricing discounts now, which kind of help with this kind of workloads that aren't like API based. So I think there's a huge future in using Lambda for like scientific computing and financial modelling.


Yan Cui: 32:38  

Yeah, I think there's still a few hurdles I need to get over. I don't love the fact that you have to use containers sometimes to get over the fact that you are often using packages that are too big to fit into the prime size limit and things like that which kind of once you have once you have to do that, there's a whole bunch of other things you have to think about in terms of decisions you have to make, you know, where do you put the container images and all that sort of security and stuff around the container image itself. There is just so much more additional workloads you have to take on once you have to cross this threshold and I think there's something that they really need to address to make these kinds of HPC workloads more accessible on Lambda. But I'm appreciate this coming is something. It’s something that a lot of people have asked for. I remember talking to Denis Bauer from CSIRO which is the sort of Australian government agency that did a lot of the gene sequencing on the COVID back in the early days of the pandemic, and she talks about the same problem back then just how difficult it is to know to to fight to to get the packages into Lambda functions. And the fact that the containers is there is great, but it just so much more work compared to what you would do normally just zip everything up, uploaded, that’s it, done.


Eoin Shanaghy: 33:55  

Yeah, it seems arbitrary, you know that it's 250 meg for a zip package. Why don't you do 10 gigs with a little bit? Yeah, I remember listening to that episode with Denis Bauer. And it was really awesome actually because it resonated a lot with us because we were also involved in this project at the time. So it's great and you can see more and more people are trying to do that kind of either scientific modelling or financial modelling at scale, and trying to move away from instances. And I think it's great I think there's a big future in AWS, we're probably going to, I don't know whether it's going to be with Step Functions or batch or some other scheduling kind of infrastructure would make that whole orchestration piece a little bit easier because it is, it's non-trivial even trying to keep it as simple as we did. Just like writing your own scheduler, accounting for all the failure modes, it's not it's no easy feat. And it'd be nice to have AWS take on some managed services to support that. One of the one of the challenges with building a solution like this for a customer is that you know, the whole model around distributed logging, distributed tracing, and understanding how to debug and troubleshoot a system that's event-driven and completely asynchronous. That's, that's something that you really need to account for as well because it's it's a mindset shift, and you need the right tooling in place, right and the right metrics and whatever you're using, whether it's cloud watch logs or some third party for tracing or logging, you need to have something to ensure that okay, it becomes people part of people's daily skill set, that they're able to go in and troubleshoot and kind of jump across the system and understand how it all fits together, which, you know, it's still it's still something that's a challenge for people when they're adopting this kind of architecture for the first time.


Yan Cui: 35:43  

Yeah, agreed. And I guess, Eoin,  this is the this is all the questions I had in mind. Is there something else that you'd like to mention before we go, I know from the fact that Gurapit has joined you guys recently that you know, you guys are maybe still looking for talents. You've been doing this Youtube series with Luciana as well. Anything else that you guys are working on, new initiatives and things like that?


Eoin Shanaghy: 36:05  

Yeah. In terms of new initiatives that for fourTheorem we're doing a lot more training activities now. So for companies like enterprise companies, but other companies as well, trying to understand all of these challenges when you're moving to do AWS patterns, not just serverless but other kind of foundational elements of AWS as well. Yep, myself and Luciana have a podcast called AWS Bites, which is I guess, tries to be as short as possible, episode every Friday for people to just give them a flavour of one aspect of working on AWS. So you can check that out, awsbites.com. And for the rest, you can follow me on Twitter. It's E-O-I-N-S eoins. And yeah, feel free to reach out. And we are indeed hiring, so always interested, different skill levels, different backgrounds, interested in this kind of tech.


Yan Cui: 36:50  

Okay, great. I'll put those in the show notes so that anyone who is interested can go and find out about fourTheorem and as well as checkout AWS Bites podcast. Again, Eoin, thank you so much for taking the time to talk to me today.


Eoin Shanaghy: 37:04  

It's been a pleasure, Yan. Thanks a lot. 


Yan Cui: 37:06  

Hopefully, I’ll see you in person at some point soon.


Eoin Shanaghy: 37:09  

I really hope so. Cheers. 


Yan Cui: 37:10 

Cheers. OK. Bye bye. 


Yan Cui: 37:24  

So that's it for another episode of Real World Serverless. To access the show notes, please go to realworldserverless.com. If you want to learn how to build production-ready serverless applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.