#55: Serverless at Theodo with Ben Ellerby Artwork

Real World Serverless with theburningmonk

A podcast where we talk about real-world use of Serverless technologies from engineers who work with them day-to-day. We will discuss use cases, why they chose serverless and the pain points and challenges they face. If you want to know what it's REALLY like to work with serverless, this is the show for you.

All Episodes

Real World Serverless with theburningmonk

#55: Serverless at Theodo with Ben Ellerby

June 23, 2021 • Yan Cui • Season 1 • Episode 55

You can find Ben on Twitter as @EllerbyBen and click here for the open positions at Theodo.

Links:

sls-dev-tools: think Chrom dev tool but for serverless
sls-test-tools: utilities for testing serverless apps
stackoscope: detect misconfiguration for serverless components

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

To learn how to build production-ready serverless applications, check out my upcoming workshops.

Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:12

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today, I'm joined by Ben Ellerby from Theodo. Hey man.

Ben Ellerby: 00:24

Hey Yan, how are you?

Yan Cui: 00:25

So Ben, a fellow Serverless Hero. Nice to see you again. It's been a while.

Ben Ellerby: 00:30

Yea, it's good to see you too. Hard to meet up in lockdown but good we can do things virtually.

Yan Cui: 00:34

Yeah, yeah. And I guess with all the craziness happening in the world, you guys have been pretty busy from what I hear in terms of client projects and things you've been working on.

Ben Ellerby: 00:44

Yeah, it's been really busy. Cloud adoption was spurred on a bit more by lockdown and people being sort of forced to accelerate their digital plans. And you know, we've had a lot of clients who have platforms that have had increased load. So some video conferencing clients who've obviously had a lot more users on the platform, since we've been helping them sort of reach scalability by throwing pieces of serverless around their core architecture, as well as new startups trying to meet sort of the demands of the new world. So helping them build quickly with serverless, building scalable platforms as well.

Yan Cui: 01:15

Cool. I want to dive into some of these examples of client projects you have been working on in a bit. But first, maybe just tell us a little bit about yourself, your journey towards serverless, maybe with Theodo and what do you guys do?

Ben Ellerby: 01:29

Yeah, sure. So I'm Ben, VP of Engineering at Theodo. Theodo is a company that helps other companies build digital products from startups launching their initial MVPs, through to large enterprises doing digital transformation. Now, we were doing full stack, web and mobile application development for, I guess, the past decade now. But about, you know, five years ago, we were starting to use bits of serverless, around our architectures, most of the auxiliary functions that we're doing some things. And then maybe three or four years ago, I started pushing more serverless into our architectures and actually started leading our serverless first practice really building fully serverless architectures for clients. That's partly new companies building fully serverless through to large enterprises, using serverless as part of their sort of digital transformation. And it's been going really well, one of the great things we've been doing is sharing what we've been up to. So we created the Serverless Transformation blog, we've run the serverless meetup in London with yet another Serverless Hero Ant Stanley. And we've been trying to share as much of what we're doing, because we really enjoy, we really enjoy learning from the community, we're really enjoying sharing that back. And I think because we've been quite vocal about what we've been doing, there also seems to be more and more people looking to adopt serverless. So we're getting people reaching out not only for delivery, but also for training, and helping to sort of talk companies through how to take a, how to conduct serverless adoption.

Yan Cui: 02:46

Good, I'd like to dive into that a little bit as well. Because that's also a trend I've been noticing over the last 12, 18 months that definitely more and more people adopting serverless, both as a small and medium businesses where they go in a fully serverless because it just makes perfect sense for them from the agility point of view, but also from the cost point of view, but also large enterprises, you see small teams here and there starting to adopt serverless, or at least going with serverless first mindset. But I want to get you to maybe talk about some of the projects you've worked on and maybe discuss some of the interesting architectural features of those projects. I think you've been quite vocal advocate for EventBridge. So I'll take it that you've done a lot of work using EventBridge in the event driven architecture. So maybe let's start from there and just talk about some of the projects you've been working on with clients?

Ben Ellerby: 03:38

Sure, yeah. So with event driven architectures, well, with serverless, we've started to move to more of an event driven architecture, we all sort of when serverless came out, started doing some quite maybe complicated architectures with Lambdas, calling other Lambdas potentially, or putting things into S3, which triggered other Lambdas. And the traces became complicated, and the architectures became a bit messy. Since you know, as a whole, people have been seeing those problems, people started to use things like CloudWatch events to create sort of event driven architectures. And obviously AWS formalises that with EventBridge, which has been super helpful. So with a lot of our projects, we actually start with domain driven design. So we do workshops with their developers, but also people from the business side business stakeholders to understand the actual events of their company, get sort of common terminology, build out some sort of bounded contexts, and then figure out the microservices from there. And EventBridge becomes sort of the key artery for events going across that architecture. Now, this is really great for greenfield, but it's also been really interesting for legacy modernization. So I mentioned earlier about video conferencing platforms. One of the clients we worked with, sort of at the start of the pandemic was a company called WorkCast, who provide a vir—, a virtual conferencing platform, which combines both video streaming but also the ability to ask questions, conduct polls, do Q&As and these kind of things. Now, at the start of the pandemic, they had a massive increase in traffic because a lot of these conferences, university open days, pharmaceutical events, they were all moving into the Cloud and moving from being in person to being virtual, they were seeing a massive increase in traffic. Now, their system hadn't really had scalability issues before, because they, you know, they had quite a predictable amount of traffic. This was a world event that changed everything. So they came to us to help them in creating even more scalability in their system. And to do this, we used several different serverless services around their core application. So first of all, when an auditorium needs to be rendered, that is, when somebody needs to go on to the event and see the page information around it. And then it loads the video player, we shifted that into S3 front of CloudFront, very standard, but the authentication then happening through Lambda edge, so doing that authentication check as close to the user as possible. And any redirects that happened between the existing system and the new system, we then built out the real time sort of chat polls and Q&A features, all using AppSync. So WebSockets, completely fully managed. And of course, Yan, you know more about AppSync than anyone I think, but really sort of used AppSync really well, there it was, it created a nice amount of scalability, but also a lot of developer productivity using VTL templates a lot to build that out. But as we started to have these different modules of the system, they needed to share events. And we built in a domain driven way with these different bounded contexts and microservices split between them and quite a quite a strict separation, never making synchronous requests between them. But that meant that EventBridge was a really good fit. So we had these different microservices, they communicated over EventBridge using the real world business events. And that was great. And that's a, you know, a really great use case for EventBridge, what we could also do is what we could also do is trigger events from their existing system into this new greenfield architecture, because their existing system was in a different AWS account, but it was an AWS, we could use the AWS SDK to just trigger an event to go into EventBridge. And that would go from the existing system into the new system with cross account EventBridges, which is really great, because it lowers the barrier to creating integration points, it's a really nice way to have retries, you can have archives of this, we can have reporting on it. And it really created this concept of this event bus being the core part of their organisation. And as they're sort of, you know, building out more of their existing system and more of this new serverless part of their system. They're really buying into EventBridge and using that for almost everything, which is creating a really nice way for developers to work. They had some developers, we were training up who were new to serverless. And they could work off just one of those microservices in isolation and just know the events they had to trigger or the events they will be triggered by which really reduce the cognitive load to having to understand a whole array of different microservices, all the AWS services that we use there. So yeah, a long answer. But yeah, I've been using EventBridge in quite a few contexts.

Yan Cui: 07:44

So in terms of how you organise your EventBridge buses, there's quite a few different ways you can configure that. So from what you described there, it sounds like you guys are having a centralised event bus, maybe in its own account, and then using event bus using the new resource policy to allow other accounts to subscribe to events and to push events to this centralised bus. So that's something that I guess I've seen more and more people do it now that it's actually easy and feasible to do it at a big scale when you've got lots of different accounts., whereas before you had to do assume role and all of that before the resource policy stuff came out last re:Invent. So I guess the question for you there is, are there still some pain points in terms of how you set up and manage this kind of EventBidge topology where you got centralised account in terms of things like an ownership who owns the event bus? How do you keep track of and dispute the schema to the different teams so that they know what events are available in that event bus? What about... replays is an interesting one, if you want to replay events, when you're building a new microservices, how do you go about doing that against a centralised event bus? For instance, because I imagine the archive is configured on the centralised event bus so that you've got an archive of everything that's been recorded. So are there still some operational challenges from at least the organisation level, maybe not the technology level, have you found some of these to be, still be a painful thing to do? And if so, particular workarounds that you've used in your teams?

Ben Ellerby: 09:18

Sure, so we have been using sort of a centralised event bus sort of strategy. I mean, to start with, we're using event buses in the main AWS accounts of the applications before it became a lot easier to do this across account event buses. And also we have architectures a multi-region, multi-region event bus support, it's also really important for us it was some of those large scale architectures. But yeah, in terms of, you know, account strategy, we've really moved to the one account per service style, not just per event, not just, sorry, per environment, but you know, each microservice almost having its own AWS accounts, and then we're using, you know, AWS organisations to organise that and we have exactly a bus account that has the event bus or the event buses. Now typically we are using one event bus for those sort of business events. There are some cases where it's not quite telemetry data, but more or less sort of business events, sort of other auxiliary events that have been fired, we put a putting into a secondary bus. And obviously, we're using Kinesis, typically for that telemetry data. But there are a few use cases where we, it's just been easier from a... there's not a massive throughput from sort of using one tool and training a team of one tool is a bit easier than explaining the three different ways you can do event streaming through AWS. But yeah, we've been having that in one account. Yeah, there's a mindset shift of having multiple AWS accounts. Yeah, I mean, some people are still playing with their environments in one AWS account, which definitely don't do that. So we were, you know, we're educating people on shifting environments, but also shifting multiple services. And once people understand the concept of sort of one service per account, then having one for the bus is completely natural and makes sense, because there isn't really any account it would belong to. The archive is a really good point. Yeah, it lives in that bus account. And does everyone have access to that bus account doesn't really make sense. But it's useful to have access to the archives, we did do a proof of concept around... Obviously, the archive is just an S3. So sort of forwarding that S3 on to the other accounts using a Lambda function being triggered by the S3 changes, but we haven't really seen the need. And to be honest, those replays are nice to have, we don't often need it and haven't needed to use it that often. And when we do, it's normally quite a few people involved because something went wrong. And we'd have access to it. It's not a regular developer activity to use replay from what we found. But in developments, we we don't share the same event bus, obviously, every developer, we have sort of an ephemeral stack system. So every developer has their own event bus, and they can play around with everything they want with that. And that's not split into as many different AWS accounts. So the developers have the ability to use sort of, you know, the archives and replace to debug during development. But in production, that bus account is pretty split off, which largely hasn't created too many issues yet.

Yan Cui: 11:59

Okay, yeah, that makes sense. Because with a serverless stack it’s just so easy to provision a temporary stack for, for some feature, what you're doing, or to even just run some end-to-end test. Or if you're, every developer, you can just have your own stack. So you get your own EventBridge bus, you get your own API Gateway, Lambda functions, all of that is super easy to provision. You mentioned you got multi-region event buses. So how do you go about managing that then because EventBridge itself is not multi-region is Multi-AZ which is already pretty good. But if you have centralised buses that have to span across multiple regions, what sort of strategy for doing that?

Ben Ellerby: 12:40

Yeah, and it's actually something we're looking at currently. So it's not something we've done yet. So we... more and more of our architecture is required, because serverless makes it so easy to just spin up an environment, as we were saying just before, and actually just on that point of developers having their own stacks, the workflow we have now is when a developer starts a new task, and you take it in Trello, or your new task in JIRA, they take that that number from the ticket or the task, they create a draft pull request in GitHub. And then GitHub spins off CI processes that then spin up a stack just for that pull request. And then on Slack, they get a message with all the different stack resources they can use, which is a really nice workflow, because it's really a sandbox, it just gets shu—, as soon as a branch gets deleted, that stack gets torn down. But from a CI point of view, we have a completely sandbox stack that can just be tested against which is really nice, which was a complete tangent from the point I'm trying to answer. But yeah, on the multi-region stuff, because it's easy to spin up these environments. There are actually some use cases where that's really useful. So we have some clients in the healthcare space. So one of which is called SharpTx, which is digital therapeutics for disease diagnosis. Now, with healthcare data, you never want that leaving the region that it's from, especially if you're moving between Europe and the US, or even the UK in Europe at this point. So it's important that we have the ability to just have an identical stack and keep all the data staying inside of that region, which is great. But there are some operational elements that sometimes you want to have go between all of your, your stacks, potentially, when a feature flag gets changed when a version gets changed, or, you know, some internal facing stuff that isn't data that's protected, but data that is shared globally, and even outside of, you know, a regulated space when we're building new social media applications. We're currently building one I can't talk about yet, but it's fully serverless. And it's really, really great scalability. And actually, it's interesting, the investors in that social media startup, were actually very reassured to see AWS and serverless as part of the tech stack. So it's interesting that investors are starting to understand that serverless is a good thing to see from a scalability point of view, and from a cost point of view. But with that application, it's multi-region more just from a scalability point of view, having the DynamoDB databases close to the users and obviously DynamoDB it's very easy to replicate now with global distribution, but cross account EventBridges. I think the spot is here now. And we're looking into how we can use that to have shared events across all those regions.

Yan Cui: 15:13

Yeah, I guess maybe one thing you could do is to use that, because you can replicate the events from one event bus to another using the same account or different region, maybe you can use that. The tricky thing with that is just you've got this chicken and egg thing where I guess all the different regions want to subscribe to events in all the other regions. So you have to provision all of them first, and have some other processes create the rules that basically allows them to fan out to different regions, we've done something similar with, not with EventBridge but with SNS, SQS kind of event sort of system before where we have multi-region active/active setup and data, well, events are fanned out this way. So that events being captured in one region always gets fanned out to the other regions so that the event exists in all different regions that we have running. I imagine the same process, same, at least approach would work on EventBridge as well. They're just a matter of working out that, you know, what gets deployed first, because you need to reference them.

Ben Ellerby: 16:17

And not having it happen. And not having it happen. recursively because obviously, you don't want events then go into a different region. It gets very expensive.

Yan Cui: 16:27

Yeah. What we used to, we were doing, so yeah, we were doing something similar with DynamoDB stream at one point. And there was one small element in the DynamoDB stream events, which is not documented. But basically that was the one event that's the one key that tells you whether or not this is data that's replicated, or data has been written to the region, because when you use the global tables, those replicated data also generates events onto the DynamoDB stream. So if you don't want to process events multiple times, because every time it gets replicated, it gets replicated again, you have to look at that one key to figure it out. Okay, this is the event that needs to be processed, because it’s created in this region. Yeah, there's a couple of things you need to be quite careful when you do something like that.

Ben Ellerby: 17:15

Yeah, I'm always watching the billing and being very... doing a lot of TDD to make sure I've actually implemented the logic correctly before I ship.

Yan Cui: 17:23

Yeah, I've heard a few stories of people running into infinite event loops. It is quite an easy thing to do, especially with something like S3. You write, you process the incoming file, put it back into a different folder, but it triggers the same function again.

Ben Ellerby: 17:38

You process it again. Yeah, yeah, exactly.

Yan Cui: 17:41

Okay, so so you worked for a lot of different companies, in lots of different industries. I guess maybe a good question to ask in that point, in that case, is, are you seeing any common trends? You mentioned that you are seeing investors starting to notice the value of serverless, and are actively looking for help on projects from people like you and others who are specialising in serverless. Are there any other common trends you're noticing from the business side of things, but also from the technical teams as well?

Ben Ellerby: 18:11

Sure, yeah. I mean, everyone wants, everyone's to hire people with the skill set. And there aren't enough people with the skill set, which isn't a new thing in tech. But I guess one of the things is, and you know, something I've been pushing for a while, it's like, it's creating the autonomous capability for teams to do things. It's not just bringing in experts to build something and leave it, it's training people in using that technologies. And that training part is difficult. And I know, you know, you've got a lot of initiatives in the training space. Ant Stanley, who I mentioned earlier, of course, does, and we were doing it through the sort of hands on delivery, but training their teams as we go. So there's definitely a focus on training, not just delivery, which I think makes complete sense. From the technical side of things, there are a lot of a lot more sort of using third parties where you can, that obviously comes with serverless. But you know, the buy not builds mentality is getting stronger, which I think is really good as well, it's a good move. So people are focusing on that, you know, 10% of differentiating logic that they can write in their application. And, you know, no one needs to build email sending, again from scratch. So that's working nicely. And you know, Cognito, for passwords, er, for, you know, user accounts and single sign-on, there's a lot of a, more of a buy not build mentality. And I think that's because people are realising, or starting to appreciate the total cost of ownership side of things and starting to see, you know, 10 years later when I'm still patching the same thing, it's getting really annoying, and we don't have the capacity to do it. The way you're smiling is making me think you've been in a few of those situations.

Yan Cui: 19:37

Yeah, I've built a few identity management systems in my time, and they are not fun. And when you’re looking back, you realise how, you know, how badly a lot of things are implemented, compared to, when you compare it to modern standards in terms of security, best practices and whatnot. Obviously, those things you, how do you deal with unknown unknowns if you didn't know better, that's what you end up with, right? And yes, definitely I'm seeing the same thing as well, seeing lots of the third party services like, not just AWS services, but third party ones like Algolia. I mean, I've been using Algolia quite a lot myself. I’m a big fan of their service, probably the closest we have to serverless Elasticsearch, even, even more so than the Amazon Elasticsearch, which’s still paying for uptime, having to manage indexes. And you have to manage the rollout of new version of the Elasticsearch Engine, all this other stuff, which is things that I don't want to have to deal with. Just give me an API key, give me an API that I can call and that’s it.

Ben Ellerby: 20:36

And same with in the ecommerce space, you know, Shopify adoptions going up. But Shopify, sometimes you still need some custom stuff around that. So, you know, either through webhook or EventBridge, like getting those events in and then reacting to them. We have a current project with an ecommerce company, where they're using data and events that are coming from, you know, their Zendesk, their Shopify, the different third party services they're using. But inside of AWS, we're aggregating all that data together, so they can do sort of, you know, correlation across those different providers, and then use that data to help them in their sales process, that sort of AWS being the glue of, or serverless on AWS maybe being the glue between all of the different third parties that they're using it's quite a common pattern. And another serverless sort of architecture we built, this one's quite well documented on the AWS blog at this point, is for a company called Gamercraft, which is sort of an esports tournament platform that makes it easier to create fairer tournaments for games like League of Legends. So it does the automatic sort of player matching, and it matches people against a similar skill set, does prizes, or that kind of thing. We built their initial, their initial serverless architecture, which is fully serverless, event driven, and using EventBridge, Lambda, step functions, Dynamo, Cognito, the usual suspects, and that worked really well. But we also built into the Kinesis analytics pipeline. And at the start, they didn't really know what they were using that for, they were just like, we want to start gathering some of the data. Now with these other sort of applications, while they were using, you know, multiple third parties and aggregating the data together, it's then what you do with the data. And we're seeing the sort of serverless data lake really taking off in adoption. So with Gamercraft, we're using a serverless data lake. So storing the data in Parquet storage in S3, using Athena to query that data. And then using machine learning to do anti-smurfing detection. So detecting, you know, players who might be cheating about what their capabilities in the game to get through the tournament's. But if we take that ecommerce company I talked about, they're gathering the data across these third parties, aggregating in an S3 data lake, and then using actually machine learning again, to predict how likely leads are to buy the product that they're selling. That sort of serverless data lake, it's almost becoming like a standardised component for us, we're dropping in, because it doesn't cost much and the benefits are quite nice.

Yan Cui: 22:53

Yeah, I built the same thing multiple times myself as well. And the one thing that gets really annoys me that I have to do every single time is write a transformer function for Kinesis firehose, because it just doesn't, it doesn't insert newline character at the end. So the file you aggregate in S3, Athena just sees, oh, there's one record because I don't see a new line. This is so annoying that they don't just fix it at service level. And you end up having to write a Lambda function transformer just to add a new line after each record, which is a bit silly.

Ben Ellerby: 23:27

Most people are piping their data into S3, and most people are losing in situ. So yeah, it makes complete sense. And yeah, we've written a couple of those, and you pay for it as well, which isn't ideal. Yeah, that's working nicely. And then using QuickSight to visualise that or actually integrating with other third party BI tools like Tableau and others, the the Athena query layer is actually working really nicely. And we haven't had to use Redshift in many cases at this point.

Yan Cui: 23:54

Yeah, I think Redshift makes sense. If, you know, if you're just gonna be sitting there all day long looking at data and running different different kind of queries, then they probably make sense for you to have Redshift. But then if your use cases, like, I guess, most people, they just, they need to have something that they can run ad hoc queries when they need to, but they don't do it eight hours, nine hours a day, then the Redshift that makes a lot more economic sense. You don't have to, have to wait for that cluster to spin up and don't have to forget, don't remember to shut it down afterwards. Because that can get expensive if you got a massive Redshift cluster running around all the time. But yeah, Athena is... one time we had a project where I think the company was paying something like $3,000 a month for mixpanel. And, you know, it was just a startup social network, there wasn't that much data. And by the time we moved it to Athena, I think we're paying about eight cents a month. So it was pretty mega saving because of the fact that it's so cost efficient that when you don't have a huge amount of data that you have to query all the time, you can just query the most recent, I don’t know, five, ten minutes worth of data for your real time dashboards and whatnot, or just waiting to run ad hoc queries, query a couple hundred MB, or a couple GB of data. It’s nothing on the Athena. And you need to, I think Athena can also handle a pretty sizable chunks of data that you need to query as well, and do it in a few, I don’t know, 10s of seconds.

Ben Ellerby: 25:20

Definitely, and it's actually it's a weird one, it's a really good thing, but also creates a challenge in that when we start projects, you know, in the past, you know, ago, five years ago, or further ago, people would, you know, be like, you know, we're building this application, we'd come up with an architecture they were offered, they would be pretty much very similar architectures, which were, you know, sort of using something like Django for the back end, and then using React or Angular or whatever, on the front end, but also to build these more customizable architectures that can go to much higher scale, and, you know, people are starting to realise the scalability of their architectures is super important. A lot of these services are very cheap, but people are now asking at the start, you know, how much is this gonna cost? How much is it gonna cost with this many users? How much with this many users? We have a few internal tools that we've built out, that, you know, they're not the most advanced things in the world, but give an indication around, you know, if you increase your users by this much, we see the different AWS services, and you know, based off past, you know, clients, this is how we expect those costs to go as well as according to the AWS pricing docs, but the numbers are always really low, especially if you know, all of that data lake stuff. So, so a lot of my time, like, trying to convince people that you know, we, we have the right numbers, we understand what you want to scale to this is just yeah, it's very cheap with serverless. But I think people almost distrust how cheap it can be, which is a double edged sword, right? Because it's a great thing that it's cheaper. But it's then convincing them that it is cheaper is part of the job.

Yan Cui: 26:45

Yeah, and I guess also there are stories out there that says serverless is super expensive, because when you got to a certain scale, then it can be expensive if you have a sustained throughput. I think I've read quite a few of these posts on Reddit, or maybe it's not Reddit, maybe it’s the Hacker News that that always... I don’t know, Hacker News is a funny place. Negative negative stories always just fly off the shelf.

Ben Ellerby: 27:11

They're not a fan of API Gateway.

Yan Cui: 27:13

No, API Gateway can be quite expensive to scale. That's true. But if you're not careful, you can make those cost mistakes, well, costly mistakes. But API Gateway itself has got nice options now to use HTTP APIs, which is a lot cheaper. And then you can also use ALBs as well. ALBs are quite complicated to predict the cost because of the way that ALC use to work out, you know, how much actually pay for ALB, but a scale is going to be a lot cheaper. And I think people… oftentimes we look at the cost of Lambda but that's the small part of your overall architecture.

Ben Ellerby: 27:52

It's never the top line of an architecture, unless you're doing something very interesting. And you know, step functions have got cheaper, too, because that was another another element that was shifting up, you know, on that balance sheet, but you know, step function express workflow, that could cost massively as well. And this is what makes you know, the vendor lock-in arguments a bit complicated because the services keep getting cheaper. So it's, it's interesting.

Yan Cui: 28:16

Yeah, like when Lambda released the per millisecond billing, so everyone's bills suddenly got maybe 20 - 30% cheaper, just because of the fact that it's more granular in terms of the pricing. Yeah, and also, your system just gets faster, gets more reliable, gets more secure over time without you having to do anything. So the best thing you could do is just to do nothing for a few months, and then it gets better, which is, which is pretty great. But but yeah, I think understanding AWS cost is, is probably what you need to do to work to understand how your decisions are going to impact your cost over time. And there are quite a few good solutions out there. I think CloudZero does a pretty good job of surfacing a lot of the cost for individual components and so that as you monitor changes from one month to another, you can see, okay, suddenly, our architecture, in terms of the cost distribution has shifted towards more towards a different service, which may not be a problem, but just that you're using the service more, but at least you help surface the impacts of decisions, because not everyone knows AWS billing well enough to be able to drill down and then work out at a particular service or account level. Why are we seeing sudden spikes in certain costs? Why does the AWS billing just jump up by a certain amount? And I think one mistake a lot of companies make is that they just don't give developers access to billing information.

Ben Ellerby: 29:41

Yeah, I have never understood.

Yan Cui: 29:43

Yeah, which of course, people are going to make a mistake that's going to cost you money because they can't get feedback on the decisions that they're making and understand the cost implications of those decisions.

Ben Ellerby: 29:53

Mostly forgotten about like the random server they created three years ago and haven't turned off. You know, the tagging of your resources, it's actually an interesting sort of analytics tool, almost seeing, if you tag those microservices well that come from a bounded context, you can be like, you know, from a technical perspective, this department costs way more money than this department and this department doesn’t, you know, generate much revenue, you can attribute your your Cloud costs to your different, you know, products and your different business units, which can be really interesting as an analytical tool.

Yan Cui: 30:24

Yeah, absolutely. And you can also, I guess, this is something that I've spoken with a few people before on this podcast, that this whole idea of FinDev, one of the facets of that is that you can... the fact that you get the pay as you go as the as the pricing model, you go with AWS, you can also translate that to how you price for your services. And then that way, you know, something that would have been much more expensive, you can offer much more competitive pricing to your customers by offering them a pay as you go pricing that's based on their usage. And I've worked with a few customers now, who have built services that can operate at that with that kind of pricing model. But you do have to have the kind of track the cost for individual customers, and you have to do some custom cost allocation and stuff like that. But those are not difficult to do per se, you just have to figure out all the different cost points you have to track based on the services that you're using. Actually, before we go, there's also one more thing I want to get you to talk about as well, because I've noticed you've done quite a bit of work on the open source side of things. There's some serverless testing library you published. And then there's also the, what’s the command line tool that you published?

Ben Ellerby: 31:32

The sls-dev-tools.

Yan Cui: 31:33

Yeah, that's right. Can you maybe take us through some of these more open source projects that you've done? And maybe how, you know, why I should use it?

Ben Ellerby: 31:42

Sure, yeah. So you know, open source is a big part of how we build with, generally use only open source tools, and then obviously, AWS and third parties for you know, hosting and dedicated services. And part of what we want to do is, you know, if we're going to use open source, we need to be contributing back to open source. When we're fixing things, it's great to be able to fix that in the open source tool. And that's why we contribute to the serverless framework. So Frederick Beausoleil, who's in our French team, he's doing a lot of work with that with the serverless framework at the minute. And he's actually building an open source project with a guy called Matthieu Napoli, another AWS Serverless Hero, called Lift, which is sort of an abstraction over the serverless framework. In the UK team, we've been doing some open source more around the developer tooling and testing side, sls-dev-tools, which started from a, I think, a conversation on a Friday night where someone was complaining about constantly going to the, having to open chrome to do back end development. And they were, you know, constantly going to the AWS console, and they wanted a more integrated into their IDE experience. So we created a very basic sort of terminal interface there, enabling them to see their Lambda functions and open the logs. And then gradually, we just added more to it. So you can switch regions on sort of a 2D map, you can see your logs in real time, before the extension support came out on Lambda, we actually, only internally, we didn't ship this to the public open source, because it wasn't probably polished enough. But we built a custom runtime for Node that would override console.log. And instead of just console dot logging, and getting it through CloudWatch, it would use a WebSocket connection down to the terminal of the developer. So you'd get your logs back for CloudWatch before your Lambda function finished executing, which reduced, you know, two to three seconds where you're getting through CloudWatch, which was frustrating people. So you know, we've done a few things, you know, not shipped all of them. And it also enables you just to open your, your Lambda functions directly from the terminal. So you know, if you're in your Lambda function, you're seeing the logs, you want to actually go to AWS console, you just press O for open, and it just opens you to the right page at the AWS console, you're not clicking around, constantly, you can test from there, you can see your event buses, you can also inject events into your event buses, which was something that I, you know, we will constantly, you know, writing small scripts to just inject events into event bus instead of that, we can just see our event bus, use a sort of form that's generated from the schema of the events that has gone before, typing the values, hit Enter, and inject and just repeat that really quickly, as we're developing. So you know, some of those things became quite useful. We built our sls-dev-tools Guardian, which is sort of automated checks for certain things like, you know, stars in IAM policies, shipping the same code, or multiple, multiple Lambda functions, not configuring memory and CPU. And we've used that more internally on projects to build out custom rules for particular clients who have particular, you know, safeguards they want in place. I know you've actually built something fairly, you know, in the same space, which is the Lumigo, is it Stackoscope, which looks really great.

Yan Cui: 34:42

Yes, that is called Stackoscope.

Ben Ellerby: 34:44

Stackoscope? That was it? Yeah. But it's really great. Automating checks is is a key way to help developers for sure. And then we're testing so as we mentioned earlier, we're a big fan of sort of these ephemeral stacks. We wrote about it on our Serverless Transformation blog. At one point, this concept was sort of Serverless Flow which we called it a bit like Git Flow. But basically open the pull requests stack gets set up, develop against it, run automated tests against it, and then merge. But the test that we wrote, and I think you spoke about this before, Yan, like we're not focusing on unit testing much anymore. Massive focus on integration tests, and integration tests that we write, we're actually using Jest. We're generally doing full-stack TypeScript on the back end and the front end, sometimes using Python, but largely full-stack TypeScript. So just as a test runner, super familiar to front end developers, and we're using that as our test runner for the back end. And and what we found ourselves doing was often writing, you know, how to invoke a Lambda function from, you know, a Jest test, or how to assert that an API Gateway exists, or how to assert that a step function is executed, or how to assert the contents, you know, the, the content type of a document in S3, like these became quite common things we kept rewriting, just use AWS SDK. But we realised that first of all the tests weren’t that readable. Secondly, we were duplicating this across the place. And third of all, like they're quite intimidating tests to write, especially if you do something new to serverless and new to using the AWS SDK. So we tried to create a set of assertions using the Jest extends functionality. So you can just write, you know, call this endpoint and then expect file in S3 to have content type, application/json or whatever it is. So we started to build these abstract, expect statements to actually under the hood called the AWS SDK, we shipped some of them we're trying to ship more as we're starting to, we'd like to test them out in a project and then motion to the open source version. But Sarah Hamilton, who's one of our Cloud developers, and has spoken a few times about the testing as a topic, she worked with me on a way to sort of make better to, create better testing for EventBridge. So often, when we have these sort of, you know, as I mentioned, these microservices, but by bounded context, we wanted to be able to test and deploy them independently. Now, we often have some end-to-end tests that are doing more sort of across all the microservices. But just for that microservice, we want to run integration tests to make sure it does everything we expect, and deploy that independently of every other microservice. To do that, we actually wrote some assertions that assert that an event is fired on EventBridge, and to assert that when an event is fired, a Lambda function is invoked, but actually became a bit more complicated, although it's a one line expect statements. Under the hood, we're provisioning an SQS queue linking that to the EventBridge, use... create an EventBridge rule to link that to the event bus and then polling that SQS queue to check the event is there. And you know, it's a bit more complexity, but it's completely abstracted. So the test just says expect when we call, you know, expect when this happens, and event is fired in EventBridge, and the fact that the test is readable, and we're not duplicating that code has been a big advantage for us. So we thought we'd open source it and also try and share that approach. And you know, it's great when you open source things because you get feedback about, you know, things we can improve. And sometimes people fix issues, which is always nice.

Yan Cui: 38:02

Yeah, I was just, I was just talking to Paul Swail about serverless testing the other day, and we touched on the serverless sls testing as well. I think the one thing that's probably I'm not, like, maybe slightly concerned about the approach that you've described, is that if someone just presses Ctrl+C in the middle test, because they realise they made a mistake, then the some infrastructure that's been created behind the scenes will just be hanging around. So over time, those potentially accumulate. Is that something that guys have run into?

Ben Ellerby: 38:33

Yeah, no, we thought about that. And, you know, the... we debated, you know, should we force the user to create the SQS queue in their infrastructure code? Because, you know, it feels weird for your testing, to be creating tests sort of infrastructure around replication, we decided not to, because we liked the abstraction. And obviously, we put lots of warnings not to use this on your production infrastructure. This is to be used, you know, in your, in your development stacks in your testing stacks. But, yeah, we and we actually have the functionality so you can keep the SQS queue up. So when you run the test, you can put dash dash keep, and it will keep the SQS queue because it reduces obviously, the latency to running the test. But yeah, the cleanup is part of it. So what we've ended up doing is well with our stack, with our approach of creating the whole stack, when you open the pull request, and then deleting it when you delete the branch, actually stuff gets cleaned up pretty well. We have a few Lambda functions that look for any resources that haven't been used in a while and why they're there and then sort of puts on a Slack channel that we clean up. But yeah, the fact that you could execute your tests before the SQS queue has been torn down the SQS queue could stay around. So that's why we often would advise not, we don't run this on a production stack, and you need good tear down policies in general. Yeah, it's a good point, to be honest, we might change the approach and put it back in the infrastructure code or give the user the option and you'd like you can either use it as our service framework plugin that will handle that for you. Or you can let the the test tools generate it for you.

Yan Cui: 40:01

Yeah, and I mean I've normally gone with the approach of having this included as conditional resources in my stack so then it gets provisioned in some stacks, but the downside is, like you said, developer have to make a conscious decision to do this every single time, and you have to kind of learn the approach as opposed to just have a library that just does it for you, which is much more convenient. I guess it just may be finding a balance between the risk of having resources hanging around that didn't get cleaned up properly versus the convenience you get using the approach you described.

Ben Ellerby: 40:34

Sure, yes it's sort of how abstracted it is. Obviously putting sort of warning saying it's going to deploy these resources and, you know, you have to, we don't need to use your normal IAM user for this you have to create an IAM user with a policy that explicitly says what it can do so you do know this is happening. But yeah, the cleanup is definitely something that could be improved and I think it's given the option would be the best.

Yan Cui: 40:54

Yep, yep. Anything else that you want to... I think that's all the questions I've got. Is there anything else that you would like to mention before we go?

Ben Ellerby: 41:02

No, no, I don't think so. I mean if anyone's interested in those projects or what we write about we will put that out on the Serverless Transformation blog, which you can find on Medium. And yeah, the whole team's putting stuff out there including the Lift project that I mentioned, sls-test-tools, sls-dev-tools, and any formal sort of theoretical stuff. And yeah, if anyone has any questions for me following on from this, you can always reach me on Twitter, I'm @EllerbyBen, that’s E - double l - e - r - b - y - B - e - n.

Yan Cui: 41:27

And what about in terms of hiring because you guys have got quite a distributed team. Is there any open position that people who want to get into serverless may want to look at and apply?

Ben Ellerby: 41:37

You're much better than me. Yan, you remember to do that but, yes, we are always hiring. Hiring both serverless developers and serverless architects, you don't necessarily need Cloud experience before. We really have the ability to train people, and look forward to doing that and also hiring more experienced roles, generally hiring in London and Paris, and potentially in New York as well. We have an office there. But yeah, if you are interested in roles, especially in the London area, feel free to contact me. I'd be super happy to talk to you, even if it's just to talk about how to get into the serverless space, not necessarily work with us.

Yan Cui: 42:08

Okay, excellent. I would include in the show notes, links to your careers page, as well as the tools that you talked about. And so anyone who's interested, they can go and check it out quickly. And again thank you Ben, thanks for taking the time to talk to us today.

Ben Ellerby: 42:23

Thanks, Yan, always happy to. And I really love the podcast so please keep it up.

Yan Cui: 42:26

Thank you. Well, best of luck and take care and stay safe.

Ben Ellerby: 42:30

You too. Bye.

Yan Cui: 42:31

Bye, bye.

Yan Cui: 42:45

So that's it for another episode of Real World Serverless. To access the show notes, please go to realworldserverless.com. If you want to learn how to build production ready Serverless Applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.