#35: Serverless at MatHem with Lars Jacobsson Artwork

Real World Serverless with theburningmonk

A podcast where we talk about real-world use of Serverless technologies from engineers who work with them day-to-day. We will discuss use cases, why they chose serverless and the pain points and challenges they face. If you want to know what it's REALLY like to work with serverless, this is the show for you.

All Episodes

Real World Serverless with theburningmonk

#35: Serverless at MatHem with Lars Jacobsson

October 28, 2020 • Yan Cui • Season 1 • Episode 35

You can find Lars on Twitter as @lajacobsson and on LinkedIn here.

Here are the tools we talked about in the episode

Lar's github repos
aws-tools repoes on Github
CloudFormation to draw.io generator: cfn-diagram
IAM policy builder: iam-policies-cli
SAM policy builder: sam-policies-cli
EventBridge pattern creator and schema browser: evb-cli
EventBridge local debugger: evb-local
VS Code plugin for adding codelens actions to cfn templates
Lumigo-CLI

Check out opportunities at MatHem here.

To learn how to build production-ready Serverless applications, go to productionreadyserverless.com.

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:12

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today. I'm joined by Lars Jacobsson. Hi, welcome to the show.

Lars Jacobsson: 00:25

Hi, thanks for having me.

Yan Cui: 00:27

So we were talking just before the podcast started, and you guys have some really interesting story about how you're going to serverless and how you got into SAM. And now thinking about CDK. Can we start by maybe telling us about yourself and what you're doing at MatHem which is one of the biggest online grocery stores in Sweden?

Lars Jacobsson: 00:47

Yeah, exactly. So first, some background about MatHem. It was founded in 2006, I think the first order was placed around 2008. It was built as a .NET monolith from the beginning, I'm not going to talk about it, because it was quite a good piece of software, but it grew very large. I joined in 2013, the end of the year. And I remember, from the very beginning, where the scalability issues have reached, I think a couple of hundred simultaneous shoppers on the site and stuff went down. So then we started ripping out stuff like the shopping cart into MongoDB. And we have bought more buffer to accept more customers. And because the market share we need, we knew that we couldn't scale to that many people on that platform we had, as we always knew that something had to happen. And then I think time went on. And then in 2016, we had a budget to hire more developers. But it was quite hard to hire developers for the tech stack we have because it's quite aged. And people have turned around the door and said like we love the company will love your service. But I don't really want to work on this. And there were two aspects of scaling the site and number of users we could take, and all scaling the teams we want to grow when we wanted each developer to increase the productivity linearly with the number of developers we have. And we found that the people who managed to hire, it just increased the total stepping because we were sharing the same codebase. And we didn't get the benefits of growing. Then in the autumn of 2016, we sat down with all we had to do. We started ripping out the data access layer into its own service, then a bit impulsively went to the OSCON conference in London. The O'Reilly one. And it was all about microservices. And we were sold. We knew we wanted this, but we didn't know exactly how to do it. And because we all came from a .NET background, we missed this whole Docker era. But then 2016 turning to be a bit of a perfect storm, first, a series of events that led to where we are now. So earlier that year, they had Microsoft released .NET Core that runs on Linux, which open up this Docker possibilities for us. So we came back from London, sat down, brainstorm for a week and thought, “Yeah, let's do Docker”. We already had a site tasteline.com is a recipe search site that's running on AWS, it's running without much maintenance, but we had an AWS account. So we started playing with ECS, got some stuff out in production, we learned the hard way the importance of IaC, because we had to recreate the same stuff over and over again. And we just realised that we were spending so much time on figuring out how to configure these clusters, and not enough time of developing features. And it's the feature development we love here where, you know, which is the core of the company. So we felt like it wasn't that much fun. The fun part is in writing code. And then we didn't go to re:Invent last year. But we went to the re:Invent recap in Copenhagen in January 2017, there we were introduced to the serverless application model. And we're like, hang on, that's exactly what we want, it's writing code without thinking about configuring clusters, and then went home and things started escalating. And we got really productive. And from a very early start, I think like day two of this, we started seeing, Oh, what if we take one step back and talk about the approach of this. We sat down and we defined a very few set of rules to follow. And the first debate was how do we version control this, we came from a mono repo background, because we had one monolith. And we decided to go for poly repo approach where each repo has its own microservice, which set up from one CloudFormation stack, or SAM stack. It follows the same naming convention. So when you create a repository, it will automatically create a code pipeline named same way as a repository. It will also tag all your resources with a name of this. And we let CloudFormation name your resources or wherever you see, for example, Cart Service, if it's in a low green or metric, you know exactly where to find the code in GitHub. I found that incredibly valuable. What we also found another stage was that we're repeating ourselves a lot. So we created internal tooling for templating. It's sort of, if you can compare it with CDK that you create constructs with. We create common patterns like SQS consumer with dead-letter queuing, and sane retry policy. But this was before long before CDK. And then we started growing, we hired more and more people, we split ourselves up in smaller teams, domain specific teams where each team might be managing 10 to 20 different microservices. At the moment, we have about 300 microservices in production. Not everything is perfect. A lot of it is really good. We've seen that pandemic as a live-load test. And it went very successfully. We went from, I think we almost tenfold doubled in number of requests in one day, when things started closing down. And we had a few hiccups, but nothing we couldn't fix in an hour. So yes it has been a good journey.

Yan Cui: 05:43

I know it’s taking a few years to get there. But sounds like you guys have gone pretty far from falling over. When there's 100 shoppers online to now be able to handle some of the, I guess, traffic spikes that come to the pandemic. Do you have any sort of numbers around some of the new spikes that you're seeing during this shopping frenzy people had early on when they start worrying about supplies and all of that?

Lars Jacobsson: 06:05

Yeah, I'm taking some numbers out. So if we go back to, I think it was the beginning of March, I read the same thing happened in the UK with Ocado, and Tesco and all of those grocers that next available delivery day pushed forward in time. So people place an order for like a week in ahead because there were no delivery time. So our bottleneck became their warehouse and production part of it. And we saw that, yeah, we can't please all customers, because we don't have the physical resources for it. But if you place an order for food, 10 days before the delivery, you will still be disappointed because you don't know now what you need in 10 days. So we, I think we were the only ones who took this approach, and we reshuffle things. So we only opened delivery times for two days ahead. So at least you get stuff that was fresh in your mind when you bought it. The consequence of that was that we announced to the customers that at 12 noon, we release bunch of delivery times for two days in advance. And of course, people are sitting hanging on the door at 12 to reserve a delivery time. And there we went from what's normal pre-pandemic was roughly I don't know, normal evening when people would go shopping is maybe 800 simultaneous shoppers to 1000. Here was 6000 delivery slots sold out in two minutes. And the number of requests we had to the delivery time service that reserving these times went from like zero before 12, people knew that there was no point clicking, to, I mean, you do the math, a lot of requests. And I mean, when you reserve delivery time, there will be events risen, which will fire other Lambda functions. There are two issues that we learned there was A) we had already increased the concurrent Lambda executions to a very high number. But what we learned there was that it still scales that bucket of 500 at a time. So we quickly saw that, and then we made use of this brilliant feature of provisioned concurrency. So we scaled up 15 minutes before 12 and on some crucial services, crucial Lambda functions. And that managed to handle the load perfectly, you couldn’t notice any latency added to the site at all, despite having suddenly 10,000 simultaneous users, there was a load test which we hadn't planned for on production.

Yan Cui: 08:18

Yeah, that's a really good use of provisioned concurrency, especially for this kind of sudden spikes. Where, you know, you do run into some that scaling limits you mentioned. And also, I guess, for you guys, because you're still writing a lot of Lambda in .NET, well on .NET Core. So I guess provisioned concurrency is also going to help you with the cold start aspect as well, where you don't have to worry about all these new .NET modules starts to having to, you know, cold start at the same time adding a bit of latency to the user experience. So in terms of the some, I guess, your architecture, what does it look like? Because it sounds like there is a lot of API that’s involved. But you mentioned that you guys also started to do more and more event driven stuff as well with EventBridge.

Lars Jacobsson: 09:01

Yeah, exactly. So to go back to the provisioned concurrency for APIs, let me first touch on how we work with API gateway and Lambda. So we work with API gateway. So we have one API gateway for every stack that requires an API, not all stacks have an API. So just to do data processing, we do a Lambda proxy pattern. So we use for Node.js view, some Express.js framework, we use ASP.Net for .NET. So you get this like multi-purpose big Lambda, which does more…you gave it a set of permissions, but then one endpoint might not use any of those. It's debatable if you should have one Lambda function per API method or this works for us. It also enables us to do local debugging. Obviously, all those artefacts get quite big and it's depending on what you do in the bootstrapping phase of it, it will affect the cold start. We don't use provisioned concurrency for it yet. I find when I've been playing with it, just the price of it frightens me a bit, it gets a bit expensive. What we do with this .NET Lambdas, where latency is crucial is A) we max the RAM B) we have that old hack where you have, it's called EventBridge rule that fires X number of concurrent requests to their API at a set rate, just to keep them awake. It's a mitigation, it's not a solution. But what we find is that we typically get about one second cold start on average. And they're very crucial ones like the product search, and those we are in the process of rewriting to Node.js, for cold start reasons. We're not locked into a language, we pick the one that suits the use case best.

Yan Cui: 10:40

Okay? That's actually interesting, because with provisioned concurrency that we're finding people no longer need to use a Lambda warmers. But sounds like you're worried about the provisioned concurrency pricing, is that something that you are seeing in the dev environment, because in production, if you can get a certain amount of utilisation out of your provisioned concurrency, which I think we did some numbers, a while back, I think that if you get past 60% utilisation, you can actually be end up being cheaper compared to on demand is that because the traffic is so spiky that a lot of time you're not using the full…

Lars Jacobsson: 11:15

Yes.

Yan Cui: 11:15

Ah, gotcha.

Lars Jacobsson: 11:16

I think it's a huge benefit in a way to exist in one timezone. Because we have very predictable patterns, I mean, you could compare one week to another and the traffic is, we can do anomaly detection, very good, because if something weird happens, there will be an anomaly. So we have very low traffic at night, steady traffic during the day, small spike at lunch, because people do the shopping in the lunch break, and then it goes up in evening. And then it dies off at midnight, maybe before midnight. So if we were to use provison concurrency, right, I think that we need to still do some manual scaling on that scale down at night. Because otherwise, there's a set fee of the concurrency provider. That's my understanding of it anyway.

Yan Cui: 11:56

So with that provisioned concurrency, you can make it work with Application Auto Scaling, so you can set up a schedule. So in your case, you can actually set up a schedule to say, scale up, at, I don’t know, 11:35 or 45. And then scale down the provisioned concurrency to form 100 to 10 at one o'clock. If you know your spike comes around between 12 and 1, same in the evening, when you release new slots, you can set a scheduled scaling activity or action to be triggered at, say, quarter to 12. And then another one to scale back down maybe after the 12 o'clock rush when everyone tries to get the 6000 slots in two minutes. So you could control those using a schedule, which is great, I think, in your case, because it's so predictable, like you said, that happens every single day. And then in between during the day, you just use a very low number of provisioned concurrency, just to stop those functions that doesn't get used often. So every time someone uses it, it's gonna be a cold start. So I think that's probably what you should have looked at in terms of how to use provisioned concurrency and to remove some of that the manual task maybe you have right now.

Lars Jacobsson: 13:09

Yeah, that's interesting. I didn't know it hooked in with Application Auto Scaling. It still adds another layer of configuration to the stack. We should definitely look at that for certain APIs. I don't think it's a doable thing to do for all APIs. But yeah, I should say like the ones where latency is crucial. I'll take a note of that. Thank you.

Yan Cui: 13:26

Just to use provisioned concurrency, you have to use aliases, which adds another layer of like I said, configuration and complexity as well, for the latency sensitive stuff that you don't want to have to rewrite. In order to optimise for cold start performance, then it's still a really good option. I'm curious as to what you're doing with EventBridge and legacy event driven architecture that you envision that you guys going into.

Lars Jacobsson: 13:51

Yeah, we worked from an early start with SMS to SQS. We'll go back to the very beginning, we want a good way to trigger Lambdas based on events, and we often quite wrongly used Kinesis for this, and this was before there was native SQS to Lambda support. We then started using SNS to SQS, so PubSub method, which works fine. That problem I found there was that it created a coupling both ways between the producing side and the consuming side. Because typically, or always, the producer of the event owns the topic. So that's created in a stack that's producing the event. The consumer will then reference the topic via a CloudFormation export, creating a coupling. And also then on the consumer end, we want to make use of message filtering. But the producer doesn't know what the consumers want to filter on. Then we saw there was a lot of talks between teams, lots of pull requests and from the consumer side to the producing side where you just add a message attributes like I want to filter on this status or whatever. I think one of the goals of my life is to remove friction between...I make it sound like I don't want people to talk to each other at work, we do. But maybe instead of asking for favours sharing knowledge instead of interrupting each other. So when EventBridge, I remember like a year before EventBridge, I was sitting and talking to our SA about why don't you use Cloudwatch events, because the pricing is similar to SNS, the payload size is similar. And then we get the content based filtering. And we agreed, like, yeah, maybe that's not built that way. It's not how it's promoted. So maybe it's wrong to use that way. And then a year later, I think they listen into our conversation, and they built EventBridge, which is Cloudwatch events on steroids. We didn't jump on a straightaway, I don't know why really, but I think around December time, I think they released updates to the content based filtering agreements. And remember, I sat at re:Invent and was playing with this, with the schema registry. And as I was seeing all these opportunities of, hang on, can these actually replace our SQS, SNS? Yes, pretty much all of it. And we spoke to the EventBridge team at re:Invent, and they were like, yeah, people do that. So we went back and did that. And we started using schema registry already when it was a preview state. And we started building the power of EventBridge is the content based filtering, the difficult part of it is write the patterns. So we wrote a tool called EVBCLI, which lets you automatically build event patterns from the schema registry, as a CLI tool. And what we found when we removed the whole pattern compensation part of it and automated that it got incredibly fast to build patterns. And then we start to think like, “what's an event?” I think in 9 times out of 10, an event is raised upon data mutation, often in our DynamoDB table. So we create a macro, which hooks up DynamoDB stream with EventBridge, and just passes on the events on EventBridge. And when we do that, we also add some metadata to the event about what CRUD operation caused this event, what has changed in it. And we also send that full old state and a new state. So it's basically a DynamoDB record passed on to EventBridge with added metadata. So a fan out pattern from DynamoDB.

Yan Cui: 17:12

Okay, so when you're publishing those custom events to EventBridge, are you making sure that you always publish the new and old image essentially trying to mimic what the DynamoDB streams does?

Lars Jacobsson: 17:23

Yeah, so we take the old image and the new image. We wrapped them up in so our payload looks like, I think similar to what Lego does, we have a standard pattern, which is metadata and data that goes into detail, the data has new and old, so not a new and old image. The metadata has stuff like the CRUD operation is got that array of the JSON paths to the changed items in there between new and old. So we can do some matching on if certain properties has changed or not. We have sometimes issue of the payloads being too large. If it's large items in the table, that often goes back to the table design is not optimal. But I do have a wish that they will increase the size limits on EventBridge and SNS and SQS.

Yan Cui: 18:10

Okay, so this in this case, you have a centralised, we said one Event Bus for everything.

Lars Jacobsson: 18:16

Nope, we don't touch the default Event Bus. That's for the AWS service events. We have split up our organisation in logical areas, we have the eCommerce platform, we have the supply chain, we have some other data platform, every logical larger domain of the organisation has their own Event Bus. So all eCommercy services would race events on the commerce Event Bus. We have logically grouped this like this that we think like if we were to move into a more multi-account strategy, an Event Bus would live in the same account.

Yan Cui: 18:49

Yeah, because right now, at least as of today, this I guess the developer experience working with cross accounts with EventBridge is not particularly great. EventBridge does have the cross account delivery, of course, but always delivers to the default and also trying to subscribe across accounts, that is still quite clumsy right now, I do find it interesting that you're using multiple Event Busses. Because one of the things... one of the reasons that people say they want to use the EventBridge instead of SNS is because the content-based filtering allows you to use one Event Bus for the entire application. If you have something that you want to do, say some kind of journaling of all the events that's been put into Event Bus, I guess you have to then subscribe to every single Event Bus there are in the entire system.

Lars Jacobsson: 19:36

We have five Event Buses, and I've thought of that scenario as well. You will need to have something in between to consolidate them into one Event Bus. That's true. I don't have a good reason apart from structuring that why we went for these EventBridges. I think it's just to get some structure.

Yan Cui: 19:58

Okay, sure. I don't mean this is, that is, you know, is wrong or is bad. Just that is, I guess a little bit different to what I've seen a lot of customers use EventBridge, whereby they tend to have just one bus for all events, and then just do content based filtering based on the detail type. See, I guess similar to how the whole AWS has got one default bus that you can listen in on. Even though, God knows how many different services or publishing events into the same bus, you also have some tooling to help working with and debugging events in EventBridge as well, right?

Lars Jacobsson: 20:33

Yeah. So if we go back to the beginning of the chat, I talked about this internal tool we built, we called the MHCLI, which is doing templating. It also does other boilerplate stuff, which basically, when a developer does something for the third time and thinks this is getting boring, they add a command to CLI. And I've been talking to other companies and stuff that we have done, like internal demos and stuff like this, and they go, “Oh, would that go open sourced?” And we've been, we've spent some time discussing it. But if it had been done at nice, it had been like your CLI, the Lumigo CLI. Because that's, again, multipurpose across service productivity tool was ours. Yeah, it is fast, but it wouldn't fit into any other organisation. But we have discussed to do it. And then when we start to work with EventBridge, and we found that it takes time to compose patterns, especially when they get complex, we built the EVBCLI, which I said before it hooks into the schema registry, parses the schemas, let you browse it, and from there build patterns. And it will just as of the other day, you can now inject it straight into your templates where it's supposed to be. It also lets you because we had developer, they get confused that they don't know who is subscribing to the event. That's, that's been used case when you want to know, like, “Where's this being propagated to?” I think its when I want to make a breaking change, then if no one is listening to the events, then they can go ahead and do it. But if 10 people are doing it, then they need to know. It also has capability of browsing, you can select a schema, you can see where is that schema actually used in patterns. And you can see how its transformed into the target. What the target is? Then in parallel to that one. And don't ask me why I did this in two different tools. I just did. That is EVB Local, which it's contains two parts. One is a CloudFormation stack, you deployed your AWS account, the other one is a CLI, it lets you... you can listen in on a stack event. So you can hook it into a CloudFormation stack, you will pause it and find all the event rules, and it will spit out the payload to console, you can also pipe it into the SAM local CLI. So you can set breakpoints in your code based on real life events in AWS. It also lets you test rules before you deploy your template. So you can write a rule in your template, it will then create that route in AWS and create a pipe back to your console. And that's using API Gateway WebSockets. So you have a WebSocket connection down to your console. And save now you can then pipe those events into SAM local. And we find doing that it saves the developers a lot of time, you don't have to do the round trip of actually deploying stuff to test it. Because doing integration tests with EventBridge, it's quite difficult unless you physically deploy stacks.

Yan Cui: 23:16

Yeah, tell me about it. I've had to, I've actually wrote a blog post a while back that talks about how you can test the, you can add EvenBridge or SNS into your, as part of integration or end-to-end test, and validate that you're sending the right stuff to EventBridge. And one of the approach I discussed is also what you just described using API Gateway WebSockets that I can listen in on from my Jest test so that I can invoke a function or either running locally or invoke it by API gateway that has been deployed. And then listen in on the event being published into EventBridge bus so that I know, okay, when this API runs, it does publish the right events so that the other systems are able to tap into that event and listen to it.

Lars Jacobsson: 24:07

Yeah, I've seen that pattern being used in integration testing in, I think, in code build someone has, or maybe some end-to-end test is called, if I fire this, something to this API endpoint, I will expect this event to be raised within second. So can you can test that flow as part of your continuous integration.

Yan Cui: 24:28

So you first build a lot of tooling around your development team so that you can automate a lot of the boilerplate or boring stuff that they have to constantly have you looked at something like a CDK because some of the things that you were talking about earlier in terms of generating templates. That's something that we're seeing a lot of people use CDK for that.

Lars Jacobsson: 24:47

Yeah, I want to ride away from CDK. Because I hear everywhere I go people are raving about it. I think it looks like a great tool, where we think we have looked at in context of how we are working, and the approach we did with the our templating engine that creates these... or you call it construct of, for an example, if a pub sub pattern where you create a subscription from an SNS topic into SQS queue and a Lambda function with a retry policy in a dead-letter queue, that's a typical CDK construct, we created before CDK even existed, but our tool writes pure CloudFormation. The benefit, as in doing that is that the developers who own the stack, they will see what they are deploying in a template. And also, we don't enforce any specific programming language. But if we were to work with CDK, and we want everyone to contribute to these constructs, who decides the language to use, but the use case I'm thinking of for CDK, is that our templating engine, it does two things, it produces code in .NET, in Python, or in Node, and the template that goes around it, you could replace a templating, because I mean, our templating engine is not flawless. And we could integrate that with CDK. So that stuff that's spat out into your template and use the MHCLI, is actually produced by constructs from CDK. And that's something I'm, I have in my short term pipeline to look at. But I haven't touched CDK that much yet. So I'm not really sure what's possible and what's not possible. But from what I read, I think that should be possible.

Yan Cui: 26:22

Yeah, that should be in the realm of possibility. And certainly I know, companies like Alma Media, who I spoke with a while back, are doing a lot of similar things in terms of using the CDK to create the patterns or high level constructs, like SQS queue with all these default alerts configured. And also with dead-letter queue configured as well, that has one simple construct that you can distribute across all of your teams as a shared libraries. And then you can basically take those best practices that you have learned collectively as an organisation, and then just reuse that rather than everyone have to constantly remember, right, we need to create dead-letter queue. And we need to set the alert so that when something get dropped to dead-letter queue, we get, we get notified. And we know to look at that. So I think that stuff is something that's getting very popular with CDK. And also very good use case for CDK. But I think you also touch on some of the potential challenges with CDK, as well, especially in the large organisation where, you know, potentially you have a centralised Ops team that needs to have some governance on the resources that get provisioned. And if every team is just using their own, you know, their favourite language to provision and write the CDK constructs, then that team is going to have a harder time trying to work with essentially what four or five different languages that people are using for CDK. So how are your team structured? Do you have like a centralised Ops team that manage a lot of these things for you? Or do you just let the developers do their own thing?

Lars Jacobsson: 27:56

We push out the responsibility for all the infrastructure creation through CloudFormation templates to the developers. We have this mindset of you build it, you run it. So now we don't have a centralised Ops team. And I think that's something we've been working from beginning. Because as I said before, we want to avoid friction between teams. And I've worked in places before where there was a development team, QA team and Ops team or Sysadmin team. There was never a steady flow of work between. There was always someone who acted the bottleneck. So the closest you get an Ops team is my team, which I mean, so there's three of us. We just hired a guy last week. So it's been two until recently, the way I look at setting the guidelines and tooling is, I see my job is... I'm constantly trying to make myself redundant. I don't want to be I... we've gone through short periods where I've become a bottleneck or my team has become a bottleneck, where developers are waiting for us to enable stuff for them. What we do instead is that we provide them with the guidelines and the tooling they need to do their work without coming to us all the time. It's working really well that we've gone through shorter periods when it's been not very aligned. But I think that goes with anything.

Yan Cui: 29:08

Okay, yeah, this has, I guess, a sensible approach, where you focus on building tooling and enabling other teams to do their own thing as opposed to being a gatekeeper, which I do find, maybe some of the larger organisations still operating with, I guess, more traditional models of how to organise teams. And that does create a lot of friction. Even as the teams adopt technologies like serverless where they can move a lot faster. But then everything slows down the moment they need to rely on the centralised Ops team to do something to add resources to AWS account because the teams themselves can't do it.

Lars Jacobsson: 29:42

Yeah, I mean, the certain things we do, we look after single sign on stuff, the control tower part of it as well, because we don't want everyone to be there. But if you got a look at something like their single sign-on thing, that was one place where we became a bottleneck. What we quickly did was we called a meeting for all the Dev leads for all the teams, make them SSO admins and allow them to manage own users and new starters. So we're delegating up work like that in a controlled manner.

Yan Cui: 30:12

Okay, since you've moved to serverless, what would you say are some of the biggest benefit it has brought to your teams and also to the company as a whole?

Lars Jacobsson: 30:21

Yeah, I would say a lot. I think that there are different pillars you need to look at with... also depends on the context it came from before. For us, our two blockers were, we knew we had to scale up their service. So it's a site, the scalability of a site, scalability of warehouse and the production lines, and also the scalability of teams, what we can control on the tech side or the development of the site and making that scalable. And then we have very talented people out in production lines, scaling up the warehouse. As an example, when the pandemic started, we acquired a new warehouse, which we got from empty to up and running in, I think, four weeks. So the whole mindset of being agile is in the core of the company. So the benefits of going serverless is scalability of the site. I mean, we would never have been able to handle the pandemic with what we had before going serverless with that massive stampede of customers coming all at once. So that was really a token of “Yeah, we did right choice.” Because people just sat back and watched it. And it just worked like a well tuned machine with a few hiccups. But let's not talk about those. The second part is when we know. So we had a big investment in the middle of this. So we had a huge investment of one of the, soon-to-be one of the major tech investors in Sweden, just when we were about ready with launching the new site. So we were subject to a thorough technical review during the due diligence. And I think the fact that we landed that shows that there's faith in our technology and the scaling of it, and that we have a future proof solution. And also hiring people. Now when we get new starters in, we can see that productivity is almost linear with the number of people. Whereas before, as I mentioned earlier, that when we hire more people, there are less contribution each person accounted to because of merge conflict, toe stepping, you work on the same codebase. Now also we work in a technology sector where people want to be, I saw this when we decided to go serverless and I read up on containers, and I thought this is a natural step. It’s probably not going to end with where we are now. But this is a step in the direction where technology will go. So I think from a developer perspective, it's really good. You need to look at work in as... you do eight hours a day. It has to be meaningful you have to learn stuff. And I think work in this sector, we teach people the right things. So growing the team is releasing. We are very fast time to market. I have a little equation in my head about what should take time when you build new features. It's the idea part of it that always takes time to work out the new idea, writing the code, what's in between setting up CICD and deploying it should be minimised. And we have that. We can get new feature out in minutes if we are fast enough of writing code. So it's the agility we are very dynamic, we can react fast to changes in the market. I think that sums up the benefits. Then another big pillar is the billing part of it. We are constantly working on cost optimizations, we have some places where we could focus on more, which we haven't because we have been very keen on getting the site out and working. Not saying that we are running an expensive workload, but there's always work to be done on the cost optimization.

Yan Cui: 33:39

Okay, that's really good to hear that you guys have reaped a lot of benefits from going serverless. But is there anything that you wish AWS will do better? I guess this is where you can give your top three AWS wishlist items for things like EventBridge, or SAM or something else?

Lars Jacobsson: 33:57

Yeah, Yan, you did prewarn me about this question. I have a list of a few things. One of the things... I listened to a few of your episodes before, but I’m...yesterday I was going through a few more and something that's been on my mind, which obviously has been in other people's mind is that AWS has a serverless solution for pretty much everything apart from search. We use Elasticsearch and I'm a bit scared of that cluster. Sometimes it’s running flawlessly, but it does require some patches. It requires some manual work which I was find a bit daunting. So I would love some form of serverless Elasticsearch solution or similar. Then with EventBridge, I want to... and also for any I mean SNS, SQS as well to be able to handle large payloads larger than 256K because I think it's, I think, yeah, I know some people just send like a reference to the item that raised the event and then you fetch it through API call at the receiving end. I don't like that. I want the whole event together to be filterable in things, and then it does happen that sometimes, in rare occasions, you need large events. When we were back with SNS and SQS, we solved that with passing via S3 and just sending their reference if it was a large item, but then with Eventbrige, you would lose the content based filtering. And I have another one on EventBridge, it's, I think what's lacking there, is better, like retry handling on baked in setting with a retry policy with like, exponential backoffs, and deal queuing and stuff. I feel that’s, there's something missing there. But then we have to remember that EventBridge is still a new service, which is performing really well. And I'm hoping that they would be focused from the AWS side and that. I just started working with their embedded metrics, client library for Node. And I think it's great to create custom metrics. But that leads to another wishlist item, CloudWatch Logs ingestion is quite expensive. And we want to collect metrics and data. But that leads to I mean, Lambda itself is quite cheap. But then CloudWatch does add to that. And so I don't like to have in mind by “Oh, how verbose can we actually be with when I'm logging my stuff?”. So yeah. Then I asked the team, this fourth point now coming as a team about what they think, and I had good feedback. I think both GCP and Azure has this concept of, of a page where you can list all your resources across accounts. So just in one consolidated view, I can drill into which all my S3 buckets, which are my Lambda functions, and because now it's all over the place in the console. So yeah, that's my four items on the three item wish list?

Yan Cui: 36:38

Yeah, excellent. Well, I certainly concur with everything that you mentioned, especially the serverless search. I've just spent the last couple of days on a client project, setting up a ElasticSearch cluster and setting up alone is tricky. Not to mention all the maintenance that comes after that. Yeah, I do hear your pain. And that’s one of the, I guess, for a lot of people, that's the one, serverful component, they have to run in their stack, because there's no equivalent within AWS at least. So typically, I use Algolia. But use Algolia and uses to get a HIPAA compliance is quite expensive. And the ElasticSearch is HIPAA compliant. So it's kind of for this project is kind of what I've, what we have to use. But yeah, is a lot of overhead compared to what I normally have to deal with. Lars, so thank you so much for taking the time to talk to us today. Before we go, how can people find you on the internet,

Lars Jacobsson: 37:37

How they can find me on the internet? I'm not that active actually. I've got a Twitter account, @lajacobsson. I'll send all this to you. So you can put in a summary later. That's basically me updating, sending out updates about the tooling we build. We have, in addition to the EventBridge tools I talked about before, we have some other tooling around generating policies, generating diagrams from CloudFormation templates and stuff. That's all under a GitHub repo called mhlabs (https://github.com/mhlabs), inspired by AWS labs. And I'm also on LinkedIn.

Yan Cui: 38:10

Okay, excellent. You've already sent me some of the tools that you publish. So I want to make sure that those are in the show notes. And also a few other things that we talked about in this episode, these will also be linked in the show notes as well, and also I’ll add Twitter and your LinkedIn profiles to there, so that people can find you. And again, thank you so much for taking the time to talk to us today. And I hope you stay safe and hopefully see you in person sometime soon.

Lars Jacobsson: 38:37

Yeah, yeah, thank you. Stay safe.

Yan Cui: 38:39

You too. Bye bye.

Lars Jacobsson: 38:40

Bye.

Yan Cui: 38:53

So that's it for another episode of Real World Serverless. To access a show notes, please go to realworldserverless.com. If you want to learn how to build production ready serverless applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.