Real World Serverless with theburningmonk

#23: Serverless at Alma Media with Ari Palo

August 05, 2020 Yan Cui Season 1 Episode 23
Real World Serverless with theburningmonk
#23: Serverless at Alma Media with Ari Palo
Show Notes Transcript

You can find Ari Palo on Twitter as @aripalo and his LinkedIn profile here.
Also, check out his new blog at

Here are links to the tools we mentioned in the show:

And the previous episodes we touched on:

NOTE: since we recorded the show, AppSync has added caching support for nested resolvers!

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Opening theme song:
Cheery Monday by Kevin MacLeod

Yan Cui: 00:13  

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today, I'm joined by Ari Palo from Alma Media. Hi, welcome to the show.

Ari Palo: 00:26  

Hi. Thanks for having me.

Yan Cui: 00:28  

So, Ari, we met at the AWS Community Summit in Stockholm a while back. How have you been since then?

Ari Palo: 00:36  

Fine. Thanks for asking. Well, all things serverless, working with those. 

Yan Cui: 00:42  

Yeah, so we've been having some discussions of the podcast. And you guys are doing some pretty amazing things with serverless I would say. For the audience who are not familiar with Alma Media, can you just give us a brief introduction? What you guys do and what is your road there?

Ari Palo: 00:57  

Sure. Yeah. Well, I'll start with Alma Media. It's a European media and digital services company. And we basically operate in Finland, Sweden, and Eastern Central Europe. We have three different business units that have their separate, separate focuses. But generally speaking, we do all sorts of stuff recording, housing, recruitment, cars, news, entertainment, lifestyle, travel, cooking, dating, and business media, and, like, professional training and data services. So that's a lot. That's a lot. And it's, it's both business to customer and business to business. And we have something like 100 plus websites and apps and things like that. So it's a lot. But usually, those are...they sort of support each other quite well. So So that's, that's our sort of business case. And, yeah, we have about 280 developers in total in multiple teams. And then my role in the whole whole of that is that I work in all my ICT unit as a solution architect. And our unit basically provides these, you know, shared services that are used across company and APIs, and also the data platform and analytics and single sign on. And personally, because we work quite closely with all our different teams, there are many, many teams in all my development teams. So I get to sort of share these best practices between them. So I'm, in some sense, you could say I'm an in house developer advocate. So that's that's it in brief.

Yan Cui: 02:30  

Yeah, I think and I remember you mentioned that some of your websites are some of the biggest websites in Finland as well, right?

Ari Palo: 02:37  

Yeah, yeah, definitely. They are. I mean, you know, we have tight competition, we are sometimes at the second place. And sometimes in the first place, like from the Finnish Finnish websites, like regarding how many weekly visitors there are, and there are other other like, we have other positions there as well. So they're definitely in Finland, they are really high usage websites, and we reach about 80%, almost 80% of like Finnish population. So that's quite a quite a big reach here in Finland.

Yan Cui: 03:07  

So for some of these more popular websites, are they fairly high traffic? What’s sort of the peak TPS do you guys see on some of your websites?

Ari Palo: 03:16  

Yeah, I don't have like a TPS number at the moment. But but I could say that we get like a regular month would be something like 750 million page views, somewhere somewhere around that. Of course, there are times when they're like some really massive news, and there are some high peaks and things like that, but but around that 750 million page views a month.

Yan Cui: 03:36  

Okay. And I think you mentioned that some of that the back end, some of the APIs is more and more now becoming serverless. At Alma Media, you guys have been on this really long journey, moving things into serverless. Is that right?

Ari Palo: 03:50  

Yeah, we actually are. Well, we started with AWS in 2012. And I think our first serverless implementations were in 2016. And, you know, the regular ones, the usual ones, like, you know, infrastructure glue with Lambda, and some small APIs and things like that. But But nowadays, the, for example, the web front ends are are implemented with serverless quite a lot. For example, one of one of the projects that we're probably most proud of were, a couple of years ago, Alma talent business unit decided to rewrite their website front end again. And they have pretty great big websites, some of the biggest ones, get almost 2 million weekly visitors. And they decided to go with the React.js library. And performance was a quite big player, key role in the in the project. And they wanted to use the React.js server side rendering. So the server builds up HTML and sends it to the browser and then the browser downloads the HTML and then the React.js client side framework kicks off and continues from there. So they did that, you know, what you call universal React.js or server side React.js rendering. So they did that, but with serverless, with AWS Lambda. So that was probably the, one of the biggest, like moments that they did that. And it's, it's working still, and it's working great.

Yan Cui: 05:23  

So that's quite interesting, because one of the problems a lot of companies run into, when it comes to using servers for user facing features like websites and things like that, especially when it comes to service side rendering is that, okay, we are going to get cold starts from time to time. Do you guys experience that problem? Or do you have such a constant, stable traffic that you don't really see cold start often anyway?

Ari Palo: 05:47  

I'd say that most cases, and in that previous example, for a, for example, is that the traffic is most of the time someone constant that there is always some traffic at least coming. So the cold start weren’t such a big issue. At least it wasn't, maybe it was in the start, but at least it isn’t anymore. Of course now you have provisioned capacity and things like that. But at least for those websites, it wasn't. There have been some cases that people have, for example, used like JVM based languages like Java or Kotlin, to build some APIs with AWS Lambda. And there actually the cold stars have been a problem. But but in, you know, Node.js, and Python and Go and runtime environments that we mostly use, it hasn't really been that huge of a problem.

Yan Cui: 06:40  

I guess with JavaScript that cold starts has gotten a lot better, and it wasn't that bad to begin with, especially for your case, if you have a pretty stable traffic that's keeping all the containers warm anyway, is probably not that big a deal, I imagine. But are you guys doing anything special to optimise the performance of your functions anyway? Are you, you know, minimising the packages using Webpack and things like that?

Ari Palo: 07:05  

There might be someone doing that. But I don't think we don't really use that, you know, Lambda optimization with Webpack or similar tools. We don't use that really really much. I'd say one of the things I've personally used is the Lambda power tuning, tuning library that sort of triggers the the step functions that that also tests your functions against various configurations. So I've used that to sort of find a sweet spot for the Lambda runtime that how much memory and, and and things like that I provide to it. I've used that kind of a set up, but not not really, then it's usually... Sometimes we've done an optimization where we had like, let's call them micro functions, that there are multiple really small functions. And sometimes there has been a situation that only few of those gets called often. And rest are not called that often. But when they're called, called again, and they are they are they have the cold start issue there. So we have done in several projects that we've sort of combined to move towards a bit, some people might call it like monolithic Lambda, the Lambda function that contains a bit more logic than your just one function, actually, that there can be multiple handlers and things like that. So we've done that that combination tree a few times to sort of mitigate that problem with cold starts. 

Yan Cui: 08:35  

Yeah, it's interesting, you mentioned that because I find that approach only works when you don't have a lot of traffic, because you condense the number of cold starts required across multiple endpoints, because you have just one function that handles them. But the moment you have a fair number of traffic that spread across different endpoints, you kind of lose the benefit from doing that condensation. This has been my experience so far.

Ari Palo: 08:57  

Yeah, and we haven't done that, that really on the larger traffic sites, usually the sites sites that or the endpoints that have the larger traffic amount, they have really the Lambda function optimised for the for the actual, or the main use case, let's say like that, but some some some projects that have have said like several Lambda functions and not that much traffic, but it's still required that there shouldn't be like that much code. So they have they have used that and I don't have any numbers, but I've heard that it's, it's, it has helped them out.

Yan Cui: 09:32  

Okay. And I guess nowadays, you have the provisioned currency anyway, so you don't really need it anymore.

Ari Palo: 09:38  

Yeah. Yeah, that was a good addition from AWS. Definitely that I haven't personally used use that yet. But but i i see several use cases for it.

Yan Cui: 09:48  

Yeah, and I'm really glad that you mentioned the power tuning as well. And it has been a shameless plug here. Anyone who is listening and want to try out the power tuning, but you can also try it out lumigo-cli which is a cli tool that I built. It’s got a command, well, deployed power tuning step function and then run against your Lambda function as well. And you can also embed it as part of your CI/CD pipeline so that you can also optimise your functions after you run different configurations and see which’s best memory allocation for your function.

Ari Palo: 10:19  

Yeah, yeah, I haven't used the tool… yeah, I haven’t used that tool myself. But I have to say that I've seen I've seen it done. And I really recommend everyone to try tools like those because that really, you can really predict where, in which which kind of environment, your Lambda is going to actually perform the best. So actually test it out and try to figure out figure out with those those kind of tools.

Yan Cui: 10:41  

Yeah, before the, before the whole power tuning tool, it was pretty much trial and error. Now, it's pretty much, it's very much a data driven. So going back to Alma Media, and how you got to using serverless. I know, you also mentioned that, you know, there's a lot of APIs and stuff like that. But there's also a lot of other things as well. I think you mentioned step functions, you mentioned a lot of data pipelines as well. Are there any other sort of particular highlight and success stories that you want to share?

Ari Palo: 11:09  

Well, of course, I actually mentioned APIs are the big thing, and especially for the team I work with is that, I have to say that the first time when you sort of had had this, we did those, you know, classic REST APIs with API gateway, and Lambda and Dynamo and things like that. But once you start seeing them serving traffic like of millions, or tens or hundreds of millions requests a month, and without no problem, that's a great, great, great feeling. I'd sort of say to me that, okay, serverless, serverless is the way to go. But some of the not so common use cases necessarily, but there are still sort of, I am beginning to see more and more companies, one you mentioned this using step functions to control like Sagemaker training and how to model this use and what, where the results are put and things like that. That's that's one thing. I don't think that's step functions per se, are really that much used in Alma, but I'd say they are increasing and, and generally speaking, I think there's this, I don't know who coined the term, but someone called it functionless that you don't really use Lambda functions that much. Instead, you sort of integrate different AWS services directly, in some cases. So I think that kind of stuff is probably I don't believe it in 100%, that it's we're not going to write any code anymore or anything like that. But I guess like, using more and more like AWS, you know, managed services that don't, you don't really need to write that much code to integrate things is probably going to increase and to that point, sort of a shout out to the previous or other other episode where there were the Lego guys like Sheen and Nicole, that I heard about the EventBridge, actually in the same same Stockholm Community Day that we met about the Sheen’s talk, he talked about EventBridge there and AWS EventBridge. And and I find it really interesting. And we're just beginning to look at it and starting like first implementations within and it's looking good. And I suspect that's will probably change quite a lot how we do like internal integrations between different services living in different AWS accounts that most of the time, it's just events being passed along. So why send an HTTP, HTTP request when you can use something like a bit more robust, like service that is more native to AWS, and has this IAM policy controls and things like that. So that's probably something that we're going to see a lot more in Alma.

Yan Cui: 13:49  

Yeah, that's definitely the case I've seen in more and more now as well. Quite a few people have now has told me this phrase use a Lambda to transform not to transport data. And with, certainly within AWS, you've got lots of services that can directly integrate with other services. API gateway has got service proxies, you have stuff functions that can integrate with SNS and SQS and DynamoDB. You also have AppSync nowadays. If all you're doing is just CRUD against DynamoDB data, then AppSync can do most of that for you without needing to write custom code in Lambda, which is great. And I definitely think there's more of a trend towards that. Another thing that this pattern I’m seeing, I’m starting to see more and more. Last week I spoke with Gojko Adzic. He's also talked about how for a while now he's had this pattern where he'll have the front end talk directly to AWS DynamoDB. Once a client has authenticated against Cognito identity pool and gotten a temporary IAM credential, so bypasses API gateway and Lambda altogether, because for him, all his API was doing was authentication, which IAM and DynamoDB already does. So there's no point going through another layer of compute and resources and cost just for authentication.

Ari Palo: 15:03  

Yeah, that's an interesting use case. And I've seen a few examples around the interwebs about that. We, I don't think we have done exactly that kind of a set up. But I would, I would say that, we're probably not not going to do the whole, like, go full on with just directly clients calling the, let's say, a database, that will be interesting, if you already doing an IAM authentication there, it's definitely good, I would say it's a good good idea to do it directly. But again, sort of coming back to that bit of a previous point, I'll see us like using that integration from various services directly without the Lambda, as you said, Lambda should not be used for just, you know, transport, it should be used for transform. So that's, that I think we're gonna be using more and more and already using, but... That's an interesting approach about clients directly calling, calling let's say a DynamoDB.

Yan Cui: 16:04  

It's definitely a very different to the traditional thinking in terms of while we need API in front of everything, which is, I think, definitely food for thought and definitely something that's worthwhile exploring in a lot of different cases,

Ari Palo: 16:15  

I feel, I feel that that might be something something that people who are using the amplifier framework might be probably doing a lot more than the rest of us. Let's say, that's just my assumption. I have no real facts about it. But...

Yan Cui: 16:29  

We’ll see. We’ll see. So you mentioned you've got over 100 websites, you've got 280 developers, that's quite a significant size team. How are your teams organised and set up? Do you have like a centralised Ops team that manages infrastructure and governance? Or is everyone just responsible for their own AWS accounts and for their own websites and systems?

Ari Palo: 16:52  

Yeah, we basically have, well, as I said, we have multiple, multiple multiple teams here. And we pretty much have quite independent teams in, you know, in general, speaking, speaking, not even speaking by terms of AWS governance that we only sort of our default cloud solution is to use AWS and for web front ends, consider React.js and that kind of thing. But teams are pretty independent in like, generally speaking, but and there's, there's also they sort of have responsibility for all their own AWS accounts. But we also do have some company wide, like governance, that we are using AWS organisations. And we have a certain way of provisioning accounts. And we use StackSets to sort an automation on top top of that to sort of distribute these certain, you know, guardrail type of things like GuardDuty, or CloudTrail logging that centralised logging to a specific account and things like that we do company wide. And we also have basically each theme that at least each unit has like few persons that are, you know, a lot more comfortable with AWS and have a strong strong experience on it and got from the sort of maintenance side and things like that. So we do have this like that we do collaborate a lot that we have these ideas from the, you know, from the people who do more of the governance side, we have thought, hey, we'd like to implement this kind of thing. And we sort of discuss it together and things like that. So we do have some centralised governance, but mostly it's that teams are quite responsible for their accounts. One reason being for this, I hate to use the word “legacy”, because it always has like some negative sound, but legacy in good sense, because we have long running businesses. So we, as I said, we started moving AWS in 2012. So there are AWS accounts that have quite a long history already. And, and there are many types of we have, as I said, we have hundreds of accounts, but the recent like one way of setting up the accounts, there are like, well, there's one way of setting new accounts, but there's like different kind of setups in the accounts that we have. One thing that I really sort of, I'm almost jealous of that that of people who sort of get to the point that they are tasked with setting up an organisation today, like like set up an organisation with all the best practices and all the latest tools and you can use the Control Tower landing zone and every every tool out of the box from AWS. But since we have a long history, we had had to come up with our own solutions to various things and that sort of reflects in some some some places that we can let's say use some some of the latest stuff from from AWS organisation side, but we use them and teams are quite responsible for their for their accounts.

Yan Cui: 19:54  

Yeah, also, I always prefer the autonomy, having autonomous teams and having some centralised, I guess, teams like SRE or security teams that you have a pool model where teams can ask for help and guidance and work together as opposed to like being a gatekeeper that Oh, you can't deploy unless we say so. And that that kind of model just never works for me when it comes to a large company.

Ari Palo: 20:17  

Yeah, no, no, that's definitely something we don't want that that having a separate for us, we think that the teams should be, I guess, you could say that they are DevOps teams or something like that, they should be really autonomous in that sense that they are the ones who develop the features, they are the ones who sort of have to think about how they put the infrastructure together. And they are the ones who have the monitor and operate the service because they have the best know how to do it. And if you start separating, if you have separate team that sort of manages the deployment and, and the production and things like that, usually it's a barrier for, like development speed that your agility drops, and you can’t push new features fast enough and things like that. So definitely I as well prefer automated, not automated teams, but autonomous teams, but there should be always some centralised support that formed the governance side of things and security aspects, as you said.

Yan Cui: 21:21  

So switching gears slightly, you are also a heavy user of AppSync, which is something that I have been getting to myself quite a bit the last couple of months, and it's really been growing on me. So what have been your sort of general experience with AppSync so far? What has worked well for you and what hasn't?

Ari Palo: 21:41  

Well, sort of, when getting started, I think we've been using AppSync for, I don't know, couple years now. And and I think when getting started, it was the, we sort of made them quite, like we made a switch to GraphQL, not switch, we still do a lot of RESTful APIs but especially my team, we sort of consider that Okay, now, what kind of our APIs we were supposed to build in near future, and we saw that they were probably quite data heavy and, and saw the value of adding GraphQL into into the mix. And then we sort of started started looking things. And I don't know that, you know, building a GraphQL server, server yourself, even using, like the, you know, best possible libraries and things like that. It's sort of, at least at the timing, it sort of failed, failed like unnecessary job, then, you know, taking on AppSync, it was it has its own perks, of course, but it's sort of, you don't have to think about that much about how, how all kinds of, you know, I don't know, validation logic and things like that, that yourself and how you configure like resolvers in the in your in your code, because it sort of manages that for you, you sort of point out that this field, or this query uses this function a bit simplified, of course, but in essence, so that was why we sort of got started with it, why we sort of got why we still keep using it and keep developing new features with it. I think it's it's the, some some some people have said as it's unlike an API gateway for the GraphQL world. So it's a serverless serverless product. And once you sort of... the biggest benefit of serverless, I think, is that once you sort of set it up first time, it usually keeps on working. So it has been keeping on working. And nowadays the biggest GraphQL API we have is serving around 150 million monthly transactions. So it has worked fine. We haven't had any any issues with it. Even though at the time when we started using it, there are some people saying that it's slow or has a huge latencies but we haven't noticed that. Of course, I have to say we've only used like relatively simple things with the AppSync. We haven't used the pipeline resolvers or, or anything complex like that, that we usually have either a field or a query that is assigned to a certain resolver that is a Lambda function. And we write that Lambda functions quite often with Go language or Golang. And yeah, it's, it's, it has worked just fine. I think the biggest pain points with it is sort of lack of local development or mocking AppSync locally. I know there are some some kind of tools in the in the amplify community or in amplify scene that you can sort of mock AppSync environment but then you have to have the whole project done with amplify, and we haven't used amplify, so that's not really a solution. There might be some hacks around it and things like that. But at least I haven't I haven't figured it out yet. So the local local development thing is, of course, a bit challenging, you have to actually just think about what the event looks like. And then, you know, sent the event against your Lambda function locally if you want to test it. But that's probably the biggest challenge so far I've seen with AppSync.

Yan Cui: 25:29  

Yeah, that's pretty consistent with my personal experience as well. As for the testing, I do the same as you. When I've got Lambda functions, which I don't have that many, most of the time is just a simple resolver, that does the DynamoDB get item or put item or query. Those kinda just works. I mean, you can just take the template and just change the parameters, and they just work. They don't really go wrong, very much. So for those most I just have end to end tests. So after I deploy, I hit the real GraphQL endpoints. I worked through the sort of user story, the registration process, all of that measure, all that is working end to end. And for my Lambda functions, that's where I tend to have problems. I have local tests. For my function, I invoke it against a real third party services that I'm talking to, could be Cognito could be, you know, Auth0, it could be whatever service I'm using. And those are tested locally before I deploy. But otherwise, I just hit the real thing once it's deployed, and then that's most my testing for GraphQL. I think that was fine. It's not been, it's not been something that really gets in the way of my working. Certainly I don't find that, like I said, the problems is tend to be in my code, not, they're not in the AppSync.

Ari Palo: 26:45  

Yeah, I took one with one. One thing I sort of, I can't say it's a necessarily negative thing about AppSync. But something I've been, I've been constantly thinking about because I sort of compare AppSync into API gateway quite a lot, which is a fair comparison, because they are one is for RESTful world, and one is for GraphQL. And there has been something I have had this feeling few times that sort of AppSync is sort of lagging behind, on on features, and and on, you know, feature development compared to API gateway. And there has been several times that I've sort of thought that maybe I should put, like, maybe there's a use case, some use case could have been sold better if I've had the API gateway in front of AppSync. And yeah, that's a valid thing to do. But then you get to a situation that if you have a high traffic, high traffic website, and both of those are priced by transaction, so around $4 bill per million requests, so then you're doubling your transaction price, just to get, let's say, one or two features from API gateway.

Yan Cui: 27:57  

Yeah, sure. I guess there's no custom Lambda authorizer. That's probably one thing that I miss when it comes to AppSync, which is useful when you're not using Cognito as the authentication. But other than that, I should find there’s a lot of things that AppSync does for me. And this is speaking from personal experience working on pre—, a very recent project where I started off with API gateway. And then I found myself spending a whole week implementing group based authentication with Cognito and API gateway, which I just get out of the box with AppSync and, and then I spent a couple of days working out how to do the automated documentation generation, the request modal. There's no response validation, which I have to do in my code, which again just out of the box with AppSync. So there's a lot of things I find myself I don't have to do when I'm just doing stuff with AppSync. Equally, I guess there's the features that I think I still have been missing. One thing that I think is really missing is AppSync’s caching, it doesn't support nested resolvers very well. In fact, well, it just doesn't work because it can’t, you can't use a source in the cache keys. But from talking to Ed Lima on Twitter, he says that that's something that they are actively working on. So I know it’s coming. And when it does, then it’ll be another big, big win for us.

Ari Palo: 29:19  

Yeah, good. Sorry to interrupt but I decided it's good that you mentioned the AppSync caching that, even though it's limited, it's still it's still a great feature to have that actually, the caching was one of the big sort of my biggest complaints before they released it, that I sort of felt that that was like a really obvious thing missing, missing from from AppSync. And now we're lucky to have it done. And we're using it in in a couple of projects. And we are in those cases we are satisfied how it works really well. 

Yan Cui: 29:51  

Yeah, in my previous job at DAZN, we also vetted AppSync, I think last year, about 12 months ago, for another very high throughput scenario where we’re looking at AppSync subscriptions. We're looking at the AppSync for some of the just basic query and the lack of caching was a big one for us. Because this was a website that the sport streaming app that had one point something million concurrent users at peak. So everyone polling every couple of seconds, that's going to be expensive. So we need caching, right?

Ari Palo: 30:21  

Yeah. Yeah. And also also, it's a big, like, philosophical, let's say, a philosophical thing to have caching in AppSync, because the caching was always the point that everybody want, everybody wanted to point out, if you say that, okay, we're going to use GraphQL. So they always had the misconception that you can cache GraphQL requests and responses, even though there are multiple places you can catch them. But now having a cache in AppSync, as well, it was sort of the final nail, not the final nail in the coffin. That's a really bad, but expression there. But you know, the final point that made you able to say that, yes, you can cache GraphQL.

Yan Cui: 31:03  

Yep. And the work they did on the scaling the subscriptions, that was also amazing as well, because before they had the native WebSockets, they had a limit of about 10,000 connected devices, which for us were just way, way, way below what we needed. But now they can scale that to tens of millions connected devices is... Yeah, they've done some amazing work on the apps in the last 12 months, I have to say, and it has been a really joy to work with, for me personally on a couple of projects. So we talked a lot about sort of different things you guys are doing. So overall, what would you say are some of the biggest benefit you guys have found with serverless? Because you started your journey in 2012 on AWS, I guess, using EC2, I guess maybe at some point you decided to switch to serverless. What has been sort of the biggest benefit to Alma Media as an organisation from that switch?

Ari Palo: 31:56  

Yeah. And I have to I have to sort of add to there that we I think we're doing like this serverless first, because we are not like 100% serverless. There are, for example, one team that has a really long experience with using Beanstalk and they are using Beanstalk to power their, from the like web applications and front ends. Because why why Jenkins, if it's working for you, and you have a good knowledge on that. So So we do have all sorts of setups still. But yeah, we sort of like to think that quite a many of us are now thinking in the serverless first manner that we sort of, at the beginning of a project, we think that okay, the default solution will should be serverless and managed services. And only if there's some really special requirements, that there isn't that many these days, we should probably look into EC2 base workflows. But I think the biggest are, of course, the obvious ones, the development speed, and, you know, you can just, it's not out of the box, using serverless doesn't mean that you increase your development speed. But we've seen this, as you know that because you don't have to, you know, look after your EC2 instances and figure out how you scale them and things like that, and how you monitor and how you patch and things like that. So we sort of see that the time sort of saved from that is time that can be put for to developing new stuff. That's one one aspect, then there's also the total cost of ownership, TCO as they say. Okay, sometimes when you start looking at serverless, it might not actually be like super cheap depends on of what what your scale is. But because some things are priced by the price by the the transaction, amount, transaction amount, so it might look that okay, this might become expensive at some point. But you sort of tend to forget in those situations that even if you use, like EC2 base workflows, you still as you grow and grow, you still tend to get more expensive, because again, you have to think about how you scale things and how you operate and how you monitor and how you patch and things and things like that. So that's one a part of them. But then there's also like, you get like different environments free. So in traditional workforce, you often have maybe, let's say, development, environment and production environments are just a few environments, but with serverless you can have basically feature environments that sort of that your CI triggers from, let's say, a Git branch, like we use that if there's a feature slash prefix in the Git branch to see I will actually create a new and totally new environment not in every project, of course, but in some projects. So because you pay by the transaction, you get the development and feature environments that there can be multiple they are basically free. So that's a nice thing, then I sort of always, always say the best part of serverless as, as I've probably said already, in the beginning of this podcast that the best part of serverless is that it usually keeps on working that we, because the Lambda functions that are really short, short lived, there usually isn't any state creep or like bugs that sort of appear after time, because they don't really, the containers don't really live that long. So usually, the, you know, the problems with serverless products, or the services that we've done in a serverless manner, they usually are problems that arise almost directly after deployments that there has been a some coding error that our tests and QA on something like that didn't catch. And we only noticed it after we deployed, I always prefer that kind of errors like that happen, that deployment basically break something, I prefer those kind of errors always over the ones that happen suddenly, sometime later on, because those are usually the weekends and holidays and things like that. So so I... that's that's why I like the service model that once you set it up, it usually keeps on working. And then also, I'd say that, that one, one thing I see and I actually have a good example on this is that using Serverless it sort of makes the blast radius last, as we call it, smaller. So if you have some kind of a problem, it doesn't necessarily mean that the whole system crashes, it often is, is that a one request or few requests doesn't work, that they sort of error in some way. We had a few years back, we had this serverless, which was some sort of service that was EC2 and I think it was in Beanstalk. And it basically provided for our front end that search to show something like weather update and things like that. If a user gave their location, it would resolve, you know, a city, maybe a street or something like that, not street, but you know, city and country on the state and things like that from it. And, and we use third party service there, but we did some data processing in between. and then July actually is quite, quite popular, like basically like quite a popular vacation month here in Finland, basically, some could say that the whole whole country is closed during July. So people tend to go to countryside and to coast and to islands and things like that places that don't necessarily have an exact address. So when people use our services that had this, like geolocation service in the background for, you know, weather and stuff like that. So there was a coding error in the system. And if the third party service didn't return like a valid address, the system sort of failed totally, that result that that actually, because there was enough of those failing requests, the whole whole service went down. And sort of learning from that was that, Okay, that was a coding error from our end should not have happened in the first place. Yes, but coding errors do happen. But if it were a serverless implementation, only the requests that had been like, only the requests that didn't have a valid address would have been affected. So that was what we sort of came up from it so that it can in sometimes it can sort of limit the blast radius also. So that's one benefit of serverless that doesn't get, you know, talk about that much. But there are several other benefits that we think that it just keeps working, that total cost of ownership. And we feel that it, you know, allows us to develop stuff faster. Those are the points.

Yan Cui: 39:09  

Yeah, that built in isolation is, I guess, like I said, a bit of underhyped feature of serverless, that if something goes wrong, it just affects one function or one container running your code, and it just automatically recovers from that failure with the next request that comes in anyway. So what about things that didn't work quite well? Are there any... What are some of the sort of biggest pitfalls or mistakes that you wish you knew when you started?

Ari Palo: 39:38  

Yeah, I guess all sort of comes down to that when you're, when you're doing, when you're using serverless mentality in your, your product development, it all sort of boils down to good planning. It's an, it's sometimes can be hard, especially when you're starting, especially when you're starting. You can’t always know all the sort of dependencies like, for example, cost wise, there can be some, not hidden costs, but costs that are hard to anticipate if you're inexperienced in serverless. We've been sort of, so so so in that effecting couple times, I can give two concrete examples. The first one probably being the CloudWatch. And actually, the second one also relates to CloudWatch. But the first CloudWatch problem was that we were actually we had the Lambda function that was called quite often. And it used the CloudWatch metrics API to publish a metric. So I called this cloudwatch put-metric-data API or operation. And that's actually somewhat costly operation. And because we were talking, I don't remember, was it tens or hundreds of millions of requests per month. So we did that metric operation, put-metric-data operation each request. That ended up costing, like, it was almost 60 percentage of our monthly billable just because of that put-metric-data operation. So you have to sort of think about, in serverless, every, almost everything is priced per transaction. So you sort of have to think about the dependencies of these transactions that if you're calling Lambda Lambda for X amount, what things are the Lambda calling and things like that you have to take into account. That's almost the similar kind of example where the culprit was Lambda, and CloudWatch was that we had done somewhat popular API, and we had Cloud, sorry, Lambda at the edge functions touch the CloudFront distribution there. And basically, those function did some validation for the data before sending it to the back end, because the back end operation was a bit more expensive. So we thought that Okay, let's do it on the edge. And we know this at some point that, hey, why are CloudWatch bill us again, quite, quite, quite expensive, like what's causing it, we dig through and found out that okay, we had forgotten, like, debug logging in, like debug logging on in the Lambda at the edge functions. So basically, all the time, the Lambda, the edge function was invoking. It locked the the payload event, which is can be somewhat massive, especially coming from CloudFront event, and other things as well. So that ended up actually causing quite a high CloudWatch bill. So always, like, check what things are you logging and use, like logging framework or logging library, I always say that always use logging library because it's easier then to define that, Okay, if production is, sorry, environment is production, then maybe only log like errors and maybe warnings or something like that. And then in other environments, like staging or development or feature branches, or what have you, you can then do all the day back logging. And we actually have a little treat there that in production in some APIs, you can put if it's a GraphQL API or in REST you can put it in here, or you can put like this one debug flag, so you can get like a, like full logs for that, even for the production request, for single request. But you don't have to log everything in in a larger scale. So being aware of this, like, again, I want one, I don't want to say hidden costs because they are, they are documented in AWS pages. But they are sometimes quite hard to understand, especially once you're like first time coming to serverless. So those I think I've been that it challenges.

Yan Cui: 44:06  

Yeah. The CloudWatch cost things that is something that definitely a lot of people got hit. I've seen people spend 10 times more on CloudWatch logs alone to compared to what they pay for Lambda and API gateway, because of the whole debug logging getting forgotten and things like that. So nowadays, what I typically do is I've got, I've got some libraries I have developed, which in production you would be logging at error or info or warning or whatever. But we sample the debug logs, say 1% of the invocation so you do always still have some population of your debug logs in case something happens, and hopefully we still have debug logs to help you. And also work with correlation IDs so that if you've got a whole call chain or function calling other functions through SNS SQS or whatever. Those decision gets passed along so every function around the call chain will respect the decision the first function made regarding should we re-enable the debug logging for this invocation. So that really helps a lot in terms of keeping your costs in check, but also still giving you really good visibility. And I got another horror story about CloudWatch metrics. You know how CloudWatch metrics counts every distinct combination of dimensions as a unique metric, and that's like 20 cents per per month per metric. And I heard of one company that was by mistake, including the request ID as a dimension. So every request was a unique metric. That was so was so bad. Yeah, the bill for that company was was horrendous,

Ari Palo: 45:43  

Those are those are really... it’s a real shame. I feel bad, bad for that kind of... Because, again, the, you know, mistakes happen, you can we are all humans and, and I can say we have had a case where we had a Lambda function trigger from S3 and then updating the same object that wasn't cheap either I can say.

Yan Cui: 46:02  

Oh recursion, infinite recursion.

Ari Palo: 46:03 

Yeah, exactly, no see. 

Yan Cui: 46:06  

Yea, I’ve seen those.

Ari Palo: 46:08  

Yeah. But that's that's like, one thing that you should really, really do like that that was something that wasn't like that that could have been like called with, you know, proper code review so so always try to, you know, and especially when you're like, if you're starting like, let's say your company that's switching to serverless, and you have your team figuring out, it's, it's good idea to always do like good code reviews, it’s not because you sort of, main point isn't that your teammates screw up or something that's not the point. The point is to learn from others, and then occasionally you might spot actually something that was a mistake, because we're all human. We do mistakes. So, so that is something that I always tell people, especially working with serverless and are new to serverless that, that it's a complex. Even though, in some sense, some sense serverless is super easy, but it can also be super complex. So, when you're learning, don't do it just alone, so use, use your teammates and and do it together and there are a lot of articles and stuff and out there so get to read read and learn. And don't just jump right into it. I have, for example I have personally I'm a bit of a, I can't really wait to get started so I often start, start by doing and then do something and do a mistake and then I'm like okay I should have probably read the docs properly. This is totally explained there's a...

Yan Cui: 47:43  

Yeah, Joe emison, who actually interviewed on this podcast. He was one of the first people I interviewed on this podcast. He once said that, you know all these articles you read on medium about the serverless lessons learned, you know, how’s, you know, you made some bad mistake, it can always be summarised as someone who's spent two weeks working in two days researching and could have been a lot better has spent two weeks researching and then two days working.

Ari Palo: 48:07  

Yeah, exactly so, exactly so.

Yan Cui: 48:11  

Yeah. And for that recursion thing I actually ran into that my previous company at DAZN. And what we ended up doing was, as part of our suite of tools which is open source called DAZN power, Lambda power tools. We actually have got a middleware that along, that passes the count in terms of the number of invocations in this call chain as a correlation ID and incremented the automatically so that if we ever run into a case where your call chain gets longer than the default number we set up, which is I think like ten or something, we just stopped the invocation and tell you there is a potential potential infinite recursion. That's how we are nowadays sort of trying to catch those, I guess, accidental infinite recursions early so that you don't find out once you have a, you know, 5 million, 5 billion invocations for the month.

Ari Palo: 49:00  

Yeah, yeah that's that's useful. We'll definitely have to check out, check that too.

Yan Cui: 49:04  

So I think that's everything I wanted to cover for this episode. Is there anything, any sort of personal projects, any books that you're writing about that you'd like to tell us?

Ari Palo: 49:15  

Not, not at the moment, actually. So, as I said, it's summer is coming and my mind is almost about the summer vacation. Summer vacation already. But I'm actually sort of, I've been blogging in previous years and I've actually started thinking about like I have already a few blog posts waiting. The only thing that is actually preventing from me is, is preventing me from publishing is that actually I need to first like develop my own website because I'm a purist in that sense. I don’t want to publish anywhere else so that's that's been the reason so maybe, maybe like after the holidays I will be posting more about some things even more about serverless and AWS and all sorts of stuff related to development. So I think that's it so but I don't have a, it's probably going to be at address but it's not doesn't really, really there's nothing at the moment yet. So don't go looking there. But, you know, follow me so I'll probably say something once I have some content online.

Yan Cui: 50:20  

Okay. And how can people find you on the social media?

Ari Palo: 50:24  

Yeah I'm basically everywhere. First Name Last Name, so Ari Palo, basically everywhere, Ari Palo. So, in Twitter I share stuff related to this and, and also feel free to connect me with LinkedIn if you are interested in serverless, and infrastructure and code and things like that so you can reach me on those platforms and I'm probably speaking in a, well, the situation is what it is but at least maybe some online conferences later in the year or so.

Yan Cui: 50:53  

Okay, sounds good. I would make sure I'll put those links in the show notes. When you're ready to share your new brand new blog with us. And it's been a pleasure having you today. And take care, and start thinking about your vacation, hopefully when, when the when when things come down a little bit we can all go and travel.

Ari Palo: 51:15  

Yeah, yeah, exactly. And Yan, it's been a pleasure. Thanks for the invite here, always like to talk with other AWS and serverless enthusiasts.

Yan Cui: 51:24  

Cool. Take it easy, man.

Ari Palo: 51:26  

Yeah, you too. 

Yan Cui: 51:27

Bye Bye. 

Ari Palo: 51:27

Thanks, bye.

Yan Cui: 51:41 

That's it for another episode of Real World Serverless. To access the show notes and the transcript, please go to And I'll see you guys next time.