Real World Serverless with theburningmonk

#36: Serverless at New10 with Ricardo Torres

November 04, 2020 Yan Cui Season 1 Episode 36
Real World Serverless with theburningmonk
#36: Serverless at New10 with Ricardo Torres
Show Notes Transcript

You can find Ricardo on LinkedIn here.

Here are links to what we discussed in the show:

To learn how to build production-ready Serverless applications, go to productionreadyserverless.com.

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.


Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:13  

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today, I'm joined by Ricardo Torres from New10. Hey, man, welcome to the show.


Ricardo Torres: 00:27  

Hey, man, thanks for inviting me. It's so good to be here.


Yan Cui: 00:31  

So before we get into how New10 is using serverless, can you just give us a quick introduction about who is New10? And your experience at New10 so far?


Ricardo Torres: 00:42  

Yeah, sure. So New10 is a startup initiated by the bank ABN AMRO in 2017. So in fact, we actually just celebrated three years since go live. And what we do we provide loans to small and medium enterprises, and also independent contractors in the Netherlands. And these range within the 5-250k euro. And this year, we also released Corona relief products to the market to aid our clients, these really hard times that we are facing all of us. And what we do. For these, we use a lot of serverless tooling. But basically, that's what new tenders and on my personal experience, I come from a background of API development. So I worked before for a company called Mycujoo, we were doing football live streaming. And I was responsible for the API development in there. And that was also how I started at New10. So basically, we wanted to create all these APIs for our microservices, because in the beginning, we have some monolithic applications. There was basically for the MVP. And we just wanted to make the shift to two servers and two APIs for with microservices. So that's basically the experience I have nowadays at New10. Recently, I became also a tech evangelist. So yeah, my role changed a bit. I still do software engineering, but I'm quite focused on evangelism internally at New10 as well.


Yan Cui: 02:17  

Okay, so in this case, how do you guys settled on using serverless technologies in the first place? Well, I guess the finance is not known for being adventurous when it comes to choosing technology choices. So how the New10 settle on using serverless in this case?


Ricardo Torres: 02:35  

Yeah, that's interesting story, actually, I think the decision was made really on the early days, back in 2017. So basically, we had really high pressure to deliver a strict budget, we also had a strong belief in DevOps, these, these combination of factors make, make it easier for us to to choose serverless, right, because with serverless, we managed to set up the whole ATM platform in less than nine months, which was really a great achievement. So for us to launch a product from zero to life, in nine months, there was quite good. And we didn't have to spend the time managing the resources that are now made obsolete by serverless, right? I mean, resources do exist, just don't manage them directly. So that's, I think, that really added benefit of the serverless part. And even for personal projects nowadays, so this is my go to stack. Because, yea, such an amazing thing that for us developers, we don't need to take care of a lot of stuff. So I think that reduces a lot of the burden we usually have. If we choose a different path, and it's coming to see nowadays, a lot of companies spend a lot of time setting up Kubernetes, and the lower level parts of the infra where they could just for the MVP, just go straight to serverless. And you have something that works for the initial state, and they can evolve from there. So there's a quite interesting what we did. And so far, it still works quite well, we evolved a lot since we started and we more and more into serverless. So nowadays, I think it's just growing the amount of serverless, the amount of the serverless that we have.


Yan Cui: 04:19  

Okay, so in that case, can you give us a bird's eye view of your architecture? Sounds like you've got a lot of APIs, a lot of microservices. But I guess that you must be doing a lot of background data processing as well, maybe some machine learning stuff, give us a high level overview of how your architecture looks like and how different things fit together. 


Ricardo Torres: 04:41  

Mm hmm. Yeah, so I'm gonna go like from, let's say, from bottom top on this one. So basically, what we have nowadays is like we have around 158 API endpoints, around 460 Lambdas running on production, 35 for gate containers. So not everything is serverless. We have quite some stuff on containers as well, by still the serverless part, especially with Lambdas make up for I think two thirds of the surfaces we have. And then we have one main general purpose SNS topic, and we have around 60 SQS queues for internal eventing. And on there we have around 350,000 events per month. And the persistence layer is backed by either DynamoDB S3 or RDS. And on grouping part, all these resources are grouped into 45 client facing microservices, we have microservices, they encapsulate and abstract the network logic and data. And we also do the single... we have, we believe a lot in the single responsibility for this serverless for these services, sorry. So it means that for serverless service, each Lambda is responsible for a given endpoint or method. So we don't have a single Lambda, dealing with three or five different HTTP methods. So if I have a post on a given endpoint, that's responsibility of a single Lambda, There are some interesting pros and cons on this one. But that's the approach we took. And everything is also grouped into five business domains. And the service to service communication is going through, like I mentioned, the SNS, our SQS part, and also HTTP, we have a lot of service to service communication going internally through our own APIs. And on the application landscape, everything is managed on the app, AWS infra I mean, the part for the back end. And we have front end and GraphQL also powered by AWS. The GraphQL layer is actually running on a container. So that's not serverless yet. We have also integration with Salesforce, we have the Google Cloud Platform for marketing purposes, we also use BigQuery in there. And we have quite some third party vendors, because for instance, we believe a lot in cloud native. So we have a lot of vendors for signing documents for verifying identities, basically, to tell if a person is who they are based on personal documents, like passport, etc. And we also have a powerful vendor that actually takes care... it's basically provides us with the ledger for balance, interest and transaction management of our loans. So I think that's the big overview from, from our stack at the moment.


Yan Cui: 07:45  

Okay, that's, that's great. And certainly, there's a lot of, a lot of things that I find interesting. So I'm going to ask you a bit more about those. But one thing you mentioned, which I found interesting is that when you were talking about, you believe in cloud native, and therefore you use a lot of vendors to do all these other things, so you don't have to do them yourself. That is different to a lot of the, I guess, the popular definition for cloud native, which is run everything in containers so that you are really portable between clouds, which are fine. Honestly, ridiculous, because how is something native to the cloud, being categorised by its portability, or being able to run in your own data centres? So how do you guys define cloud native?


Ricardo Torres: 08:28  

Basically, what we... maybe I could have used the wrong term. But basically what what I mean is that we try to outsource resources, especially resources that we would have to manage ourselves. So imagine that, as a start-up, so small, like us, if we had to implement ourselves, this verification process for identities would be a really big project, right? Would involve a lot of people. So basically, to check the veracity of a passport or any other personal document, you need really specialised people, which, at the time, we didn't have, and we still don't. So that's why we outsource this. So maybe the terms not really cloud native, but whenever we can use a vendor to provide the features we need, or the tooling, we're gonna rely on them. So that that's more of the definition I was always looking for.


Yan Cui: 09:25  

Right? So you're thinking, so what you're describing is this mindset that we probably call serverless first or being service for whereby, as much as possible, we want to consume services, that does something for us so that we can focus on the business differentiating things that our customers actually want from us rather than just managing infrastructures and all of that, right?


Ricardo Torres: 09:50  

Yeah, exactly. Yes, exactly. Right on. 


Yan Cui: 09:53  

Gotcha. So you're running containers and the Lambdas side by side. So how do you decide when to use Lambda, when to use containers?


Ricardo Torres: 10:04  

So yeah, we have some limitations, right? So sometimes we choose containers to work around limitations from our cloud provider, in this case, AWS. So we have services that handles a lot of file uploads, and other time, we decided not to use S3 for those, which meant that the file actually needed to go through the service first before reaching any other parts of the back end or the persistence layer. So that meant we couldn't use API gateway and Lambda, right? Because of the payload and timeout limits of that integration, which is 6 megabytes and 29 seconds, which is, of course, not enough for uploading files. And in different cases, we try to actually start with Lambda. So we had an interesting case where we had a Python service that was backed by RDS. And at the time, VPCs were really slow, right? That's something you know quite well. So in some cases, we're seeing a cold start of up to 10 seconds. And even though we kept most of the Lambdas warm, there were still some cases that reached this cold start of 10 seconds. And that's really not something we can, we can have on our customer facing services.


Yan Cui: 11:18  

That shouldn’t the case anymore since they redone the whole networking layer for Lambda and Fargate, because of Firecracker, and they were able to basically reimagine the whole networking layer around the VPCs. And so now you shouldn't be seeing those horrific 10 second cold starts.


Ricardo Torres: 11:38  

Exactly. Yeah, exactly. And so nowadays, we have a few more Lambdas, sometimes not really talking to RDS, but talking to, let's say, to some service that must be running on VPC. So yeah, in those cases, we have some Lambdas in VPC. And they are way faster than they used to be at the time. But that's... we had to work with what we had like two years ago. So basically, the servers had to make the switch. So it was running on Fargate with ECS, because of that, because we just couldn't actually make it work with, so with those slow cold starts. And also we have most Java services, they run on containers, we have a lot of Java knowledge internally. And we have quite some service written in Java. And even though we have some Java Lambdas, most of them are not really customer facing directly just because the cold start for those runtimes are still not there, if you compare it to other language, in terms of cold start still quite bad with JavaScript and Node.js internally. Sometimes in some service, we're reaching cold start of 800 milliseconds. So that's quite amazing, right? And quite acceptable from a SLA point of view.


Yan Cui: 12:56  

Gotcha. Yeah, with Java, the cold starts are still not quite the same level as with Node or Python or Golang. So in this case, how many teams do you have working on all these different, all these 150 APIs?


Ricardo Torres: 13:12  

So I think I have, we have in total, around 10 teams working on these APIs, we have 6 stream teams, they are directly aligned with our business domains. And we also have the enabling teams. That's how we call it like a tooling team, which is the team that I'm part of right now. We have a platform team that takes care of the whole platform, the AWS accounts, organisations and provides quite some infra guidelines and best practice for us developers. We have an automation team and also have one data team. So I think in total, we are talking about around 35 engineers spread around the back end, front end, also Salesforce engineers, the automation part, like I mentioned, the clouds and the data.


Yan Cui: 13:56  

Okay, so in this case, you have some enabling teams, while enablement teams and infra teams. So how do you go about so I guess organising your code into repositories? Are you doing like micro-repos where you have one repo per microservice? Or are you doing some mono-repo and then building some tooling around knowing which service to deployed when they've changed?


Ricardo Torres: 14:22  

We are using micro-repos. So each service will have a dedicated repository for that service. We also use mono-repos, but not for the services themselves. We use for libraries. So for instance, we have a lot of NPM package for Node. And those are living in a single repository. So we have some other internal tooling, so like the framework for Lambdas we had to develop, it's made of several backers and they all live in the same repository, managed by actually Lerna, so we have not a lot of experience in there. We kind of like it the way we do it, the mono-repos. But I think just need to experiment a bit more and explore, I think, different tooling. Because at the moment, Lauren is not being the best tool for the job, in our case, at least.


Yan Cui: 15:12  

Yeah, Lerna is pretty powerful. But it's also can be quite, I guess, opinionated and fixed in how it does things. I do know that it got an ecosystem of plugins that you can use as well. But I haven’t explored them much myself. So in this case, I guess, you know, all these different teams are using Lambda and also using containers to build things. Are you doing anything to ensure consistency, because it sounds like you've got the shared infra team that are responsible for setting up the AWS environments. And I imagine there's some guidelines around some best practices and security things you guys should be doing, how do you go about making sure that everyone is doing the right thing, and in propagating some of the best practices.


Ricardo Torres: 16:01  

So I think that the key word here is trust, we really trust our engineers to actually do the right thing. And we provide the tooling for that to happen. So basically, we give trust and freedom to innovate. But we verified using automated tooling, for instance, we rely a lot on AWS config, to tell us if a resource has been tagged, or if... because customer authorizer was attached to a given API gateway and Lambda. So these kind of things we always verified. And our long term vision is to actually to make it easy to do the right thing, and quite hard to do the wrong thing. But we still allow them, like I mentioned, to actually create different stuff, right? So you're not really you don't need to always use the tooling. Because sometimes you have, let's say, I had a case, for instance, I was working on a service that was really integrated within Cognito. And at the time, we didn't have a Terraform module to provide what I needed. So I had to create my own module according to the specs for that service. And basically, that service doesn't use any shared module, because it's not something that's shared, it is the only service relying on Cognito. That's the kind of authentication part we have. So basically in there, we define everything from scratch. Of course, the cloud engineers will help us to make sure that we are not setting rules that shouldn't be there. And a good example, for instance, this like when you have to deploy an S3 bucket, right? There are, I think, several ways of doing that, engineers can either use our Terraform module that we provide that comes baked with encryption, with all the policies that we define as a company, meaning that the bucket will be private by default, but it's possible to to either you create your own Terraform module, or you deploy using the serverless framework, which is something we also use. And by doing this, it gives room for you to make a mistake, right? You can deploy an S3 bucket without encryption or missing some rules. Maybe it was public. So that's not intended, right? What's going to happen is that we're going to be alerted that this was done. And we're going to be in touch with the responsible engineers to understand if it was a mistake, it was on purpose. But I think all goes around the tooling we provide. And we have, like I mentioned, the Terraform modules, serverless plugins. For logging, for instance, we have quite some logging standards. And all the service must comply with these logging standards. So we provide libraries for logging, and such libraries, they have obfuscation built-in to prevent sensitive information being leaked. So we also had the Lambda framework, which is like, really based on Koa.js is like a middleware framework kind of relates to Middy as well, Middy.js, and the only difference is not... this one is a framework that was built with our use case in mind. So it's really powerful, really small, and provides us with most of the things we do. So we have middlewares for SQS, for SNS, for HTTP, for DynamoDB, for S3, a lot of these things. So whenever engineers are starting a new service, in this case, a JavaScript or TypeScript service, they can rely on this tooling to make sure that they are on the right path. But of course, mistakes can still be made. And we always try to verify and keep an open mind and try to understand the use case.


Yan Cui: 19:34  

All right, great to hear that you guys are using some kind of middleware engine to encapsulate a lot of the cross-cutting concerns. And but want to circle back to what you talked about earlier about the Terraform versus serverless. How do you decide when do you use Terraform versus serverless framework? And also how do you mix the two together like sharing resources has been created in one, in the other or reference them rather, how do you go about using that and deciding when to, which one to use?


Ricardo Torres: 20:08  

So, initially, we had quite some resources being created, and the early days, especially directly from the serverless framework, right? But I think once you get used to CloudFormation, I think you understand, you can easily understand that this is not the best approach, because CloudFormation has all kinds of weird scenarios where the stack gets inconsistent. And sometimes you have to destroy the stack, you need to recreate resources. And that's something you cannot do with the database with your persistence layer, right? Unless it's a caching table, maybe you can do that. But more often than not, everything that must be persisted, we have with Terraform. So for instance, the Cognito part I mentioned, all the databases are defined within Terraform, S3 buckets, etc, etc. And on the serverless part, I think the only thing that's defining there are the things that can be removed and should be removed if the stack is gone. So for instance, the API gateway, if we ever removed that service, we want that API gateway to be gone with it. So I think that's more or less the criteria we use for defining what's defined from Terraform and what’s from serverless framework. And the integration between these two happens, we have a serverless plugin that integrates with the Terraform outputs, which means that from your serverless.yml definition or other configuration files, you can just refer to the Terraform outputs, and basically during deploy time, the Terraform, the serverless plugin is going to get the outputs from Terraform and inject on your configuration. So basically, you don't need to care about ARNs or referencing resources by name, or other identifiers, you can just use the Terraform output for that, and you can just reference easily, and then you have everything in a single place. 


Yan Cui: 21:59  

Is that a custom plugin that you guys have written for yourself? Or is there some other, some public one? 


Ricardo Torres: 22:05  

Now it's a custom public, sorry, it's a custom plugin. At the moment it is private, for sure. That's something we want to open source as together with the Lambda framework I mentioned, we've been using it for, I think two years by now. And I think it's really battle tested, we made a lot of improvements. So those libraries, for sure a good candidate for us to open source. 


Yan Cui: 22:30  

Okay, I want to go back to something you mentioned at the start of that conversation about CloudFormation is not the best approach to some of these persistence layers, like DynamoDB tables. Can you give me some examples of things that you can change with Terraform but you can't change with CloudFormation?


Ricardo Torres: 22:51  

It's not about, really about the changing, but it's more about problem with the stack. Because what we had issues, for instance, we in some service reached the limits on the stack. So what happened, we need to create an extra stack, right? And sometimes this is not a straightforward process, we use the famous plugin serverless, split stacks to do that. But more often than not, if you don't start already using the plugin, if you introduce the plugin, a later stage, you can get quite some inconsistency on your stack. So it's advised that you destroy the stack before splitting it in multiple stacks. So you can have more than 200 resources in it. So we had to balance this a lot. And basically Terraform was a... it's just, in my opinion, just a better tool for the job in terms of defining infra in a single place where most things can be shared, and the community around Terraform, I must say it's a lot bigger than CloudFormation to have providers not only for AWS, but you have for everything we use Terraform for the sensory integration, PagerDuty, Datadog. So we rely on Terraform, for a lot of stuff not really related to AWS. So I think that's just a common place for us to have our resources. They are not managed by that we don't want to be managed by the serverless framework and hence CloudFormation.


Yan Cui: 24:23  

Okay, I think CloudFormation also has support for third party resources now, but I don't know how widely adopted they are and how big that ecosystem is. I've written custom plugins, sorry, custom resources for CloudFormation before for things like Datadog myself, but certainly with Terraform you get that out of the box. So but for that 200 resources limit, I've transformed, I guess existing CloudFormation stacks, using the space stacks plugin, and that was fine. I guess I'm quite interested to hear what specific problems that you ran into. I guess there's one thing that I've run into in the past is because of some of the resources already exists so that you have to do some clever things around the naming of resources and things like that. Is that what you were referring to that is hard to go from not using the split stacks plugin to start using it on a live project.


Ricardo Torres: 25:20  

So that's something actually we had less here. So that's, it's been quite a while. But basically, what I had on one of the services I was responsible for is that we reached the 200 resources limit, right? And once we introduced the serverless split stack, since it was not from the beginning, we started having on different occasions, like the CloudFormation issue of cyclic dependencies, right? So resources from different stacks, referencing each other. So that was something that was quite hard for us to manage, because it's something that's decided by the plugin itself, how the resources are going to be split. So you have different rules, you can do it per name, or per group, and you have different options in there. But basically, the option we chose had this issue that we ended up with a cyclic dependencies. And to actually fix that we had to remove the stack. So that was quite bad. And later on, we discovered that there is a note on the readme of the serverless stack plugins, displayed stacks plugin, that actually tells you that once used, if you need to start from scratch using this plugin, if you introduce, or if you change the way you are grouping the stack, then unintended things can happen. So that's something there is a disclaimer on the readme of the plugin actually. And that's something we really face. So that was quite interesting to work around.


Yan Cui: 26:53  

Yeah, so one of the things that that you could do there, I guess this kind of veer towards the more advanced use of that plugin is that you can write your own stack map modules that you can control how the resources are grouped. And then you can sort of decide which resources go into which nested stack. So for something like this, where you are migrating from an existing CloudFormation stack, that's not using the stacks plugin, you probably should be doing that instead. But like you said, yeah, it does add a bit more nuance to this. And I was working on a recent project where we did this, and it was with app sync projects. So there's a lot of resources that are referencing each other. And so I had to understand my resource graph, so that I can pick up all the relevant resources that are part of that graph. So basically walk the graph, dependency myself, and then pull all of them as much as possible into its own nested stack so that they don't… so to minimise the number of cross stack references. So basically there's a way around it. But it does require you to, like, learn a lot more about how your resources are being provisioned, understand your graph so that you are dissecting them nicely in such a way that you minimise those cross stack references. It is a bit more work and then to just use one of those built in groupings that this that the plugin gives you. But those are way out that are just probably not as easiest as you would like.


Ricardo Torres: 28:38  

Yeah, exactly. Actually, we are using the stacks map. So in one service, we have, for instance, the separation that the deadlock groups goes to one stack, versions of the Lambdas go to a different stack. So we have that. But that's something we learned maybe a bit too late on the process. So we tried everything else. And then we learned “Oh, there is this option that kind of gives you a more reliable way”, right? Because then you can really avoid, I don't know, what was the criteria used to split the stack, I don't know for something like Round-robin or something like that. But basically, it gives you really full control where you want your resources to be. So that's something we use nowadays, but I think it took us a while to learn that.


Yan Cui: 29:20  

Yeah, I have also had to do things like, you know, hashing the resource name so that for the same function or the same resources, as much as possible, they always get hash the same stack so that I don't have to constantly move resources between different stacks when I'm doing something clever. But yeah, there's quite a lot of things you have to, I guess, extra complexity that you hopefully just didn't want to deal with. But yeah, I do get it, with CloudFormation, that 200 resource limit is really annoying. And I think they've been, well, they as in AWS has been hinting at maybe potentially lifting some of that limits. A while back I think Chris Munns or someone else on Twitter was kind of hinting that something's coming, guys. So yeah, let's, let's hope something comes out. And because this 200 resource limit is arbitrary, especially when you can just use a nested stack to go up to 200 times 200. What’s that? That’s 40,000 resources. So, you just...


Ricardo Torres: 30:25  

It feels like just a legacy thing, right? That's still there and a bit hard to remove. But yeah, I think it would make sense to remove this and, and for us would be immense thing to have, because like I mentioned, every Lambda maps every different Lambda, single Lambda maps to a different endpoint and HTTP methods. So sometimes on the same HTTP method, same endpoint, you have, I don't know, three or four different methods so that you can actually, if you multiply that by the amount of endpoints a single service can have, then it's so easy to reach this limit that it feels that it's just doesn't make sense anymore.


Yan Cui: 31:05  

Yeah, especially if you follow AWS security best practices and have a tailored IAM role for every single function. So now you've got every endpoint, you've got the API gateway itself is not particularly efficient in terms of how you manage resources, you've got all these different paths and got resource and then method. And then you've got the Lambda function, the Lambda version, the lock group, and then you've got the IAM role, when all of these things add up pretty quickly. Right? So the 200 resources is nothing.


Ricardo Torres: 31:37  

Yeah, indeed.


Yan Cui: 31:39  

So okay, I guess, let's carry on about the use of team structure. Because I'm also quite curious about how your operational model looks like as well, in terms of these being on call in terms of the monitoring side of things? Do you have the actual teams themselves, be responsible for their own services and be on call for them? Or do you have like a centralised Ops team that takes care of a lot of that?


Ricardo Torres: 32:07  

No, actually, all the software engineers are responsible for the uptime of the system. So I mean, we have the different stream teams. So each team is responsible for a given set of services. And those engineers are also responsible for making sure the services are up and running. And while we have leveraged PagerDuty a lot for our Incident management, so that the flow is kinda like this. So whenever there is an error on a given service on application, or sometimes the error level goes above a given threshold. So there is an alert being created on PagerDuty, and there is usually an engineer on call, these engineers on call is not alone. It's not mapped to the teams to the domains, they actually come from one of a given domain. So I can be an engineer on call on a given week. And what happens I’m the first line of defence, so issue comes to me, I can, since we have quite some nice logging standards, it's quite easy for me to pinpoint where the problem is, because from the log message, I can already tell everything, because in the log message, there is information about the servers, the execution status, for instance, if you're cold or warm, there is a record quest IDs and a lot of contextual information are there, which helps us understand where the problem comes from. So this first line of defence can be the engineer that is just gonna “Oh, there is a message went to the dead letter queue, I just need to put it back into the main queue to retry”. So that's something these engineering on-call can easily do. But there are other cases where the problem is a bit more complex. So they're going to ask for help from the team responsible for the service. And from this point on this issue can even be delegated to these teams. So they can discuss with PO and solve it or just solve it if it's really urgent. And apart from that, there is the second, third line of defence. So meaning if the person on call is not available, maybe went out for lunch or something like that, there is the second line of defence, that is going to be pinged. And they can reply, they can actually do the same process, or can just get in touch with other engineers asking for help. And these actually goes up to the CTO, if nobody is available to respond, the CTO is going to be paged. And they're going to get in. The CTO will get in touch with other engineers see what's happening. But basically, that's the scenario we have everybody's responsible at the end.


Yan Cui: 34:40  

Okay, that's great. For my experience, putting developers on call is one of the best things you can do to improve reliability and uptime, because developers don't want to get caught in the middle of a night when something breaks, right? 


Ricardo Torres: 34:55  

Yeah. I think it is the best approach. It works. I mean, I don't want to introduce bugs because I know I'm gonna be alerted. So it kind of keeps you the mindset, right? That is your responsibility not to introduce bugs to make sure that everything is well tested and well taken care of, to make... so that when you are on call, you don't get paged every hour. So that's something that really helps us indeed.


Yan Cui: 35:20  

And I guess the same mindset goes towards improving monitoring, and all of that as well. Because when you're on call, and you're under pressure to debug something that's happening in production, not having the right tools and the monitoring, and the metrics and the logs in place that's you that's going to be struggling to find this problems, and really helps improve those, I guess, operational practices, when you are on the hook to fix problems and identify problems and fix them quickly. So for that, what do you guys use for all the monitoring side of things and all the, sort of the runtime monitoring and alerting?


Ricardo Torres: 35:58  

So we use a lot of, we use X-ray, we use Sentry, we use CloudWatch, and Datadog, and ultimately PagerDuty. So these are all integrated. And like I mentioned, we provide tooling to make sure that developers are doing the right thing. So we have tooling that whenever a service has been deployed, it's going to make checks, all these integrations are in place, because of course, we don't want to deploy a service that's not hooked in to Sentry or PagerDuty, because it means that error happens, we only see if there is someone looking at the log stream or something like that. So the integration must be there. And so yeah, that's what we rely on. I think Sentry it's really amazing tool, battle test that we use, I think our services from the front end to the back end. We rely on auto Sentry, and we have the integration from Sentry to PagerDuty, that it's going to be alerted. Incidents are gonna be created. And, yeah, I think that that's the tooling we use nowadays for the monitoring part. And a lot of the dashboards we have in Datadog, so we are relying more and more on Datadogs, for dashboards, for monitoring, for alerting as well. We have even monitors for external tooling, right? Because like I mentioned, we rely a lot on external vendors. So imagine that the service that provides you a way for people to sign documents, if that's down, we need to know as soon as possible, right? So we have monitor for that service as well. So whenever we monitor the API from our site, we also monitor their status page. So whenever one of those change something then we are alerted to make sure you're on top of the issue, already, even before sometimes we actually get the alerts even before the update the status page, right? I think you know how it goes, those status pages sometimes are just for show, because they take some two hours, or maybe even four hours to be updated. And that's, yeah, for sure, not something we can have. We cannot wait four hours to understand what's happening with the service we rely on. So yeah, I think that's more or less the tooling we have around this.


Yan Cui: 38:15  

Okay, I'm curious about the choice to go with Datadog because the pricing model where they charge you $5 per resource is really not great for Lambda, when you've got easily have lots of Lambda functions, you're going to be paying five bucks per month for each of those. Are you using it just for metrics? Or do you use metrics and logs as well?


Ricardo Torres: 38:40  

We use for metrics. We use for logging. But we don't have native integration with Datadog and Lambdas. So basically, everything that we get to Datadog, we inject into Datadog. So for instance, we... the logs are, of course, they go to CloudWatch. And then they are brought back to Datadog, they're fed to Datadog, then for the metrics, we also have a special way of logging message. And this message will become metrics inside Datadog where we can create monitors on etc, etc. So I think that's how we, we are using Datadog right now. So I hope we don't have any native integration with our systems, in terms of agents, etc. Actually, I don't have visibility a lot on the container part. I'm not sure if the containers are using the Datadog agent. That's something I would have to verify. But yeah, that's how we rely on Datadog. There are some things that I'd like to see a lot of improvements in there. So for instance, there is no Full Text Search. So you need to really index out the fields that you want to be searched on in advance . So whenever you introduce a new field to your JSON log message, you need to make sure that on... first appearance that gets indexed so on the next appearance, then you can search on that field. So that can... it's a bit tricky. Since we have standardised logging, it kind of helps because most of the things we have, by default are already indexed. But if you have to introduce something else, then it's kind of a pain in the ass. Indeed.


Yan Cui: 40:19  

Yeah, so I think you were talking about the DogStatsD format, which if you've got the Datadog to ingest your logs from CloudWatch logs and it will automatically turn them into custom metrics. Nowadays, CloudWatch also supports that, but it uses a really verbose format, which I don't like but it does have got this thing called the embedded metric format, which you can basically write JSON Blob to CloudWatch logs from Lambda and that gets turned into custom metric. One of the reasons why I don't really... I still don't use a Datadog for metrics is that the ingestion time adds delay, so that my alerts doesn't fire for another few minutes, which in a case, when is an emergency when something's happening that means it takes me a few more minutes just to know that something's happening, which is why at DAZN my previous company that even though we were using Datadog to ingest all these things into from CloudWatch to CloudWatch logs, we still using CloudWatch and the CloudWatch metrics and the CloudWatch dashboard for alerting and metrics, but we use Datadog for all the logs but Datadog is so expensive, especially given the pricing is based on number of resources you have, and with Lambda you end up with hundreds maybe thousands of these things. And it gets expensive really quickly. So they actually did a big migration away from Datadog, because the contract has got too expensive. But yes, but yes, that's okay. That's cool. So, yeah, thank you for walking me through all of the all of the tools that you guys are using. I want to maybe circle back to a little bit about the cold starts because you said that you've got some Java functions as well. And we have seen a lot of people adopt the Provisioned Concurrency settings with Java functions. Is that something that you guys have played around with?


Ricardo Torres: 42:11  

I played around with it but that's not something we have in production yet, we are still relying on our own warmers, we call them like a central heating system. So basically, each serverless project, they have these warmer in London. So on the serverless.yml, all you need to define is the concurrency you want each Lambda to have. And these warmer lambda will make sure that they get pinged accordingly. So that's, I think the setup we have for the warmers, we haven't tried on production with the Provisioned Concurrency that's for sure something would be a really additional for us, but comes with a bit of a prize. Because since we rely on our warmers, we were actually kind of a hidden feature, right? Because if you ping the Lambda quite often, it means that you can actually refresh secrets in memory cache, and we do that a lot for secrets coming from parameter store or other secret management thing. So basically we sometimes load some stuff in memory. And basically what the warmer provides since it's pinged, then you can always keep this in memory cache up to date. So we can generate tokens, we can store tokens in there, we can do a lot of stuff during this refreshing part that we otherwise wouldn't be able to do on Provisioned Concurrency. Of course there are other ways to achieve that. But there will be quite a, quite a change for us. So I think that's maybe the main reason we are not actively using the Provisioned Concurrency. But for sure i think it is one of the things most people are waiting for since we started with serverless, right, it was the Provisioned Concurrency. But we have a lot of stuff around the cold starts because like I mentioned. The decision was made in the early days to go for serverless, and especially for Lambdas. And, in 2018, where we started moving everything to different microservices we did quite extensive research, if Lambdas were ready for HTTP APIs at the time. There were still quite some blog posts about Lambda is good but not ready for customer facing APIs, and we really wanted to validate that. We knew already some people using it like you, I think at the time you were at DAZN. I knew you guys were using it and I knew some other companies using it as well so we went to verify this. We actually found really interesting results, like I mentioned, sometimes the cold start in simple services are around 800 milliseconds, and even on complex services, they are running on Python or JavaScript, you can get 1, 1.5 seconds of a cold start and that's for us is really acceptable. And from the beginning our goal was to have really lean Lambdas around less than five megabytes, which is something relatively easy to achieve with JavaScript but at the same time a bit harder, right? Because none of the tooling, and none of the modules we have for basically any language were not made for serverless, especially with the cold starts in mind. So sometimes you have JavaScript library that's doing such a simple thing, but it depends on libraries or modules that are, I don't know, 10, 15, 20 megabytes, and all of a sudden you store something quite simple on your surface, and you get bundles that went from 2 megabytes to 15 megabytes, and of course that's going to come with a price. So we did some, a lot of research, there is even a website called Package Phobia. I think it was a great resource for us to understand how we were leveraging these node modules, and in some cases we managed to reduce the size. So, for instance, we wanted to use the AWS SDK but that library came baked with the SDK inside which is not needed, if we're using Lambda, and also came with Lodash and other huge libraries. Another time it had 40 megabytes of, after instal and we were like, well, we don't, we don't want that and we don't need that. So we can afford that library, it was at the same time that Node 10 was released, and the X-Ray SDK was not actually working with Node 10 because of the async-await, etc. So basically forked, we introduced the support for the Node 10, and at the same time we removed a lot of the boilerplate and stuff we didn't need like the AWS SDK, the Lodash, and other things. And majority of our services are responding now in less than two seconds on a cold start. And like I mentioned we have some services with one second. So, in theory, everything worked fine, and we proved that in practice, everything else worked fine so we were really happy with the results. And basically, nowadays, cold stars are not even on topic anymore among the developers, unless something quite tricky must be introduced. Otherwise, I think it's just built into our tooling and everything else that we provide. There is not something we care as much nowadays because we know that the Lambdas will have really lean size, so there's already kind of guaranteed whenever we start a new service, unless of course the service is Java, then, yeah, extra caution must be taken, but otherwise I think we're pretty happy with the setup we have for JavaScript and Python on this.


Yan Cui: 47:40  

Have you tried using a bundler like Webpack with the Serverless Webpack Plugin for example because that basically solves most of your problems in terms of reducing the size but also importantly, because a lot of time that goes into the cold start for those dependencies is actually just file IO and CPU, to do the resolution and to file system calls to load those files. So when you've got a bundler that bundles everything together and doing the whole tree shaking as well, that makes a massive difference. And if you're also doing things like don't require the full AWS SDK, just require specific client that you need, that also makes a big difference as well. I did a bunch of testing a while back, and I was able to reduce the module initialization time from 200, 300 milliseconds to about 50, 60 milliseconds at the 90th percentile or 95th percentile, just by doing those two things requiring just the SDK client, rather than the whole AWS SDK and applying the Webpack bundling and that's it, just two things for JavaScript functions.


Ricardo Torres: 48:50  

Yeah, that's something we do. I think everything you mentioned is something we actually do, for instance the Webpack is something we started really from the get-go. So we have kind of a serverless boilerplate setup that has the Webpack that the Lambdas are packed individually to make sure that, yeah, it only loads the resources it actually needs to avoid loading, I don't know, the SDK on, not the SDK but external SDK, on a Lambda, that's not even contact the service from that SDK. So that's something we are actively doing as well. And we try to always keep track of our response times, because they've kind of relates a lot with the cold start, right? Because if we see an increase on the response times we need to look into, “Hey, what was changed?” Is this because of infra issues, like, there is a latency issue on API gateway Lambda integration or something like that or is it due to our cold start? And this is actually where the X-Ray really helps debugging this because I think without X-Ray and the instrumentations it provides to debug everything. I think it would be insanely hard to actually make sense of all these things right out the cold start and the initialization part. And, yeah, when, where, your, your Lambda is, your execution is spending time right. So yeah, those are the things that you need to be mindful always indeed.


Yan Cui: 50:27  

In fact, there are actually quite a lot of other tools nowadays they can offer something similar to what X-Ray offers, but X-Ray is like first party tool, the other services you have to pay for them. So I guess this is, I think this is all the questions I had in mind. Thank you so much Ricardo for taking the time to talk to us today. So before we go, can you maybe tell people how to find you on the internet.


Ricardo Torres: 50:54  

Yeah, sure. So, on LinkedIn I'm Ricardo Torres, or if you just search for Ricardo Torres New10, it should pop up. On Twitter, I’m Ricardian Torres, that's not the best thing but you can easily find me by my name. And on GitHub I'm rictorres (https://github.com/rictorres). So I'm, I tried to be quite active on open source community as well, always trying to give back to the community. I guess without the open source community we wouldn't even exist, we wouldn't even be here talking about this so I think that's the least we can do, right? So we'll be nice if anybody wants to get in touch. I'll be here. And thanks a lot for having me, I think it was an immense pleasure. And it's such an honour to be talking to you in person. I learned so much regarding serverless. Thanks a lot. 


Yan Cui: 51:45  

No worries. Thanks for agreeing to do this. So, I guess one last question about New 10. Are you guys hiring? Because right now, I see a lot of people looking for a job. At the same time, there's a lot of companies that are still looking to recruit. Are you guys doing anything New 10?


Ricardo Torres: 52:02  

Yes, we are hiring. If you check our page on LinkedIn there is always open vacancies in there in different departments, data, sales force, also engineering. I'm not sure if we have any openings on the engineering side right now especially for working with Lambdas, but just keep an eye on it, because we are always looking for new talents. Yeah.


Yan Cui: 52:23  

Excellent. With that I guess, stay safe and hopefully see you in person, sometime soon, 


Ricardo Torres: 52:30  

You too. Take care. 


Yan Cui: 52:31  

Take care, man. Bye bye. 


Ricardo Torres: 52:32  

Bye bye.


Yan Cui: 52:46  

So that's it for another episode of Real World Serverless. To access a show notes, please go to realworldserverless.com. If you want to learn how to build production ready serverless applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.