You can find Ricardo on LinkedIn here.
Here are links to what we discussed in the show:
To learn how to build production-ready Serverless applications, go to productionreadyserverless.com.
For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.
Opening theme song:
Cheery Monday by Kevin MacLeod
You can find Ricardo on LinkedIn here.
Here are links to what we discussed in the show:
To learn how to build production-ready Serverless applications, go to productionreadyserverless.com.
For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.
Opening theme song:
Cheery Monday by Kevin MacLeod
Yan Cui: 00:13
Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today, I'm joined by Ricardo Torres from New10. Hey, man, welcome to the show.
Ricardo Torres: 00:27
Hey, man, thanks for inviting me. It's so good to be here.
Yan Cui: 00:31
So before we get into how New10 is using serverless, can you just give us a quick introduction about who is New10? And your experience at New10 so far?
Ricardo Torres: 00:42
Yeah, sure. So New10 is a startup initiated by the bank ABN AMRO in 2017. So in fact, we actually just celebrated three years since go live. And what we do we provide loans to small and medium enterprises, and also independent contractors in the Netherlands. And these range within the 5-250k euro. And this year, we also released Corona relief products to the market to aid our clients, these really hard times that we are facing all of us. And what we do. For these, we use a lot of serverless tooling. But basically, that's what new tenders and on my personal experience, I come from a background of API development. So I worked before for a company called Mycujoo, we were doing football live streaming. And I was responsible for the API development in there. And that was also how I started at New10. So basically, we wanted to create all these APIs for our microservices, because in the beginning, we have some monolithic applications. There was basically for the MVP. And we just wanted to make the shift to two servers and two APIs for with microservices. So that's basically the experience I have nowadays at New10. Recently, I became also a tech evangelist. So yeah, my role changed a bit. I still do software engineering, but I'm quite focused on evangelism internally at New10 as well.
Yan Cui: 02:17
Okay, so in this case, how do you guys settled on using serverless technologies in the first place? Well, I guess the finance is not known for being adventurous when it comes to choosing technology choices. So how the New10 settle on using serverless in this case?
Ricardo Torres: 02:35
Yeah, that's interesting story, actually, I think the decision was made really on the early days, back in 2017. So basically, we had really high pressure to deliver a strict budget, we also had a strong belief in DevOps, these, these combination of factors make, make it easier for us to to choose serverless, right, because with serverless, we managed to set up the whole ATM platform in less than nine months, which was really a great achievement. So for us to launch a product from zero to life, in nine months, there was quite good. And we didn't have to spend the time managing the resources that are now made obsolete by serverless, right? I mean, resources do exist, just don't manage them directly. So that's, I think, that really added benefit of the serverless part. And even for personal projects nowadays, so this is my go to stack. Because, yea, such an amazing thing that for us developers, we don't need to take care of a lot of stuff. So I think that reduces a lot of the burden we usually have. If we choose a different path, and it's coming to see nowadays, a lot of companies spend a lot of time setting up Kubernetes, and the lower level parts of the infra where they could just for the MVP, just go straight to serverless. And you have something that works for the initial state, and they can evolve from there. So there's a quite interesting what we did. And so far, it still works quite well, we evolved a lot since we started and we more and more into serverless. So nowadays, I think it's just growing the amount of serverless, the amount of the serverless that we have.
Yan Cui: 04:19
Okay, so in that case, can you give us a bird's eye view of your architecture? Sounds like you've got a lot of APIs, a lot of microservices. But I guess that you must be doing a lot of background data processing as well, maybe some machine learning stuff, give us a high level overview of how your architecture looks like and how different things fit together.
Ricardo Torres: 04:41
Mm hmm. Yeah, so I'm gonna go like from, let's say, from bottom top on this one. So basically, what we have nowadays is like we have around 158 API endpoints, around 460 Lambdas running on production, 35 for gate containers. So not everything is serverless. We have quite some stuff on containers as well, by still the serverless part, especially with Lambdas make up for I think two thirds of the surfaces we have. And then we have one main general purpose SNS topic, and we have around 60 SQS queues for internal eventing. And on there we have around 350,000 events per month. And the persistence layer is backed by either DynamoDB S3 or RDS. And on grouping part, all these resources are grouped into 45 client facing microservices, we have microservices, they encapsulate and abstract the network logic and data. And we also do the single... we have, we believe a lot in the single responsibility for this serverless for these services, sorry. So it means that for serverless service, each Lambda is responsible for a given endpoint or method. So we don't have a single Lambda, dealing with three or five different HTTP methods. So if I have a post on a given endpoint, that's responsibility of a single Lambda, There are some interesting pros and cons on this one. But that's the approach we took. And everything is also grouped into five business domains. And the service to service communication is going through, like I mentioned, the SNS, our SQS part, and also HTTP, we have a lot of service to service communication going internally through our own APIs. And on the application landscape, everything is managed on the app, AWS infra I mean, the part for the back end. And we have front end and GraphQL also powered by AWS. The GraphQL layer is actually running on a container. So that's not serverless yet. We have also integration with Salesforce, we have the Google Cloud Platform for marketing purposes, we also use BigQuery in there. And we have quite some third party vendors, because for instance, we believe a lot in cloud native. So we have a lot of vendors for signing documents for verifying identities, basically, to tell if a person is who they are based on personal documents, like passport, etc. And we also have a powerful vendor that actually takes care... it's basically provides us with the ledger for balance, interest and transaction management of our loans. So I think that's the big overview from, from our stack at the moment.
Yan Cui: 07:45
Okay, that's, that's great. And certainly, there's a lot of, a lot of things that I find interesting. So I'm going to ask you a bit more about those. But one thing you mentioned, which I found interesting is that when you were talking about, you believe in cloud native, and therefore you use a lot of vendors to do all these other things, so you don't have to do them yourself. That is different to a lot of the, I guess, the popular definition for cloud native, which is run everything in containers so that you are really portable between clouds, which are fine. Honestly, ridiculous, because how is something native to the cloud, being categorised by its portability, or being able to run in your own data centres? So how do you guys define cloud native?
Ricardo Torres: 08:28
Basically, what we... maybe I could have used the wrong term. But basically what what I mean is that we try to outsource resources, especially resources that we would have to manage ourselves. So imagine that, as a start-up, so small, like us, if we had to implement ourselves, this verification process for identities would be a really big project, right? Would involve a lot of people. So basically, to check the veracity of a passport or any other personal document, you need really specialised people, which, at the time, we didn't have, and we still don't. So that's why we outsource this. So maybe the terms not really cloud native, but whenever we can use a vendor to provide the features we need, or the tooling, we're gonna rely on them. So that that's more of the definition I was always looking for.
Yan Cui: 09:25
Right? So you're thinking, so what you're describing is this mindset that we probably call serverless first or being service for whereby, as much as possible, we want to consume services, that does something for us so that we can focus on the business differentiating things that our customers actually want from us rather than just managing infrastructures and all of that, right?
Ricardo Torres: 09:50
Yeah, exactly. Yes, exactly. Right on.
Yan Cui: 09:53
Gotcha. So you're running containers and the Lambdas side by side. So how do you decide when to use Lambda, when to use containers?
Ricardo Torres: 10:04
So yeah, we have some limitations, right? So sometimes we choose containers to work around limitations from our cloud provider, in this case, AWS. So we have services that handles a lot of file uploads, and other time, we decided not to use S3 for those, which meant that the file actually needed to go through the service first before reaching any other parts of the back end or the persistence layer. So that meant we couldn't use API gateway and Lambda, right? Because of the payload and timeout limits of that integration, which is 6 megabytes and 29 seconds, which is, of course, not enough for uploading files. And in different cases, we try to actually start with Lambda. So we had an interesting case where we had a Python service that was backed by RDS. And at the time, VPCs were really slow, right? That's something you know quite well. So in some cases, we're seeing a cold start of up to 10 seconds. And even though we kept most of the Lambdas warm, there were still some cases that reached this cold start of 10 seconds. And that's really not something we can, we can have on our customer facing services.
Yan Cui: 11:18
That shouldn’t the case anymore since they redone the whole networking layer for Lambda and Fargate, because of Firecracker, and they were able to basically reimagine the whole networking layer around the VPCs. And so now you shouldn't be seeing those horrific 10 second cold starts.
Ricardo Torres: 11:38
Yan Cui: 12:56
Gotcha. Yeah, with Java, the cold starts are still not quite the same level as with Node or Python or Golang. So in this case, how many teams do you have working on all these different, all these 150 APIs?
Ricardo Torres: 13:12
So I think I have, we have in total, around 10 teams working on these APIs, we have 6 stream teams, they are directly aligned with our business domains. And we also have the enabling teams. That's how we call it like a tooling team, which is the team that I'm part of right now. We have a platform team that takes care of the whole platform, the AWS accounts, organisations and provides quite some infra guidelines and best practice for us developers. We have an automation team and also have one data team. So I think in total, we are talking about around 35 engineers spread around the back end, front end, also Salesforce engineers, the automation part, like I mentioned, the clouds and the data.
Yan Cui: 13:56
Okay, so in this case, you have some enabling teams, while enablement teams and infra teams. So how do you go about so I guess organising your code into repositories? Are you doing like micro-repos where you have one repo per microservice? Or are you doing some mono-repo and then building some tooling around knowing which service to deployed when they've changed?
Ricardo Torres: 14:22
We are using micro-repos. So each service will have a dedicated repository for that service. We also use mono-repos, but not for the services themselves. We use for libraries. So for instance, we have a lot of NPM package for Node. And those are living in a single repository. So we have some other internal tooling, so like the framework for Lambdas we had to develop, it's made of several backers and they all live in the same repository, managed by actually Lerna, so we have not a lot of experience in there. We kind of like it the way we do it, the mono-repos. But I think just need to experiment a bit more and explore, I think, different tooling. Because at the moment, Lauren is not being the best tool for the job, in our case, at least.
Yan Cui: 15:12
Yeah, Lerna is pretty powerful. But it's also can be quite, I guess, opinionated and fixed in how it does things. I do know that it got an ecosystem of plugins that you can use as well. But I haven’t explored them much myself. So in this case, I guess, you know, all these different teams are using Lambda and also using containers to build things. Are you doing anything to ensure consistency, because it sounds like you've got the shared infra team that are responsible for setting up the AWS environments. And I imagine there's some guidelines around some best practices and security things you guys should be doing, how do you go about making sure that everyone is doing the right thing, and in propagating some of the best practices.
Ricardo Torres: 16:01
Yan Cui: 19:34
All right, great to hear that you guys are using some kind of middleware engine to encapsulate a lot of the cross-cutting concerns. And but want to circle back to what you talked about earlier about the Terraform versus serverless. How do you decide when do you use Terraform versus serverless framework? And also how do you mix the two together like sharing resources has been created in one, in the other or reference them rather, how do you go about using that and deciding when to, which one to use?
Ricardo Torres: 20:08
So, initially, we had quite some resources being created, and the early days, especially directly from the serverless framework, right? But I think once you get used to CloudFormation, I think you understand, you can easily understand that this is not the best approach, because CloudFormation has all kinds of weird scenarios where the stack gets inconsistent. And sometimes you have to destroy the stack, you need to recreate resources. And that's something you cannot do with the database with your persistence layer, right? Unless it's a caching table, maybe you can do that. But more often than not, everything that must be persisted, we have with Terraform. So for instance, the Cognito part I mentioned, all the databases are defined within Terraform, S3 buckets, etc, etc. And on the serverless part, I think the only thing that's defining there are the things that can be removed and should be removed if the stack is gone. So for instance, the API gateway, if we ever removed that service, we want that API gateway to be gone with it. So I think that's more or less the criteria we use for defining what's defined from Terraform and what’s from serverless framework. And the integration between these two happens, we have a serverless plugin that integrates with the Terraform outputs, which means that from your serverless.yml definition or other configuration files, you can just refer to the Terraform outputs, and basically during deploy time, the Terraform, the serverless plugin is going to get the outputs from Terraform and inject on your configuration. So basically, you don't need to care about ARNs or referencing resources by name, or other identifiers, you can just use the Terraform output for that, and you can just reference easily, and then you have everything in a single place.
Yan Cui: 21:59
Is that a custom plugin that you guys have written for yourself? Or is there some other, some public one?
Ricardo Torres: 22:05
Now it's a custom public, sorry, it's a custom plugin. At the moment it is private, for sure. That's something we want to open source as together with the Lambda framework I mentioned, we've been using it for, I think two years by now. And I think it's really battle tested, we made a lot of improvements. So those libraries, for sure a good candidate for us to open source.
Yan Cui: 22:30
Okay, I want to go back to something you mentioned at the start of that conversation about CloudFormation is not the best approach to some of these persistence layers, like DynamoDB tables. Can you give me some examples of things that you can change with Terraform but you can't change with CloudFormation?
Ricardo Torres: 22:51
It's not about, really about the changing, but it's more about problem with the stack. Because what we had issues, for instance, we in some service reached the limits on the stack. So what happened, we need to create an extra stack, right? And sometimes this is not a straightforward process, we use the famous plugin serverless, split stacks to do that. But more often than not, if you don't start already using the plugin, if you introduce the plugin, a later stage, you can get quite some inconsistency on your stack. So it's advised that you destroy the stack before splitting it in multiple stacks. So you can have more than 200 resources in it. So we had to balance this a lot. And basically Terraform was a... it's just, in my opinion, just a better tool for the job in terms of defining infra in a single place where most things can be shared, and the community around Terraform, I must say it's a lot bigger than CloudFormation to have providers not only for AWS, but you have for everything we use Terraform for the sensory integration, PagerDuty, Datadog. So we rely on Terraform, for a lot of stuff not really related to AWS. So I think that's just a common place for us to have our resources. They are not managed by that we don't want to be managed by the serverless framework and hence CloudFormation.
Yan Cui: 24:23
Okay, I think CloudFormation also has support for third party resources now, but I don't know how widely adopted they are and how big that ecosystem is. I've written custom plugins, sorry, custom resources for CloudFormation before for things like Datadog myself, but certainly with Terraform you get that out of the box. So but for that 200 resources limit, I've transformed, I guess existing CloudFormation stacks, using the space stacks plugin, and that was fine. I guess I'm quite interested to hear what specific problems that you ran into. I guess there's one thing that I've run into in the past is because of some of the resources already exists so that you have to do some clever things around the naming of resources and things like that. Is that what you were referring to that is hard to go from not using the split stacks plugin to start using it on a live project.
Ricardo Torres: 25:20
So that's something actually we had less here. So that's, it's been quite a while. But basically, what I had on one of the services I was responsible for is that we reached the 200 resources limit, right? And once we introduced the serverless split stack, since it was not from the beginning, we started having on different occasions, like the CloudFormation issue of cyclic dependencies, right? So resources from different stacks, referencing each other. So that was something that was quite hard for us to manage, because it's something that's decided by the plugin itself, how the resources are going to be split. So you have different rules, you can do it per name, or per group, and you have different options in there. But basically, the option we chose had this issue that we ended up with a cyclic dependencies. And to actually fix that we had to remove the stack. So that was quite bad. And later on, we discovered that there is a note on the readme of the serverless stack plugins, displayed stacks plugin, that actually tells you that once used, if you need to start from scratch using this plugin, if you introduce, or if you change the way you are grouping the stack, then unintended things can happen. So that's something there is a disclaimer on the readme of the plugin actually. And that's something we really face. So that was quite interesting to work around.
Yan Cui: 26:53
Yeah, so one of the things that that you could do there, I guess this kind of veer towards the more advanced use of that plugin is that you can write your own stack map modules that you can control how the resources are grouped. And then you can sort of decide which resources go into which nested stack. So for something like this, where you are migrating from an existing CloudFormation stack, that's not using the stacks plugin, you probably should be doing that instead. But like you said, yeah, it does add a bit more nuance to this. And I was working on a recent project where we did this, and it was with app sync projects. So there's a lot of resources that are referencing each other. And so I had to understand my resource graph, so that I can pick up all the relevant resources that are part of that graph. So basically walk the graph, dependency myself, and then pull all of them as much as possible into its own nested stack so that they don't… so to minimise the number of cross stack references. So basically there's a way around it. But it does require you to, like, learn a lot more about how your resources are being provisioned, understand your graph so that you are dissecting them nicely in such a way that you minimise those cross stack references. It is a bit more work and then to just use one of those built in groupings that this that the plugin gives you. But those are way out that are just probably not as easiest as you would like.
Ricardo Torres: 28:38
Yeah, exactly. Actually, we are using the stacks map. So in one service, we have, for instance, the separation that the deadlock groups goes to one stack, versions of the Lambdas go to a different stack. So we have that. But that's something we learned maybe a bit too late on the process. So we tried everything else. And then we learned “Oh, there is this option that kind of gives you a more reliable way”, right? Because then you can really avoid, I don't know, what was the criteria used to split the stack, I don't know for something like Round-robin or something like that. But basically, it gives you really full control where you want your resources to be. So that's something we use nowadays, but I think it took us a while to learn that.
Yan Cui: 29:20
Yeah, I have also had to do things like, you know, hashing the resource name so that for the same function or the same resources, as much as possible, they always get hash the same stack so that I don't have to constantly move resources between different stacks when I'm doing something clever. But yeah, there's quite a lot of things you have to, I guess, extra complexity that you hopefully just didn't want to deal with. But yeah, I do get it, with CloudFormation, that 200 resource limit is really annoying. And I think they've been, well, they as in AWS has been hinting at maybe potentially lifting some of that limits. A while back I think Chris Munns or someone else on Twitter was kind of hinting that something's coming, guys. So yeah, let's, let's hope something comes out. And because this 200 resource limit is arbitrary, especially when you can just use a nested stack to go up to 200 times 200. What’s that? That’s 40,000 resources. So, you just...
Ricardo Torres: 30:25
It feels like just a legacy thing, right? That's still there and a bit hard to remove. But yeah, I think it would make sense to remove this and, and for us would be immense thing to have, because like I mentioned, every Lambda maps every different Lambda, single Lambda maps to a different endpoint and HTTP methods. So sometimes on the same HTTP method, same endpoint, you have, I don't know, three or four different methods so that you can actually, if you multiply that by the amount of endpoints a single service can have, then it's so easy to reach this limit that it feels that it's just doesn't make sense anymore.
Yan Cui: 31:05
Yeah, especially if you follow AWS security best practices and have a tailored IAM role for every single function. So now you've got every endpoint, you've got the API gateway itself is not particularly efficient in terms of how you manage resources, you've got all these different paths and got resource and then method. And then you've got the Lambda function, the Lambda version, the lock group, and then you've got the IAM role, when all of these things add up pretty quickly. Right? So the 200 resources is nothing.
Ricardo Torres: 31:37
Yan Cui: 31:39
So okay, I guess, let's carry on about the use of team structure. Because I'm also quite curious about how your operational model looks like as well, in terms of these being on call in terms of the monitoring side of things? Do you have the actual teams themselves, be responsible for their own services and be on call for them? Or do you have like a centralised Ops team that takes care of a lot of that?
Ricardo Torres: 32:07
No, actually, all the software engineers are responsible for the uptime of the system. So I mean, we have the different stream teams. So each team is responsible for a given set of services. And those engineers are also responsible for making sure the services are up and running. And while we have leveraged PagerDuty a lot for our Incident management, so that the flow is kinda like this. So whenever there is an error on a given service on application, or sometimes the error level goes above a given threshold. So there is an alert being created on PagerDuty, and there is usually an engineer on call, these engineers on call is not alone. It's not mapped to the teams to the domains, they actually come from one of a given domain. So I can be an engineer on call on a given week. And what happens I’m the first line of defence, so issue comes to me, I can, since we have quite some nice logging standards, it's quite easy for me to pinpoint where the problem is, because from the log message, I can already tell everything, because in the log message, there is information about the servers, the execution status, for instance, if you're cold or warm, there is a record quest IDs and a lot of contextual information are there, which helps us understand where the problem comes from. So this first line of defence can be the engineer that is just gonna “Oh, there is a message went to the dead letter queue, I just need to put it back into the main queue to retry”. So that's something these engineering on-call can easily do. But there are other cases where the problem is a bit more complex. So they're going to ask for help from the team responsible for the service. And from this point on this issue can even be delegated to these teams. So they can discuss with PO and solve it or just solve it if it's really urgent. And apart from that, there is the second, third line of defence. So meaning if the person on call is not available, maybe went out for lunch or something like that, there is the second line of defence, that is going to be pinged. And they can reply, they can actually do the same process, or can just get in touch with other engineers asking for help. And these actually goes up to the CTO, if nobody is available to respond, the CTO is going to be paged. And they're going to get in. The CTO will get in touch with other engineers see what's happening. But basically, that's the scenario we have everybody's responsible at the end.
Yan Cui: 34:40
Okay, that's great. For my experience, putting developers on call is one of the best things you can do to improve reliability and uptime, because developers don't want to get caught in the middle of a night when something breaks, right?
Ricardo Torres: 34:55
Yeah. I think it is the best approach. It works. I mean, I don't want to introduce bugs because I know I'm gonna be alerted. So it kind of keeps you the mindset, right? That is your responsibility not to introduce bugs to make sure that everything is well tested and well taken care of, to make... so that when you are on call, you don't get paged every hour. So that's something that really helps us indeed.
Yan Cui: 35:20
And I guess the same mindset goes towards improving monitoring, and all of that as well. Because when you're on call, and you're under pressure to debug something that's happening in production, not having the right tools and the monitoring, and the metrics and the logs in place that's you that's going to be struggling to find this problems, and really helps improve those, I guess, operational practices, when you are on the hook to fix problems and identify problems and fix them quickly. So for that, what do you guys use for all the monitoring side of things and all the, sort of the runtime monitoring and alerting?
Ricardo Torres: 35:58
So we use a lot of, we use X-ray, we use Sentry, we use CloudWatch, and Datadog, and ultimately PagerDuty. So these are all integrated. And like I mentioned, we provide tooling to make sure that developers are doing the right thing. So we have tooling that whenever a service has been deployed, it's going to make checks, all these integrations are in place, because of course, we don't want to deploy a service that's not hooked in to Sentry or PagerDuty, because it means that error happens, we only see if there is someone looking at the log stream or something like that. So the integration must be there. And so yeah, that's what we rely on. I think Sentry it's really amazing tool, battle test that we use, I think our services from the front end to the back end. We rely on auto Sentry, and we have the integration from Sentry to PagerDuty, that it's going to be alerted. Incidents are gonna be created. And, yeah, I think that that's the tooling we use nowadays for the monitoring part. And a lot of the dashboards we have in Datadog, so we are relying more and more on Datadogs, for dashboards, for monitoring, for alerting as well. We have even monitors for external tooling, right? Because like I mentioned, we rely a lot on external vendors. So imagine that the service that provides you a way for people to sign documents, if that's down, we need to know as soon as possible, right? So we have monitor for that service as well. So whenever we monitor the API from our site, we also monitor their status page. So whenever one of those change something then we are alerted to make sure you're on top of the issue, already, even before sometimes we actually get the alerts even before the update the status page, right? I think you know how it goes, those status pages sometimes are just for show, because they take some two hours, or maybe even four hours to be updated. And that's, yeah, for sure, not something we can have. We cannot wait four hours to understand what's happening with the service we rely on. So yeah, I think that's more or less the tooling we have around this.
Yan Cui: 38:15
Okay, I'm curious about the choice to go with Datadog because the pricing model where they charge you $5 per resource is really not great for Lambda, when you've got easily have lots of Lambda functions, you're going to be paying five bucks per month for each of those. Are you using it just for metrics? Or do you use metrics and logs as well?
Ricardo Torres: 38:40
We use for metrics. We use for logging. But we don't have native integration with Datadog and Lambdas. So basically, everything that we get to Datadog, we inject into Datadog. So for instance, we... the logs are, of course, they go to CloudWatch. And then they are brought back to Datadog, they're fed to Datadog, then for the metrics, we also have a special way of logging message. And this message will become metrics inside Datadog where we can create monitors on etc, etc. So I think that's how we, we are using Datadog right now. So I hope we don't have any native integration with our systems, in terms of agents, etc. Actually, I don't have visibility a lot on the container part. I'm not sure if the containers are using the Datadog agent. That's something I would have to verify. But yeah, that's how we rely on Datadog. There are some things that I'd like to see a lot of improvements in there. So for instance, there is no Full Text Search. So you need to really index out the fields that you want to be searched on in advance . So whenever you introduce a new field to your JSON log message, you need to make sure that on... first appearance that gets indexed so on the next appearance, then you can search on that field. So that can... it's a bit tricky. Since we have standardised logging, it kind of helps because most of the things we have, by default are already indexed. But if you have to introduce something else, then it's kind of a pain in the ass. Indeed.
Yan Cui: 40:19
Yeah, so I think you were talking about the DogStatsD format, which if you've got the Datadog to ingest your logs from CloudWatch logs and it will automatically turn them into custom metrics. Nowadays, CloudWatch also supports that, but it uses a really verbose format, which I don't like but it does have got this thing called the embedded metric format, which you can basically write JSON Blob to CloudWatch logs from Lambda and that gets turned into custom metric. One of the reasons why I don't really... I still don't use a Datadog for metrics is that the ingestion time adds delay, so that my alerts doesn't fire for another few minutes, which in a case, when is an emergency when something's happening that means it takes me a few more minutes just to know that something's happening, which is why at DAZN my previous company that even though we were using Datadog to ingest all these things into from CloudWatch to CloudWatch logs, we still using CloudWatch and the CloudWatch metrics and the CloudWatch dashboard for alerting and metrics, but we use Datadog for all the logs but Datadog is so expensive, especially given the pricing is based on number of resources you have, and with Lambda you end up with hundreds maybe thousands of these things. And it gets expensive really quickly. So they actually did a big migration away from Datadog, because the contract has got too expensive. But yes, but yes, that's okay. That's cool. So, yeah, thank you for walking me through all of the all of the tools that you guys are using. I want to maybe circle back to a little bit about the cold starts because you said that you've got some Java functions as well. And we have seen a lot of people adopt the Provisioned Concurrency settings with Java functions. Is that something that you guys have played around with?
Ricardo Torres: 42:11
Yan Cui: 47:40
Ricardo Torres: 48:50
Yeah, that's something we do. I think everything you mentioned is something we actually do, for instance the Webpack is something we started really from the get-go. So we have kind of a serverless boilerplate setup that has the Webpack that the Lambdas are packed individually to make sure that, yeah, it only loads the resources it actually needs to avoid loading, I don't know, the SDK on, not the SDK but external SDK, on a Lambda, that's not even contact the service from that SDK. So that's something we are actively doing as well. And we try to always keep track of our response times, because they've kind of relates a lot with the cold start, right? Because if we see an increase on the response times we need to look into, “Hey, what was changed?” Is this because of infra issues, like, there is a latency issue on API gateway Lambda integration or something like that or is it due to our cold start? And this is actually where the X-Ray really helps debugging this because I think without X-Ray and the instrumentations it provides to debug everything. I think it would be insanely hard to actually make sense of all these things right out the cold start and the initialization part. And, yeah, when, where, your, your Lambda is, your execution is spending time right. So yeah, those are the things that you need to be mindful always indeed.
Yan Cui: 50:27
In fact, there are actually quite a lot of other tools nowadays they can offer something similar to what X-Ray offers, but X-Ray is like first party tool, the other services you have to pay for them. So I guess this is, I think this is all the questions I had in mind. Thank you so much Ricardo for taking the time to talk to us today. So before we go, can you maybe tell people how to find you on the internet.
Ricardo Torres: 50:54
Yeah, sure. So, on LinkedIn I'm Ricardo Torres, or if you just search for Ricardo Torres New10, it should pop up. On Twitter, I’m Ricardian Torres, that's not the best thing but you can easily find me by my name. And on GitHub I'm rictorres (https://github.com/rictorres). So I'm, I tried to be quite active on open source community as well, always trying to give back to the community. I guess without the open source community we wouldn't even exist, we wouldn't even be here talking about this so I think that's the least we can do, right? So we'll be nice if anybody wants to get in touch. I'll be here. And thanks a lot for having me, I think it was an immense pleasure. And it's such an honour to be talking to you in person. I learned so much regarding serverless. Thanks a lot.
Yan Cui: 51:45
No worries. Thanks for agreeing to do this. So, I guess one last question about New 10. Are you guys hiring? Because right now, I see a lot of people looking for a job. At the same time, there's a lot of companies that are still looking to recruit. Are you guys doing anything New 10?
Ricardo Torres: 52:02
Yes, we are hiring. If you check our page on LinkedIn there is always open vacancies in there in different departments, data, sales force, also engineering. I'm not sure if we have any openings on the engineering side right now especially for working with Lambdas, but just keep an eye on it, because we are always looking for new talents. Yeah.
Yan Cui: 52:23
Excellent. With that I guess, stay safe and hopefully see you in person, sometime soon,
Ricardo Torres: 52:30
You too. Take care.
Yan Cui: 52:31
Take care, man. Bye bye.
Ricardo Torres: 52:32
Yan Cui: 52:46
So that's it for another episode of Real World Serverless. To access a show notes, please go to realworldserverless.com. If you want to learn how to build production ready serverless applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.