#68: Event-driven architecture at PostNL with Luc van Donkersgoed Artwork

Real World Serverless with theburningmonk

A podcast where we talk about real-world use of Serverless technologies from engineers who work with them day-to-day. We will discuss use cases, why they chose serverless and the pain points and challenges they face. If you want to know what it's REALLY like to work with serverless, this is the show for you.

All Episodes

Real World Serverless with theburningmonk

#68: Event-driven architecture at PostNL with Luc van Donkersgoed

November 02, 2022 • Yan Cui • Season 1 • Episode 68

In this episode, I spoke with AWS Serverless Hero Luc van Donkersgoed about how PostNL is using serverless technologies and discussed the challenges of building event-driven architectures and how PostNL tackles problems such as schema validation and testing.

Links from the episode:

Build cloud-native apps with Serverless interaction testing
IT vacancies at PostNL
AWS Distro for OpenTelemtry
AWS X-Ray vs Lumigo
My upcoming Testing Serverless Architectures video course

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Want to step up your AWS game and learn how to build production-ready serverless applications? Check out my upcoming workshops and I will teach you everything I know.

Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:14

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today, I'm joined by Luc van Donkersgoed from PostNL. Hey, man.

Luc van Donkersgoed: 00:28

Good to be here.

Yan Cui: 00:29

Yep. So we've bumped into each other a few times now in the last couple of months, and most recently at the AWS Community Day in the Netherlands. How have you been?

Luc van Donkersgoed: 00:38

Yeah, good. Well, there's a lot to do around serverless and around event driven architectures. So it's a good time to be working in that area. So we'll talk a lot more about that today, I guess. And it's kind of funny, because we're both serverless heroes. Are you a Serverless Hero or a Data Hero?

Yan Cui: 00:55

Serverless Hero. Alex DeBrie is a Data Hero.

Luc van Donkersgoed: 00:59

Ah, yeah, right. But we only met I think, for the first time in person about a month ago.

Yan Cui: 01:07

Yeah, that's right. That was at the GOTO EDA events, which I think was one of the first event driven architecture focused conference I've ever seen. And I thought it went on really well. And there are some really good talks. And it was really good for me to catch up with a lot of people that I've kind of known online and or from back when I was living in London. And both of us live in the Netherlands as well in the Amsterdam, but we don't really see face to face until then, which is kind of funny.

Luc van Donkersgoed: 01:34

Yeah, yeah. And then we're both living in the Netherlands and then we meet in London.

Yan Cui: 01:38

Exactly, exactly. The community is global. And maybe get back to the EDA aspects a bit later. I guess for those listeners who are not familiar with PostNL or doesn't live in the Netherlands. Can you maybe just give a quick introduction to PostNL, as well as also, what do you do at PostNL?

Luc van Donkersgoed: 01:56

Yeah, sure. So PostNL is the Dutch postal service, the national postal service for listeners abroad, it's like USPS but then for the Netherlands, we cover the what's called the Benelux so Belgium, Netherlands, and Luxembourg. And we're one of the largest logistics companies, or at least the largest in the Benelux and one of the largest around globally as well, of course, not as big as parts, like DHL, or UPS who have a real global footprints, but still pretty high volumes. So for example, last year, on average, we deliver 1.2 million parcels per day. And I say, on average, because for example, we don't deliver on Sundays, and in the November December periods, it's way busier. So it's actually pretty high volumes and PostNL has always been pretty innovative company, pretty open to change and open to new concepts and new ways of work. And I think they sort of always had to. They're a very old company, 220 years old. I think 1799 is the origins of what is now PostNL. And of course, in that entire time, the logistics world changes a lot as well, when you go from horses, to trains, from trains to cars, from cars to truck to aeroplanes, the internet and everything around that. So I think that willingness to change really stems from that history. One of the more recent changes is, I think, in 2012ish, where PostNL said, Well, we're running these data centres, we're running all these, these servers and these operations, but it's not really our core business, should we be doing this. And they started looking at the Cloud. And they were one of the very first major enterprises in Europe that really embraced the Cloud and close all their data centres and move to AWS. And this is in a time where hardly anyone was really using the Cloud at scale here in Europe. So then they didn't have the data centres. And then a few years back, they said, Well, the next step for us is to start building our own logistics software, because that's how we are able to compete with the other logistics providers. And if we're going to build our own software, if we're going to build an actual engineering community at PostNL, then what is the technology foundation that we want to build that on? And since they already had this pretty intense partnership with AWS, the answer was serverless. So I think this was done 2018ish, where they said, We're going to build our software, and all the software that we build will be serverless. And well, there's a lot more to tell, but that's really how we got into serverless. And currently, we have a very large, large landscape with well, millions, billions of invocations on serverless technologies.

Yan Cui: 04:47

Ok. 2018, that's still pretty early days. I remember back then there's still a lot of things that need to be worked out, a lot of kinks. I remember when I started working with serverless back in the 2016. I think that's when you just had API Gateway being supported in 2015. So it barely just became like a usable thing that you can build entire systems around. And no one knew how to do a lot of things like observability, how to do conflict management and event driven architectures properly, all of that has gradually so come into it. And it's very interesting, it's good to hear that you guys have as a really large company made an early decision to go to serverless. So I guess it's been about four years since you made that decision? How would you sort of categorise the state of serverless within PostNL? Would you say most of your workloads are running serverless? And how would you sort of categorise in terms of the developer experience when it comes to using serverless at massive scale, both in terms of volume, but also in terms of, I guess, complexity and also the organisation level?

Luc van Donkersgoed: 05:48

Yeah, well, there's a lot of organisation and a lot of culture that you need to build around serverless, or becoming an engineering company around serverless. So one thing we have is the Cloud Centre of Excellence, which is a platform team that provides services around AWS, but they also control a number of, let's say, mandates, and one of them is the fact that we cannot spin up an EC2 instance, right? It's an… they block it and there is no exceptions to that rule. Another one that I really like is we cannot use, we cannot create any IAM user. We can only use roles. And that also forces you to apply best practices, which is really a number of other constraints that they implement as well, but they don't come up with the constraints, because that will be a sort of top down approach, maybe a bit iron fisted. And that's not really the approach that works. So what we did instead is build a community of the engineers, all of the serverless engineers in the organisation and we define charters. And charters are topics that we want to make decisions on, with one to standardise on. One of the examples is CI/CD. So before everybody use their own kind of CI/CD, some use GitLab, some use Jenkins, some use code build and code pipeline, others use GitHub. And there was the sense like we need to standardise around this topic, so that everybody uses the same technology. And we can share knowledge and share best practices and so on. So we started the charter with a number of engineers and they came up with a list of requirements like what should our CI/CD tooling be able to do came up with a shortlist of tools, compared them, and then came up with the best solution for our company, which happened to be GitHub Actions. But when they got to that result, when they made that decision, then it became mandatory. So now every new team that is building software has to use GitHub Actions. That's a really important part. And I think the culture and the organisation to make sure that those decisions are made by the engineers by the values and the requirements that they have give all the engineers the opportunity to provide input and to make those decisions. But when the decision is made, then it's, well, nothing is final, but pretty final, at least for a couple of years, so that we standardise around it. And one or two additional examples are from a infrastructure as code perspective. We standardised on the CDK. So we don't have Pulumi. We don't have Serverless Framework. We don't have raw CloudFormation. Using CDK and infrastructure as code is of course also mandatory. We standardised around a dev, test, acceptance, and production workflow are currently we're looking at standardising around observability tooling.

Yan Cui: 08:43

Okay. So I guess in this case, you know, in terms of hiring, because you've got all of these decisions, these frameworks in place already, which I think does help in terms of education, and onboarding, if every team is doing the same thing. But in terms of, I guess, hiring in terms of bring up new joiners up to speed, do you guys have some kind of process in place already? Because obviously, serverless is still not as widely adopted in the wider developer community. And I've heard from many places where, you know, hiring has been a real pain because it's hard to find people with existing experiences using Lambda and other serverless technologies at decent scale. How do you guys kind of approach this the problem of education, but also in terms of recruitment?

Luc van Donkersgoed: 09:29

There are a few things about that one. From a recruitment perspective, I think reaching out and giving back to the community is a very important part. So for example, the fact that a few of us were in London at the EDA Summit to learn but also to talk to other people and tell what we're doing at PostNL and the challenges and solutions that we have, but also at meetups and community days and conferences to share our learnings and allow others to grow on based on our experiences is super important, but it also really positions PostNL, as I say, an interesting company to work for. Because we have interesting challenges and problems and the skill of processing a billion Lambda invocations in a day or having, for example, our IoT department tracks, I think over 300,000 roll cages, where they are and what their status is, and how they are moving. Those are super interesting problems to work with and to solve. But it's important for us to let the world know that these are the kinds of problems that we're solving and how we're solving them from a recruitment point of view as well. And I think we're actually doing pretty well. So we often hear people saying, well, we want to work at PostNL, because we've heard about the great things you're doing. I myself am an example of that, because I joined PostNL, because other people told me that PostNL was an interesting company. And also the fact we've mandated the use of serverless is very attractive to a lot of engineers. Because if you're saying, Well, we're a tech company, we do some serverless. But you might also end up in a team, where you're still doing EC2, or maybe you're doing Kubernetes or whatever, then that is not as attractive to some types of engineers, as when you're saying we do serverless, and you will only be doing server. So that story helps. But then there's also of course, the education part because well, finding people who actually have all the experience that we need is one thing, but building that experience is another and one of the things that we did is we started a training programme. It’s called Tech together, where we have informal learning sessions where we just share internally our learnings, but also formal learnings like workshops and organised trainings. And in that we also have a partnership with AWS called their Skills Guild programme, where for example, every half year, they do a three day workshop on advanced architecting in serverless. So all the new engineers that we get, we sent to the workshop and they just spent three full days building stuff and Lambda SQS, SNS, EventBridge, Kinesis, Step Functions and all the other serverless tools.

Yan Cui: 12:09

Okay, very interesting. And I guess that now that DAZN, kind of just dismantled the Amsterdam office, that's quite a nice pool of engineers and you can tap into in the Amsterdam area as well.

Luc van Donkersgoed: 12:21

Yeah, definitely. So yeah, if you're hearing this, and you're interested in working at PostNL, then make sure to reach out.

Yan Cui: 12:27

Yeah, I'll put the links to your careers page in the links for the show notes as well. So anyone who's interested can always just go to that. And I guess you're hiring not just in Amsterdam, and the Netherlands, you also hiring potential remotely.

Luc van Donkersgoed: 12:41

That is a bit of a topic. So we'd like to but as a very large enterprise role also bound by some procedural stuff and legal stuff. So I think the current limitation is you have to have a Dutch residents to be able to apply.

Yan Cui: 12:54

Okay, right, right, gotcha.

Luc van Donkersgoed: 12:56

The residents part is actually important, because we do have a very international culture, especially in the engineering community. So it's not like you have to speak Dutch, you don't have to be Dutch, don't even have to feel Dutch. The only thing is you have to live in the Netherlands.

Yan Cui: 13:12

And also, I guess, Netherlands has got a really good immigration policy, especially for highly skilled workers. So that was one of the reasons why DAZN picked Amsterdam as the location for the next office back a few years ago. It’s because of the fact that it is so easy to hire international talent into the Netherlands and give them work permits and stuff like that. So I want to switch to maybe talk about how you're managing your AWS accounts, because I imagine you've got lots and lots of accounts, and then I imagine you're using AWS organisations. So are there any tools that you're using to I guess, make sure that every account is created the same way, you've got consistent baseline infrastructure for, I guess, like your own platform on top of the base AWS Construct?

Luc van Donkersgoed: 13:57

Yeah, so I can't tell you all the details, because this is not a team I am in but the Cloud Centre of Excellence. The CCOE, I mentioned before, is actually an engineering team. And they built the applications and built the integrations that do exactly this. Well, the SCPs that I mentioned before, and the organisations and the organisational units, and AWS SSO, and the integrations with our other SSO or our central SSO provider is all their sponsibility, for example, they deploy stacks, stack sets into our accounts. They also, for example, bootstrap all of our accounts for the CDK. And they make sure that we're all following the same standards, but how they actually implemented that, oh, I can get one of their engineers on your talk as well on your podcast as well. But they can tell you more about that than I can.

Yan Cui: 14:45

Okay, sure. Sounds good. And I guess in that case, maybe what we can do is to talk about your event driven architecture, because I've heard you talk about this several times and obviously doing massively serverless implementations of event driven architectures at PostNL scale has some really interesting challenges. So maybe let's start by just paint us a picture of what your event driven architecture look like.

Luc van Donkersgoed: 15:09

Yeah, sure. So PostNL as a logistics company is sort of, by default kind of a event driven business, because we're only responding to external inputs. So a consumer might want to send a parcel, and they registered that parcel with one of our APIs, one of our websites. So that is an event parcel registered. Then we pick it up with a van, and that's an event parcel picked up. And then we bring it to a sorting centre, and we offload the parcels, and that's an event like parcel delivered to sorting centre. And then we move it down the sorting lines, we scan the parcels in our automated systems, whenever a parcel passes a camera, that's an event like parcel detected or observed, and you can think of a million other events in that system. But again, it's all pretty much event driven. And that actually, I think I wasn't at PostNL at that time, but I think that also helped shape the idea of that PostNL should build a event driven application landscape, because it's the best fit to our business. And another part that is taken into consideration there is the fact that we're very seasonal, and seasonal, not only on a yearly basis, like December is busier than June, but also on a weekly basis. Tuesdays are busier than Mondays. Sundays, there's nothing happening. And on a daily basis itself, like there's a lot of events at four o'clock and five o'clock, and not so many events at 9pm, right? So everything is seasonal. And that makes it very logical to have, well, serverless event driven architecture as well, because it just it follows the events and skills with the events, and you only pay for the events that you actually use, and you don't pay when you're not using them. So that's I think, the rationale of our landscape. And then if you look at the landscape itself, we have a large number of applications and application teams, I think 30 or 40 of them, and they all have their own responsibilities, like one is responsible for interaction with the consumer, one is responsible for the sorting lines, one is responsible for planning the trucks. I mean, it's actually kind of obvious or kind of logical if you think about a logistics process, like we all see the PostNL trucks driving down the highway. Somebody has had to make a plan on where it's going and where it needs to be and which parcels need to be on it. So that's an application responsible for that. But since they are all largely event driven, you do need some sort of integration between them and what doesn't work. And that's not something that we found out. But that was found our way back in the 70s and 80s is if you're going to create a mesh, where all of the applications talk to each other constantly. You have this web of integrations, and nobody knows who's actually talking to whom and who's responsible for what and what those events look like. And also you don't know what will break if you change something. So bad idea, that's not what you want. Instead, you do something you centralise it. And classically, the solution would have been an Enterprise Service Bus or ESB. There's a whole history of on the ESB, that we will not go into now. But it makes a lot of sense from a historic perspective. But it doesn't make so much sense now anymore, because the main issue with an Enterprise Service Bus is the fact that all the integrations on that service bus need to be built. So you have to have a team that actually connects sending applications to consuming applications. And that's one of the issues that we ran into. And one of my main responsibilities to solve is to have a central event broker. So the solution we came up with to make sure that we can build those integrations a lot faster is to build a central event broker and a self service event broker where all of our producing applications can register their events and publish those events onto the event broker and have consumer applications, discover which events are available on that event broker and then subscribe to those events and stream them to the applications that need to receive them. And now you don't need a central integration team anymore to manage those integrations, because it's a self service portal. And all of those applications are really decoupled because the producers only need to send to the event broker and you can get a very dynamic landscape of events. And there's a lot of additional features that you can build by having this the central event broker pattern as well because it allows you to well centrally monitor and maintain all those integrations, also see what kind of patterns emerge and to control the reliability and things like replayability and retries in a central place, and really maintain the stability of your landscape that way as well.

Yan Cui: 20:01

So I guess you have, I guess in terms of your account topology, you have one account that hosts the central EventBridge bus. And then that you then also have a satellite account for different teams and different services have got their own local EventBridge bus, and then using the EventBridge of event forwarding so that local events are forwarded to the central bus, or do you do most of direct from the application you push always to the central bus? How does that sort of I guess, connectivity work?

Luc van Donkersgoed: 20:32

That's a good question. And the main philosophy that we follow is we want to offer a very low friction kinds of integration. So that means we don't want to tell our users what kind of integration pattern or at least integration technology they should use, we don't want to force EventBridge onto them. So what we do instead is for every producer of events that registers their events with us, we allow them to choose the technology that they want to use to deliver those events. So that might be SQS. That might be SNS. But actually, a very common pattern is also HTTPS, which I personally am not the biggest fan of because I think, well, HTTPS is of course, very open standard. But it's also not not the most efficient or not the best integrated into AWS. But we also have a lot of customers internal customers that are not on AWS, so they just choose HTTPS as their protocol, then we deploy endpoint with all the, of course authentication and authorization configuration on top of that, and then they can publish to that endpoint. But other applications who are in AWS might choose SQS. And then we deploy a queue and allow their AWS account to publish there. And then it's just IAM integration. So it's really open to whatever they need. And we do the same at the consuming ends. So if the consumer has, for example, their own EventBridge, then we just publish the messages onto their EventBridge. If they have an HTTPS endpoint we do an HTTPS POST to that endpoint. And we offer a lot of integrations that way.

Yan Cui: 22:15

Okay, I see. So now I see what you mean, when you say event broker as opposed to just spelling out event bus. I see. Okay. So I guess in this case, the fact that you've got so many different options developers can use that feels like it is also going to make certain things that you mentioned harder to implement at the central level, things like schemas and schema validations and just a guess having a schema registry that everyone can access, especially for people that are not on AWS already. So how do you go about approaching these kinds of problems?

Luc van Donkersgoed: 22:47

Yeah, one of the main ways in which we sort of diverge from a standard EventBridge or event broker import implementation is the fact that we do a lot of validations, schema validations on our events. So EventBridge, by itself can only detect the events that were published, and maybe detect the schemas that were published on it, or you have to do it yourself. But it cannot stop messages from being published if those messages don't match a specific schema. And that is a feature that we offer. And that's really important to us, because for the consumers that connects to our central event broker, we want to be able to make the promise to offer the guarantee that the event that they subscribe to will always match the schema that they, that they saw when they that they agreed upon when they actually created that subscription. And the way we do that is forcing the producer to tell us what the schema of their event will be. And then validating that all the events that are published to that endpoint actually match that schema. And this is maybe the biggest part of the work that we do, because there are not that many ready to go tools in AWS to build this. So this is actually where there's a lot of Lambda code, a lot of Lambda functions that do those validations, that process the rejected messages and make sure that they get returned to the sender with one exception where we can use a lot of AWS services. And that's actually in the HTTPS post that we just discussed, because API Gateway does have built in schema validation, and we use that extensively.

Yan Cui: 24:26

So I guess that that also then brings up an interesting question about versioning. So I guess in that case, if I was to introduce a breaking change or renaming an attribute on my event, I guess, in that case, because of schema validation doesn't allow me to break the consumers, which I think is a good thing, something that more people should be doing, what's the process in that case for that team to introduce this change they want to do, things like I guess for simple renaming, you can just add a new attribute with a new name and keep the old one around. But what about if you want to do more structural change to the schema of the event?

Luc van Donkersgoed: 25:00

That's another very good question. Of course, one of the also central topics to the event broker. So the way we structured our application is, you could say, a data layer and the control layer, or what we say is we have the back end, which is the system that actually processes and transports messages. And we have a management application where users go to configure things. And the management application, it's like, well, your basic REST API and business logic setup. So it has Cognito for user management, API Gateway for API routes, DynamoDB to store data, and Lambda functions for logic, and there's some Step Functions and some some SQS in there, pretty basic setup if you're familiar with serverless. And all the all the business logic resides in that application. So users, well, they also have a front end website. But that's just a thin front end for this API. So they go to the to the website, and they register their events there. Well, when it's a new event, they're free to define whatever they want, because nothing is break. But they should be able, as you said, to do updates to those events. And that's what we call non breaking changes, or minor updates. And those are adding fields, making optional fields required, removing optional fields, may be reducing the range of values that you can support. And we have a whole list of things that we consider minor non breaking changes. And they are allowed to do that and publish them because you're not breaking the contract, you're extending the contracts. And that's, that's fine. Part of what you need to do to make this work is to tell the consumers that they should be pretty lenient in what they accept, right. And so you're saying this is what we're going to send, we might send more, but we're sending at least this and they should be able to deal with that. But that's generally not a problem. But as you said, there's also situations where you have breaking changes where you're significantly changing the structure of your event. So what we then do is, if you try to apply a breaking change on an existing version, we just give you a 400 error, you're not allowed to do this, you have to create a new major version of the events. So you get I don't know, state change v1, and then you have to create a state change v2, and the very initial version that you make is always v1. So we always apply a major version, even if you only have one. And that version can then be completely different structured sort of doesn't have to be any relation between v1 and v2. But as a producer, you're now responsible for sending both v1 and v2 as long as there are consumers links to v1 subscribed to v1. And there's a common saying in event driven architectures where they say producers shouldn’t know their consumers or shouldn't have to know their consumers. And I want to add some nuance to that statement. Because it's true from a technical standpoint. As a producer, you shouldn't care what kind of application is consuming that event or how that application is built. But from a business and from a procedural perspective, it's super important to know who your consumers are, especially if they are consuming a specific version. And you might want to choose to stop sending that version. So what we do in the event broker and in the Management API on top of it, is visualise who are the consumers to this event. But also, since we're all within PostNL, what is their contact info? What is their email address? So you can reach out to them and say, Hey, I introduced v2 and I want you to migrate to v2. Can I help you in that migration? Can we do that together? Because I at some point, want to stop sending v1 because I don't want to maintain two versions. So there's a lot of business and process thoughts about that behind that as well.

Yan Cui: 28:46

Yeah, I like that caveat. Because what my code doesn't have to know, I may have to know because I'm a developer. I'm the one that's building these things together. So I guess in that case, in terms of enforcement, making sure that I don't introduce things that are not considered non breaking change, or maybe things that I didn't think is a breaking change, but it is for my consumers. Are you using anything like Async API spec or consumer driven contract testing that kind of thing to make sure that when I am the publisher, the event publisher, I'm making changes, I am always going to be alerted before I push my change out that, Hey, your change is going to break the customer? Or would you just rely on entirely on the broker to tell you at runtime that when you try to send this event, you get rejected and then you get sent back to me for some notification, some notes that this is a breaking change?

Luc van Donkersgoed: 29:37

So we don't do customer driven contracts or Async API, we simply have business rules on what we consider breaking a non breaking. But what's important there is to see the distinction between the front end the Management API and the back end. So in the Management API as our control layer, we do validation before we actually do a roll out of those events into our backend systems. And the validation simply rejects the change following those business rules. So we came up with a list, I think, well, it maybe took me a day to think of all the business rules. It's not in the end that complex, but it's things like if a field is marked as required, it can never become optional, because that's breaking, the consumer is expecting it. If a field is marked as the type is a string, it can never become an integer, because the consumer is expecting it. And so there are a few other rules there, it becomes more interesting when you're looking at things like ranges and lists, for example, in JSON schema, you can define a field as being either a string or an integer. So a list of types that is supported that you can add that to the schema, there's nothing wrong with it from a technical point of view. So if you have a type list that says it's either a string or an integer, and you change that to it's only an integer, that's not a breaking change, because you're telling the consumer is I'm going to send you either a string or an integer. And then from that moment on, you're only sending integers when your consumer is already capable of dealing with that. So it's been a bit of a search to figure that out. But I think that rulesets is working very well for us.

Yan Cui: 31:16

Okay. And what about in terms of testing the event driven application themselves? One of the challenges people often have is that in terms of testing EDAs that pretty much relies on kind of some kind of polling that you find events and you expecting some side effect to happen. And then you have to just poll and see if the role gets written to database or some API gets caught. What’s kind of approach have you guys developed towards event driven applications more easily?

Luc van Donkersgoed: 31:47

Well, easily is a good, well, adding the word easily to the question makes it a very difficult question.

Yan Cui: 31:53

Okay. Maybe easilier instead of easily.

Luc van Donkersgoed: 31:58

Because testing serverless environments is not easy. I think it's one of the trade offs, right? So you can get into a discussion people say serverless is too hard, and this doesn't work. And that doesn't work. And I don't have all of these things that I do have when I run it on EC2 and this is one of those topics. Of course, the favourite answer of any architect on any question is it depends on what you should use. And so it is with serverless so you get a lot of benefits that we discussed earlier about operations, about scalability, about decoupling and so on. But you do lose some benefits in how easy it is to test. So we are trying to find our way in standardising our testing and I've given a talk about this a while back. I think you can find it online, in which I show how we create Lambda functions to test specific parts of our infrastructure, component testing that well, for example, puts a message on a queue and then verifies that a Lambda function has run and that the result of the Lambda function is what you should expect but I have to be honest, it is quite a lot of work to write those tests. And it's quite a lot of infrastructure that you still have to maintain. And I haven't found the ideal solution there. What I do want to add is one thing that is really helping us is thinking about contracts, so any events in an event driven architecture, you can define almost all of your integrations in the form of contracts, and this is the literal event that I will send out, or this is the literal event that you will receive and on APIs actually it's it's the same and if two applications integrate with each other and they agree on the contract, then you don't need to do an entire chain test in which all of these applications need to be tested together. You only need to test, does the producing application send out events following the contract and the consuming application only needs to simulate events following the same contracts.

Yan Cui: 33:59

Yeah, I agree on many of the things that you said. The testing of serverless applications in general, not even the event event applications are tricky. And it's one of the first things people ask when they come to serverless is how do I test this thing. And there's this, I guess, a false dichotomy that you have to test either fully locally or fully remotely. And then in between there is this massive drop off in your feedback loop which drives people crazy. So it's definitely something that needs to be addressed. And that's actually why I'm putting together a new course just focus on servers testing because all my other materials that people are doing them but then a lot of times they just want to get an answer on how to do testing. I think this thing about testing just the contracts the input and output for individual components in that event driven application chain, I think that's very useful, but I still hear a lot of from some of the customers and students that there is value in testing the whole chain because even though you're testing the individual components, you know the behaviour is correct, but you're not really testing the configurations like for example, event patterns. Maybe your code is doing the right thing given the right event, but you've got a bug in your event pattern somewhere along the whole chain, then you won't know until someone realise in production that this thing never fires. Why is that? And if you can test the whole chain, then you can catch those kinds of problems earlier. I think that's where testing becomes really quite tricky when it comes to event driven applications that for this long chains it’s really difficult to estimate and have a very sensible timeout for how long do you keep polling before you decide, okay, this is not working.

Luc van Donkersgoed: 35:35

That's absolutely true. So a few things is I think you need almost all of the tests, testing types that you can think of, you're going to need them. So your Lambda functions, you're going to do unit testing. The internal workings of your application or your service, you're going to do some sort of component testing. For your standalone service, you're going to do some service testing, contract testing, and maybe integration testing. Multiple services together, you're going to do some integration testing. And an entire landscape, you're going to do some end to end testing. And the balance of what you need is, I think, determined by your appetite for errors or your appetite for failures. Like if you're building a webshop and one in 10,000 users gets a 503 or error. And because of that something doesn't work might be fine. But if you're in I don't know, a hospital, and one in 10,000 medical devices stops working every now and then that's a way bigger problem. And then another scenario you're going to need much more testing, and much more application testing than when you're just running a webshop or blog or whatever. So yeah, it depends on your industry depends on your appetite on how much testing you need and how much time you're going to invest in building those tests.

Yan Cui: 36:54

Yeah, that's true. That's absolutely true. And certainly those industry that are dealing with human lives they tend to have much longer development time and testing cycles as well compared to you know, just putting out a new blog or new website, mobile app, that kind of thing. Okay, so I think that we're coming up to time and really want to thank you for taking the time to go into a lot of details about how PostNL is using serverless, but also how you're approaching event driven architectures at a pretty big scale. Is there anything else that you want to sort of mention before we go, anything that you're doing personally or PostNL is doing as a company?

Luc van Donkersgoed: 07:24

Yeah, one thing that we didn't touch upon, but I think it's a very hot topic right now is observability. And that's really a thing that we're focusing on like in distributed applications, how do you even understand what's going on? And I think that also relates to testing, because you can use your observability data to actually validate that your test works. And actually, that can give you more insight and maybe more event driven insights than just polling and seeing, Hey, did something arrive. Instead, you're pushing your observability data and responding like, Oh, I saw this events happening. And that's exactly what I needed to know. Those worlds overlap. And I think that's really valuable. I'm a very big fan of OpenTelemetry. Because it's an open format and because it allows you to link new services into it without locking you into a vendor. And what I'm really hoping to see in the next months and maybe years, is more adoption of OpenTelemetry in AWS services and other Cloud providers.

Yan Cui: 38:26

Yeah, that's actually something that I completely forgot to ask you about, the observability. And one of the things that I find really useful I use personally I use Lumigo and one of the great things about Lumigo is that it supports a lot of event triggers for Lambda. So when you have an even complex chains like Lambda going to SNS, SQS, Lambda to EventBridge, Lambda to something else, you can see the whole chain. So when it comes to end to end test failing, that's actually a really good time for me to figure out okay, can I actually debug this problem? Because when you got a test, you've got control environment. It's a lot easier compared to in production, you got 10,000 events per second. How do you even figure out which one failed and what's the cause for their failure. So Lumigo has been really easy, good for me in terms of development, but also in production as well. And one of the things that a lot of customers asked from Lumigo is support for OpenTelemetry so that they can get the data out of Lumigo into something else, which I think they do support now for containers. That don't I'm not that deep into the conversations there. But I think for Lambda, there's some specific challenges with getting that meeting those telemetry data to OpenTelemetry formats for Lambda. But it’s something that does seem to be getting a lot of traction. And I believe that AWS has been adding support for OpenTelemetry for CloudWatch, I believe, and maybe some of the other services just you have to get it through CloudWatch and and get them out into your OpenTelemetry supporting system. Are you guys right now using mostly the sort of native services like CloudWatch logs and X-Ray?

Luc van Donkersgoed: 39:59

So X-Ray doesn't work for me. I would like it. It's too expensive, and it doesn't give me the insights that I need. It's way too. It really flattens your data way too much, or you have to find one specific trace, but it's not the right solution for my problems. We do use a CloudWatch kind of extensively still, but mostly for aggregated insights. So alarms on the number of maybe Lambda function failures or messages on an SQS queue or number of 400 or 500 status codes on APIs, those kinds of things. But that doesn't give you insight on specific problems that might hit a subset of your requests or subset of your customers or subsets of your applications. So what we currently do is emit OpenTelemetry using the official OpenTelemetry SDK. We use the Python one, and then we emit that to a Kinesis data stream. And then from Kinesis. We read it to build some some insights and analytics that we use internally. But we also forward those traces to Honeycomb where we do our deep dives and dashboarding and so on.

Yan Cui: 41:13

Okay, so I guess the trade off there is that you adding a bit of latency for invocations so that you can send data from your collecting agents to Kinesis. But I guess if you're mostly doing event driven architecture, you're not dealing with user facing latency. So I think 10, 20 milliseconds is not going to be the big issue, I imagine.

Luc van Donkersgoed: 41:32

It's single digit latency. So it's generally five or six ms added to the write to Kinesis. And that's also the reason why we chose Kinesis as our as a in between buffer for our OpenTelemetry data, because AWS does have a sort of out of the box solution for OpenTelemetry, which is called ADOT, the Amazon distribution for OpenTelemetry, but that actually runs a local server in your Lambda extensions and you talk to the server and then the server offloads it to wherever you're sending it, but then that offload is part of your Lambda invocation. And because it's a web request, so it might actually add 200 or maybe 300 milliseconds depending on where your observability backend is hosted. And that latency is added to every single Lambda invocation. And the processing time is not even a problem because you do it at the end of your request. But the cost is a problem. Because if we run a billion events, and each of those billion events adds 200 of ms of latency, that's a lot of dollars, that you pay on invocation time. So that's why we first offload to Kinesis in single digit latency and then from Kinesis buffer every minute to our backup.

Yan Cui: 42:50

I guess if you add the support for that Amazon distribution for OpenTelemetry to support Kinesis as a target and maybe that will solve the problem?

Luc van Donkersgoed: 42:59

Yes and no, because what that thing actually does is run a service that's called the OpenTelemetry Collector. That's a Go application, which has a startup time and everything and that's the part that I do not want. So we just… because OpenTelemetry is an open format we just wrote our own extension and our own exporter to OpenTelemetry, which does nothing more than just batch those telemetry data and forward it to Kinesis. And what I would really love and that's discussions I'm having with AWS is for Lambda to emit OpenTelemetry natively. So that I don't have to run an extension, and I don't have to run a Kinesis data stream, but that's for the future.

Yan Cui: 43:41

Okay, yeah, they have done quite a lot of changes in that particular space, including the ability to use extensions to ship your logs somewhere else, and disable CloudWatch all together. Because a lot of people have complained about a cost of CloudWatch logs. Plus they use something else anyway. So you're kind of double paying for your logs with Lambda as well as whatever thing other thing you use. Okay, I think we're coming up to the hour now. So as again, I want to thank you so much for your time, Luc. And I think I've found your talk on EDA and testing. So I'll put that in the show notes as well, along with the careers page for PostNL, I guess in that case, I want to thank you again, and hopefully maybe see you in person somewhere in Amsterdam soon.

Luc van Donkersgoed: 44:21

Yeah, definitely. Thanks for having me. It was a great talk. And maybe we'll catch each other at re:Invent.

Yan Cui: 44:28

I’m not going to re:Invent this year just yet. Not quite ready to do a long distance travel. When I went to EDA came back with Covid. So I'm a bit worried about travelling somewhere too far where I can’t just easily get home. But for now, I guess maybe we can catch up in Amsterdam at some point.

Luc van Donkersgoed: 44:43

Sounds good.

Yan Cui: 44:45

Cool. Take easy, man.

Luc van Donkersgoed: 44:45

All right. Thanks, Yan. Bye.

Yan Cui: 45:00

So that's it for another episode of Real World Serverless. To access the show notes, please go to realworldserverless.com. If you want to learn how to build production-ready serverless applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.