Real World Serverless with theburningmonk

#56: Serverless at TacoBell with Robbie Kohler

September 22, 2021 Yan Cui Season 1 Episode 56
Real World Serverless with theburningmonk
#56: Serverless at TacoBell with Robbie Kohler
Show Notes Transcript

You can find Robbie on LinkedIn here and links to his "This is my architecture" sessions with AWS below.


For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

To learn how to build production-ready serverless applications, check out my upcoming workshops.

Opening theme song:
Cheery Monday by Kevin MacLeod

Yan Cui: 00:12  

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today, I'm joined by Robbie Kohler from Taco Bell. Hey man.

Robbie Kohler: 00:24 

Hey Yan, thanks for having me. 

Yan Cui: 00:27 

So Robbie, we've known each other for a while. And we've worked together on a few things now. And can you maybe just tell the audience about your role at Taco Bell? And I guess Yum!, your, your your parent brand. And what does your team do?

Robbie Kohler: 00:40  

Yeah, sure. So I am currently the director of software engineering at Taco Bell, focusing on e-commerce on engineering. Taco Bell is part of Yum! brands, which also owns KFC, Pizza Hut, and Habit burger. Yum! is huge. There's over 50,000 Global restaurants. And that Taco Bell domestically, we have over 7000 restaurants in the United States, we serve over 42 million customers per week. And my team's main focus is on the kind of e-commerce and digital products around that. Personally, I've been in software since 2004. So I've seen a lot of changes. I'm really happy to be kind of on this cloud-native and serverless bandwagon now.

Yan Cui: 01:27  

Um, so I guess that's a really interesting talking point there in terms of the traffic you guys have and how many customers you're serving per week. Let's maybe go into that in more detail later. Can you maybe tell us about how Taco Bell is using serverless technologies as a whole and how it's been working out for you?

Robbie Kohler: 01:46  

Yeah, so I'm, I'm actually pretty proud to say Taco Bell is a serverless first dev shop. So that means anything new, we're building we're really looking at at it in a serverless first lens. How can we use AWS to manage services? How do we make our lives easier, you know, it's been a journey to get to where we are now, obviously, every company starts in a different way. We actually started in, I would say, early 2018, I joined in late 2017. And we have this really huge strategic initiative to build this new system. And it was really important because at the time, you know, we were thinking about doing our kiosk initiative, delivery aggregators were coming up. And we really needed a way to get our menu data, which is actually very complex. If you think about it, there's over 7000, restaurants, you know, millions of pricing rules, tax rules, store hours, all this kind of master data that makes up a menu. And we really wanted to build something that we own that and we can iterate on it, we can move very quickly. So we started out building the system, that was my first project here. We started building out kind of last-gen, I guess you can call that or you know, kind of on EC2 servers, and kind of these traditional, you know, ETL, and data integration tools. But there was a really specific part of the system that was kind of worrying me about the scale. Like we said, there's a lot of... these files are big, you know, we're connecting all these different APIs. So at that point, I started looking at serverless. I heard a lot about it. I didn't, like a lot of people, I didn't really get it. I mean, there are servers. So I had to do a deep dive and understand that and, you know, I was like, Okay, I think I want to try this at least at one part of the system, we'll see how it goes. And funny enough, my background is in .Net, and C#, which, at the time, they started supporting .Net core on Lambda. So I'm like, Okay, well, this, maybe is not the exact best practice, but you know, this will work, I can take what I built here that that works, it generates these menus, but I need it now to scale. So we literally did a PLC, we popped it in, you know, it was kind of the Wild Wild West at the time. So it's like, a lot of things, we didn't know how it was gonna work. But thankfully, there's the community was great people, like you had articles, and we can kind of dig into that a little bit more. But that was a huge success, we don’t have to worry about the scaling. It scales up, you know, it’s still talking to an RDS back end, but you know, the actual compute memory and and being able to spin up to do all these menus very quickly. Because when something changes, like the store hours changes with a price change, we need to get it out to all the digital channels as quickly as possible. And I really didn't want to worry about an EC2 server crashing, or even containers and stuff like that in the backend. So that was the first dabbling and serverless and from then it just kind of cascaded to basically now every every new system is now serverless. And then we were, you know, that system started out majority not serverless. You starting to move that system to become fully serverless using step functions and things like that, instead of maybe these EC2 servers and kind of monolithic code.

Yan Cui: 04:52  

So I guess your strategy is very much of the sort of the strangler pattern where you start with one component that’s low-risk, you know, go serverless, try out, learn about how to do things in a serverless way, and then gradually, as you gain more confidence, and I guess more accustomed to the way you do things with serverless, then you start to tackle other parts of the system. And I guess eventually, you know, do some of the more business-critical components. I remember when we were working together, you were working on some of the middleware stuff. And that was pretty much all fully serverless. Is there anything you can tell us about that system? How it's put together? And I guess, what services do you use to make the whole thing work?

Robbie Kohler: 05:33  

Yeah, so like I said, we started off on the that menu middleware system that I described, we got some success there, we built our confidence, kind of as a team. And when it came down, we had to build a new system that accepted orders. These can be orders from delivery aggregators, they can be orders from any kind of third party system or even internal, when an order is basically like, you know, a fully paid order that needs to get processed, and the food needs to be made in our stores. So we thought, Okay, well, we need to build this new system. Again, we can do it the last-gen way, we can look at containers, EC2. But we already had that success, we were like, We wanted to build upon that. So we decided to make this a fully serverless 100% no VPC kind of project. So what this system does is it takes those inbound orders from different order sources. And we actually use step functions here. Step functions, again, is... we're really, really big fans of it, for the kind of business-critical workflows. Again, when we used to write Lambdas, we probably made them way too monolithic. We probably made them so that they were, you know, too much kind of logic within within the Lambdas. So we started to break them up into very kind of fine grained Lambdas and use step functions for the orchestration. What the system does is kind of crop processing these orders, it's running various rules, we even have callback callback task tokens, where maybe the order kind of pauses for a little bit. And the driver, if it's a delivery order, they're going to drive close by, it's going to trigger another event that they're close by. And that goes back into our kind of via EventBridge kind of starts up the step function again. So it's really easy to kind of onboard new integrations with the system, it's really easy to know what's going on because we're using step functions. So we've kind of you know, evolved in how we use kind of serverless and managed services to this new order middleware platform.

Yan Cui: 07:28  

And I guess, being like a middleware platform, you also have to do a lot of integration with other third-party services, or maybe internal services as well. So I guess this is where perhaps step functions retry and things like that, is also quite useful. Is that where you guys are using step functions to do as well, you know, taking the retry out of Lambda and put it into the state machine itself?

Robbie Kohler: 07:49  

Yeah, so so you can imagine the system there's there's internal and external services that we're communicating with. Again, we could have done this in Lambda, we could have done, you know, kind of the older way of doing it, where it gets really hard to reason about when you're sending things back and forth asynchronous so step functions, really, really was nice here, because we could put retries, things like retries on critical parts of the of the workflow, we can handle different types of errors, we kind of open it up. So we can eventually do things like make it a circuit breaker pattern if we needed to. But, you know, we started off, you know, we needed to get it out pretty quickly. So we started out with just kind of retries and that helped to smooth out a lot of the issues. We noticed when we started doing kind of load testing, and we started seeing a lot of load, you know, once you run millions, this thing millions of times, you're gonna see things that don't really don't happen, you're gonna see issues. So retries actually just kind of, for the most part, you know, 99% of the time, you know, it's gonna work after a few retries. So that's been a very big success there. So kind of interesting, oddly enough, we ran into some not not necessarily challenges, but we were asking ourselves, how do we test the system? We have these we have APIs now that we don't control on the inbound, and we have APIs, we don't really control on the outbound, we have the system in the middle, the order middleware, but you know, we really wanted to test is this thing scale, what we built. And also just kind of end to end, you know, like happy path testing, we need to make sure because it gets pretty complex when you have these different workflows. You know, I did, I remember talking to you about it. And we were kind of thinking about different ways we can do this. And what we did is it became really, really easy to mock out kind of the, you know, the left side and the right side, I guess you can think of the diagram. So API Gateway, you know, we use Lambda and we just very, very simply mocked out all the different kinds of inputs and the outputs and the calls. And when you kind of spin up these temporary stacks in your CI/CD environment, you actually spinning up these, these fake mock APIs on both sides. And you can really simulate anything that's that's actually going to happen. You know, you can take it really far. You can do it very simple, you know, just like small delay, you know, and things like that you kind of give yourself that ability there. But what's nice is that you can then do your end to end testing, you can then do your your happy path, happy path automation, you can then pound load simulated load against those mock APIs, and then you can try to run, you know, let's run a million orders. And see what happens with Dynamo and API Gateway. And of course, you know, we all thought, well, this is gonna scale, you know, it's, it's, this is what it's made to do. But we still caught errors, we still caught places where we needed retries. So still very, very valuable to do that. And we're glad we did it. and like a lot of things like Lambda serverless managed services that are new. It's a little Wild Wild West at first. And I'm really thankful there's you know, the community is great, like people like you there right? You know, it's a joke around here, every time we're doing something, you kind of write an article around the same time. That is exactly what we're working on. Actually, that first system back in 2018, the menu middleware, I was like, how do I do secrets? Like how do I get database connections? It was annoying me because I don't want to do environment variables. We had so many of these Lambdas we were writing and it just seemed off, you know, and the first article I ever ever read of yours was about using SSM about what you should write some kind of middleware, you should, you know, this is a cross-cutting concern that you're going to use all the time, you know, figure out a way to do to do something. So we wrote our own little middleware if you want to call it, like a mini type thing for C#, that now is used for a lot of those. But, you know, it definitely helped us out a lot in terms of like secret management, like, you kind of guided us to use parameter store, which has been really, really great. It's definitely evolved so much using the serverless framework, and things like that. So yeah, this, you know, this system was a culmination of many years of trial and error and learning, and we're pretty proud of it, though.

Yan Cui: 11:54  

Great, yeah. It's amazing to see how far you guys have come. And because one thing I do want to also ask you as well, is that step functions is a really powerful service. I love it as well. But it's also one of the more expensive services that AWS offers. What do you I guess, what's your view on when does it make sense to use step functions versus putting more stuff into a Lambda function instead? Do you have some kind of rule of thumb that you use to decide, Okay, you know, that is a good place where we should be using step functions, even if it's more expensive?

Robbie Kohler: 12:24  

Yeah, that's the only thing I would say the negative about about step functions, they do have the express step functions, but it's a little bit limited. And we did actually use express step functions to kind of... a regular step function calling express step functions for certain synchronous parts of the workflow. And that actually was a kind of unique way to save a little money, but still have it broken up, and more granular, we could obviously put those into two Lambda functions, but it's not really our best practice now, we want to make sure... it really depends. So I guess, you know, that's the answer one always says. If it's something mission-critical business, like dealing with orders, maybe payment processing, you know, the first thing I'm gonna think about is, this might, you know, and I'm thinking about like Production Services teams going in, you know, different third party observability tools, what's going to be the easiest way, because it's not just building it, right? You can build it super fast, you can run the monolithic Lambda, if you wanted to, right, it's going to be able to do all your stuff. But we just have to make sure that you know, the teams that we're handing off to whether it's us maintaining it in production or some other team, it's going to be easy for them to know what's happening. Now, that's one of the, you know, the negatives they say about middleware, sorry about that, serverless is that it can be hard to know what's going on. So step functions kind of mitigates that with its like visual workflow, and things like that. So if I feel like we need some kind of visual way to see what's happening, that's important. If it's a mission-critical workflow, you know, step functions will consider it. If it's not going to run a ton of times either, if it's like a batch process that, you know, you're gonna save a lot of money actually, even though it's an expensive service versus an EC2 server running all day that runs a job, you know, once or twice a day, right? There's the old ways to run those batch jobs or ETL jobs, like serverless is so perfect for that and step functions. And I actually want to talk about a system we built for that a little bit later, because we really kind of dove in deep on step functions there. So you know, the cost can be an issue at scale. But the nice thing is you can like calculate it out, you can you can really like determine like, well, what is it worth for an accurate order, right? Is it you know, I can give that to my business sponsors and executives and be like, it costs, you know, a fraction of a cent if we do it this way, or we do in Lambda, and you kind of just weigh out the positives and negatives. So I would be happy if they, you know, reduce the pricing and we can all wish, but it is it's a fantastic service. I know there's a lot behind it. And it's a lot better than me building it and me maintaining it. So I'm kind of happy to pay for that.

Yan Cui: 14:53  

Yeah, absolutely. I think that's probably the fact that you don't have to manage any instances. You don’t have to pay or spend any engineering time on patching OS or machine images, and all of that have just saved you so much time. I think the visulisation is just really, really useful not just for the development team, but also for a support team. And maybe for the product owners as well. I've had the teams that, you know, they work with product people who don't understand code, but they can very happy look at a visual diagram from your state machine definition and figure it out, all right, okay, that looks about right, you're doing that step one first, you're talking to visa to verify the credit card number, and then the next step, so they can get a sense of what the system is actually doing without having to dive into the code. And equally, when you got support teams that can look at execution and see which path it took, and when there's a problem where exactly is the problem that you have to dive into lots of logs and whatnot, you just see, okay, that is one failed execution there, going to that, there is a red box right there. And you can see, okay, yep, that's a step you failed on. So yeah, the visulisation stuff is just so powerful it's amazing.

Robbie Kohler: 16:01  

It is. And I think you got to think of that as part of the overall total cost of ownership is how much time is it going to save, right? Well, you just saved me from documentation, because I just export it out. Like you said, if you model it very correctly and write with all the different paths, and you know, saga patterns, rollbacks. It's like, that's, that's how it works. Like, take a look at this diagram. And it's really nice, right, that documentation, and then the time saved for, I think we, we don't think of how much time is spent digging through logs, digging through various accounts, CloudWatch, you know, insights, you know, it's gotten better, but it was really painful years ago, I remember trying to figure out what's happening when something would break. So step functions, again, simplifies a lot of that and gives you that place where you can kind of double click in literally into the issue. And, you know, I think it's, it's definitely totally worth it, especially on stuff that's not running like 100 million times now, right? It's a no-brainer for me to use that. And we actually have the system we built that we called Diablo. And I always like saying this name, and the acronym, which is the Data Integration and Automation Batch Lambda Orchestration. It's a mouthful. But that is a system we built kind of to replace any of these old ETL tools. I'm not going to name these tools, we've all used them. And, you know, a lot of us not wanting to use tools, but always end up writing these like kind of data integration platforms that handle jobs, and you know, start time end time errors and everything like that. So when I came here, I was like, Well, I'm gonna have to write another one, or someone's going to have to, because it's like, we all need these, these other services that are offered. And I was looking at step functions, while Airflow was, was an alternative. I'm like, I have to run containers, I have to run EC2, because managed Airflow wasn't there yet. So I was like, Can we do this with step functions and Lambda? And I was like, Well, yeah, it's just executing steps, passing an input. And at the time, step functions kept releasing more and more cool stuff, like, you can pass an input in here. And it just like, was very much more robust. So the timing was right when we built this new platform again. And that is a batch process, you know, you're handling import export jobs, you're handling like sending reports out, you're handling anything and you just write these new Lambda steps. It's kind of a horizontal service, these general Lambda functions. And that's probably what has been one of our most successful things like I really want to like open source or at least kind of create a version for that because I think other people would find it very, very helpful. And, yeah, again, we love serverless. Yeah, if it sounds like I'm really excited, it's like, I never knew we can like follow up with a certain type of technology. But I think because I've been through the trenches, I've been through the pain of spending three weeks trying to get like this monolith to work. And the fact now we can kind of finish these projects within the three month, the three week time span with with serverless is amazing. So.

Yan Cui: 18:59  

So one of the, I guests, to zoom in on that a little bit, because that's really interesting in terms of that productivity gain you're referring to there. So I guess, I mean, zooming out a bit, what would you say has been the biggest success or the biggest benefit you've had by adopting serverless?

Robbie Kohler: 19:17  

Yeah, I mean, it's really hard to quantify. I would love to quantify how much time we've actually saved. You know, I discussed that order middleware, we had something up and running in like two to three weeks, right, like a sprint, because you really break it down. Now. serverless is taking out some of the stuff that just took a long time to do like going to another team setting up the servers and all these things as not producing the business value, right? Now what serverless does is it's like well, we already we already have this tool built like we're already super powered. We know we can get by any of that. We can just start immediately on the business value. I think it does a couple of things. One, I think it gives our development team a lot of confidence when we're talking to like product owner or business partners that like, almost we can build anything, you know, like, you know, we're already thinking in our head how we will do it and which services we can do it. So takes out a lot of cognitive loads before it's like, well, I had to get Mongo license, or you know, this event bus, you know, like now, well, I know, I'm gonna use EventBridge here. I know Kinesis makes sense here, right? So there's a lot of that reduced amount of time to get going. Yeah, and just cost, like I said, the... some of the cost savings have been really dramatic going from EC2s, really powerful EC2s running 24/7 to just, you know, step function calling a Lambda a couple of times a day. You'd see like upwards of 99% reduction in costs, not to mention the time in the fun that we're having. The developers actually like it. So they want to do more, they work faster, and they work, you know, not to say we want them to work longer. But I've noticed that, like, they're just very excited about it, you know, with our communities of practice, we have here. So I’ve seen a lot of benefits across the board. That's why I'm such a big fan.

Yan Cui: 20:58  

So in terms of building up that community of practice, that's one of the things that I think a lot of companies struggle with is, especially when you've got a very large company, there's a lot of different different communication channels and trying to get to consensus on technologies and practices are quite difficult. Do you have any sort of, I guess, tips or hints that you can share with us so that anyone else who's listening who's thinking about doing the same thing, can now have a better chance of succeeding?

Robbie Kohler: 21:26  

Yeah, I still think we're relatively kind of new to the whole serverless and going really all in on the managed services. So you're gonna have some doubts. When you bring it up in a company. I think it's very common when I talk to you and other globally, it seems like there's a lot of common things talking about, well, how do you do this with Terraform? And how does serverless framework work with that? So what you really need to do, I think, first is have a few evangelists for serverless, you got to have some people that are already passionate, they've already built something. They have something to show they have repos in your common repo, maybe build a you know, a group or something and start throwing in these like, already baked in fully complete serverless kind of starters, I remember you, I'm always referring to your tips, but you told me, you should probably have some example repos. So when someone needs to know how to do something like a Node, they're not starting from the beginning, there's already some best practices built in, there's like middy built-in for the parameter store and a readme file. So we started to do that, you know, you got to have a Slack channel or Teams or whatever you're using, or that community of practices constantly sharing, you know, the space moves fast, new services are having EventBridge, it’s like, now we're, you know, a centralised event bus, you know, that's important, you need to kind of talk about that you need to bring that up in that group, maybe have architecture meetings specifically about that. And one thing like we're doing right now is that it's expanding so much, you kind of almost need like, T-Shaped specialists on certain things. So we have one person that really likes DynamoDB. So we have him, you know, I'm like, go read, you know, Alex Debrie’s books, go dive in really deeply. Another person, like the AppSync GraphQL, I'm like, Alright, you sign up for Yan's course, I'm in that course. It's amazing. You know, I'm plugging you here. But it is really amazing that you can spin up basically Twitter, the back end for front and I actually started up again after a few weeks, I was like, This is awesome. Because with one command, and I don't think people understand that once they run that serverless command, deploy, or whatever it is, you just need that sandbox account. You really are like, running that full app without any toil whatsoever about setting the service. So that's usually like I just show them that the new people and they're like, Okay, I'm kind of bought in on this. But you do need to have that community of practice. We have a biweekly meeting, where we discuss, we demo new things we're working on and we demo new step functions that recently came out with the... You can kind of use the UI and, you know, something you and I wanted years ago was like, I used to have a UI sometimes just to build out these steps. Because you're just kind of working with these huge YAML or JSON files, it kind of gets harder, you know, anything gets really complicated when it gets huge. So that, you know, that's something we would demo in that meeting and get everyone excited. So and then once you have that you have evangelists, you have people from other teams join, you have your platform team, you have DevOps, you realised we can work together, you know, we still want you to own EC2, RDS, VPC, all those components, we're not saying we want to manage them within a serverless repos, because that would be you know, a nightmare. We just want to manage the managed services, things around microservices. So you do have to make sure you bring everyone aboard, honestly, because it's such a... it's a paradigm shift. So I think, you know, we learn that you know, a little bit the hard way. It took us a little bit longer, but we're in a good place now where pretty much everyone's on board all the new interns that come on, they've never done anything but serverless. So they're just like, Oh, this is cool. This is the way it's done. I’m like No, let me take you back to 2004 what this was like, and you know, months and months of waiting to see if you're going to get this MQ set up, like some server, you know, it was a different time. And then they're like, Oh, okay, well, this is pretty cool, then. Yeah, yes, this is amazing. We kind of drill that into the young guys. It's pretty funny.

Yan Cui: 25:11  

Yeah, it's amazing to think that the whole generation of developers gonna grow up not having to have to configure NGINX and web servers and all this other stuff, just to get two lines of business logic running, returning some JSON data from database.

Robbie Kohler: 25:26  

No, we're gonna, I think we'll keep we'll keep one around just for that reason to force them to force him to spend the first three months of their career maintaining a monolith, and then we'd like to just kidding, you know, that's actually not in production, check this out. We may do may do that, put them through a little bit of stress because, yeah, the pain of spending weeks, not just days, weeks to try to get something to work. You know, I just can't go back. None of us can go back.

Yan Cui: 25:53  

Yeah, same here, same here. And I guess in that case, let's maybe talk a little bit about some of the challenges, I guess, we talked about the need for community of practice. But what about any sort of technical challenges that you guys have to overcome? Maybe, I guess you just kind of touch on some of the organisational challenges. But yeah, let's talk about some of the technical challenges that you've had to deal with.

Robbie Kohler: 26:18  

Yeah, I think, like I said, in 2018, early 2018, you know, the community, serverless community has, like, blown up since then, I think. So Lambda was a few years old by then, you know, it just got like, I think the five minute timeout and things like that. It started opening up these new use cases, you know, but again, I did with the developer, we did it in .Net. He was monolithic to say at least I was, you know, I was building it the way I used to build the apps with types, you know, type crazy, right? And build these really object oriented systems before. And that's just the way I thought you did software, right, whether it's Java or C#, and you started, really, I started really, really quickly. Yeah, I noticed you weren't doing that. And you're just using Node or whatever. And other people were using Python. And I was like, Okay, this is not the right, you know, I'm not saying it's a wrong way or the bad way, because it still works in the business value. But there's a better way to do things. You know, so we started adopting Node and Python just as general, right. So the data folks like these Python for their Lambdas, everyone else, kind of especially the full stack, front end engineer, they already know JavaScript and Node, someone like TypeScript. So we don't force that on them. But if you want types, you can have some types there. So what was nice about that is we started moving away from doing them in Java. And not to say they're bad, sometimes you may need to do something in Java, C#, if you need, like, high performance, but you started to get kind of everyone, everyone knows these languages, they start to share more. So that was a good decision we made. We started just kind of standardising there. You know, the challenges are always like, Well, how do you make these things secure, right? So in the beginning, you know, we have one role that we did way too much, you know, so it's like, Oh, man, this is not, you know, best practice, you know, eventually, we started using the serverless framework, we settled on that, you know, in the beginning, I only think serverless framework supported .Net core, so I was kind of packaging up zipping files, uploading them. Pretty, pretty bad. But now, like, the serverless framework is such a nice abstraction for so many things. And using things like the middy middleware, you know, you can just be really productive very quickly, once you just kind of get around those best practices. And it can be very, and it's very secure, you know, our security team. You know, they're always concerned about security, which you tell them, like, we locked down this Lambda, you can't do anything but this because it's, it's, it's a, you know, a role per function type thing. And the serverless step functions plugin, you know, it also like auto locks down your step function role and stuff like that. That's really cool, right? So I'm going to try to make it to… dnot doing admin access, and things like that. So, you know, there's always those kind of, like, technical challenges, not necessarily around the code itself, but more like, security, like, how do we worry about like, you know, maintaining this in production, and things like that. So there's great third party tools, you work with Lumigo, you know, that's a great thing to get kind of observability, you know, CloudWatch Insights, there's a lot of training that we've had to do. And I would say, actually, one of the biggest challenges is to kind of like, get people to understand what we're building, quality engineers, even the product owners to some extent, like, it's, it is a paradigm shift. But when you get down to it, and we've talked about it, like you're not really writing now, that much complex code, sometimes you may have a complex piece of business logic. Maybe you write that in its own library, and you have unit tests on that. But a lot of things, you're just kind of, you know, writing to the table here, or maybe AppSync is handling a lot of things for you. And then you're like, it's less about writing these huge algorithms and you know, reinventing the wheel and more about understanding all the services and the best way to use it because there's a lot of ways you can do the same thing. So that was, that was probably one thing to do. But we've kind of settled on like, you know EventBridge has always been, is always there now, DL queues, maybe not always having to go to an SQS and having Lambda. You know, there's other ways we can do that now with that EventBridge. And then if it fails, then you can write it to a DL queue and, you know, some of the less experienced people will be like, I'm a little bit overwhelmed, because there's so many ways to do this. So, again, not overly complicated from a technical perspective, but kind of overwhelming and getting everyone involved are kind of some of the challenges we've seen.

Yan Cui: 30:34  

One thing you mentioned earlier, is that you are still using RDS for a lot of systems. I guess, how do you find that balance of when do you use RDS versus DynamoDB? And maybe, how do you deal with some of the challenges that come with using RDS from Lambda? Especially if you're not using serverless Aurora than all this socket polling and other things you've got to worry about? How do you guys approach that?

Robbie Kohler: 31:00  

Yeah, so that that first, I always go back to the first time we built that system, which definitely has an RDS back end, kind of for this menu master data. At the time, there was no way we were comfortable with Dynamo. I mean, I, you know, get getting everyone on board for Lambda was enough. But probably to get them on board for using something like Dynamo was just too much. But we've again, we've come a long way because of the success. You know, as we build out more systems using Dynamo the tooling around Dynamo has gotten better. We are understanding of Dynamo through like Alex and yourself and has gotten it a lot better. So we actually, like I said, we're serverless first. So if we don't have to do a VPC for some new system, we're going to try to do it in Dynamo. Dynamo has some great different ways you can use it and model data in different use cases. In that, that's nice, right? But if you're on kind of I will say our legacy, you know, our first couple serverless apps have RDS, you do open up a little bit more, you got to be in the VPC now for your Lambdas. You have security groups. You have things that are a little bit more I would say like annoying to deal with, but you just kind of have to deal with that. And then RDS is going to scale to a certain extent, based on the size of the kind of cluster you started. So you know, make sure you got Redis or something or read replica set up, especially if you're doing heavy reads. So that was something we... And they make it easy to do that which is great on the Redis. I love Aurora, it's a great product itself, we're definitely gonna look at serverless Aurora. But I'm kind of like, if I can get away with not having a VPC, I'm definitely gonna try very hard to do it. And the people want to query that data like it's, you know, SQL and we can send it to like, you know, S3 is that they can use Athena, or maybe you know, there's like a Kinesis to their own database, we could figure out ways to still make that happen. But we try to be creative about doing Dynamo first. You know, early 2020, I was talking to my boss. And I was like, I think we're ready. He's like, what are we ready for? I think I'm like, I think we're ready to be serverless first, and he's like, I'm like, Okay, let's just, you know, we're using AWS, we're not gonna get off, you know, we're happy with the service. They're the leader in this. We've had success from three to four of these projects. I feel I'm a little bit, you know, it's a little difficult when something new is being talked about, the architecture. And I have to think about being cloud agnostic, or using some tool that is not really tailor made for what we are trying to do. If I can just, you know, use something that it's made, and we can get it done faster. Amazon's not going away. You know, and those kinds of this like step we took instead of it being half, you know, one foot out the door so that we can be cloud agnostic, we just took that full step towards AWS, and serverless. And then I think like Dynamo blew, everyone started building things like Dynamo, it was like RDS like if we have to, and it was nice. It was, it was really fun, actually.

Yan Cui: 33:51  

Okay, so I guess nowadays, DynamoDB is pretty much very firmly establishing your stack as well as the EventBridge, by the sound of it.

Robbie Kohler: 34:00  

Oh, yeah. I mean, EventBridge is definitely something I think you want to start with for anything. We're looking at some patterns you guys have been talking about. We have that central event bus, because those are the events that multiple teams may care about, right. So like, the store hours for store hours updates for a particular store, the pricing or tax will change. You know, there's one system that kind of maintains where that event happens. But you know, the e-commerce team cares about it, because they got to update something on their site, right? And then, you know, some other team that's dealing with, let's say, location services, or whatever may need to know that that happened. So instead of kind of figuring out these other ways, and maybe we used SNS in the past and it just didn't seem very, it felt a little awkward. We're looking at the centralised event bus and then just kind of like rules that are firing off to other you know, systems and that that's becoming really powerful. Even like third parties and our partners, we can we can send them events that are happening and they can build things off of them as well. It's actually, I'm really, really, we are really, really high on that. And we're really high on on the potential for AppSync, GraphQL to kind of replace some of the stuff we're doing at the API Gateway. There's so many things we're looking into if you think about it. Building front ends, we're looking at, you know, doing the kind of JamStack with Amplify and things like that. And then, for the API layer, though, it's like AppSync. Again, your masterclass has been really great. We're following through that just to kind of get ourselves ready for the new stuff we're gonna build.

Yan Cui: 35:40  

That's great to hear. And EventBridge just announced a new feature as well, I think a couple days ago, the cross-account event schema discovery. So if you've got like a centralised event bus, you can also just have all of the event schema be discovered, I guess, reported to different accounts so that you can easily assess them as well. And not just events themselves, but actually the schema for those events you want to consume from different accounts, which is pretty cool.

Robbie Kohler: 36:05  

It's hard to keep up, they do so many, they do so many updates. And that's a pretty big one. So I definitely will need to share that with our internal group. So that's a great example, we would we I find that you know what I'm like reading on the weekend, because I'm lame, and I follow serverless on the weekend. And then I'll post it to the chat. And then real surprising, someone's like, Hey, I'm working on something, too. And thanks for sharing that. I'm gonna think about that for my my project. So, yeah, that it's hard to keep up. But you kind of need a team to keep up with all the stuff they're coming out with. And do we use this, oh, this actually doesn't really work. Like, think they just came out with this managed Redis service. And I looked at it and was like, Oh, this is really cool. But it's, you know, we're still kind of serverless first. We can do this with Dynamo, we can do this with AppSync resolver, no caching. And you know, it wasn't something for us. But it was nice to look at that. And kind of talk about it as a group. That's one value is to just kind of eliminate services. There's so many, right? You know, having that that group that can say, Well, actually, that's not what we're doing. That's cool. Maybe we'll look at that later.

Yan Cui: 37:10  

Yeah, and that's the Amazon MemoryDB for Redis. There's also the company called Upstash, which does more of a serverless Redis. So you pay for the number of requests you make, I suppose I think for data storage as well, rather than the MemoryDB, which is you're paying for uptime for the number of EC2 nodes that you're running and how long you run them for, even though there's less infrastructure overhead, but you still kind of paying for uptime, as opposed to, like you do with Lambda with step functions. You only pay for them when you use them, which is I guess more efficient when you, unless you do like million requests a second, that kind of thing. It’s a...

Robbie Kohler: 37:48  

I mean, so I'll say one thing too is if someone's worried, or a company, the biggest, the first question I get from my partners and other people I talked to is like, Well, does it handle enterprise, you know, scale, and if we're building a SAS and things like that? And I'm like, Well, yeah, if you do it right. You know, even if you're doing a tremendous high scale, I'm sure there's parts of the system that can still be serverless. I mean, I'm including all the managed services of EventBridge and things like that. You don't need to roll your own, in my opinion, you don't need to roll your own, like infrastructure and then put stuff on it. That's kind of AWS already does. Yeah, and I'm... some of the stuff that requires uptime that are always up, like when we were considering DocumentDB versus like Dynamo at the beginning. Now clearly, you know, nothing against MongoDB, it’s a great, great service, but the way we're moving and just kind of the general consensus is like Dynamo, let's make Dynamo work. You know, there's there's a way to do that.

Yan Cui: 38:49  

So Robbie, thank you so much for taking the time to share your experience with us today. Is there anything else that you'd like to mention before you go, maybe is Taco Bell hiring? I'm sure there's serverless enthusiasts who are listening to this podcast who may be interested in taking up a job with you guys.

Robbie Kohler: 39:07  

Yeah, we're always looking for new serverless developers, as you know, they're you know, a little bit hard to find right now. It's it's competitive, but just want to get the word out that you know, Taco Bell, you guys probably know us, you know, from just eating the food, but we have a lot of cool technology we're building. And we're going as I guess you can say cutting edge as possible. You know, we think five years from now this is going to be the norm and we're kind of already there. We're not afraid to try new technologies. We're not afraid to try anything. And we're not afraid to partner with AWS here and work with experts like you. We have we have a ton of fun. Yeah, and if you guys are interested, there's two. These are my architecture videos where I kind of dive in a little bit more on the quarter middleware, and many middleware so you can just YouTube Taco Bell AWS, I think they should pop up. Yeah, that's about it.

Yan Cui: 39:59  

Okay. I'll put those in the show notes. And if you share the careers page, I guess with me as well, then I'll put that in the show notes as well so that people can check it out easily. Again, Robbie, thank you so much for taking the time and stay safe. Hope to catch you in person. Maybe the next re:Invent. 

Robbie Kohler: 40:15  

Oh, yes, re:Invent is happening. We'll meet at Taco Bell canteen right across, I know you've been there before, late night so.

Yan Cui: 40:25  

Yeah I'm not sure why this year because of the travel restrictions. Let's see, let's see, what happens.

Robbie Kohler: 40:30  

I'm really hoping. If not, next year. Thanks Yan. 

Yan Cui: 40:34  

Take it easy man. Okay. Bye bye.

Robbie Kohler: 40:36  


Yan Cui: 40:48  

So that's it for another episode of Real World Serverless. To access the show notes, please go to If you want to learn how to build production ready Serverless Applications, please check out my upcoming courses at And I'll see you guys next time.