Real World Serverless with theburningmonk

#32: Serverless chatbot with Michael Wittig

October 07, 2020 Yan Cui Season 1 Episode 32
Real World Serverless with theburningmonk
#32: Serverless chatbot with Michael Wittig
Chapters
Real World Serverless with theburningmonk
#32: Serverless chatbot with Michael Wittig
Oct 07, 2020 Season 1 Episode 32
Yan Cui

You can find Michael on Twitter as @hellomichibye and on LinkedIn here.

Here are the links for some of the things we discussed during the show:

To learn how to build production-ready Serverless applications, go to productionreadyserverless.com.

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.


Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Show Notes Transcript

You can find Michael on Twitter as @hellomichibye and on LinkedIn here.

Here are the links for some of the things we discussed during the show:

To learn how to build production-ready Serverless applications, go to productionreadyserverless.com.

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.


Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:13  

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today I'm joined by Michael Wittig, who is here to talk to us about how he's built a serverless chatbot with his brother. Hey, Michael, welcome to the show.


Michael Wittig: 00:30  

Hi, thanks for having me.


Yan Cui: 00:31  

So before we jump straight into it. Can you tell us a little bit about yourself and what you've been building?


Michael Wittig: 00:38  

Yeah, sure. So my name is Michael Wittig, and I run a blog that is called cloudonaut.io. Very focused on AWS topics so it's not only focused on serverless stuff. We also talk about the old stuff, EC2, RDS and things like this. And, but yeah, we do this since a couple of years I think five or even six years and I work together with my brother Andreas. So we have published a few books in the meantime, video courses, and a lot of like material about AWS, and yeah, we also run a serverless product that is called Marbot (https://marbot.io/) and I think that's what we are going to talk a little bit about today, which is a chatbot, and that connects AWS, and like AWS monitoring with Slack and Microsoft Teams.


Yan Cui: 01:27  

So Marbot has an interesting story as well. And you guys did that as part of the AWS Hackathon, I remember quite a few years back, and since then you've added the subscription option to that. And they spin growing pretty steadily and it sounds like it is doing pretty well.


Michael Wittig: 01:45  

Yeah, you're right. So in 2016 AWS launched together with like a hackathon where basically the job was to use Slack and the AWS API's together. So we thought a little bit okay what could we do here. And we always had this problems with getting the like monitoring data out of AWS, like for example, alerts, or CloudWatch events and things like this so if things go wrong so how can we notice that, and also how can we make sure that someone picks up the task. So that's why we participate in the hackathon and we also won a prize but I must admit, there was not only one price or a lot of people won price, we were one of the projects that that was awarded with a price, and since then we continue to improve the service, and the features. Yeah, so that's that's right.


Yan Cui: 02:33  

So I guess, given the domain that you're dealing with is really event driven so I guess serverless and Lambda is a pretty nice fit with Marbot. Can you maybe just give us a bird's eye view of how your architecture looks like from a very high level?


Michael Wittig: 02:51  

Yeah. So, basically we have two different sources that sent us data so the source number one is SNS topics or subscriptions within the clients AWS account. So, if you set up your AWS monitoring setup, then you usually have an SNS topic where all those sources send their messages to and Marbot basically connects to this with a HTTPS subscription. So that's the event like events was number one. And you also have the chat services itself. So Slack does like API sends those webhooks and also Microsoft Teams sends us webhooks. So those are all the, the inputs to the system basically. And what we do is we accept this data on an API gateway. And then we move it to, or we put it on a Kinesis stream and then we make sure that we process the events as fast as we can and do whatever is needed for example send a message out or get some data or fetch some additional data things like this.


Yan Cui: 03:52  

So are you able to share maybe some numbers in terms of the amount of throughput you are dealing with? Are you dealing with, I don't know, hundreds of thousands of events per day or even more?


Michael Wittig: 04:04  

No, I think that's a good like size like hundreds of thousand events so we sometimes have problems that people connect our Marbot endpoints, with a very chatty services so for example we have people, and that accidentally connected to an S3 bucket and so all the S3 events are then sent to us. So we sometimes have a lot of requests and then we are like, well above 100,000 calls per day. And what we have done, because we don't really want to send that many messages to you in Slack. I mean, you will get crazy if you would do that. And also Slack the limit on us because they have rate limits on their APIs. So what we do to protect this or protect our service and also protect basically you from from going crazy by reading all those messages in Slack is that we have, we are using the API key features of API gateway in rate limit our APIs as well, so you cannot really send us that many events. So, and it also doesn't really make sense to send that many Slack messages so that kind of goes hand in hand here.


Yan Cui: 05:06  

Okay, and do you do some, any additional filtering to make sure that if you had sent me, an event as, so about something you don't send me another event, about the same thing within say like five minute or ten minute or even one hour window. 


Michael Wittig: 05:23  

Yeah, so maybe I can like, go or take one step back. So at the beginning we thought, okay, it seems very, or it seems very simple to actually send a message to Slack, and in practice it is very simple, but there are many many problems that can occur. So for example, what happens if the Slack API is not available. And at the beginning in 2016 this happened more than once a day so the Slack API was not very reliable. It improved a lot today, but back in the days this was one of our biggest problems just like API was not available. And, and the next problem is, as you mentioned, and it doesn't really make sense to send you ten alerts that something is still wrong if you already received a message that is not working. So, what we build is two things so first we made sure that we actually retry until we can deliver the alert to your Slack window, and this is mostly done with the Kinesis strategy so if you cannot reach Slack you trust retry until Slack is up again. And the second thing that we built into our service is what we call alert aggregation. So for example if you see that we have a similar event, processed, a couple of minutes, or we go back 24 hours if you like so you can configure the window size. Then we will just update the message in Slack and say okay this is, this is now part of two or three or five or hundreds of events to avoid sending you too many messages.


Yan Cui: 06:51  

So you touched on the fact that some of these API calls are making, including the Slack API is not always reliable sometimes they fail. And interesting that you're using Kinesis’ retry until success behaviour with Lambda to basically get this retry out of the box. I remember when I spoke with Ben Kehoe from iRobot a while back. They are also using, taking advantage of the fact that the Kinesis has got this retry until success behaviour. So you recently wrote a blog post, essentially talking about how do you deal with the fact that the more dependencies you have in your system as you integrate with Team and Slack and other services. Essentially it's got a compounding effect on your SLA. So, can you maybe talk to us a little bit about what the core problem is and some of the strategies you're using to isolate your dependencies and how to improve your uptime?


Michael Wittig: 07:50  

Yeah. So, what we basically have as dependencies is okay, the biggest dependency that we have is AWS. But the good news is that they are really reliable so we don't see many errors with our AWS dependencies Kinesis Dynamodb in S3. The other dependency that we have is Slack. And now, since we are added support for Microsoft Teams we add the Teams API as well which is called, and so they have, I mean, it basically integrates into Azure so there's a chatbot service and you basically talk to the chatbot service in the chatbot service and forwards the messages to Teams. And we, when we thought about how to implement this. We didn't like the idea that our stable, kind of stable path, like receiving alerts and sending them to Slack. We didn't want to affect this via new code, and our new dependency Microsoft Teams. So, we are trying to achieve a way to have those two channels, independent, so if Slack is not working, we still want to send out alerts to Teams. If Teams is not working, we still want to send out messages to Slack. So, what we did, we had this one Kinesis stream, which at the time only deals with Slack because we only support Slack at the beginning of the year. So what we did is we added a second Kinesis stream and everything that was Microsoft Teams related basically send the message into this new stream and we had a complete separate processing unit to care about the Kinesis messages that are relevant for Microsoft Teams customers. So, this may, I mean at the beginning when we started the private beta we had a few problems in our code mostly, and we had to fix them but this all was had no effect on our Slack users, because we had those two completely independent paths data, paths in our system, ensuring that a problem in one of those like streams does not affect the other one. And that's something that I can highly recommend. So if you have, like, if you're adding dependencies, to your system. You will always decrease the availability of the whole thing unless you can really make sure that they are isolated, and that's kind of the only chance that you have to make sure that systems with many dependencies have good availability.


Yan Cui: 10:13  

Yeah, that reminds me of a lot of practices that we've used in the past for things like priority queues and having one queue for every single ingestion point, as well as every single outbound target as well, so very much applicable here. But I guess one of the, one of the downside of this approach is that with something like Kinesis at least while you are not just paying for the number of requests you're making, but you're also paying for uptime. It’s said that like a, something that you guys are potentially concerned about that by the time you add, I don’t know, 100 different egress, then you end up having to have 100 different Kinesis, each one of them are not being used very much.


Michael Wittig: 10:59  

Yeah, so I agree. So, Kinesis is actually our biggest cost driver. Because we also use multiple shards and not because we need them in terms of capacity but also to isolate again within the stream because if we have an error in the messages that they only affect one shard so if Lambda cannot really process this message all the other shards will continue working . So, we have for example four shards for Slack. So usually we try to only have a couple of customers or like, like a small amount of customers on each shard. So we pay a lot of money for Kinesis and we don't, yeah, it would be a more reasonable pricing model for example like Dynamodb on demand for Kinesis. So that would be a big big benefit for us. The problem that we have, and we also thought about using SQS for our use cases because as you mentioned SQS is also a good choice to isolate and the two dependencies like and Microsoft Teams, but in our case, the problem is that we really need a order because, if you like, if an alert happens then you can for example you can acknowledge the alert you can close it. And those things need to happen in order so if you close an alert before it was acknowledged then things go and go, or things become very complicated in the code. So we said okay we use Kinesis and we know that it will be a little bit more expensive, but we get the big benefit of the order. So for example, we have orders within a channel. So we know that everything that happens in the channel is processed exactly in the order that like send it to us, for example, and that helps us very much in the code to it simplifies the code base basically. Yeah, but the downside is as you mentioned that you pay in Kinesis for the shard hour. So basically our Lambda costs are not really, I don't actually know, I think they are below 10 US dollars definitely. But for Kinesis I think we pay, maybe 500 or something like dollars per month. So in comparison it really doesn't make a lot of sense. Also if you compare it to what we pay for Dynamodb. So Dynamodb is also very cheap so maybe the same range as Lambda. And so yeah Kinesis is really the biggest part of our bill and we would like to change that and we are hoping for something that happens in the Kinesis service in the future.


Yan Cui: 13:25  

Quite a few of the guests on this podcast have asked for different pricing model for Kinesis in the past as well I know Ben Kehoe was advocating for that too to have a similar pricing model to Kinesis Firehose where you are paying for the amount of data that's passing through your stream rather than paying for uptime. But in your case, could you have used the SQS FIFO queue? Would that have worked as well? 


Michael Wittig: 13:51  

So I think the first problem with the ordered queues is that they were not available in 2016, and also I must admit that I'm not, I think they added Lambda support, I'm not sure that this was, maybe this year or maybe last year so it's not it's not integrated that well into Lambda as well. And also I don't really like it because it has so many restrictions about throughput. So, yeah, so I think Kinesis is more, I mean, on the, on the spectrum of make it easy to use the service, I would say that Kinesis is easier and compared to the FIFO queues.


Yan Cui: 14:27  

Yeah, the support for FIFO queues for Lambda was only introduced that I think was late last year, just before re:Invent. I remember it because it was something that I've always wanted to use but never been able to so ended up using Kinesis for a lot of things that could have been done with a FIFO queue instead. 


Michael Wittig: 14:46  

Okay, so what are your experience with those queues because I never used them. I always look at them a little bit with suspicion so I don't, I'm not a big fan of of them but what what are your experiences, using them is it like working fine for you or?


Yan Cui: 15:01  

Yeah, it works fine. So with SQS FIFO queues, you have to specify the group ID, and essentially is the same idea as the partition key you send, when you send to Kinesis when you publish an event, and additionally, you also have to, for FIFO queues, you have to specify how you're going to do the deduplication as well. So that's one of the things that the SQS that does, that, well, FIFO queue does that Kinesis doesn't do. They gives you the duplication out of the box. Whereas, with Kinesis it’s quite likely to, for you to have duplicate events, especially when you consider things like your application may be sending the same thing twice, or maybe just something as simple as the SDK doing retries. So SDK retries, because you didn't get the response the first time, and then you end up sending the same message twice. So in the past I've had to build deduplication logic in my application code when I'm working with Kinesis and Lambda quite a few times. Whereas, with SQS FIFO queue you kind of get the benefit of that deduplication by payload, with FIFO queue, and it does work quite well with Lambda nowadays. I do, I do hear you about the throughput limit, though, that I do remember that is a soft limit so it's just a case of asking for higher limit, but you have to know what you need, ahead of time whereas with Kinesis I guess you have a bit more room where you can just, well I don't need to raise a ticket I just add another shard and that's it, that's me done. And also, at higher throughputs Kinesis is order of magnitude cheaper compared to SQS. I've had systems in the past where you're sending maybe tens of thousands of messages per second. And when we did the numbers on SQS cost compared to Kinesis, Kinesis was like, I don’t know, a couple hundred bucks. Whereas, SQS would have been, would have been maybe $10,000 a month. So, the higher throughput scenario is that Kinesis is much cheaper because of the fact that you're paying for uptime. But when you've got lower throughput, it's, you know, I guess it’s a bit more expensive when you are paying for uptime especially when you need to have lots of different streams, one for every egress target.


Michael Wittig: 17:21  

Yeah, I see. And I also like, like, or basically I also had like similar experience with the deduplication, the missing feature in Kinesis. So, all of our code has to be idempotent for example, and it's it's really hard to test this and to get it right and in our case it took like I would say it took at least one or two years until we fixed all the bugs that we had with that because it's really very hard to test all the things that can go wrong and where things are retried and then someone, yeah. So, if he could at least avoid inserting the same message twice into Kinesis that would help a little bit. So I agree that deduplication is a nice feature of the FIFO queues, yeah. 


Yan Cui: 18:00  

So, maybe going back a little bit about what you said about the Slack and other APIs not being available, not being reliable. Have we done anything around the retrying those requests, similar to what SDK gives you like the exponential backoff and all of that on top of what you're doing with the Kinesis and Lambda retries?

 

Michael Wittig: 18:20  

Yeah. So, basically we can say there are two different kinds of problems with the APIs. The first problem is they are very like only a very small amount of requests fail and the other problem is okay they really have a big problem on their side, and that API doesn't work for 10 minutes also. And to fix the first problems like the temporary problems. We, so we implement everything in Node.js. So we use, like there's a very famous or not famous but popular library like back in the days the popular library was request, and basically there is another library that is called request retry that basically gives you the same functionality that the AWS SDK has. So if the HTTP call fails to the Slack API the library itself will do the retries. So we, you have to make sure that you configure it in a way that matches with the timeout of a Lambda function. So for example if the Lambda function timeout is 30 seconds, then you should ensure that you retries will not take longer than 30 seconds because otherwise the function will timeout. But yeah, so we, I think we do three or four retries with, like within a few seconds the first time and then a couple of maybe two or three seconds and then six seconds or something like this so we try to have time as of 30 seconds of the Lambda functions. So that's the first kind of mechanism that we have implemented, and that fixes a lot of problems. This also fixes problems that sometimes your Lambda function has some issues connecting to the internet so this will also fix that problem. And the other problem that we really have, an unavailable APIs for like for hours and then the Kinesis retry will help us. Because Kinesis is like in the default configuration will retry the events, like, basically for days so I think we have the retention set to seven days so we can retry for seven days if you like to finally deliver the message to Slack. So, so far we never needed seven days so usually they fix it within a few hours but yeah so those are the two ways of retries that we have implemented for making calls to, to the, to the Slack API. And we have the same thing for the Teams AP as well now we are. 


Yan Cui: 20:26  

Okay, that makes a lot of sense. Are there any other sort of interesting technical challenges or features that you had to implement as part of building Marbot?


Michael Wittig: 20:35  

Yeah, so maybe I can go a little bit into the details about how we figured out, like performance problems. So one thing that maybe I say like background information is that if you click a button in Slack so there are these buttons in messages that you can add, and if you click this button then the endpoint has to respond within three seconds, so Slack will only wait for three seconds for a result from you. And, at the beginning we had lots of issues, making sure that we can respond under three seconds with all the cold start things that... I mean now it's , it's not a big problem anymore but, and in 2016 so we had to make sure that everything starts up soon enough that we can respond within three seconds. So we did a lot of optimization there, and what helped us is that we added X-ray. And so, you now can see, okay, how long does this call to Kinesis take? How long does it take to talk to Dynamodb? But the cool thing about X-rays. You can also integrate it. And so when we make a request to Slack we also see this in X-ray, and we make a call to Microsoft Teams, you also see this in X-ray. And that's been very powerful because then you really can see, okay, where is all this time spent? Where does it take so long to process the message? So that was really a big, a big benefit for us when we added X-ray to, to our setup to investigate a little bit, okay, where are all the... basically where is all the time spent in our functions, yeah. Not sure what your experiences are with X-ray.


Yan Cui: 22:07  

Yeah, I find the X-ray is okay when you've got a relatively simple application. But when you've got something more interesting, especially in the serverless world where you've got a lot of asynchronous processing happening, X-ray kind of just doesn't quite do the job anymore. For example when you've got Kinesis, it doesn't trace those invocation through Kinesis. So, it becomes a very, I guess, less useful as you're doing more and more of this asynchronous processing in your system, which is quite normal in the sort of serverless world. So I do find the X-ray to be limited. But some of the other vendors, third party vendors like Lumigo, Epsagon, and Thundra, they do a much better job of tracing other event triggers for Lambda functions and can connect them a lot better. But for simple HTTP to HTTP requests, X-ray still does the job I think. Before we jump on a call, you also mentioned something interesting which is how to implement the timers on AWS because that's actually a topic that I've written about and also written quite a bit actually in the latest chapter we just released on the Serverless Architectures on AWS, Second Edition (https://www.manning.com/books/serverless-architectures-on-aws-second-edition) which I have been working on with Peter Sbarski. I’m quite curious about which approach you guys went with because there's so many different ways to do it with SQS, with Step Functions, using a cron job and so on. I wonder what, which one you guys actually used.


Michael Wittig: 23:36  

Yeah, so maybe I add the background information so why do we actually need it. So, when you receive an alert in Slack, we start an escalation chain. So we basically pick one of the online users of the channel. This user receives a private message where we say okay here's something wrong, please take care. And now you have five minutes to acknowledge this alert, and if you do not acknowledge within five minutes then we escalate to the next user and then if no acknowledgement happened again we escalated to the whole channel. So that's why we need the timer and it's a five minute timer. And we started with the SQS approach. So if you send a message to ask us, you can basically delay the message, so it only pops up, for example, in our case, five minutes later, and now I cannot really remember if five minutes is acceptable. But i think it's, it's acceptable so you don't have to, like, you cannot delay for longer than I don't really know the limit here I cannot remember it. But for five minutes I think it's acceptable you can just delay a single message. And then remember that in 2016, it was not too easy to connect a Lambda function with an SQS queue. So we had to manage all the polling on ourselves as well. So, we had something that triggered the Lambda function to poll the SQS queue and that was... and so we basically never acted exactly five minutes after we started the timer, but it was not too much of a big deal for us, but still it was error prone. So finally we re-architected this piece. And now, today we are using a Step Function, and the Step Function does very accurate waits, and then it just invokes our Lambda function after exactly five minutes. And that's how we do it nowadays. And we are very happy with that. So we don't see many problems. It, it, I think it depends a little bit on the number of timers that you have so I think it's definitely more expensive, and to use the Step Function approach. But for us, we have... So basically for each alert that we, that we have we can have up to four timers. So that's a reasonable number and the costs are not really a problem for us here.


Yan Cui: 25:48  

Okay, yes. Sounds, Sounds good. Yes, SQS that delay the delivery, you can only have up to 15 minutes, and it's still quite accurate from when I tested that. Whereas, Step Function is very precise when you set a time specific time is literally precise to the milliseconds, which is pretty amazing, but Step Function is a bit more expensive, probably be one of the most expensive services on AWS, but unless you're using it on massive massive scale you probably doesn't matter too much, I imagine. Because even though it's expensive you are still talking about $25 per million state transitions. So, you know, if you just have one state to wait and then triggers Lambda function, well, that’s actually two states, then it's, it's not that bad, you can still get a lot done with that $25. But one thing that's actually interesting I found out recently about Step Function is that even though the pricing says is charges you based on state transition, it actually charges you for the start and the state as well, even though those are system states, you don't...it's not something that you write yourself, but you get charged for, in your example if you've got one state for waiting, another state for running the execution, then you have really four states, the start, the wait, execution and then end, they all get charged. I, the documentation in the official pricing page is actually wrong, and it's been raised to AWS, I think, a few other people from AWS are already looking into this now to get it updated. But the pricing example itself was wrong. And I had to find out by creating a state machine in the region that I don't normally use, and then run one execution, and then check like two hours later, what shows up in my AWS billing in terms of the number of state transitions that get charged. That's the only way I managed to find out that the start and end actually got included in the charge. So that's just something to keep in mind. But like I said, if you're not doing it at massive scale then Step Functions may be expensive per unit, but if you're not using it all that much, then it's probably okay.


Michael Wittig: 28:07  

Yeah, that's a good, that's a good point and I wasn't aware of that. I like the day I also like I would have expected that it would charge two state transitions in this case that you mentioned, so not four so that's that's a good thing to keep in mind for like if you make a calculation for a project where there's a lot of volume, then I mean this will basically double your costs and this this can be a problem. Yeah, so.


Yan Cui: 28:28  

yeah i was, i was quite, I was really surprised when I, when I found out that the, the pricing example they had in the, in the documentation was wrong. But yea, it is one of those things that you only find out if you really have to make a lot effort because AWS billing is not real time, is a couple of hours behind. So I had to wait quite a long time just to confirm the suspicion I had. But yeah, but yeah, so I'm glad that I found out and now I can have a more accurate estimate for what my Step Functions cost is going to be. I think you also mentioned another thing before we jumped on the call that you had some interesting challenges around, keeping secrets secure and how you deal with encryption and all of that.


Michael Wittig: 29:13  

Yeah, so basically the secrets that we have to take care of are so if you talk to the Slack API you need a token. If you talk to the Microsoft Teams API, you need another token as well. And I'm a big fan of keeping all of my configuration in the repository as well like in the Git repository. I know that a lot of people think that that's not a good idea but I, I like it. And, but we don't like to have them in plaintext because I mean we don't want to have our secrets, we don't want them to be in the Git repository, of course. So what we did is we have a JSON file for configuration that has two values one for the Slack API one for the Teams API token and a bunch of others. And what we do is when we check those files in we encrypted them using the AWS KMS decrypt command. So we only check in the, sorry, we encrypted. We only check in the encrypted file so it's basically a binary file, you cannot see any text in here. And then when our Lambda functions start up, they read the file, and they run the like in memory they do the KMS decrypt and the decrypted values are never on the system, and they are only in the memory, and then we can access them in our application code and if the Lambda function shuts down and the values are no longer visible so that was the approach that we have taken. And I, I think there are now maybe better alternatives. But depending on the number of values that you really have in this configuration file it might not be a good idea to configure everything over environment variables. So that's why we still use that approach. And the thing that I like is that it's part of the code. So at any time I can actually see okay the configuration at, like, three months before looked like this. If I decrypt the file, and now it looks like this so you can also see how the configuration changed. And that works pretty well for us. And the only problem is that, of course, if two people make a change to this file at the same time this is not something that you can merge and get because it's just like binary data that cannot be merged line by line for example in Git. So if, if a lot of people are working on a project and you only have one of those files that are encrypted then this might not be a good idea but in our case we are only two people working on the project so far, this is working fine.


Yan Cui: 31:37  

Yeah. Well, my personal rule is to never have those secrets in plaintext in the environment variables. And so long, you can satisfy that I think that's good enough. One thing I will probably add is that, is for secrets that doesn't change often I think this is perfectly fine. One reason why you may want to use something like SSM parameter store instead is that if you want to rotate those secrets regularly, using a cron job or something, so that you don't have to redeploy every time that secret change, which can be quite labour intensive if you have a lot of microservices. That actually happened to one of the, one of the teams that I worked with in the past where we updated the, I think, the credentials for a MongoDB cluster we were using for dev, and the poor guy had to redeploy something like 50 different microservices. So, this is where if you load a cold start from say something like as SSM parameter store then you change it in one place, maybe still have them in the source code but then when you commit some pipeline that updates SSM parameter store in all the different regions you're in, then Lambda function themselves will have some kind of cache and invalidation mechanisms so that every couple of minutes is going to go check SSM to see whether or not there is a new version. So this allows you to, I guess, centralise your configurations so that you change in one place, and all the function would just update without having to be redeployed. For your example it sounds like you've worked quite a few different configurations. And also, it doesn't really change so I think what you're doing, makes total sense.


Michael Wittig: 33:16  

Yeah, so it doesn't change often, that's right. I have one question because I haven't really used it in the serverless project but have you used the new service the AppConfig service because I think that will also kind of maybe solve this problem, we want to deploy the configuration separately from deploying your code, but I'm not sure if that's actually used at, like, for serverless project very often or what's your experience here.


Yan Cui: 33:38  

I haven't used it myself. And I haven't seen many other people use it. The only person I heard talking about it for serverless projects is probably Ben Kehoe from iRobot. I think he was looking into it. I don't know if he's actually using it yet. But AppConfig is not as tightly integrated into Lambda as you would like. It’ll be great in the future if AWS would make it, just something that you configure with your Lambda functions so that with AppConfig you can do all these things around, have a config group, that you can have a AB testing and canarying, all of that which is amazing, but it's not really tied to your application in a, in a tightly integrated way as, sorry, you Lambda functions in a tightly integrated way so you... all of that have to be kind of done in your application code, and it's just not as easy to read from compared to something like SSM parameter store I think. But there is, I guess, an opportunity there for them to potentially do a tighter integration between Lambda and AppConfig where you can just point your Lambda function to a config in AppConfig, so that, you know, your function will just have access to that runtime where maybe some environment variable or maybe to the context objects, but I'd love to see something like that happen because AppConfig offers some interesting opportunities that you have to build yourself, if you were, if you were to use SSM or use a Secrets Manager.


Michael Wittig: 35:03  

Yeah, I agree. And this was also kind of my like what I discovered when I just played around with it. And I think like when it when it was released it, like the only source for a configuration was SSM, and now it can also be an S3 file which makes it a little bit easier if you have larger files, so yeah, I think we have to wait a little bit and see, improve like improvements of the integration. And then this really can become a powerful tool. Yeah.


Yan Cui: 35:30  

And I guess one other thing I wanted to ask about Marbot, is that, have you guys gone multi-region yet. Is that something that you guys are considering in the pipeline?


Michael Wittig: 35:40  

Yeah, so, I'm a big fan of doing this, and I'm really looking forward to do it. The problem for us is that we like, there, it's not really a requirement for us to do it. And because for example Slack only operates in us-east-1. So, I mean if us-east-1 is down then, I mean we cannot send messages because Slack will not be available. But the more, like, the thing that, that drives this requirement at the moment for us is that we, when we switch to Microsoft, when we added support for Microsoft Teams. Microsoft Teams run in multiple regions. So for example, customers in the United States have a different kind of endpoint and customers in Europe, and end customers in Asia, for example. So, and this is the reason is basically that I mean, this is kind of personal identifiable data, so we shouldn't process, or we should process this data close to the customer in the same jurisdiction. So that's why we are more interested in this feature today; it's not really about downtime, but it's about keeping the data close to the users and in the same jurisdiction. The only problem that I have, like, when I first looked into it, my problem was that with the first version of Dynamodb global tables, it was not possible to use an existing table with data. And when I looked at this I, okay, I don't have really a big, I'm, yeah, I don't want to migrate all of my data so that's not that's not something I want to do. So I waited. And now luckily you can use it with existing tables that have data already in the tables. The only problem, and I'm waiting again is that there's no support for the global tables in cloud formation. So I'm still waiting for cloud formation support. And as soon as we see cloud formation support then we are going to plan this feature, and I'm really, I'm looking forward to do it, because it's, it's, I mean, in our case, it's more like a challenge for us, technical challenge, but also like from the data perspective, it makes sense for our customers. So yeah, it's not yet implemented, but I'm watching out for the changes that AWS releases. And I hope that I can do multi-region Dynamodb in cloud formation soon.


Yan Cui: 37:51  

All right. That sounds like a AWS wish list item. 


Michael Wittig: 37:55  

Yes. 


Yan Cui: 37:56  

Anything else that you would like to add to your AWS wish list item here?


Michael Wittig: 38:00  

Oh, I think we already added a lot of features already so Kinesis on demand kind of mode would be great where pay per record and not per hour of a shard, and then the deduplication in Kinesis would be nice and multi-region in cloud formation for Dynamodb. I think that those are my three wishes for at least for Marbot that will help us. Yeah.


Yan Cui: 38:26  

Excellent. Those, yeah, those are things that I've asked for in the past as well. Funny everyone seems to hit the same problems. Okay, so I think that's all the questions I've got. Is there anything else that you'd like to tell the audience about your next project, things you're working on?


Michael Wittig: 38:43  

I think the last thing that I might mention is that if you want to try out Marbot it's free for the first 14 days. So you can go to marbot.io. And there are two buttons - add to Slack and add to Microsoft Teams. And as soon as you install it, you will, Marbot will start talking to you and you will get like a wizard is starting where we ask you, okay, do you want to set up monitoring rules in AWS, and Marbot does all of that for you. So, basically once it is installed, the conversation starts and we don't only help you with the alerts. Marbot also helps you to set up all the, like, the configuration in AWS that you need that you actually are notified if things are wrong. So we have over, I think 50 Cloudwatch event rules, so, like things like GuardDuty, Security Hub, but also the Health Dashboard events. And so there's a couple of sources that are interesting. And we will kind of connect to all of them, and then you will see the results in your Slack or in your Teams window. So yeah, check that out if you're interested, and we would love to see a few of your listeners, using Marbot in the future.


Yan Cui: 39:58  

Excellent. And with that, I guess, thank you very much for taking the time to talk to us today. How can people find you on the internet?


Michael Wittig: 40:07  

Oh yeah, that's a good question. So they can follow me on Twitter and my Twitter handle is @hellomichibye. I think we added it to the show notes to make sure that no one makes mistakes when typing this. And they can also find me on LinkedIn and feel free to connect and send me a message or mention me on Twitter. I'm always happy to, to chat about AWS topics and serverless of course as well. Yeah. So those are the easiest way to reach me and of course you can also find us on cloudonaut.io where we publish AWS content every week. And so those are the easiest way to reach me.


Yan Cui: 40:43  

Yeah, Sure, I'll put those in the show notes, and include the links to Marbot, to your blog, to your books and your, your courses so that people can go and find you on the internet and all the content that you're putting out there. So again, thank you so much Michael for joining me today.


Michael Wittig: 41:01  

Yeah, thanks for having me, and have a good day. 


Yan Cui: 41:04  

You too. Take care, stay safe. Bye bye.


Michael Wittig: 41:07  

Bye.


Yan Cui: 41:19  

So that's it for another episode of Real World Serverless. To access a show notes, please go to realworldserverless.com. If you want to learn how to build production ready serverless applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.