Real World Serverless with theburningmonk
Real World Serverless with theburningmonk
#19: Serverless observability with Lumigo and Uri Parush
You can find Uri on Twitter as @uri82042753.
To listen to the conversation we had with Aleksandar Simovic and Slobodan Stojanovic on FinDev, check out Episode 17.
For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.
Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0
Yan Cui: 00:13
Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today I'm joined by Uri from Lumigo. Hi, welcome to the show.
Uri Parush: 00:26
Hey Yan, glad to be here.
Yan Cui: 00:27
So we've been working together at Lumigo for a little while now, for the audience who are not familiar with who Lumigo is, can you just quickly tell us about Lumigo, what it does and what is your role there.
Uri Parush: 00:38
Sure. So my name is Uri Parush, and I'm a senior developer with Lumigo. I have more than 10 years experience in the industry as a developer, system architect and so on. And then I joined Lumigo like more than a year ago. And we got started a bit about Lumigo so what actually Lumigo does is a serverless observability platform. And what does it mean? So we actually give troubleshooting issues when you have for the serverless platform, stuff like errors in Lambda, and in your function, but with the full power of flow context, which you can actually see the business impact of those issues. And I will elaborate a bit, a bit more later. And we also have a visibility. It's also visibility tools of all the moving parts of your serverless system. This means it's not just about Lambda, you have many AWS services, probably like S3, Kinesis, DynamoDB, Firehose and so on. And then we put them all in the same place, and show the full visibility of all your flow using them. Another great thing about Lumigo is predefined alerts. You get the right focus on your issues, and with specific alerts, which are predefined by Lumigo and we use our experience to give you the best insights we can do. And one of the things about Lumigo, that it's also consolidate all your serverless application data. And what is data? I'm talking about logs, trace data and metrics. And we actually put all of those, all these data in a single source of tool, which is usually a hard work for each customer need to do. And we do it for you. So Lumigo was an interesting journey of Lumigo product, when I started to work with Lumigo. And the main focus of Lumigo was the operational team and our real focus was to help the operation team with the day to day work to keep a healthy production. But as time passed by, we started to see more and more customers using Lumigo not just for the production environment, but for all their environments which includes the developer personal account, the CI/CD environment, staging environment, and so on. So today, we have customers using Lumigo all over the product development lifecycle, which is amazing. And of course in Lumigo we do the same for our environment so Lumigo is not just a production tool. It's a full product lifecycle tool.
Yan Cui: 03:22
Okay, so with Lumigo ingesting a lot of information from the customers. So you must be taking in a lot of data, because every time I invoke my function, you're going to get some data coming into your system. So, how does, well, what does the Lumigo architecture look like from, say, 30,000 feet view?
Uri Parush: 03:44
So in general, our system as many other system out there is a data streaming data processing system, which results in useful insight for our customers. We collect data from multiple sources like trace trace Lambda, and we use a lightweight agent, which we custom made for Lambda, Lambda services. And we also collect the customers logs and AWS services metrics, with all three sources to get the full context of the business flows in the system. For example, let's let's take an example of API gateway trigger a Lambda which call a DynamoDB. With Lumigo, you get a clear view of the flow, alongside with the relevant context. And what do I mean by saying context? Let's say, the API gateway triggers the Lambda, and then what you will see in the Lumigo is also the event which was trigger which include the path, the query parameters, the header, everything you need to know about the event trigger the Lambda, you get the Lambda execution data which includes also the errors, the return value, environment parameters and so on. And you also get the query for the DynamoDB for example the table name, the context of the query, the data of the query, and so on. So you get the full context. So our architecture is based purely on serverless and all our infrastructure using only serverless technology. We don't have any VM or physical machine. And we use multiple services of AWS like Kinesis, S3, Firehose, Lambda, DynamoDB, API gateway and the list goes on. And we always say using more and more new services of AWS. It worked really good for us. So we usually use Kinesis Firehose to stream the data, and we use DynamoDB to store the data, and we use Lambda to process the data. This is our core services in there in our infrastructure.
Yan Cui: 06:01
Okay, so as a customer that uses Lumigo, one of the criterias that does important to consider for me is how much overhead is Lumigo collection agent or the whole or the whole process is adding to my function invocation time. Have you guys done any work to minimise the sort of the case of latency for your collection API so that, for example, if, you know, if you say your services are all serverless, I imagine you're gonna have API gateway and Lambda, but Lambda has got a cold start, and it's not acceptable for me to finish my invocation to send you the data but then you have a cold start on your end. So are you doing anything on the server side, on a backend side of things to optimise that latency, maybe moving the collection API to containers, perhaps?
Uri Parush: 06:50
So actually this route. When we started developing the product. We use API gateway, and we find it a bit problematic for our customer, because we have latency issues. So, our agent is very lightweight and to keep it lightweight. We did some changes in our infrastructure, and we move to using containers and Fargate. We choose NGINX to to process our trace data. And so our latency is really low now we're talking about a few milliseconds to 10s of milliseconds. It really depends on your Lambda configuration. And this is how we solve this issue for our customer which is very important to our customer. Latency is a huge core idea in Lumigo. We don't want to impact the customer environment at all. So it's very lightweight.
Yan Cui: 07:48
Yeah, that's great. And another thing that I've noticed, because I've been using Lumigo myself in some of my client projects, is that you are scraping off some of the data. For example, sometimes the body or sometimes I noticed that some of the API keys are being scraped. Can you tell us a bit about some of the decisions that led to that because that's something that I don't think any of the other platforms are doing. But I think it's quite valuable especially in a time where GDPR and data privacy has become more and more important thing. And one of the sort of blindsides most applications have is, well, it's great that we are storing user data in the places where we can easily delete them if they asked us to. But we have also the same data being logged everywhere and it's going to be a pain to try to get rid of those. Is that why you guys are doing all this data, data scraping so that's, for GDPR compliance reasons, there's one other thing that you don't have to think about?
Uri Parush: 08:49
So, Lumigo is very customer oriented, and this specific request came from a customer which was very, which security is a high value for this company, it’s even a security company so you have to know that our product aligned with all his security policies. So, it was actually from a customer and, of course, we do the striping by default, but you also have custom fields if you can scrape, scrape from for every customer so it's really customised for all your requests for all the customer requests and but yes it was a request from a customer and GDPR, and any other certificate that you have, we want you to know that we are not damaging them.
Yan Cui: 09:42
Okay, that's great. I guess the one of the flipside of that is sometimes I do want to see the, the body and that they are scraped off by default. But I guess that's kind of the the trade off you're making here. So you're saying that it's possible for me to say, ask you to not scrape the HTTP body for some operations, but you can configure that differently for per customer, right?
Uri Parush: 10:05
Yeah, of course. Usually what we do is that we always keep hook that we can use at some environment variables that can be customised for each customer. So, we can have a customised environment for each of our customer with their specific needs for their for his environment.
Yan Cui: 10:24
Okay, that's useful. That's good to know. I'll probably have to do something with you guys, so that I can get some of my HTTP bodies in my Lumigo view. And so in this case that you are ingesting a huge amount of data from all these different customers. Tell us about the, I guess, your architecture for ingestion, and do you learn anything when you're iterating on this architecture around cost and performance, because again, with serverless it’s great when you are running at a low scale but as your scale goes up your cost can also get quite expensive as well, are you guys doing anything clever to optimise on the cost and performance?
Uri Parush: 11:04
Certainly, I had a nice journey on the cost issue, especially when I joined Lumigo, I never, I never developed a serverless full serverless application before. And what I notice here that I need to have the shift in my mind, for for cost is very important in the serverless world because usually, usually I was focused before on performance but cost was a side issue. It's usually, it's not a developer's responsibilities, it was more an architecture, the VP R&D or architecture’ decision. And, and now, each developer has a huge impact on the cost. And one of the things we notice in Lumigo that we use Kinesis to stream all of data, and in some places Kinesis become very expensive for us. And the reason for that is that we couldn't do over provision in Kinesis for free. You're paying per shard in Kinesis. And we actually need to do some over provisioning to be prepared for that large amount of data and the data is consistency and you never know when you’re gonna get a fix, so it's became more and more expensive for us to use Kinesis in some parts of our system. So what we did is, do some cost analysis to compare between Kinesis and Firehose. And we found Firehose is a nice solution, a nice alternative, for some of our flows, and we actually switch between Kinesis and Firehose. The big advantage is Firehose can have over provision for free. You do have some. It's not streaming like Kinesis so you have some latencies issues if you use it all over your system, but in some parts of the system the latency was not an issue. So we did the change and it save us a lot of money.
Yan Cui: 13:06
And I guess Firehose has got a default throughput limit as well, right? If I remember correctly, it's like 10,000 records per second. Did you also have to ask AWS to raise that? Is that what you mean by over provision?
Uri Parush: 13:18
Yeah, exactly. So, till now AWS were very gentle with us and give us all our requests. So yeah, I did a request for AWS and it is a over provision of 10 times our traffic. And it was all good and now we are prepared for much more and it doesn't cost us anything. So this was a big advantage for Firehose for us in the, in this use case.
Yan Cui: 13:43
Yeah, the whole thing about the managing shards, that's really annoying with Kinesis. And I do see the appeal for Firehose in this particular regard because it's all paid by usage and you only pay for data that's get transferred, rather than paying for uptime for the shards, and the annoying thing also is that with Kinesis there's no, there's no built-in auto scaling, you end up having to build your own auto scaling as well. I've done that like three different three different times. And even though they've got an API that you can call to change the shard number there's no trigger for you to actually do, you know, you have to do the whole thing yourself or use the application auto scaling mechanism which is also not well documented for custom scaling behaviours like this either. But yeah, so I do see that the appeal, you know, why you guys went to that approach. So, another thing that I think we talked about before in terms of what you've been doing to optimise for performance is, you had to switch to a multi-region approach to improve the ingestion latency. Do you remember how much of a difference did that make? Was it a case of a few milliseconds or was it 10s of milliseconds? Any idea what sort of benefits you are able to derive from going multi-region?
Uri Parush: 14:59
So it was a, it was a big change, and it was you, it has a big impact on the latency, and we do it for for two reasons. One reason is really the latency. The second one is, is a cost. Because communication between region, the same region is much cheaper than cross region. So we do it also for not just for our trace data, also for our log data and metrics data. So we actually collect your, your all your all your customer data from specific region which, which is for, for, we are not using cross region we are using the same region for every collection. So, the impact time varies was like 10% of impact. It was more than a few milliseconds. It was like 10s to hundreds of milliseconds difference.
Yan Cui: 15:50
Okay. Do you also look at any other options like Global Accelerator, which I went and looked at is quite expensive but it's also meant to be really good in terms of making sure that the latency is really good, doesn't matter where you are trying to access endpoints from.
Uri Parush: 16:08
So actually we are kind of started looking at it now, because we have a solid, we are in solid place which are latencies is pretty low and we don't have to improve it anymore to get a customer satisfied. But we are looking for, for the future. Yeah, we might look at other solution and improve it even more. But we are currently having a very solid solution.
Yan Cui: 16:33
Okay, that's great. And one other thing that I remember we talked about before was that I think Lumigo is doing, which is, which I think is quite fun, it’s using Lumigo to monitor Lumigo itself. So what were some of the insights that you that you know you got from this? Surely you know you've probably experienced a lot of the same pain points that your customers do as well in terms of how difficult it is to troubleshoot a serverless application. How do you guys help identify problems for your customers?
Uri Parush: 17:07
So actually this is a really interesting question. I think this is one of the best things about working at Lumigo, is that you also used to be, also, a customer, and this, this gives us a really special view on serverless and development. And it actually means that every developer in Lumigo have a huge impact on the product design. A lot of the ideas that that we are implementing in Lumigo platform are coming internally from Lumigo developers, which is great. So, it's I'm talking about stuff like, which data is really important? Which graph is is useful for us, like do I need all the graph AWS? Do I need autograph? Do I need to correlate? Which data need to be correlate together? Which gives you the best view like, which logs are important? Which views views of timeline views, views of graph of flows? And many of the ideas coming internally. Also we have specific alerts that we invented in Lumigo that we know help us to get over some difficult like. And do I have a Lambda that stop triggering, and it could, it could see like an issue. It could be an issue, it's not an error, but it could be an issue for my system, usually a Lambda triggered every day and now it stops. So those special alerts are really internally designed in Lumigo and what what it does is that we, because we have a huge serverless platform. We many times see problems that our customer is not facing yet. And it saves our customer a lot of time that we see it in Lumigo, we develop the product to support them and then we see our customers using the same capabilities that we develop. So it's kind of amazing that that we always improving our product with our own insights.
Yan Cui: 19:15
Yeah, I think that's one of the really interesting things that Lumigo is doing. I'm not sure if the other vendors are doing that but definitely I think it's it's really good practice to be dogfooding yourself. And I think, Netflix used to talk about this quiet... Was was it Netflix or Amazon? They just talk about this practice of dogfooding yourself quite a while back as well. And so you've, you've joined Lumigo for a few years now and you said this is the first time you've had to work on a fully serverless architecture. What are some of the challenges that you have experienced as you transition to this different way of building things, this whole fully serverless or at least serverless first approach to building systems?
Uri Parush: 19:58
So, it was an interesting question, because when I first joined Lumigo, and I looked at serverless architecture and it's really changed my point of view on how you should develop a design product system. And so we actually, as I said before I was more focused most of the time about quality and performance issues and stuff like that and cost was a big change for me because I never had such a huge responsibility on the cost. And I think, I think this change, not just the, the mindset of the developer but it changed the whole organizaiton. So maybe the budget needs to be more close to the developer now because he's more responsible for this and stuff like that. So it really changed the point of view of things in your organisation. And, and so, this is, this is my two cents about about serverless developments so you have to be prepared to, to, to change your, your thought, don’t, you have to come up into serverless system because it's really changed the way you build things. So it's, it's, it's amazing that you can infinity scale or close to infinity scale, but you have other responsibilities that come along with that. So this was a really big change for me.
Yan Cui: 21:27
Yeah, the whole idea of development and finance being, I guess, quite inter interlinked. That's something that Simon Wardley has talked about a lot. This whole idea of a FinDev of financial— where finance and and development kind of work together because performance optimization has a really tangible and measurable cost optimization as well. And this whole idea of FinDev is something that, to me, actually I did a podcast recently with Aleksandar and Slobodan who are also AWS Serverless Heroes as well. We pretty much did a whole episode on, centred around FinDev and there's also some really clever things you can do when it comes to FinDev for example, you can you can work out the return on investment on features and bugs and see, okay, where should you optimise in terms of, where should you prioritise work, because if a bug is maybe annoying but you may think is quite important, but when we look at the actual cost of that bug, it may just be a few dollars. And, you know, you may decide okay it's not worth optimising at all. Or you’re looking at a Lambda function that is inefficient, but you know you're going to spend a month optimising something that can only save you up to what $10 a month. Again, just not worth the engineering time you're going to spend into it because engineering time is also a cost as well. And then there's also a time like this especially with COVID 19, and the financial impact it had, the economic impact it had globally. This idea of FinDev can also be applied as a business advantage whereby you know we get pay as you go from AWS, and we can build products that are also charged on a variable basis based on how much you use the platform, rather than always going to be a fixed fee so that most customers end up can save end up saving a lot of money, and especially the time like this is where there's so much uncertainty economically, I think that's also going to be a more and more common thing we're going to see in terms of products and services coming out, that are built using serverless as well. I definitely think that's the one area which is probably not quite explored as, as much as I would like to see, but I definitely think it's something, that's it's gonna be much bigger going forward. So in terms of other things that you know AWS can do better. Are there any platform limitations that you have run into that made your life difficult?
Uri Parush: 23:47
Yes, so I actually talked about it before, so I really really want to see a Kinesis really soon just pay-per-usage, because in our system currently it is one of the more expensive components. And I really want to see. I don't want to pay for idle Kinesis. This is my, this is my pain with Kinesis, we're not using it, I'm still paying for it. So I really want to see improvement in this area. The second one is the log group multi subscription. And one of the things I can say about the logs that usually you have more than one system that need to go over your logs, it could be a security system, it could be a visibility system, observability system like ours, it could be other internal systems, and having just one subscription really limit your usage with logs. So I think this is a every step forward to have multiple subscription. And also one pain point that I I discovered in AWS is the auto scaling of DynamoDB. I figure that it's a bit slow. So it's forced your system, because because you have slow auto scaling and you get throttle for some time. I'm talking about minutes here but but still in our line of work which we are working streaming in real time. A few minutes, it's kind of a lot. So, what we have to do is to do some back pressure and wait for the auto scaling to kick in. But, but I I want to see auto scaling work much much faster. These are the three things that I really want to see improve in AWS and I'm sure. I'm sure they're all under... I'm sure we will see them in the future.
Yan Cui: 25:42
So you know that multiple subscription filter thing. Apparently, this is true, it’s, it’s already available today, but you have to raise a support ticket. It’s not, it's not a service limit raise. It’s a support ticket and say, please enable it on my account, and it’s that, and they can be enabled on the whole account, not on specific log groups. It's almost like a secret they didn't, they haven't talked about but I only found out because I just happened to be talking to somebody and from AWS, and they told me that that's the thing you can raise a support ticket and ask for it. We should just, I don’t know, this is crazy that they don't make that a thing that you can just turn on in the console. But yeah, there you go that's one of the list. And on the auto scaling... Yeah, go on.
Uri Parush: 26:34
Yeah, I guess. If you're talking about log groups subscription, we are doing an auto subscription for our customer to to collect their logs, so it's not really reasonable that we open a ticket for each customer we have. So it's not an internal problem of ours, it's for all of our customers. So it's really problematic and I totally agree with you, it should be default, should be very easy to do it. I'm not sure why it's not like that but I'm sure it's going to be.
Yan Cui: 27:01
Yes. And the other thing you mentioned the auto scaling for DynamoDB. I also ran into that quite a few years ago. I think they've improved it a lot, but it still doesn't auto scale fast enough. One of the things, one of the ways I found to work around it is that Dynamo DB auto scaling is still using the same auto application auto scaling mechanism. And when you enable it it generates a CloudWatch alarm. So what you can do is that you can hijack that, and you can change the, the alarms, because it uses like, I think, five, you check for five minutes, so what you can do is that you can change the alarm configuration to say one minute so that you can improve the auto scaling to kick off a lot faster. Again, something you shouldn’t have to do yourself, but just one of the things that I found I found. I guess with on-demand it’s just going to be too expensive at the sort of scale that you are running at and probably, you're going to potentially see some of the limits as well in terms of how far you can go you can push with on-demand. And, okay. So those are some really good wishlist items for AWS to improve. So that's everything I wanted to cover. Before we go, is there anything else you'd like to tell the listeners, any personal projects you want to share, maybe Lumigo is hiring, and how do people get in touch with you?
Uri Parush: 28:19
So first of all, Lumigo is always hiring, and we are always looking for passionate people to join our team and be part of the industry leaders, which is great. And in person my personal view. I believe that serverless is going to change the way developer building system. And then I can really see this is the future of development forward and serverless brings great solution to many of the architectural challenges ahead before. And it's a real refresh from traditional development stacks. So I really encourage everyone to, to... even if you're just starting, even if your mini project, try to play with it. It's a blow mind. And and that's what that's what I think about. But but one thing you should know when you start using serverless, you have to come into the open mind, because it changed a lot of your conceptions. So, so you need to think differently when using serverless but you have to try to understand it. So I really encourage everyone to try it.
Yan Cui: 29:31
And how do people find you on the internet, maybe, are you on Twitter, LinkedIn?
Uri Parush: 29:37
So yeah, I'm on LinkedIn and Twitter, email me. It's uri@lumigo.io, and you can find me in all those three sectors and just ping me whatever you want, you can ask any question about serverless, and we'll be happy to help. Everybody in Lumigo, we really want to contribute to the, to the community back for the serverless community back.
Yan Cui: 30:06
Yeah, I've been working with Lumigo for a while now as a developer advocate for a couple of days a week, and I have to say the guys are really friendly. And I love interacting with Erez and Aviad and Efi and everybody, so it's a really good team, a good bunch, and they are doing quite a lot of work around the community, including all the stuff that they asked me to do in terms of open source and all that as well. So it's been great having you on the show, Uri. Take care, and stay safe. I hope the lockdown hasn’t hasn't been quite as extensive over in Israel as it has been here.
Uri Parush: 30:39
And it’s already I think a bit better. But, yeah, we are all hoping this to finish and get over with it. And get along with life and do serverless stuff.
Yan Cui: 30:49
Yeah, let’s, let’s, hopefully that happens. Take care and stay safe. Bye bye.
Uri Parush: 30:54
Thank you very much, Yan. Bye bye.
Yan Cui: 31:09
That's it for another episode of Real-World Serverless. To access the show notes and the transcript, please go to realworldserverless.com. And I'll see you guys next time.