Real World Serverless with theburningmonk

#8: Serverless at LifeWorks with Scott Smethurst and Diana Ionita (Part 2)

April 22, 2020 Yan Cui Season 1 Episode 8
#8: Serverless at LifeWorks with Scott Smethurst and Diana Ionita (Part 2)
Real World Serverless with theburningmonk
More Info
Real World Serverless with theburningmonk
#8: Serverless at LifeWorks with Scott Smethurst and Diana Ionita (Part 2)
Apr 22, 2020 Season 1 Episode 8
Yan Cui

This is part 2 of my conversation with Scott Smethurst and Diana Ionita about their work at LifeWorks. We discussed the platform limitations and usability issues that make serverless development difficult sometimes, including problems with CloudWatch Logs, X-Ray, etc. that pushes you towards using other third-party vendors.

In part 1, we talked about the story of LifeWorks, a wellbeing startup that were acquired for $400M, in part, thanks to the heroics of Scott and Diana. They implemented business critical features using serverless technologies, delivered them under severe time constraints and created significant business value which played a part in the acquisition.

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday/
License: http://creativecommons.org/licenses/by/4.0/

Show Notes Transcript Chapter Markers

This is part 2 of my conversation with Scott Smethurst and Diana Ionita about their work at LifeWorks. We discussed the platform limitations and usability issues that make serverless development difficult sometimes, including problems with CloudWatch Logs, X-Ray, etc. that pushes you towards using other third-party vendors.

In part 1, we talked about the story of LifeWorks, a wellbeing startup that were acquired for $400M, in part, thanks to the heroics of Scott and Diana. They implemented business critical features using serverless technologies, delivered them under severe time constraints and created significant business value which played a part in the acquisition.

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday/
License: http://creativecommons.org/licenses/by/4.0/

spk_0:   0:00
This is Part two off my conversation with Scott Smith host and at the end of your Nitta about their work at life works well being company that was acquired for $400 million in part thanks to the work that Scott and Deanna did, turning around some business critical features using service technologies very, very quickly. In this episode, we talked about platform imitations and problems with tools such as car watch, clouds, logs and X ray. As always, I hope you enjoy this conversation and give us a photo on Twitter at Real World S O s if you do. And another thing I want to ask about your work at life work is that you've got, what, 25 different Michael services, how to the micro services interact with each other?

spk_2:   0:58
It's

spk_0:   0:59
over. I synchronise FBI course.

spk_2:   1:00
Or were

spk_0:   1:01
you using a lot of events through s Inez or ask us or even bridge?

spk_1:   1:06
I would say it was mostly through messaging, so when something happened, a service would publish in a van. Another micro service could subscribe to that, even if it was injured of interest and, you know, react in some way to it. I would say occasionally a service ward called another service if it was, you know, get me this and not service owned that data. I know there's a lot of debate on whether that's the right way of doing things, but yeah, that's what we were doing. And it didn't seem to cause too much bother. Yeah, most of it was through messaging, and there wasn't that much into service communication going on. Really?

spk_0:   1:47
Okay. On DH Vienna Before we started the show, you mentioned that you guys are also using step functions as well. What are you guys doing with step functions?

spk_2:   1:57
Yeah. So what Life works with some functions. We build the workflow, a couple of work flows for interacting with localised, which is a tool which translates copy, either automatically or with actual human beings that know the languages they had. Ah, good web hooks. And you could do things with the tool out of the box. But we found some use cases that were not covered. For instance, we had multiple repositories. And when we pushed changes to them, we wanted trance stations toe happen automatically. So we built a workflow with step functions which would first detect that new keys had been added to localise on. Then create translation orders on DH tasks to translate internally as well. We had that requirement on when translations were finished. Something would pick them up and create pull request on a repository that was built with still functions.

spk_0:   2:51
All right, that sounds pretty cool. Quiet Use case for state functions. So earlier. Deanna, you also mentioned that keeping up with everything that's happening in a DBS on learning everything is quite challenging. Are there any others telephone limits or lack of tooling that makes life difficult for you when you're working with land on a day to day paces?

spk_2:   3:11
Since I mentioned step function is just now, there is a use case that I'm currently building step function for to get some automated transcriptions. But I'm finding that they can't be triggered by S. N s a rescue s events, which meant that I have to introduce a proxy lambda, which picks up that event and star system function. That's what limitation. Another thing that's been hitting us recently is that the 64 character limit on I am roll names. I don't understand the character limit. There's a plug in that generates it based on your Lambda function named Region and other parameters. And we tend to name or them to function so that we understand what they're doing just by their names. So, yeah, the 64 character limit is isan issue for us

spk_1:   3:56
so much. So are two plus one on the limit's terrible.

spk_2:   4:04
Ah, yeah. One other thing that I ran into in the past few months we were trying to use Lambda Edge to modify or response coming out of the FBI. But we needed to cash the response of the A p I, but still change it. Well, basically, um, shuffle around. Some elements of Ray's tow make it appear that it was a different response to the user. Eso We tried looking into the land at age to do that, but we found that you can modify the response body for some reason. So, yeah, that's a limitation I mentioned. I guess cloud watch logs that you can't touch for logs, cross regions.

spk_0:   4:43
Okay. Yeah, so that Ah, that land the edge one. That's interesting. I guess that maybe a security reason why they don't let you modify the body at the edge. This potentially other things you could try in this case, the cloth. They're workers. They are really good product. One of the amazing things about Chrysler workers is that they can update globally within a few seconds with on cloud front right now, which they have just shortened to a few minutes but still dies, too. Quite big difference compared to develope, experience and feedback you get when you work for clown fear workers. What about you, Scots? And there's anything that you find annoying. Besides the 64 characters limit on the AM rose,

spk_1:   5:23
not one definitely sprung to mind eyes, the onerous talking. Yeah, I really hate that because yeah, I mean, the main solution we seem to after that is just a short and your function name and make it less descriptive, which is something you don't typically want to do. Another thing that springs to mind you concert a reserved concurrency on along the function. I guess one thing I don't like about the way that works is I would like to be able to sir a maximum number of concurrent instances for alumni function without it eating into the account wide limit for anyone who that sounds a bit weird for anyone who's not sure what I'm talking about. I think the default account wide limit for concurrent London's is 1000. It's a soft limit, but if I was then to give alumni function a reserved concurrency of 50 it would reduce the account wide limit to 950. So, yeah, by kind of limiting the concurrency on alum Dora meat and into that can't wide limit. I'm curious why we can't just make it so you concert marks a moon without doing that. I

spk_0:   6:35
don't think it's any technical reason why you can't and also that naming reserved in currencies also super confusing until they introduce the provisions in currency. Ah, lot of people thought Reserve concurrency means that that you deserve 50 instance off containers for this family functions, so they're always 50 available where so it doesn't mean that at all actually means 50 maximum concurrency for this function. And the reason why it's called the Reserve is because you're reserving 50 out of your thousands regional concurrency forthis particular function. And it's funny that he worked us both that Max was a reservation off some concussive out of your regional limits is be confusing

spk_2:   7:16
and

spk_0:   7:17
the Friday's us both things and the way his name is not helpful.

spk_1:   7:22
I agree. It's a horrible name, and I can see why people thought that it did. Reserve, yeah, you know, saw 50 warm instances or whatever because it does kind of sound while art from the name. I think the main other gripe half day today is I mean, the others already touched on it, but it's cloud watch, really. You know, it'd be really nice if you didn't have toe so quickly. Consider third party tools for stuff like log aggregation and monitoring. I mean, cloud watches getting better. I think you know it can aggregate your logs now. But was the honours torched upon? It will only do that within the scope of a region. I just find its interface. Not that nice to use as well. I mean, in a lot of the other monitoring tools. If you kind of see an error on a chart, you can very easily pivot to the logs from the chart. That's just not easily done in Cloud watch, either. I really would like it if you could just remain within the ecosystem of eight of us without having to even consider using stuff like Data Dog. I think the same goes with stuff like X ray, which is a really cool tall. But it's, you know, it's nowhere near as advanced to some of the other stuff out there, you know, like the mego and all these kind of tools. I'm just curious why I'd give us some invested a bit Maurine the Dev up side of things you know, to make it easier to manage this stuff once it's actually are in the wild,

spk_0:   8:49
I think part of that is that a Darius has at least traditionally has this tendency to prioritise cost over developing experience. In fact, that also applies to when you take any specifications on your explicitly told that if there are two answers, both are correct. You are supposed to pick the answer. That's gotta lower cost. And that also applies to our love of decisions that they've made for cloud Word for X ray. For example, Clowers Locke's Inside that lets you create multiple locks at the same time, but only up to 20 lakh groups, so you have to know what you're looking for roughly the ballpark before you can even hope to find what it is that I've got the string for my error message, which is Want to find out where you came from? Well, if you go 135 functions in your case, while good luck, that means you got a query groups or trend here the time, unless you know roughly what it is you're looking for and where to find them again dies because you don't want to add costs for the customers, which is great. But then again, is comes at a sacrifice off Really great developer experience. And the same goes to simple things. Like until recently, they wouldn't even show you the metric for number of concurrent executions. Given that you don't pay for that metric, I don't think also things like in your locks. You see a report off how long your function ran for the duration, how long you were built for as Watson in duration. So the bill alteration and innit duration? Those are useful information for me to find out. But again, you know, expose those as metrics for cost reasons. So I think I love these tools are not as good as the ship e because they prioritise costs over to actually develop experience. And that's something that I guess I appreciate. He helps keep my A device costs down at the same time. That often means that everyone gets pushed to something part of service that you're gonna have to pay even more just because the beauty in tools are not as good as they could have been.

spk_2:   10:46
I

spk_1:   10:46
was just thinking that, as you said, I mean, it's a nice idea, but in some ways it's a false economy if you then go into another third party products and paying even more to those guys. So ultimately the customers still paying.

spk_0:   11:00
Ah, yep, and especially when you have to go to something like Data dog issue. Just amazingly expensive of has so many confidence that moved from state adopt because that the way they charge you based on number three sources so you've got under 1000 containers while you're going to pay five bucks per month for us, every single one of those guys eso your cost out of wrap up really, really quickly, and many of the other vendors you find this particular space also are really expensive, especially when you start using a PM solutions. Essentially, they're kind of version ofthe X ray, but also instrumented and anything into the Kuiper have with ex rages. It takes a lot of work on a developer part to actually instrument or of your code. Where things are Libby go. There's a lot of the work for you to auto instrument your system so that you don't have to instrument every single a place where you have a TV ck every place where you're doing an AP I corps to some third party HDP endpoint.

spk_2:   11:53
To be fair, if you have, if you're using no Jess or something that supports modules, you can write a module that wraps. And you would just use that instead of having to instrument every HDP coal or every medically assisted Keiko before

spk_0:   12:06
every function where you're requiring the HDP. You have the either require custom one that you have pre wrapped all you have to required and then use the express decayed to wrap the HP module, right?

spk_2:   12:19
Ah, yeah, that's what we did a life works we had we were using instead of super Agent. We were using a rapper of Super Agent Over Super Agent that we called HDP clan with tracing Andi, it would make sure that all the age to be calls we did were traced.

spk_1:   12:34
Yeah, One nice thing. We hard as well was it just used a little environment variable so you could turn tracing on and off within a service. And it it knew about that variable that we just knock it off. We wanted to avoid exactly what you described on where we love toe Put this X Ray Cohen absolutely, all over the place. We just wrote it in two places, basically one. And for hate's TTP softened the second place for the idea you assess the UK

spk_0:   13:02
Okay? Yep. On have you monitor your costar time since you made that change

spk_1:   13:07
the cold start time. I mean, I believe we were monitoring cold start science, but I'd be lying if I could tell you we did it before and after when we introduce x ray or not, we didn't have, like a sudden it was controlled by an environment variable. So we weren't trait on absolutely everything. We would knock it on if we thought we needed to for example, if we'd released, you know, something new and we're a little bit concerned about it. We knew. What if there is a problem? I would really like it if I had traces that would knock it on.

spk_0:   13:39
Yes. So the moment you required The Express decade, regardless whether or not use it, that's gonna add about 150 maybe 200 milliseconds To what? Maybe not I spent nowadays, but certainly when I lost so measured it. It was 100 proximity. Second, just by requiring the extra stick it on, then maybe some more for actually instrumented the issue. Be client or the university case,

spk_1:   14:01
guys. A lot large. Just the X ray one. Yeah.

spk_2:   14:04
Have you tried requiring a doubly assess decay? Extra eight core. I think that's what we're doing. It might be smaller. Yep.

spk_0:   14:11
So that's what I was doing this. Well, it wasn't a full extras decay. It was just a call. Okay. So that nothing is pretty heavy. And also, when you start using it, when you do anything locally, it just breaks. You have to do so much, face justice off totally off. So that when I won the functions that locally it doesn't

spk_2:   14:27
break. I remember that was just a matter of setting an environment variable. Was that

spk_0:   14:31
yes. But then there's also another thing you say If you record the custom trace segments, then that just doesn't work that way. Just bark.

spk_2:   14:40
Ah, I see.

spk_1:   14:41
Yeah, I'm not sure we were using it to that level off. We were doing very, very basic traces. It's bloody useful, though, to be fair. You know, if you have it X rayed traces of all of thie, all of your HDTV calls on all of you radio us calls versus other northen. It's certainly useful, but I didn't realise it was such an insane performance on cold starts. That's crazy.

spk_0:   15:08
Yeah,

spk_2:   15:08
So X

spk_0:   15:08
with us, have got the beauty and sampling. So once you hit, I think five or something I

spk_2:   15:13
did

spk_0:   15:13
request for second kicks into sampling. So you can also just customise your sampling configuration is based on the service name or just applied to change the global default setting. But that performance over there is more excessive than I would like. So I want to cover one more thing before we go, which is imagine you can ask a debate to improve something, Anything. What would be your number one? Ask for them to improve, to make life better.

spk_1:   15:40
I think for me it would be Yeah. Some of the gripes we've we've already talked about. Previously I would have said cold start times within the VPC that used to drive me. Not because it men, you know, certain things just couldn't be used. For example, if you wanted to have a lumber function talking, sir, on elastic cash Kloster it, Haas to be in a vpc is that saw that secured. And it was just a ridiculous time for cold starts. I mean, most of it off. No. First son tried out the difference, but I hear the improvements are very significant. So I guess that's probably so solve now. So, yeah, maybe my new biggest gripe would be on some of the stuff I've said around the monitoring onder X ray type stuff. It would be cool if the art, you know, they're just a little bit more useful. I appreciate the never going to get to the level of some of the dedicated product sport. Even if you could go easily Teo era charts, you get the lumber console and pivot straight to the log entry for that error. Even that would make me very happy.

spk_2:   16:53
Yeah, I have to agree with Scott. If anything, cloud watch logs, improvements would be a real win.

spk_0:   16:59
So there Certainly. One thing I've learned about the clouds logs which we used to struggle with, which was, you know how you can only have the one substitution future per LA group. So it turns out that's not hard limit per se. But the only way to raise that for the account is by raising a support ticket, Noa service limit race, but a support ticket and asked for an increase on your account. Then you can have more than one subscription future for every single law group

spk_2:   17:28
that is not discoverable. It'll

spk_0:   17:30
No, no, no. Okay, so we're coming up, Tio the hour mark and elective. Thank you guys very much for taking the time to talk to me today. It's always a pleasure talking to you guys. And before we go, is there anything that you'd like to tell the listeners? Maybe tell them where to find you guys on the internet, And if life works or up, time is hiring in London.

spk_1:   17:52
I think the only thing I'd like to say is yeah, the current client is ah is a new startup called up time There a new company creating the next generation, learning up with very carefully curated, bite size concert from various world renowned experts. The idea of it is that you'll be able to turn, You know, that 50 minutes of downtime you might have in the day into what time you know, where you actually help yourself grow and you learn something. Thea probie Launch fairly soon. Hopefully even next month. Keep your eye out for it. And if anyone's interested, they can visit the website. It's up. Time up. There are some job openings as well, but not developer ones. Currently able, there are a bunch of roles in the products and content space. So we thought you back then police take a look and apply.

spk_0:   18:46
Where can people find you guys?

spk_2:   18:48
The best place to find me is only thing den. Maybe mention that you listen to the podcast e get messages from people that I don't know all the time.

spk_1:   18:56
Yeah, I don't really use Twitter to be honest. Yon on the same is the honour on best found. Arlington. I need to improve my social media game like you. I'm rubbish.

spk_0:   19:08
All right. Okay. Again. Thank you guys very much for doing this today and enjoy the rest of the evening and look forward to seeing you guys again for ramen next time.

spk_1:   19:17
Look forward, Tio. Thank

spk_2:   19:20
you.

spk_0:   19:21
See a keeper by night. So that's it for my conversation with Scott, Indiana, about their service. Storey at Life Works. I want to thank you guys for joining us this week to assess the show notes and transcript. Please go to real world to serve Listo. Come and I'll see you guys next week.

events vs API calls
step functions @ LifeWorks
on platform limitations