Real World Serverless with theburningmonk

#12: Serverless at LEGO with Nicole Yip and Sheen Brisals

May 20, 2020 Yan Cui Season 1 Episode 12
Real World Serverless with theburningmonk
#12: Serverless at LEGO with Nicole Yip and Sheen Brisals
Chapters
00:02:03
How LEGO's teams are organized
00:04:00
On giving developers access to production
00:10:09
How does on-call work at LEGO?
00:17:13
How do you onboard new engineers to serverless?
00:23:59
How do you organize your repos?
00:26:50
What programming languages do you use?
00:28:17
Are you doing anything to optimize cold start performance?
00:32:11
On using webpack and layers
00:38:00
How many users do you serve at peak?
00:38:57
On the business values of Serverless
00:42:55
What other pain points do you experience with AWS?
00:48:13
AWS wish list items
Real World Serverless with theburningmonk
#12: Serverless at LEGO with Nicole Yip and Sheen Brisals
May 20, 2020 Season 1 Episode 12
Yan Cui

This episode is sponsored by Dynobase, a modern DynamoDB GUI Client.

You can find Nicole on Twitter as @Pelicanpie88 and Sheen as @sheenbrisals.

To hear Sheen's chat with Jeremy Daly over on ServerlessChats, go here. And you can read more about Sheen's thoughts on Serverless on his medium blog here.

To see Nicole's talk at Serverless London, go here.

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday/
License: http://creativecommons.org/licenses/by/

Show Notes Transcript Chapter Markers

This episode is sponsored by Dynobase, a modern DynamoDB GUI Client.

You can find Nicole on Twitter as @Pelicanpie88 and Sheen as @sheenbrisals.

To hear Sheen's chat with Jeremy Daly over on ServerlessChats, go here. And you can read more about Sheen's thoughts on Serverless on his medium blog here.

To see Nicole's talk at Serverless London, go here.

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday/
License: http://creativecommons.org/licenses/by/

spk_0:   0:12
Hi. Welcome back to another episode ofthe Real World Service A podcast where I speak real world practitioners and get their storeys from the changes. Today I'm joined by Sheen and Niko from Lego. Welcome to the show, guys.

spk_1:   0:25
Thank you.

spk_2:   0:26
Hey, Glad to be here

spk_0:   0:29
So obviously we don't need to introduce everyone to Lego because we all know who like oh is And we all love Lego. Maybe just give us a queen's darshan to what you do a lego and how Lego got onto the journey off doing service.

spk_2:   0:43
Right? So I make a start. S Oh, my name is Shin Brussels on DH. I am senior engineer manager at the Lego Group. I've been with the logo for four years now and mainly associated with the surveillance operation. So let go dot com on involved heavily in the migration off Llegada Khan from its wanna lift onto serve Ellis. So when we talk about the Lego group B represent the team behind running the local dot com. So there are other groups and teams across Lego doing many things with. So unless different years cases and different development works going out for us, that logo dot com. The backend off this excitement's on several Assan loveliest cloud. So that's where we are at lego dot com on DH using Solis Technologies.

spk_1:   1:38
I'm Nicole yet, and I'm a senior infrastructure engineer at the Lego Group. So, like Sheen said, we're working on the shop part of lego dot com. I've only been in the team for about 10 months now, but we're in the platform team, which is responsible for the hosting the Develops and Dev tooling as well as the site Reliability and security. So yes, that's me.

spk_0:   2:02
Okay, so let's maybe in that case, talk about the structure off the team and how the team is dividing sense of responsibilities. So, Niko, you mentioned that you work in the both of you working. It develops space. So do you have, like, a centralised team that you guys look after? Order the infrastructure the deployment as wasthe e on core for the feature teams?

spk_1:   2:24
Yes. So we have six feature teams off speeches, squads, and then the platform squad, which is where I work the platform squatters, the one that looks after all of the deaf tooling the Dev ops infrastructure on how both the back and front end has hosted, and then the monitoring on call and alerting of the site.

spk_0:   2:45
So in that case that that's your developers have assets to a device accounts themselves or does everything have to go to your team?

spk_1:   2:53
So we have three different eight of US accounts. Where are different environments? The one that the developers have the most access to is the one we call Playground, and it's the one where they can deploy the services that they're working on. They can test them out and then a dubious account and see how it's going to be deployed. When that service is ready to go to production, they merged. Their PR zit gets automatically deployed throughout deployment pipelines into the development and acceptance and abuse account. And this account, they have more restricted permissions where they can't just deploy anything. They can have a look at how it's configured. They can access it from the front end. T. Make sure it's working as expected on DH. Generally shake out the service and then once that's already we have the U A team and my team that coordinate releases and to production, and this is it's also automated we simply click a button. And so even we don't do any manual deployments into the production a dubious account. But here the developers have little or no access, at least to the aid of a U. S. Account. They can still go through the front end. Teo, make sure that their services are responding as expected.

spk_0:   4:00
So I well, we've quite of companies focused large companies or enterprises. They have got similar model where developers have no assets. Teo the Atavus accounts beyond what you would call it there for playground environment. And this is often a hindrance to actually addressing problems when something happens in production, because the developers are the ones that understand the system the best. But they have no easy way Tio actually go into the environment. It see if something is wrong. And if the monitoring setup is not sufficient for the problem they're dealing with, then it adds a lot of time to meantime, to recovery. Is that something that you guys have experienced at all?

spk_1:   4:38
It is yes, and we've been actively working on it quite recently. To ease this. One of the things that we've done very recently is overhauled Thie yes, roles that we can assume in each of us account. One of them is called the viewer role, and we've now allowed or gone through the auditing process of looking at all the services that we use getting thie AM actions that are required for a minimum interaction from the obvious consul. And so we've built up this policy for a viewer role where we can now give developers access to our production account and know that they can't do anything destructive. They also can't view anything that's secret or confidential, but they can have a look at how their service has been deployed. They can look at the logs and they can see if anything has gone wrong from that perspective. So we now don't have that barrier of Oh no, we can't just give all developers access to production. We actually overhauled that process, and the other benefit is that we use new relic Teo pullout insights from our various accounts, and so developers can log into new relic and view all of their metrics all of the dashboards and errors that are coming from their functions, and so they're not limited from viewing all of that just because they don't have access to the eight of us. Account quite useful that we can build up thes dashboards for them. Tio have a very quick view of how is this service performing? How is this part of the site performing? And so, yeah, we use them quite a bit.

spk_0:   6:03
Okay, that sounds pretty good, because that is Yeah, you mentioned. That is the common problem people run into on. I guess the other thing that you put a gun to need eventually is some kind of a break glass ceiling. So that if some emergency happens and some money to go and change something quickly just so that you can recover the site or some kind of business critical functions, often times there's a need for some kind of a break grounds procedure allows someone to get right assets to say the production environment. But then everything they do is audited. You guys have anything like that planned in your pipeline?

spk_1:   6:34
So the same time that we were auditing our I am roles, we had a look at this dynamic escalation to the admin role if required, and using Emma Faye to protect then and things like that, but for a whole set up within Lego it it wasn't that feasible. The way that we are operating and have operated before is that we have always have two engineers on call at a time. 24 71 of them is from the platform team, and the other is an application engineer. And both of the infrastructure engineers that are on call have administrative privileges to most of the accounts and the sites that we use. So if anything that the break glass procedure is tio, contact the infrastructure engineer that's on call. And so we make ourselves available in that case,

spk_0:   7:20
Okay, so in this case, how many's application teams do you have?

spk_1:   7:24
We've got six squads on DH with We keep reorganising them. I think they've been reorganised about three or four times since I joined, but at the moment they're organised in end to end squads based on the parts of the site that we operate. So we have one squad for checkout that's purely responsible end to end for the front end of the checkout process, although through to the back end integration without tap implementation, we have another team that's based on exploration of the site and again their full stack. They own and operate the navigation around the site, the display of various products and the different landing pages that we have all the way through to the back end of talking to our content providers and all of that. We have some other squads for other projects that we have in flight A cz Well oh, and I mustn't forget the Viper Squad who are working without 1/3 party for the rewards site that we have

spk_2:   8:19
just to add, Tio what people are saying about the squads on the set up. Another think that be practise is that we've tried to bring in the squats the engineers as part of the end of our journey. So when a new feature is about to be developed, so obviously someone will come up with the original architectural design on DH that will be reviewed and discussed with the engineers, irrespective off their level of several ulcerated Leah's knowledge and skill set. But everyone will be involved on DH. Then they slowly kind off bring up the development on DH before it goes to production. We know how kind off a checklist which also includes the Rabia several Aslan's checklist as well, so that also we sit together. We means myself or someone else is responsible for the architecture along with Nichols. Scored on DH there feature squat, so we will go through the different things on DH. Make sure they understand, and also make sure that the right things like the alerts and alarms and monitoring every those kind of things in place before we take it further. So that pig's nickels or her team's job a bit easier because engineers now have a full understanding off the expectations. And you know what to do in case of emergency or after life situations.

spk_0:   9:50
So that's quite a sensible and the choir common approach I've seen in companies like The Zone. We have very similar thing as well, where you have some kind of unborn in process, but a almost a while. Architecture review Before you should go live to measure that the team's understand the threshold they need in order to go through the front of barrier to be running in production. But once in production did you say that there's one application developer on court the whole time? 24 7 is that one for 16 so it's a one per team.

spk_1:   10:19
So that's one for all of the teams. And the idea is that this engineer has a fairly wide ranging view off the whole site and which engineers know the most about different parts of the system, so that if something goes wrong, they're responsible for escalating to the right people involving a few people as possible, especially if it's out of hours. So both uncle engineers are mostly there to say, Hey, you need to be aware and be the first responders to any alerts but also know who has the most knowledge on each part of the site so that they could be involved fairly quickly. If it's something out of your well

spk_0:   10:57
and how do you measure that? You've got enough people with this wider knowledge about who is responsible for different part of system so that the people on rotation for this encore rotor, because again the mortar teams you have working on something, then the harder it is for any one person to know who's even talked to about particular problem happening in a particular say, micro service,

spk_1:   11:18
yes, of the benefit for us is that we have a structure where we have senior engineers and engineering managers and senior engineering managers and Sheen's case. And so it's these people at at that level where they have thie awareness of how everything fits together within the site, and they've been in the team for way longer than I've been a part of it. I think she knew from here four years now, I think, he said, So we have at least three of those engineers who have been through the entire uplift and development, and I'll move into service. They have atleast peripheral knowledge of Oh yeah, that that team worked on that service a while ago. So they're the ones who I can talk to If I need more context there. General remit is to be across all the new things that are going into the system and any historical alerts that we've had so that we convey willed up that knowledge as well.

spk_2:   12:08
Yes, oh, another point toe add is that these uncle engineers not necessarily going to fix the problem, so they probably will bring in other engineers like Nicola, saying who are familiar with that part of the system or that feature, then they can be brought on board in order to find a solution to a problem. The other aspect ofthe so an engineer having the overall view, our understanding on over things, so that's quite impossible. So So we slowly encourage engineers to be onboard in this fungal process, but not necessarily. They will be fully in charge off the day today on call duties, But they can share, though senior engineer to learn it on DH, then slowly grow up and you know a sense of shame to be part of the right.

spk_0:   13:04
Yeah, those guys who have long institution knowledge about how the company operates, how two different teams break up their super valuable, especially this kind of encore situation, and didn't fire when they actually happened. So I've got a question along similar lines. That case have you guys operate in thiss off cell up for a long time. It is something that's kind of knew that came into place when you guys moved to serve Ellis.

spk_1:   13:29
It's been in place since we moved to serve Ellis. I believe that's when we became Maurin control of the infrastructure and the other into end of our site and so we could then effectively be uncle and have a lot more within our control to fix. We are in the process of improving this process, as Sheen was alluding to of figuring out how to bring in new engineers and not have them too intimidated by having to know every single thing about the entire site. It's an involving process, and we've been operating in this way for over a year now,

spk_2:   14:02
s o. They did how similar set up in the past when we had the monolith legacy platform in place. But those days is mainly the infrastructure engineers specifically looking after the operation off the site on one. When something goes wrong, they will done initiate a call to another engineer or something like that. But the difference now is that more engineers are part off the end, their process so that sort off. The transition happened mainly due to the motor serve Ellis on clouds

spk_0:   14:37
and along the journey off, going to the Thekla loud and to serve Alice. What are some of the biggest technical and cultural changes that has had to happen for this process to work

spk_2:   14:49
s o cultural changes Yes, on DH the maid cultural shift in eighties. Their mindset or the willingness to adopt survivalists on to learn on DH, do something. So for us, that wasn't an issue because we started off with a small team fresh off ideas. Moving to serve illness wasn't an issue. But then the struggle wass. How do we bring a new engineers with less or no Arab leaders and several of skills onto the team? So that was a challenge. So the recent clear consolation. So one way to look at it is that, you know, obviously you bring in the engineers on DH, slowly feed them the knowledge they need and graze them to be part ofthe NATO talks or on land Webber knows, and making them go through head of the certification process and those sort ofthe learnings that will slowly enhance them to come up to the level. So in terms of technical challenges, typically when you move from different environments, say, for example, if the legacy platforms heavily utilised a sequel native is, for example, so anymore to serve illness, and you start dealing with the no secret. So obviously that is kind of a cultural difference between these two. Same is true when we talk about programming language and to the different technologies and things. So one thing I often they're dealing my taxes thes sort off the mind shift to see things differently. So here, when we have an engineer, that engineer is not just doing a programme it so that engineer, even if it's a small future, can come up with a dozen also credibly services that needs to be put in place in orderto deliver that particular surveillance service feature. So that is the difference. ASL ong us. The team can convey that message in understandable way on increased them to be part of the whole movement off Sir Villers. Then it can become bit easier. First of all, you need to have the willingness to known entre a doubt. Serval is so once you cross that barrier, then can be become easier as long as you guide them or coach the engineers to come up to the required level

spk_0:   17:13
on what are some offence that you are doing? Tio actively coach people and to train them so that they have easier time switching over to a new set of paradigms in terms of how they develop software insensitive databases, air using also in terms of new languages because it sounds like it's quite a lot different tax that change that's all happening at the same time.

spk_2:   17:34
Yes, so I'll start under and probably and Nicole can't will in their remaining I cemented. There is no clear conciliation. So one thing we do is go a full stack squad, set up certain engineers who are new, maybe initially doing all the front and worked our script work. But then the squad picks up a new feature that involves friend and as well, a serval service developments. Those new engineers. They won't be here really involved with the survival of service development, but they will pair with someone who is doing the service development or they pick up tickets. They just touch just the boundaries off the you know that the service, so that will become, sort off a stepping stone for them. So they stay become partof the whole journey part off all the discussions and things that I talked about area that slowly enriches their knowledge. So that's one way of bringing them on. Then nickel. She's so heavily involved in getting engineers after skills with certification things Nikolay can elaborate on the threat that we run.

spk_1:   18:44
Yes, sir. We have, ah teams channel discussing entirely around eight a beer certifications and actively encouraging our engineers to study towards at least the cloud practitioner and then beyond if if there is that interest, were actively posting resources and results and summarising to each other and in any new references that we've found helpful. And I think one of our engineers attended thie US online conference foursome conference. I'm not sure how you pronounce it, but they're one of our engineers. Attended then was sending back interesting things that she had learned there that the other engineers at the same level could been share and going to Sheen's point. We actively well, we're teaching by immersion. Essentially, so all of the architecture has done in the CWS language, using the injurious icons so that when engineers pick up new features, they see these diagrams with all of the services names in aid abuses, language, and so they sort of learned by association that way as well. So we're surrounding surrounding our teams with they diverse terminology and pairing is older than the squad So we're just presenting as many opportunities as we can to be involved in anything. It'll be escalated

spk_2:   19:58
on, give you two examples. Yeah. So, engineer from our team A 20. You are surveillance course. That's one way off. How we increase engineers to take opportunities and locked right? The other thing is like when even bridge was announced. So I wanted someone to do a simple people see with an idea had I didn't choose a engineer with the top off the several of skills I went to an engineer. She was just starting up on surveillance. So I said to that explained her Okay, so this is a new service. This is how it works. And this is what I'm expecting to have a spot of the people see on DH she had to learn on. She had to learn and struggle and come up with something so that wass and she can show to others. So this is sort of the examples that I can put forward where we encourage engineers to become skillful in this cloud and several of space.

spk_0:   20:56
Yes. Oh, those combination and everything's has worked quite well in quite a few of the companies that I've worked with love, that phrase to use there, learned by immersion. That's definitely the best way to do it. And I love the fact that you went to someone who doesn't have that much experience of database and give her an opportunity to activate, go and experiment and try something out, which I find again this where you learned the most on sounds like some of the other things you guys are doing in terms ofthe encouraging people. I guess that means that you give them budget on the time off to study and to take the certification exams.

spk_1:   21:31
Yes, certainly we have a budget for all of our engineers to attend conferences. Teo. Yet, as you said, take the exams, take study, leave. It's all self driven, so we encourage each of our engineers. Teo, actively use up this budget. Ask your engineering managers for budget to study for exams. Teo subscribe to various courses and all of that

spk_2:   21:52
on Also, we encourage them to share with the community as well. So we ingredient to write block posts on we increase them tio talk at conferences or mate ABS or even Attila Go Herb in London, p conductor meet ABS, so that's an opportunity for them to come up and talk about something they learned on. Also, quite a few engineers are taking part in court camp on various different quoting opportunities where they shared the cloud on programming and other skills with Viagra community as well.

spk_0:   22:28
Yeah, I'm definitely seeing more and more people from Lego of popping up in the community in terms ofthe speaking and conferences and sense ofthe Web in AA and Block post and all of that on the thank you guys. So much for giving back to the community. I think it's great that you're giving people the time off quite a few other companies I work with in the past where they do have a budget. But they find that that no one takes out a budget because you don't get people at the time off to actually use this budget. So people have to invest their own time, their holidays in order to go to conferences and thinks that that which just doesn't have the right incentive isation in terms ofthe encouraging you to go and learn which ultimately benefits a company, a great deal having people there are more knowledgeable and can make the right decisions. So we've talked a lot about some of the things you're doing at the organisation level to ease the transition to serve less. I want to hear about it a bit and talk about some of the actual tenant coffins. You've got to do interns off how you organise your repose. We did a weapon. Our recently not weapon. I guess we did a virtual middle of destiny. Me and the Nico. So you talked about how you guys are using Monory Poll. But your approaching morning report is quite different to a lot of other companies who are doing more than report where they have a single C. I see the pipeline that deploys every time anything changes. So can you talk us through how more the Reaper works at Lego?

spk_1:   23:49
Sure. So we have. Like I said in the talk, we have to monitor reefers one for our front end, which is Fugate services that mostly serve up the front end. So our app shop and AP cheque out. We also have another man, a repo that has all of our serval ist services where we use the civil list framework to define each of these packages that provide a collective service in terms of how we do and want to repo differently. We implemented something called that we call a Select A script, so we use learner to detect whenever an engineer raises a PR to any part of the monitor. Repo. The select A script has triggered Teo Detect which package or which of the packages have been changed or affected by that new code change, and we then only run the integration tests for those services that have been affected. So this makes the deployment process a lot more efficient. We're not redeploying code that hasn't changed, and we also use the same. Select a script when we emerge into our main trunk or the develop branch that we is what we call it. And so this triggers the deployment into the development environment or Q environment that we call it internally. And so only the services that have been impacted by this change, either because you've changed a dependency or you've changed the actual service. Only those services are deployed. We don't deploy everything all the time, which is what we did at the start and that was fine when we had 1 to 10 services. Now we're up to about 40. And so it's good that we've been able to limit the amount of changes that happened when the single PR has merged.

spk_0:   25:27
Yes, I think that when comes to detecting changes to share code that's were often gets complicated. And that's why people tend to deport everything all the time. Because even though I haven't touched my service, something that we depend on some share business lot, Yuko somewhere could have changed. Or we end up having to just be safe and deported living all the time. But I think this idea of using learner to detect changes it's quite clever. Thank you for sharing that with us during the mid up, and I would definitely put a link for anyone. Want to hear the Coast full session at the service London Miller last month. You can find that in the show Notes on also want to talk about on the language Shin, you mentioned that you guys also had to change the language that you used to write your services and sounds like now everything is in either no Js or in typescript. What did you switch from?

spk_2:   26:18
So the legacy platform is Java on DH. Luckily for us at that time, the friend and wass and JavaScript. So when be more to serve illness? There was this day, Lamar, which language choice to go for whether to go for go Lang or Northern Java. So we slowly kind of filtered out the old ones on ended up with the Java script. Or nor are go now looking at the skins that the under they had at that time, it was kind of a no brainer stick with what we have. So that's how we went with no Js on that really paid off so that we didn't have to bring in new skills in or a struggle toe teach on a language aspect where standardising on the programming language part give us the time or the freedom to occupy everyone on the cloud technology and that sort of things. So we have nor Js tight script, I think also bit of next year's as well. So that's the sort of we have.

spk_0:   27:29
So we've guests with the local shopping side. You must get quiet traffic public, quite spiky traffic. So I imagine around holiday seasons, people buying gives. Are you guys doing anything different? Anything that goes against conventional recommendations in order to improve? I guess the costar Time for your functions. Or maybe to measure that you hit some those Leighton see or scared, but it'd requirements.

spk_1:   27:55
I don't think we're doing anything outside of the norm. One thing that we've been trying to optimise recently as new relic agent. So we used Teo included as a dependency for each of our services so that we could get the information back over to our new relic account. So we've been experimenting recently with adding it as a layer and we spend the last. I think it was two months working very closely with new relic to try and get this implementation as performance as possible, because we were observing quite an impact to our cold start times and our land operation because of the size of the layer. I think in the end it wasn't just the size of the layer, but there were some libraries that we were using. So wear using Web Pack and we were also using tercer as a plug in tow were pack to significantly reduce the size of our land of functions. But there were some issues they're with. I think it was with the terse, a plug in and the whole cut. That whole combination was resulting in our land of functions taking six seconds or plus two Cold start. And so once one of our engineers spent two months of his life trying t bug thes issues of air is happening in like a network in context. We've now got it being fairly perform it and we're not having significantly huge start times. And now we can have this new relic layer that's implemented across all of our functions to give us and nice distributed tracing. But whether we have had quite a bit of spiky traffic in the past and there are some limitations that we have run into, and it's been fairly reactive on our end, like one thing that we ran into was the SS and parameter store rate limit. So by default you get, I think it's 40 transactions per second to perform in a store, and in the past that hadn't been a problem. But then we released a set for the Friends TV show, and we had such a big reaction, I think Jennifer Aniston, she she posted on the sticker around the same time, and we had a massive spike in traffic and we started hitting these parameter store rate limits. We were getting throttled and we were unable to scale up. The lander's fast enough. And so we realised, Oh, there's a standard throughput and an advanced throughput. So we quickly opted into the higher throughput, which, thankfully, was a click of the button on the console on DH, not a service request. But whether that was just one of the examples of limits that we've run into a TVs with our new traffic models that were still learning about

spk_0:   30:31
Yeah, that limit also just wasn't documented anywhere. Quite a few of us went into just because we happen to have a traffic spike. And then we realised Wait a minute. Why displaying to suddenly getting throttled? Because they don't publicise that, really? Or until I think maybe until now and now that they've got the paid tear that you can get up to 1000 officer seconds. Darling was so sneaky.

spk_1:   30:56
Yeah, I was trying to find it before this recording as well and you search for standard throughput S S M and nothing comes up. Had to go over to the service quotas for the specific calls to find the 40 number

spk_0:   31:10
and that the service quota also is relatively new. Before, we just had Tio hammer the FBI call until we hit that ruffian numbers. So we kind of guest. Okay, it's about 40 because that's when we started getting sort of exceptions back. So you mentioned something there I wanted to go into right there. Hold weapon and layers once introduced layers. Then you can't use weapon Teo Unified and two tree shake those dependencies injecting sue layers anymore. Who was that? Where you sort of see it, your functions getting blown up in terms of size and having impact on performance?

spk_1:   31:44
No. So it was a learning exercise for us because obviously we started out with Lambda Functions with engineers who had switched over to 28 of us from a jobber background or front and back ground if they were from Java script, and so that we brought across this Web pack and Min ification from the front end world essentially where it's it's fairly standard. Tio. When your codes going out for a production. You modify it so that it's the smallest size possible. And when you come across 28 of us and you're talking about Landers that and they talk about cold starts and package sizes, that'll make sense to bring it across. And so now that as we were going through that exercise of switching over from new relic from a dependency toe layer, we started observing all of the performance issues, and it uncovered that having tercer and I believe there was some specific syntax that it was struggling with, it was something to do with how we were making. Http Cools with X. Just making http equals it was ignoring timeouts, and it was taking. It was using up the entire Lambda context so slightly unrelated from the new relic layer. But it uncovered this issue. And so by removing the terse A plug in removing the unification that we got back to a more performance state and so following on from this investigation, we talked to our WS account managers and we will be opposed them. The question of do we still need to be so worried about the size of our Lambda functions. And does that still have a significant impact on land? A cold start time? Because it used Teo, and I think I still need to get back to them on the actual numbers of how much sizes were pack saving us. And so we were still in this active investigation of Should we still be using Web pack? Or should we can we drop from unification and not see a significant impact performance? And that's kind of where we are now. We don't need to modify, and we still get the same start up times that we were getting before.

spk_0:   33:47
So I can actually feel some details here because we've done quite well off experimentations around the various different configurations and father size and all of that. So the benefit of weapon is not actually reducing the far sides itself, which has got the impact. But let's go into that later. The benefit of weapon is removing order file system calls because every time you go require that misses the far system Io up on de every time you have require some dependency that fancy might bring in another 1000 dependencies. So that's a lot off our ops, So those far offices were as to your co star time. And so weapon removes all of that on DH. Then you and I, we want fluff outta, contains everything you need, and therefore you have no required. That requires fire system up at one time. And that's where you get all the benefits from weapon, not the far sides yourself. The far sides yourself thus has an impact. And this is the kicker. That's what we found was that Actually, that impact on Lee really applies to the first invocation after a code appointment. So imagine you deport new code. Then the first content I could put up on the site of the artefact has pretty big impact on. We're talking about potentially hundreds of milliseconds for their first container, but afterwards, new containers that that put up the far sides almost has no impact. There's no statistically significant impact on your co star time, so that means if you want to change the environment variables or change some configurations, then you see that the far sides has no bearing on the coast. Our time itself, And then I think we found was that during co starts you get a full CPU power, regardless off the memory setting you use for your functions. So you costar Justus Fast 128 back. You would do on, say, I know one cake or streak a core memory, some really interesting details there in terms ofthe how the Pakis eyes effects your costar time. Because the main thing is the weapon, which still have a big benefits just because he removes old off our system ops of one time.

spk_1:   35:54
Yes. Now that makes sense. So yeah, well, I guess we'll likely be keeping. We're packing on DH then. Likely removing are discussing Maura about whether we keep the Turks of plug in or not.

spk_0:   36:06
Yeah, there's also some tricky aspects to measuring the Coast artist. Well, Lambda tells you the innit duration, which is time it takes to initialise the module itself. But for the time it takes to download the foul artefact, which impacts only the first container, they get pull up after a code change that is no including the inauguration. And you have to have something our side of the invocation itself. So it may be of use in a gay way. Talk to Lambda, then you can look at the integration Leighton see, to see the actual end to any location, Time for their function on DH. That's kind of one of this more reliable way to measure costar time. So depending on the Metro, you look at you may not be a measuring the right thing. I mean, I spent so much time just turned to experiment with things and Michael Heart from a bar. So we had ranges in conversation around this. He was the one that you did some of this experimentation around the foul size and impact on Costa. And in fact, I only affects the 1st 1 That was something that he found after a bunch of experiments. He did, yes, a fun stuff, but stuff I'd rather I didn't have to do. What about things Like a number of users you serve per month.

spk_1:   37:17
I don't have that number off the top of my head. I know that during normal sight operation, a low number of concurrent stations to the site is around 202 150. And then when we're in peak season, or when everyone around the world's awake and on the site, it can go up to about 5 600 that's fairly normal. And then winter in black Friday scenarios or first of the month. New product releases on Think. The highest we've seen is about 2500 concurrent users on the site, and that's always a little bit scary from an infrastructure perspective, but very impressive when it holds.

spk_0:   37:53
That's one of benefit of lamb There, right just scales and handles the order there for you, so you don't have to. They're on the edge of your seat all the time. So I guess you talked about the benefit of Serval is the scaling, but also the fact that he allows your teams you have more control of your atavus environment having more control and off the application. What about incense or business benefits? Has there been any finger you can measure it in terms ofthe velocity for feature development or maybe even cost for your atavus accounts?

spk_2:   38:24
Yes, So I mean, I can't talk about cause, but for being a servant, less journey is more than cost. So I often talk about this civil list acceleration that is quite risible in our case. So since they moved to surveil us. You can clearly see the benefits that brings in terms ofthe the speed with which a new feature can be brought to production. So this has changed from months in the past two now days or a sprint or weeks. So that 2 May is a really benefit so that you know, the stakeholders appeals can come to us asking for a new feature to go to production on. We have now the you know, the sort of the tours and the services that gives us the confidence to quickly take that the production. That, to me, is a really, really benefit on other things that we talked about is the other thing about the team culture and how that Tim accelerates or walls around the technology. So that way it's been really beneficial for us at on DH. So that's the reason why I say that. Of course, we do look at the cost side of things as well, But to me, whatsoever, less brains is beyond the cost benefits and the other sort off the cliche things that we often often talk about.

spk_0:   39:59
I told you agree, I think, in terms ofthe the actress savings that service give you in terms. You are a total cost of ownership. The atavus cost is just a very, very small part. The fact that you can get so much more done with the same engineers means that you save so much in terms ofthe thie. Engineering cost itself for your application. Bosso Market opportunity costs as well You can be. You can hit the market or quicker you can get there before you compel a tissue can out innovate your competitors in terms ofthe bringing out new ideas and new products and new features to the markets.

spk_2:   40:31
Exactly. So we're just going to give a kick example of something that happened recently. So people wanted to experiment with a different search waiting. I'll go to them So we already have something in place. The idea was brought up on was a kick architectural design on DH Detailing talked to the team on DH so simple because it's all just enriching. And he went that we already sent to even British on DH. Take that to a different consumer on set up few, you know, dynamo tables and a couple of Lambros on top to bring up that new value as per the new logic, so that can know Go into production with the, you know, the and be scenario where the business can evaluate which one is affect you in which market. So there's just a example of how quickly we can bring value to the business with surveillance.

spk_0:   41:34
Yes, funny that you mention the vem bridge there. But we could have spent the whole hour just talking about you Bembridge. But I know you've done quite all talks already, including with Jeremy Daily. So I'm going to include some length. Teo, your previous talks on you've embraced your micro services communication patterns, start a chat with Jeremy dailies that if any listeners want to find out more about the temperate and her leg was using FM bridge, you can find out from there. So, Nicole, sitting back to you, you mentioned earlier that one of the pain points you guys running to wass with the low through per limits with sm Prime Minister, Are there any others? Or paraffin imitations or other pain points there are hurting you and your teams

spk_1:   42:18
s O in terms of yes and land is one pain point that we are still facing is that into in testing. So I mentioned that we have a playground eight of US account that allows our developers to deploy their function into anade, obvious environment and actually tested out in a like for like context. What we don't have at the moment is a fully connected into an environment that has a front end instance for them to connect their deployment into so that they can have a front end that they Khun invoked their service that they've just changed. Teo ensure that the change is actually what they are intended to do. So that's one thing that I've put in a future roadmap item in terms of ah connected feature environment rather than just my lander in isolation. Another thing that we've run into is the lack of tracing that we have through the stack on DSO. That's what we're actively working on now with new relic to implement their distributed tracing so that we can track a request from the front end in someone's clients. So in, if you load up the shop in your browser, we can have that request I'd come through into new relic and weaken trace that you hit our our back in layers. And which services did you actually invoke through that and any areas that you encountered along the way? So that's what we're actively implementing now. But it is a point of complexity when you go into the Landers and a love the micro services type architectures.

spk_0:   43:50
Have you guys checked out that the zone Lamda power through it Implements are love. This of Coalition I'd based tracing for a pro gay by Lambda Sig Temperature hope much different services on I guess Thie tracing you are implementing now for new relic is a MOC for AP swell From the client,

spk_1:   44:09
we had a look at many different services that provided distributed tracing and tracing four Landers and eight of us. And we went with new relic because we already have an existing relationship with them. And so when they were bringing out this new feature, we there was the easiest one that we could pivot Teo and implement because we were already using their agent. And so it was simply adding on a couple new flags. Teo say yes path through passed through thes trace segments s o the tracing that beginning from new relic is in the context of their wider application. I called a PM application

spk_0:   44:47
performance monitoring

spk_1:   44:48
performance. That's the one. Yeah, application performance monitoring. So it's in that entire context of the information that they're already bringing through. So we we do get those trace segments off all of the services that we call whether their internal or third parties. And

spk_0:   45:04
I think a lot of problems I've had with a PM solutions from a lot of these traditional vendors is that they cater for the AKP, tracing really well on to give you a lot off details around. Ah, wish line of code that took how long and all of that. But the moment you goes over asynchronous sources like Bembridge, the whole fingers breaks because again, they don't have a really good way to cater for those kind of workloads. Have you guys been testing it were for mm bridge and all the different background posting that you are doing.

spk_1:   45:34
No, we haven't broken it out into testing out event bridge yet, but I think we are tracing through, like, a gateway. So I guess that is a synchronous call. I don't think we we haven't, I guess considered as a use case that we need to trace the asynchronous calls, the background processing that happened and tie that to the user's request because it I guess, by definition it's asynchronous. So the main use case that we're trying to solve is how many evolve users on the site currently are encountering errors that are being caused by specific services, as they're the ones that are more high priority to fix right now. And then we have alerting through various Adebayor's metrics on Are our background process is working as we expect. Are any of them lagging behind and a love that so way kind of break out of that distributed tracing model for those asynchronous background processes?

spk_0:   46:34
Okay, yeah, that makes sense. What I found is that we love client use of facing aspect justice a part of the storey because I love the processing as it happens in the background, and when something doesn't happen to that customer places an order on. There's no confirmation email. And then they tried to figure out what happened because some of that process some the order flow involves a sequence bits of things happening in the background through mm bridge or some kind of Q. So that's where you have the challenge of ham. Did he book those? A sink with locations is Paul ofyour and two and flow for your customer. Facing features again entirely depends on how your system is put together and what it is trying to build. S o we actually coming up to the hour mark before we go? I want to just get maybe some 80 bucks wish list items from you guys based on your experience with a DBS and service. So far, anything that you like 80 better fix or improved.

spk_1:   47:32
I think she has won.

spk_2:   47:34
Yes. Oh, you talked about even progenitor. I mean, we do use even breach in a single, most a couple of those situations. But this particular point, I'm goingto say, is kind of a blocker. In many cases, is the event persistence So often I get challenged saying that Okay, what if they went is lost? Okay, so I don't have an answer for that. That would be something nice to have in place so that people will get confidence that they can send anyone to, and there is a guaranteed store that they can be pre created on. Also, I was allowed to see the deal Q option available to certain targets similar toward the injuries to send us on the other one. I think it's Ah Long wish list for many for a long time. Is thie survivalists a nasty cash option? So that would really help. You know, I know we have them on TV, but the elastic I should really cool thing to have on DH one other thing which. And now that we have a number ofthe standard common patterns with survivalists, it would be really useful going forward if credibly a scan wrapped us as a caged service solutions so that we don't need Teo wire oppa lambda with trigger and this and that. So those are off standard patterns can become readily available out off the boss. So yes. Oh, that will do for enough.

spk_0:   49:10
That's a pretty sweet list on for the Empress Alla. Things also like for them to add the support for more granular. I am permissions for Eve M bridge, which have been right now when you do put event, you can only use a stiles your resources, which is not great thing. If you were nice Yeah, I'd love to have some kind of a survey lists already. Issa's well readies its greatest so many news cases. You, Khun, do we read is But you have to run the cluster. You have the running CBP seas and all of that. Yeah, that'd be amazing. And serve a list Elastic search that be awesome as well. Okay, so again, Finger Guy says so much for taking the time to talk to me today. Just one very last thing. How can people find you guys on the Internet and is Lego? Hurry.

spk_1:   49:58
Well, I guess in terms of finding may you can find me on Twitter. I'm Pelican Pie 88 on old Twitter Handle that I made when I was a little younger on then. Yes, we have. Ah, Lego Medium publication. I believe it's called where we are personally about blog's every every now and then. And we're actively encouraging new new writers and Tio into publishing articles on there in terms of hiring. We initially planned on doubling the size of our team this year. Unfortunately, with the current situation, we've pushed back on the hiring activities and so we'll have tto cheque back in on us. Maybe later in the year

spk_2:   50:38
and I am available on Twitter. She in Brussels on also I write, often published for blocks onto the Lego Engineering Medium Channel.

spk_0:   50:50
Cool on our measure that those are included in the show notes so that everyone can going to find them, including a link to the Koreas. Page on Legos that when you guys to start hiring again, people can go and find out about opportunities and Lego to work on surveillance stuff. Eso again fake you guys so much forward the time on DH. Stay safe. Hopefully see you guys in person or virtually soon.

spk_2:   51:15
Yeah. Thank you. Thank you so much.

spk_0:   51:33
So that's it for another episode ofthe real World service to assets the show notes and the transcript Please go to real World service or come hope you've enjoyed this conversation with Chimp Resource and the Nico Yip from Lecco And I'll see you guys next time. Bye bye

How LEGO's teams are organized
On giving developers access to production
How does on-call work at LEGO?
How do you onboard new engineers to serverless?
How do you organize your repos?
What programming languages do you use?
Are you doing anything to optimize cold start performance?
On using webpack and layers
How many users do you serve at peak?
On the business values of Serverless
What other pain points do you experience with AWS?
AWS wish list items