I spoke with Daniel Wright and Sara Gerion to talk about the state of Serverless adoption at DAZN. We talked about a range of topics including:
You can find Daniel on Twitter as @daalwr and Sara as @sarutule.
You can find open positions at DAZN on their careers page and check out their engineering blog to see the problems they're solving.
For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.
Opening theme song:
Cheery Monday by Kevin MacLeod
in this episode ofthe real world serve. Alors, I interviewed Daniel, Right, and Sarah Jerry on At the Zone, which is a sports streaming platform that you can think off as the Netflix off sports, given the massive scale that John operates at, with over 1.2 million concurrent viewers a peak and the desire to have a very high up time because the contents are streamed live and viewers would hate to miss out on those all important moments in sporting history. So the zone faces some pretty unique technical challenges, and we talked about the different ways it's using service technologies today, often cyber side with containers running in PCs. Hi. Welcome back to another episode ofthe Real World Service, a podcast where I speak of real world practitioners and get their storeys from the changes. Today I'm joined by to my former colleagues at the Zone. Sarah Jerry on and Daniel, right? Welcome to the show, guys. Thanks having us.
So we worked together at his own fourth. Well, for people that are not familiar with his own, maybe they're not big sports fans. Can you maybe just quickly introduce yourselves and talk about who it is. Don't. Is that What do you guys do?
All right, Maybe I can start. So the zone for those who are not familiar with is a global platform. The stream sport events were known around as Netflix off sport, and it's basically a platform. The streams content the real time and on demand related the sport. Of course, one of the greatest challenge that we have is to be able to sustain spikes of traffic going from blow traffic to hundreds of 1000 of concurrent user signing up, signing in or streaming the event itself.
Yeah, I guess I can introduce myself first s. Oh, my name's Daniel, right? I'm principal back and engineer. It was our best in our home office on, And I'm in a kind of technical leadership servant leadership style roll, looking across team issues and yeah, sorry. Set design or TT streaming platform. Currently in about nine different markets.
just recently announced that we're gonna be going Global city, So yeah, we really do working in lots of countries, lots of customers, lots of different sports content and yet love. It's in real time. So that's where a lot of our challenge is coming.
And yet and myself on my back and engineer, I was lucky enough to work with Daniel for a period. We work in the same team. I joined his own more a bit more than a year ago. And as a beckon engineer, I contribute to the design, the development and deployment and monitoring off the micro services, the power that the zone core, functional logic and operations.
Yeah, the thing that really stood out for me when I was working a zone was to share scare off the application that we were building on the fact that we also need to have a very high up thymus. Well, because you have billions of dollars off contract rights riding on you being to deliver the content to the millions of people watching those sporting events life, given that so background their context off the high skill, high reliability. How is the zone using serve Ellis
to give a brief on, sir. Yeah, Across the zone has been a really wide adoption of various bits of debris rests over this technology. So heavy. Users of London functions FBI, get ways, Dynamo death. Mrs. S. U. S s ns as three. Kony, sis, you name it. We're probably using it somewhere on DH. Yeah, In our office, our particular domain focuses mostly on payments subscriptions and use the management. So it might be creating an FBI using London FBI get where to allow users to create accounts, for example, or to provide payment details to redeem gift codes to change their account details. Any kinds of ah, user subscription use a pen and using management style back end systems. Yeah, there's Bean. I would say a serval is by default mindset going into solving these problems on. And yeah, we used service for a whole lot of stuff.
Yeah, in general, we try. Our first approach is to try to keep it simple and survey. Listen, that gay serves well. If we can build something with a Londoner or any other service defeats the requirements of the service, normally we will go for it. And yeah, it's Danny already said. Mostly, we use Lambda Law Dynamo D B A P Gateway sn. Answer. Ask us. It's really our bread. I would say. Can you give some
examples off systems that you've built recently using some of the service technologies.
Yes, An example would be a rest FBI application. That is simple micro service that is behind an e p a gateway and a network of old balancer and a C s. So, in that case, the gateway wass the entry point off the micro service or another services a queue in service. Oh ah, sure. Polar would be integrated with an S Q s. That would be as well integrated with us in us. So listeninto publishing bans and make sure they're way process that event and do some computing interacting with other services area. That's it.
That your highlights and other interesting point about how the zone is using a degree s in general, that is not just about surveillance is not just a gateway and land everywhere. There's such a quite good combination. A good mix off ccs containers we've lambda on DH. Yeah, that's definitely out of necessity. Can you talk about some of the decision points in terms ofthe how do you decide which service should be running on PCs versus running on Lamda?
That is a very good question. Ahn DH. I think when people talk about survivalists, they think that in the cloud way mostly see examples that uses only surveillance or mostly surveillance. But I think your reality most off the like services that we should be in the cloud are hybrids. So you use surveillance when it makes sense. And sometimes it's not always the only solution that can be applied. Asan Example. For HDP application, we sometimes choose this yes, because of some scaling requirements in the zoom. We have some huge traffic at spikes, and not always is the case. The Lambda satisfies the requirements that we have in that sense or for cueing system. Sometimes we need to perform long running computing operations. So again, maybe CCS. It's an easier choice or a faster choice to implement. Or in general, we will go for a different solutions if the requirements just to meet the heart requirements or limitation that this herbal solution is. But maybe Daniel, we have something to say about it.
Yeah, like you say, I think there was some particular scenarios where you cannot use longer, So if you have a task that's gonna run longer than 15 minutes, it might make more sense to put it into a container or easy to are just tomorrow. Lt's now. I guess you could chop it up and often you could use the Lambda functions and sometimes step function if you maybe processing a file, for example. But the some complexity involved in that I think often there's been a preference to just move into into a container instead on DH. Then, like you said around traffic and scaling, I think, Yeah, we do have, ah, global audience. And if you're thinking about sport events, live events, maybe users are trying to sign up for Sinan. Maybe your five minutes to kickoff. Sinan is going through the roof. How quickly could Lunda deal with are on DH? Could you know the soft limits around concrete execution supply our use case. So yeah, I think there are lots of things to consider. You know what is the maximum required throughput? Will our concurrent execution lim allow for that? And also how quickly will burst happened? You know? Well, the initial burst that were given by us to be enough to be ableto handle the increase in customers Come in, Or do we need to consider scaling up in advance or having something that can scale of it faster,
so to fill in some context for audience who are not familiar with some really mister we're talking about here told the concurrence. Dilemma is a soft limit, and that can be raised with her support ticket. But there's a second limit. So once you hit the region or burst capacity, which in most regions is about 3000 Khun Constitutions, after which point you can only increase the number of the Constitution's by up to 500 per minute, this is a hard limit, and this is the reasons why Daniel is Sarah's talking about. For some, this is a front facing workloads where the traffic goes through the roof very, very quickly as you have people just trying to look into the system about 10 seconds before the match starts, when I was at his own, we're looking at some of the data report on the traffic. It was literally going for about 200 users to 1.2 million within a 32nd window. So I think about the concurrency you need for that kind of spike which slammed it just very hard to do. That is, until reinvented last year when they announced the provisions on currency, which is kind of your way to get around that limit. But, Stan, you mentioned sometimes the actual complexity involved using some of these tools just makes surveillance no worth well and just easier to go containers. And I think this particular case, given that you can just say, OK, we know when the traffic's gonna come. So we're just gonna change our auto skating group setting to bump up the number of containers a minute before the match starts. So we have enough capacity when the people do thought of signing Andi. Using this approach is actually much easier compared to tryingto work around on the skating limitations around Lambda with provisions concurrency
Yeah, I completely agree with I think it's really reassuring somebody thinking with operational mindset tohave a button you can press to just say scale up. You know, guys is high as you need to on DH. Yeah, I guess if you're going with a longer approach, a lot of that is controlled by edible us, right? You don't have a button that just says rescue me scale. Oh, I think it's something really, really reassuring stuff.
Yeah, absolutely. It's just in some cases, you do need to have the actual control that containers the DCs gives you. Where's we've landed it being a black box. You just don't have as much control around some of these skinny behaviours. But I do want to also keep talking a little bit about this. Mix off for containers and service because this is again seven. You see a lot. We have larger companies, their wares with some of the smaller companies. A one's going for the service. But the zone still has thiss service first mindset. And since you guys were we both containers and service on a day to day basis, can you maybe just compare and contrast the tooth and paint the picture off? Why does this service first mindset makes sense? Where is service? Much better compared to contain us.
Yeah, I can start on that one. So I guess what the main benefits of using service compared Teo containers on. I think you gotta love bang fear book. You hear people say you got a lot of stuff for free. So in terms of resiliency automatically in three availability zones and you don't have to do much of the underlying work to get there on tonnes of scalability. If you don't have these immense traffic predictions, then hey, maybe you don't need to have 1.2 million conquering users that we're talking about previously. So, you know, in terms of scale, ability, you do get a lot for free. It might not go to an extremely high level of Lord, but often we're not having to deal with that mark traffic on DH. Then in terms of just getting up and running, you know, I think the speed of deployment is is a whole lot quicker. You know, you could just write your function. You can play it. You don't have to worry about a lot of the extra boilerplate surrounding your surround your functions in terms of getting up in running operational savings. Yes, you don't have to spend up easy. Two instances you don't have to set. We're close to you don't have to manage a cluster. You don't have to configure our skill in groups and yeah, a lot of the resiliency stuff you get Freeze. You don't have to worry about. Yes, all right. What do you think?
Everything that you just said? I have nothing to add.
Okay, Cool. I think one last thing. That's art. I think so. Phyllis really shines when you're looking at the vent best models, for example. Maybe you have a dynamite table and you want to process a stream of events like coming out of changes to your to your data. Or maybe, are you want to know what you want to trigger an event when something drops into us three. Or, you know, maybe some other aid we rest service you want to integrate with. I think London has got really good tightened aggression with lots of other dead were services which would be a lot harder to achieve. And you'd have to work a lot harder if you wanted Teo to use a container toe. Paul, for those updates, are somehow analyse changes going elsewhere in your rate of the rest. So that's the benefits, I guess. What are some of the challenges? One working survivalists? Sorry, What do you think?
Yes, so for me, from my perspective, one of the challenges working with several lists on a daily basis, mostly the failure, handling and the result of Butters that, as you already mentioned, can be benefits because they are floated to a W s to be taken care off. But sometimes we near more granular settings because of some constraints that we have in our own applications. That has happened to me multiple times. So sometimes those things that were not possible to customise in server the services were not always faking the requirements off our applications. Some of this limitation have been fixed recently. So, for instance, some better handling happening during a synchronous symbolic vacations between London and other Landa, which is very nice. So in general, the lack of possibility of customised certain behaviours, which is by definition part off the surveillance computing benefits, but in a way taking away the possibility to customise certain favours he can be a bit off limit according to the application or the complexity or the constraints or what you have to ship in the cloud. This's my major pain point.
So this is where the lack of control two rounds of the underlying behaviours. I've really tries and all of that. Yeah, that was weird. Okay, right? Yeah. I think of what you mentioned about something a sink controlled that introduced, reinvented last year. That's really helped, and I really love the changes, especially for Canisius, which I used to do so much work to get around the fact that you've got this beachfront success, which is trying to me that's so as a developer, Sarah, you're working with both containers and surveillance. Do you find us like a difference in terms of speed of delivery? When you're working on a project that's mostly working, we're lambda versus a project where you have to do a lot of work. We have PCs and containers.
I think it depends on your company in Depends how much work you investors, a company in scaffolding and border plates. So in the zone, we are as beckon. Engineers were fortunate enough to have a dedicated platform team that does a lot off the provisioning and the managing of infrastructure. So, for instance, the cluster and some of their two skating policy some of the security policies and all the services that are built around that that is shaped by the platform team. So in that sense, if I were to build everything from scratch, absolutely would take me way less time to ship something with surveillance. But because our company's doing pretty well. In that sense, we are as beckon engineers and power to shape also containers very fast. So I'm very grateful to our proffer team for that. That helps, certainly. And how many
people do we have in the platform team nowadays?
That's a good question. I'm not entirely sure, but I wish there were more. Think way support is good from a Bacon engineer perspective. We're already getting a lot, but yeah, it's tough. It's a big company
that makes sense. But I guess in terms ofthe from the total cost of ownership perspective, you're paying, what, 10 15 people pretty. These in salary in London to look after something that you just wouldn't have to do if you had a fully service environment, which, of course, may not meet order the non functional requirements that you have at his own. No, we've already covered that. Another thing that I want to touch on, I don't know if it is something that you guys have exposure to is what about the cost side of things. How out service component being used in terms, of course, are the more cost efficient, less cost efficient compared to some of your containerised workloads.
I think it really varies from team to team. I think I wouldn't want to go into too much specifics around our costs. But clearly, if you've got some endpoint which aren't getting too much traffic on DH there running on Linda, you're gonna get huge cost savings as opposed to if you toe have some long living contenders, especially if they're being deployed. Teo. Multiple regions on multiple over the reasons. So a lot of stuff we do it. Design is active, active across four regions on in each region. We want to have resiliency across availability zone so suddenly, if you're gonna deploy a container, if you just want to have at least one container, you're actually talking about 12 containers on. Then if you're scaling that outer for environments like their tests stage in production, if you're not careful, you've quickly got 48 containers on DH. If this is just a relatively simple FBI that doesn't get a whole lot of traffic, obviously having a London function which you only pay for what you use, it's going to save you quite a lot of money. I think that the storey probably changes as you scale up. If we have a service which requires a whole load more containers and received a lot more traffic, then I think it gets a bit more complicated. But I think certainly, in the EU's case, where services have a lot less traffic, Yeah, it's definitely cheapie. Islam,
Yeah, no every service at his own one set the zone scale, right?
This is something that when I was talking to the guys and Netflix that they also said the same thing that Netflix was very interested in Lambda. Once it takes a few more boxes for them because not every service at Netflix is runs that Netflix scale. There's a couple of them, like now, the signing, all of that stuff that where everybody hits along the way, But a lot of services, either highly cash aborted just don't get that much traffic. By comparison, it makes sense for you to have a split. Now. Make a decision, a conscious decision based on where you should start to optimise for costs because of the amount traffic. Those services that receive you touched on multi region active, active there. So can you describe maybe some examples of how you set up this system's What do they look like from the storefront back in terms of the EU eye on old way to, say the databases, How do you measure that? It can be served active, active from different regions?
This is definitely sorrows. Cup of tea. That's already
right. Yeah, so active, active It's very complex. It's a pain point in as developers, because I think it's very hard to make it right and increases the complexity by a lot. But it is a necessity because off from a resiliency perspective, we need toe warranty that our services are going to be consumed by the user with minimum down time, certainly using surveillance like in case of dynamo BB ester pockets AP a Gateway or Landa. It is easier to do active active in that sense, because there are certain feature like think of data from one region to another. It's something that you can achieve easily for data storage, which is in serve Alice. Otherwise, it's very hard. It's very art, but yeah, in general, we try to apply not only that but also resent other residency patterns. At the very least, we deploy into regions so we apply fail over. We have ah, active passive for anything. There is not critical on the critical part. Let's say so. At least we have an active region that gets all the traffic and otherwise we fail over the shadow region, but in general, for everything like, say, not playback and payment system. Yeah, we've deployed for regions. And yeah, it takes a lot of time and testing
also after that and say that we make quite heavy use of dynamite global tables in terms of replication across regions. Yeah, a lot of teams using using global tables, and it's been a really big times ever.
And with global tables, I remember a few teams were also using dining with extreme form global tables and run into some really interesting edge cases, which are not that well documented. Do you guys want to talk? Maybe about some of the pain point around. Using global table is we've may be done with the streams.
I can't remember exactly which edge cases you're referring. Teo and I know their teams who have global table set up enough streams than taking off changes and putting them into like a history table for example where we can then store a whole a whole bunch of effectively, like an audit, lockable changes that have occurred toe the main data. Yeah, I can you enlighten us? I
think it was Tom that was running to this. So when you have an update to your local table gets replicated to other regions tables, you get the same event getting triggered in every single region. So essentially one update, you're gonna receive four events on a stream. And then you have to look at this very specific key in the payload to figure out whether or not did update come from the local it'd up. They come from another region somewhere else to figure out which one you should process rather than trying to protest. Same event four times.
Yeah, I I recall that because you get some extra attributes added to your to your data, right, And then, from
can derive whether it's an update event or whether whether it's a code in the local region, you have to figure out backwards whether or not should process it, or you should leave it to the other region where the actual event occurred. If you want to keep the changes in the regions on. No, Have the same pope. What did log in every region, if you like?
Yeah, I was talking with Tom Stewed that on DH. I think it was even documented. It was just something that he figured out by looking at the payload experiment a little bit and see which key maps to wish scenario and wasn't even mentioned. The documentation. So there's, like one line, the documentation in life. 50 or five pages of documentation for Donald D. B.
Yeah, I Sometimes I find the documentation inedible. Us. If it's not easy to browse or one of the pain points with the documentation is that sometimes I think that failure scenarios or enter handling should be highlighted. Mohr to help developers troubleshooting because everything could happen and things fail. And we need to be empowers a developer to be able to capture the scenarios and tto handle them for see them if possible, but otherwise trouble shooting them. And I think we can make an extra step. There is marching off improvement in terms of how the database documentation shows the scenarios in their documentation.
Yeah, I told you I can't I couldn't agree more. I've spent so many hours just going through documentation is trying to find that one line that explains why I'm seeing some weird behaviour. So it appears documentations need to be better than some pain points around the area, handing scenarios at anything else that you find challenging when you're working with lamb, there were conveyed a beer's anything that database can do better.
I think, from my perspective, some of the pages of the management Council. It could be a whole lot better. Yes, give it to be Ressam. Credit. I think the FBI get won. Gok Wan has recently got Devon Facelift Very frustrating open toe recently, but they're improving. So
there is the one from the perimeter store, which is so, so tough to use. But again, interior should be creating your in front with well, such as confirmation or their form or this Eli. But again, sometimes you just want to get for me there with things, and you need to have the visual feedback to understand what's going on, especially when you troubleshoot stuff or you're not really sure or you're just experimenting or playing around. So in that sense, the dashboard is very helpful to get familiar with a specific service or specific contacts or scenario. So, yeah, I understand that we should prioritise the dashboard or the Web console. But again, there is margin of improvement.
Go back to the documentation improvements that we were talking about. I think this is one thing we could do a bit better. It's helping hold people's hunt through the adoption of service because I think, you know, it's very easy to get something out of the door. Quick, just spin up a new several stack. I think there's just lots of kind of hidden decisions that if you're not careful, it might bite you. There's a boat like a hidden learning curve. So, for example, would you want to use, you know, L B versus an FBI Get away. How much memory do you need to give each other functions? And what impact will that have on either costs? Top performance. How should I deploy my functions? You know? Do I need provisions? Kong currency. What they count them is I need to watch out for what kind of a lurch should I really have. So I think there's a whole lot of hidden complexity that I think, yeah, if you aren't careful coming into several us on just using everything straight away, there are a few gotchas you gotta watch out for. So I wonder if just educate users could be a bit better.
Yeah, I greatly that
funny that you mentioned that I was talking to the guys from Devi away in Swansea on the end of the episode of this podcast. The guys said exactly the same thing I think about Luis. He's a chief architect over there, said that Eddie Bell is so un opinionated when it comes to what customers, too. They almost just give you these huge, many off different things you can choose from without really telling you that which one is good when so much I just Hey, here's a menu order, whatever you like, But you know what you're looking for? Losing weight. Maybe order some salad, don't order a Big Mac. Don't offer any opinion around when you should be using a PR gateway versus the GOP's. Those decision points are no obvious. I guess they're just giving me the job because you come to me for this kind of thing.
if you go into edible us really isn't all you can eat buffet, right? You have. You have so much to choose from and yeah, I guess it's kind of down toe leadership within organisations to set a guard rails into, you know, come up with well trodden paths toe keep people doing the right thing.
Yep, yep. And that they shouldn't have to do that themselves. Some of that guy that should be coming from a TVs for sure. And Sarah, you mentioned earlier. Kind of just briefly in passing using Sir Villers and Tara form together how the guy choose when to use terra form versus confirmation or surveillance framework.
I think it's not exactly a choice, as in we widely adopted to reform eventually as a tool to shape our services in the clouds because we work in micro services and each team as space to choose the tooling that they want within limits. So I think there was like, I was just a white adoption off their firm also because he was fostered by the platform team. But Daniel, correct me if I'm wrong in this,
I think it was a recommendation from our platform team mostly, but I think it's also just the tool that most people are comfortable with. Because if a lot of teams are having Teo to deploy content as anywhere or having to create dashboards are alarms of the bits of infrastructure, maybe outside available us, then you know, using HCL on DH terra form and staying within your kind of terrifying. But well, there's been more comfortable for people, and they don't have to use multiple tools the same time. That's probably the main reason
a good point. Yeah, I forgot to mention Yes, certainly there form as the other value off every multiple providers. So, for instance, weak and Bill's our dashboards relying on the new relic provider. So everything Khun B is infrastructures coat so we can build our dashboard alerts. Everything related to observe ability of monitoring using terror form so we can also for should control it. And that is no relic is an example, but the same is for our secrets management involved or Khun B. Data dog. It can be pager duty, so that's really a great benefit. To be able to control is in your code to be responsible for it.
I guess it's not without issues because terrifying is very opinionated, Right? We'll let you do whatever you want because it's just the recipe eyes on the other side of it. I think we've run into various issues when China was terrifying, especially around updating multiple lumbers concurrently, I think there are some limits where you can't I hope they're multiple under's at the same time bond we've had to daisy chain using depends on our terra form so that you could put deploy one time and you sequence all the rollout. I think getting more advanced deployment mechanism like blue green deployments are carry deployments. Using terra form is has been quite challenging. And, yeah, you may wish to look at all the tools, if tryingto achieve that, which you know we are.
So if a day mostly stuff if land a lot of the teams in London, at least that they're using several film work instead of terra form because it just makes life so much easier, especially when you're working with AP Gateway on. I do remember that day situating. We have a terra form and love that when you got quite a few functions in the project that is not fun
now it's no idea,
he said. Anything that you'd like to tell the listeners before we go. Maybe the zone is hiring.
Yeah, definitely. So we are hiring If you want to help us shape the future of spot cheque out often roles. We have it. Jobs stocked is on. But come on, DH, if you're interested in learning more about what we get up to,
to engineering doctors on dot com What we have a bunch of Lord posts going to more technical detail about So the challenges that we've overcome.
I'll put those in the show notes for everyone who's looking for information about his own. And also the zone is spread across multiple development centres. I guess you're hurrying in Amsterdam and London is there, right?
Yeah. We have development centres in London. Amsterdam leads in Katowice. Not sure exactly where we're hiring right now, but cheque out the website with all of our open roads there.
Okay, that's great. And I want to thank you guys for joining us. And before I let you go, how can people find you guys on the Internet?
So if you want to get in touch with May search Daniel right on Mukden or on D l w are on Twitter
with me. Is that you? Can you find me on Twitter under the nickname Salute or Yeah, Arlington under the name Sarah Jerry on. Okay.
Makes you put those into the show. Notice. Well, so again, Thank you guys. So much for taking the time to talk to me today. I hope to see you around that his own office soon
change? Yes. Tickets.
These guys. Thank
you. Thank you.
So that's it for another episode ofthe real world service. Thank you guys very much for joining us on this conversation with Daniel, right? And Sarah Jerry on at the zone too. Asked us to show notes and the transcript. Please go to real world service dot com and I'll see you guys same time next week. Take care for now.