The Art of LiveOps

The Operational Side of LiveOps w/ Jason Girard: The Art of LiveOps S3E7

June 06, 2022 James Gwertzman and Crystin Cox Season 3 Episode 7
The Art of LiveOps
The Operational Side of LiveOps w/ Jason Girard: The Art of LiveOps S3E7
Show Notes Transcript

Jason Girard is the Director of Live Services and Technical Program Management at ProbablyMonsters, building the next generation of Live Operations from the ground up in a developer focused way. He's been in the games industry for over 10 years building and supporting platform level technologies at the scale of millions of players across multiple games simultaneously. His focus is guarding the player experience by keeping games online and working. 

Jason talks about Games as a Service; Monitoring and Alerting; How a LiveOps team integrates with other teams within an organization; the importance of database management and the dangers of collecting too much or too little personal data from your players.


This episode is brought to you by Azure PlayFab.
Visit https://www.playfab.com for all your LiveOps needs.

Support the Show.

00;00;05;00 - 00;00;06;14
James Gwertzman
Hello. I'm James Gwertzman.

00;00;06;14 - 00;00;09;23
Crystin Cox
I'm Crystin Cox. Welcome to The Art of LiveOps podcast.

00;00;14;22 - 00;00;21;14
Crystin Cox
Hey, everybody. Crystin here. It's just me for this one, but I am joined by an awesome guest.

00;00;21;20 - 00;00;38;17
Jason Girard
My name is Jason Girard. I am the Live Services Director at ProbablyMonsters. I've been in the games industry now for about ten years in roles, everything from QA to production to live operations. I spend most of my time these days leading live operations and technical program management.

00;00;38;20 - 00;00;59;11
Crystin Cox
He is going to be talking with us about that side of the fence that very much that actual operational side of LiveOps in his career, he's spent most of his time on platform technologies, sort of managing all the bits and bytes and services and uptimes and all the things that keep the magic going. So I'm excited to talk with him and explore that space

00;00;59;11 - 00;01;03;05
Crystin Cox
a little bit more fully. So let's dive right into it.

00;01;10;04 - 00;01;21;00
Crystin Cox
So you're over at ProbablyMonsters. Can you give us like just a little bit of a background on ProbablyMonsters because some people might not be familiar with it. I think a lot of people probably know, but some people may not be familiar with what you guys do.

00;01;21;10 - 00;02;02;09
Jason Girard
Absolutely. So ProbablyMonsters is still very much in sort of the startup phase of development. We are a family of studios that is efforting to do things differently than the broader game industry has for the last number of years. As far as things like Work-Life Balance and Culture within the studios. We're a family of studios where the idea is that ProbablyMonsters itself basically builds all of the infrastructure from the business side and the technical side to allow game creators to just create the fun that they're best at and do it for a really, really long time.

00;02;02;10 - 00;02;11;19
Jason Girard
Our goal is to hire people that retire here and you know where we're just in the early stages of that, but so far it's going really well and we're really excited about it.

00;02;12;00 - 00;02;21;21
Crystin Cox
Yeah, I know there's not really like you guys can you can talk about the specifics on the games you're making. But I do know that some of them may have a LiveOps component.

00;02;22;11 - 00;02;59;00
Jason Girard
Yes. Yes. Well, there's a reason that they brought me on. Right. So it's a fair bet that we're going to be doing games as a service to a certain degree. And they're bringing me on to sort of build that organization. And somewhat different from your typical guest on this podcast. My happy space is less and the games as a service, community management, how do we measure and monetize and do those things. I play in that that pool a little bit, but where my bread and butter really is is the operational aspect of it.

00;02;59;12 - 00;03;10;29
Jason Girard
And so I'm really interested or really excited to talk to you all about that because live operations in this industry is changing and that it means so many things in so many different studios.

00;03;10;29 - 00;03;47;20
Crystin Cox
Yeah, agreed. You know, you have this background, as I've said, where you've really worked in the nitty gritty of keeping services up and running and making all that infrastructure and technology that players connect to and creators connect to. Actually both sides connect to that. And you've had a lot of experience in your career doing that for multiple teams in multiple studios in a way that I think is actually quite different than a lot of the other people we've talked to or even a lot of younger, smaller teams who might be kind of doing this a little bit on their own.

00;03;47;20 - 00;03;52;23
Crystin Cox
So I'd love to hear you talk a little bit more about like that scale of operations.

00;03;53;09 - 00;04;16;26
Jason Girard
So I've worked in platform technology for nearly the entire time that I've been in the games industry at this point. And I love it. There are folks that really get geeked out about working directly on a game product. You know, they like to be able to say, I made that that weapon or that I designed that level, and there's no disparagement on that at all.

00;04;16;26 - 00;04;38;13
Jason Girard
It's just I love being able to touch all of the things. One of the things that I like to say about the live operations folks in general is the typical developer launches a game once every two to five years, depending on the product. I launch five in a year. And so it's a really gratifying experience. It never gets old.

00;04;38;13 - 00;04;55;06
Jason Girard
There's new problems to solve with every product, but because of that, there's a lot of that underlying sort of platform level infrastructure that touches so many of the products that people play that is completely blind to a player unless something's broken,

00;04;55;06 - 00;04;55;15
Crystin Cox
Right.

00;04;55;15 - 00;04;59;14
Jason Girard
And, you know, those are the sorts of things that we'll get into a little bit later I think

00;04;59;14 - 00;05;05;09
Crystin Cox
You're the you provide that invisible hand that moves me magically from one place to another.

00;05;05;10 - 00;05;21;08
Jason Girard
I mean, I've been the person that comes home after an 80 hour week and just wants to log in and play the game that I'm playing at a particular point and it be down for some reason. Right. And I don't want people that play my games to do that, to have that experience. And that's why I do what I do.

00;05;21;19 - 00;05;43;20
Crystin Cox
Yeah, that's awesome. And like in the time that you've been doing it, it's a unique perspective you spent pretty much your entire career in platform technology. I think you sort of came into the industry at a time when the idea of what that meant, it was a bit different than it is today. So I'd love to hear you talk a little bit about the way that technology has changed over the time you've been working with it.

00;05;43;20 - 00;06;16;26
Jason Girard
Oh, sure. So I had the fortune to start my game industry care,er at Blizzard Entertainment back about ten years ago now. And it was one of the first platform level technology companies with Battle.net back in and the Starcraft 1 days. Right. And the Warcraft 2 and 3 days where it was really a way to facilitate people playing together over the time that I was there slightly before,

00;06;16;26 - 00;06;48;08
Jason Girard
but over the time that I was there, it morphed into this true centralized product that all of the Blizzard products plugged into. In fact, one of the things that I worked on while I was there was the launch of the Battle.net asset folks know today. I was in QA at the time, so a slightly different role, but it's gone from this set of systems that allow people to connect and interact with each other to really the core of any product that gets to people online right?

00;06;48;08 - 00;07;14;14
Jason Girard
Like everything from the ability for a game, a game client to connect into Xbox's or PlayStation's or Nintendo's or Steam's or whomever's authentication systems or their marketplaces or their multiplayer services, there are these platform technologies that that underlie all of these games that have to be there in order for them to work on any platform that you choose to play on.

00;07;14;21 - 00;08;01;29
Jason Girard
And the more the industry changes over time, the more integrated those things come. I mean, games as a service has existed for quite a while, but has really become the bread and butter of the industry over the last five, maybe ten years for some companies. And as a result, these technologies have gotten broader focused, larger loads faster and just more of them over time, like you used to have a identity and a multiplayer service. And now you've got, you know, marketplaces, you've got title storage, you've got the ability for analytics for all of these things, and you're looking at the monitoring and adjusting them in real time, sometimes manually, but hopefully most of the time via automation.

00;08;01;29 - 00;08;30;22
Crystin Cox
Yeah. You know, monitoring live services has always been fascinating to me because I actually had one... somebody asked me at one point, like, why isn't there just a service for this? Like, why isn't there's an off the shelf service to monitor games? Like why can't you just build that? And, you know, I think my answer was in games when we monitor your health of services, it's rarely as simple as just looking for errors.

00;08;30;22 - 00;08;46;17
Crystin Cox
Like, Yeah, we have yes, we have a systems that will notify us when errors happen. That's great. Sure. But it can be so much more complex than that to actually understand this very complex thing that's up and running and know is it running Well.

00;08;46;26 - 00;09;11;17
Jason Girard
Yeah. Yeah. So there's multiple ways to look at this, right? And each team within games is going to care some somewhat differently about different measurements. For example, an infrastructure team is going to care about, you know, ALB health (automatic load balancers) or the health of the stacks that they have things up in the cloud.

00;09;11;17 - 00;09;36;26
Jason Girard
Right. Or if they're old school, whether or not they're bare metal services are in a good solid a game services team or a services team that develops something like authentication or multiplayer services is going to care about things like, you know, response time, throughput. And very rarely are you actually measuring on number of errors you're measuring all a number of other things.

00;09;37;06 - 00;09;59;18
Jason Girard
A game team, however, is going to care about things like population. You know, is there an event of some sort that causes the population to either spike tremendously or reduce tremendously in a short amount of time? Because those are indicators that there might be a problem to be looked at and even individual game products will want to care about different things entirely.

00;09;59;28 - 00;10;42;22
Jason Girard
Like if you're a multiplayer game, you're going to care an awful lot about latency. If you're not, then you'll care a little bit less about it. You're nearly always going to care whether or not your store is working for any product that has one. The monitoring and alerting is really kind of an art form that's developed in concert with all of these various teams that in many ways interact with each other fundamentally, but don't necessarily interact with each other and in an everyday sense, like it once you've integrated with an API, you're probably not going to work with the team that built the service that you've integrated with often, but you care very much

00;10;42;22 - 00;10;59;12
Jason Girard
about how their things are working. And that's where live operations teams sort of fit right in the middle of all of those things to watch what all of those teams care about. [ad music]

00;10;59;12 - 00;11;40;29
Crystin Cox
The Art of LiveOps is presented by Azure PlayFab. The future of LiveOps games is in the cloud, so why not power your game on the global, secure and reliable Azure Cloud? Microsoft Azure already delivers enterprise level cloud services to thousands of game developers because Azure Cloud Gaming Services are built by game developers for game developers. Find out what makes Azure the best cloud for gaming by visiting Azure.com/gaming. That's Azure.com/gaming. A-Z-U-R-E dot com slash gaming to learn more and get started for free. [ad music]

00;11;40;29 - 00;12;03;24
Crystin Cox
Welcome back to The Art of Liveops. Today we were talking to Jason Girard and we are talking about live services and platform operations, the technical side of LiveOps, keeping all of your services up and running, processing all that data and infrastructure and making sure that everything's running smoothly.

00;12;04;13 - 00;12;15;26
Crystin Cox
We're talking about all of his experience doing that largely for these really like big organizations with lots of different game teams touching those platforms. So let's dive back in and keep that conversation going.

00;12;19;11 - 00;12;33;16
Crystin Cox
You know, there's all of these different things the different teams want to monitor. And then the next step, which is I've run into so many times in my career, is great, you've monitored it and now it's alerting. And now what?

00;12;34;19 - 00;13;03;05
Jason Girard
Well, now there's optimization and adjustments and tweaks, and every deployment, every change to a system is going to change how you're alerting works. And so you actually have to be really careful with your alerting so as not to not to alert on everything, right? Not to monitor everything. Which is sort of the baseline of I want to know all the things about the product, right?

00;13;03;08 - 00;13;27;06
Jason Girard
But the reality is the noisier you're alerting goes, the less the less effective it is. And so if you end up spamming alerts at your operations team, eventually they're just going to ignore them because they're not super important or they don't indicate a problem that actually needs to be researched and they get filtered away in email or what have you.

00;13;27;07 - 00;13;34;10
Crystin Cox
The actionable problem, right? Like if you start measuring a lot of things, but then you don't do anything with that information, then.

00;13;34;16 - 00;13;34;25
Jason Girard
Yeah.

00;13;34;29 - 00;13;36;02
Crystin Cox
It becomes noise.

00;13;36;02 - 00;14;03;07
Jason Girard
And so what you really have to do is per service or per product, figure out what is the best measure of health for that. And alert only on that thing or that thing. And maybe one other thing that could be a secondary indicator. And then you know, over time, you adjust those alerts to the appropriate thresholds because there's always going to be a variance.

00;14;04;04 - 00;14;20;19
Jason Girard
There is no piece of software that never errors. And in fact, oftentimes player behavior will error software because it's supposed to, like if you're trying to log into account that isn't yours, it's going to throw an error. You want that kind of error.

00;14;20;19 - 00;14;21;04
Crystin Cox
Right.

00;14;21;04 - 00;14;41;04
Jason Girard
versus a you know, if you're getting, say, a DDoS that is overwhelming your marketplace service, those are an entirely different set of errors and should be throwing off alarms. Beyond that, you're going to run into problems where something breaks that you didn't have an alert for.

00;14;41;07 - 00;15;02;00
Jason Girard
Right. And the important thing after that point, once you've gone through the incident management process and brought services back up, is to understand where that gap was and understand whether or not you need to alert on it in the future or you solved the problem that maybe you didn't understand before. And if so, how do you want to measure the health of that?

00;15;02;22 - 00;15;22;07
Jason Girard
It's very important to take a look at a game product or service and focus on what you really care about. Otherwise, it's just noise and you end up having to scale an operations team far bigger than it needs to be just to facilitate the business of it.

00;15;22;14 - 00;15;42;25
Crystin Cox
Right? Yeah, that makes sense. You know, it makes me think to, you know, how has the request you get from game teams changed over the years? Like is it mostly the same stuff, but you're using different technology to solve those requests? Or fundamentally, developers want different things now.

00;15;44;00 - 00;16;18;01
Jason Girard
It's a mix of both. Like, you know, technology is continuously changing. You know, the web sockets of a few years ago are now things like gRPC. So you're always trying to keep up with sort of the change that technology just latently goes through. And as a result, that presents some very interesting problems where you've got older games on older infrastructure and newer games on newer infrastructure and that's an entire can of worms of its own.

00;16;18;09 - 00;16;50;13
Jason Girard
So there's sort of the requests of games in sort of their lifecycle. And then there's the broad requests for platform services and operational services that have really grown over the years. Because realistically speaking, while many games may be different and they have perhaps different segments of requests from an operational team, the types of work can fall into fairly large buckets that just need to be facilitated through either automation or documentation.

00;16;50;22 - 00;17;11;22
Jason Girard
And it's things like one of the biggest things that happens in GaaS models now is things like we want to grant all the players a Christmas gift kind of thing. Well, development team doesn't need to do that. They just need to tell an operational team what they want to grant. And the operational team should have the tooling to be able to do that for them.

00;17;12;04 - 00;17;33;26
Jason Girard
That kind of thing wasn't considered five years ago. And so some engineer on the product team was writing custom code to get the thing out there. Whereas now it's just kind of a thing that is expected. That is part of your either operational or player support tooling, right? So the requests have gotten broader as the capabilities have gotten broader.

00;17;34;21 - 00;17;53;15
Jason Girard
And the more we think of things that are good value adds for players the more likely it is that that's going to land in the operational space to facilitate. My goal is literally to make sure that game developers can make fun, can make the game, and leave all the rest of it to me.

00;17;53;26 - 00;18;11;28
Crystin Cox
Right? Yeah. That it's going to run smoothly and they can I can say I need to do this. It's interesting you mentioned tooling because I think when people ask me like, what's the thing that I am is going to surprise me when I launch this game? Or like, what am I not thinking of? What's the one piece of I'm always like, You don't have enough tooling.

00;18;12;18 - 00;18;29;23
Crystin Cox
I guarantee I don't care what you've done. I don't care how much you prepared. You don't have enough tooling. And it is stuff exactly like you said, like, someone is going to come and say, well, we just want to give everyone in the game this gift. And if you don't have that tooling, you realize, Oh, well, this is going to be a real pain in the butt.

00;18;30;05 - 00;19;02;16
Jason Girard
Yeah, absolutely. And it's going to take days and days or weeks, right? Like to fulfill that that request. And it ends up being a thing where somebody is going to write a quick and dirty script that you're going to build an entire toolset on. I've seen that happen, you know, at this point, dozens of times where somebody writes something to fulfill a need that ends up getting additional things tacked on to it, particularly in the, the CS space or the knock space, right?

00;19;02;16 - 00;19;24;18
Jason Girard
Like we have this repetitive work that has been put on us by a product team for whatever reason. I'm going to write a quick automation. It's probably not a fully fledged engineer that writes that initial tooling, and then you build upon it over time, and then it becomes a ball of yarn that you have to figure out how to unravel in order to make it scalable.

00;19;24;23 - 00;19;34;12
Crystin Cox
Yeah, absolutely. I mean, it's so funny. Sometimes I think back to the tooling I had in the early 2000s almost wistfully because it seemed so simple.

00;19;35;09 - 00;19;36;20
Jason Girard
Yes, yes.

00;19;37;07 - 00;20;15;07
Jason Girard
And, and. These days your tooling has to account for multiple different platforms. It's got to account for probably multiple different databases. It's got to account for, in many cases, any scenario that you need to solve a problem for a player on masse, which in today's games can get pretty complicated, right? Like there's the cases where, you know, Final Fantasy 7 teams run into this recently where they've had to figure out how to constrain the number of new players to get into the servers because they have that big throughput problem.

00;20;15;07 - 00;20;15;21
Crystin Cox
Yeah.

00;20;16;04 - 00;20;19;26
Jason Girard
That probably wasn't written before they ran into this problem.

00;20;19;26 - 00;20;34;28
Crystin Cox
Yeah. Yeah. And they have a legacy infrastructure, right, like there's also that double edged sword of LiveOps success, which it's like, great, your game is successful for ten years. You are running on ten year old technology.

00;20;35;09 - 00;20;58;19
Jason Girard
Yep. And you know, eventually, you know, if you're on bare metal, that stuff breaks and you've got to replace it. And that's actually where the knock tends to shine because it is the answer for the problem of ten years ago where we've got, you know, datacenters, we've got servers that need to be monitored. These are things that require an onsite presence all the time.

00;20;59;19 - 00;21;19;18
Jason Girard
And in the more modern world, you've got cloud infrastructure where you'll hear the term SRE thrown around rather than network operations center. But it's really just an evolution of how you solve that problem in the cloud space instead of the bare metal space.

00;21;20;05 - 00;21;46;14
Crystin Cox
So I'd like to talk a little bit about data, you know, data is one of those things we end up talking to pretty much everybody about on here because it touches everything is it a cornerstone of LiveOps. And I'm curious to hear you talk a little bit about the way your relationship with data has changed as someone who's primarily touching data in the sense of you know, managing the flow of it, managing the storage of it, managing, you know, the processing of it.

00;21;46;14 - 00;22;05;05
Jason Girard
So yeah, this is this is one of the things that I was really interested in to talk about as well because like you said, everybody touches data. What I generally care about in the operation space, that is the how do you manage to make sure that things are up is whether or not. Well, there's a few things.

00;22;05;05 - 00;22;28;07
Jason Girard
One, logging needs to be a fundamental part of any service because that's... if something goes wrong, you have your monitors, your alerts, and then you go straight to logging to try to figure out where that problem really is. Is it infrastructure, is that code, etc. And so logging solutions need to be robust and easily able to be searched.

00;22;28;23 - 00;22;57;04
Jason Girard
There's lots of tools out there to facilitate that for you these days. Things like Splunk. The other thing that is of concern is your database, what solution you're using, what version of it you're using, how big is it? How can it scale? Are we going to ever have to migrate it? Because if you've ever had to do a Db migration, you know that you're going to generate downtime and it's rife with possibility for failure.

00;22;57;04 - 00;23;23;19
Jason Girard
Like you have to be very careful in the scenarios where you touch any data because as many of your other guests have talked about, it's extremely important for how you develop your products, how you know, what's doing well for players, how you know, whether or not the event that you're running is making the progress that you're trying to hit with it, whether or not the sales that you're running are generating additional profits, all that sort of thing.

00;23;23;19 - 00;23;44;29
Jason Girard
So any time you're touching data, you need to be extremely careful with it. And that means that you need to establish a change management process and follow it to a tee and have disaster recovery and roll back ability and in the case of data, it's quite often replicating the database over and over again in order to make sure that you've got good snapshots of it before you touch it.

00;23;45;06 - 00;23;57;15
Jason Girard
So in my particular world, it's less doing things with the data and more trying to figure out how to protect it so that things can be done with it.

00;23;57;15 - 00;24;21;07
Crystin Cox
And I imagine in the last couple of years there's been a lot of push on privacy and security and a lot of different rules have come into play for privacy and security, which also affects the way that you store, manage and touch data. And, you know, what has that been like I think people sometimes outside of the industry don't appreciate...games produce a lot of data.

00;24;21;07 - 00;24;31;08
Crystin Cox
Yes, a lot of very detailed behavioral data from players. So it's a big question for our industry how to handle this thing.

00;24;31;12 - 00;24;59;15
Jason Girard
So so there's a couple of ways to approach it. I'll start with collect only what you need. I'll just start there because if you collect everything that you can, one, it's not cost effective. So just let's admit that we can collect all of the things, but we should only collect the things that are helpful. Beyond that, anything that could identify an individual needs to be held or handled with kid gloves.

00;24;59;24 - 00;25;23;02
Jason Girard
You have to have an excellent reason for why you're collecting it. And it has to be a defensible reason. You also have to be able to scrub it at any time, which provides another challenge. Right. Going back to the tooling, you need to have the ability to clear player data when they request it. There are oft times, good reasons to collect a certain amount of personal information.

00;25;23;02 - 00;25;48;18
Jason Girard
For example, if you're making a mature game, you want to make sure that the people that are playing it at least acknowledge that they're old enough to play that game for legal reasons. So that's an example of, Yes, you're collecting personally identifying information PII for a reason that is to protect you in a legal standpoint versus you don't need to collect somebody's Social Security number, right?

00;25;48;18 - 00;25;48;29
Crystin Cox
Right.

00;25;49;00 - 00;26;12;20
Jason Girard
Or whatever information that they have there. Like, you don't need a driver's license number. You don't even in many cases need their actual name unless for some reason there's a requirement for like a payment system that kind of thing. But again, there's a specific reason to collect it for a specific purpose, and it's only used for that purpose.

00;26;12;20 - 00;26;39;01
Jason Girard
Don't use it for anything else. Don't use people's information for anything that they haven't willingly opted in to be very careful about that sort of thing because not only is it probably just the right thing to do, right? Like don't take advantage of the information that you can gather from people just because you can, but it also protects you from a legal standpoint, if there comes to be a GDPR challenge or something like that.

00;26;39;22 - 00;27;08;10
Crystin Cox
Yes, it's always surprising. Like I would argue most of the time you just don't need it. Like, and I've been surprised it's really all over the place in the games industry I'd say like there are some teams who're like we're just collecting everything we can and it makes you go, why, why do you want that information sitting around on your server? It's like, it's yeah, it's not like, you know, you're using it for anything legitimate and then there are other teams, you know, like my team now at Xbox is very much like we collect nothing.

00;27;08;13 - 00;27;25;24
Crystin Cox
Like you want something, you're going to have to jump through a million hoops like if you want to get it. And, and it's, you know, mostly correct. Like we don't need that information. But where the tension comes in to me is like I consider on the developer side and say we almost would never be that information. Like I don't want that information.

00;27;26;08 - 00;27;48;14
Crystin Cox
But when we talk about customer support, there becomes a real tension because while I might be over here going, well, there's no reason for me to know that you specifically Jason bought this item in the store and this item in the store and made this trade with this person. I don't need to know that. You might come to CS and say, I bought this thing in the store and I really need you to do X.

00;27;48;14 - 00;27;52;15
Crystin Cox
And then that tension I think is the one that's the most challenging.

00;27;52;25 - 00;28;18;24
Jason Girard
Oh, sure. And it's a reasonable tension to have. And just a quick aside, I was actually at Bethesda or Zenimax when the acquisition with Xbox happened, as you know. And one of the things as part of that acquisition was going through all of the data we had. And Microsoft Xbox is really particular about what you're carrying or holding on to which I appreciate.

00;28;19;02 - 00;28;50;13
Jason Girard
But to get back to the point of different teams requiring different data. Absolutely the case and the challenge there is how do you collect the right amount and then obfuscate what you need to. For example, a Xbox player in that scenario could be identified by their ZUID and instead of, you know, their name and that ZUID connects to all of the different tooling that player support has to be able to make those adjustments.

00;28;51;09 - 00;29;09;13
Jason Girard
So there's a little bit of lateral thinking there. How do we facilitate the need without necessarily putting ourselves in jeopardy of collecting something that we really don't have a use for or don't require use for to facilitate this same outcome right.

00;29;09;19 - 00;29;23;28
Crystin Cox
And the more you have that data sitting around, just the more risk you're at, not only from bad actors but even from your own team like. Yeah, the more of the data is around somebody is going to look at it. You shouldn't be looking at it or do something with it they really shouldn't be doing.

00;29;24;10 - 00;29;36;28
Jason Girard
Yeah. Changing human behavior is a rather Herculean task. You should just probably not put folks to have the ability to get themselves in trouble if you can avoid it.

00;29;37;06 - 00;29;56;21
Crystin Cox
That is the ideal for sure. So as you're you know, you're starting out this sort of new journey. Now you're building a team, you're looking at the future, you know, you're really looking at the future of games. You guys haven't launched everything yet. So it's all in the future. It's all new stuff. Where's your head at now?

00;29;56;21 - 00;30;02;04
Crystin Cox
What do you see as sort of the stuff that's on the horizon for operations?

00;30;02;05 - 00;30;29;19
Jason Girard
Sure. So I'm funny enough, I was just whiteboarding out some of these ideas this morning and I alluded to it a little bit earlier. And when I talked about sort of the knock versus SRE versus, you know, whatever you want to call it, and the knock being the answer for the problem that we were trying to solve a decade ago and a little bit longer, which is, let me be fundamentally clear, it's not a knock on knocks at all.

00;30;30;27 - 00;30;31;26
Crystin Cox
Don't knock the knock.

00;30;32;22 - 00;31;01;15
Jason Girard
And never because there's they facilitate a fundamental need and oftentimes they exist in places where that need has existed for as long as their first GaaS product came out. Like if you're running an MMO, you probably have a knock of some sort. Like that's just it's almost a hard requirement for that kind of game. Or if it's not, then you've invested heavily in scenarios that fulfill the same function that a knock does just in different ways.

00;31;01;25 - 00;31;29;11
Jason Girard
The way that I'm looking at it sort of in the today in the cloud infrastructure space and the everything is obfuscated away via services like Google Cloud or AWS or Azure or what have you and how to solve that problem now and for the next ten years without having to worry about literally scaling people, humans to do things that computers can do.

00;31;30;02 - 00;31;38;18
Jason Girard
So the buzzword in the last number of years has been SRE, that came from Google, I think it started to come into vogue again about 2018 or so.

00;31;38;29 - 00;31;40;25
Crystin Cox
So people don't know that term, can you...?

00;31;40;25 - 00;32;20;06
Jason Girard
Yeah just about to do that. It stands for site reliability engineering and it's essentially the attempt to solve the problem that operations teams in general have been fulfilling from approaching it from an engineering mindset. And there's a number of parts of it but fundamentally it fulfills all of the same roles, things like monitoring alerting, incident management because something will break at some point and you need to have somebody that can go solve a problem but has a focus on how do we engineer our way out of those problems

00;32;20;06 - 00;32;48;26
Jason Girard
over time. And so the hope is that they spend no more than half of their time with the operational tasks and over time, those operational tasks get automated away where automation makes sense. And there's a mix of that. And then who is literally on the wall at 3 a.m. to let players know that there's a problem if one shows up well, that can be facilitated in any number of ways.

00;32;48;26 - 00;33;11;22
Jason Girard
It can be facilitated by player support. For example, we typically got a 20 47 organization there. It could be facilitated by a type of knock where, you know, say your cloud infrastructure provider has an outage or there's a Internet provider outage that's a broad impact. That's having players, players and it's good to get tickets and to say, hey, your games aren't working.

00;33;12;07 - 00;33;36;21
Jason Girard
You can use numerous tools to go find out where the problem is. And if it's not your actual servers or if it's not your clients or if it's not your services, then it's a simple matter of letting players know that there's a problem in this area and we're monitoring it. Stand by. The other side of it is most problems happen during business hours because you made a change.

00;33;36;21 - 00;33;37;00
Crystin Cox
Right.

00;33;37;01 - 00;34;12;05
Jason Girard
It's combining the operational side of the world, but it's also making sure that you're building in strong change management systems to prevent the need to spend a lot of time on that incident management or the recovery from the problems that existed. You know, there are there are war stories everywhere about major outages. Facebook actually had one recently, right, where they had multiple automated systems that sort of daisy chained to the point where it took down a good chunk of their services. No one is immune.

00;34;12;05 - 00;34;12;28
Crystin Cox
Yeah.

00;34;12;28 - 00;34;29;18
Jason Girard
And so but that outage was caused...it was self-harm. And most of the time they are. You can't plan for every scenario. So the best you can do is accept that there's going to be a problem and understand how to fix it when it becomes one.

00;34;29;18 - 00;34;37;29
Crystin Cox
Right. Yeah. Well, that's a great lead in because we're just about out of time. But, you know, now have to ask the favorite question

00;34;37;29 - 00;34;38;21
Jason Girard
War story.

00;34;38;21 - 00;34;42;13
Crystin Cox
Yes. Give us your LiveOps disaster story.

00;34;43;00 - 00;35;13;12
Jason Girard
Okay. Well, I'll preface with the names have been changed to protect those who are innocent. And so I'll avoid naming products or studios. But one of my one of my favorite was there was a new game that was being announced sort of as a mike drop moment at an E3, where we had planned it sort of to a tee, like had everything covered as far as we knew.

00;35;13;28 - 00;35;34;02
Jason Girard
We had scale tested everything. We had made sure that all of our services were pre-warmed to be able to take a large load. And the mic drop moment was, hey, you can go to this website and sign up for this beta, which is for a big game that everybody's excited for. Well, it turns out that when you ask people to do that, they tend to do that.

00;35;35;00 - 00;36;03;20
Jason Girard
And so half the people watching E3, obviously that number is a little overblown. But half the people watching E3 all went to the site of the exact same time, which, you know, overloads the site, That's one problem to solve. It turned out that when they were overloading the site, it touched a number of other of the platform services and those platforms services were effectively hidden behind a piece of technology that was a gateway technology that routed everything.

00;36;04;06 - 00;36;28;01
Jason Girard
And so the website became overwhelmed. Everybody's trying to log into it. It's hitting identity. It's hitting a number of other services that were endemic to those. All of those services were being shoved into the gateway, and the gateway fell over. And when the gateway fell over, every game that used the services that were behind the gateway.

00;36;28;01 - 00;36;28;14
Crystin Cox
Yeah.

00;36;29;00 - 00;36;59;25
Jason Girard
Ended up having major issues to varying degrees. Right. Like if you if you needed a specific server up and it went down, you were dead in the water. But if you needed to be able to take purchases to the storefront that was unavailable for that time. All of which is very bad. And so it was about a 16 hour remediation where you spent all night and much of the next day getting everything back up into a workable state.

00;37;00;02 - 00;37;08;19
Jason Girard
And then we spent the next six months essentially rewriting the entire infrastructure of the platform. Yeah. In order to solve that problem and push it away.

00;37;08;19 - 00;37;09;18
Crystin Cox
That's a rough one.

00;37;09;18 - 00;37;12;24
Jason Girard
So yeah, that was, that was a fun night, you know.

00;37;12;25 - 00;37;49;18
Crystin Cox
Yeah. It's true though. You often find those points of failure by having them fail. Like, Gosh, I remember, I once spent like two days before Christmas, we had a switch...a bug in a switch. And whew yeah, it was doing so many things were, would not function because you know, you do have sometimes those connective points that it can be hard to know how much replication do I need? Like is it going to be okay until suddenly your gateway falls over and it takes 16 hours to get your services back.

00;37;49;28 - 00;38;14;11
Jason Girard
So it's, it's a bit of an odd segue I guess. But a quote that I often give out is "everybody's got a plan until they get punched in the mouth". Yeah. It applies in this scenario because if you think about major product launches for pretty much any studio in the last number of years, scale is nearly always a problem at launch and it's the proverbial punch in the mouth.

00;38;14;12 - 00;38;37;05
Jason Girard
Right? Like it's how do you take that punch and stay standing and then how do you understand how to keep everything up and running when you have that major change in traffic? If I was to give any advice to folks sort of getting into the operational space understand your infrastructure and how it touches your services, get a service map.

00;38;37;24 - 00;38;56;26
Jason Girard
They're hard to build. They're ridiculously hard to build. But if you get that and you have people that can understand the connective tissue, you have people that can solve problems a lot faster, you're going to have problems you will absolutely have problems. The question is what do you do when they happen?

00;38;56;29 - 00;39;04;09
Crystin Cox
Right. Well, that's a great thing to leave it on. Thank you so much for coming and doing this. It's been great having you on.

00;39;04;15 - 00;39;06;10
Jason Girard
Likewise. This is this has been a lot of fun.

00;39;06;11 - 00;39;07;10
Crystin Cox
Thank you so much.

00;39;07;11 - 00;39;08;01
Jason Girard
Thanks, Crystin.

00;39;11;21 - 00;39;14;05
Crystin Cox
Thanks for listening to The Art of LiveOps podcast.

00;39;14;05 - 00;39;19;12
James Gwertzman
If you liked what you heard, remember to rate, review and subscribe so others can find us.

00;39;19;12 - 00;39;24;10
Crystin Cox
And visit PlayFab.com for more information on solutions for all your LiveOps needs.

00;39;24;10 - 00;39;25;05
James Gwertzman
Thanks for tuning in.