Real World Serverless with theburningmonk

#20: Serverless at Fender with Michael Garski

July 15, 2020 Yan Cui Season 1 Episode 20
Real World Serverless with theburningmonk
#20: Serverless at Fender with Michael Garski
Chapters
Real World Serverless with theburningmonk
#20: Serverless at Fender with Michael Garski
Jul 15, 2020 Season 1 Episode 20
Yan Cui

You can find Michael on LinkedIn here.

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Show Notes Transcript

You can find Michael on LinkedIn here.

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:13  

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today I'm joined by Michael Garski from Fender. Hi, welcome to the show.


Michael Garski: 00:27  

Hi, thanks Yan. Thanks for having me on today.


Yan Cui: 00:29  

Yes. Nice to have you here and I've heard about Fender quite a few times now from the re:Invent. In fact, you guys got mentioned in the keynote, a few years ago about how you're using serverless to revolutionise your e-commerce platform. For the audience who are not familiar with Fender, can you just maybe talk to us about what Fender is and maybe your role there as well?


Michael Garski: 00:51  

Sure. Fender has been a company since 1946. When we initially launched the Fender Telecaster with a Stratocaster following that. And we also make amplifiers both digital and still tube amplifiers as well. Our digital team, we're based here in Los Angeles, we were responsible for the digital side of Fender which currently consists of Fender Tune, a guitar tuning mobile app, and Fender Play, a web and mobile application for instruction of guitar, bass and ukulele. We have another app Fender Tone which is a companion desktop or mobile application for our digital amplifiers, that's managed by the amplifier team. So that we are a little bit separate from the IT group. The IT group uses serverless as well but they're more focused on B2B. So we're more the B2C side.


Yan Cui: 01:48  

Alright. So in that case, for your B2C business, how are you using serverless and what does your architecture look like from a real high level? 


Michael Garski: 01:57  

So, our architecture is based around API Gateway and Lambda. All of our Lambda functions are written in Go and we've been using Go even since before Lambda had native support for it using a framework called Apex. I would put a Node.js shim that would start up the Go binary and feed data in on standard out and pick the reply, or feed data in on a non-standard in and pick up the reply on standard out. But we take our services or microservices built around a given business domain that will be comprised of multiple Lambda functions. Generally, there's a single function per route or method route method combination. Some of them may handle the same route like say for our curriculum system there's only one Lambda function that handles a given entity but it handles all the different CRUD operations, it just does a little method inspection inside the function. And we deploy all of those Lambdas as one single unit so there may be say 20 functions in a given service that are, you know, triggered by a Dynamo stream or SNS function or API Gateway. And we we build them in such a way that, like, there's a common shared business logic and the functions themselves are very small; they're just validating the incoming request executing business logic and formatting the response. We use SNS quite a bit for our application events for say when a user subscription status changes or a new user signs up so we can take other actions such as if they sign up for an annual plan they can get a discount on physical products. And we, for data stores, a lot of use of DynamoDB and Aurora as well for data that's just highly relational such as our curriculum. And for data infrastructure, we have an S3 data lake setup.


Yan Cui: 04:00  

Okay. So that combination of what you call a single purpose function where a single function handles just one particular route and method versus, I forgot what people call it now is a Lambdalith, where you got a function that handles multiple things within an API. Do you guys, sometimes make that decision to have a single function to handle multiple CRUD operations by inspecting the path itself and, sorry the path and the method itself, because you're running into the resource count limit within a CloudFormation stack.


Michael Garski: 04:35  

It's not really a resource path limit, it's more rather than kind of keeping the number of individual functions down, so for example if you're managing a lesson, having an individual function for a lesson, get a lesson, create the lesson, update it, it's, it seems a bit much. So, for things like that where they're all dealing around the same entity just with a different HTTP method that works out very well for us.


Yan Cui: 05:01  

Okay, all right. Fair enough. In this case, how big is your team? Sometimes you've got quite a few different microservices, like API Gateway with Lambda combination with DynamoDB behind it. How big is the team that's looking after all of this infrastructure?


Michael Garski: 05:17  

So we have three API engineers to... all work on the Lambda based APIs. We have two data engineers. And we have four dedicated ops engineers that do a lot of the... maintain all the infrastructure. We have a lot of legacy infrastructure as well as a massive array of redirects over the years, as well as setting up our client deployment pipelines.


Yan Cui: 05:44  

Okay that's quite interesting that you have more ops engineers and you have API engineers that working on features, especially within like a serverless context, but I guess those ops engineers don't just work on a serverless stack but they also work on a lot of the, I guess, your legacy applications as well. 


Michael Garski: 06:03 

Correct. Yeah, they have a... they run the gamut that I would say the work they do for... to support the APIs is maybe about half of their total workload.


Yan Cui: 06:13  

So in this case, do your API engineers also are responsible for spinning up AWS database resources, using, I don’t know, SAM, CDK or serverless framework or is it like a handoff process where they do their bit and then give it to the ops engineers to do the release and the management and running of the application in production?


Michael Garski: 06:37  

It will be that latter case where when we're say developing a new service or making some substantial changes to one. Each engineer has their own AWS account, and they can use serverless to deploy into that account to and use cloud service and CloudFormation to set up those resources such as DynamoDB, SNS. And then, when we're in the process of developing that we're also letting the ops team know hey this is what we need, we've got this function, it can be invoked this way so that they can get everything set up. Our infrastructure is all managed in Terraform, and they get everything set up for us so then as soon as we're ready and can merge into master everything deploys without a hitch.


Yan Cui: 07:15  

So am I understanding you correctly that you are saying that the API teams API engineers are developing using the serverless framework, but then to deploy to actual production, the ops engineers have to translate that into Terraform?


Michael Garski: 07:28  

Correct. 


Yan Cui: 07:30  

Okay, I've just seen that a few times and usually that create a lot of friction and certainly a lot of, I guess delays in the process because you've got engineers that are not able to spin up lots of different resources like API Gateway for example, which, you know, using a serverless framework is quite easy, but when you need to translate it to Terraform, it becomes a bit more cumbersome and becomes a bit more laborious because you just have to write a lot more boilerplate. Is that ever a problem for you guys that you've been slowed down because of this need to translate your infrastructure stack?


Michael Garski: 08:03  

No, it hasn't because usually when the engineer starts developing they know what resources they need so while they're setting things up and making sure their code is working properly the ops team is getting all the resources defined in Terraform for them and for API Gateway we use the open API definitions, so that it's easy enough for us to manage and maintain and then even turn those into documentation. 


Yan Cui: 08:28  

Okay, so with the serverless framework which I assume that what your API teams are using, there's no built-in support for API, or for open API spec at least so in terms of defining the API itself. There is support for documentation afterwards once you've deployed something. So how do you guys use the open API spec in this case?


Michael Garski: 08:48  

So, in this case, it's the open API spec is used when with along with the Terraform deployment to keep the API Gateway up to date. And then we take those same definitions and we run them through the ReDoc CLI to generate documentation.


Yan Cui: 09:03  

Right, right. Okay, gotcha. So maybe bring it back slightly, how did how did you guys decide to go serverless in the first place? What were some of the main motivations you had?


Michael Garski: 09:16  

So, we started using serverless very early in Fender digital. Fender Digital's only been around for about four and a half years or so. So, before we even launched our first products we had to set up some initial services, one service was a product service. So, given a, an item products queue, you know return metadata about that product. Now that all of that product data is managed by our IT team in a completely separate repository. And they were able to deliver us an export of that by just dropping it into an S3 bucket for us. And we're just like, well, why not just use the S3 Lambda trigger to process that data and load it up into DynamoDB. And that worked out really well for us. And then we expanded on when we started launching Fender Tune there's a feature that allows a user to save custom guitar tunings. So if you want to tune two steps down they can save that. And for that we just use API Gateway, DynamoDB, and the mapping template so that there wasn't any Lambda function or whatsoever. And then we had good luck with that. We experimented with other small services. And when we started building the infrastructure for Fender Play we just decided to go all in serverless. We were from our early experiments things were very successful and we're fortunate enough to be given that freedom to be able to use that and experiment with that.


Yan Cui: 10:45  

Okay. And when you say you were quite successful with those serverless projects early on, what would you say are some of the big benefits that you guys got from serverless as a business and as a development team? 


Michael Garski: 11:00  

Ah, well as a development team, it all comes down to, I would say we do move a little bit faster, but it certainly keeps the keeps engineering team much more engaged because it's a newer technology, it's a new approach, it's something interesting to figure out so just that whole personnel side of it has been fantastic for us. And business side, the biggest advantages has been cost reduction, we estimated that, you know, if we had done with EC2 services, it would we have saved about 90% over that. And given that we're launching a new product, and we were probably we're not going to have really high traffic out of the gate, it seems like a really good way to go to keep costs down.


Yan Cui: 11:41  

Did you say 98%?

Michael Garski: 11:44  

90.


Yan Cui: 11:46  

Okay, but that's still really really significant, which I guess makes sense, especially like you said, if you don't have a lot of traffic coming through, you can have EC2 servers sitting there idle and you still have to pay for Multi-AZ deployment just so that you have some redundancy in place, certainly that that's actually quite, quite similar to to a few of the others companies that are now has gone to serverless. One of my previous companies that I worked on, we also went serverless with a social network and we had lots of spikes so we had lots of spare capacity sitting around just in case a spike comes. And when we move to serverless we save about 80% and 90% of our costs as well. One of the things I guess maybe I want to bring it back a little bit to what you were talking about earlier in terms of the split of responsibilities. One of the things that I really enjoyed working with serverless stack is that as an engineer, as a feature engineer, I can have a lot of autonomy over the infrastructure, what resources are provisioned and things like that. One of the flipsides of that is I’m now also, I can be more responsible for monitoring and running applications in production. Sounds like some of that so runtime so monitoring and troubleshooting is not handled by the ops team and not the API team. Do you ever find that you have some of the problems that comes with having people that are looking after applications that they didn't build themselves so troubleshooting is not as easy as it should be?


Michael Garski: 13:14  

That is true that we do we do encounter that because our ops team does reply respond to things. However, one, one advantage of, well I don’t know if it’s advantage, one distinct difference with server—, a serverless infrastructure is that when there is an application problem, it's most likely not code related, but it's generally like a third party service is having a problem or something like that is going on or CRM provider APIs are returning errors or timing out. It’s usually something we can't really do anything about other than just notify people.


Yan Cui: 13:53  

That's actually very true. You don't have those kind of infrastructure problems that somehow you have to deal with and mitigate because you just get all of that resiliency out of the box with serverless. But you touch on an interesting thing here that mostly it’s problem with some other third party system that we're using or even other AWS services that we're using. How do you how do you go about in terms of your monitoring and all of that your observability setup so that you can quickly identify what service is a problem, and therefore decide what you need to do, or you don't need to do anything?


Michael Garski: 14:29 

A lot of it just depends around like what service is having the issue like, for example if our CRM provider is having having a little few issues we know that there we're gonna see some more in our email service that kind of takes care of all that communication with them we... That's where we'd see a higher error rate in that. And for some APIs and some services we actually have some like simple monitoring place to let us know like, oh we ping this like simple API call or we look at their status page of a given service and if it's not all green or it says that there's an error we'll alert people via Slack channel.


Yan Cui: 15:05  

So in this case, are you tracking the status code that is returned by this, I don’t know, CMS service or other services that you depending on? So from your Lambda function maybe you're talking to DynamoDB and then you're talking to, I don’t know, Algolia, or something like that. So in order to sort of quickly identify that the sort of elevated error rate that you're seeing from your own API is because of problems happening in Algolia or something like that. So do you have some monitoring in place for all these different API calls that you're making from the Lambda functions?


Michael Garski: 15:38  

We don't monitor for every single third party API call. It's a lot of times just the status pages where they'll say like if there's no error or anything like that. It will kind of keep an eye on that to alert us when that changes.


Yan Cui: 15:49  

Okay, and then I guess so when something does happen, what does your engineers do? Do they just go into the logs and see where the problem was? Do they use tools like X-Ray or tracing tools like that to figure out where invocation got to? And then, if it made the DynamoDB call but then it errored, that means it must have errored when you're trying to talk to Algolia or some other service.


Michael Garski: 16:14  

Yes we, we aggregate all of our Lambda logs from CloudWatch into Honeycomb and Honeycomb was one of my all time favourite observability tools that allows us to easily quickly identify the errors and find any problems and determine what the root cause is, so we can take action from there.


Yan Cui: 16:32  

Oh, right, yeah Honeycomb, Honeycomb is great, I love that whole paradigm of everything's the event, and then you don't pre-aggregate for metrics you can just do ad-hoc slicing and dicing. But one of the things I find with Honeycomb is, certainly with serverless, is that it does require you to do a bit more work yourself. I mean I've done a lot of work with a client of mine, where we turning a lot of those events and then API Gateway logs into traces so that we can see them inside Honeycomb as well, but a lot of that instrumentation becomes custom and things that we have to yourself. Is that something that you guys have invested a bit of engineering time into to make the experience quite smooth and quite easy?


Michael Garski: 17:15  

Yes, we have an internal framework that's created that acts as middleware in our functions. It's very similar, I believe the function is for Node, it’s called middy to allow you to have Lambda middleware. And in that middleware that's where we saw that the event comes in from from AWS into our Lambda function and it goes through this middleware first before executing the handler, and the middleware does things such as setting up a request scope logger and putting it in the Go context. So that way we've got all of the necessary fields for good structured logging that'll include all HTTP headers, refer, the HTTP method route, like anything within that incoming event are added as individual log fields, and that way the individual applications don't really have to worry about logging. They can just pull the logger out of the context and log an entry if they need to. And if they don't need to log anything when the function completes invocation, it'll actually log the final status.


Yan Cui: 18:19  

That's great. I'm actually one of the core team members behind middy and I really love that middleware approach. It does take out a lot of the boilerplate and the cross cutting concerns that people have with their Lambda functions. Is that something that you guys are going to open source potentially because I imagine other people may be also interested as well?


Michael Garski: 18:36  

Ah, yes absolutely. It is something we do want to do. There's a documentation cleanup we need to do and there's a few minor inconsistencies where the same thing may be handled in a different approach that want to make sure we get cleaned up. And then there's, getting signed off from legal on the appropriate licence to use. So, but that's we do intend to open source that.


Yan Cui: 18:57  

That's cool. That's wonderful. Well, let me know when you do so that I can share that with other people and also potentially add it to the show notes for this episode when it comes out. So in that case I want to also just talk about a little bit about your journey towards serverless as well. What were some of the biggest challenges that you face along the way?


Michael Garski: 19:16  

Oh, the biggest challenge since it's like a new paradigm for application development was tooling and workflows. And you ran into some challenges with Go due to the fact that if you have multiple Lambda functions you need to like do that bill, you're building a native static binary. And if you've got 20 Lambda functions and you're building them serially, it can take a long time. So, what we actually did is we created a build process that actually uses Lambda to build all of the functions in parallel. So, each function is built by a build function that's triggered by CircleCI that will then, it'll build function one. It'll build all the functions in parallel and then we pull it all together so that that brought down we had some build times that were exceeding 30 minutes, and we don't have a build time that exceeds three minutes anymore.


Yan Cui: 20:13  

Okay. And if we were to do the whole migration to serverless all over again. Are there anything that you would have done differently this time?


Michael Garski: 20:21  

Well, one thing that we've, we've we thought of is that, you know, having a separate completely separate binary for each individual Lambda function isn't really necessary. With Go, unfortunately you can't make use of layers because you're building a static binary. However, for a given service like let's say our subscription service, we could have one single binary that's deployed to multiple functions and each of those functions have their permission so that they're not able to do any more than they're supposed to do such as read from the DynamoDB or if it doesn't need to write that function shouldn't have write permissions. And then within the Lambda function to basically look at the AWS environment variable to see what the handler is like oh, I am the subscription create function so this is the handler I could use that would have, you know, solved our build issue. That would be one thing, I think, possibly different. Another might be we may explore sort of like, like a hybrid Terraform CloudFormation deployments where Terraform is used for a lot of the overall infrastructure. And then using CloudFormation and serverless within just the individual application stack and then referring to things. I've read a lot about that and we've talked about that internally but however as we're currently fairly well entrenched and have everything up to speed very efficiently in Terraform it doesn't make a lot of sense for us to go through and change all that up.


Yan Cui: 21:50  

Yeah, I thought that was a bit strange as well that you have this translation, but I figured that it might be the case that you're so well entrenched into, you know, having everything in Terraform and be able to use Terraform for deployment, for everything that's, you know, you might just be sticking with but that because of those legacy reasons, rather than just doing everything with Terraform for the sheer stack, which is what I see, most people do where a lot of VPCs and sort of landing zone, sort of settings are, well, resources are configured with Terraform but then the actual application itself, individual APIs and so on those are done using the frameworks like the serverless framework where you can get a lot of productivity out of it without having to do that translation layer, layer later. In this case, if you're looking at, I guess, AWS, I guess, as a bigger thing. Are there are there are there any other things that you wish AWS would do better? What would be your top three AWS wishlist items?


Michael Garski: 22:51  

I would say one would be some functionality, I believe, from, I believe from what I heard the functionality is within Azure, to where you can create a dead letter queue for a function just right away. You don't have to define a separate resource you can define it along with the function. Possibly, definitely we migrated to Cognito for authentication. We had our own internal system initially and ended up migrating to Cognito so we wouldn't have to manage it anymore. Some of the observability into Cognito there's no rate limiting within Cognito which makes sense but it's hard to see how close you are getting to those rate limits. And then native signing with Apple there's really no native sign in with Apple support it kind of has to go through a web view. And then also like a generic  OAuth2 connector would be really nice. We are dealing with some potential third parties that don't support Open ID Connect but they do support generic OAuth that would be something nice to have. Maybe even like a documentation hub for APIs that could take that open API definition so we don't need to manage things on our side instead of our own process, processing with ReDoc that is just all like built-in feature of API Gateway.


Yan Cui: 24:05  

Yeah, I think some of them they are actively working on. I know with the HTTP API the new, sort of new kind of API, you can have in API Gateway. It does support, it has got built-in support for OAuth authentication. But I don't know if they're going to pull that back to REST APIs in API Gateway, but it does definitely sound like something that is really useful, but I'm quite curious about that first use, the first thing you mentioned about having a DLQ that is built-in, without having to need a separate resource. What sort of use case are you thinking about for that?


Michael Garski: 24:42  

It's just more of just saving time. It's just a little easier rather rather than having to define a separate resource and set permissions that, you know, this can write to that, that it'd be nice just to set a boolean flag, and then it creates the dead letter queue and automatically kind of wires it up for you.


Yan Cui: 24:58  

Okay, gotcha. But with also with the DLQ in Lambda. It only works for async invocations so it doesn't work for things like API Gateway invocations which is synchronous. So do you have a use case where you can, you know, you want to use some kind of DLQ for API Gateway or other synchronous invocations as well?


Michael Garski: 25:19  

No for synchronous invocations that wouldn't be necessary. It's more for the asynchronous invocation say so when we're updating an external party, say when user subscription status changes, we want to keep our CRM system up to date with the status on that user subscription and being able to retry if connecting to that third party, give it three retries and then from there into a dead letter queue and then we can process that later if they're having some sort of an outage issue.


Yan Cui: 25:49  

Okay. Gotcha, gotcha. So I think that's everything that I wanted to cover. Is there anything else that you'd like to tell the listeners about, maybe personal projects you want to share, or maybe initiatives that Fender is is running right now?


Michael Garski: 26:05  

I would say the first thing is that the scalability of Lambda we recently ran a promotion, due to for COVID-19 to allow anyone to get a free three month subscription, no credit card required. And that more than tripled our monthly active users. And apart from being proactive and making sure our concurrent invocation limit was raised. We had absolutely no issue with our serverless infrastructure, it just handled the load without a problem. The only real changes we needed to make is we still use Elasticsearch for some functionality, is we just had to add a few nodes to the cluster and change some of the replication factors on the indices. But other than that, it was more than triple of traffic and no problem handling it whatsoever.


Yan Cui: 26:58  

Yeah, that's one of the beauties of serverless right? It just auto-scales, no problem. All you have to do is keep an eye on your limits. And I think, that thing that you mentioned about Elasticsearch. That's also quite a common thing people have to do because that's one of the few places where you still have to worry about servers, you still have to do some kind of capacity planning. And one of the things that quite a few people on this podcast has asked for is some kind of a serverless Elasticsearch essentially and I think so far Algolia is probably the closest thing I've seen to a serverless ELK Stack. And I do think Algolia is a really nice service and I hope one day AWS will offer some kind of a serverless Elasticsearch that just takes away from me all these overheads of having to manage and scaling up or scaling out a number of nodes for Elasticsearch.


Michael Garski: 27:48  

It's interesting you mentioned Algolia because that's where we migrated all of our text search to and a lot of our browse for browsing through lessons and filtering is we've migrated all of that into Algolia. It's a lovely service.


Yan Cui: 28:01  

Oh yeah, I’m a big fan as well. I used it in quite a few projects now. It's really nice, really nice and easy to use, and there's no server setup, just open up an account, get a token. That's it. And you can really fine-grained access control as well with different API keys, which I thought was great. And you can also generate them programmatically so you can have user API keys tied to a particular session, all of that, which is all real good stuff. And I think, I think that's it, that's everything that I have in my mind. Thank you so much for joining me today and for sharing your your stories with our listeners.


Michael Garski: 28:37  

Oh, you're welcome. Thank you for having me on today.


Yan Cui: 28:40  

Take care and stay safe.


Michael Garski: 28:42  

Will do, you as well.


Yan Cui: 28:57  

That's it for another episode of Real-World Serverless. To access the show notes and the transcript, please go to realworldserverless.com. And I'll see you guys next time.