Real World Serverless with theburningmonk

A podcast where we talk about real-world use of Serverless technologies from engineers who work with them day-to-day. We will discuss use cases, why they chose serverless and the pain points and challenges they face. If you want to know what it's REALLY like to work with serverless, this is the show for you.

All Episodes

Real World Serverless with theburningmonk

#29: SAM 1.0 with Alex Wood

September 16, 2020 • Yan Cui • Season 1 • Episode 29

You can find Alex on Twitter as @alexwwood he blogs at alexwood.codes.

We discussed the Amazon Builder's Library here and the SAM's repos below:

SAM GitHub: https://aws.amazon.com/builders-library
SAM CLI GitHub: https://github.com/awslabs/aws-sam-cli

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

This episode is sponsored by ChaosSearch.

Have you heard about ChaosSearch? It’s the fully managed log analytics platform that uses your Amazon S3 storage as the data store! Companies like Armor, HubSpot, Alert Logic and many more are already using ChaosSearch as a critical part of their infrastructure and processing terabytes of log data every day. Because ChaosSearch uses your Amazon S3 storage, there’s no moving data around, no data retention limits and you can save up to 80% vs other methods of log analysis. So if you’re sick and tired of your ELK Stack falling over, or having your data retention squeezed by increasing costs, then visit ChaosSearch.io today and join the log analysis revolution!

Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:13

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today I'm honoured to be joined by Alex Wood from AWS. Hey Alex, welcome to the show.

Alex Wood: 00:27

Thanks for having me on.

Yan Cui: 00:30

So, you've been with Amazon for just over 10 years now. And you just told me before the show that you should be getting your red badge instead of the normal orange badge. How does it feel?

Alex Wood: 00:41

I was definitely a little bit surreal. I joined Amazon, out of college about 10 years 24 days ago. According to our internal tracking of that and joined directly out of college and the retail inventory planning side of things so my running joke is if you go to Amazon and something is out of stock, there's some chance you could possibly blame me for that in a small way. After that, I spent a little over five years on the AWS SDK for Ruby team, and late in my time with the Ruby SDK I got the opportunity to go and develop the Ruby runtime for AWS Lambda, and I think after that experience I realised this is really what I want to be working in I'm really excited about the space and excited about what we could do for serverless tooling, and really making the most seamless experiences possible and so I moved over to working on open source tools for Lambda and the entirely serverless space full time.

Yan Cui: 01:59

So your day to day focus now, I guess, is with the SAM framework, and you guys have recently gone live with 1.0. So what's changed with SAM 1.0 compared to before?

Alex Wood: 02:15

Sure. So over the last few months we've been releasing a number of new features so we kind of had a step back and ask, what are the big changes that could be breaking to the user experience in some way that we would want to do, someday. And we got into some interesting ideas like well for ”sam init”, for example in the SAM CLI which creates a new project. We wanted to go with a default interactive experience, which is a small breaking change because if you were expecting ”sam init” with no parameters to fail, and then suddenly it's an interactive prompt could be a problem in some cases so we did that ahead of 1.0 and we added new features like guided deployments and enhancements around a number of small details around building and a lot of the overall purpose of moving to 1.0 is to give confidence in the operational stability of SAM in the SAM CLI. We want to tell the story that if you use SAM in the SAM CLI for your critical production CI/CD pipelines your production services that you can rely on these scripts to work the same way. Going forward, and generally the move to GA is a way to signify that sort of operational promise and so we kind of started working through our backlog of changes to behaviour that we wanted to do and once we got to the end of that it was time to go GA.

Yan Cui: 04:07

And with SAM and the CDK both positioned as official first party tools from AWS, what would be your advice on when a customer should use SAM versus CDK . Do you guys have some clear distinctions in mind in terms of you should use SAM, and you should be using CDK.

Alex Wood: 04:27

So I think a big part of it comes down to personal preference. I've talked to customers who are very excited about defining their infrastructure in programming languages like TypeScript or Python and CDK can be a good choice for that. Even like talking to a number of customers who've used CDK to prototype a template, and who might though prefer to use YAML are definitely a number of customers who prefer to write their CloudFormation directly one to one in CloudFormations YAML or JSON formats. But I think for SAM versus CDK I think where SAM can really shine, is when you're building a serverless application in particular, so we have a lot of operationally stable proven constructs around making the experience of building Lambda functions, integrating them with API gateway, integrating them with event sources and making that as seamless and simple as possible. It's also true that it's not a one way door. So, if you get to the point where you feel like you're integrating a lot of other services and you like CDK constructs for those services. There's always the ability to integrate CDK into your infrastructure as well. So, some of it is personal preference and some of it is, if you're really serverless focus that SAM can be an excellent choice for that.

Yan Cui: 06:03

So do have some examples of a customer's doing that. By that I mean the integrating CDK with SAM, because they both offer some abstraction layer on top of CloudFormation, but I haven't seen anyone using them both is always either one or the other, based on like you said, personal preference.

Alex Wood: 06:22

Yeah I have seen cases where people use the CDK with the SAM CLI. So the SAM CLI does have support for example for doing local invokes of Lambda functions that are defined within the CDK. I haven't heard of too many cases of people using SAM and CDK together. I think if people are interested in that that's something I would love to hear about.

Yan Cui: 06:52

Okay. So switching back to just focusing on SAM are there any other interesting use cases that you've heard about from your customers?

Alex Wood: 07:01

Yeah, sure. So, I think, when you start thinking about SAM in serverless there's a few classic use cases. So, the most common and I think because this is where a lot of the examples start is I'm creating some sort of web service or web API, and people are using API gateway. They're using Lambda, they're often using Dynamodb behind that. There's obviously the other common use case which sometimes adds to that of I have asynchronous jobs so I'm using SQS I'm using SNS. One interesting thing that I've seen is use of patterns that have come out of SAM and asynchronous workflows like Event Fork Pipelines. So a very interesting pattern is if you have some sort of event could be like a purchase of something or a state change, and you publish that event, through the simple notification service. One interesting pattern is you could send that notification to multiple SQS queues or multiple Lambdas to interpret them in different ways so like event replays, or archiving previous events using things like Kinesis Firehose and S3. Or like ways to track like here's an event that has been a failure of some kind, and you can record it in multiple different places so the idea of like forking events to have independent workflows on them, is another very interesting use case that I've seen work really well with SAM.

Yan Cui: 08:45

And besides a signalling this operational promise, and the stability for SAM going forward, are there any sort of major vision that you guys have for SAM in terms of features and capabilities?

Alex Wood: 09:10

Yeah, for sure. So the one thing that I also like to point out is that 1.0 is not the completion of all the improvements that we want to make to the user experience. I mean, I think if you look at the long arc of serverless tooling, a lot of this is here's this space that was brand new as of a few years ago. And we're really trying to make everything as seamless as possible and really listen to where customers are struggling or where they feel like they've come up with an interesting solution that really should be integrated and shared. So you know we're not stopping with major feature development after 1.0. So some of the things that are in the short term pipeline is there a number of improvements to the build experience that we want to make that people have been asking for in SAM. So I think in the short term future there's a lot of feature requests that didn't necessarily need to happen before we could promise operational stability but are still very important that we want to go out and resolve. And then, I think the long arc direction after that is to try to answer the question of how can we make serverless development as seamless as possible because in a large sense and this is what really got me excited about working in the serverless space. I have come to believe personally that serverless is probably the most approachable way to make really operationally sound, and really, like, high scaling high availability solidly fault tolerant applications. And so it can kind of be like a superpower for you as a developer to be able to use all these serverless concepts, when you're trying to solve a particular technical problem. And what I really am excited about is how do we make that as easy to learn as possible so that you can get all of the best practices in place. From day one like to really reduce the cycle of what you have to learn to kind of achieve the operational and performance promise that serverless offers.

Yan Cui: 11:28

Yeah, I can definitely attest to the fact that the service is a bit of a superpower. I spend ten, fifteen years now working as a back end engineer. Ten of those working with AWS and spend so much time tweaking with network configurations, setting up the servers, right sizing machine sizes and setting up AMIs and all of that. And all that is just to write like two lines of code that does something very simple. Now with Lambda, that's all I have to write just two lines of code and then put into a Lambda function and I'm done. And I've got all the auto scaling, all the logging monitoring all of that out of the box. And I used to spend two weeks just setting up the load balancers or that. And nowadays I spent two weeks. I can write a new back end with AppSync and with Lambda and Dynamodb for social network is, it's amazing how much can get done with one person, compared to how much other stuff I had to do before, it's just it's just crazy.

Alex Wood: 12:27

Yeah, absolutely.

Yan Cui: 12:28

So in terms of, I guess, going back to that same question about vision for SAM going forward. I still use the serverless framework as my personal go to choice when it comes to building serverless applications, mainly because the serverless framework has got this huge plugin ecosystem so anytime I want to do something that the framework doesn't support, I can find a plugin that does that for me. And that's one of the things I find has been missing in SAM, for me. Is that something that you guys are looking at potentially addressing in the future to open up SAM so that other people can extend these functionalities?

Alex Wood: 13:05

Yeah, I think the notion of providing useful ways for people to extend SAM's functionality which I think becomes an extension of, kind of like more easily integrating complex applications is something that we're really trying to look at as sort of like the the long arc of how do we make it as seamless as possible to contribute to SAM, as well as to build complex apps, increasingly complex applications on top of it. And I think one thing that we're very interested to talk with customers about I'm also very approachable on places like Twitter about this type of thing is, what are the type of extensions to SAM's behaviour that people would like to see and that they would like to contribute because I think when we work on designing that sort of long term sustainable model for adding on outside functionality through some sort of plugin model or making it as easy as possible to contribute complex pieces of behaviour. That's kind of the key questions that we would like to learn more about what types of use cases people are trying to solve with this so that we make sure we design the right approach.

Yan Cui: 14:26

Okay. And you mentioned that before you joined the SAM team you're working on the Ruby runtime. Maybe can you also explain to us what goes into building an official language runtime for Lambda, because the whole custom runtime makes it seem very trivial, but I know that now for example with a no runtime you guys have to do a lot of work to make sure security-wise, is all locked down, or very secure. So there's a lot of things you just can't do in Lambda to introspect the actual service itself. What are some of the things that you guys have to design and be really careful when you're building an official runtime support.

Alex Wood: 15:07

Sure. So going back to my experience writing the Ruby runtime is a very interesting experience and I enjoyed it quite a bit. It is not dissimilar to writing a custom runtime like a lot of the core functionality is the same like the execution environment is fairly common across the different runtimes that we provide. So a lot of the decisions that we have to make for an official runtime are around a reliability and predictable behaviour side of things. So, in writing the Ruby runtime, I think I spent a lot of time on what are all the possible surprise exceptions that could jump out of the stack. And how do we capture that and make sure that the error that a user gets if they're trying to debug - why is my Lambda function crashing are helpful. So, you know, maybe something that would normally crash Ruby immediately - can I catch that, give them a backtrace so that they can figure out what they did, that may have caused their function to crash. Another interesting thing that I noticed is the decisions around what libraries do you include become very important. So, another fear you would have writing a runtime that you want to be used broadly is limiting the dependencies you bring in so that you have a deep understanding of the dependencies that you have. And that you have the highest confidence that you're not going to create dependency conflicts with what the user is going to bring in. Even going back to my time on the Ruby SDK an interesting note is that for a very long time we had no dependencies in the Ruby SDK that were outside of the standard library that were required and other than other libraries that we owned that then only depended on the standard library. And this allows us to avoid the case of, you know, we need to do an emergency upgrade of a dependency. Because, you know, there is a critical bug that needs to be updated if we don't have many dependencies, you're not going to have that situation come up, or this dependency doesn't play well with a dependency that is commonly used by other users and they're just going to be unable to use the SDK or runtime. So a lot of it is just thinking about how do you support the broadest numbers of customers possible. And I think for writing a custom runtime like if you're writing it for your own purposes you can sometimes you don't necessarily have to do all of that because sometimes you just need a runtime that works for your use case and that's very valid. But I found a lot of my effort was just thinking very creatively about what are things that can go wrong during the runtime execution and, you know, how can I make sure that we're not going to have obscure dependency conflicts or errors that will create a frustrating user experience and so it's very interesting.

Yan Cui: 19:14

So as you work on the runtime itself. Are there room for you guys to do additional optimization around the cold start time. I know the cold start for Ruby is actually very similar to Python and to Node.js as well. But is there anything that you guys are thinking about or maybe doing activity to try to improve the cold start performance for Ruby runtime?

Alex Wood: 19:35

Sure. So yeah, and this was one thing that I was very pleased with as I started to look into the performance of Ruby in general is Ruby as far as cold start go actually performs quite well. So if you look at Ruby itself. The Ruby executable spins up in solidly less than 100 milliseconds, I think, bundler is probably the majority of Ruby's cold start time and that's been seeing performance improvements as well. But there is an interesting and very common misconception about Ruby where I think some will say like oh Ruby slow, and that's just how it is and it's actually become not at all true starting from Ruby 2.5 on through Ruby 2.7 is it and as it approaches the anticipated Ruby 3 launch which probably is coming in the next couple of years I think that there's some rumours that Ruby 3 is going to drop on Christmas this year. We will see it will be exciting if it happens but there's a big goal called Ruby 3x3 where starting from Ruby 2.0, by the time they got to Ruby 3.0 they wanted to see a three x improvement in performance and they're actually quite close to achieving that. So, and I've had a lot of interesting conversations with people working on the Ruby team and they're actually very mindful about the cold start performance of Ruby itself in general and that's had a lot of benefits but I think in the context of writing a runtime, and really trying to optimise cold starts. I mean, a lot of that is just how much work is your executable doing before it takes on customer code, and really trying to minimise that be mindful of not doing expensive operations that you don't need to be doing. And this is actually a lot. It's much the same to how you minimise cold starts in an application you write there's a lot of simple best practices like, do your initialisation outside of your handler, so that you're not repeating work and reducing the performance of your functions by doing work inside the handler that can happen outside of it, or being mindful of like doing a whole bunch of state loading. In, when you're launching an application like can you minimise the amount of state that your application needs to hydrate to function. So I think a lot of the same concepts apply so the end of the day is just an executable that you're spinning up inside the execution environment.

Yan Cui: 22:20

So that's such a quite good advice that you just mentioned in terms of doing initialization out of handlers so that you don't repeat them and also minimising state loading during initialization as well. Are there any other tips that you can give to customers who are looking to improve cold starts performance for their Ruby runtime, anything that's maybe specific to Ruby, I don't know because with node, when you use a bundler, it improves the cold start a lot because it removes all the runtime file IO you can make when you require dependencies. Is that something similar to Ruby?

Alex Wood: 23:05

So I think the biggest thing I've seen for Ruby is power tuning. Like, it's a very common trope but Ruby is actually... Interestingly, the language is fairly good about being conservative about its CPU and memory usage like i've sometimes seen minimum memory settings actually give you pretty decent performance with the Ruby runtime which is interesting, but power tuning... and for those unfamiliar there are some tools I think Alex Casalboni released that help you run the same function at different memory CPU settings to kind of find out what's going to give you the optimum performance and the optimum price efficiency of running your Lambda functions and that's definitely very valuable. I think also within Ruby, a couple of elements are thinking about when autoloading is appropriate, autoloading being lazy loading of different dependencies and files that can give you performance benefits over the lifetime of an execution environment, especially if you have many dependencies that are quite large. Another interesting thing is that Ruby is actually starting to do a lot of this heavy lifting in recent versions. So I am very interested to see more benchmarks coming in about Ruby 2.7, the language, where they've really started to work on, Just-In-Time compilation optimizations which really shine when you repeat similar code paths, over and over which is almost the classic definition of how a lot of Lambda functions work you're kind of repeating a very similar code flow over and over. And so I think that Ruby two seven and later versions are actually very well suited for the Lambda compute model.

Yan Cui: 24:48

Okay, yeah, I'm a big fan of power tuning and I'll include the link to power tuning in the show notes for anyone who has not seen it before. But I guess, for as far as cold start is concerned, because the Lambda runs the initialization with a full CPU. So I've actually not found much difference when it comes to cold start performance, using more memory, except for strangely for .Net Core. That seems to be the only one time that if you put more memory it cold starts faster. Haven't quite figured out why that's the case, but the other runtimes that when I’ve done benchmarks against them before the memory doesn't seem to affect the actual cold start time itself very much, because initializations are always running at the full CPU. Any idea why it's known, why that's the case with .Net Core, why that's the one that's different.

Alex Wood: 25:38

I don't have a lot of personal familiarity with .Net Core but I think the reason that power tuning remains valuable for actual real world application cold starts, is that a lot of what you're experiencing is the cold start like the first run of your Lambda function is also what is all of your initialization code. If you are loading up a large number of libraries, if you're loading up a large number of configuration files you're essentially doing compute operations as a part of getting your function ready to take traffic. So power tuning if you have a lot of code doing that can still be very valuable because you are speeding up the code that you're running before you take traffic. And of course, another solution if you are very sensitive to the impact of cold starts in your application if you have very high performance requirements going from like the p99 percentile and upward, for example, options like provision concurrency work really well. Because you can worry less about how much work, am I going to do hydrating my function and configuration and state, before I can take traffic and kind of leave that to happen before, you know, like at scale, not as the traffic comes in.

Yan Cui: 27:03

Okay. And since you mentioned that you had so much fun building the Ruby runtime. So what led you to switch gears and move to the SAM team instead?

Alex Wood: 27:12

Yeah, for sure. I think it kind of goes back to the thing that has me really excited about serverless so if I were to kind of break up the arc of my career so working in Amazon retail working on inventory planning and management algorithms, it's a very high scale system. There's a lot of traffic that we serve doing that and, but I found it very interesting, especially as a very junior engineer at that point, there is a lot of standing on the shoulders of giants. There's a lot of very talented engineers at Amazon that have built a whole bunch of very robust systems that you can kind of build on top of. As I transitioned into working on the Ruby SDK and open source. The question that really started to become a passion of mine was okay how do I, from scratch, from a blank slate, make applications that are extremely robust, that are highly performant, that scale really well, that deal with all the types of fault tolerance problems that can happen. And it kind of leads you on that journey through I'm learning how load balancers work, I'm learning how auto scaling works like what happens if a bunch of your servers go down or if you deploy a bug and you need to adjust to that and try to like, avoid downtime. And I think for me personally I found a lot of the process of making these extremely robust production systems difficult. There's a lot that you have to learn and like the learning curve for me was very steep. And it's something that I pressed on with for many years and, you know, starting to get the hang of it, starting to learn this, starting to think about how can I build tools that make this more reachable and accessible. And then it's almost like I had the lightbulb eureka moment at a point when I was working on the Lambda runtime for Ruby of... this really just gets rid of whole swaths of problems that I've been having to deal with before. The idea that the scaling is handled on your behalf, fault tolerance I mean an interesting problem with designing load balancers and servers for example is, how do you decide when to take a server out of rotation behind a load balancer. Like you start to see 500 errors coming back from it, but do you take it out of service right away. Do you wait like optimising that can, you know, like if you over optimise that you could take all your servers out of service from like short transient bug and now you're completely down, or you wait too long and 10% of your customers are hitting bad servers and there's a lot of work and a lot of creativity that goes into designing the system as well. And when you look at, like an API gateway Lambda model, like if you have something that crashes, a Lambda execution environment, it'll just be taken down and the next request will get a new one. And that type of model. The number of problems you have to solve, to get to that performant high scale fault tolerant system is not zero. But the space of problems you have to solve is so much smaller that I feel like a single developer can really start to reason about all the different things you need to understand, to make these kinds of production quality systems. And once I realised like, Oh, I can really do this, that was this really exciting moment for me, where it's like I can write a demo application and it can take thousands of concurrent users from the moment I deploy it, which is... just it's incredible like it's. If you had told me 10 years ago that this was going to be possible it would just sound like magic to me I wouldn't be able to reason about it.

Yan Cui: 31:23

And that's definitely why I love Lambda so much is that I can give all the complex problems of building a highly resilient highly scalable system to you guys. But what is it like to work as an engineer at AWS, and be on the receiving end of all that complexity, for building this complex highly resilient systems and also be on call for them because when things go wrong you guys are on the hook to fix things very very quickly, which, credit where credit is due. But what is it like to be on the hook for all these different customers and having to deal with all this complexity.

Alex Wood: 32:01

Yeah, it's incredibly interesting and almost like going back to the question of why I joined then. It was you know realising that serverless had all this potential and then thinking there's a little bit of a learning curve for me to get the hang of it, and what I became very excited about is making the tooling to make this as easy as possible and I'm really excited about the potential of some of the things that we can do to, you know, take the current state and make it easier and easier and easier and I think like being on the other side like kind of seeing what goes into making this. I feel like I'm learning all the time like the thing that I have said to my managers and hoping I'm not stepping out of line by saying this is that if I stopped learning, then I think I'm gonna quit, and do something else. So, I'm still around so I'm still learning every year and you know sometimes every day. And you know there's also just surrounded by people who have just incredible brains like the amount of creativity and intelligence that I see in the engineers that I work with, is super inspiring and kind of like pushes me to be a better engineer. So I think that's the thing is like working on AWS is there's a lot of learning that goes into it, it's very exciting to kind of be able to have that broad impact of making so many developers lives easier and kind of like learning how we accomplish that and then trying to even take those best practices and build it into the products that we give people like anytime we solve a new, or I see a creative solution to a problem about, you know, how do we get high performance, how do we deal with scaling, how do we deal with interesting failure cases it becomes, how do we then make that accessible to all of our customers how do we externalise lessons that we've learned and it then gets built into our services, it gets built into our tools there's interesting things like the Amazon builders library that I think we announced that last reinvent that we announced recently where a lot of our principal engineers like share deep lessons that they've learned about building reliable software so like that's the thing that I get excited about is learning all these interesting ways to solve problems and then sharing it.

Yan Cui: 34:35

I bet that's the very few places where you get to work on really challenging, really interesting problems like you guys do at AWS. So I guess the final question is what colour would your badge be if you stay there for another 10 years.

Alex Wood: 34:51

So I believe that my badge colour would become grey. We do have a teammate of mine that I work with, who has been at Amazon for over 20 years and yeah it's a lot of very interesting conversation like I think like any company that's been around that long. It's very interesting to see kind of how teams change how cultures change and like the kind of lessons you learn from being at a company from its early days all the way through the maturity and scale that we've reached. So yeah, if I'm around another 10 years it'll be interesting to see what even software development looks like at that point.

Yan Cui: 35:39

Maybe at that point I just tell Alexa what I wanted and she would just build it for me.

Alex Wood: 35:45

It's good, it's gonna be an interesting role like I'm really excited to see about like the types of things that are possible for making, building these types of high scale systems more and more approachable like it'd be incredible if we get to a point in a few years, where you don't even have to think about it. You're just writing. You're just writing your code, and you get to the point where you feel like scale and fault tolerance and all these things are solved problems, it's a difficult dream to reach, but the possibilities of it are really exciting.

Yan Cui: 36:18

So thank you so much, Alex, for joining us today and sharing so much of your experience.

Alex Wood: 36:24

Yeah, thank you so much for having me. It's great.

Yan Cui: 36:26

So, before we go, how can people find you on the Internet?

Alex Wood: 36:30

Yeah, so you can find me on Twitter @alexwwood so two ws in the middle, the joys of a common name, and you can also find me on the web at https://alexwood.codes/, where I occasionally blog about things I learn or find interesting but occasionally working on getting more consistent about that but I'm definitely very active on Twitter and I enjoy hearing from people about the things they're building, you know, successes people are having or even things that we could do better and things that you'd want to see us build. It really does have a real impact on what we build to hear feedback from customers about what they'd like to see what they find frustrating, or even like what they find exciting that they'd like us to double down on so I really do enjoy hearing from people about that kind of thing.

Yan Cui: 37:26

Excellent, and I'll put those in the show notes and like Alex said, if you got any interesting things you want to ask Alex, please get in touch with him on social media. And once again, thank you so much Alex. Hope everything is safer where you are.

Alex Wood: 37:40

Thank you so much.

Yan Cui: 37:41

Take it easy. Bye bye.

Alex Wood: 37:43

Bye.

Yan Cui: 37:56

So that's it for another episode of Real World Serverless. To ask us a show notes, please go to realworldserverless.com. If you want to learn how to build production ready serverless applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.