Real World Serverless with theburningmonk

#52: Talking DynamoDB with Alex DeBrie

May 05, 2021 Yan Cui Season 1 Episode 52
Real World Serverless with theburningmonk
#52: Talking DynamoDB with Alex DeBrie
Show Notes Transcript

You can find Alex on Twitter as @alexbdebrie.

Links:

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

To learn how to build production-ready serverless applications, check out my upcoming workshops.


Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:12  

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today. I'm joined by Alex DeBrie, one of the very first AWS Data Heroes. Hey man, how have you been?


Alex DeBrie: 00:27  

Hey Yan, I'm good. Thanks for having me, excited to be here.


Yan Cui: 00:29  

So we've known each other for quite a while now, I guess, way back from when you were working at the Serverless Inc. So you've recently left Stedi and you've gone independent again, I guess, once you go independent, it's just hard to go back to full time work, right?


Alex DeBrie: 00:45  

It's true. It's true. I love the flexibility of being independent. Yeah, it was hard to hard to pass up.


Yan Cui: 00:50  

So maybe, I guess, for audience who may not be well familiar with your work, maybe talk about yourself and what you've been up to?


Alex DeBrie: 00:57  

Yeah, absolutely. So yeah, you say we go way back. But I knew you before you knew me. I remember listening to you on Software Engineering Daily, and all the cool things you did with Yubl and all that. So. But yeah, my story starts, you know, actually kind of a weird background, I was actually a corporate lawyer, initially, in sort of, I learned to code in, in law school, and I was a corporate lawyer for about a year and then went and jumped into the tech world. But about, let's see, four or five years ago, now I joined Serverless Inc, which is, they're the creators of the serverless framework. And that's when I really got into the whole serverless thing, I'd sort of played around with Lambda a little bit before that. That's when I really went deep on that, got into the serverless community, did a few different things there, partly on the growth team, you know, doing blog posts, and sort of developer outreach type stuff, and then also on the product team, later on. But yeah, I spent two years at serverless really loved it, engaging with the community, and folks like you and Jeremy Daly, and just all the other people that are building in the community. So did that for a couple years. As part of that, you know, when I was writing blog posts and things like that about serverless, I, you know, I saw all these people using DynamoDB, and I was trying to use it and not using it very well. And I, you know, I saw Rick Houlihan talk at re:Invent about all the kind of mind blowing stuff that he's doing with Dynamo. So let's see, I think it was 2017. around their Christmas time. That year, I worked on DynamoDB guide, which was just a basic, trying to be user friendly breakdown of like how to use Dynamo and some of the principles that Rick's talking about that are very different than relational design, right? You are using some of these single table principles, so did DynamoDB guide and sort of, that, I don't know, it accidentally sort of, I don't know, propelled me into this, this spot where I still wasn't very good at Dynamo didn't know that much. But because I had that out there, I had more people reaching out to me and became sort of a sort of a focal point. And I just spent more and more time learning how to do this stuff, which was a lot of fun, and got to do some talks with AWS, things like that. And then, at some point, you know, I really enjoyed what I was doing at serverless, but wanted to do more with Dynamo. So I took some time off to write the DynamoDB book launched a year ago, just just over a year ago, now having its anniversary last week. So the book has been going well, and then being independent for most of the time since then as well just doing DynamoDB training, consulting, AWS stuff, different things like that.


Yan Cui: 03:17  

Yeah, DynamoDB is easily one of the most powerful services in in all of AWS. And also, I guess, going back a few years ago, it was also one of the most, I guess, confusing services because people just don't know how to use it, they don’t understand all the constraints and why they are there, and how to actually do basic things like, you know, they are used to doing with relational databases. I think the stuff you've been doing with the DynamoDB guide and DynamoDB book as well has really been helpful in terms of giving people some, you know, some reference material that they can go and learn how DynamoDB the works, and also explain things in the fewer words than the 4000 pages of official documentation you can go through. I think that's one of the things that is severely lacking for other services like Cognito, which is another prime candidate for for a book like this. And so yeah, if if anyone wants to, anyone who wants to learn DynamoDB, and all the different things you could do with it, go check out Alex’s book at dynamodbbook.com. I guess, let's go back to the single table designs in a minute. I want to also maybe get your take on the state of serverless world we are in right now. Because you were working for Serverless Inc, and now and later we've said you've seen a lot of different companies using serverless. Are there any sort of common patterns and trends that you're seeing in terms of the mistakes people are making, but also success stories where serverless are successfully adopted within a company?


Alex DeBrie: 04:50  

Yeah, that's interesting. I work less closely with with sort of pure serverless shops than I used to back at serverles. But, yeah, it's tough, it's tough for me to say. I think it's it's all over the place. You know, I think teams that that really take the time to learn serverless and learn how it's different and understand, you know, it's not going to just be the same things they were doing. But now in a in a Lambda function, right? And really understanding what the differences are there, they're going to do best. But you really have to get onboard as a team, understand the different constraints, how to work around those constraints, and especially like, around observability and monitoring. I know you've, you've written a lot and done a lot around that stuff. I think that's where a lot of people are still struggling around some of that stuff. I still wish the, my big bugaboo, I wish the deployment ecosystem weren't so fragmented, I think it's a little disjointed. And I think it's sort of hampered things in some ways, where you're kind of googling around and you're looking for examples, and maybe it's for, you know, this deployment framework, or that one, or different versions or old things. And I think people just get frustrated because of all that, and not understanding how it's working together. So I wish, you know, it's good to have these different approaches, because then you're also seeing it evolve more quickly, you know, compete against each other a little bit, but also it fragmented a little bit. So that's a, you know, it's six one, half the other, but that's what I'd say. In terms of other stuff, yeah, I guess I don't know. Um, yeah, I don't know, what do you see out there, Yan?


Yan Cui: 06:21  

Yeah, I also see a lot of different tooling and people and observability is one of the big questions people ask. In terms of tooling, I do see still mostly serverless frameworks out there, a fair bit of SAM and more and more CDK nowadays, occasionally see shops doing, you know, handwriting  CloudFormation, and, and Terraform. So I guess that's where I tend to push them towards a framework instead of doing a lot of handwriting, or the infrastructure because especially with API gateway it is not fun writing 500 lines of API gateway code, just so that you can have a Hello World endpoint running, especially when the same thing just takes two lines of code in the serverless, or SAM. I think the fragmentation definitely is something that adds a lot of confusion, especially when you consider that Lambda itself is quite a paradigm shift. And I think the point you made about, oh, someone really need to invest into learning the tool to really get the most out of it. I think that really applies to Lambda and DynamoDB as well, is, you know, the whole shift from relational database to NoSQL or this, I guess, in this particular case, document data store, is something that you do have to understand and no NoSQL database is exactly alike, different databases with different constraints and strengths. And once you learn DynamoDB, you can do a lot of amazing things with it. But at first sight in my scene, this is so restrictive, and why would anyone do this? Right? But once you learn it, it's actually quite a powerful, really powerful tool, there are a lot of things you could do with it that may not seem obvious at first, I think the same applies to Lambda as well. I guess what's interesting is, I find that there's a lot of adoption from the small to medium enterprises. And that's where you see a lot of companies going serverless first or fully serverless as, as a tech strategy or direction, fully buy into the mindset and, and really seeing serverless as the future. I'm seeing a lot more adoptions in the enterprise world. But that's pretty typical in that particular sector, you see a lot of like one team in a really large company going serverless. And then it takes a long time for that to start to propagate within the company. Because lift and shift is just so much easier, at least from a high level strategy point of view is, we've got hundreds of engineers, I can't possibly retrain everyone. Training is going to be super expensive, and all of that. But it's much easier for us to just go from our own data centres into the cloud, the whole digital transformation thing. Just lift and shift our way to containers and run stuff in AWS instead of our own data centres. That is a big step forward. But hopefully, more of them will see that while you’re still doing a lot of operational work, maybe even more so than before, because you can dedicate those to more specialised, I guess, teams in your company. But now, every developer have to learn about containers, and the schedulers and all of that. So hopefully, they will see more, more of the benefit that you get from serverless in terms of it's just the speed of development, and the fact that you don't need as many people to to develop the same, the same application, at least do them a lot quicker. And hopefully you see that more, that become more of a competitive advantage as budgets getting cut everywhere. Once the COVID is finished eventually, all the money that has been printed start to have to be paid back and hopefully I think you see more of that more thought being put towards efficiency, both in terms of development time, but also in terms of resource use as well.


Alex DeBrie: 10:02  

Yep, absolutely. I totally agree with that. So yeah, I'm hopeful, you know, I think I think we're still gonna see a lot of adoption. And you see the, you know, this industry is growing sort of leaps and bounds, whether it is containers and Kubernetes, or serverless. You know, there's just a lot more coming on. But I'm, I'm hopeful the serverless revolution will keep going. Because, yeah, that's what I enjoy working with, and all that. So so what about in terms of like, stuff you see missing from the serverless community, community or not community, but maybe features and products and things like that, any big wishlist items for you?


Yan Cui: 10:29  

I think the biggest one is the, by far is the serverless Elasticsearch. That's like the one thing that everyone on on this podcast probably asked about, and every time I have to go back and use Elasticsearch, it just pains me to to no end. Algolia is probably the closest thing we've got to serverless Elasticsearch, but it's still not, you know, I guess it's not quite as serverless as you'd like in terms of the pay-per-use, they got this whole unit based system, but you’re still kind of paying for peak throughput, as opposed to just how much you're using. And Elasticsearch itself is just such a pain in the ass to to deal with.


Alex DeBrie: 11:05  

It's, it's true, what what's your prediction, I guess on on whether we'll get a serverless Elasticsearch in the next couple of years? I, I'll be honest, I'm kind of bearish on it. I think it's just a really hard problem. Especially to do sort of like in a serverless way, but I don't know. Do you have a... Are you hopeful for this? What do you what are your short expectations?


Yan Cui: 11:22  

Yeah, I'm hopeful for this because that probably by far is the biggest gaping hole in the whole serverless ecosystem we have. We’ve got all these databases, we've got all these, we've got some really good databases that are serverless, you've got DynamoDB, you've got... what’s the other one that, Fauna, yea, FaunaDB. You've got Cosmos DB in Azure. You've got, you've got quite a lot different options that are really good out there. And you got APIs, you've got API gateway, you got AppSync, all these different things. You got queues, well, too many of them actually. But like the whole search domain is probably the the one big missing gap we have, that will really be up there for someone to know to grab and offer something really good. Yeah, that's the reason I'm hoping the next couple of years, someone's going to realise that there's a big opportunity there to offer something as much as I realised that is a really tough challenge. But Algolia is a pretty good offering already. So this, I imagine technically doable.


Alex DeBrie: 12:29  

Yep, yep. And even you know, even if they could come out with something that maybe won’t do 100% of what you can do with Elasticsearch, if they could sort of like with Dynamo, right, if they give you like the the sort of 80/20, right, where they constrict you and say, Hey, here are the things that we can offer pretty good guarantees around in this particular domain, you can do that and just easily enable some, you know, text search on on certain fields in your application, I think that'd be a be a huge win.


Yan Cui: 12:54  

With Algolia, it’s kind of there already. I mean, minus a few things, Algolia can do pretty much everything Elasticsearch can do, just a lot simpler. Also, I was really taken aback recently, when I realised that even doing basic exact text match in Elasticsearch wasn't quite straightforward. I couldn't use term filters on the text field, how to use a term.keyword or something like that. You’re just like, weird complexity is like that, okay, why is it there? All I want to do is like an exact text match. Why is this difficult? Stuff like that, and plus all the management you got to do on the indices, which is, it infers the wrong type, depending on data, the first bit of data you got for that field, and things like that, it just drives me nuts. I don’t now how much how much you’ve done Elasticsearch yourself?


Alex DeBrie: 13:45  

I’ve done enough to be scared of it, you know, and I think that's the the point and and, you know, the nice thing about Elastic is, it's really easy to get up and get going and and sort of play around with it and see, but then I think a lot of the complexity and and scaling challenges are hidden from you at first. So like you're saying like the indices management stuff, but even just like the templates you're using, and making sure you're indexing the right things and the right way, and doing things that aren't going to blow up later on. And it's hard to get a good sense of that unless you're doing a real full blown test where you're loaded with a bunch of data and run some real queries against it to sort of know how it's gonna operate. And then also, at what point it’s going to fall over, because some point it’s just gonna fall over rather than sort of start rejecting a few requests. It's gonna, you know, totally fall over and you have to recover it somehow.


Yan Cui: 14:29  

Yeah, I remember talking to Ben Kehoe a long time quite a while back. He was on his podcast. And he said, the one thing that we have to do some manual intervention now and then is Elasticsearch, because it seems to fall over on its own from time to time. Everything else just just, just runs and if there's a problem it just gets replaced automatically with Lambda API gateway whatnots. Elasticsearch is the only one they get to, get to manually kick the box and get it back up every now and then.


Alex DeBrie: 14:58  

Yep, exactly. So so yeah. That's definitely not fun, especially when you've gotten into this world where you're not used to doing those kind of operations anymore. You're used to things sort of healing themselves or, or being clear about how they're going to fail. And then all of a sudden, it just, you know, you have a mystery, and you have to poke around, figure out what's going on.


Yan Cui: 15:15  

Yeah, yes, exactly. I haven't had to do that, that sort of thing for a long time now. Elasticsearch, kind of brings me back. So I guess, going back to DynamoDB and single table design. And I guess, for audience who haven't really heard about this concept, can you maybe just give us a quick, brief introduction, what is single table design? What sort of things would you do that's different to how I'll probably use a DynamoDB having one table per entity that kind of thing?


Alex DeBrie: 15:44  

Yep, absolutely. So, you know, DynamoDB is a NoSQL database, and which can mean a lot of things, I think, sort of the biggest flavour of NoSQL, if you look at Dynamo, Mongo, Cassandra are also all pretty similar in certain ways. And, and the big thing is, they want to be able to horizontally scale your data effortlessly, basically, to any size. And the way they do that is they use a partition key or a sort key or something like that, to determine which different partition your data is going to go to, right. And that allows you to add more partitions as you go and scale up pretty well. Now, because of that partitioning system, I think there's a couple things there. Number one, most of these databases don't allow joins. Mongo will allow joins, although they're not gonna be super performant. And then they're also going to sort of limit the amount of query, kinds of queries you can run different things like that. And so that can be difficult for a lot of people coming from this relational world where they're, they're used to joins, they're used to normalising all their data. And I think what a lot of people do up front is, is they try and replicate by putting joins in their application, right? So they might go fetch the parent record, and they might go fetch a bunch of related child records based on that parent record, or some pattern like that. And, and basically have a different DynamoDB table for each table they would have in their relational database. So single table design is basically saying, hey, think about your access patterns. And if you have those patterns, where you need to be getting different types of of entities, you know, like you would in a joint operation, what you can do is sort of pre-join your data and co-locate them near each other. So they have the same partition key, it ends up looking kind of funky, especially if you're used to what a relational table looks like, you know, which looks like a nice Excel spreadsheet or a couple of them. This is a little more disordered in some of that senses, especially on the partition key and sort key stuff. But what you do is you do get some more efficient reads there. So I think that's like the extreme end of single table design where you're like, you know, pre-joining co-locating different types of entities in the same, using the same partition key, so you can do that efficiently. But even on the less extreme end, you know, if you're not going to be getting those type, different types of entities in a single request, I think what a lot of people do is, is put all their entities into a single table, right? Instead of having a user's table, an orders table and an items table, they'd all be in a single table using this different partition key to just sort of segment them as you need them. So that's the high level just of a single table design.


Yan Cui: 18:13  

So I guess in this case, what would you say are the main benefits of doing colocation of your data? Maybe some examples will be quite useful to help people to visualise what that looks like.


Alex DeBrie: 18:25  

Yep, absolutely. So I think there are sort of two main benefits to single table design. And, and the thing is, like, a lot of them don't apply to a lot of people. So I think, let me go through those two benefits. And then we'll sort of talk about, you know, how they may not apply. So number one, I think you can, like I was saying, you can co-locate and pre-join your data. So if you had a use case that said, Hey, give me a customer and all of a customer's orders or something like that, or there's something about when you're fetching all the orders for a customer, there's something you want about that customer as well, what you could do is give them the same partition key, and sort of order them in a way that where maybe that customer item is at the top of your item collection. And then the most recent orders are going to follow right after that. And it's a very efficient operation to go to DynamoDB to run a query and say, Hey, give me a customer, and that customer's most recent 10 orders, that's gonna be a very efficient operation, it's gonna, you're gonna have basically the same performance, you know, when you have a gig of data as when you have a terabyte or more of data. So that's what's really nice about that, you're not going to slow down based on the number of concurrent queries going on, or how much data you have, like you might get in, in a relational database that's hard to horizontally scale. So that's going to be option one, benefit one from single table. I think the second benefit from single table is more operations related where, when you're provisioning these different DynamoDB tables, you know, you need to have different provisioned throughput for them. So you need to have read units, read capacity units and write capacity units for your DynamoDB table. The more tables you have, the more those you need to tune sort of right and make sure you're not getting throttled. Because you don't have enough capacity or just overpaying because you have way too much capacity. So if you do that in one table, it's a little easier to manage that and understand what your usage is going to look like over the over the day. Now, let's let's look at when that doesn't apply, right, especially that operations thing, I think that's gone away, for the most part, especially you have on demand pricing with DynamoDB, which makes it instead of having to sort of provision your throughput upfront, you can just pay per request and do that, it's gonna be a little more expensive if you can get good utilisation out of your provision capacity, but a lot of people can. So I think on demand ends up saving a lot of people. So you know, even if you have 10 different tables, like you're talking about there, if you throw them on on demand, you don't have an operational burden, like you would have, you know, maybe a couple years ago, the second one, you know, about fetching multiple heterogeneous item types in in a single request, you know, not everyone uses that pattern, you can also sort of structure your table to where if you can do that request in parallel, you know, get the customer get the customers orders in parallel, you're not gonna have the performance hit, you'll have a tiny cost hit in terms of like having to read more items, but it's not going to be a huge one to where I'd, I'd really worried about that in most cases. So I think you're not going to get the benefit of that as well. So in that sense, I think a lot of the, the main reasons for like, Hey, you should absolutely do single table aren't gonna apply to everyone. That being said, I still think like, the biggest problem for people coming to DynamoDB is they want to model it like a relational database. And what they do is they model it, they move exactly what they would have done in relational into Dynamo. And they have, you know, maybe auto incrementing primary keys they have, they're doing joins in their application code, things like that. They're doing lots of sequential requests, rather than parallel requests. And I think that's where you get into trouble. So one thing I do like about single table is like, it forces you down this road of starting to have to think about this and realising that modelling in Dynamo is different than modelling in a relational database, and it gets you to think about, you know, your primary key or partition key, your sort key making those meaningful, it gets you to think about making parallel requests rather than sequential request different things like that. So you know, and then it turns out, like, once you model it, it's, it's pretty much the same whether you're doing it in a single table using DynamoDB principles or a couple different tables with DynamoDB principles. And and I think you can be, you know, sort of use case specific, there are times where I might recommend people split off an entity or two into a different table because of specific needs, maybe around DynamoDB streams, or maybe around like scans, you know, if they want to get all of this entity type in a single table, maybe you put that in a different table, things like that. So I think the context is important. And I'm not totally against separate tables. But I would say, in most circumstances, if you're modelling correctly, it's going to be pretty similar in a single table as in multi tables. I guess I don't know. But what do you see out there in the wild? Because I know, I know, you have some good thoughts on this as well.


Yan Cui: 22:55  

I very much agree with the fact that you should learn those modern techniques. I am probably on the mindset that you should learn them. But you probably shouldn't be using them every single time. But certainly I'm I'm against applying it as a as a default. Because again, I think it's a tool that's useful in some cases. And I do use it moderately, in cases where I do have those access patterns we attend to, like you said, access as a user, and then all the orders for that user in one go, when I've got those use cases I do use a single type of design techniques. But I don't apply, I guess I don't follow. Maybe maybe single table is just a bad name. Because even people that use single table design they don't just have actually one table, typically is one table per microservice or some bounded scope, as opposed to just one table for your entire your entire system, which, amongst other things is going to force you into really bad place where different microservices access sharing the same database and things like that. So yeah, I’m definitely of that mindset where I use as many tables as I need to, and then I will, I will consolidate entities into the same table, using the single table design modelling techniques, where I see benefits of doing that. Because even though sometimes you may assess different entities in one query, but vast majority of the time, I find that most my entities just don't intersect. They don't need to be in the same table because they never get returned in the same API operation. And also, I guess, there’s couple of things that also single tables design makes things really difficult, I guess in some cases impossible. For example, you can’t monitor the cost for individual entities and access patterns easily when these are all coming from the same table. And the DynamoDB streams become almost unusable. When you've got a single table for lots of different entities. You end up doing all kinds of all kinds of filtering in your own code, just so that you're going to get that one updates related to one entity, because you're going to get everything in one stream.There's a couple of things like that and also just in general, I find that a lot people are struggling with single table designs, especially when it comes to AppSync. The whole doing application joins is painful when you are going to do it yourself. When AppSync does it for you, it's actually quite easy. But when when I hear a lot of people struggle with AppSync and DynamoDB, it often goes back to Oh, we are using single table designs. And we don't really understand it. And I think part of that is education. Part of that is some top down decision making where some architects somewhere said, Oh, we are doing single table designs, that's it. No discussion, no training, and then as a developer you’re just sort of left to pick up the pieces and then try to make do what you can, but then you haven't really learned, haven't been provided the opportunity to actually learn how to do it properly. And you end up writing a lot more custom VTL code, which is not exactly fun to write either. So I think that's where a lot of time I find that people are struggling with AppSync and VTL is because they're using single table designs, where they're not ready to do that yet. So yeah, I think it’s definitely something that you should be learning, but having the skills and knowing when to apply them is just as important.


Alex DeBrie: 26:10  

Yep. I totally agree with that. And a couple of good points there. Number one, like, I don't think this should be sort of imposed from on high down on there, I think it has to be sort of team buying on, you know, on, on whether you want to adopt DynamoDB at all, because there's going to be a learning curve there. And then, you know, if you do adopt Dynamo, what sort of patterns are you going to use in it, because I think, you know, if it becomes one person's pet project, they go off on the side, implement this whole thing, and then it gets dropped on the rest of the team, and no one else knows how to use it, and it's just gonna be a mess, that person is gonna be overwhelmed, they're gonna leave, and then everyone's really mad. So I totally agree with that stuff. Yeah, the thing I'd say is like, especially with AppSync, you know, I think you're right, the AppSync, like makes those application join sort of easy for you. The other thing is just like the way GraphQL works is like those joins are sort of going to happen, right? It's, it's, it's sort of hard to get away from making multiple sequential requests in in GraphQL. You know, there are some ways to do it, with loggerheads and things like that. But you know, some people say, don't do that at all, some, some people do so. But the thing is, like, it's, it's tricky the way GraphQL works, I would say, is even less towards single table design and more towards more meaningful separate tables, just because of the way it's going to work. I've had arguments with people on on that some people like to do single table with AppSync, I'm less strict on there, I think it adds some complications. So but regardless, you know, I think you have to have buy in and that's, that's a big part and make sure you you know, what you're doing and, and understand the trade offs. The other thing I would say like it... with table per entity is I think, sometimes you could lose some easy optimizations that like, if you thought out your access patterns beforehand, you know, some denormalisation type stuff that could really reduce some of the like, you just when you start to have table per entity, you sometimes you go too strongly in it, and you normalise, like you would in a relational database and you miss some easy wins if even if it's not sort of the the hardcore, single table stuff. So that's another downside. I mean, I'm still going to advocate for single table most of the time, and at least knowing those those modelling techniques pretty well. But but again, I'm not going to be a zealot on it, cuz I don't want to force it on anyone, because then they're gonna be mad at me if they don't like it, you know, you got to, you got to make that decision for yourself and for your team.


Yan Cui: 28:35  

I think it's too late for that. You and Rick Houlihan are the face of a single table design. So whenever I get mad about it, it's always you I'm thinking about.


Alex DeBrie: 28:43  

True. Yeah, it's true. 


Yan Cui: 28:44  

I guess one one thing that's always, I think always helps me with single table designs is to build the visualise the data you're going to have, because that's one thing that's quite hard to do when you look at a table you have in production, because you're gonna see all kinds of things that are there, the sort of schema of the table becomes quite opaque. So are there any tools that you use to help you sort of visualise and understand the data and what structure they have in your table?


Alex DeBrie: 29:10  

Yep, so when I'm modelling, I usually use NoSQL Workbench. So that's the tool from AWS to do some of that. And you can sort of set up your tables and set up those partition keys and store sort and your whole primary key and your secondary indexes to understand how that's going to look, make sure they're ordered in the proper way, you haven't made some sort of mistake. So I do that during the data modelling phase to sort of get those items in my table. I also just use like a spreadsheet. So usually Google Sheets or something like that but where I'll say, hey, here are my different entities I have here. Here are the the primary key patterns I'm going to use. Here are the global secondary index patterns I'm going to use just to get that in a in a sheet somewhere so people know what that looks like, rather than having to dig through the code and find these these sort of wacky strings that you've constructed the other. And then finally, last thing I like to do is, is have an access patterns chart and this can be a little different than with GraphQL, you know, it's it's, but like, if you have a more either a RESTful API or even something that's more like, you know, backends for front ends or something like that, I at least like to think about, like, Hey, what are the, what are the access patterns I'm going to have? And how am I going to solve those, and then it gives you a sense of understanding like, Okay, this is, these are the problems we need to have. This is why we set up that index that way, it's like solving this particular problem here, the requests I need to make to do that stuff. So I do all that in terms of like, you know, looking at my data, once it's, it's in there, then I use Dynobase which is a pretty good tool for sort of looking through your table and understanding that, although you know, it does get tricky just to, you know, if you really do have sort of a denormalized DynamoDB table, it can be tricky to understand that just even even with Dynobase.


Yan Cui: 30:48  

Yeah, it's great. They mentioned the Dynobase, because I had Rafal on here a little while back, he talked about the Dynobase as well. So yea, it is a great tool for actually seeing the data you have in your tables. And the DynamoDB Workbench is is great for modelling on the other side, on the other hand. So I actually got a question about that, because you talked about access pattern, and one of the things that I find also with a single table design is that if you know the access patterns ahead of time, so if migrating an existing application into that DynamoDB, for example, then you know those access patterns quite well. But if you build something new, then you don't really know your access pattern, or at least they haven't they haven't crystallised yet. They may be quite malleable, they can be changed quite drastically in the future. Another thing that I have sort of run into with a single table design is that when those access patterns change, it becomes quite difficult sometimes to modify your table, especially when you got to worry about the fact that you only have, sorry, CloudFormation can only update one global secondary index at a time, which means if you need to make changes to accommodate changes in your access pattern, you got to be quite careful and strategic about how you update your indices and how you deploy them. Because you can't rename indices, that's two operations you can't do. You can't update two indices at the same time, sorry, you can't add two indices in the same time, and things like that. Have you run into those kind of problems? And do you know any sort of workarounds?


Alex DeBrie: 32:16  

Yeah, it's, I'd say that that last point you made is kind of a frustrating quirk of CloudFormation, where you can't sort of make these multiple operations. So you at least need to, unfortunately, it can be like an unknown unknown for people, like they don't know that CloudFormation has that limitation until they try and go do it. But basically, what you to do is, you know, if you need to add that new secondary index, you add that first, let that operation go through, make sure the backfill is all fine, which could take you know, an hour could take a day, depending on how big that table is. And then once that's done, go, you know, start using that index if it's if it's backfield, or what you need to do. So that's, that's tricky on the deployment aspect, but in terms of, you know, adding these new access patterns, I agree that it's like, it's, it's more complicated than what a lot of people are used to, if you're talking about a relational database and adding a new index, they're adding a new column with, you know, some default value. But I also think you get used to it, right? And you figure out how to do it, and you say, Okay, I have this new access pattern, I don't have, you know, it's not gonna work with my primary key or any of my existing indexes, but you just say, Okay, now I know, I have to run this job that basically scans my table looks for, you know, this type of item, or this type of item, adds these new properties, you know, GSI1 PK, GSI1 SK. How are we going to do that? So you're decorating those existing items with those new properties, and then you have that new index on those and you can start accessing them. And I think that ETL, ELT process, whatever you want to say there, it gets pretty rote once you've done it once or twice, right, you can do it in like a step function that splits it out into parallel scans, or you can do sort of just like running in a loop until it gets over the whole table. But what you can do is, you know, it's a pretty simple operation, you write the scan to find the items you want, you update those items, and then you go to the next batch until it's done. So it's scary. But you know, it's also, you know, it gives you a sense of what's actually happening under the scenes, because I think, you know, with relational databases, you hear horror stories of adding indexes or adding columns, and now it shuts, locks the whole table until it's copied over. And you know, relational databases have gotten better on that in some fronts, but there, there's still some edge cases there. I think this gives you an understanding at least of what's happening, I think, I wish there was a better tool, I think you could make a tool with step functions and Lambda and basically said, Okay, write your sort of filter logic, write your update logic, or whatever you want to do, and just have this thing run. I think that could be out there. I've wanted to do that for a while and just haven't gotten around to it.


Yan Cui: 34:39  

That's probably more difficult than the, than you make it sound, especially when you've got single table design, depending on the specific entity that you need to update to accommodate the new access pattern you could be talking about, you could be talking about a small subset of data you have in a very large table because the fact that everything is dumped into the same table. So depending on how, like I guess, the demand of data you're talking about. You could be filtering out 99% of the data in the database for that 1% depending on what entity you're looking for, that's something that you should also think about as well. You've got to do something like that.


Alex DeBrie: 35:12  

Yeah, I totally agree with that. But like, you know, reading all the data in your database is not generally that expensive, especially if you're doing a scan, where they're gonna group those items together before calculating RCUs, you know, if you have a large table, you probably already have a large DynamoDB bill as well. So unless you have some sort of situation where you have, like, a large table, where a lot of is historical and sort of not being used, but generally, I think you're right, that like, that is a factor and something you need to consider. But I find that like, not a huge bill when I actually calculated out most of the time, but yeah, it's a good point as well.


Yan Cui: 35:47  

Okay, maybe just the, I guess, additional complexity. But I guess that's...


Alex DeBrie: 35:52  

One question I have for you. I was gonna say like for you, if you have, you know, multiple different tables per entity, are you, are you doing anything around sort of, I imagine you're not using generic key names there, you're probably using something like user ID and order ID for the primary key and secondary indexes. And then, but then also, are you ever like, combining attributes, you know, if you need to have like a, you know, some of the single table design principles where maybe you have a status and a timestamp jam together into a sort key? Do you have anything like that? Or what's that sort of look like when you have these different tables per entity?


Yan Cui: 36:30  

I do have that. And in fact, I'll just did that for another, today on the on the client project, we have to combine a type name as well as a ID into a secondary field, the multi-tenancy scenario, so primary key is tenant ID. So I'm going to kind of combine multiple things into the secondary key to make this whole system work. So I do I do do that quite, quite regularly. Even though that's probably not my default stance. But like I said, when I need to I’ll do it. But probably one thing I don't do is I don't use the generic index names. I always use, I always choose a meaningful index names. That's one of the things that drives me nuts as well. I don't mind having like SK and PK in attribute names, especially when they're quiet, when they can contain different fields in there. But GSI 12345. That doesn’t cut it for me.


Alex DeBrie: 37:24  

I agree. It's like it's, it's totally different. I mean, I've bought in enough to like, where I'm down the road. And like, it seems totally normal to me now. But I get the sense, and especially if you're working if you're moving between relational and DynamoDB, it's sometimes it's like, what am I what am I doing here when I come back to this, but again, that's where like the buy in is important and deciding sort of how much you want to bite off with with some of that stuff.


Yan Cui: 37:49  

The thing is that with the GSI you are explicitly saying, which are the which attributes you're using. So it's not like the GSI itself can change in that matter. So I think with GSIs they should have meaningful names, because they are dedicated for specific access patterns where say, instead of having a tenant ID and SK I want to find the tenant ID and ID for the specific entity I'm looking for. So I should be able to name a GSI accordingly, so that when I'm writing the code, more importantly, when I'm reading the code, I don't have to look at a code, go find some reference, figure out which GSI12345 do I need, and then figure out, Okay, is this query doing the right thing? So I shouldn't have to do that. I guess that's my point about the GSI naming, I don't mind PK and SK, when the table can can use different things for PK and sort key. But GSI has got specific, you know, target for the access pattern. So I think they should have more meaningful names in the table.


Alex DeBrie: 38:51  

I think that's true, you know, when you're using your patterns, like where one entity is in a table, but if you have a single table design, or even, you know, two or three entities, only two or three, or maybe you have 10 entities in the same table, those GSIs can be used different ways for different entities, right? So it's like, my users might be indexed, you know, by their email, or, and, or something like that. Whereas my orders might be indexed by the date they were made or something like that. So then it's hard to name that GSI something because they're used for different things across those different items. So if you do have those different item types in a single table, I think you do need to get generic, you know, just like you would with your primary key. Unless you want to make different GSIs for you know, one GSI per access pattern, which you could do there as well. Yeah, I just like the, I like the forcing function of of sort of making these things generic and in thinking about it in like this modelling way of like, Hey, this is how I'm laying it out. This is how I'm splitting these things out. But yeah.


Yan Cui: 39:50  

Maybe maybe that hasn't worked for me, at least whenever whenever I have to use GSI as part of my consolidated tables. I found them to cater for specific access patterns at least, I really find them span across multiple access patterns, well, multiple use cases that have the same access pattern or at least are looking for things in the same way. That is, that's that's been my experience with those tables. I guess, one of the things, one other thing I wanted to ask you as well and get your opinions is, what are your thoughts on the PartiQL for DynamoDB?


Alex DeBrie: 40:23  

This is tricky. I'm not a huge fan of PartiQL and I don't know I think I'm just sort of old school and and I started with Python, right. And I like Python's principle explicit is better than implicit in most cases. And so then I worry with PartiQL that it's, it's, it's hiding what's happening under the scenes and what I really like about Dynamo and have come to appreciate more and more as we get into it, it's just like how explicit it makes everything and how easy it is to understand like every aspect of scaling and especially like if I didn't write the code I'm gonna go read someone else's code I can pretty quickly understand like where it's going to scale and where it's not. With PartiQL, it's a little harder because like some of that stuff is hidden and you can accidentally do a scan, or things like that you know you can you can help yourself in some ways by, you know, maybe not giving yourself that IAM permission or just just doing a good code review on that stuff but I think any sort of obscurity there makes me nervous about about what is actually happening so I haven't done a lot with PartiQL, but I am hearing like a lot of people loving it and so I probably need to add some stuff to the book into different other places at least to, you know, make it more complete and and, and at least you know tell people hey here's some of the things you need to think about here. But yeah, not for me and partly because I'm already familiar with the other stuff. But yeah, what do you think? Are you, have you used PartiQL? Do you like it? What do you think?


Yan Cui: 41:50  

I haven't used it, and I don't like it because like you said, it's hiding a lot of the underlying. Well, the thing I really don't like is the fact that it's... you can also switch from a query to scan and Friends don't let friends use scans, unless they absolutely know what they're doing and accidentally you know turning a query to scan is going to be really painful for many, many reasons. And beyond that, I actually written a library, many years ago, that gives you some kind of SQL like syntax to query, well before the expression syntax was introduced, that was a couple years before the expression syntax was introduced. I wrote something that basically does the same thing, but encapsulates limit and page size and stuff like that as well. But at no point was I trying to hide a lot of the underlying mechanisms is just expressing in a different way, whereas I think PartiQL probably goes a bit further towards try to mimicking relational databases which I understand why people love it, because anything that can help you not learn this other thing and make it pretend to look like other things you’ve already known, I think people would like it. In the same way that I think I've got issues against tools like CDK, but people love it because it lets them use tools that they are already familiar with. And I totally get it, it just, I think with PartiQL that just that massive danger that you end up doing the wrong thing quite easily. Yeah, so that's, that's my, my feeling towards PartiQL and I also find that DynamoDB API itself is pretty straightforward, there's not a lot of different knobs I need to turn to get to get to do what you need to do. So I don't really see that I have a big appeal of PartiQL in that particular sense either.


Alex DeBrie: 43:37  

Yep, I totally agree with that and and same thoughts around yeah PartiQL and and CDK and one thing that I'm like always wrestling with in my head is like I don't want to be. I don't know if you would say like gatekeeping or like, just like an old grumpy guy right but like you see some like older programmers that are like, don't use Python or Node because you know, you don't have to think about memory management, you're just making these huge inefficient applications it's like well I think we've, in some places gotten to the point where even if, Python, Node are less efficient with memory, you know, we've sort of grown out of that, right? Like memory is not a major constraint anymore. So, like, I don't have a computer science background, I don't know a turn around memory management but I also think, for what I'm doing I don't need to. So, I also want to make sure like for PartiQL for CDK that I'm not gatekeeping in some other way, right? And saying, Oh, you should actually learn the underlying things like with CDK I like the, I like some things about it, you know, but I do worry like it makes it so easy that you're not understanding the underlying concepts and could get into trouble there. Yeah, it's hard for me to handle that balance.


Yan Cui: 44:46  

Yeah, I feel the same as what I saw as well that I always worried that becoming one of the old grumpy guys. But at the same time I'm constantly excited by new things. So I think I’m okay, just some new things that I don't like. You can be, well, something can be different, just for just for the sake of being different, not necessarily better. I certainly I think not, not that I'm saying, that's what the PartiQL and CDK is, but there are, there are lots of things I don't like about CDK for reasons probably too long for us to discuss here. I think I've touched on a lot of those in the in the past already. The CDK is really powerful for some use cases, especially when you got things like multi-tenant application where I see a lot of people essentially replicate some parts of their entire stack, but heavily templated in such a way that they, they bring up a new subset of their architecture for every new tenant they onboard, and in such a way that even if you heavily template CloudFormation template is still going to be quite difficult. So CDK is great for that and it's also great for capturing reusable components, maybe most even, even better than the serverless components, because it's something that you can actually just share as a npm package in your, in your organisation, but of course that depends on you agreeing as a whole company that that's the language going to use because otherwise you could support three different languages, or maybe even more depending on what you're going to do in the future, right? So there's a lot of is depending on you agreeing as whole company that's, this is what we're going to do, everyone's going to do that, which in some ways, betrays the some of the, I guess principles of microservices is that every microservice should be free to choose whatever technology they want to use. And I think taking away your ability to use reusable reusable components just because your team want to use Python instead of JavaScript. That brings other problems in terms of organisation, how you, how you manage things, how you make sure everything's consistent. And also I guess for me, I just, I don't, I don't see the problem with YAML I get used to it. I guess you doing Python you, you, you have no problem with whitespaces either. I think that the whole whitespace thing just gets people when they where they have not been getting used to it or exposed to it. I'm like you getting old but at the same time I'm always worried about becoming one of those grumpy developers who don't want to learn new things.


Alex DeBrie: 47:23  

Yep. And I think you're right on a few points like where CDK can be useful right for templating those things or for reusability although I think we're actually a lot worse at reusability than we think, I'm not like a huge usability fan so I think you know it just ends up not working well in practice but that aside, I think those might be two benefits of CDK but those are like a small aspect of why people end up using CDK I think which is usually just because like now I don't have to learn about IAM, I don't have to learn about how these things work together which like, you know, sometimes that can be abstracted well but sometimes it can't, you don't know what you're sort of missing there.


Yan Cui: 48:01  

Yeah, and I guess one of the personal rules I've had for a very long time is that it's always to is to always be able to understand at least two layers of abstraction below, what I want to operate at. So if I want to be building APIs and stuff and I should really be understanding how you know HTTP and TCP works, even though I don't want to be writing an operating at that level, but you need to understand at least one or two layers beneath the layer you want to be working at. So that when there's some problems, you can actually solve it. You can actually figure out what the problem is and understand the underlying mechanism or how things work, you don’t have to be a specialist, expert at them. But at least you should understand how things work. You understand how the bacon is made. That's the problem with a lot of the tools. That is just too much of a black box, and if you only know how that tool does something without without understanding the underlying fundamentals, then you could put yourself in a really dangerous position. And I find that with a lot of the front end tooling as well that people only know how to do something with this particular framework, if you take that framework away. They have no idea how things even put together, what’s CSS and things like that, which I think is also not a good place to be because especially when those tools recycle every two years.


Alex DeBrie: 49:22  

Yeah. Yep, on that now I feel hurt because that's how I, I definitely don't know how those things work, I just kind of glueing stuff together to to make it works. So that really hit me hard but I totally agree. I love that rule of no one thinks two layers below what you're doing. I think that's, that's a good idea there.


Yan Cui: 49:40  

Yeah, this is something that I've sort of lived by for a long time, just so that I found myself curious about okay I'm doing all these things, but how does it actually work under the hood, so do do a lot of actual reading and learning just figure out how things work under the hood. It helped me a lot in my career.


Alex DeBrie: 50:00  

Yep. Yep, absolutely.


Yan Cui: 50:02  

So I guess we're coming up to the top of the hour. Alex, thank you so much for taking the time to talk to us today, how do people find you on the internet?


Alex DeBrie: 50:12  

Awesome. Yeah, thanks for having me, Yan. If you want to find me Twitter is probably the best way so I'm @alexbdebrie. You can, you know, google me and find me. That way, if you want to get the book you know dynamodbbook.com, if you're interested in learning about Dynamo and the different data modelling principles and sort of philosophy under that, otherwise I blog it, alexdebrie.com. And I'm generally available if you want to find me. 


Yan Cui: 50:35  

Are you still available to do consulting work? I guess imagine some people hearing this podcast might be interested in getting some help from you?


Alex DeBrie: 50:43  

Yep, absolutely I do consulting both around DynamoDB and just serverless AWS generally, so I have some training workshops if you want to do that or I can come in and give you advice, do design reviews, just be general assistance if you're interested in that. So yeah feel free to reach out on any of that stuff.


Yan Cui: 51:00  

Great, I guess with that take it easy, man. It's been a pleasure talking to you again.


Alex DeBrie: 51:05  

Thanks a lot. Likewise, Yan. See you.


Yan Cui: 51:19  

So that's it for another episode of Real World Serverless. To access the show notes, please go to realworldserverless.com. If you want to learn how to build production ready Serverless Applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.