AWS Insiders

An Inside Look at The Planet-Scale Architecture of DynamoDB with Alex DeBrie, Author of "The DynamoDB Book" (Part One)

Episode Summary

On this episode, Alex dives deep into the intricacies of Amazon DynamoDB, a Planet-Scale NoSQL database service that supports key-value and document data structures. Alex discusses the consistency and predictability in the design of DynamoDB’s performance, and how to best utilize it.

Episode Notes

On this episode, Alex dives deep into the intricacies of Amazon DynamoDB, a Planet-Scale NoSQL database service that supports key-value and document data structures. Alex discusses the consistency and predictability in the design of DynamoDB’s performance, and how to best utilize it. 

--------

“How is Dynamo different than a relational database? I’d say the biggest thing is, it’s designed for consistency and predictability, especially around its performance and what that response time is going to be, and also how high it can scale up on different axes.” Alex DeBrie, Principal at DeBrie Advisory and Author of “The DynamoDB Book”.

--------

Time Stamps

* (02:08) How Alex discovered and got started with Amazon DynamoDB

* (04:34) The value of writing “The DynamoDB Book”

* (07:08) The underlying architecture and the unique qualities of Amazon DynamoDB

* (13:45) The advantages of single table design 

* (17:48) Illustrating examples of single table design 

* (21:45) Doing the math with Amazon DynamoDB for consistency and predictability

 

--------

Sponsor

This podcast is presented by Cloudfix

 

CloudFix automates easy, no risk AWS cost savings. So your team can focus on changing the world. 

--------

Links

Connect with Alex DeBrie on LinkedIn

Connect with Rahul Subramaniam on LinkedIn

Episode Transcription

Speaker 1:

Hello and welcome to AWS Insiders. On this podcast, we'll uncover how today's tech leaders can stay ahead of the constantly evolving pace of innovation at AWS. You'll hear the secrets and strategies of Amazon's top Product Managers on how to reduce costs and improve performance. Today's episode features an interview with Alex DeBrie, Principal at DeBrie Advisory and author of The DynamoDB Book. On this episode, Alex dives deep into the intricacies of Amazon DynamoDB, a planet scale, NoSQL database service that supports key value and document data structures. Alex discusses the consistency and predictability in the design of DynamoDB's performance and how to best utilize it. But before we get into it, here's a brief word from our sponsor.

Speaker 2:

This podcast is brought to you by CloudFix. Ready to save money on your AWS bill? CloudFix finds and implements 100% safe AWS recommended account fixes that can save you 10% to 20% on your AWS bill. Visit cloudfix.com for a free savings assessment. And now here's your host, AWS super fan and CTO of the SW Capital, Rahul Subramaniam.

Rahul Subramaniam:

Hi everyone and welcome to another episode of AWS Insiders. Today I have with me, Alex DeBrie, the author of The DynamoDB Book. Thanks for coming to the show, Alex. I'm really excited to dive deep into the details of DynamoDB with you.

Alex DeBrie:

Absolutely. I'm excited to be here. Thanks for having me Rahul. This is great.

Rahul Subramaniam:

Awesome. So I get surprised so many times that you are the go-to person for any questions, DynamoDB. You're the first person to get tagged even before they actually tag the AWS team on questions. So I couldn't be more excited to have you come in and give your perspective about DynamoDB. So how did you discover DynamoDB or how did you get started with it?

Alex DeBrie:

Well, first thing I would say is probably no one's more surprised than me that it fell into this way. I don't have a hard database background or even a hard computer science background or anything like that. So it happened accidentally over time and it's something that I really like and interested in, so I love being in it but yeah, it's an accidental. But I would say, I think about how I discovered DynamoDB, I'd put it in three different phases. So first one, I was working for a company, I was doing internal infrastructure stuff, so more like data warehousing, working with Amazon Redshift, and different things like that. And our team just had some need for an internal slack application to integrate with slack and store some data, different things like that.

Alex DeBrie:

I offered to build the thing and I didn't want to have an EC2 instance running and a relational database instance running and all that stuff just to hold 5-10 requests a day, not even that much. So I looked into all this serverless stuff and found Lambda, found DynamoDB, and I'll use this, it's super easy, it's low cost, it's going to fit in a free tier, and so I looked into doing that and I built it and I used it totally wrong, but it was small enough scale but it didn't matter, it could just brute force it. But then just taking that, I really liked how that service application works. So, six months after that, I took a job with serverless.com. These are the people that created the serverless framework and really are at the cutting edge of making this serverless revolution happen in conjunction with AWS and Lambda and those things.

Alex DeBrie:

So I'm working for serverless.com and I see so many people using DynamoDB in their serverless applications because of how well it worked with Lambda and especially how relational databases didn't work well with Lambda, because you had all these networking requirements, you had connection limits, you had, and then also just like the scalability of Lambda as compared to the scalability of relational, especially quick spikes and things like that just wasn't working well. So I started using Dynamo there, probably poorly most of the time, and then late 2017, I watched Rick Cohan's talk at AWS Reinvent and he was, this guy that worked for AWS and helped Amazon move a bunch of their internal applications from Oracle to DynamoDB to NoSQL patterned a lot of these relational database patterns and shared it. I watched his talk like 5, 6, 7 times that break and it just blew my mind. And that's when I actually figured out how to use Dynamo and I would say, that's when I discovered what it means to use it.

Rahul Subramaniam:

That's pretty neat. At what point did you realize that there's tons of value in writing a book about DynamoDB and what really triggered you to go do that?

Alex DeBrie:

Yeah, so just to continue that story, this is late 2017, this is Christmas break. I'm watching Rick Cohan talks. I actually listened to it the first time when I was driving to work. And I was like, "Oh, this is really amazing." And then I watched a few more times, took a bunch of notes and I was like, "There's a lot of interesting stuff here that I didn't get the first time and I just wanted to help share that with people." So during that Christmas break, I created a website called dynamodbguide.com and it was just 10, 15, 20 pages, about the DynamoDB documentation, but presented in a different way, in a way that made sense to me of just like, "Hey, this is how DynamoDB is different than other databases and what it looks like."

Alex DeBrie:

So I put that together in early 2018 and it was pretty popular right away. And just like, it started to snowball where then people would be asking me for questions. And a lot of times, I didn't know the answers to the question because I'm still pretty new to Dynamo at that point, at least to modeling Dynamo correctly, but people would ask me a question and I'm like, "Well, I got to go figure it out." So I'm figuring stuff out and doing this and it's starting to build up knowledge and all that. Then I start writing blog posts. I start giving talks. I worked with the AWS team a little bit, things that throughout 2018, 2019, and around the summer of 2019, I was like, "You know what? I think there's an opportunity here for a book where DynamoDB is interesting enough, distinct enough, and it has these unique advantages with serverless applications, also just with predictability, consistency, things like that, but it's so different and people need a comprehensive look of how you model DynamoDB.

Alex DeBrie:

So, started thinking about that in summer 2019, was working on it, but I still had a full-time job and it's really hard to do that with a full-time job. So at the end of 2020, I was like, "Hey, I think I can do this. I put in my notice in my company and I spent the next four months at the beginning of 2020, writing the book and releasing it and of course, right when I release... I set this release date of like April 2020 and two weeks before that, the whole United States and the whole world shuts down with COVID and my guys is going to be spending money on books when who knows what's happening with their self, but I released it and it went well and it's just been a fun journey since then as well.

Rahul Subramaniam:

That's a really awesome story. For most of us who come from the SQL world or the relational database world, DynamoDB was this huge paradigm shift and wrapping our heads around all those concepts, because you had to think about everything differently, was really hard. So for those who are new, who are listening to this podcast, how would you describe the underlying architecture or the uniqueness of DynamoDB and how to think about DynamoDB differently, especially when they come from a relational database world?

Alex DeBrie:

Yep. Yeah. Sure thing. I think Dynamo is so interesting and unique and particularly, I think it has the most explicit trade-offs of any database, so it's easy for me to go to a place and people are saying, "Hey, should we use Dynamo?" "And I love Dynamo because it gives you this, this and this, but it also takes away, this, this and this on the other side and it just, whether you want to give or accept those trade-offs." So I like the clarity of the trade-offs. If I'm talking to people and telling them, "Hey, how is Dynamo different than a relational database?" I'd say the biggest thing is it's designed for consistency and predictability, especially around its performance and what that response time is going to be, and also just like how high it can scale up on different axes and things like that. To do that, to do its underlying architecture, it's very much going to force you to use, what's called the primary key to identify your items.

Alex DeBrie:

So, if you're coming from a relational database, you can query based on any column in your table, but in Dynamo, you're almost always going to want to filter on the primary key, which is going to be different than a primary key in a relational database because that's often going to be an auto assigned [inaudible 00:08:20], whereas in DynamoDB, it's going to be something meaningful in your application. It might be a username, it might be a custom email, it might be an order ID, but whatever it is that primary key is really going to have value in your application because that's what's going to be used for retrieving your items, for writing your items back, and things like that. So I think that's the first thing is they're really going to force you into using that primary key. The other thing I would say, that's different with relational database or even other NoSQL databases is a lot of times those databases will have a query planner, right?

Alex DeBrie:

So you'll issue a query against your database. Something internal to that database is going to take apart your query, look at table statistics and figure out the most efficient way to execute that query. Dynamo doesn't have a query planner at all. It's basically giving you lower level access to some basic data structures, whether that's like HashMap or B-Tree, some pretty basic stuff and just making you plan for that and basically you become the query planner to where you have to arrange your data in particular ways that match your access patterns. It requires more time spent up front in thinking through your access patterns, designing for your access patterns, but then you get those great things about Dynamo, the consistency, the predictability around that stuff where you know about how long anything is going to take without having to go through this query planner and maybe giving you unexpected results as it hits higher load.

Alex DeBrie:

So those would be the first ones. I think, three other pretty unique things that fall out of that would be number one, it's very explicit on what the limits of the application are. So you can't have an item over a particular size, you can't retrieve more than a mega data in a single request, you can't hit a particular key more than 3000 times per second, which is all useful to know up front as compared to, I think to a relational database where there are those limits, you just don't know what they are and it depends a lot on how many other queries are going on. It depends a lot on what architecture you're on, what in instant size you're on, all sorts of things, whereas Dynamo that's pretty explicit for you.

Alex DeBrie:

I'm rambling a bit here, but two other things I think that are super interesting and why Dynamo is so popular, especially in the serverless world, is how well can scale up and down, right? It's easy to scale DynamoDB up if you have [cyclical 00:10:36] traffic during the day or the week or the month, whatever, you can scale that up and scale it back down pretty easy. And then also, it has an HDP based API, so you don't have to set up VPCs and do all this private networking stuff to access your DynamoDB table. It's all access over HDP. It uses AWS-IAM for authentication and it just really works well that way. So it works well in serverless architecture [inaudible 00:10:57].

Rahul Subramaniam:

Yeah. AWS DynamoDB is now 10 years old, right? It's got such an amazing set of properties or characteristics that sometimes I feel like AWS is almost underselling the service. When I first looked at DynamoDB years ago, I looked at it and I said, "Here's a new key value store." And coming from the old school relational store, when you think of key value, you literally think of a properties file kind of a setup or a very simple cash of sorts where you could just query for some very basic data. For quite a while, I had not even realized that DynamoDB was or could actually, allow you to go leverage it for so many different use cases. And I think over the last 10 years, the ways in which DynamoDB has been used to create some amazing global scale products has been absolutely fascinating.

Alex DeBrie:

I totally agree. And I think you're right there that the marketing undersells it a little bit. They have this purpose built database strategy and they put Dynamo in the key value bucket, which you absolutely can do that, but it can do so much more. It can handle all these relationships and interesting things and all of Amazon, all of AWS into early is running on DynamoDB. So you can do some very complex things in there. And again, yeah, I think you're right, that they aren't selling enough of the story around that because it's more complex, right? You have to tell people, "Hey, it's a totally different way of modeling." You don't want them to bring relational mindset to DynamoDB and use that because you're going to end up in a pickle there. So it's an education problem in a lot of sense, and I think we're seeing a lot more of that.

Alex DeBrie:

Rick Cohan did a lot of great work here. A lot of the team is doing great work, and I think also, because Dynamo works so well in that serverless realm, you had to expand to this much larger user base to where, for the first five, six years of its existence, it was just the really high I scale customers that needed and use DynamoDB. And now with that serverless and how well it works there, you're like, "Okay, it works for a lot more, it works really well with service, but it also works for any OLTP application if you put in the work to understand how it works."

Rahul Subramaniam:

With that thought, you just cued me to bring up a conversation about single table design. I was first introduced to the concept around 2018 or 19 and since then, it seems to have catalyzed a whole lot of very fascinating use cases for DynamoDB, especially when it comes to building planet scale applications. What's your take on it?

Alex DeBrie:

Yeah. Absolutely. So let me tell you, let me just, I guess for the listeners, say a little bit about what single table design is, some of the benefits, and then also some of the other things to think about with that as well. So first of all, why single table design? I think the first thing you should know about Dynamo is that it doesn't have joints, right? And it does have that primary key base access. You're accessing by primary key, you don't have a joint operation, but so you still need to fetch related data in DynamoDB and how can you do that? So Dynamo does have an operation called the query operation, which allows you to retrieve a contiguous set of items. We're talking in a B-Tree set up here, you can retrieve a range of items that has a particular partition key.

Alex DeBrie:

So what you can do, I think the most clear example of single table is you can rejoin related data in a way that matches your access patterns, right? If you know you're going to fetch a customer and the customer's most recent orders, because you want to populate their order history page, you can model your data so that those are next to each other in a single table. You can do that query operation. It's a very efficient, predictable request, and you can don't have that joint operation that gives you that unpredictability. So I think that's one of the big reasons for single table design. How do I get these disparate related items in a single efficient request rather than doing joints in my application, right? So another reason you might want to do single table design, especially before they had on demand billing or some of the auto scaling, is it just simplified data management, right?

Alex DeBrie:

So if you have an application that has five different tables, now you have to manage capacity and throughput for all those different tables, scale those up and down and things like that, whereas if you put them all into a single table, now you only manage one. You only have to scale that up and down and often whatever your biggest use case is, it often ends up being a lot of your throughput and you can hide all your other operations in there and get them almost for free just in your excess capacity. So, that's pretty interesting. One thing I would like to say... I'm not as much of a purist on single table, I think it's useful, especially if you're going to be pre-joining data, especially if you want to lower management, the biggest thing for me, I like about single table is that it forces you to understand, "Hey, the way I'm modeling my data in Dynamo is not going to match how I was modeling in a relational database."

Alex DeBrie:

And it forces you to understand those principles. And if you are modeling with single table design principles, in almost all cases you can put it together in a single the table design if it's going to work. But there also might be reasons to split it out into different tables and usually those are operational type reasons or other reasons, right? So if you're using DynamoDB streams, which is basically change data capture or change log for your DynamoDB table, maybe you have very different use cases for some of the different entities in your application. Well, you could split out those entities in different tables and have different stream consumers for each of those, maybe that simplifies that access, right? Maybe you have different data exporting needs or different backup needs, or maybe occasionally, you need to scan all items of a particular entity and that's easier if it's in a separate table. So things like that. So I think it's often a balance of, do I want to put this into a single table or are there reasons to break off and have other tables?

Alex DeBrie:

I think the key point here is make sure you're modeling it in a DynamoDB first way, which should be compatible with single table design if that works for the other considerations.

Rahul Subramaniam:

I've heard a lot of people equate the single table design to hyper-denormalization, but it's really so much more than that. Could we take an example and walk through, what it takes to create or design a single table schema?

Alex DeBrie:

I think you're right, it's not just about hyper denormalization. I think if you over de-normalized, you can get yourself into trouble there, normalization you need to balance, find the right balance there. I think single table design is more thinking about access patterns first, rather than model design first and then doing your query, right? So if you're working with a relational database, often you'll have your entity relationship diagram, you'll have your different boxes for each entity and then each box becomes a table in your relational database and then you'll write the queries after the fact to join those together and filter as you need to add indexes if you need to, whereas Dynamo, you'll create your energy relationship diagram, but then you'll think, "Hey, what are my actual access patterns here? How am I going to read this data? How am I going to write this data? Update the data?" Different things like that. Once you have those listed out, then you go about actually designing your table to handle those specific access patterns in the way that you want them to.

Alex DeBrie:

So in terms of examples to have here, one example I talk about a lot would be like a very simplified eCommerce example where you have a customer and orders, right? So a customer's going to make multiple orders. That order is going to be comprised of multiple items, right? Also, maybe the customer has multiple addresses that they've stored on file for you because they want to ship to their house or to a business or to their parent's house or whatever that is. So you have a lot of different relationships here, right? You have customer to address, customer to order, order to order items.

Alex DeBrie:

So those are all one to many relationships, but you might model those in different ways, depending on how that works, even if they're all in the same table. So just thinking about the easy one, and we talk about denormalization, if we think about customer and their addresses, you're almost always going to be looking at that address within the context of a customer, right? So maybe a customer's checking out, you want to show them their address options, where you're doing that within the context of a customer. So in that case, you could probably do some denormalization there and just store the addresses directly on that customer record, because you're not going to be accessing the address separate from the customer at any case. So that can be a particular easy one. But then you have, "Hey, a customer has multiple orders and sometimes we want to show the order history, Sometimes we want to retrieve a particular order, how we do that?

Alex DeBrie:

Well, you probably don't want to do that same denormalization strategy of that customer, the order onto that customer item, because as they make more orders over time, it's going to expand that customer item to where it gets really big. So you'll get slower response time, you also pay more because in Dynamo, you are paying based on the amount of data you're reading at a time and at some point, you're going to hit a DynamoDB size limit. You're actually not going to be able to add anymore orders there. So you probably want to split that out separately, have an order item specifically, but then when we get into the single table design, if we're showing that order history, often we want to get all the orders, but maybe we have some data that's normalized up onto to that customer that we want to fetch with that order history page.

Alex DeBrie:

We can co-locate those together, give them the same partition key they're located together, we do a query operation, we can fetch the customer, we can fetch the customer to orders, all in a single request and show that summary view. And then you can do that same strategy with an order and the order items, right? If I want to click into an order, see all the different order items, maybe you put it, so the order itself and then the different order items are all located near each other, you can do that in a single query operation and get those very efficiently.

Rahul Subramaniam:

Yeah, it makes a lot of sense. And I think, understanding all of those nuances really, really valuable because I see a lot of folks jumping straight into single table design saying, this is... And there have been a bunch of folks who profess that you can pretty much solve any relational schema situation with single table design, but it's not really that you have to really think about what your credit patterns look like. You have to design bottom up based on the query patterns and then decide if the single table design actually works for you or not.

Alex DeBrie:

Yep. Absolutely. One thing I always tell people is, do the math. And that's why Dynamo's predictability and consistency is so helpful. They're going to charge you based on how much data you're reading, how much data you're writing, if you have indexes, you have to pay it additional rights for those sorts of things. But like you're saying, you have to figure out your access patterns, really dig deep on that, how often am I writing this item? How often am I updating it? How often am I reading it? What are my different patterns? How big is that item? Knowing all those sorts of things and the specifics of your access pattern is really going to drive how you actually model your data, which I think is different than a relational database, where there's generally one way to model it and then you try and figure out how to optimize your queries on top of that.

Rahul Subramaniam:

You're absolutely right. We find that every time we are trying to go create new DynamoDB schema. The minimum set of requirements we need to collect before we get started, involves a list of the entities, of course, then the relative cardinality of those entities in the store and then the most important thing is, the queries or the access patterns that we are likely to see across these entities, but also the volume or the relative volume of the number of queries we are making for each of those access patterns. And I think only once we have that list, we find that we are able to create a reasonably efficient schema for DynamoDB.

Alex DeBrie:

Absolutely. And I think you're right, you can optimize that pretty well. It's also just amazing to me that you can get a pretty good sense of that without having to do a load test, which isn't going to be representative of your actual data, but you can get a decent sense of your cost where you can say, "Hey, we think we'll have 400 transactions per second doing these types of operations. These items are going to be this big." Here's the ballpark range of costs we're going to have, and you can even look at that. What if we were off by 2X, 3X, 4X, 5X, whatever, and get a sense of what your costs are going to be, which I just have no idea to do that with a relational database. You can do some testing but the load testing you're going to do is just not going to be representative of that stuff. So it's just a lot harder I think, to get a good test on a relational database. So then you see people over provisioning and paying for 8% utilization on their relational database.

Rahul Subramaniam:

Yeah. I think that's one of the fundamental mind shifts that happen when you come to DynamoDB because when you start off with a relational database, the first thing you start worrying about is, "Okay, how many cores do I need? How much memory do I have?" Then you start looking at IOPS to see how much is going to get loaded in memory. You're literally focusing on all the wrong stuff. If you really think about... If your job is to design the data model and stuff like that, then having to deal with the memory and your IOPS and your network IOPS, how much network load is going to happen if my storage is going to be on a attached storage or whatever, you are then spending so much energy trying to solve for stuff that may or may not be in your control. Like tomorrow, if there's a little jitter in the network or something's wrong with one disk or any of these parameters go awry, you basically then land up to the situation where the performance is completely a function of infrastructure, not the database itself.

Alex DeBrie:

Yeah, absolutely. I think related to that, you're just talking about the depth of things you have to know about when you're modeling with relational databases and I think it's interesting, the learning curve difference between relational and Dynamo, I think it's easier to get started and know the basics of a relational database, and that's why, it's pretty popular and people like it, but I think it's really hard to get to the point where you understand how all those different things are affecting your query, especially at high scale, when you talk about disk and your kernel and the cash and all these disk buffers and all sorts of things like that you have to know to really know that relational database well, and how it's going to up under pressure, network jitter, and all that sort of things. Whereas with Dynamo, I think it's harder to get started. So you have to learn this completely different way of modeling. It's not like an Excel table people think of with relational database, it's just totally different.

Alex DeBrie:

But I think you can become an expert to where you have pretty good confidence in how this is going to perform at high scale and what you need press on, I think you can become an expert in that a lot quicker in Dynamo than you can in a relational database.

Rahul Subramaniam:

Yeah. And also, I feel like the parameters are so much simpler as you, it's a different paradigm, but in Dynamo, the number of parameters you have to juggle to get to the right solutions are so much lesser. There was a time when I was trying to, make sure that relational, which is basically a MySQL store, once it got to a certain scale, I literally had to get down to the level of understanding, what the difference between the two engines were and how the engines were operating and what they were optimized for and then literally, going in optimizing the engine parameters, that just felt like I was wasting energy in the wrong place. So beyond a certain basic scale, I almost feel like you have to go explore other data sources, relational stuff falling apart at some point of scale.

Alex DeBrie:

Yeah, you could do some amazing things with the relational databases, but again, you really need to get down into the details. You probably need to plan a lot in your application on how you're doing some different charting or different things like that, where you're relaxing constraints anyway, whereas again, like you're saying, with Dynamo, you learn five key concepts and then you know the parameters you look for. I go to places all the time where I do a training and by the end of it, they know 80%, 90% of what they're going to need to know for most to their data models. And if they have a really hard question, again, you just need to point them in the direction and say, "Hey, here are the factors to consider." These are the same factors that we talked about before, just how do you apply it to your situation? But it's basic factors, it's math you can do in an Excel Spreadsheet pretty easily, rather than having to do a load test or really get down into the deep details of your hardware.

Rahul Subramaniam:

Thanks, Alex. This has been such a fun and insightful conversation so far, but unfortunately we run our time for this episode, but I still feel like we have so much more to talk about. I'd love to floor DynamoDB further in an extended conversation. For the audience, if you've liked what you've heard so far, please review and comment on the episode and don't forget to join us next week as we continue this conversation with Alex about DynamoDB. Thank you.

Speaker 1:

We hope you enjoyed this episode of AWS Insiders. If so, please take a moment to rate and review the show. For more information on how to implement a hundred percent safe AWS recommended account fixes that can save you 10% to 20% of your AWS bill, visit cloudfix.com. Join us again next time for more secrets and strategies from top Amazon insiders and experts.

Thank you for listening.