Develop Yourself

#265 - The 4 Databases Every Developer Should Know (Including 2 You Probably Don't)

Brian Jenney

When your only tool is a hammer - everything looks like a nail.

2 of these database I'm sure you've heard and 2 might be completely new to you.

Let's go past MongoDB and SQL to learn what tool is best for what job and what's the database choice for AI in 2025.

If you're interested in learning SQL, check out this episode: https://open.spotify.com/show/69BHCbRAl6rHT9LlNhFWUy

Send us a text

Shameless Plugs

🧑‍💻 Join Parsity - For career changers who want to pivot into software.

✉️ Got a question you want answered on the pod? Drop it here

Zubin's LinkedIn (ex-lawyer, former Googler, Brian-look-a-like)

Speaker 1:

Welcome to the Develop Yourself podcast, where we teach you everything you need to land your first job as a software developer by learning to develop yourself, your skills, your network and more. I'm Brian, your host. Today we're gonna talk about the four big species of databases out there, the different flavors of how to store your data, when to use them, when not to use them. What are they? A couple you're gonna be familiar with and a couple I can almost guarantee you've never heard of. But it's really important to know this stuff because then you know which one to use for what type of problem you're trying to solve. Too often, especially people that go to coding boot camps or maybe are self-taught they kind of default to a couple really common databases, which are going to be either SQL or Mongo. In fact, if you went to a coding bootcamp, you're almost certainly gonna be using Mongo. At Parsity, we also used to only teach Mongo. Now our curriculum is more customized and we teach beyond just MongoDB. Now you don't have to be an expert in all these databases to be a dangerous developer, but you should at least know what they are and I think, if nothing else, they can be interesting to understand what's out there, because if your only tool is a hammer, then everything looks like a nail. But before we get off into the four species of databases, what even is a database? A lot of developers, especially early career ones, or maybe front-end developers that are into their career, aren't really comfortable with databases, and I felt like this as well. My first job, which I've spoken about on this show, was at a company where I had to learn SQL and IBM DB2, a really old kind of archaic database, and this was basically my trial by fire to learning about databases. A database is just a structured way to store and retrieve data quickly and predictably, Whether you're on TikTok or Instagram or watching this on YouTube, whatever, all this data is stored in a database for persistence, meaning that when you go away from a site, away from some social media app, your likes are saved, the things that you've watched are saved, your username are saved, passwords all sorts of things that tell the website or the app who you are, what you've watched. They may store the videos, the images, whatever is needed for that site to give you the information at that time that is relevant to you.

Speaker 1:

Why does it matter what kind of database you pick? Performance, scalability, cost, developer sanity. If you use the wrong choice, it could potentially destroy your company, your project, whatever. Now, if you're just starting off building stuff, this isn't something you should really worry about, but it's important. The moment you step out of just building a simple cred app or something for your project or whatever your boot camp and you step out into the real world, it's important to have at least a bit of knowledge about what's out there and when it should be used.

Speaker 1:

Now I've seen the issues that can come with choosing the wrong database. I've been at companies, especially startups, that have chosen MongoDB just because it was really easy to set up and use out of the box. Eventually they realize that was the wrong choice and that the queries become really difficult to maintain, or the data is unnecessarily complex and messy, or doing data migrations becomes a real nightmare, and at this point they want to switch to something like SQL, but they can't because they're way too deep in to MongoDB territory at this point. Once you've chosen a database, it becomes really really difficult, the further along you are in the development process, to then unwind those kinds of choices. So let's start off with the first type of database that I'm sure most people are familiar with.

Speaker 1:

This is the most popular type of database relational databases you can basically think of like a spreadsheet with relationships, right? So in this example we might have users. A user can have an order. Let's say this is an e-commerce website. Users can have orders. What does an order contain? An order might contain products. These are related. You can even think of a classroom. Maybe. In a classroom you have a teacher. A teacher has students. Those students might have homework. All that's related, right, each piece of homework is related to a student. A student is related to a teacher, a student is related to a teacher. So a teacher will have many students, students will have many pieces of homework and a classroom will have a teacher. These are all related. And the reason why this type of database is so popular is because most data has some sort of structured, clear relationship. In fact, most things in our natural world kind of have some structured, clear relationship.

Speaker 1:

Think of a family tree. You have a mother and a father. They will have multiple children. Those children will have multiple children or cousins or other hierarchies and relationships that are pretty clearly defined. Same thing when it comes to massive organizations, corporations. They'll usually have some sort of product Users. Those users usually have some sort of product users. Those users will have some sort of transactions on the sites. Things are orderly, which makes a lot of sense. I mean, that's how big organizations run. If things were all messy and sometimes you had an order, but sometimes you didn't. Sometimes you had a transaction, but sometimes you didn't.

Speaker 1:

Think of your favorite social media app. If you go on that social media app, I can guarantee it has a very, very clear structure. You will typically have an image or a video. That video will belong to a user. That user will have followers. Those followers will give likes and comments and engagement on a post. That's a very clear relationship. Those few things make up the entirety of what you're seeing on that site. Go to something like Airbnb. Look at any site and just look at how organized the data is.

Speaker 1:

This typically lends itself to well-structured data and these relationships in a SQL database or any relational database, whether it's SQL, postgres, sql, mysql, one of the many SQL-type databases out there makes a lot of sense to use off the bat and people should honestly probably just default to using SQL databases. It's so funny that people actually default to using no SQL databases, which we'll get into in just a second. The reason why people often avoid these types of databases is because they have very strict rules. You have something like a teacher and a teacher will have a classroom or something, and then something changes. You say you know what? Maybe teachers can belong to multiple classrooms, or maybe we need to change this hierarchy or this relationship. These changes aren't so simple to do in a SQL database. I gave some pretty contrived examples, but these aren't like super, super simple to do in a SQL database. You'll have to update relationships. You may have to run migrations. Migrations are essentially when you say, okay, let's look at all the data, see what's wrong and let's make the proper relationships and hierarchies out of all the data we may have. That might be billions of rows, millions of rows. So it's not a trivial thing to do. And that leads us to why people often default to NoSQL databases, mongodb being the most popular choice here.

Speaker 1:

So we just talked about the relationship thing and the hierarchies and trees and social media posts and all this kind of stuff no SQL databases. Think about it like this they're basically like really, really big objects in the cloud Of course I'm oversimplifying this they're called documents. So instead of having like SQL databases where you have rows, like when you think of a spreadsheet, for example. That's like a nice mental model for SQL or relational databases. When you have NoSQL databases, they have documents, basically these big JSON-like documents. There's no strict schema. You can basically shove whatever you want in these things. There's no relationships. So maybe if you have teachers and students or products and orders and things like that, they could just be nested under the person. They have workarounds for this kind of stuff. But there's nothing stopping you from just making these massive objects and then something changes and you say you know what we're gonna call transactions, payments or something like that. We'll just change the name. Now that doesn't mean you just can't think through how these things should work and the data contracts, basically what keys and what values belong where. But it means that you just have a lot less barriers around that kind of thing and it's much easier to step into a foot gun.

Speaker 1:

Also, even though these are really easy to get started with, they often become really complicated when it comes to querying the data. When you're using something like SQL, the queries often follow a syntax that feels more common, natural, to like English speakers. Right, you're like? Select this from this table. If you've worked with Excel, then you kind of already know a little bit about working with SQL, but you're like oh cool, I can do some really really simple select statements really quickly.

Speaker 1:

When it comes to using something like Mongo, these queries are absolutely not intuitive at all. You'll have to write aggregation queries. You'll have to decide whether you wanna nest things under other objects. You don't always know if the data is gonna be returned in the shape that you thought, because there's nothing really stopping people from changing the names. Maybe my bias is showing, because I've seen just how poorly these often fit into organizations that are trying to go fast. They use Mongo when they really should have just used SQL.

Speaker 1:

But here's the thing they are really really useful in certain ways. If you're in the prototyping stage and your data model is changing really really fast, maybe you don't want to use SQL. That's why a lot of startups just kind of default to using Mongo or some other key value store basically these big document NoSQL databases, because they don't want to be constrained by having to write more schema changes, write more migrations and having to really think through all the data relationships and hierarchies where they're like we're just trying to figure out what we're going to do. We're building a prototype. So, in this case, using something like MongoDB or whatever out there is a really really good choice. And if you're at a coding boot camp where they want you to move really really fast and you are building things where you don't necessarily know the relationships or you're just kind of experimenting and prototyping, this can be a very good choice. Now, obviously now I've very much oversimplified how easy it is to mess up with these things. There are schema libraries and other tools that you can use with these types of databases to enforce something that feels more relationship-like, but it's not really their intended use case. So if you ever find yourself working with something that has very clear relationships, very clear hierarchy and is past that stage of just prototyping, you probably wanna move into a database that can support relationships.

Speaker 1:

Now let's move on to a couple of databases I'm almost certain you've never heard of this. One is a search index database OpenSearch, slash Elasticsearch I think OpenSearch is the new name for it, elasticsearch these are full text search databases. These types of databases are great for text-based searching. So let's say you have a large list of products or maybe a massive list of error logs or something for a massive backend application or something like that. You can type in something like, hey, I want red running shoes that are under $100. Or you're looking for a combination of error logs like line 42, error plus JSON or something like that, and you can look this up really fast and it's ranked by relevance. This is really, really cool. Now, the technology that underlies this is beyond my scope of understanding, right, but I've used these databases before for storing documents, and the cool thing is that you can do these really interesting and complicated text-based searches. You can do fuzzy matching so you can look for something that sounds kind of like a certain word, or you can type a sentence and find documents that may match parts of or all of that sentence. You could be looking for a very specific set of keywords or something like that that may have words in between them. This is like really, really, really powerful. And I go back to the idea of product searching. Now, product searching for somebody online they're not always going to spell the right thing, right. They might put like red running shoes with an N or something like that in there. They may put like red plus running plus shoes plus Nike, or something like that. This is when these types of search indexes can really really be helpful to use. So if you're doing lots of text-based searching, this is the kind of database you should think of supporting.

Speaker 1:

I worked at a company where we essentially were scraping or ingesting the entirety of Twitter on a daily basis. We were at every new tweet that came in, we were ingesting it, we were indexing it and we were putting it in an Elasticsearch database. Then our users, who are government agencies, politicians, army, military they could look for certain keywords or sets of keywords to understand what was the risk, what was the sentiment around certain words, hashtags, other things online. This is before the rise of AI, where you could do this probably in a much different way, but this was really helpful for those agencies to see, like, what's trending right now based on these keywords. They could even construct these massive, complex queries like where you could say it includes these keywords, but not this keyword and this keyword, or only include this keyword if this keyword is present. They could make these really crazy long queries that we'd run in here and, boom, you'd get like all the data that day in that timeframe that included this crazy query and you can understand. Hey, like, how is you know President Trump, for example, doing? Or Kamala Harris at the time who was running for president, or Biden or whatever All the most hot topics you could imagine on Twitter and all the terrible things people are saying could come back in an instant for some government agency that we don't know to understand. Whatever it is they were trying to understand that day Kind of creepy stuff we were building, actually, but really really cool, and that was my experience using Elasticsearch.

Speaker 1:

Now, this is overkill for small datasets. If you have like 10,000 documents, then maybe just use another type of database. Thousand documents, then maybe just use another type of database. This is not free either. This, like, is really really massive. This is for things like what we were doing ingesting data at Twitter scale. When you have millions or billions of documents and you need super fast retrieval and searching, this is a really good way to do it. Basically, if searching is your product, you should consider using Elasticsearch, slash, opensearch or some other type of text-based retrieval.

Speaker 1:

And finally, let's end off with vector databases. These are the databases for AI. This has become the default database for AI and these are somewhat similar in many ways to the search index, one, at least in my head. So the way these work is they also store lots of text, but they get stored as vectors. Vectors meaning long lists of numbers. Essentially what they do is take either a large amount of text or a small amount of text and they vectorize it, they embed it, they turn it into a series of numbers that can encode meaning, importance of the words, emotion. It encodes the sentence or the words and transform them into meaning using a bunch of numbers. These numbers can almost be thought of as lines or coordinates in a really high dimensional space. So they're embedded.

Speaker 1:

We take a word, we embed it, we put it in a high dimensional space you can think of like a graph. This is a really simplified version of it, but think of a graph with a bunch of lines and each of these lines represents a word or a sentence or something, some embedding something that was taken, made into numbers and put on a graph somewhere. And so when a user types something in, they will embed that particular query user typed in and they will match it against all the different vectors, these lines and these plots in a high-dimensional graph and they will see which one is it close to. Is it close to many of them, is it close to a few of them? What are the top ones it's close to? And then it will return those back to you.

Speaker 1:

Now, why is this used in AI? Because this is the default for retrieval augmented generation. This is known as RAG. This is something that I've been really interested in and I've been working on it the last two companies I'm at. This is something that I've been really interested in and I've been working on at the last two companies I'm at. It's a pretty new field in many ways, because I don't see a lot of full stack developers talking about this or knowing about it. We even hired some consultants that also were kind of like learning as they go along with this. So what retrieval augmented generation is? It's fetching this right chunk or the right vectors of data and then you feed it to a large language model like OpenAI.

Speaker 1:

To answer a question, think about this. A couple of nights ago, me and my daughter were looking for a movie to watch on Comcast, and the way that we had to do this was like typing in the literal name of a movie Lame In 2025, that doesn't make sense, right? No-transcript? That should be the type of search we can do Now how a vector database might support this kind of thing is it would embed what I've said, it would encode that to a series of numbers and then would look in a high dimensional space at all the movies and the descriptions which may be vectorized, along with their start times and actors and the years they were made, and then it would return me the five most relevant ones. It would feed those to a large language model along with my initial query and it would say well, based on what you want, I think these five movies might be the best for you.

Speaker 1:

If you're not aware, large language models hallucinate. They just make up things because they are optimized to just give an answer. If you've ever asked a large language model like OpenAI or Grok or whatever something, it will return you an answer, even if it doesn't know the answer. It's optimized to say something and this causes hallucinations, aka lying. And this is where vector databases and retrieval augmented generation come in handy, because you can reduce hallucinations aka make it lie less by saying no, here's some actual information that we have in our database and now you can tell this to the user.

Speaker 1:

This is a really big problem in the law, in medicine, in any kind of field where accuracy is paramount, right? You don't want your lawyer using ChatGPT which famously has happened and then citing a case which doesn't exist. You go to court, get laughed out of the courtroom and now you're out of all your money and Judge Judy's kicking you out because your lawyer was using ChatGPT, not using retrieval augmented generation. And you may think why doesn't ChatGPT do this? Because they don't have a specific domain where they need to have a lot of knowledge. It's too much. This is why companies make this, where they can do this on things like movies or their FAQs or their HR policies. There's a lot of really really practical use case with RAC and it's funny because it's just not talked about online as much and I guess that's because it's not as popular or not as cool. It's like the unsexy stuff that is really really useful that companies are honestly paying a lot of money for and that I've been literally working on for the last year now. It's super fun.

Speaker 1:

So if you're using large language models and you need to have high accuracy, you should reach out for a vector database. You should consider using RAG. Here's the thing chunking, embedding and storing this stuff isn't free. You can get started for free with a database like Pinecone, but the act of retrieving the data and embedding it into these vectors is often not free. You'll have to use something like OpenAI or whatever and then you have to upload it in there.

Speaker 1:

Vector DBs are great for storing meaning. The search index databases we talked about previously, like Elasticsearch, slash, opensearch those are really good for storing literally the text at scale. The vector database you can use a lot fewer documents Like I'm building one that's gonna be a clone of me that's gonna be trained on all of my posts on LinkedIn, so that way, when I type that I wanna write a post that sounds like XYZ, it'll write something that sounds like me, because ChatGPT just doesn't sound like me and that's what I'm using it for. That probably wouldn't be a great use for a search index, because I don't just need to find the data, I want to find it and then I want to feed it to a large language model so it can write something back. That's a great use case for vector databases when you want to encode not only the terms and text but the meaning behind it. So just to recap, now that you know the different flavors of databases that are out there, at least a few.

Speaker 1:

We didn't go over like every single one, obviously, but those are the most common ones that you're probably going to come into contact with Now. You can match the workload to the tool, so the use SQL. When it comes to structured data relationships transactions, this should honestly just be your default choice, in my opinion. Now, that being said, nosql is really good for fast iteration, flexible documents where things change a lot, where your schema is evolving, and if you're just getting started off, it's a fine choice to make. Honestly, it's something that you can back out of. If you catch it early enough that you're like oh, we do have very clear relationships here, we should probably move to SQL right. Vector databases are great for semantic similarity and using with AI Remember, they encode meaning in a high dimensional space and then we have search for full text, plus filtering, plus keyword searches, relevancy. This is really good for searching massive amounts of text, like error logs or maybe the entire library of everything that's been ever written since the Bible or something like that. Then OpenSearch would be a really good tool to use.

Speaker 1:

I hope this was helpful and, as a side note, if I was a new developer right now. Do you know what I'd be doing? I'd be making something using Rack. I'm going to be teaching this at Parsity, and I know that this is one of those topics that I'm personally really interested in. I don't think I'm going to be making a full course, like I originally planned to, because it's not, honestly that difficult to get started with. I am going to do a walkthrough for Parsity students and other people that I've invited Now. I may make this free or not, so free in the future, but if you're interested in at least purchasing or getting some access to the recording after I'm done I think it'll be around an hour and a half, maybe two hours, which is honestly all you really need to learn how to use vector databases let me know I'm happy to either. I don't know, either I'm selling it or I'm not, so one way or the other, I'll have it available for people, but this is going to be something that parts of these students will get access to in a live setting, because I think this is the most bang for your buck when it comes to learning a new database.

Speaker 1:

That being said, sql would be definitely at the top of my list, too, if I didn't know that because most companies are hiring people that know some SQL. You can learn a little bit of SQL in like a weekend. You can learn enough SQL going to SQLZoonet or even using something like Scrimba or Codecademy to get dangerous enough. I even have an episode on that which I'll link in the show notes, which walks you through exactly what I'd be doing to set up a SQL database query. It make some sort of left joins, add some sort of data so I can actually have something to play with.

Speaker 1:

Anyway, I really hope this is helpful and I encourage you to take this a step further by actually using a couple of these databases, specifically SQL and vector databases with Pinecone. If you're using AI, hint hint you probably should be using AI in your applications nowadays, because that's where the market is going. Anyway, I hope you found this helpful. Reach out to me at brian at parsityio if you want access to that RAG slash vector database walkthrough and I will see what I'm feeling when it comes to the price for that thing there. See you around. That'll do it for today's episode of the Develop Yourself Podcast. If you're serious about switching careers and becoming a software developer and building complex software and want to work directly with me and my team, go to parsityio and if you want more information, feel free to schedule a chat by just clicking the link in the show notes. See you next week.

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.