#254 - What Do Data Scientists Actually Do? A Candid Conversation with Ryan Varley Artwork

Develop Yourself

To change careers and land your first job as a Software Engineer, you need more than just great software development skills - you need to develop yourself.

Welcome to the podcast that helps you develop your skills, your habits, your network and more, all in hopes of becoming a thriving Software Engineer.

All Episodes

Develop Yourself

#254 - What Do Data Scientists Actually Do? A Candid Conversation with Ryan Varley

July 06, 2025 • Brian Jenney

Have you ever wondered what data scientists and engineers actually do all day? Forget what you've seen in movies – it's not all neural networks and fancy algorithms.

Ryan Varley, Engineering Fellow at Magnite and experienced data leader, pulls back the curtain on the rapidly evolving world of data science and engineering.

The real work often involves what he colorfully describes as "really complex plumbing" – making processes more efficient, reliable, and scalable in ways that directly impact business outcomes.

Whether you're considering a career pivot into data, trying to understand how these roles fit within your organization, or simply curious about the mechanics behind today's AI revolution, this episode provides an accessible window into a complex and increasingly crucial field.

Connect with Ryan on LinkedIn .

Read Ryan's Newsletter for engineering leaders facing the hardest problem in scaling their impact: https://newsletter.ryanvarley.com/

Send us a text

Shameless Plugs

🧑‍💻 Join Parsity - For career changers who want to pivot into software.

✉️ Got a question you want answered on the pod? Drop it here

Zubin's LinkedIn (ex-lawyer, former Googler, Brian-look-a-like)

Speaker 1: 0:00

Welcome to the Develop Yourself podcast, where we teach you everything you need to land your first job as a software developer by learning to develop yourself, your skills, your network and more. I'm Brian, your host Today on the Develop Yourself podcast. I'm here with Ryan Varley, engineering leader with a deep background in data science and data engineering. Welcome to the pod.

Speaker 2: 0:22

Likewise it's a real pleasure to be here, because I normally speak about data leadership type stuff and I don't always speak about data engineering so I'm really looking forward to it.

Speaker 1: 0:34

Okay, cool. Yeah, I had a dream a long time ago, when I was a full stack developer about 11 years ago, and the guy I was working with became a data scientist. He switched careers completely. He learned like R, he went to get a master's in statistics and I was like, oh, this looks so cool and I didn't really pursue it. He did. Now he works at a big data company and he made this full transition. And since then I've always been a little infatuated and a little you know not a little very impressed with data scientists and data engineers. And I want to know a little bit about you know one, how you get into the industry, what people really do. But before we start off there, can you just tell me a little bit about what you're doing now and what led you to a career in data?

Speaker 2: 1:15

Yeah, well, right now I'm a fellow of engineering at Magnite, which is an ad tech. The fellow is on the IC track, so you probably heard of staff and principal. This is one of those sort of levels, so it's where you have quite a wide impact role so you can move between teams solving hard problems, but you're not on the manager's track, so it's actually quite a fun role. I recently came from a leadership role so I was managing product data and engineering at a startup of about 25 people. So I've made that transition from data engineering or data science to data engineering, to leadership and then back to data engineering. So I think all paths lead back to data engineering. Yeah, so it was kind of the path.

Speaker 2: 1:53

When I started I did my PhD like probably started it like 13 years ago or something and at that time there wasn't a lot of data like data science was just up and coming. Not a lot of companies had it. So a lot of data like data science was just up and coming not a lot of companies had it. So a lot of people I knew were like the first day to hire in an org even insurance companies and stuff like they had people doing analytics and stuff but data science, machine learning, big analytics databases and a lot of them didn't really exist. So it was quite a different time to now where I think it's really powering ai and all the next level of things.

Speaker 1: 2:24

That's right. I feel like you're all having a really big moment right now. Is that fair to say? Data science, I think, blew up around that same period, because 11 years ago I first kind of got introduced to the term, and then data engineering. I was introduced to that term a little bit later, and now I feel like you're having a big moment with the rise of AI. Is that true?

Speaker 2: 2:43

Yeah, in some ways, I think we had a moment just before everyone else hit with AI. So when I first started again, we built, we did all our own machine learning. You had to do your own topic modeling. You had to build your own feature pipelines. Data wasn't in the right format, so you basically had to engineer that yourself. And that's why there's this joke about data scientists becoming data engineers or recovering data scientists, because you couldn't do what you needed to do, so you had to become a data engineer to do that, and then you ended up just doing that. But I think what really changed is you were doing all this manual stuff. There wasn't all these tools and stuff to help you and you had to build it all yourself.

Speaker 2: 3:19

Where now, if I want to do some topic modeling, I'm probably not going to start by building my own topic model.

Speaker 2: 3:25

I'm probably going to look on Hugging Face, which is a big directory of transformers, so prebuilt models, and Hugging Face became really big like a few years before ChatGPT launched.

Speaker 2: 3:35

So we were already in this place where you've kind of got a lot of off the shelf models, and then this generalized solution came along, which can kind of now, instead of like even exploring a topic model, you can kind of say, hey, like categorize this document into five topics. The other side to that I will say, though, is they are less reliable. It's a lot quicker to build it with AI, but it's less reliable and they tend to cost a lot more. So, like recently, I saw someone tried to swap one of our solutions from an old classifier to an AI-based one, and the AI-based one saved a bit of manual effort, but the cost was a thousand times higher, because, even though the token costs are getting cheaper and cheaper, they're still expensive relative to more basic solutions, and you've already touched on a few different roles within data science, like data engineering, which is something that I'm kind of familiar with.

Speaker 1: 4:20

I feel like I'm starting to do some of that at the company I'm at now, where basically I'm helping feed the data that our data science team needs to run all the black magic that they do to make predictions on, you know, influencers and things like that. But a lot of what I'm doing and I'm curious if this fits into data engineering is the way I see it is like scraping data from the web, cleaning it, putting it in store and then essentially my role kind of ends at that point. Then the data science team does whatever they're doing and then generates some information that I can use in a front-end app to say, hey, here's some influencers, for example, that we think are going to go viral because of X, y and Z reasons. Am I doing data engineering?

Speaker 2: 5:01

Yeah, I think the role can be super varied, so it is specializing in terms of you get. You now have, like, ml engineers and ML ops as its own field, but I think it can be very broad, from doing what you're doing to expanding it a bit further to the data scientists are probably going to need to work the data a bit more and you might productionize that and make it more efficient and make it work Like they might be making it work in batch that takes a day and you might be making it work in real time. You also might be deploying one of their models and connecting it up so that their results go into a database, or it's an API service that people are requesting data and it runs live. So you can be doing all those different sorts of things. I think it's a very broad title. I ought of it, as initially it was the job that was doing all the stuff that data scientists struggle to do.

Speaker 1: 5:50

I have a pretty high respect for data scientists. I've worked with them at basically every company I've ever been at, except for really, really small startups, and what impresses me the most about your kind is that you seem to have a blend of deep mathematics and statistics and I'm like how do you do that? I want to to get into that, but what is your day-to-day like as a data scientist?

Speaker 1: 6:11

so as a data scientist or a data engineer, because I can talk to both oh, I think data engineer is probably closer to the types of career that people that are listening to this would likely go into cool.

Speaker 2: 6:22

Well, it really depends. I think you can have hugely varied projects and I think that's why it's quite an interesting role, because you're rarely just adding a new endpoint. You're like I almost think of it as like it can be like really complex plumbing and you have to figure out solutions.

Speaker 1: 6:36

Hey, I hope you're enjoying this episode. Now you know that I own an anti-bootcamp with my buddy, zubin, an ex-Google software engineer. If you're interested in not just learning how to code and you know it's going to take more than three months and you're serious about making a transition into a career in software and you want to work with people that have done it before and are currently working in senior plus levels, join me and Zubin at parsityio slash inner dash circle. You can learn all about our philosophy, how we approach learning how to code and switching careers in a much different way, and how we have so much gosh dang success. If you're interested in being one of the few people that works with us this year, go and apply at parsityio slash inner dash circle. And now back to the episode.

Speaker 2: 7:14

So most recently I was looking at this job, and this job takes hours and hours to run. It's actually broken up into probably like 10 steps, believe it or not. Okay, um, and it fails a lot and it's not very scalable. Like I can see that it's going to struggle to scale much further. So, like my task is make this cheaper, make it more reliable and make it scale like it needs to be able to handle. If it 10x is, can this solution still work? So that involves for me, grokking what's going on. So what's the data coming in? What's it look like? What's it need to do? What does it currently do? I have a whiteboard that you can't see in front of me and I literally whiteboarded it all out with all the inputs and outputs and stuff like that. I like the physicality I have used like mirror and stuff before, so I'll do that and it's I mean that that is quite right. Like it's not like just add a new field, it's like the input to that task. Go figure it out.

Speaker 1: 8:12

It's a pretty nebulous task and it's not something you can just easily say. Hey, chatgpt, tell me how to make this faster and cheaper.

Speaker 2: 8:19

No, because it's running at a high scale. I think ChatGPT works really well when there's other examples on the web. So we both use Stack Overflow a lot. I like never use it now, like, I think, most people, because I would just use ChatGPT. Yeah, exactly, it operates in almost a similar way, like if there's no examples of what you've done on the web, the AI struggles. It works really well when there's similar examples and it can kind of adapt it to your use case.

Speaker 1: 8:45

Yeah, I mean, I think that's the thing a lot of us are coming up against. Building maybe some front-end stuff is easy, you know, getting some generic code that does some generic stuff is super simple and cuts your time down, but as soon as you come across a novel problem, it just gets increasingly less efficient or useful or even just straight up hallucinate and do things you're like why would you do that? It's like having a really not good junior engineer but one you can cuss at without HR getting on your back.

Speaker 2: 9:12

Yeah, exactly. There is other tasks as well, like I've built simpler pipelines, like upgrading things to DBT, which you essentially write different bits of SQL and string them together, and that can be really nice and that can power BI dashboards. There can be data modeling in that. Or it could be like hooking up a new data source. We've got some new data. We need to plug it in.

Speaker 1: 9:32

You can go and do that and why is this so important? Because I think a lot of people you know whether they work in corporate America or in a technology team the data science team is typically core to the business, and can you explain a little bit of that relationship, Because they seem to have an interesting position within most organizations, you know? And, yeah, why is that role so important?

Speaker 2: 9:53

Yeah, it divides right. So it's. You have data engineering which might be moving data around. You have data analysts which are doing reporting, and you know you have data scientists that might be trying to optimize things. You know I didn't use sort of ML solutions, so obviously the reporting like who, which org doesn't have a need for reporting. You know how many things did we sell this quarter? How many campaigns did we run? You know this type of thing. Did this feature we deployed lead to any extra?

Speaker 2: 10:21

You know clients and all of this data needs to go somewhere, and I think we just gather more and more and more and more data on everything and we need systems in place to use that. And I think you see a lot that the companies that are leveraging the data properly are the ones that are doing really well. They're not guessing what works, they know what works. That's a big one, yeah. And on the data science side, there are problems that are just really hard to solve with other solutions. You might have a rules based, like a heuristic based stuff to like categorize incoming traffic or categorize meetings and stuff like that, but data science often lets you get a higher accuracy or do it more efficiently. It lets you like boost it to the next solution, and there's very few other ways of doing that. You have to. It's almost like the heuristics don't scale. You can only get so complex with heuristics and then you're about learning all of the gaps.

Speaker 2: 11:12

Data scientists, I think, have the opportunity to drive just a lot of value. Scientists, I think, have the opportunity to drive just a lot of value, and you see that, um, but they have a hard job because it's probably not uncommon. For nine out of ten things you try not to work. You know like when you start a project, you don't necessarily know if it's going to work. It's like can you do this? And it's like, maybe, yeah, well, I think that's rarer in software engineering when it's like can you do this, can you add this to the api or can you add this feature? You're like, yeah, it might take three times as long as I expected, but I'm pretty confident we can do it.

Speaker 1: 11:42

Oh, that's interesting. Okay, so you can get like a request, for example and it's like I don't know if that's possible because you probably have constraints Like do we have the data? Is it good enough to make some sort of inference upon? Can I extrapolate something out of this data? Is that what you're kind of talking about?

Speaker 2: 11:59

Yeah, it can be multiple things, so you might get a like can you classify? Like? One typical thing that a lot of companies tried to do was classifying customer support requests or something, or measuring sentiment, and you don't just need to like. You can do that, but to what level of accuracy? How many false positives and negatives are there and how much do they matter? So you can easily deploy a solution which is barely adding any value, and it's like oh, in order to add enough value, this needs to be right 95% of the time, or it might be higher, and you might spend three months in it and get to 80%, and at that point you're probably like I'm not sure I'm going to get to 95. But you need to question does that actually matter?

Speaker 2: 12:44

And so you spend a lot of time because everyone always expects the solution to be perfect. They're like not every single one you send us needs to be right, and you're like well, that's kind of impossible. Or I can do that, but I'm going to give you one. You or I can do that, but I'm going to give you one. You're expecting like 100 a day and I'm going to give you five a day that are definitely right, and it's like well, that's not enough, and it's like well, yeah, well, that's the trade-off. It's like precision and recall. I can either give you lots and the precision goes down, or I can give you less and the precision goes up.

Speaker 1: 13:12

So you tend to have these sorts of trade-offs you need to work with. Oh okay, this is helpful to know. And also, I think there's a lot of people that are saying why don't you just use open ai, just use chat, gpt, use you know whatever off-the-shelf models out there right now, and I think the cost is is pretty crazy and I'm guessing the accuracy is not much better than what you might be able to produce in-house yeah, so if you what I've tended to have found is open ai or any sort of llm has allowed me to someone comes with an idea, I can then like create a prototype of that idea like within the day you know, like that's, that's what it's created.

Speaker 2: 13:50

a prototype might have taken weeks before I can now do it in a day, right, but it doesn't always work very well. So I guess a really good example before is we were using it to classify videos into standards. So you know you have educational curriculum, so which curriculum standard it comes under? And people put it into chat gpt did it and it like spat out like three different standards and they were like it looks like okay, well, you've only applied it to like one out of 200,000 videos. When we apply it to all 200,000, what we found was it keeps recommending the same category. So whilst that one category looks good on that single example, it's actually very generic and there was a better category for it and you didn't notice it until you found that it gave like 60,000 videos the same category and you're like, well, that's not what we expected yes, but you don't see that when you're just working with individual examples, you need to run it over everything.

Speaker 2: 14:45

And then it does run into the issue that you said, where if, if you're, if it's costing, you say I don't know a penny a time. If you're doing it, you know a lot. That all adds up because Because I bet whatever service API you've built, it doesn't cost a penny per request.

Speaker 1: 15:03

It tends to be MARK MANDELMANN yeah absolutely not.

Speaker 2: 15:04

That'd be terrible. Mark BLYTHEUS EACHSTERMANN Exactly so. Although the numbers seem small in an individual example, they don't always add up and sometimes it could be more than that. If you choose some of the bigger models now and you're doing more complicated things bigger context windows you can spend pounds per request. I didn't even know that was possible.

Speaker 1: 15:20

Yeah, you don't notice it when you're using chat GPT.

Speaker 2: 15:23

But especially when you increase the size of that input context window and the output context window, the larger models, the cost can add up a lot. You can do the maths on it. I've run an LLM query that cost 15 pounds before.

Speaker 1: 15:36

Oh my, I would love to see this. That's nuts man. That is interesting because I think that's the issue right now with a lot of people, I think, especially in like leadership, or people that are maybe a little non-technical and they're thinking just use chat, gpt or open it for everything. There's no need to do anything anymore, and they just assume the output is correct. And I'm like this is a bit of an interesting slash, dangerous game. Also the opportunity to educate, I think, a lot of people about the what can and can't be done with the modeler. Yeah, prototypes get. What used to take a week now takes a day. It's amazing going beyond the prototype to something useful that actually has real connections and is secure, stable, extensible. It's just not doing that. But I feel like that's not its job. I'm like, okay, it's great at doing, and I'm like let's not try to just use it as a swiss army knife or it's like just use it for every single thing yeah, it's really I want to get into the top.

Speaker 2: 16:28

Like products can have this idea, but now they can like literally build the idea and show you and be like look, it works and it's got all these interactive things. We don't have to like pair designer to build it in figma, like we can kind of build it and show people. It's the same time to build it as it was before, but that is super valuable that they were able to do that that's yeah, I love that take because that's exactly what I'm experiencing.

Speaker 1: 16:49

I'm like it's cool because no shade on figma or designers. But it's really cool that here's what's in my head and then type it into a prompt and spit out something and then show it to the developers. Like here's what I want. Like okay, cool, now we can do it. It takes hours now to do that.

Speaker 1: 17:04

I want to switch gears and talk about this career path Because I have some of my own ideas and thoughts. People come to me a lot of times. I own Parsity, which is a coding mentorship program, and we work with people that want to get into full stack software engineering. We have people go to front end, back end. Every once in a while we'll have somebody go into data engineering. But when people say I want to become a data scientist, I am pretty open. I'm like for one. I'm not an expert or I don't know enough about that. My impression is from all the data scientists I've met they either have master's PhDs and have a deep background in mathematics as well. I don't think you can get that at a boot camp style program. That's just how is this correct? Like it can.

Speaker 2: 17:46

Can you get into data science kind of go on the non-traditional path yeah, I think you can and I know people that have and I know people that have done very well. So I think it was definitely when I started the main path was you know you've done? I said the main path was kind of you've done a phd or you are a sort of industry experienced analyst and you have all the business context, you know the data really well, you have all of those skills and now you're like learning python and machine learning and stuff. There is a lot of boot camps now. So I do think you can do a data science boot camp, but the biggest thing is you kind of have to care, right.

Speaker 2: 18:20

A lot of people want to do a data science bootcamp because they're like I just want to build machine learning models all day and it's like it's not what you'll be doing and like what even is building machine learning model Like? It's often like doing analysis on what just happened, trying something, letting it run and then doing analysis again. Most of your time is spent doing analysis and feature engineering and stuff like that. You're not tweaking with the system. A lot of people have this sort of false impression that they're hand-coding neural nets or something all the time, and obviously some people are doing that in research, but that's just not the common situation.

Speaker 2: 18:53

Honestly, some of the best data science I've hired have been career switchers. Because they've had a job, they know, they've found out what they like about work, they found out what they like to do, they found out that it is more data science-y and then they've put the effort in to go to like a boot camp to learn that and grow into that role. And I find that those people, although they're maybe starting a bit lower, they have so much potential so they just learn super fast because they're hungry, they're motivated and they can just really learn and progress in the field. Really, quickly.

Speaker 1: 19:24

That's really cool to hear. And that's great to hear from somebody that's knowledgeable, because I'm always a little apprehensive telling people like one I don't know. You should talk to somebody that does know about that particular field. And then just my impression of the data scientists I've met over the years has been like they don't seem to come from.

Speaker 1: 19:40

But then I think about the guy that I met 11 years ago. He had a very interesting story, but he only had a high school degree and he was the business expert at this large grocery chain where we were on the web development team, and he had this incredible business knowledge and he was an expert at sql and so that was a natural path for him to go into data science.

Speaker 2: 19:59

Yeah, because he's got some of the makeup of a data scientist is like all of these different skills and like sort of domain knowledge and appreciation for data and sequel, and all these things are skills, and I do find that people tend to exist in different places of the spectrum. So one thing that I often asked as sort of more junior, mid data scientists when I hired them was and it's a little bit old now, but like, if you imagine a triangle and in one corner is machine learning and one corner is data engineering and in one corner is analytics, where do you see yourself on that triangle? So like, are you fully towards machine learning, are you kind of halfway between machine learning and data engineering, but not quite over to analytics? So kind of like, place yourself in there. And then I asked where would you like to be in that triangle to try and get an idea of where they are and where they want to go?

Speaker 2: 20:44

And I think a lot of people end up sitting somewhere near the middle. They tend to shy away from one aspect, so they'll shy away from machine learning or data engineering or analytics, and then that's how you form your team right. So you've got someone that's really into machine learning and data engineering. It's like okay, well, maybe I need to hire someone that's stronger in analytics and these sort of skills can mesh together, but you don't expect someone to be like super strong in all of them.

Speaker 1: 21:06

That's encouraging and I'm sure people that are listening that really want to get into this field. And it's hard because I hear like machine learning, data engineering, data science, and I'm sure that there's some skills that cover all those bases. But I'm sure each one has its own unique set of skills that you also need. But what are some things a person would need to? Let's say, I'm kind of starting from scratch. Maybe I know a little bit of code and I'm like I really want to get into machine learning. That's one of the hottest industries right now that a lot of people are wanting to get into, either from software engineering or that are thinking of switching into tech in general. What do they need? What do they need to know?

Speaker 2: 21:44

Yeah, I mean, machine learning might be one of the harder ones because there's just not a lot of pure machine learning roles and they're quite competitive and obviously the people in them are quite knowledgeable. But you, I mean, I think a lot of it is attitude, like you've got to be curious and skeptical and you've got to care about it. You've got to be a really good learner, because you're going to be just learning stuff all the time. You're going to have to go out and research things. You're constantly doing stuff you don't know how to do. You're constantly trying to find new ways of doing things. So that attitude is really important. And then, yeah, pick up some of the basics. Like I don't think I've ever hired anyone that knew all of the basics. They will, yeah, be really good at python and they will have done some more sort of traditional machine learning. Or they will be really good at the sort of new stuff like using transformers and fine-tuning them. Or they'll have done a lot of building apis and stuff and they've done some analysis. They've never done any lot of building APIs and stuff and they've done some analysis. They've never done any orchestration or distributed computing. So you're going to have gaps.

Speaker 2: 22:39

You can kind of find lists of roughly what you'd need to know. I've mentioned a bunch of it there, and data modeling is another one as well. Just start learning the ones that you care about the most and work from there. You'll find a role that works for you, because some are very generalist, like I've described, and that's kind of what my background's been. Some are very specialized. You might join a team and only do analytics. You might only do machine learning literally on one system. At Google, you can imagine there'll be someone that's literally only working on the search algorithm or the ads algorithm or one of these other things, and you've got this spectrum. So it's a very broad area.

Speaker 1: 23:17

you gotta know python, I'm assuming, right like that seems like the language of data there are some.

Speaker 2: 23:24

I think for the most part. Yes, there are some roles that are quite fond of r. They tend to be more when you're doing statistics and analysis. R can be a very powerful language for that. They're less likely in those roles to be like deploying the solutions themselves. They tend to be more specialized into insights and analytics and some machine learning. But that's partly because a lot of people come out of university with those skills, with that language, and that language is kind of built for that use case. But for the most part, yeah, I think Python is probably your best bet.

Speaker 1: 23:57

Yeah, I know enough Python, but I had to actually rewrite a lot of the scripts from the data team into TypeScript because I was using them in this full-stack web app and I'm like, do I really want to? I just want to do everything in one place. So I felt like I was doing some evil data, some anti-pattern stuff. I'm like, let me just convert all this to typescript. Worked pretty well. I mean, it still works. I was like you know, I've definitely felt like a fake data engineer so far at this role, doing everything in typescript. And now the data engineering, the other couple data scientists like I kind of see what's going on, but I'm typescript. We're working in a bit of an interesting way well, that's why it's hard.

Speaker 2: 24:31

You're working at the interfaces to all the other teams and, yeah, you can't just say no, this needs to be python, because they're like that doesn't work here or it doesn't connect up or it's not gonna be fast enough. You have to pick your battles as well. Like maybe you could find a solution with that team to get them to use python in some way, but yeah, are they overloaded? Do they have time? Do you have you have time? So it's always a compromise.

Speaker 1: 24:55

That's the messy thing. Sometimes it's like, yeah, all these factors come into play. If that was the case, then everything will be written in Rust, or something like that. Before I let you go, I got a couple of questions for you. They came directly from some students at Parsity, One of them which I'm interested in as well. What's your favorite database to work with and why?

Speaker 2: 25:12

Yeah, that's such an interesting question. I think the only correct answer to that question is it depends, and the reason being and I will give some examples. But you have different types of database, right? You have sort of relational, sort of classic SQL. You have more columnar SQL, and relational tends to store their data in rows. Columnar can be very much like that. Some of the same databases like Postgres, can have columnar tables and columnar is really good at collecting columns. So if you want sums and averages and stuff of columns, that's going to be more efficient at that than a row-based table, which is really good at individual record retrieval.

Speaker 2: 25:47

And then you've got the NoSQL, which is things like Mongo, and they're really good at individual record retrieval as well. They don't tend to have relations. They tend to be heavily optimized for storing unstructured data and searching through that. And then you have Datalux, which are not strictly databases but you use them in a lot of the same way. It's just you generally have separated storage and compute. So rather than having a server that's active all the time, that's your database. You'll have all of your data on S3 and you'll use something like Spark or Athena or there's other technologies as well that will just scan that data when you need it. So all of these are for different purposes. I will say that I generally don't like NoSQL and Mongo and that is mainly yeah.

Speaker 1: 26:27

you know me neither.

Speaker 2: 26:29

But it's not that it can't be good, right. It's just mainly I've seen it used by people that didn't want to think about data modeling because it was too hard. So they just started with that because they could throw anything they wanted in it, but the data at the time when they first built it wasn't relational. As the product grew, the data became more and more relational. They had to think of the data model more, but it's harder to enforce in that environment and environment, and then what they ended up doing is because they needed relational queries in a non-relational environment. They ended up coding all of these relational layers in their application code rather than letting the database handle it. So you end up with this really complicated system that's then hard to move out of.

Speaker 2: 27:08

There are genuine use cases for it, I think, for example, I don't know how open ai store their chats for chat gpt, but if I was going to build that, I would probably look at something like Mongo or NoSQL at least, because you imagine that each chat is a single record of complex JSON, of different steps and different information, and it's generally not relational. Chats don't relate to other chats. They're very, quite separate, individual things. So I'd probably look at it for something like that, okay, but I think there's less good use cases than there are good ones. I think it tends to be more specific favorite database. If I'm using sql, I will normally use postgres. I don't think it matters too much these days anymore. They're all kind of similar, but I'd still use postgres.

Speaker 2: 27:48

And one of my favorites is probably bigquery, which is in google cloud and it's not suitable for everything. It's more of a columnar thing, but it's just so easy to use. It's so cost-effective that it tends to be like cheap to store, cheap to query. It scales really well. It's really useful to connect into systems like DBT. It has all these other connectors. Like, if you're working with streaming systems, you can easily connect those streaming systems to BigQuery, so it just fills up BigQuery with the rows from the streaming tables. So, yeah, I really like BigQuery. I quite like Postgres. Don't often like Mongo, but there's use cases for all three.

Speaker 1: 28:24

I like that answer a lot. I feel like I've worked at some of the same companies that have come into that problem. I feel like we all have when, yeah, at some point you have a Mongo database collection and you're thinking this is basically relational, but we're so screwed now because to unwind out of here is impossible. Yeah, and honestly, the query language, like SQL query languages, is natural and somewhat intuitive in many ways. Mongo does not feel like that at all. I feel like I have to look up everything all the time to do a fairly simple query and I'm like it doesn't feel natural to write. Honestly.

Speaker 2: 28:56

It's a heavily nested structure that also has not really any schema enforcement, so there's lots of generalization in there. That just makes it not a pleasure to use.

Speaker 1: 29:07

Yeah, it's really funny because I feel like everybody does use it. That's how it always happens Start off with Mongo because you're like I don't want to deal with this, dump stuff in it, we'll take care of it later, and then at some point it's too late. I was at a company also where we did migrate Mongo into Postgres. That was great. It actually made development easier, made migrations easier. It just made our lives better.

Speaker 2: 29:27

Yes, Good data modeling made your life better. Really, yes, and that's what you're. You can skip that with mongo.

Speaker 1: 29:38

It lets you skip it and I don't think generally you should. Yeah, it seems like more people should probably start with structured data and then maybe say you know what, maybe let's now put this in. That seems like an easier migration to me. Yep, for example. Um, last question here, this was an interesting one. This person asked how do you spice up machine learning?

Speaker 2: 29:56

yeah, I mean that. How do we spice things up? By throwing in some machine learning?

Speaker 1: 29:59

and the answer to that maybe that's a better way to phrase it and their answer is probably.

Speaker 2: 30:03

You probably don't, like you know that you're not looking for.

Speaker 2: 30:07

It's that when you've got a hammer, everything's a nail like.

Speaker 2: 30:10

You've got a solution machine learning and you're looking for places to apply it, and that's generally the wrong way around to do things.

Speaker 2: 30:15

You should generally be looking for problems and then try and figure out the best way to solve it. And what a lot of data scientists find or at least what I found, especially in sort of startup environments is the first solution is almost certainly going to be heuristic based, or it's going to be an off the shelf transformer that you've done some cleaning on or some extra rules on top of. You're almost never just going to be going okay, let's build a model first, because that's not where the value is. You can build a simple heuristic thing in a few days. You can use some transformers and clean it up in a few weeks maybe. And if you want to build something from scratch, that might take you months and it might not even be better. So you end up kind of having this maturity scale where you probably should do the basic thing first and see if it adds value, and then you get more complex as you sort of move through.

Speaker 1: 31:01

Okay, that makes a lot of sense. I mean from a person who doesn't know a lot about machine learning or data science. Yes, that makes sense. Do what makes sense and be open to changing direction and tools that fit the use case.

Speaker 2: 31:16

And that's the problem with AI right now. Right, we're not being told solve this problem. We're being told put AI in the product, build this thing with AI, do AI. And I'm in two minds about it. Right, because in the one mind I'm like I understand there's pressure from investors. People think it's going to be super valuable. I do think they're generally the right ways to have a problem and solve that problem, but there is a part of me that is also like this technology is clearly growing really fast and we want to be on top of it. What I don't want to do is have all of my data science team ignore it. I kind of do think it probably makes sense to have them almost force them to try it a bit more. So they stay experts and they stay on top of it, because I think that at some point that's going to have a big payoff. I think there's a mix there, but I do think there is some reason to push people into it just a little bit more I.

Speaker 1: 32:15

I think that we're all feeling this in most organizations, where it's like this top-down push to use ai, yeah, then to be cautious, because if you're too cautious you're going to potentially miss a massive opportunity, but it's ham-fisted and people are pushing really shoddy, really hard yes, and it's like I'm getting a little bit of fatigue.

Speaker 1: 32:35

It's like, well, now, curse doesn't work. Well, oh, you're not using the right model. You got to use this model for this task of this model. For I'm like, if I'm coding, I don't want to think about all these things, I just want to do the thing and have it kind of work nicely in the background. That's what a good product I would think does, and I know that there's gonna be some winners in this game for sure. But it is interesting watching, like the musical chairs. I mean, there's definitely some. You know, open ai is kind of the feels like the king of the off-the-shelf large language models, but then you have like gemini for visual analysis maybe, or, you know, trying to think of the other like big players out here yeah, a lot of people like two point gemini 2.5 for coding and a lot of people like Claude for coding as well at the minute.

Speaker 1: 33:16

But it changes.

Speaker 2: 33:17

It's changing every few months what's better for different use cases, and that is really hard to stay on top of. I find it hard to stay on top of. I haven't tried all the latest models in all the situations.

Speaker 1: 33:30

Because that's not how people work. I kind of got something. I like it's kind of what I want to do, because we don't. That's not how people work. I don't like I kind of got something. I like it's kind of what I want to do.

Speaker 2: 33:35

I just want to kind of use that. I think it's kind of a full-time job to do that, like to actually just keep on top of AI and try all the things you can't do your normal job and do that.

Speaker 1: 33:41

No, Maybe I'll have new roles in the future, Like kind of like the AI curator for each company on their. You know what they're doing. Last question Any books you recommend. I'm always curious about this. There's a lot of data science books. Is there a book or a program, a course, something you recommend for people that want to get a foundation in data?

Speaker 2: 34:05

Yeah, I think it can be tricky, in part because when I started it was such a long time ago. Things have changed. One consistent book I always recommend people for python is fluent python. It's an o'reilly book, it's really thick, and the reason I like that is because you can kind of dip in and out of you know, say you don't know anything about decorators in python. You can read a chapter of that and then grok decorators and I think that's the better way to do it. It's to almost look through the contents and say I don't know anything about that. I'm gonna learn that this week. So I always like to recommend fluent python for data itself and there's all sorts of different things I do.

Speaker 2: 34:41

I did designing data intensive applications or something. Applications, yeah, yeah, and it talks about like how twitter was built. So I've got that right. We can put it in the notes or something that it's that one's really interesting, because it's like, if you're going to build a simple twitter, here's how you could build it, and then you run into the problem where you've got celebrities and celebrities have 20 million followers. So now this solution that was working, where you now need to send out a message to 20 million people stops working. So how do we solve that problem? And it kind of walks you through these steps and in the process. Actually it gives really good examples of when sql and no sql work and what the advantages and disadvantages are. So I really like that from a data engineer perspective. For data science itself, there was a book called the data science handbook. I think it's available for free on github, like the notebooks and that is quite good if you want to learn sort of the more traditional machine learning algorithms.

Speaker 2: 35:32

it's got like walkthroughs showing you the algorithms, showing how they work. It's got some nice graphs and you can actually run the stuff yourself. So I also like that one. But yeah, I haven't had to learn it from scratch in a while, so there's probably even better resources out there now.

Speaker 1: 35:44

No, that's really good. I'm familiar with at least one of those books. And where can people find you? Where can people find you online?

Speaker 2: 35:50

Yes, I'm in two places. For the most part, I'm on LinkedIn, so you can follow me there. It's a slash, ryan Varley, and I also write a newsletter that I'm going to write more actively. So it's currently called Data Leadership, but what I've kind of realized the type of stuff I talk about is like how to be more brilliant as an individual, how to create brilliant teams that then build brilliant products, so I think I'm going to rename it and something in that direction and then talk about all those things around how to build good processes, do good hiring, how to learn the first 90 days in a role. So that will change, probably in the next couple of weeks.

Speaker 1: 36:25

Awesome, looking forward to that, that's. More leaders should read things like that, and I think it's also really helpful for people that are in their career, because those are things that apply broadly across different domains within software or even out of software.

Speaker 2: 36:38

Yeah, and I think it can be really valuable for you. Don't have to be a leader to have change in your team, like you can see something's broken like a retrospective or hiring or something, and you can get involved, and I think that is part of getting more senior as well. And yeah, I think that's what I mean by like being more brilliant yourself.

Speaker 1: 36:56

I like that. I love that a lot. Cool, we're going to have links to all those things in the show notes. I appreciate you taking time to speak with me today. Thanks for coming by.

Speaker 2: 37:03

Thanks so much for having me. As I said, I don't often talk about data engineering, and it was a real pleasure to do so.

Speaker 1: 37:09

I man. That'll do it for today's episode of the Develop Yourself podcast. If you're serious about switching careers and becoming a software developer and building complex software and want to work directly with me and my team, go to parsityio, and if you want more information, feel free to schedule a chat by just clicking the link in the show notes. See you next week.

Develop Yourself

Develop Yourself

#254 - What Do Data Scientists Actually Do? A Candid Conversation with Ryan Varley

Podcasts we love

Easier Said Than Done (And How to Do it)