Code & Cure

#47 - Depression Screening with Digital Phenotypes

Vasanth Sarathy and Laura Hagopian

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 26:06

What if depression could be monitored with the same continuity as blood pressure or heart rhythms? While physical health is often tracked visit after visit, depression is still commonly measured through a brief PHQ-9 questionnaire—one that depends on memory, mood in the moment, and a person’s willingness to answer honestly.

We explore how digital phenotyping could change that by using signals from smartphones and wearable devices to better understand changes in mood, behavior, and daily functioning over time. From step counts and sleep patterns to broader activity trends, these passive data streams may offer clinicians a more continuous view of mental health. But the promise comes with real-world challenges: device access, syncing problems, missing data, and the risk of widening gaps for people who are already underserved.

We also break down the AI methods behind the research in plain language, including why depression scores often contain many zeros, how hurdle models help account for that pattern, why PCA can reduce overfitting, and how Bayesian multi-level modeling fits the messy reality of longitudinal mental health care. The result is a thoughtful look at where digital tools can support depression monitoring, especially for older adults who may face stigma or underreport symptoms, and what needs to happen before these systems can responsibly become part of clinical practice.

References:

Using Digital Phenotyping for Depression Screening in Community-Dwelling Older Adults: Bayesian Multilevel Hurdle Model Machine Learning Approach
Chung et al.
JMIR AI (2026)

Credits:

Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/

Why Depression Screening Still Feels Primitive

SPEAKER_00

We screen blood pressure at every office visit. We can monitor heart rhythms remotely. So, why are we still trying to detect and monitor depression with a few questions asked once a year?

SPEAKER_01

Hello and welcome back to Coding Cure, the podcast where we discuss decoding health and the age of AI. My name is Vasant Sarathy. I'm a cognitive uh scientist and AI researcher, and I'm with Laura Hagopian.

SPEAKER_00

I'm an emergency medicine physician. And today we're gonna be talking about um screening and monitoring depression for depression. And um, this is something that we do all the time in the clinical setting. And so the question is can digital phenotyping help us out here?

SPEAKER_01

Yeah, and we'll talk about what digital phenotyping is. I think this is might be the first time we're talking about it in this podcast, although it's a big topic in the um AI health space.

The PHQ-9 And What It Measures

SPEAKER_00

Yeah, and I think maybe as a background, it's helpful to like talk about what do we traditionally do? Like, how do we actually do this in the medical system? And um, Vasand, you've probably like if you've had a checkup, you've probably gotten a questionnaire called the PHQ nine before. Yes. You you may or maybe it sounds like you do remember it, but like people probably a lot of people probably don't even remember it. It's like here, check the boxes on these nine questions. Right. And um, and it is used to both screen for depression and it is used to monitor depression. And those are two different things. So I'll go, I'll go into a little detail about that um in a second. But basically, what happens on the questionnaire is they're like, hey, recall over the last you know couple of weeks or whatever, um, how often have you had any of these symptoms? And it'll go through this list, like, oh, little interest or pleasure in doing things. Uh, you know, how often have you been feeling down, depressed, or hopeless? How often have you been feeling tired? How often have you had poor appetite or overeating? How often have you been feeling bad about yourself? The list goes on. It's not too long, though. It's like a nine question thing. You probably complete it in like two or three minutes. Uh, there is a question at the end about suicidal uh ideation. So, you know, have you had thoughts that you would be better off dead or if hurting yourself in some way? And so all these questions have a scale associated with them. Like, how often is this happening? Is it not at all? That's worth zero points. Is it several days worth one point, more than half the days, worth two points, or nearly every day worth three points? And so you just sum up the score and compute, okay, where is this person at? And for screening for depression, we often use a cutoff score of 10 or so to say, hey, this person may not have been diagnosed before, but now we want to like see if they actually do have depression. So as a screening tool, it's like, hey, let's look further, let's see if we need to make this diagnosis. Right, right. And then for someone who has depression, you can kind of monitor the symptoms over time with this same questionnaire. So, like, say you have a diagnosis of depression, you do this questionnaire uh, you know, at every visit or at many of the visits that you go to, maybe you've been to therapy, maybe you've tried a new med. And so we want to see if that's working. And so we can track your score over time. So perhaps you go from a score of 20 to a score of 14, and that is a clinically significant change where you know you went from a severe to a moderate depression.

SPEAKER_01

Got it.

SPEAKER_00

Um, and so we use it both for, hey, could this person have depression? But also, how is this patient doing over time with the interventions that we're trying? Right.

SPEAKER_01

Right.

SPEAKER_00

And honestly, I like I've done the PHQ nine many times myself when I go into the I I I get it on my annual screening every year and I check off the boxes, and I it's not really a big deal to do, but it is only once a year, right?

SPEAKER_01

Yeah, and it's not only just once a year. I think the other issue is it's a one-time. So, like when that happens, I remember the I remember this, and I remember thinking, okay, I'm feeling this out, but maybe it's just affected by my mood that moment or that day, or something good or bad happened to me that day or that moment. And that's influencing my answers. And and so I always wonder, you know, I I used to wonder this all the time, which is like, what how is this reflective of a more general condition I may or may not have? And so, you know, that's one issue, it seems like first of all, it's very subjective because I have to kind of report on it. I have to remember from the last time in my head how I'm feeling and how those feelings are averaged out. Um, and I have to, in some ways, I this is as a patient, I think about this as just I feel

Subjectivity Recall And Stigma

SPEAKER_01

like I have to discount weird shit that happens in you know the course of that day that might influence those numbers. So it all of this I feel like affects the the quality of the score that it computes, right?

SPEAKER_00

Yeah, I mean, there's obviously gonna be recall bias, right? We're asking you to remember in the past couple of weeks how have you been feeling. Yes. And it totally can be affected by your mood that day for the better or for the worse, right? Um, some people do fill it out more frequently, like maybe monthly, even to trend how they're doing over time. But again, it it could depend on like how poorly you slept that night, whether or not you did a workout that morning. Yeah. There are so many things, like whether or not you got into a clash with someone on the subway. Like there are so many things that can affect it. Absolutely and you're right, like it is a subjective tool. That doesn't mean it's not validated, it is validated, it works well, it's been studied a lot in the literature as a very great tool. In fact, that's what the paper uses as like as the correct answer. Yeah, exactly. Yeah, yeah, yeah.

SPEAKER_01

So that they're not trying to they're not trying to substitute this tool at all. Um, they're they're focused on an older age population that doesn't do the most regular job in filling out these these sort of questionnaires, and might maybe we might miss a lot of the depressed, uh depressed, we might miss a lot of the screenings, right? That's I think the one of the things that they're targeting.

SPEAKER_00

Yeah. And so in this study, they actually looked at older adults. And part of why they looked at older adults is because one, it's a challenging population when you start to think about tech, where they may not have the devices, the wearables to be able to use, you know, be able to use them, et cetera. But two, like this is a population where there's a lot of stigma about mental health. Maybe they don't want to fill out the forms, or maybe they don't want to say, hey, I've been feeling sad to their provider. That's just not something that is common. And so it's underdiagnosed and therefore it's undertreated. And if someone has depression and it's untreated, then that's making everything else worse. It's gonna make all their health problems worse. Um, and so the whole idea is like, hey, are like, and I think the concept of this is good is are there other ways we could check for depression or screen for depression or track someone who has depression and what their scores are? Like, how can we could we do that in this population without asking them to fill out this questionnaire?

SPEAKER_01

That's

Digital Phenotyping Active Versus Passive

SPEAKER_01

it. And and this is where the concept of digital phenotype, I think, comes in quite nicely. And the basic idea is that digital phenotyping means inferring aspects of a person's health, behavior, mood, uh, or functioning from data generated by some device. So it's like a digital device, right? So their devices could be a phone, it could be a wearable, it could be something else. Uh, but the idea is that you get all of these different places of data and you kind of build this profile of this individual using all of the digital data. And that's kind of what a digital phenotype is.

SPEAKER_00

Yeah. And I think there are they the paper goes through two kind of types of digital phenotyping, active and passive. I think the passive one is the more interesting uh because you you don't even sensing things on its own. Right. When you when it's passive, like the user doesn't have to do anything, right? You're not you don't have to fill out that PHQ 9 questionnaire. You don't have to track anything yourself.

SPEAKER_01

You just like like it's coming steps in your in your phone, right? Yeah, exactly.

SPEAKER_00

Or it's like looking at how you sleep, or yeah, um, they didn't do this in this paper. They but we have talked about this on the podcast before, which is like, oh, how slowly is someone typing? And could that be related to their mood or their cognition? Or did they um did they make too many spelling errors? Did they have to delete things? Because that could be related to mood or cognition.

SPEAKER_01

Yes, that was a very interesting episode we did. I think that was relating to like neurological health, right?

SPEAKER_00

Yeah, exactly. And also mood, too. Both of those things. Right. And so they chose specific things in this paper to track, and they they could have chosen more, but a lot of what they looked at was um step count and also a bunch of different sleep parameters, like the time that someone spent in light sleep versus deep sleep versus REM sleep and awakening and things like that. And so this is like ideal from a patient perspective because you don't have to fill

Wearables And Real-World Data Problems

SPEAKER_00

anything out. There's no friction, right? You're already tracking it in theory. In practice, what they found was that a lot of these older adults weren't tracking these things and that these things didn't sync well. So, like they found that the smartwatches were not actually syncing the steps to their system. Oh, yeah, yeah.

SPEAKER_01

So there were like practical operational issues, which are actually meaningful for this patient population.

SPEAKER_00

Yeah, and like it, I think it was something like less than half of the people who participated actually had tracking stuff on board and like they couldn't get it to work with iOS. They only got it to work with Android. And so there's a lot of like, you it seems like it would be so easy. It would be so easy for the for the patient, right? If this stuff was being tracked passively without them having to fill out a questionnaire, it's just kind of going into the into the ether and then analyzing it and being like, hey, uh at your next uh doctor's visit, your doctor's gonna talk to you about your mood because there are some flags here, right? Right, right. So from a user standpoint, it's sort of ideal. From a practical standpoint, though, I think there were some limitations here in terms of how many people actually have these devices. Um, what are the financial means for getting a device like this onto people? What type of uh what type of phone do they have and will it sync? What type, you know, where are the smartwatches syncing with into the system that they're using to track the data? Because your your system's only as good as the data that it's able to receive. And then of course, like what type of data are they choosing to look at? Um, and what else could they have looked at that maybe has been studied and has been shown to have an impact um in terms of tracking someone's mood, like the speed at which they type.

SPEAKER_01

Yeah, and that's a that that that example is a good one because I think that is something that you know maybe in future studies they will look at. Um, but in addition to the passive data, they also looked at active data, active entries. So yeah, like the PHQ 9, but but more maybe a daily mood rating or um just kind of averaging over um or asking about different stressful events that might have happened.

SPEAKER_00

Lots of different stress things, yeah, that they asked about. And so the the benefit of this active digital phenotyping is that people can self-report it and they can do it all the time, right? They can do it every day. Whereas the PHQ, you're probably not taking that every day. Um, if you have depression, maybe you're taking you could take it every couple of weeks or every month or something like that. A lot of people for screening purposes are only taking it like once a year when they show up at their uh primary care provider's office, or they could be even taking it less often. Right. And so the whole idea of the active digital phenotyping is that, yeah, you can self-report all the time what your stress level is or what your mood is, and they can then go and see if that correlates with the with the PHQ nine that these these people are also filling out. The the problem, of course, is that it is a friction-filled thing to do, right? That you have to go in and track all these things all the time.

SPEAKER_01

It's like, well, it's an app, it's a special app you have to download and and use, right? Right.

SPEAKER_00

And in my mind, I'm like, well, it'll just be easier to fill out the PHQ nine, which is like validated and I don't have to do it that often. So I think there, and again, same some of the same issues that you were bringing up with the PHQ nine happen here. Obviously, you're not gonna have uh so much recall bias if they're filling it out right away, but it is uh it's a very subjective measure. Like, what does it mean?

SPEAKER_01

And the cultural aspects are going to be still at play.

SPEAKER_00

Yeah, yeah. So but I do think this concept in general is interesting, which is like, hey, we have we have tools out there that we know that work, but we know that there may be some stigma associated with depression. We know that maybe not everyone fills them out. So is there a way from a digital side to like actually figure out, you know, who ha who may have this condition, who should we check for it? That's right. And then can we track them over time within the same person to see how they're doing? When might we need to intervene because they're doing worse? Right, you know, can we check what if they're doing better? All of that is like a really interesting concept.

Bayesian Hurdle Models In Plain English

SPEAKER_00

So I and what they did here, I'm gonna mess up the words of this. I was like reading this, I was like, I don't know what any of these words means. They did a Bayesian multi-level hurdle model machine learning approach. How'd I do? Is that good? Pretty good. Pretty good. I don't know what any of those words mean. So I want you to explain to me what those words mean because this is how they kind of took all that information uh they got from these people and and turned it into results.

SPEAKER_01

Yeah, no, absolutely. And you know, we all we have talked about machine learning in general in the podcast where you have a bunch of data and the data has a bunch of what's called features, things that you're putting into the machine to learn, and then the the labels, which are things that you want to classify or find answers to, right? Um, and and sort of just to repeat myself a little bit on that front, you know, you have images, for example. You send in images, it's all got you got pixel data, and that's the input, and the output is it's an image of a cat or a dog or something, right? And so you have data that gives you the right answer, so to speak, and gives you all the things to look at, right? In the image case, it was pixels. Here we have those active and passive items of data that we cared about or features that we care about, and that's mostly um, there's a I think 45, 44, 45 of those that they were tracking. And for what were they predicting? Well, they were predicting PHQ nine scores scores. So they were predicting um, you know, cases where people had no symptoms, like zero scores, and then scores from one through um 27, right? Because there were nine questions and three levels each. Um and and what's interesting is um I I think what's one of the most interesting parts about this, one of the challenges here, was that it's not just a zero through twenty-seven scale. Uh-huh. Because there is a substantial and meaningful leap from zero to one. Right. And then from one through twenty-seven. Right. And the so the zero to one, uh, and they have highlighted this in the paper, as is is an important piece because it tells you that there's some depressive symptoms. And there's a distinction there, and that needs to be tracked. And that's and what you end up getting is people who don't have depression, there's a lot of zeros. So in order to deal with that, they use what's called a hurdle model, machine learning model, to deal with the issue of lots of zeros and this sort of thing which says, hey, let's check if it's zero or something else. And then if it's something else, let's tell you exactly what the value that is.

SPEAKER_00

So I'm like picturing someone doing hurdles at the track meet. Is this is this like an appropriate or an inappropriate analogy?

SPEAKER_01

I don't know. I I haven't thought about that.

SPEAKER_00

Like you have to jump over the hurdle to sort of even be considered to have a score.

SPEAKER_01

Yes, that is okay. So I don't know.

SPEAKER_00

Well, like why do they call it a hurdle model?

SPEAKER_01

But but it is a hurdle in that sense, right? You have to jump over the hurdle.

SPEAKER_00

So it's like binary, like it's like zero or one, and once you're a one, once you've come over the hurdle, then we're like, okay, this is this is a whole different bottom of the game.

SPEAKER_01

Yes, and that's kind of what you have. Then you have a c then you have sort of a um a regression model, is what they call it afterwards, right? You have some something from one through 27. So it's in a sense it's like a hurdle, but it's not like hurdles, right? No, it's a single hurdle. Right.

SPEAKER_00

But so many people scored a zero. And that and in part, that was because they did this in like a very general community population. People didn't have to have a diagnosis that's a lot of people. Most people didn't have any symptoms, right? Most people didn't have any symptoms. They had a lot of zeros in this study.

SPEAKER_01

Right. So that was the hurdle piece of the model. They needed, they needed a model that could handle that. The second piece of this is the fact that they need the number of data items was not that high. They were, they didn't have that many patients. You know, it wasn't like they had 50,000 patients or thousands upon thousands of patients. It was a much smaller number. And what happens when you have that, and if you have um 44 different active and passive features, you get what's called overfitting, which is you have very few data points, you have all of these extra things that are all considered equally important to detect whether or not a PHQ nine score is of some some value. And so what ends up happening is you don't get a model that's actually valid outside of that training population of people. So you have too many features and you have too few people, and it's just going to be it's going to look very carefully at those features and maybe place importance on features that actually are not that important.

SPEAKER_00

Yeah. So the whole idea is like in theory, we want to be able to take it outside of this small population and apply it to others. Yes. And if we overfit it, then that won't work.

SPEAKER_01

That won't work, exactly. And so one of the steps that you take for that is you reduce the number of input features by doing what's called um principal uh component analysis. Uh, and that allows you to identify uh kinds of features that kind of all are similar and are most influential for the results. And they found that um, you know, it's kind of a separate analysis, but they found that they did that and they found uh a subset of I think 12 uh 13 or 14 different um uh new features that are sort of compositions of those other ones. So for example, there's there's one that's called PC one. You can't really name them because they're just principal component one. One, yeah, they're just sort of they're named, but they kind of link up with things like high anxiety, high loneliness, high daily negative mood, high weekly stress, low social support. And so roughly they all sort of fall into the category, loosely speaking, of psychological distress or low social support profile. So, and and if you see another piece um principal component, PC four, is all about sleep. It's shorter light sleep, shorter REM sleep, shorter deep sleep. So you could say this has more to do with you know sleep. Um and so they're able to bring these features together. So now I have a smaller. Yeah. And um, and then you can do um then you can do regular machine learning over it. Now, the other problem is you because you don't have that much data, it's really hard to understand how confident you should be of the data. And so this is where the Bayesian piece comes in. And Bayesian modeling is a way to say, I have a belief right now in my head of something, I have a theory of something, I make an observation, and then I update my theory based on that observation. And then each time I receive a new observation, I'm slightly updating my theory each time. And it's a little different way of thinking. It allows you to work with very small amounts of data and um start somewhere. So if you knew up front that you know you were more likely um to be depressed because of some some pieces of some other pieces of information, you could start there, and then every piece of data that comes in is going to sort of push that you know in one direction or the other. And it allows it to learn uh a system to be able to learn um how to read new pieces of data in this manner and in with small amounts of data. So that's the Bayesian part. And I mean we've talked about Bayesian in the conversation.

SPEAKER_00

Uncertainty. I feel like that's when we've talked about it.

SPEAKER_01

Yes, it helps you build up a belief about something based on evidence, um, and and and assign a level of uncertainty to it. So lesser evidence, more uncertainty. Um, you know, so you're able to uh sort of play that out nicely and you're able to categorize that it's more explainable in some ways. Um and so they use a uh Bayesian model. So there you go. That's Bayesian um, you know, hurdle modeling. And I think there's they also multi-level. Well, the multi-level piece has to do with um I'm just saying words, you know. I know. No, no, but that's a great point because the data set in some sense is longitudinal. Uh because you make this uh because the same person contributes uh to multiple monthly observations over time. Right? So you have that piece, and so you have different groupings of observations inside of the same person's profile. Um, and and so you have this sort of variation that occurs within the person, and then you have variations that occur across different people, across different times.

SPEAKER_00

This is interesting because when you do the PHQ nine, and if you do it like once a year or once every couple of months or whatever, it's kind of infrequent. But for a lot of this passive data, there's so much of it that's being inputted all the time. Every day you're getting information about sleep and step count and the self-reported, you know, stress and mood could be every day too. And so you have to like account for, you know, it's great to have all that data, but then you have to like figure out how you're gonna process all that data.

SPEAKER_01

Right. And a multi-level model helps you account for things like, you know, when a person's sleep is slightly worse than usual for them, or when people with low social support generally are more likely to have depressive symptoms, right? That's across one is of them is within the person because over time they may have different sleep cycles, and the other is across different people because of their own backgrounds. Um, and so you're able to account for that, which is why you have this multiple levels of analysis. And so there you go. That's the Bayesian multi-level hurdle models, right? So, like you have um all of those jargon terms put together for you. There we go.

SPEAKER_00

Thank you very much. See, now I can say it like I know what it means, because I do know what it means now.

SPEAKER_01

Yeah. And and so, you know, I I think that they were very careful about all the little design choices made uh for this patient population. Um, so in that sense, it was a it was a nice paper to read and kind of share with people because I think it brought some interesting, for me at least, the most interesting piece was the um the fact that the PHQ nine. Data is not just zero to twenty-seven, it's zero to one and then one through twenty-seven. Yeah. Then the jump matters.

SPEAKER_00

Yeah. Like binary component at first, or the concept of the hurdle.

What The Study Found And Missed

SPEAKER_01

Yes.

SPEAKER_00

And so I will say that the results here are kind of like a like a no shit Sherlock results. Yeah. Honestly, because if you look, if you look at the results, it's it's what you would expect. It's what's been shown and studied before. That the people who had uh more depressive type symptoms are the people who had more anxiety, more stress, that they tracked worse mood, that they were more lonely, that they had less social support. All of those things are things that we know correlate with depression, depression and depressive type symptoms. Um, and then also, you know, in terms of sleep, it was like, hey, people who had worse sleep. People who had worse sleep had more depressive symptoms. Yeah. People who are male, people who are smokers, that's common and and well studied in this uh older adult population as well. That those are the people who have who tend to have worse, worse moods, worse depressive type symptoms. And so the results here were not really surprising to me. And I don't think this is at the level where it can be sort of like implemented yet, right? In an ideal state, what I'd want to see with something like this is hey, can we can we actually like replace the PHQ9 to make it, you know, so much less friction for uh this population? Um, that's not where we're at, not at all. Like this could maybe be an adjunct, uh, but not everyone has uh a wearable device to track with. They maybe could have expanded what they were looking at, right? Not just uh sleep and stress levels, but maybe typing speed and things like that. Um, and I think this would need to be studied in just like a much larger population, including more people who overcame that hurdle. Yeah. More people who had depression, for example. Um, but this concept of like passively being able to sense things in order to create a digital phenotype, I think that's like a really, really interesting thing that we need to keep studying. Um, but like personally, in my in my mind, I was like, well, the PHQ nine isn't so bad to fill out right now. Yeah, yeah, yeah.

SPEAKER_01

But I want but I want to state that there's actually a distinctive technical contribution in this paper, which is that it's not treating the the classification of PHQ nine as this one-shot cross-sectional task where you just give it a bunch of data, predict the PCQ9, uh PHQ nine. Instead, it's saying, no, no, no, depression screening has a personalized longitudinal component to it. And it's partly a stable person level vulnerability and partly um a changing month-to-month state, right? Uh uh so so a good model should be able to capture both. And I think that they hadn't had that before. And that's kind of what this I think there's a technical contribution here in this paper outside of the results being unsurprising, uh, the results of the of the you know the main component being unsurprising. Um, I think there is a piece here which I think is a good step forward technologically.

SPEAKER_00

Yeah, and I think it is in general promising to see work like this coming out.

SPEAKER_01

Yeah.

SPEAKER_00

Because I do think this is the future, and it ends up creating potentially for patients uh a better user experience in terms of who gets screened and making it so that maybe everyone can get screened or tracked over time better, especially for things that are you know higher on the stick stigma spectrum to be able to know, okay, this is something that your primary care provider should bring up with you. You don't have to necessarily bring it up. If it's being flagged in the data, they can start a conversation with you because they're able to see them. Awesome.

Closing Thoughts

SPEAKER_00

Well, this was a really interesting topic, and uh, we're excited to see you next week again on Code and Cure.

SPEAKER_01

Thank you for joining us.