Bio(un)ethical

#14 James Diao: When should race be used in medical algorithms?

with Leah Pierson and Sophie Gibert Season 2 Episode 14

In this episode, we speak with researcher and physician Dr. James Diao about when and why race should be included or excluded from clinical algorithms. We focus on his work evaluating the implications of including race as a variable in two clinical algorithms: one used to assess lung function, and another used to assess cardiovascular disease risk.

(00:00) Our introduction
(05:10) Interview begins
(09:47) Criteria for the inclusion/exclusion of race of clinical algorithms
(16:23) Inclusion of race in lung function equations
(27:04) Estimated racial disparities in lung disease classification
(31:37) Estimated racial disparities in access to social services and healthcare
(37:44) The PREVENT equations for cardiovascular risk
(47:44) Anticipated impact on statin recommendations
(57:22) Estimated changes in statin eligibility by race
(1:10:03) Whether we should exclude race from clinical algorithms by default
(1:20:36) Common themes and failure modes

Used or referenced:

Bio(un)ethical is a bioethics podcast written and edited by Leah Pierson and Sophie Gibert, with production support by Audiolift.co. Our music is written by Nina Khoury and performed by Social Skills. We are supported by a grant from Amplify Creative Grants.

Note: All transcripts are automatically generated using Descript and edited with Claude. They likely contain some errors.

Introduction:

Sophie: Hi, and welcome to Bio(un)ethical, the podcast where we question existing norms in medicine, science, and public health. I'm Sophie Gibert, a Bersoff Fellow in the philosophy department at NYU, soon to be an assistant professor at the University of Pennsylvania. 

Leah: And I'm Leah Pearson, a final year MD-PhD candidate at Harvard Medical School.

Sophie: Clinical algorithms, a term we'll use pretty broadly, to mean structured decision-making tools that help doctors decide what care patients need, are ubiquitous in medicine. The website MDCalc, which many doctors use daily, includes hundreds of such tools. To cite an example of how these tools can be used, consider patients with atrial fibrillation, or AFib, a kind of abnormal heart rhythm.

These patients are at higher risk of strokes because the irregular heartbeat causes blood clots to form, which can flick off and travel to the brain. One solution is to put patients on blood thinners, which keep the clots from forming. But preventing clots comes with risks, too. As you might expect, patients who are on blood thinners are more likely to bleed. So, figuring out who to put on blood thinners involves assessing a patient's risk of having a stroke versus a bleeding event. 

Leah: Doctors used to make these kinds of decisions by evaluating the patient in front of them and using their best clinical judgment. But in recent decades, algorithms have changed.

Based on data from large studies, algorithms have been developed to try to standardize these risk assessments. For instance, the widely used CHA₂DS₂-VASc score gives patients a point each for variables like history of diabetes and female sex. These points are then added up and the CHA₂DS₂-VASc calculator tells a doctor a patient's annual risk of having a stroke based on their score.

Doctors can then pair these risk estimates with clinical guidelines which tell doctors what to do in light of this information. For instance, a patient with a score of 6 would be estimated as having a 10 percent risk of having a stroke. This is pretty high, so guidelines would recommend starting a blood thinner.

Sophie: The variables that go into clinical algorithms thus matter immensely, because they determine what care doctors recommend and what treatments patients receive, including for medical resources that are limited, like organs for transplantation. It probably seems pretty innocuous for a doctor to use a patient's history of prior strokes or their age to determine who should be on blood thinners, but it may not be innocuous to use a patient's race as an input, which many algorithms do, and which can pave the way for recommendations that differ for patients depending on their race.

Examples abound, but just to cite one: The National Cancer Institute's Breast Cancer Risk Assessment Tool is used to predict a patient's risk of developing breast cancer. It uses race as an input and predicts that Black, Hispanic, and Asian American women are at lower risk than white women. These lower risk estimates may, in turn, lead to less aggressive screening for non-white women.

Leah: The general trend in medicine has been to try to remove race from clinical algorithms. Some of the justifications for this are that the inclusion of race in algorithms is often based on stereotypes rather than rigorous science, fosters distrust, can perpetuate health inequities, and risks reinforcing the idea that race is a biological rather than social construct.

However, removing race from algorithms can have downsides too. In particular, the exclusion of race can make some algorithms less accurate, in part because race is correlated with genetic ancestry as well as various social, environmental, and economic factors that affect health. Thus, removing race haphazardly can lead to less accurate predictions. For instance, algorithms that do not consider race have been shown to underpredict colorectal cancer risk in Black patients. 

Sophie: Today we'll be talking with Dr. James Diao about when and why race should be included or excluded from clinical algorithms. James is a resident physician in internal medicine at Brigham and Women's Hospital, as well as a researcher studying the development and evaluation of clinical algorithms.

He was recently named a STAT Wunderkind and recently became just the 22nd person to graduate summa cum laude from Harvard Medical School. We'll discuss with James his recent work evaluating two clinical algorithms, one that is used to assess patients' lung function and another that is used to assess patients' risk of cardiovascular disease.

In both cases, we'll discuss with James the implications of including or excluding race as a variable in these algorithms. 

Leah: As always, you can access the papers that we reference in the episode notes or at our website, biounethical.com. And you can submit feedback there or email us at biounethical@gmail.com.

Interview:

Host: Hi James. Welcome to the podcast.

James: Thanks so much for having me. I'm very excited to be here.

Host: The general trend in medicine has been to remove race from clinical algorithms. Could you talk a little bit about why race was historically included in many algorithms and what motivates present efforts to remove it?

James: Absolutely. So the history of race in medicine is basically as old as race in American society, and race as an organizing principle in medicine has spanned more than a century. And so part of the motivation in the older algorithms came from this really strong default assumption that there were these underlying biological differences between human beings that were relevant to their physiology and to health and disease.

As time went on, and we started to learn more, for example, after the Human Genome Project, about how the concept of race is a social construct and not a biological construct, and a poor proxy for both biological and social differences between human beings, I think more of the motivation came from convenience, rather than anything else. When you have a set of data, and race is generally in the list of variables that you could use to predict an outcome, researchers often treat it just like they would any other convenience variable, where you have an outcome that you're trying to optimize some accuracy metric for, or even sometimes a fairness metric for.

And then you include all variables that are just easily available. And because race is so commonly found in data and presumably easily collected, it's often included just as something that researchers will try out. And then when they see the number go down, they think this is something that has improved the model and therefore should be included.

There's also, in some more recent efforts, an idea of adding race as a corrective factor. So to account for the effects of racism or the differences in care that different populations receive, this idea that, "Oh, if my model is less accurate for Black individuals, then having an 'if Black' part of that equation automatically makes it such that I can make those numbers equal."

And I think as someone who develops models, it can feel very bad sometimes to feel that your model is performing worse in one group, and this is a very quick fix that I think is very tempting to take because it makes the numbers equal. So these are all different motivations. It's very complex. I think it's very hard to point to any individual one as the motivator that's driven a lot of these algorithms, but I think those are the ones that I would point to.

Host: That's super helpful context and we'll get into your research in a bit, but you are also a clinician, so we wanted to ask: have you seen cases where the inclusion or exclusion of race in clinical algorithms seemed to impact patient care?

James: I've definitely seen the way race has been used to guide clinical decision making. I think the question of whether it was in an algorithm really depends on what you mean by algorithm. Sometimes the algorithm is as simple as "if this race then use a different age threshold for screening." And other times it's as a more explicit input into something that you find on MDCalc that would create a different outcome.

I think what I've been seeing more is the former. Just because as prevalent as these algorithms are, they're oftentimes hidden away from the people experiencing the care and even sometimes the people delivering the care. Whereas something like deciding whether or not to screen someone for diabetes or screen someone for prostate cancer is something that is much more at the surface.

What stands out to me is, for example, my dad is in a BMI range that is close to the screening threshold for diabetes, but not quite there. And he sees the same PCP he's been seeing for more than a decade. And his doctor is Chinese as well and sees a lot of Chinese patients in Chinatown and basically presented it as, "Hey, your BMI doesn't necessarily meet this threshold, but we Chinese people, we carry more fat in the abdomen. And some data has shown that we develop diabetes at lower BMIs. So I'm just going to screen you anyway, just because it'll make me feel better."

And so that's one example that I've heard just related to my own family members. But then I think lots of stories also abound, not that I've seen directly, but about people having different decisions made about their care related to cardiovascular disease risk, related to their lung function, their kidney function, and so on.

Host: In a widely cited New England Journal of Medicine article by Vyas et al, they suggest that quote, "when developing or applying clinical algorithms, physicians should ask three questions: Is the need for race correction based on robust evidence and statistical analyses? Is there a plausible causal mechanism for the racial difference that justifies the race correction? And would implementing this race correction relieve or exacerbate health inequities?" Do you think these are the right questions to ask or would you modify them in any way?

James: I certainly think these are great questions, especially in the context of the article by Vyas et al, which just to provide some context for the listeners, goes through and documents maybe 12, 13, 14 examples of algorithms that use race. And then I think this provides a more ex-post kind of framework for evaluating them. So after they've already been developed, thinking about: was it developed based on evidence? Is it something that we should be using still? I think those criteria are really great for doing that type of analysis.

There are criteria everywhere from these different perspective pieces, anywhere from three criteria to nine criteria. I think the ones that I would add to that list are more relevant to the people developing the algorithms. And so there were a couple brought up by Amaka Nanya and Peter Rees in their JAMA article, which basically asked also the questions: does the use of race bring substantial benefit? Can this benefit be achieved using any other means? Is the use of race transparent to patients and clinicians? And if patients were to decline the use of race, can they still be treated fairly?

And I think that those additional criteria are really helpful as well, because I think they bring to the fore the direct tension between having a variable that is actually oftentimes a really powerful predictor of outcomes that we do care about, but then brings a lot of other tensions that are difficult to quantify. Things like how much they might promote distrust or entrench race as an organizing principle in medicine.

And so I think those criteria are the ones I tend to cite a little more often, because I think they're just easier to think about for people who are developing and using the equations. But then I think from a policymaker standpoint of "should these be recommended for, for example, national implementation or in our clinical laboratory," I do think the Vyas et al criteria are a really great starting point.

Host: Just to follow up on what you just said about different people proposing different potential criteria. Are people saying the same things in different words or applying different degrees of specificity to the questions they're proposing? Or is there actual genuine disagreement among people who work in this area about what the right criteria are?

James: I think there is definitely genuine disagreement. To cite two examples, I think one would be the necessity of a causal relationship. One important critique has been this idea that race doesn't cause these outcomes. Race is associated with this very dense network of social, environmental, economic, other variables that are then tied to the outcome through other causal links. But then if race is not the causal factor, then this is something that we shouldn't include.

And I think intuition does support some of this idea. For example, one of the things we found was that eating multigrain bread was one of the top predictors of developing COPD down the line, probably not because of any causal relation, but because it is tightly correlated to people's health literacy, to people's willingness to pursue things that advance their health and things like that. But a lot of us would feel uncomfortable, for example, with using multigrain bread to predict whether or not we'll develop COPD.

But on the other hand, I think there are a lot of folks who say, for a variable that's really widely available and really convenient to use that advances our predictive performance or the performance metrics we care about, does it really need to be causal? And this is a completely separate issue of sensitive attributes in general. When you have something that is deeply confounded or just not causally related at all, but just happens to improve your accuracy by a significant amount.

There's data showing that this is true for a lot of the different variables that end up really improving the performance of different models. And so there's definitely a group that has this approach of "accuracy above all."

I think also there's tension about the relative weight of concerns like distrust or reifying race as an organizing principle. And again, these are impossible to measure against things like how accurate your risk model is. And so, for those who consider those to be really important problems that will outweigh almost any substantial benefit added by race, and then for those who don't think those are big problems, even a small benefit from adding race is sufficient to include it. So I think there's definitely a lot of disagreement about how much to weight these different criteria or whether certain criteria should even be included at all.

Host: Another follow up about the other criteria you mentioned, one of them was something like, "could this benefit be achieved by any other means?" I was wondering, could you say a little bit more about what is driving the inclusion of that criterion?

James: I think, and I might be putting words into Dr. Rees and Dr. Nanya's mouth here, but my thought was that this criterion relies on an assumption that race is just ideally a variable that we wouldn't use, and this could be linked to a lot of different things that we've already talked about. This idea of promoting distrust, it has a really long sordid history of being associated with poor treatment. And then this idea of solidifying race as an organizing principle is also something that a lot of people are uncomfortable with.

So these are all just drawbacks that come with race. Others include things like the tendency for clinicians to assign race rather than ask patients to self-report it. Different people might assign different races to different people, making it less objective of a variable. So a lot of reasons why it's very different from a variable like height or a variable like your LDL level.

And so the assumption being that if race were being a stand-in for something that we could actually measure, then we should just measure the thing that it's standing in for. But then, oftentimes, I think the reality is more complex where it's very difficult, in my experience, to find a single variable, or even a set of variables, that adequately captures all the different things that race stands in for.

Host: Okay, that makes a lot of sense. So in your work, you've focused on different kinds of decision-making algorithms. We want to focus on two of the different algorithms that you've looked at because we think that they expose different issues about how these algorithms can affect the distribution of health resources and, correspondingly, patients' health outcomes.

So let's start with algorithms that measure patients' lung function. In recent work, you look at the implications of race adjustment in lung function equations. To just give people some background, and feel free to add any context you think is necessary, one of the ways we measure lung function is by using spirometry, which involves having people inhale and exhale into a sensor that measures how much air someone can move and how quickly.

The results of spirometry testing can then be used to diagnose respiratory diseases like asthma or COPD, as well as assess response to treatment. In your paper, you note that clinicians interpret a patient's values by comparing them with a predicted normal range, which is calculated on the base of age, sex, height, and often race, with normal values typically falling between 80 and 120 percent of the predicted healthy value.

But recently, race-based equations from the Global Lung Function Initiative, or GLI, which are widely used, were replaced with race-neutral algorithms. Can you talk about why race was originally included in the GLI equations and what motivated its removal?

James: Of course. The Global Lung Function Initiative were taking on a fairly challenging task, which was they wanted to have a single set of reference equations that could be applied around the world. And so to do this, they solicited data, spirometry data, from different research groups from around the world, and collected data from a lot of different places.

The problem is that there was substantial regional variation, variation both between the different places that submitted their data, and also within. And it's very hard to have an equation that accurately models this variation without accounting for some of these differences.

And so what they did was they grouped these different data, oftentimes using the geography, but sometimes using self-identified or investigator-assigned race or ethnicity within that data, into groups that tended to cluster together. For example, they grouped the African American patients together. They grouped a lot of North America, Western Europe, Middle East together and called it "Caucasian." They grouped Northeast Asia separately from Southeast Asia because those numbers were different.

Some of the populations didn't fit into any of the buckets. For example, a lot of the Indian subcontinent had data that were separate but also not enough to form their own group; those data were excluded. Data from France that did not have race and ethnicity data, but still a multi-ethnic population, were excluded. And then the goal was to be able to have these different coefficients that would basically define these groupings; the model would then do better within each group.

And so the way they chose to organize that was around race and ethnicity, but it is a general problem that other groups have sought to deal with. For example, for the Europeans dealing with cardiovascular risk, they also found substantial variation across countries. But instead of grouping it by race, they grouped it by low risk, high risk, and very high risk regions. And similarly for fracture risk, they started out by organizing it by race and then just did it separately by country.

So there are a lot of different ways you can choose to account for this type of regional or population level variation. But originally in 2012, the GLI sought to use race to form the taxonomy.

Host: Got it. In your paper, you note that the race-based algorithm assumes that healthy persons of different race groups have different lung functions and essentially normalizes lower lung functions among Asian and Black people and higher lung function among Hispanic and white people. Two questions. So first, do you have thoughts on why these racial disparities were observed? And second, whether it's appropriate to normalize them?

James: Yes, I have some hypotheses. Unfortunately, the amount of data supporting them is not terribly strong. So I would say first that what is often taught in medical school and in pulmonology fellowships is that anthropometric measures account for this difference - that Black individuals have a shorter torso, longer legs, and therefore when you're adjusting for height, you're going to over-adjust their thoracic cavity. And in order to not over-adjust for height in Black individuals, you have to account for race so that you properly account for that difference.

This was shown in some data, largely in pediatric cohorts, where they found that accounting for the Cormic index, which is sitting height to standing height ratio, which better captures the thoracic cavity versus the leg length, was able to explain somewhere around 40 percent of the differences between Black and white individuals. Subsequent studies found that this proportion is actually lower, especially in adult populations. And so I think, in my head, the evidence supporting this leg length hypothesis is not terribly strong.

The other hypothesis, and the one that I think is more likely in my head, but again, just without data to support it, is this idea that Black Americans in the United States have a really long history of different social, environmental, economic insults that affect lung capacity. So, for example, anyone who grows up in a redlined neighborhood that's close to highways, close to factories, has poor nutrition growing up - all of these things will affect the development of the lungs and their subsequent health.

And one reason this is so difficult to study is because a lot of these insults happen during developmental phases and studying a cross-sectional data set of adults today, it's hard to know for someone who has high income or low income now actually what their circumstances were like in the time periods where lung development was most affected. And so again, there's no data on this, but my personal sense is that the leg length and anthropometric hypothesis is less likely, and then environmental and other factors being more likely.

Host: Just to follow up on that, if I understand correctly, the anthropometric hypothesis would suggest that normalization is appropriate versus if you have a social explanation that injustice may have contributed to them having worse lung function, then normalization is not appropriate. Is that a correct understanding of what you're saying?

James: I think if the goal is primarily to have the most accurate reference, then yes, I think the anthropometric hypothesis is saying that there's an overcorrection happening for height that we can fix with race. And then the other hypothesis would say that, hey, you're actually obscuring these different insults that have occurred over time, and then you're burying it under this label of "normal." And so I think it's true that where this variation comes from is actually really important towards determining the question of whether we should be normalizing, if, again, the primary goal is to have the most accurate reference equation.

Host: So just to clarify, is it that we have these different standards for lung function that we're calling normal for patients of different races? And if it's actually the case that the standard for Black patients is lower because they were exposed to a lot of pollution, and so on average the population has lower lung function because they were exposed to pollution, then if we encounter some patient of that race and we would normally be inclined to think their lung function is fine because it's similar to what we are regarding as normal for their race group, well actually maybe their lung function is not fine because actually the normal for that race group is worse than it could physiologically be?

Because they were exposed to all these bad things. And so we should actually be viewing the patient in front of us as having lung disease, even though they're in the normal range for their race group, because we're comparing them to a group of patients that maybe had some undiagnosed baseline lung disease or something like that. Is that kind of the idea?

James: I would agree with that.

Host: Okay. So in the paper, you use data from hundreds of thousands of patients from five cohorts to estimate the effects of the adoption of race-neutral equations. How did you do this?

James: Okay. So the rationale for including five cohorts was that there were certain outcomes we wanted to study that were often available in just one or several of these cohorts. And so because lung function is used for so many different outcomes, so many different clinical applications, we felt that it was important to take a holistic view of how it affects all of them.

And so this includes things like clinical outcomes, things like who is classified to have ventilatory impairment, who is classified to have severe versus less severe COPD, things like impairment ratings assigned by the American Medical Association that are used to determine things like disability qualification and compensation, disability ratings by Veterans Affairs, and qualification for work. And so that was one set of outcomes that we estimated using a nationally representative data set called NHANES.

We also used the U.S. transplantation data for lung transplant candidates to assess things like how would people move up or down in priority for lung transplantation with race-based or race-neutral equations, how much would they move, and if you use these equations to try to predict measures that are important, like how long would you survive without a lung on this waiting list, is one of them more accurate at predicting that.

And then lastly, we sought to assess how accurate these different equations were for predicting cross-sectional or longitudinal outcomes that matter to patients. So things like, what is your likelihood of developing respiratory disease in 10 years? What is your likelihood of dying in the next 30 years from a respiratory cause? What is your risk of, again, dying before the median time, for example, when you're on the lung transplant waiting list?

And these are things that, ideally, a good measure of lung function would be good at rank-ordering people for. And using these different data sets, we're able to try to triangulate the different outcomes and make sure that we had a broader picture of how patients would actually be affected in all these different dimensions when race is either used or removed.

Host: And in general, you found that adopting the race-neutral GLI equations led to more Black patients being classified as having lung disease as well as more severe lung disease. To highlight one example, the race-neutral GLI equations led to increased findings of obstructive impairment, pathology that can be indicative of diseases like asthma and COPD among Black, Hispanic, and white participants, and decreased findings of obstructive impairment among participants of Asian or other race or ethnic groups.

Scaled to the U.S. population, this amounts to 2.64 million people newly being classified with obstruction and 565,000 no longer being classified with obstruction. How should we think about whether these reclassifications are accurate or not? That is, whether they correctly identify patients as having or not having lung disease.

James: So I think there's two approaches for doing this. The first is the one that's more commonly adopted in the respiratory world, which is you find a distribution of normal, and then you see whether you are outside of the distribution, and then you just see among a healthy population, so people who have no lung disease, no symptoms and are non-smokers, whether or not you can actually predict for those people what their lung function is.

And I think that's been the dominant method by which this has been done, largely because it's agnostic to any outcome variable. So you don't need to have an outcome like COPD in 10 years or mortality in 10 years or things like that in order to do this type of analysis. It's very easy to do this when you might apply it to 10 or 20 different outcomes, as is the case here.

But, what we thought was a better way of trying to get at the question of whether people labeled with disease actually have disease, is to compare it to outcomes that patients care about. And so this is what we talked about earlier, which was things like, did they go to the hospital in the last year for a wheezing episode? Do they report functional limitation from wheezing or from a lung disease? And in the next 10, 20, 30 years, are they going to experience new lung disease or mortality from lung disease or things like that?

And these are things that we think are more specific outcomes that help determine whether or not something is actually a disease or not because disease incorporates not just physiologic classification, but also a sense of values around this. And so things like the actual experiences of patients that they have later and just how bad those are, are things that also should be taken into account.

Host: Yeah, I guess just intuitively, it seems like a potential downside of this approach could be that if you're missing data about patients who were, for whatever reason, unable to seek care, and you might expect that there could be disparities across racial groups in people's abilities to seek care, that you could then replicate that in your analysis. Is that correct? Is that an issue that you were worried about or able to get around in any way?

James: Oh, the idea that, for example, if someone were to develop COPD 10 years down the line, but then were not captured in the data set? Is that the question?

Host: Right. If you're tracking things like the number of times someone presented to the hospital for a wheezing episode, and someone lives in rural Kansas and doesn't have good access to health care, they may not go to the hospital for a wheezing episode. And you might expect those patterns to track different races in different ways.

James: Absolutely. I think there's definitely this concern across a lot of the different outcomes, and so this is something that you can try to improve on just by having higher quality data sets and trying to triangulate from different ways of measuring the same outcome.

For example, things like people not reporting asthma or COPD down the line, that's subject to bias. But things like mortality from lung disease, a little easier to capture in an objective fashion. Things like people going to the hospital for wheezing, again, more subject to barriers to disparities in access and so on. But things like self-reported limitations, maybe a little better.

And so I think this is part of why we had not just one different outcome for each of the buckets that we cared about, but several with the intention of trying to see whether there are overall trends rather than indexing too much to just any one of these outcomes.

Host: That makes sense. Changes in patients' estimated lung function also had implications for their access to social services and healthcare. For instance, you found that the average Black veteran with lung disease might see a several thousand dollar increase in their annual disability compensation.

Similarly, Black lung transplant candidates would move forward in line for lung transplant priority. These seem like positive changes for Black patients. However, as you noted in the paper, other work has found that because Black patients are now classified as having worse lung function, surgeons are now less likely to recommend lung cancer surgery for them.

Black patients may also incur higher insurance premiums if their lung disease is classified as more severe. In short, the new algorithms appear to help Black patients in some ways while hurting them in others. Should people who are designing algorithms or deciding whether to use them factor the downstream effects into their decisions about whether to include race?

James: It's a great question, and I think it's, in my head, a matter of degree. I think there should be some thought towards this in choosing, for example, what outcome measures they're optimizing for. When you're designing a model, there's always going to be trade-offs in terms of, you could go for absolute maximum accuracy, you could tweak it a little bit more if you find that maybe one model is more accurate, but has a bigger gap in accuracy between populations that might be less desirable. You could also be thinking about whether that gap acts in directions that worsen or improve disparities.

For example, if Black Americans are less likely to be appropriately screened or appropriately treated, and this increases the number of Black Americans who are screened or treated, then that might be less bad of an inaccuracy than the other direction.

I also want to be wary about putting too much of this on the shoulders of people designing the algorithms. There's enough that they have to be thinking about already. There's already so much that goes into developing and validating these equations. My sense is that even though it's helpful to be thoughtful about what is being optimized for, most of the burden of thinking about how this will actually affect people should be on the policy makers who are drawing the thresholds at which people act on the algorithm outputs and on the appropriate application of those algorithms.

And so the way I think of it is the algorithms produce a number, the number should be informative, but then the question of what to do with that number and how big of a number you do something, this is the place where questions around direct patient impact and implications are the most salient and the ones where policymakers have actually greater latitude to affect these outcomes than the algorithm designers do.

Host: Okay, just to confirm then, you would be less worried about an issue in accuracy that reinforced disparities than one that didn't?

James: I think development of algorithms involves some politics around the choice of the outcome that you want to optimize for. And I think it's important that those designing algorithms are thoughtful about it, including how it actually will affect people. But I don't think it is entirely on their shoulders to think about how it will be deployed and used and recommended. And I think that the principal goal should be to develop something that is informative and accurate, but then what it means to be informative and accurate, I think they should also be thoughtful about.

Host: And so I guess if, as you're saying, there's different ways to cash out what it means for an algorithm to be informative and accurate. And so designers of algorithms have some amount of leeway or discretion in determining which of those ways to cash this out. They should err on the side of cashing out "informative and accurate" in ways that are equity promoting.

James: Exactly. For example, if they are developing something where there's a well-known disparity, it would be valuable for them to be thinking about, if I have the choice of several different outcomes that I want to optimize for, or several different - in the machine learning world, we might say cost functions that are being used to evaluate how well a model is working - it might be helpful to not choose the one that pushes things in the wrong direction. The wrong direction meaning, for example, if there's an existing disparity, having an inaccuracy that widens that disparity.

Host: Just to make this more concrete, using an example from your research on lung function: You found that the race-based GLI equations and the race-neutral one were similarly accurate for a number of respiratory outcomes. And the differences that did exist were generally minor.

So in this case, the algorithms are in some sense similarly accurate, but the downstream implications are importantly different for different groups of patients. You might think there's an argument here for just using the algorithm that seems to have better or fairer downstream implications. Would that be your view?

James: Absolutely. I think this goes back to the criterion of substantial benefit. So in this case, what I would categorize as a substantial benefit would be an algorithm, for example, maybe one that uses race that is just substantially better at rank-ordering who is more likely to experience adverse outcomes or other outcomes associated with having poor lung function.

And so in this case, what we found was that there was no such substantial benefit from including race. And so, in that case, my sense is the other factors that generally promote the removal of race that we've talked about would favor just going with the more straightforward, simpler equation rather than one that adds it without this improvement in discriminative accuracy.

Host: Okay, and are you more motivated there by the fact that the algorithm that did not include race seemed to have more positive clinical implications for Black patients like the disability payments that we referenced and so on, or are you more motivated by concerns about trust and these other types of concerns that were mentioned previously?

James: Definitely the latter.

Host: Okay, great. So we want to transition now to talking about another paper you recently published. Some of the context is similar. You were interested in studying how the switch from an equation that included race as an input to one that didn't might affect clinical care. However, the context otherwise differs.

In this study, you looked at algorithms that predict a patient's 10-year risk of atherosclerotic events like heart attacks and strokes. These 10-year risk estimates are one of the things that are used to determine who should be on a statin, which is a medication that lowers one's cholesterol and level of inflammation, and in so doing, can greatly reduce people's risk of having heart attacks, strokes, or dying of other cardiovascular causes.

Previously, risk was estimated using an equation that was criticized for considering race, among other things. So the American Heart Association developed a new set of equations called the PREVENT equations, which take some additional things into account, like someone's kidney function, while no longer considering others, like race. And while the PREVENT equations haven't yet replaced the old ones, this change is being considered. Did we get that right? And is there other context you would add?

James: Yeah, no other context. You're very comprehensive.

Host: Excellent. So, in your study, you aimed to estimate how many people would be placed in a different risk category, how many would have their treatment eligibility change, or would experience different clinical outcomes as a result of the new equations. Could you say a bit about how the methods and data used to develop the PREVENT equations differed from those used to develop the old equations and why we would theoretically expect this to lead to different risk assessments?

James: The motivation behind the PREVENT equations was several fold. I think the first was that there was mounting data that the previous equation, the pooled cohort equation, was overestimating risk in people, largely attributed to the possibility that as clinical risk factors were being better managed, as cardiovascular care has improved in this country, as stents were being developed and improved, as new therapies are being developed and improved, that the number of heart attacks and strokes is just going down. And that we're over-calling the number of patients who actually develop it. Part of the motivation was to use more contemporary data to capture these more recent trends and could be more applicable to today's population.

Another is that there are certain variables that are known to be related to cardiovascular risk that were not included in the previous equation. Things like kidney function, or for some other cardiovascular outcomes, things like BMI, things like blood sugar. And so there was an opportunity to incorporate some of these measures to improve the accuracy of the equation.

And then lastly, there is this thought that the old equations were developed based on research cohorts, and that there are, let's say, peculiarities of real-world data that might make it less applicable. And so using a very large real-world data set from EHR data comprising about 95 percent of the data used to develop the new models was another motivation. And then in that process, they also decided that removing race was going to be a critical part of the new development process as well.

Host: So this is a clarificatory question. One thing I took from what you just said is the old algorithms were overestimating people's risk in part because people are getting stents, taking statins, getting bypass surgery, doing all these clinical interventions that are preventing them from having heart attacks and strokes. In some ways it seems weird to then account for that given that the whole point of these algorithms is just to predict who is at risk independent of whether they get a stent or they start a statin, right? Because the algorithms are partly telling us who ought to be on a statin, who we ought to intervene on. And so if the risk is lower just because at a population level these interventions have occurred and so we're seeing fewer events at a population level, are we sure we want to be baking that into the algorithm or am I misunderstanding?

James: I think I misspoke by evoking stents and some of these other therapies because they come after the fact of the heart attack or event. I think the reality is we don't exactly know why rates are going down or that the overprediction has expanded. There are a lot of different explanations that would justify the recalibration of the equation.

So, for example, if people are just exercising more, which I don't think is likely to be the explanation, but imagine that were the case, then you would be thinking, "Hey, people are exercising more, therefore less likely to get heart attacks, therefore benefit less from these medications," or if there are things like other comorbidities are being better controlled or things like that, they're just less likely to get heart attacks or strokes.

And regardless of what that explanation is, I think just being able to have a risk estimate that aligns with the reality of how many people are actually developing these conditions is, I think, an important goal.

Host: I mean, it does seem odd, though. I take your point with respect to treatments that are used after the cardiovascular event occurs. But with respect to statins, it does seem like if that's a big driver of decreased risk, it seems a little bit strange to factor that into the equation that's predicting who needs things like statins.

James: I think when most of the evidence has shown that the equation actually doesn't vary too much for people who are on statins versus non-statins, so unlikely to be a really big driver of that effect. And then also the new equations do have statin use as an input. It's not a large coefficient, but it does account for some of that.

But I think the general understanding is that because we're accounting for the risk factors like LDL that statin operates on, that would not itself explain the difference in rates because it's the difference in rates even after accounting for blood pressure, LDL, smoking, and things like that.

Host: I'm just going to try to summarize my understanding. So basically, these calculators are trying to estimate people's risk of a heart attack or stroke by looking at things like their LDL, their blood pressure, their BMI. Is there BMI in there?

James: BMI is not for the heart attack and stroke, but for heart failure is one of the predictors.

Host: Okay. A1c, maybe sex. And it's looking at all these different variables and trying to predict people's risk. And we were doing this with the old equations, but the old equations were over-predicting people's risk, even when you gave the calculators all these correct inputs. And so there's something else that's happening that is not well captured by the inputs you're putting into the calculator that explains why the population is having falling risk. We don't totally know what that thing is, but it's important to build calculators that are getting at it better. Is that right?

James: That's exactly right.

Host: And just as another point of clarification, it seems like the new equations were trying to solve two separate problems. Is that right? So first, the old equations were over-predicting risk, which we can address in certain ways. And second, and separately, they were not accounting for important variables like kidney function. And so we can also fix that by including kidney function. But these are basically separate problems.

James: Exactly.

Host: Amazing. Okay. So to return to your studies, you looked at a cohort of about 8,000 mostly middle-aged participants who had not yet experienced a heart attack, stroke, or heart failure, and who had the lab and other clinical data needed to estimate their risk using the old and new equations. In general, the risk estimates calculated using the new PREVENT equations were lower than the risk estimates calculated using the old equations. This was true across subgroups of age, gender, and race. In fact, the mean 10-year risk was only 4.6 percent using the new equations, whereas it had been twice that, 9 percent, using the old ones.

Or maybe one way to think about this is that if you simulated the next decade of a typical participant's life 20 times, the old equations predict that they'd experience an event in two of those simulations, while the new equations predict that they'd have an event in just one.

Presumably one of these estimates is more accurate, and that at least feels like the kind of thing we should be able to figure out, for instance, by doing retrospective analyses. Why are these algorithms generating such different risk estimates when presumably there is some ground truth about the actual level of risk?

James: The main problem is that the old equation was developed in, I believe, 2013 using data from the 70s, 80s, 90s. And so they're developed just using older data and the newer data had not yet been available because you have to follow people for 10 years over larger cohorts. And so this type of recalibration to modern populations wasn't really possible until more recently when that data had been collected.

It's generally well understood that the old equations were over-predicting risk. I think that doesn't necessarily mean that the new equations are perfect. They could be under-predicting risk. And so there's still a big push to get more validation data to make sure that the new equation works well and doesn't have its own biases or miscalibrations. But I think in general, my expectation is that I tend to trust the preliminary data from at least the validation that the original authors did for the PREVENT equations, the newer equations, that it is more accurate than the Pooled Cohort equations, which were the older equations.

Host: So just to make sure that I understand. So the idea is that if we were to look at a cohort of typical patients that are similar to the ones that you looked at in your study, and we were to estimate their risk, the PREVENT equations are more accurately estimating their risk in 2024.

James: Yes. That's my expectation.

Host: Okay. And maybe this is just going back to what you said earlier about how we just don't know why the risk has been going down. But is there a particular characteristic or risk factor that was added or removed from the data that you think is primarily responsible for that difference?

James: No, I think it's mostly just the population. I think if you use the exact same variables in the equation, but just used it on your data, it would look pretty similar.

Host: Okay, mysterious. So as mentioned earlier, the guidelines regarding who should receive statins are partly based on a patient's estimated 10-year risk of having a heart attack or stroke. Though the guidelines are more nuanced than this, the threshold for recommending a statin is basically 7.5%, which is the cutoff for moderate risk. People whose risk is above that threshold should generally be on a statin according to the guidelines. Side note for listeners, if you're curious about your risk level, you can estimate it using the MDCalc ASCVD Risk Calculator online, which we will link to in the episode notes.

Since the PREVENT equations generally estimate that people are at lower risk, the number of people above that 7.5 percent threshold would drop substantially if the new equations replaced the old ones. In fact, the new equations would recommend that 14.3 million fewer people be on a statin as compared to the old equations.

Now, of course, recommendations don't translate perfectly to people receiving treatment, so you estimate that the PREVENT equations would cause about 6 million fewer people to receive statins and benefit from the 25 percent relative risk reduction associated with taking statins, which would lead to an additional 77,000 atherosclerotic events.

From this perspective, the PREVENT equations look pretty bad because they're causing 14 million people to lose their statin eligibility, and 77,000 to have a heart attack or stroke. One way to look at this might be to say, yes, it's unfortunate that the new equations would cause a lot of people who would benefit from statins to no longer receive them, say, anyone whose risk was previously estimated to be 8 percent, but is now estimated to be 5 percent.

But we shouldn't be designing equations that misestimate people's risk just so that they can get therapies that might help them. Instead, we should just estimate people's risk accurately. And if the people who would benefit from statins are being told that at their risk level, they don't need one, the cardiologists just need to lower their threshold for recommending statins.

On the other hand, if we knew that that wasn't going to happen in the near future, maybe the world in which we use the old equations, which predict that people are at higher risk, recommend more people to take statins, and prevent tens of thousands of events is a better world since thousands of lives would be saved in that world. So where do you stand on this?

James: One of the motivations of the paper came from the fact that the equation was released before guidelines for how to use it were accompanied with it. And so the temptation, I think, for many clinicians would be to take the previous guideline, which was the 7.5%, and apply it to the new equation.

And logically, it's not an inconsistent thing to do because the derivation of the 7.5 percent has nothing to do with the equation that's being used. It's just a matter of cost effectiveness and value and trade-off with potential other side effects. And so, the temptation, I think, would be to take this new equation, which is now available online in an app, and then to plug it in and then use the same 7.5%.

And I think the cautionary tale that we were trying to tell was that we think that both of these were in the wrong direction. So not just that the previous equation was overestimating risk, but the previous threshold could also have been set too high. And so if you keep them both there, then maybe the people receiving statins is roughly in an appropriate place. If you change one without the other, then you're going to see pretty substantial changes.

And so this is also something that is maybe influenced by the fact that I spend a lot of time with cardiologists who generally think that statins should be in the water and everyone should be getting them. But more contemporary data show that the threshold for cost effectiveness or for positive net benefit is a lot lower than seven and a half percent, even lower than five percent.

And so this question of when we have a more accurate equation, but without accompanying recommendations for the best threshold, we should be cautious about applying it to a threshold that was originally intended for a different equation. And it's possible was also set in part because of knowledge about overestimation.

Host: What is your sense of how quickly the guidelines would change if the new PREVENT equations were adopted?

James: That's a great question. I wish I knew. I have heard rumors that this has been in the works for a while because the last set of guidelines on preventive therapies was in 2019 and it's been a while, and I think the same folks who are involved with that are likely involved with the development of the new equations, so I am eagerly anticipating something soon, but have no idea whether that's actually grounded in reality.

Host: On this podcast, we also believe that statins should be in the water. And as you note in the paper, there is growing evidence that young adults may benefit from statins as having lower LDL over the long term is associated with having fewer heart attacks and strokes.

Meanwhile, the PREVENT equations predict that very few adults between the ages of 30 and 40 are at greater than a 7.5 percent risk of having a cardiovascular event in the next decade. This is presumably just true since these events are quite rare among young people.

So the PREVENT equations may be accurate for this population, but it's also possible that the PREVENT equations could lead to recommendations that are bad for this population since starting a statin at age 30 will probably lower one's risk to a lesser extent by age 40, but may have a greater effect at age 50 or 60 or 70, ages at which people tend to experience heart attacks and strokes.

To put this a bit differently, it seems like the model may, in some sense, be trained to predict the wrong thing. We want to reduce someone's lifetime risk but are building models that predict 10-year risk and then using that to make recommendations despite the fact that statins likely reduce a patient's lifetime risk even if their 10-year risk is low. Do you share this concern?

James: Absolutely. I think the 10-year threshold is often partly from convenience because having data over 10 years is easy to do, it's easy to understand, and collecting data from whether a 30-year-old will get a heart attack at age 70 - that's pretty tough. That's pretty hard. And it's more subject to shifts, as we've already discussed earlier, in the population that might affect how applicable they are in the present day. And so, I think that having good lifetime risk estimates is helpful.

I do wonder whether a risk-based approach is the best way to determine whether to assign therapies. It actually stands out as less common among the other ways that we often do things in medicine which oftentimes are based on very simple age or sex or other kind of thresholds.

And with the knowledge that statins reduce the risk of heart attack or stroke by 25 percent, regardless of one's starting LDL or other factors, there's a strong argument for just using other factors besides risk, things that don't require a clinician to calculate and plug in numbers in order to decide who receives this therapy.

And so I think this is especially true for young people where a lot of these conventional risk factors won't light up or give things that are particularly motivating for starting therapy, but would still be effective for them in the long term. And so mathematically, I think it does make sense to index to risk. It's just that for young people, it's so hard to get a good sense of that.

And then if we index too much to risk, we end up losing opportunities to decrease cardiovascular events because there's going to be friction from having to plug in a bunch of numbers into this calculator to get the tests that are needed for that, and then to underestimating someone's true level of benefit because they're so young.

Host: So you mentioned earlier that the guidelines that put the threshold at 7.5%, so I guess the current guidelines, may have been a reaction to the old slash current equations, which are known to overestimate risk. And so it sounds like guidelines are often going to react to the way that the equations work. And I guess we were wondering, do you think that the way the equations are designed should ever react to how people are likely to use them or what the guidelines might be?

James: Donald Lloyd-Jones, who is one of the people who is involved in both developing the equations and setting guidelines, mentioned this briefly on Twitter. So it's kind of hard to know whether that was... I mean, I trust that he was very central to the process. On the other hand, the actual documentation of the rationale for the 7.5 percent doesn't mention that. It just is derived primarily from balancing the risk of new-onset diabetes, which statins are associated with, against the benefit of reducing heart attacks and strokes. And so it's unclear to what extent the overestimation versus not.

But in terms of whether they should do that type of thing, I really feel that it would be best if developing accurate equations and setting appropriate risk thresholds were both done independently with an eye towards just making the equations as accurate as possible and drawing the thresholds where cost effectiveness and positive net benefit should be.

And then if there ends up being signal for something like overestimation, underestimation, then that's not the job of the other side to account for because you're going to end up with all of these two wrongs that try to make a right. And then once you fix one, then the other becomes broken. I think decoupling the two is the best approach for long-term achievement of appropriate policy.

Host: Yeah, that makes sense. So in the paper, you found that with the PREVENT equations, Black patients would become ineligible for statins at higher rates than White patients. Why was this?

James: Oh, yeah, so this is because the old equation had an input for African American, which accounted for the fact that Black Americans experience heart attacks and strokes at higher rates than White Americans, even after controlling for everything else. And so this was something that they did to bump up the risk numbers to meet the observed rates.

And then, in the new equation, we don't have that anymore. And so, the decrease in risk, which happened across the board, is going to be a little higher for Black Americans. And so, because their risk drops a little more, it means that they're more likely to cross that threshold from above 7.5 percent to below 7.5 percent or whatever other threshold that we care about.

One really interesting addition that I wanted to add was that even though we found that eligibility changes were more prevalent among Black Americans, the projected differences in heart attacks and strokes was actually not different. And the reason is because Black Americans have less access to guideline-recommended care in the first place. And so if fewer of them are being given statins in line with recommendations, then they're less affected by eligibility changes. Because if you're no longer recommended for a statin, but your doctor wasn't going to give you statins anyway, then you're not going to be affected by that.

And so, again, another case of two wrongs don't make a right where there actually isn't a material projected disparity in the outcome of heart attacks or strokes, but that doesn't mean that we should let it slide that there's going to be eligibility differences or let it slide that there are these differences in access to care. It just means that these are two things that happen to cancel out, but will require renewed attention over the next decades to address both.

Host: Oh, interesting. Okay. So just to summarize what I'm taking from this. If it were true that everyone had the same level of guideline-concordant care, then the bigger eligibility change for Black Americans would lead to increased rates of heart attack and stroke, even relative to White Americans. But because fewer of them receive guideline-concordant care, we actually don't observe that difference in the cardiovascular outcomes. And so this is kind of a case in which the two wrongs are making a right, in a sense. Of course, I mean, we're feeding in one disparity and that just happens to be canceling out another disparity.

James: Yeah, exactly.

Host: So Black patients are much more likely than patients of other races to die of heart disease. This can be explained in part by disparities in other kinds of diseases that put patients at increased risk of heart disease, like high blood pressure or kidney disease, both of which are captured by the PREVENT equations.

But some of the factors that may cause Black patients to disproportionately die of heart disease, as well as experience higher rates of high blood pressure and kidney disease, may not be well captured by the PREVENT equations. For instance, the equations don't include variables like whether a patient has been discriminated against by providers.

It seems possible that for the purposes of this algorithm, race could be thought of as a watershed variable that incorporates elements of multiple measured and unmeasured variables, including ones that are the byproducts of implicit bias or structural racism and that contribute to risk. And this would seem like an argument for including race in the algorithm. Do you think that that's right?

James: It is very controversial, for sure, this idea of if we're using race not to stand for a biological difference, but for the effects of racism, does this suddenly become a very different approach? And I think there are people who would say yes, and there are people who would say no.

I tend to lean away from this kind of approach, question of why are we putting it in there and more on the question of substantial benefit versus, you know, regardless of whether you're putting it in there because you had racist ideals or because you think this will capture social, environmental and so on changes, the predictions for people are going to be the same when you include them.

And so I tend to value more this question of: does this provide substantial benefit in terms of the accuracy of the equation as well as the constant harm of, regardless of what your motivation is, this idea of promoting distrust or solidifying race as an organizing principle? I think these are all things that don't change regardless of your motivation for including it.

Host: Do you think that including race in the PREVENT equations would provide substantial benefit?

James: Unfortunately, it's impossible to know because they didn't have that as a comparison. I think ideally I would be interested in seeing just what the performance metrics would look like with and without it. I think based on what we saw with the Pooled Cohort equations, it wasn't that large. And what we see with the change moving from the old equations to the new equations, the difference experienced by Black Americans and White Americans, was also not that different.

So for example, the proportion affected would be about 8 percent for White Americans and 10 percent of Black Americans, which is a much, much smaller difference than, for example, that experienced across age groups or across sex. And so my guess would be that, especially after accounting for other things like kidney disease as well and other factors that are known to vary by race, that it wouldn't meet my personal threshold for substantial benefit, but of course, I'm not going to say that that's 100 percent and it'd be best to be able to see data on that. But that's my guess.

Host: And just to clarify what you mean by 8%and 10%. So the idea is that if race were included in the PREVENT equations, which it's not, that, that would change recommendations for 10 percent of Black patients and 8 percent of White patients.

James: Or rather, when you compare the PREVENT equations without race to the old equations with race, the proportion of people with a difference would be 10 percent for Black and 8 percent for White Americans. That is the proxy for what we would see if we were to have or not have race in the PREVENT equation. A very imperfect proxy, but just the best data that we have.

Host: And that does not reflect this sort of background injustice we were talking about earlier of Black patients being less likely to get the guideline-concordant therapy in the first place. This is just like, in theory, in the abstract, what the recommendation states.

James: In theory.

Host: So, as we've discussed, if the PREVENT equations are used, statins are less likely to be recommended to Black patients because these equations underestimate atherosclerotic event risk even more among Black adults compared to White adults.

Given that one of the goals of removing race from the PREVENT equations was to improve the care that Black patients receive, this is a striking result because it instead looks like these equations could increase disparities in patients' receipt of care. What can this finding teach us more generally about when and whether race should be used to estimate patients' risk?

James: I think this goes back to a question asked earlier about what matters more, whether Black patients or vulnerable populations benefit or are harmed, or is it the general concern, regardless of whether they're benefited or harmed, of things like promoting distrust or contributing to a culture around race being an important factor?

And so I tend to lean towards the latter as being stronger arguments for not including race rather than the former. And I think while it does, of course, matter that Black Americans who are already less likely to receive guideline-recommended care are going to have decreased eligibility, and that's very important to consider and should be weighed, it is also true that we found that the material outcomes wouldn't change.

And so what to make of that is kind of tough, because at the end of the day, we care what happens to actual human beings. At the same time, there's something that doesn't feel right about moving the needle in the wrong direction. And I think this is why a lot of these decisions are made by these task forces involving experts across the humanities, social sciences, medicine, and statistics, because I think there's a lot of values that come in there too about how much of a difference is enough to matter and whether the other competing interests that we've already discussed would outweigh that decrease in eligibility.

But my personal leaning is that the observed differences that we see are just a lot smaller than the ones that we already see for gender and age and other attributes. I feel relatively comfortable, especially given the finding of no material changes, with the new equation being recommended more broadly.

Host: I mean, you've mentioned and it intuitively makes sense that including race in algorithms can have this downside of contributing to mistrust and also can reinforce misconceptions people have about race being biology. And this is intuitive why this would happen, but I'm wondering also if there is data that suggests that this is true. Like when you include race in clinical algorithms, do people feel more distrusting? Like, is there a causal link there or is it just this theoretical thing that's been hypothesized?

James: It's absolutely theoretical at this point. I think there's a lot of case reports of patients, advocates, different groups who have told their stories, but large-scale quantitative data is definitely lacking. And that's actually something I've been working on with your sister, Leah, on basically being able to ask a larger group of people, presumably patients who receive health care, how they actually feel about the use of race and exactly how many people are going to have strong reactions, discomfort, or preferences against the use of their race and what their rationale is and what kind of situational or personal factors mediate that.

I think that's something that we've been really interested in and excited to share at some point soon. I think it's definitely going to also depend on broader discussions about how many people having these types of preferences or distrust is sufficient to dispense with a predictive variable. And then, whether the type of discomfort people feel about race is more sensitive or more important to account for than other similar concerns about the use of, for example, age or sex.

Host: I mean, because it does strike me as a little bit tragic if, for example, excluding race from consideration is leading 10 percent of patients to be given recommendations that are not good for them. But we're doing that because it promotes trust. And the thing that is promoting trust is actually providing minority patients worse care.

James: I think this question of "worse care" is a very important one. Because of course, if it actually is worse care, like substantially worse care, that just feels, I agree, unacceptable. I think there are cases though, where worse care and not meeting a threshold of significance is adopted all the time.

So for example, in medicine, a lot of the things we do are dichotomized. So in a lot of these predictive algorithms, we'll say either they have high heart rate or they don't, we don't put in the actual number, either they have a fever or they don't, we don't put in the number. And statisticians hate this, right? Because it's absolutely horrible for the model performance. And if you were to actually just put in the numbers, it would do a lot better, but then the competing interest there was convenience and whether people actually use it.

And I think there are a lot of times that we do trade some accuracy in these different predictive algorithms or guidelines or models for things like convenience or cost or trust from clinicians that they understand what's going on under the hood. That's partially why most of these models are linear models, even though more expressive models have been around for decades now. No one's using a random forest to predict these things. No one's using a neural network.

And so I think that the question of whether this competing interest is sufficient really depends on whether we can define something as like actually worse care versus just within the bounds that we are normally used to accepting within medicine.

Host: Yeah, that makes sense. And just to be explicit, I'm also assuming in this conversation that as we discussed earlier, more people who have kind of marginal risk being on statins is good. But that is also an assumption that some people could contest.

So in a New York Times article about these new equations, the reporter notes, quote, "the scientists who modified the algorithm decided from the start that race itself did not belong in clinical tools used to guide medical decision making." One of the developers of the algorithm justified this by saying, that quote, "race is a social construct" and that quote, "including race in clinical equations can cause significant harm by implying that it is a biological predictor."

So it sounds like the algorithm builders started with the default assumption that race should not be used in clinical algorithms because it's a social construct. What do you make of this as an argument against including race?

James: So I think it's valuable here to zoom out and just think about what things would lead us to exclude a variable in general. So for example, how about the multigrain bread variable? Is that something we feel comfortable with even if it increases accuracy? There are some people who would say, yes, you know, like give me the best estimate of my COPD risk. I'm very happy to tell you about my multigrain bread consumption. And others would say, why do you need that? And then there would be issues from that.

And then another example would be when we're talking about sensitive attributes. If you ask people who they voted for in the last election, my sense is that that'll be a very powerful predictor of a lot of different health outcomes that we care about, independent of the data that we already collect. But on the other hand, I think a lot of people would be uncomfortable with that.

What we found already in the survey that I discussed of the general population's opinions is that people are oftentimes even more distrustful of surrogates for race, or not surrogates, but replacements, social demographic replacements, like zip code and income, than they are of race itself.

And so I think this question of whether excluding race to begin with is appropriate relies on a debate that has not quite concluded yet about which variables in general are appropriate to consider regardless of how much they improve performance.

My personal sense is that there is a level of substantial benefit when it cannot be achieved by other means that would justify the inclusion of almost any variable, including multigrain bread, including, or actually maybe not including voting patterns, because that just seems like the benefit would have to be so high to overcome the other harms that that incurs on trust and so on.

And I do feel like for the use of race, it's not that it should never be used, but that the harms that it incurs on trust and on medical culture have been substantially underestimated in past decades, and I think that the bar for a substantial benefit in my head is quite high, and maybe that's something that people who exclude it to begin with are aligning with, this idea that nothing would ever meet that bar. And I don't think my position is quite there, but I do feel like the bar has been set extremely low in past decades and aligning on where that should be is a really controversial topic that I think is going to be difficult to resolve for the foreseeable future.

Host: Do you have any theories of why people are more distrusting when asked about zip codes and these other variables you mentioned than race?

James: Part of it is this idea of, does this really represent me? Some people might say something along the lines of, "Hey, I literally just moved here. What does this zip code tell you about me that will not actually apply to me?" And then similarly for income, people might be thinking about, "Are you going to be discriminating against me based on my ability to pay? Or are you assuming that I'm going to be less protective of my health because I'm lower income?"

Similar questions arise with race as well, where people think, "Are there assumptions baked in about my behavior, my health behaviors, my health knowledge, and so on, that don't actually apply to me, but that are being used to inform healthcare decisions for me," based on what is essentially a stereotype. It's not that other things are not stereotypes, like including sex, including age, those are all stereotypes too. But there's something uniquely harmful sometimes about having race as a stereotype that depending on what it's actually standing in for, I can see why there's distrust.

Host: And when you're asking people about how they feel about these things like zip code or income or even race, how do you ask that in a way that... So I imagine a lot of people would just be like, "Yeah, don't include any of it." So how do you elicit those responses while accounting for the fact that they might not understand that it is making their risk predictions more accurate?

James: So one way to do that would be to have some kind of randomized addition of texts that prompts people to additionally consider like "this will make it more accurate" or something similar. And then just seeing how much people change their answers in response to that.

Host: I mean, it seems like there are maybe two different issues related to distrust here. Like one has to do with the inclusion of these variables in the clinical algorithm and the other has to do with how are we going to elicit this information from people. So like presumably a hospital knows my zip code because it bills me for things. And so it has this information and if they know that information and can use clinical algorithms that are going to make better risk predictions for me, that feels like a separate issue than if your doctor opens the encounter by being like, "Before we proceed, let me ask you, what's your zip code? How much money do you make? Who did you vote for in the last election?" Which obviously we can imagine would lead to distrust.

Host: So do you think there's an important difference there in terms of, if you can elicit this information in ways that are not going to actively undermine trust, is it more reasonable to then use some of these inputs in clinical algorithms?

James: I think it's actually the opposite. I worry about transparency. I think one thing that generated a lot of the backlash, for example, in the use of race and kidney function, is that a lot of people had no idea what's going on. And then there are people who realized like, "Oh, if they had not accounted for my race, I would have been on the transplant waiting list four years ago."

And in fact, if their doctor had sat down with them and said, "Hey, we're choosing to use this because we think it makes it more accurate and gives us a better sense of how well your kidneys are actually working," maybe that would be something that they would find more acceptable than if it happened without their knowledge. And I think that's something that we've seen as well in the preliminary survey data.

But then there's also this matter of data being collected from the patient versus from other means. It's tricky because I think, again, one thing that's really common is clinician-assigned race, someone just like eyeballing someone and taking their best guess versus where the patient actually provides it themselves.

In general, the movement is towards self-identified demographics, including race, being the correct way of doing things and a more standardized and acceptable way of doing things versus just assigning race, even when doing the eyeball approach would potentially actually avoid some of the issues in trust, at least in that initial encounter. Because it's not being brought up, but cause for the problems down the road when people find out that it was done without their knowledge.

Host: I was just curious because we had a guest on our podcast who mentioned that one thing that could be useful would be having multiple variables that are related to race as opposed to just one in the way that we have, say, sex and gender. And I don't know what they would be, but maybe one of those variables would capture the sort of social model of race and another one would capture something more like genomic ancestry. I was wondering if you thought there might be any helpful sub-variables to have in data collection.

James: I see. So things that work along similar axes as race, but aren't exactly the same. Yeah. So there are a large number of things that I've substituted for population level variation that are not race. So things that have been proposed are genetic ancestry, geographic region, like country of origin, or country of birth.

I think there are certainly issues with each of these, independently of race. Some of these similar issues apply, and they're definitely implemented inconsistently across different algorithms, different locales. I do feel that some of these are an improvement on race, but at the end of the day, what would be the holy grail, what would really be ideal, would be to identify the more precise underlying causal variables that account for these differences.

Absent these differences, I do feel that things like what Europe is doing around cardiovascular risk where it's done at a country level assignment of baseline risk is more appropriate.

Host: I mean, one thing that seems just hard about the holy grail of identifying these causal variables is that presumably those too differ for different people, right? So it could be that what's driving disparities for one patient is experiences of racism among their providers. It could be for another patient, it's a result of their family having been impacted by redlining and so on.

And so necessarily when you're trying to develop algorithms that work for populations, you are making assumptions about the causal pathways that are just not going to apply to every individual in that population. So the holy grail to some extent, it's not that it's just unattainable, it's also maybe undefined in some sense.

James: Yeah, I think, for example, if it were true that the anthropometric hypothesis were correct and that that accounted for 100 percent of the differences, then that would be a much better way of doing it. And then the way you would replace race is by just measuring people's sitting height instead of standing height or their thoracic cavity size instead of their height. And that would be an example of being able to identify the underlying causal factor and be able to address that appropriately.

But I do agree that in most cases, it's going to be a lot more complex than that. And perhaps something that will be even, like, if we were able to explain 100 percent of the difference, to what extent would these variables be practical for point of care implementation? To what extent would they just be recapitulating race, even without it as a direct input?

And so, I do agree, it's, I think when I first started working on this, I thought that was going to be the solution for everything. I now am less confident that it can be achieved in a shorter time frame, in a way that will be actually implementable at scale. And so, I think it's going to be more of this discussion of the other things that we talked about that could substitute for race and be a little better versus this question of if we remove it altogether, how much of a difference does it make? And does it fall within the boundaries that we normally consider to be acceptable within medicine?

Host: So we've discussed some of the implications of adding or removing race from clinical algorithms. We're curious whether in doing your research on these algorithms, you've noticed any common themes or failure modes that lead algorithm developers to go awry with respect to their consideration of race.

James: That's a great question. I think the common theme is that race is just such a powerful predictive variable because it's so densely connected with so many of these social, environmental, and economic factors that are known to be important for so many health outcomes that it's so hard to not put it in there when it's helpful.

Because when you look across different racial groups, if you look at, for example, their levels of different blood laboratories or different measurements, you're going to find statistically significant differences for basically everything. And then the question is going to be, how many of these are actually meaningful? And how many of these are worth the drawbacks that we discussed earlier?

So I think if you were to look across, not race, but other major categories that can be used to separate human beings, there are so many ways that you could do it where a lot of these differences are going to be statistically significant differences because they're associated with all these other factors. But then I think being able to try it out and compare with and without and see what else can stand in is an important exercise that I don't know whether that is always pursued.

So for example, one thing that I noticed in one of the original development papers for lung function, the NHANES equation developed for use in the U.S., was they actually said, "We tried using sitting height, but it was already pretty well explained by race. And so we decided not to include that." And so it was almost the opposite of the way we think about it now, where we thought about including race, but it was already partially explained by sitting height, and so not worth including - it's almost the reverse.

And so I think just having race in as a default versus having other things that are removed because race already accounts for those, I think is one approach that I've seen in papers that I think could be avoided.

Host: Looking ahead, how do you anticipate practices around including or excluding race as a variable in clinical algorithms to evolve?

James: I think that the movement to reconsider race and to have it broadly reconsidered across medical applications is likely to continue. So I think there's a lot of other areas where it can be reconsidered. So I think probably the most common are going to be for things like diabetes screening and for cancer screening.

As we discussed earlier, there is a lot of value in seeing how much of that difference could be explained by other things that were not accounted for because race already explained them pretty well. And then also thinking about how big that difference actually is. So for something like diabetes, the difference is actually quite large - the BMI at which different racial groups tend to develop diabetes.

And so dispensing it without coming up with a good alternative feels less palatable than for something like maybe kidney function where the difference is relatively smaller. Or for example, for cardiovascular disease risk, once we now account for kidney function, other factors tend to be smaller.

And so I think the size of that difference, again, alluding to the substantial benefit part of including race, will be an important lens by which these new algorithms are viewed and reconsidered, but I think they are likely to be reconsidered and ideally to be replaced with something more precise.

Host: So we like to close by asking our guests, what is one rule or norm broadly related to what we've been talking about today that you would change if you could and why?

James: So I think one norm that I've run up against is people's tendency to simplify the question of whether to include race down to "should we never ever use it?" And I think it's more complex than that.

I think there's differences between, for example, in lung function, where we're defining what normal and healthy is differently by populations. That's very different from, for example, kidney function or cardiovascular risk, where there's actually a gold standard ground truth that we're trying to get better at predicting.

I'll continue with the transcript cleanup:

Host: So do you think there's an important difference there in terms of, if you can elicit this information in ways that are not going to actively undermine trust, is it more reasonable to then use some of these inputs in clinical algorithms?

James: I think it's actually the opposite. I worry about transparency. I think one thing that generated a lot of the backlash, for example, in the use of race and kidney function, is that a lot of people had no idea what's going on. And then there are people who realized like, "Oh, if they had not accounted for my race, I would have been on the transplant waiting list four years ago."

And in fact, if their doctor had sat down with them and said, "Hey, we're choosing to use this because we think it makes it more accurate and gives us a better sense of how well your kidneys are actually working," maybe that would be something that they would find more acceptable than if it happened without their knowledge. And I think that's something that we've seen as well in the preliminary survey data.

But then there's also this matter of data being collected from the patient versus from other means. It's tricky because I think, again, one thing that's really common is clinician-assigned race, someone just like eyeballing someone and taking their best guess versus where the patient actually provides it themselves.

In general, the movement is towards self-identified demographics, including race, being the correct way of doing things and a more standardized and acceptable way of doing things versus just assigning race, even when doing the eyeball approach would potentially actually avoid some of the issues in trust, at least in that initial encounter. Because it's not being brought up, but cause for the problems down the road when people find out that it was done without their knowledge.

Host: I was just curious because we had a guest on our podcast who mentioned that one thing that could be useful would be having multiple variables that are related to race as opposed to just one in the way that we have, say, sex and gender. And I don't know what they would be, but maybe one of those variables would capture the sort of social model of race and another one would capture something more like genomic ancestry. I was wondering if you thought there might be any helpful sub-variables to have in data collection.

James: I see. So things that work along similar axes as race, but aren't exactly the same. Yeah. So there are a large number of things that I've substituted for population level variation that are not race. So things that have been proposed are genetic ancestry, geographic region, like country of origin, or country of birth.

I think there are certainly issues with each of these, independently of race. Some of these similar issues apply, and they're definitely implemented inconsistently across different algorithms, different locales. I do feel that some of these are an improvement on race, but at the end of the day, what would be the holy grail, what would really be ideal, would be to identify the more precise underlying causal variables that account for these differences.

Absent these differences, I do feel that things like what Europe is doing around cardiovascular risk where it's done at a country level assignment of baseline risk is more appropriate.

Host: I mean, one thing that seems just hard about the holy grail of identifying these causal variables is that presumably those too differ for different people, right? So it could be that what's driving disparities for one patient is experiences of racism among their providers. It could be for another patient, it's a result of their family having been impacted by redlining and so on.

And so necessarily when you're trying to develop algorithms that work for populations, you are making assumptions about the causal pathways that are just not going to apply to every individual in that population. So the holy grail to some extent, it's not that it's just unattainable, it's also maybe undefined in some sense.

James: Yeah, I think, for example, if it were true that the anthropometric hypothesis were correct and that that accounted for 100 percent of the differences, then that would be a much better way of doing it. And then the way you would replace race is by just measuring people's sitting height instead of standing height or their thoracic cavity size instead of their height. And that would be an example of being able to identify the underlying causal factor and be able to address that appropriately.

But I do agree that in most cases, it's going to be a lot more complex than that. And perhaps something that will be even, like, if we were able to explain 100 percent of the difference, to what extent would these variables be practical for point of care implementation? To what extent would they just be recapitulating race, even without it as a direct input?

And so, I do agree, it's, I think when I first started working on this, I thought that was going to be the solution for everything. I now am less confident that it can be achieved in a shorter time frame, in a way that will be actually implementable at scale. And so, I think it's going to be more of this discussion of the other things that we talked about that could substitute for race and be a little better versus this question of if we remove it altogether, how much of a difference does it make? And does it fall within the boundaries that we normally consider to be acceptable within medicine?

Host: So we've discussed some of the implications of adding or removing race from clinical algorithms. We're curious whether in doing your research on these algorithms, you've noticed any common themes or failure modes that lead algorithm developers to go awry with respect to their consideration of race.

James: That's a great question. I think the common theme is that race is just such a powerful predictive variable because it's so densely connected with so many of these social, environmental, and economic factors that are known to be important for so many health outcomes that it's so hard to not put it in there when it's helpful.

Because when you look across different racial groups, if you look at, for example, their levels of different blood laboratories or different measurements, you're going to find statistically significant differences for basically everything. And then the question is going to be, how many of these are actually meaningful? And how many of these are worth the drawbacks that we discussed earlier?

So I think if you were to look across, not race, but other major categories that can be used to separate human beings, there are so many ways that you could do it where a lot of these differences are going to be statistically significant differences because they're associated with all these other factors. But then I think being able to try it out and compare with and without and see what else can stand in is an important exercise that I don't know whether that is always pursued.

So for example, one thing that I noticed in one of the original development papers for lung function, the NHANES equation developed for use in the U.S., was they actually said, "We tried using sitting height, but it was already pretty well explained by race. And so we decided not to include that." And so it was almost the opposite of the way we think about it now, where we thought about including race, but it was already partially explained by sitting height, and so not worth including - it's almost the reverse.

And so I think just having race in as a default versus having other things that are removed because race already accounts for those, I think is one approach that I've seen in papers that I think could be avoided.

Host: Looking ahead, how do you anticipate practices around including or excluding race as a variable in clinical algorithms to evolve?

James: I think that the movement to reconsider race and to have it broadly reconsidered across medical applications is likely to continue. So I think there's a lot of other areas where it can be reconsidered. So I think probably the most common are going to be for things like diabetes screening and for cancer screening.

As we discussed earlier, there is a lot of value in seeing how much of that difference could be explained by other things that were not accounted for because race already explained them pretty well. And then also thinking about how big that difference actually is. So for something like diabetes, the difference is actually quite large - the BMI at which different racial groups tend to develop diabetes.

And so dispensing it without coming up with a good alternative feels less palatable than for something like maybe kidney function where the difference is relatively smaller. Or for example, for cardiovascular disease risk, once we now account for kidney function, other factors tend to be smaller.

And so I think the size of that difference, again, alluding to the substantial benefit part of including race, will be an important lens by which these new algorithms are viewed and reconsidered, but I think they are likely to be reconsidered and ideally to be replaced with something more precise.

Host: So we like to close by asking our guests, what is one rule or norm broadly related to what we've been talking about today that you would change if you could and why?

James: So I think one norm that I've run up against is people's tendency to simplify the question of whether to include race down to "should we never ever use it?" And I think it's more complex than that.

I think there's differences between, for example, in lung function, where we're defining what normal and healthy is differently by populations. That's very different from, for example, kidney function or cardiovascular risk, where there's actually a gold standard ground truth that we're trying to get better at predicting.

And I think there's also this question of how much is it actually helping? For some things, there are actually really large differences between racial groups that we still have trouble accounting for. And then there are others where after accounting for the things we know, the differences are really, really small. And then the cost of removing race is really, really small.

And I think people tend to fixate on "if race is in there, then it's automatically bad." Or "if race is not in there and the number decreases even a little bit, that's automatically bad." And I think it's much more complex than that. There's a lot of factors that go into it that are actually very hard to quantify.

I think it's very tempting to try to rely on how numbers change, but then the more social, systemic effects of whether we're supporting the concept of racist biology, or whether we're promoting distrust, or doing something that patients have strong preferences against - these are things that for a long time were hard to compare against the numbers.

And I think it's easy to say absolutely one way or the other, but a more nuanced and precise look at these different application areas will generally yield more fruitful discussion.

Host: All right. Thank you so much for coming on the podcast. We really enjoyed this conversation.

James: Thanks so much for having me. It was always a joy to be able to talk for a long time about the things I'm excited about and have a captive audience. So appreciate it.

Host: Bio(un)ethical is written and edited by me, Sophie Gibert, and Leah Pearson, with production by audiolift.co. If you want to support the show, please subscribe, rate, and review it wherever you get your podcasts and recommend it to a friend.

You can follow us on Twitter at Biounethical (no parentheses) to be notified about new episodes, and you can sign up on our website, biounethical.com to receive emails when new episodes are released.

We promise we won't spam you, but we may reach out to let you know about upcoming guests and give you the opportunity to submit questions. Our music is written by Nina Khoury and performed by the band Social Skills. We are supported by a grant from Amplify Creative Grants. Links to papers that we reference and other helpful resources are available at our website, biounethical.com. You can also submit feedback there or email us at biounethical@gmail.com. Thanks for listening and for your support.