Ep. 5 Part 1: The Fragile Science of Human-AI Teams with Professor Krzysztof Gajos Artwork

Mahesh the Geek

What is mission-critical AI, and how is it shaping our future?

Join Motorola Solutions executive vice president and chief technology officer Mahesh Saptharishi as he and AI experts explore the science, the challenges and the incredible potential of AI when it matters most.

All Episodes

Mahesh the Geek

Ep. 5 Part 1: The Fragile Science of Human-AI Teams with Professor Krzysztof Gajos

December 15, 2025 • Motorola Solutions • Season 1 • Episode 5

In the first of a two-part conversation, Mahesh welcomes Professor Krzysztof Gajos, lead of the Intelligent Interactive Systems Group at Harvard, to challenge the common assumption that human + AI is always better than either alone.

Professor Gajos takes us deep into the fascinating, messy problem space of human-AI collaboration, revealing these configurations to be inherently fragile and contingent. The discussion dissects how specific design failures—including over-reliance on incorrect advice, increased cognitive load, poorly conceived delegation models and interface design—can undermine decision quality, de-skill users and create perverse incentive structures that ultimately undermine the very goals of the systems themselves.

Across this wide-ranging conversation, Professor Gajos’ emphasizes the need for worker-centric AI systems that prioritize human competence, learning and autonomy over clamors for what are all too often superficial efficiency gains. Discover why thoughtful AI design must start with a deep understanding of the cognitive work people actually perform.

Let's geek out.

Follow Mahesh on LinkedIn: linkedin.com/in/maheshsaptharishi

Follow Motorola Solutions on social

LinkedIn: linkedin.com/company/motorolasolutions

X: x.com/MotoSolutions

Instagram: instagram.com/motorolasolutions

Facebook: facebook.com/MotorolaSolutions

YouTube: youtube.com/MotorolaSolutions

Never miss an episode by subscribing.

Please leave us a review on Spotify or Apple Podcasts if you liked what you heard!

SPEAKER_02: 0:00

Welcome to episode 5 of the Mage the Geek podcast. Today, we're joined by Professor Kristoff Gaios, a professor of computer science at the Harvard Paulson School of Engineering and Applied Sciences, and one of the most thoughtful voices on how humans and AI actually work together in the real world. As a lead of the intelligent interactive systems group at Harvard, his work sits at the intersection of machine intelligence, cognitive psychology, and interface design. And he spent years studying how people make decisions when supported by, or sometimes misled by, intelligent systems. Across the industry, there's a common and clearly seductive assumption that human plus AI is always better than either alone. But as we discussed in the previous episode with James Lande, the picture is far more nuanced. This conversation adds a further layer of nuance. As Professor Gyers takes us even deeper into this fascinating, messy problem space. Human AI teams are inherently fragile and contingent. They're shaped by cognitive load, overreliance, attention bandwidth, policy constraints, the design of the interface itself, and even how people feel about their own competence and autonomy. And as you'll hear, many of our intuitions about how assistance should work don't hold up under scrutiny. In this first part of our conversation, we explore why AI that seems helpful can still make people worse at their jobs, how explanations can create false confidence, and why thoughtful design must start with a deep understanding of the cognitive work people actually do, not just with technical capability. Let's dive in. And in full transparency, that's where I started as well. Me too. And so, you know, perhaps even some of your work is when I read it. I come from a machine learning background. I got my PhD at Carnegie Mellon. And one of my early focus areas was ensemble classifiers, boosting, data boost, and all that, right? And so mixture of experts, et cetera, we were focused very much on that. And the assumption was, hey, mixture of experts. I can mathematically show independent errors, right? Yes, exactly. Exactly. And so and so when I started reading some of your papers, I was like, okay, there is this is a more nuanced topic than than I expected it to be. And we were also, by the way, seeing some of this in real life with some of the experiments that we had done, probably a bit more productized experiments than true experiments. Like we were implementing these focus of attention mechanisms, which would guide people's attention to focus on the right sets of things. But we were getting feedback that in some of these situations, the focus of attention mechanisms weren't really doing what we expected to do in terms of dealing with the cognitive bandwidth, the attention bandwidth of the individual appropriately. In some cases, it was actually a distraction. So last episode, we talked to Professor James Lande from uh Stanford. And so he he was one of the he also started off his conversation by saying, hey, this human plus AI team concept as being universally better than either one alone is there's there's some cracks in that argument. And so we should be thinking about it differently. And so, of course, his his take, his focus areas are a little bit different. So we we went in a different direction with that conversation. Uh, before that, we talked to uh Professor Martin Holbrad from UCL, and more from an anthropology standpoint, but we also went a little bit into actor network theory and and talked a little bit about kind of the network effect and how the unintended consequence of us doing user-centered design doesn't always scale in the in the right way when you think of teams working together, including human uh AI teams. So that was the other concept that we touched on uh there as well. So maybe I can start off by just asking you to give us a summary in your own words of some of the work that you have done and really touching on this notion that this human plus AI team concept is more nuanced than people may think it is.

SPEAKER_00: 4:16

So I I've listened to the previous episodes up to the anthropologist.

SPEAKER_01: 4:21

James was on my committee. So he was actually at University of Washington when I was a student. He first came as a visitor, and then in the middle of my PhD, he became a professor there, and then he stayed on for uh when when I started at the University of Washington, there was no human-computer interaction as a field in the computer science department. And James was the first person of that flavor who joined computer science, and then he ended up transforming the entire university. But uh let me start from the beginning, kind of my my story. So I grew up at MIT. I got very kind of classical CS education. In 1997, I joined as an undergraduate the intelligent room project at the MIT AI lab. And the idea was to build an intelligent environment that observes us, learns from us, and that we can also interact with very explicitly. I stayed in that project for two years as an undergrad, then for one year as a master's student, and then for two more years as a technical manager of the project. So I spent five years over there. Very interestingly, you know, from the very beginning, we we would speak to the environment, and it was absolutely awesome. But like about three years into the project, we actually built a very elaborate system that would allow us to place a button environment and tell the system what to do when we press the button, because we realize that we're extremely tired of talking to the system. For some things, we just want to press the button. Computer, when I press this button, dim the lights, close the shades, and play relaxing music. So that was actually my first surprise in my uh AI optimism, because it made me realize that sometimes that you know that the natural interaction of speaking is actually not necessarily always the most natural way to interact with a machine. That sometimes the machine interaction is the more natural, the more convenient way to well, to speak to the machine. But anyway, from there I went to University of Washington. My advisor was a AI person. We started doing geeky AI things. At some point, we started talking. Well, wouldn't it be nice if you know the software would adapt to our needs, if it observed the specialized tasks that we had and optimized the user interfaces for the things that we want to do. Because at that point, everyone was starting to get become aware, this was early 2000s, of what some called the bloat in our software. So software would support lots and tots of different tasks. Later I learned that Microsoft's internal measurements showed that everybody showed about used about 10% of Microsoft Office's capability, but everyone used different 10% of the officer's capability. So you needed all of it, but any one person needed only a small subset. So could you create a less bloated interface for any individual that was appropriate to what they were doing? So with this interest in mind, I ended up joining the lab of Mary Czervinsky, a cognitive psychologist at Microsoft Research, and I joined her for a summer for an internship. And she and you know a colleague of hers inducted me to the human subjects research. They showed me how to run carefully designed cognitive science experiments to answer some design questions. And this was an eye-opener. And roughly from then on, so it was, I think, 2004, 2005, uh, I became very interested in this intersection of human intelligence and machine intelligence. And this has been my passion since then. So for a long time, I focused on kind of the cognitive aspects of what happens between one machine and one person. During my PhD, I became very excited about the process of design, user interface design, and automating it and casting it as an optimization problem and exploring it in the context of accessibility. So working with people with motor impairments, creating computational models of unique abilities of different individuals, because we realized that your formal diagnosis doesn't really say what it is that you can do and what you need. So we would create models of individuals' abilities, and then we would automatically optimize a user interface for what the person could do well. So the most effective interface for that person. And this was fascinating. People that we tested it with felt incredibly empowered. User interfaces that they got looked weird, but they were weird in just the right way for people with those unique abilities. And, you know, one thing that we realized working in that space was that a lot of accessibility conversation at the time was about making access possible, but not about making access efficient or convenient. And some stories that I've heard made me realize that this was a big deal. I spoke to some students who could say, look, you know, I can access everything that I need for my studies, but it takes me a very long time. So one of two things can happen. I do sloppy work and my grades suffer, or I have to take fewer classes, but then my financial aid will run out before I am able to complete my degree. So in either case, the person was professionally disadvantaged at the very beginning of their career, not because they couldn't do the work, but because the tools that they had did not allow them to do their work efficiently. So this it wasn't just the possibility of access that was critical for these people's success, but it was the efficiency of access as well. So I was very focused on making sure that we make access not just possible, but equitable in ways that mattered. So I spent a lot of time thinking about that. Then when I came to Harvard, I started looking at other questions related to intelligent interactive systems. And I realized that this is a very interesting intellectual field because it extends classical human computer interaction to ask new questions. We've got, you know, in classical HCI human computer interaction, we want to build stuff that is honestly useful to people. But how do you build useful systems using technology that is occasionally wrong? So how do you build it in such a way that people have a clear sense of value, even though occasionally there are mistakes? How do you build a system that is predictable, an absolutely core tenet of human computer interaction, with a system that is complex, whose inner understandings you you cannot understand, whose inner works you cannot understand? And also, how do you give people a meaningful sense of control? Yet another thing that human computer interaction believes is central to effective design, to systems that do their best work if they're allowed to be a little bit proactive. So uh, you know, I was fascinated by it. I I worked on you know some small sub-problems. There were relatively few people who were really pursuing these questions. Human computer interaction community was skeptical of AI as a substrate for quite a while. Meanwhile, people who built intelligent user interfaces were focused on demonstrating novel capabilities and kind of exploring best case scenarios. And relatively little work went into this boring area of trying to figure out how we could make of building the design knowledge that would allow us to build these systems such that they functioned in real situations for extended periods of time. And then, you know, less than a decade ago, this ginormous technical revolution happened in AI, first with deep learning and then, you know, with with with other developments that showed us technical capabilities that we hadn't previously imagined. And everybody got incredibly excited about looking for applications of these capabilities, but we did it in a situation where we didn't have good design knowledge, and we also invested relatively little in problem understanding. So here's another aside. I believe that really meaningful and impactful innovation happens when you've got uh you know technical readiness, so there's technology that is available to use and deploy. You've got good problem understanding, so you know you know exactly how your safety personnel works, but you also have design knowledge that knows that helps you decide whether and how to apply uh you know technological interventions in a particular human situation, because these applications are often non-obvious. You often have indirect and unexpected consequences if you are inexperienced. So you need this design knowledge as well. So, my sense is that over the past decade-ish, we invested heavily in technical capability and very little in problem understanding, and even less in developing robust design knowledge. So we're in a slightly exciting but also exasperating moment in in history, but one where there's just a ton of interesting work to be done.

SPEAKER_02: 12:49

I think that's uh that's fascinating. I think it's it'd be interesting for you to maybe double-click on this uh this Microsoft example, because I've heard you speak about this before, and it's it actually illuminates the problem in a way that's uh that's I think quite clear and quite interesting.

SPEAKER_01: 13:06

So the the starting point is that I think human AI interaction is an extension, is an exciting extension of human-computer interaction with entirely new problems. One of them is that AI is occasionally wrong, but we want to build systems that people perceive as useful despite that. We also want people to have predictable interactions despite the complexity of the AI models, and we want people to feel in control, even though the AI-powered systems often deliver the most value if they're allowed to be a little bit proactive. So, how do you do it? One example that helped me kind of start thinking about these questions was related to the to the concept of software bloat. So around 2000, people started documenting the fact that consumer software was becoming so complex that people had trouble navigating it, people had difficulty finding things that they needed, and also even if they knew where those things were, it would take you several clicks through complicated menu trees before you access the thing that you needed. So colleagues at Microsoft shared with me that uh as far as they could tell from usage logs, any one person used only about 10% of the capability of the Microsoft software. But the problem was that different people use different 10% of the of the software. So anyone could would benefit from having uh a simpler product, but everyone needed a different simpler product. So how do you do it? So my advisor and I started thinking about adaptive user interfaces. And the question was how do you do it in such a way that you know it really simplifies the interaction, but also that people accept whatever intelligence we put into the user interface? To explore this further, I had the opportunity to join the lab of Mary Czervinsky, a cognitive psychologist at Microsoft Research, and I spent the summer doing research with her and some of her colleagues. So we designed several different redesigns of the menu toolbars in Microsoft Office. One that we thought would be very useful is one that recognizes the functionality that you use frequently and moves it to a place that is more convenient. So clearly, you know, you you save time, you do not have to go as deep in the user interface. Another design that we experimented with was one that's where we designated a fraction of the user interface to be the adaptive toolbar, and it only could hold six things, and we would copy things from deep in the menu structure that's that you're using frequently to this adaptive toolbar. And you know, as you use the user interface, what was moved to be causa or what was copied to the toolbar kept changing. And the last design that we tried was one where things that you needed frequently were highlighted so that when you navigated to the right part of the toolbar, they would be easier to spot. And then we ran two user studies. One to measure the speed with which people could access the things that they were told they needed, and another one where we gave people high-level tasks like design this or write that, and where they had a clear primary task and were supported by this user interface. We did these, we did these two studies because we we wanted to accurately measure people's efficiency in a constrained task, but we also wanted to get realistic responses about people's preferences when they work on a task that to them was more natural. To our surprise, we found that the interface that we thought would be the most efficient, while the one where things got moved to more convenient places in the interface, was actually slower and less preferred. And it was slower because sometimes people did not realize that move had occurred. So they would go to the original location, didn't find it there, look around, and eventually they would find it in the what was supposed to be a more convenient location. The winner was the one where we copied things because when people turned off their attention, they could just go on autopilot and access things the way they usually access things. But when they had a little bit of attention left, they could grab things from the adaptive toolbar. And the design with highlights again proved to be unsuccessful. It turns out that highlights are not good for attention if people do not know that they should be looking for highlights. So what happened was that the first time somebody used something, it was the original color. The next time it was highlighted, therefore it was the last thing the person looked at because it was different from what they expected. But what we concluded from this experience was that the design where we copied elements from the original location. So they were accessible both in the original location and in the adaptive toolbar was robust to systems' mistakes, but also robust to a person's lack of attention. If we made incorrect prediction about what a person wanted, or if they weren't paying attention, that there was no cost. But if the system was right and they they had a little bit of attention left, they could reap benefit from what we did. So we ended up calling this the split user interface. And we documented it, but we're not inventors. You see this design in today's software. So when you go on an iPad at the bottom, you will see shortcuts to applications that the iPad thinks you are likely to use. When you go to font menus in many applications, you will see copies of the fonts near the top that you are most likely to use. This paradigm is used a lot. So in this paradigm, the AI-generated shortcut or solution is an alternative to how you would normally do things. So if you feel like experimenting, if you've got some cognitive resources left, you do this. If you don't, you you you do things the usual way. As a follow-up, later we did an experiment to see if there were any individual differences in how people used these shortcuts. And one thing that we found was that there was a systematic difference in who regularly took advantage of these shortcuts and who didn't. And the concept of need for cognition from psychology turned out to be very relevant. So need for cognition describes how much joy we get out of engaging in unneeded cognitive tasks. If we're not required to think, do we do it anyway with joy or do we do we avoid it? So people with high need for cognition would frequently take advantage of the shortcuts. They paid attention to the shortcuts. For them, it was fun to see if the system did the right thing. People with lower need for cognition did not. So this was important for a couple of reasons. First, it demonstrated that attention is work. Given that people who do not enjoy expending their cognitive resources where they are not needed chose not to take advantage of it, demonstrated that for them it was an effort. So it alerted us that when we design intelligent user interfaces, we need to be very responsible. Respectful of the effort that we require of the people. Second, it also alerted us to the fact that if we are not careful with how we design intelligent user interfaces, we may produce what people in public health called intervention-generated inequalities. So we create a solution that on average appears to make people more efficient and more successful. But for people with low need for cognition who already have a hard time engaging with complex user interfaces, they have a smaller benefit than people with high need for cognition who are already pretty successful and benefit much more. So everybody benefits a little, but the gap between people increases. And we didn't want to do that. So these were kind of the core lessons that we drew from from that very first experiment.

SPEAKER_02: 20:49

And I think there's there's another element to it that you've pointed out too, where there's also introduction potentially of bias, right? An example that you've quoted previously on predictive text used in reviews and such, where it's not just uh there's there's a need for cognition, but in terms of how perhaps AI can potentially guide the next steps, can also quite profoundly bias what the intent of that user is.

SPEAKER_01: 21:14

Uh indeed it's true. So we did that set of studies like in around 2020, just before the large language models started coming out. We looked at, so this was inspired by our work on uh split user interfaces, interfaces that offer an alternative way of doing things to people that is a little bit more efficient. And a psychologist colleague, a different psychologist colleague, alerted us to the possibility that what we make easier to use may affect what people choose to use. And we previously had not considered it. We only thought that what we are doing is making work more efficient, but we did not think we had an impact on actual content of work. So with a with a student colleague, we designed a series of experiments where people were asked to perform various tasks, describe an image, write a restaurant review, or do something else. And people did it with or without, people did it on mobile keyboards, they did it with or without predictive text, and we also created our own version of predictive text. And for one of the studies, we biased predictive text in uh two different ways. So specifically, we asked people to come to the lab and list four recent restaurant experiences that they had: two positive experiences and two negative experiences. Before encountering any of our technology, we asked them to assign star ratings to these experiences. And again, so two would be positive and two would be negative. So before people did anything, they already made a commitment. These were good experiences, these were bad experiences. And then we asked them to write a review for each of these experiences on a mobile keyboard. And without people knowing this, we manipulated it in such a way that for one positive and one negative review, the predictive text on the keyboard was trained on positive reviews from Yelp. And for the other two reviews, the predictive text was trained on negative reviews from Yelp. So people wrote their reviews, and then we presented these reviews to a different set of people, and we asked them, and these people didn't see the star ratings, they just read the text of the reviews. So we asked them, tell us what you think the star rating of this review is, given the content of the review. And we found that there was roughly a half a star rating difference between reviews written with positively trained predictive text and reviews written with negatively trained predictive text. So for the same intended star rating, the tone of the review would be different depending on the bias of the predictive text. So this is when this was a pretty alarming result for us. It showed that the bias in the adaptive user interface that makes things a little bit more convenient can impact not only the efficiency of work, but also the content of work. And obviously, now we we we see this effect amplified with recent studies that looked at the impact of uh the use of large language models on on people's work.

SPEAKER_02: 24:12

I think that's a that's a great segue into some of the topics that you've addressed. When we start start off with large language models that seem like they're capable of doing so much, you can ascribe a lot of intelligence and you can perhaps even ascribe a lot of capability to them. There's this concept of over-reliance, right? There's a concept of over-reliance where it's almost fundamental to why sometimes an AI alone may actually outperform the human and AI combination. Uh and maybe you can you can double-click on that a little bit.

SPEAKER_01: 24:42

All right. So I've looked at the concept of over-reliance in the context of AI-supported decision making. I did not really look at it very carefully in the context of LLMs. So let me look at talk about AI-supported decision making. We were really surprised by the finding that people supported by in their decision-making tasks would typically, assuming that the AI was better than an average person on the task, so a person supported by an AI would do a little bit better than they would have without AI support, but the combined performance of person and AI would be worse typically than the performance of AI system alone. And um, the entire community expected that when you combine the knowledge and skills of people and AIs, given that these are very different types of knowledge, typically when you combine different kinds of knowledge, you get an outcome that is better than ether alone. But here we did we were not getting the synergy. So we got some insight into what might be going on when we separately looked at the instances when AI gave correct advice versus when AI gave incorrect advice. What we found was that when AI gave incorrect advice, people frequently followed it and performed much worse than they would have on their own. This indicated to us that there was some level of over-reliance. People could have done a better job on their own, but somehow they were swayed by incorrect recommendation. There was research conducted by colleagues at another institution that found that adding explanations doesn't necessarily help because rather than engaging with the content of the explanation, many people seem to be just excited about the idea that the system is capable of generating explanations. So they perceived that system as more capable because it was generating expectations, and they followed its recommendations even more happily because it was perceived as more capable. So people took the presence of explanations as an indicator of competence without actually engaging with the content of the explanations. So we decided to look into this concept of cognitive engagement. In a later study, we explored the concept of cognitive forcing, which we discovered in literature on medical decision making. Cognitive forcing is the idea that you interrupt a person's decision-making process in the moment in such a way that they are less likely to make a decision heuristically and kind of get pushed or nudged toward more analytical decision-making. A prominent example of such an intervention is that you let a person make their own decision, they say, you know, what their decision is, and then you tell them, well, the AI system actually thinks that this other thing would be a better decision. And here are the AI's reasons. So the assumption is that when a person is confronted, they they made a commitment. They said, I believe this is right, and then the system says something else is right, they would be curious to find out why there is a discrepancy, and they would cognitively engage with AI's explanation. So indeed, in that study, we found that the cognitive forcing interventions resulted in much less over-reliance than just typical decision recommendations and explanations. It didn't eliminate the over-reliance, but it seemed to reduce it. Very interesting.

SPEAKER_02: 28:10

Very interesting. There is this notion that today, interacting with AI, especially with all the things that are coming up with agentic capabilities and such, that you're almost interacting with a colleague. But that interaction is not quite like you're interacting with a colleague. And so the that there's a, I think there's a structural difference there that I think you seem to be pointing out as well, is that while we may assume that that notion of competence that you can perhaps ascribe to an AI because it can explain something to you or explain its answer to you, there's no engagement, similarly to the way you you and I perhaps would question each other, which doesn't give you the benefit of that type of interaction.

SPEAKER_01: 28:54

Yes. So the first time when we kind of tried to poke at this metaphor was uh with a colleague from Columbia, Lena Mamakina, realized that when we talk about partnership in a workplace, there are numerous benefits to such partnerships among among people. And one of them is that if I were to ask you for for help on a particular decision-making task, with your help, I would make a better, I would make a better decision in the moment, but I would also learn something that I could apply the next time I was in a similar situation. So this concept is called incidental learning. And you know, I briefly looked into the relevant literature and it suggests that roughly half of the learning that happens in modern workplaces is through this process of incidental or informal learning. We really rely in professional situations on people learning on the job from doing the tasks and from interacting with colleagues. So we decided to check whether explainable AI is a good colleague. What happens when you when you get supported in your decision making by by an AI? And what we found was that people supported by classical explainable AI, where they are given decision recommendation and an explanation, learn nothing. Like zero, zilch, nada. It was it was really, really disconcerting. Another surprising result was that when we applied cognitive forcing, so there is an assumption that if you've got cognitive attention, cognitive engagement, then you also have learning. That cognitive engagement is a precondition for learning, and if you do have cognitive engagement, you will probably see learning. So because we interpreted our earlier results as indicating that cognitive forcing results in cognitive engagement, we expected that cognitive forcing would also result in learning. Surprisingly, we did not see this effect. We still do not understand why. Our current thinking is that cognitive forcing perhaps does not really result in deeper cognitive engagement, but maybe shifts our biases, maybe over-rely on the AI less, but are more committed to our original idea. So that was a surprise. So we tried yet another design. And in that design, instead of giving people a decision recommendation and explanation, we just gave them an explanation. And the explanation really nearly gave away the answer for the task. So the task that people were working on was to decide which of the two meals has more fat or more carbohydrates or something like that. And the explanation, for example, would say beans have much more protein than this other thing. So essentially it gives you an answer right there. All you have to do is perform just a tiny cognitive step to translate from the explanation to the answer. And that was sufficient to result in learning. There was substantial learning when people were aided just by an explanation rather than an explanation and an explicit decision recommendation. This little bit of cognitive work helped. So based on that study, we decided to be a little bit more careful about the metaphors that we use to talk about AI. And we ended up formulating a new goal for ourselves. We decided to see if we could build worker-centric AIs. So if we thought about things that are important in the workplace, so people's sense of competence and therefore support for learning on the job, people's sense of autonomy, support for people's sense of what is the core element of their professional identity, how would we design AIs differently? In particular, at the beginning, we focused on competence and learning, and we looked for different designs that would support that. In all of our studies, we found that designs that supported learning also supported high-quality decision making. But the designs that supported both learning and high-quality decision making were substantially different from the designs that we're using when we only focused on decision accuracy. This was a very optimistic result because it showed that if we started thinking about worker well-being and kind of worker objectives, we asking these questions resulted in very different designs for human-AI interaction. But we did not have to make any sacrifices. We are still making high-quality decisions, so the employer should be happy, but we are also making workplace more valuable to the people doing the work.

SPEAKER_02: 33:16

Is the opposite true in the in the sense that if if we're not helping the human user learn and improve their own performance, there's a there's probably uh an unlearning or a de-skilling that could also be happening. Is that is that a fair statement?

SPEAKER_01: 33:33

This is what we hypothesize. This would require longer-term studies to document. But I do want to point out that we design current AI-powered systems in such a way that they require competent human supervisors. And if we create workplaces that require competent human supervisors yet do not create opportunities for people to develop their competence, in a few years we will run out of pre-AI trained people. So we'll will not have competent supervisors for our AI system. So we will need to pay very careful attention to the development of our workforce.

SPEAKER_02: 34:07

So in in our previous episode with Martin Holbrad, we did a study with him. And it's not long-term, it's very short term. And it was specifically around this topic of what we call officer narratives. Effectively, after an incident that uh a police officer may respond to, there's a report that gets authored. And we were contrasting two approaches. One where, based upon input that we get out of the audio that is recorded, video that is recorded on a body-whorn camera that the officer wears, we can automatically author uh a first draft of that report versus us engaging in a conversation with that officer to help that officer author that report. And a piece of feedback that we got from many of the supervisors who were responsible for reviewing that report is that boy, now you're de-skilling the officer because you've taken away the that that key notion of sense-making that happens as a consequence of the officer authoring the report being able to describe things in their own way, but helping them do it in a manner where it is evidence-worthy that it seems complete. It seems like it will sustain the progression of the workflow that that report goes through as it flows through the investigation process and the prosecution process, et cetera, versus them saying, hey, I have a first draft, I'm going to quickly review it. If if I like it, I'm just going to say yes. Almost, I think, ties in with what you were talking about, this notion of explanations and even connects to a little bit to your the example of the predictive text, where when we predict what the output should be, the biases of the system very inherently maybe changes the true nuanced perspective that we need to offer from the officer itself. But perhaps more importantly, there is this notion that we're not helping that person develop their own capacity for sense making in the task that's ahead. Does that exactly yes, yes, yes.

SPEAKER_01: 36:06

It totally resonates. So by the way, I've listened to that episode and I absolutely loved it. So, you know, I I really appreciate the contrast between having a ghostwriter who would write the reports on behalf of the officers versus a support that walks the officers through a cognitive interview to help them more accurately remember what happened and then helps them critique the first draft of the report so they can anticipate and spot weaknesses in what they wrote. This really resonated. So one of the conclusions from our own work is that designs that automate people's work result in de-skilling. So all of the designs that somehow support human learning leave an interesting part of the job for people to do. Based on literature and our own experiences, my take right now is that if you want to put AI in a particular cognitive situation, you first have to understand the cognitive work that is being done. In particular, most complex cognitive work has parts where human responsibility is core, where people take pride, where people's analytical work is essential, but also it has parts of the work that are overwhelming and that people have difficulty doing well. And an opportunity is to spot these things, describe them, and use AI to support people in those parts. So in the example that you guys discussed, one difficulty was accurately remembering what happened. And the second one was anticipating the kind of information needs of different stakeholders, hence this automated critique that would look at the report from different directions and say, you know, this is unclear, there is a gap over here, right? So excellent. What are other kind of cognitive difficulties? In other places, people may have maybe challenged by premature commitment. So the first idea pops up and they stick to it without exploring the broader solution space. So in that case, you would want to help those cognitive workers become aware of the larger set of possibilities and help them explore them before they commit. In other places, the challenge is large quantities of data that you need to understand before you make a decision. So here you can help people by synthesizing the data. In other places, it's really difficult to identify all of the relevant factors. So for example, if there is a patient with multiple comorbidities, typical medical guidelines do not account for all of these extra factors. A system that analyzes millions of patients can point out, you know, given this unique combination of comorbidities, the typical approach may not be the right one, but this unusual one may be a better choice. In other places, you will know all of the factors that are relevant for your decision, but you will have a hard time making the trade-offs. So how do I make the trade-offs? What do the data suggest, and how do I make these trade-offs consistently? So an important corollary here is that each of these situations require different technical capability. The growth in the AI field recently has been powered by making a small number of technical capabilities more and more powerful. So with deep learning, we focused at better and better predictions. But this is just one way to use AI. With LLMs and other generative models, we focused on the probabilistic generative capability. But there's so much more. So for example, a colleague at the University of Chicago, Chun Hao-Tan, pointed out that how you summarize depends on the task that you are trying to support. So if someone is making a decision, you want to present them with a different summary than if they were doing something different. So if you know a particular decision task somebody needs to make, you summarize things such as to highlight the contrasts, let's say. This is a fundamentally different computational tasks from the ones that current tools support. So it required new kinds of technical capabilities and innovations. And what I think is uh you know unappreciated right now is that underexplored is that there are new kinds of technical capabilities that we need that would be better fits for certain situations. But because we we put so much energy into adapting tasks to existing computational capabilities rather than understanding what the tasks really need, we are cutting ourselves from the joy of technical innovation. So I really wish we we had rebalanced things just a little bit.

SPEAKER_02: 40:32

I mean, I think what you're also pointing out there, and and you used the word partner before. The primary goal of a lot of um AI assistance in many examples today tends to be efficiency, right? Make the task more efficient. So there's a more more of a delegation model than there is a partnership model with AI. And I think what you're also pointing out here is delegation comes with a lot of unintended consequences that are not always positive, but there's perhaps an engineering of a partnership model that has more domain understanding built into it that that perhaps engages the user cognitively in a far better way than a delegation model potentially does.

SPEAKER_01: 41:13

Is that fair to say? I think, well, I'm not sure if I would say it this way, but it kind of makes me think of two things. So one is the objective, right? So, you know, as I said earlier, when I first started working on AI-supported decision making, I focused on decision accuracy. Kind of, let's say this is the employer's objective. Then I realized that there were some things that were very important for the people doing the decision work. So we needed to support their ability to learn on the job, to have their sense of autonomy, meaning, and so on. So that caused us to ask new questions. We kind of reworked our objectives and we came up with entirely new designs that we wouldn't have come up with otherwise. It opened up the space of possibilities. Recently, we asked a new set of questions. So, you know, some other student colleagues said, wait, wait, wait, how about the welfare of decision subjects? So, for example, if somebody makes unemployment benefits decisions or housing benefits decisions, you know, we've been focusing on them making accurate decisions and them learning on the job. But how about the welfare of the people who seek assistance? So we did some pretty deep qualitative studies, both in the US and India, just to get, just to look at very, very different environments to in an attempt to make our findings a little bit more generalizable. And we found that opportunities for AI start so much earlier. Many people who are eligible for benefits do not know that they're eligible. They do not know the preconditions for applying. When they apply, so one big assumption that technologists make is that filling out an application is an easy and deterministic process. You get you get asked some questions, you fill them out, and you feed it to the algorithm. It's not the case, because for many of the questions you have to say, you are asked to explain a situation and you do not know what is relevant to the decision maker. So people need guidance describing their situation in a way that is relevant and explain and understandable to the decision maker. Often people get denied benefits not because they are not eligible, but because they did not describe the right things. Then they are asked to present supporting documents. Which documents should I choose out of many? Again, people often make suboptimal decisions. So there is a lot of opportunity for guiding people in making the right decisions so that they present the most informative applications. Then existing technologies, when somebody gets denied a benefit, focus on presenting a counterfactual. If only your income was a little bit higher, you would have gotten a loan. If only this, you would have gotten the unemployment benefit. But at that moment, people have people process information emotionally rather than cognitively when they get a you know unexpected negative decision. They do not understand it. But also sometimes they it's appropriate for them to contest decision. And the response that they get does not allow them to make a decision. Do I need to change myself? Or is it appropriate for me to contest? I need I need to contest because my circumstances are unusual, not anticipated by the guidelines. So I require discretion. Right? Or I believe the system inappropriately does not make a mistake in my circumstances, we need to address that. Or I believe there's a systematic bias in how the system works. So the existing processes do not make it possible for the decision subjects to evaluate how they should respond to a decision. So again, because we switched our question from asking how do we make the most accurate decision or how do we support decision workers to how do we support decision subjects, the opportunities for innovation grew yet again. So we are just starting to work in this space, and it's incredibly exciting. And at first it may look adversarial, but really most institutions have a mission and they want to be successful in their mission. They want to have a particular impact on the world. And I think by paying some attention to the welfare of decision subjects, we can help these institutions have the impact that they really may want to have.

SPEAKER_02: 45:17

In that example, it almost seems like the intended user or subject is a person who's actually filling out that form and giving them the better ability to fill out that form and give the right context.

SPEAKER_01: 45:28

So we can intervene there. So we can design new AI systems that support them, but we can also change a little bit how the decision makers work. So for example, we can help them detect situations where a decision is potentially contestable. So rather than issuing a negative response, we they can recognize that there is ambiguous or incomplete information. So instead they reach out and ask for more specific information. Or if they do issue a negative decision, they provide enough context so the decision subject can make appropriate decisions about whether or not they should be contesting based on some information that is available to them, but not to the decision maker.

SPEAKER_02: 46:10

Seems consistent with what you said previously, which is if you can help the user, in this case, either one, learn better or understand how to ask the right questions or follow up appropriately, the judgment element of it also becomes better. There's a proportionality between the learning and the judgment. Is there like is there a generalization or is there a framework that that you can extrapolate from that in terms of yes, efficiency is important, yes, accuracy is important, but there's uh both of those are also directly affected by the user and their capacity to also get better in some way. Is there is anything we can say about a framework there?

SPEAKER_01: 46:48

So I think I'm still in the learning process trying to kind of come up with greater principles, but I think what you and your colleague at the previous earlier episode, the anthropologist, pointed out that this is not a zero-sum game. This is about growing the pie. I think this is a mindset that can be incredibly helpful here.

SPEAKER_02: 47:08

I think there I think there's definitely something to that in that I I think across the board, there isn't a goal of helping the user get better. There is a goal of making that user's work easier in some way. And if an explanation is given, then that explanation just a f there's a fascination associated with the fact that the AI capability is able to explain its output, the fact that an explanation exists is enough as opposed to there being a cognitive engagement with that explanation. I think that that to me, again, that that's where I was also trying to get to with this notion of a partnership is more than a delegation, right? Don't just author this thing for me, help me author it, which is a different conversation. Um, I think it's a harder argument sometimes to make whether that that partnership yields in a better result, because you can always say, hey, I saved X amount of time by just asking this other entity to do my job or perform this task for me. Harder for me to say, over the long term, my efficiency, my performance, my accuracy improves as a consequence of this partnership. So that's, I think that that is where I find a lot of AI design tasks stuck right now is because the efficiency argument sometimes becomes more powerful to make than the partnership argument and the learning argument. On the topic of sort of this partnership versus delegation, I think the medical emergencies example that you've cited is actually quite interesting in the sense that uh the amount of context that a doctor has to come to a conclusion, but then getting a recommendation is somewhat more interesting than assisting that that person in coming up with a decision of some sort. And maybe I'm not explaining this the proper way, but initially it seemed almost as a bit of a contradiction in that to say, okay, well, recommendation, just offering a recommendation isn't usually the best way to go about it. But in this particular example, the timing of that recommendation had a huge part to play in the efficacy or the utility of that recommendation.

SPEAKER_01: 49:11

So I think you're referring to our very, very recent work that was led by co colleagues at Drexel and where I helped out a little bit. They specifically asked, how can we help doctors in pediatric trauma resuscitation? So it's a situation where a patient with some you know very urgent condition gets rolled in, and within seconds or minutes, they have to make uh incredibly important decisions whether to you know do a surgical intervention or some other really major intervention. And if they make a wrong decision, the consequences are dire. So they designed a study where doctors were presented with vignettes, with scenarios, videos where they heard kind of as the information was being discovered and presented about a particular patient, and you know, in less than two minutes, they had to make very important decisions. So we tried a system where you know that there was no AI support, one system in which the AI kept track of relevant information and displayed it and showed, you know, this particular bit of information is abnormal in this way or that way, or it's you know, perfectly fine and it's relevant to this decision-making task. And in the last condition, people had that support plus a decision recommendation at the end that says you should do this. We expected, based on our earlier research, that the best condition would be one where people just get this growing synthesis of the information, but not the decision recommendation. It turned out that people made the best decision when they had both, when they had the growing synthesis and the decision recommendation. I I'm still processing these results, but I think another study from Germany may be helpful in understanding what happened. In the study in Germany, the study was conducted with pilots who had to make emergency landing decisions during flight. And there one of the conditions was that during the flight, AI would constantly, during normal flight, AI would constantly give pilots updates about nearby airports, weather conditions, and so on. So they would be developing very good situational awareness of what's going on. In another condition, nothing would happen during normal flight, but then they would get a decision recommendation when they had to make a decision. And in another condition, they got both. And the best condition was when they had both. So when they only got a decision recommendation, they overrelied on it and made mistakes. But when they had they only had the kind of situational awareness support, they also didn't make optimal decisions. But when they get both, it really, really worked. So here the observation is that people need to be prepared to evaluate the system's decision. So if you get a decision that is, it is difficult to verify. So the system tells you do this, but you cannot really retrace all of the steps of the reasoning process. It's very hard for you to say this is a right decision or not right decision. The distance from what you know to what the system suggests cannot be too large. In the situation where people had both situational awareness, so they really understood what's happening at different airports, and then got a decision, even a surprising one, they were really able to engage in the process of confronting what they thought was the right thing versus what the system proposes. And the distance from what they knew to what they needed to decide wasn't so big that they couldn't reverse what the system suggested. They had enough cognitive preparation for it. So my current understanding of what happened in the study led by colleagues at Drexel is that when doctors had both the situational awareness supported by AI and the decision recommendation, they could make good decisions. Now, what was really interesting about this study was doctors' kind of comments about uh that system. They said, yeah, I felt I made pretty good decisions when I had the decision recommendations. I could confront them with my own and kind of say, oh, this makes sense, this doesn't, this was helpful. But if the AI system's recommendations were part of the permanent record, I would not want it. And this points us to another thing. We keep throwing AI into workplaces without rethinking policies. Doctors feel that if they make a decision that disagrees with an AI recommendation, they are extra liable. If they're correct and they override AI, perfectly fine. They do not get extra cuddles, they do not get punishment. But if they're incorrect and they disagreed with an AI, they get doubly punished. They're much more likely to face consequences. On the one hand, we deploy these systems and say, oh, this is just advisory, you make your own decisions, you are entirely responsible for the decision. On the other hand, because of the perceived liability involved in disagreeing with the system, or even because of explicit policies in the company that encourage agreeing with the system, people's autonomy in decision making is curtailed, contrary to the story that we tell, in which we say that you are the final decision maker fully responsible for the decision making. One thing that we need to very carefully think of is the fact that introducing these novel technologies requires not just redesigning the task, but also redesigning larger parts of the workplace, including our policies.

SPEAKER_02: 54:31

As Kristoch showed, real progress depends on understanding how people think, decide, and learn in the flow of work, and then designing systems that align with those realities. We touched on why people over-rely on AI, why explanations don't always create understanding, why cognitive forcing sometimes works and sometimes doesn't, and how timing and framing of recommendations can dramatically change decision quality. And importantly, we looked at the risks of de-skilling, including parallels with our work on report writing, where replacing sense-making with automation can undermine both judgment and professional development. We also surfaced a critical point that often gets missed. Introducing AI into high-stakes environments isn't just a design problem, it's a policy problem. As Kristoff noted, in some medical contexts, doctors benefit from AI recommendations, but don't want them recorded because of how liability structures work. Without rethinking policies, organizational incentives, and the meaning of autonomy, even well-designed systems can backfire. In part two, we build on this last point and widen the aperture. We'll explore what it means to build socio-technical systems that operate at multiple levels, cognitive, organizational, and societal. We dig deeper into decision-making itself, the relationship between decision workers and decision subjects, and how accuracy, autonomy, learning, and accountability intersect. And we ask how technical performance must be balanced against and sometimes subordinated to a deep understanding of the actual problem space and the world we're trying to build. Thanks for listening.