From Action Plan to Mission Impact: Testing for Trust Artwork

All Source Podcast

This podcast is led by INSA's Policy Councils and Subcommittees covering hot topics in the intelligence and national security community.

All Episodes

All Source Podcast

From Action Plan to Mission Impact: Testing for Trust

May 21, 2026 • INSA • Season 1 • Episode 4

0:00 | 41:20

In this episode of INSA's From Action Plan to Mission Impact: Key Takeaways from America's AI Action Plan podcast, hosts Chitra Sivanandam and Yevgeniy Sirotin sit down with Mark Munsell, former Chief AI Officer at NGA, to examine what it takes to build trust in AI systems before they are deployed in mission environments. Mr. Munsell discusses the challenges of testing, evaluating, and securing AI models as they become more autonomous and integrated into high-stakes operations. The conversation breaks down how agencies and industry can balance rapid AI adoption with responsible oversight, continuous monitoring, and risk management to ensure these systems remain reliable, mission-ready, and resilient against emerging threats.

Yevgeniy 0:04

Welcome to the All Source Podcast from the Intelligence and National Security Alliance. I'm Yevgeny Sorotin.

Chitra 0:10

And I'm Chitsar Svanatam. This episode is part of our series from Action Plan to Mission Impact, where we examine how the AI Action Plan is being put into practice across the national security community.

Yevgeniy 0:21

Today's episode is Testing for Trust. The action plan makes clear that building AI isn't enough. These systems must be rigorously tested, evaluated, and trusted before they ever hit the field.

Chitra 0:33

And that's where the challenge lies. AI can perform beautifully in the lab, but the real question is whether operators and mission leaders can rely on when rely on it when the stakes are high and the conditions are unpredictable.

Yevgeniy 0:45

Our conversation today explores how evaluation frameworks must evolve beyond traditional software testing, focusing on realistic test environments and how agencies can measure actual confidence, not just raw performance.

Chitra 0:59

And we're also digging into the continuous monitoring and the challenge of ensuring that these systems remain reliable in dynamic and adversarial conditions.

Yevgeniy 1:08

Yeah. And perhaps more importantly, we'll look at what it means for industry partners working to deliver these mission-ready capabilities.

Chitra 1:17

So we're joined here today with by Mark Monsell, the former chief AI officer from NGA. Mark, thank you for joining us.

SPEAKER_01 1:25

Aaron Powell Great to be here.

Yevgeniy 1:26

Thanks for having me. So I think we're going to open up uh with with the first question. Um Mark, what does testing for trust mean or testing for truth?

SPEAKER_01 1:36

Aaron Powell Yeah, I think that um ultimately when we use computer systems, right, in the last 40 years that we've had computers, we expect the results of the computer function, the subroutine, the application to be predictable, to be testable. We actually write tests, right? Um and they're logic-based and you have known results that you're looking for. Um, so we would expect something similar when we use AI. A lot of us, when we use AI, want it to be right. We want it to be correct. And so I guess what's different about this is that we've designed AI to be more like humans. And humans are creative. Humans have different opinions and different answers about certain sort of subjects. And what we've created is something that's not sort of binary, that's not predictable, that's not necessarily coded in a subroutine that would give predicted results. We we're we're getting something now that is um much more human-like. And therefore, it's very difficult to write those kinds of tests for AI. So when we are testing now, we for AI, we need this sort of new dimension, this new metric of testing, which is can you use this model for some important mission? In other words, can you trust this AI model in supporting you in the work that you do? It's a different question, but it's but it's relies on the same basic principle of risk management.

Chitra 3:19

And how does that work? I think in practice, most if I think about the classic systems that we're used to, you write a test script and like anybody can deploy the test script, right? And anybody will be like, I I know whether it passed or failed. And I think the way you're describing it, maybe it's a lot more subjective as to whether it gave me the thing that I trust or that I don't trust.

SPEAKER_01 3:39

Aaron Powell Absolutely. When it appears that sometimes the AI model offers opinions, recommendations, suggestions, these are all things that a computer would not do. Computer would give you the answer based on the parameters that you put in and the output that is expected. In this case, it's you know, taking on hum designed human characteristics to offer you recommendations. And maybe the recommendations are two or three different sort of paths for you to choose. And in some cases, embodying human traits that are sort of leading you to a certain recommendation. These are all things that are not sort of traditional um test parameters, they're not traditional um uh computing functions.

Chitra 4:25

Aaron Powell And where does it live in the cycle? Because um part of me feels like we still think think of it and treat it as software, which means I test before I deploy it. And the way you're talking about it, part of it kind of fits on this human spectrum, which means if I start leaning towards the thing that I find it's causing me more concern, it's less trustworthy. I only learn that after I've deployed it. So then I I'm doing something to assert discern that. And then I can't get out of the system. Like it seems like it's different in terms of life cycle.

SPEAKER_01 4:54

Certainly is. And I think there's certain, there's um there's sort of certain milestones that you would you would want to perform these tests. Um there there are still model AI models that are being made that are um you know less objective. The models that were are built for certain purposes, fit to certain purposes, um, with with uh predictability, that you can come in and get some sort of statistical um score, right, associated with it, like an F1 score. But the more that we nest these models together and the more that we include the enormous corpus of human input, like the entire internet or all of human mankind's knowledge in into these models, um, then it quickly strays from this sort of ability to do statistical testing and predictive testing to much more subjective.

Yevgeniy 5:50

Yeah, that's that's really interesting. And I want to pull the thread about, you know, the this these AI systems being sort of they're computer systems, but they're different from traditional software. And one of the things that really helped traditional software and get trusted is you know open source software, right? Because the code is available for anybody to review. As you go to this, to these to these AI systems that are based on, you know, neural nets, is it even what does it even mean for that system to be an open source AI system? Because some of these systems are.

SPEAKER_01 6:19

Yeah. I mean, um, obviously the the earliest ones were open sourced. Um you see all the all the ones that are published on Hugging Face and things like that. And um, I do think that some some of the early uh testing, testing frameworks that were out there um essentially were things like can it pass the bar exam, right? Can it pass an SAT? And so the um the testing frameworks that were out there, the benchmarking, if you will, um pretty and pretty intense. And they covered a spectrum of human human benchmarking um against these models. And what you saw is all these open source models competing against the same benchmarks. And as I feel like that's an important thing for any MOLT model developer to do is to put it out there in the open, run it against these open benchmarks, see how it compares to hundreds or thousands of other models. I think they should do that. But it doesn't mean that they have to, doesn't mean that they will, um, especially more proprietary, high-value um models that are built for certain purposes that maybe people want to protect because number one, they they don't want to run it against all of these open benchmarks. And number two, they certainly don't want us to see the world to see how well they do against those, especially if they're very bad, or especially if they're very good. You'd want to protect that information. Interesting.

Chitra 7:46

I'm thinking like, is there a paradigm where um testing is really more appropriate for those larger foundation models that you're calling versus the applications built off that model? Or is there or do we have to equally kind of look at the the trust and kind of truthy truthiness of the the um application, regardless of whether you're um utilizing something else that has been quote unquote verified or tested?

SPEAKER_01 8:12

Aaron Powell I would I would just say it's probably gonna be based on the use case. It's gonna be based on the the what what fit for purpose of the functionality of the of the model. I mean, there are models that can that can generate, you know, books now and and uh articles and research papers. And so you're talking about tens of thousands of words, maybe hundreds of thousands of words, and you know. How do you test that?

SPEAKER_03 8:39

Yeah.

SPEAKER_01 8:40

Um you can read it. I certainly don't know. And humans can come in and and I I think what's more and more common is that other models are testing it. Um you're seeing some models being designed and built to do testing, or some models designed and built to check other models. Um eventually you're gonna we're gonna have thousands of nested models. Interesting um running. Trevor Burrus, Jr.

Chitra 9:02

So that's that's the problem, right? Then the unreliable model that tests and verifies all the other models, the downstream effect is rapid.

Yevgeniy 9:10

Aaron Powell you sort of build a Matroshka of AI and and and all of it is difficult to evaluate. Aaron Powell So how do we make these systems that are so amorphous? Like how do we make them mission ready? How do we label them for like, yeah, this is ready to be used for a particular purpose? Because when we look at the AI action plan, it defines specific sort of criteria for for looking at these systems, like their trustworthiness might be related to the security or explainability, reliability, robustness. We hear these words used. Um, but how do we actually determine that they're past these bars in practice? And how do we say that this system is now mission ready?

SPEAKER_01 9:49

Yeah, I think I think there's three dimensions to this. I think there's the actual, if you can, depending on what kind of model it is, actually, you know, test the quality of the model. In in that case, it's a bit like software testing, where you do have a benchmark, you do have a set of tests that result in some sort of benchmark, um, and and you're able to quantifiably, statistically give a result, predictable result for that model, give it a score. And then there's the sort of next piece, which is fit for purpose, right? Based on the quality of that model, um, what uh use cases can it serve and should it serve? And so so between measuring having a way to measure the quality of the model statistically, predictively, to match its fit for purpose, its use case, you know, that that that's the there's a framework involved in that. There's like a risk framework involved with making sure it's qualified to do certain jobs. And then the third piece, which would be um the actual risk in taking that model and putting it in that use case, what's the sort of level of risk that you are comfortable with? Um, what is the severity level of you putting that model in that use case? And so now if you imagine all the different kinds of AI models that are out there, all the different sort of designs, um, all the different approaches, those three dimensions have to sort of line up before you would get a green light to use any given model for any use case, and it depends on the severity and risks that's involved with it.

Chitra 11:37

Yeah. So it's really like tying in the test scenarios and testing practices with the guardrail development. That's right. And then doing that recursively, right? Because maybe certain things are fit, but only with the certain constraints related to the guardrail.

SPEAKER_01 11:50

Trevor Burrus, Jr.: And I think every organization has to establish that framework up front. If they're not, they're going to get themselves in trouble.

Yevgeniy 11:59

I I thought it was really interesting what you said in terms of like how you might characterize what is being what the AI is being used for. And maybe there's there's a sort of lack of specificity there. And I was wondering just just something that came into my mind like, do we have the words and the vocabulary that we need today to be able to adequately describe how we're using these systems in these use cases?

SPEAKER_01 12:23

Aaron Powell Yeah, I don't think so. I think it's being developed. Um, I think, you know, I and coming from a government background and uh combat support and intelligence background, um, we have we have the words, we have the frameworks in most cases. We have um guidance, we have joint instructions, we have DOD instructions and all of these things because much of this is not new. Much of what we're doing is within this sort of policy and and and frameworks and guides. But no one sort of would be able to foresee 10 years ago or 15 years ago that these models would be so general, be able to do so many different things. Um everything was so sort of linear, fit-for-purpose, stove pipe before. Now that you have a model that can do a million different uh human things, how do you sort of write the words when they're so expansive, they can do so many different things. Um, so I think that's being written every day now. I'll give you another one. Um, I know some agencies are using, you know, using models to do financial work or to do contracting work. Some agencies are using models to do like employees and agencies and companies to do like employee reviews, human resource reviews, evaluations and and and ratings. And, you know, that was just never no one would have thought that a computer sort of would be doing that. Sure, maybe land application wouldn't help you write, click the boxes and type the words, whatever. But so now companies and agencies are having to write those policies to say you can use a model for this, but don't cross this line. Not for that, right? Um, and it has to have human review and it has to have legal review or whatever, right? All of these things. So it's being written today. And and it's it's important for agencies to have and companies to have the staff and the resourcing to do that. Because in the sort of rush to automate everything, you want to cut staff and resourcing. Um, and instead, we're we're talking about a shift, add staff and resourcing to be able to write these new this new guidance.

Yevgeniy 14:48

Right. And that's really interesting because I think evaluation that's the subject of this broader conversation is really important in that cycle, right? Because if the model is really good at, say, passing the bar exam, maybe it's not that good about evaluating resumes that that your recruiter might receive. So in the healthcare industry, there's this notion of like when you have a drug, you know, do you have an on-label use for that drug and maybe some off-label uses? Is that a useful concept in this? Like as we have these sort of LLMs that could do so many different things. Maybe there are some use cases for them that we say, oh, that's on-label use, that's been evaluated. We know that that's gonna work.

SPEAKER_01 15:27

That's a great analogy.

Yevgeniy 15:28

Or do we say something, oh, that's really off-label?

SPEAKER_01 15:30

Trevor Burrus, Jr.: It's great because I do believe that people should continue to experiment with these models. Right? You certainly don't know. And I say that awkwardly. Um because in some way these things are, these things are they're unknown. And and I'm not just saying that because, you know, um the the it's mysterious the w how the computer talks back to you in the screen. Um the the researchers, the scientists that work with these, um, in some of the early days of of discovering how Transformers sort of turned turned the computers into magicians, um, they're like, hey, well, that's pretty interesting. Like it actually gave me results back that are like a human, right? The way, the way this worked. And um I got to I got to go on um some early trips to go visit Google Open AI and Dropic and these companies. This would have been like 2021 or 2022, when it was really when they realized that they had something special. And so a group of us IC uh chief AI officers um types went to go visit everybody. When we went to visit Google, um, you know, someone just pointly, pointedly asked them, you know, what is it that what was the sort of discovery? What was the sort of turning point that you said, hey, this is this is something, you know, what was it what about these models who are able to do what they're doing now? These are you know, Google has like 13,000 PhDs. And at that point, they were like, yeah, we're not exactly sure. It's sort of built upon itself, and we kind of took this discovery and added this, and it was a lot of sort of dice rolling and try this and try that, and look what, hey, when we tried this, look how yeah. And and um, you know, and which I think is a bit like how experiments are done.

Chitra 17:37

Um unsettling, but good.

SPEAKER_01 17:39

And they were like, yeah, that we've made this magic human machine and uh we're just gonna keep doing it, and we're just gonna give it more data and we're gonna give it more compute, and we're gonna spend $100 million on it, and it's gonna get that much better.

Yevgeniy 17:53

Humble beginnings with langu language translation to to suddenly everything. That's right.

Chitra 17:58

I mean, like, I I really do think that the way we're describing it and the way we think about the the testing environment, like this is the perfect place where the that kind of analogy that everybody uses on the like enthusiastic intern works because I'm like, I don't know the potential of the enthusiastic intern. I don't know the potential of like what I'm building here, right? But I won't know unless I try. And I can't stop trying because I have this this potential like gut reaction on do I trust it in terms of the the response, or is is it predictable enough? And yeah. And I think that's the question.

SPEAKER_01 18:29

And at that time, right, they knew that what they were on to. They knew that more training and more compute would equal more capability. Yeah, they knew that. And and so at that time, they were just like, just more, just do more. And then as they did more, they were, you know, they were making discoveries and they were sort of changing approaches. And and now you see several approaches coming together. Now you see several modalities coming together. Again, just like the humans. Um and I remember asking them then too, hey, do you guys you're gonna spend how much money you think you'll spend training this next year? And like a billion dollars. And then someone said, Well, would you ever anticipate spending $10 billion? I'm like, Oh, yeah. And then someone said, How about a hundred billion dollars? And then they like pause for a second, like, oh yeah, we'll we'll do that. I don't know when we'll do that, but we'll certainly spend $100 billion on this. Wow. And that sort of gave everyone the confidence that they knew they were sitting on dynamite, right? They were sitting on the next atomic bomb, and uh it was obvious what was about to happen.

Yevgeniy 19:35

So this is fascinating. We we we started, we talk about you know the use cases and the applications. Uh, we talked about now we're talking about these these institutions that are developing these systems. Um obviously we have a lot of this work happening in in industry, in academia, some of it might be happening in some national labs. What do you let's talk about the role of these different institutions and organizations in evaluations? Because you can imagine, like, in some ways in AI, testing's built in. If you're training one of these models, you're minimizing some kind of a cost function, right? And you have some error propagation. So inherent in that, you need to measure error. So there's obviously some kind of testing during training. How do we evaluate? You know, how do we understand the role of these different technologies? Should the companies that make the technology be responsible for testing? Should the companies that are integrating these technologies be responsible for testing? Should third parties be independent, like in the healthcare industry? What do you what do you what are your thoughts on that?

SPEAKER_01 20:35

And I and I think the companies mostly are motivated by quality, which which leads to product, great product and great revenues, right? The higher the quality, better the product, the more revenues that they have. So that's their biggest motivation. But you also remember some of the companies, a lot of them were founded under the sort of moral premise of hey, if we do this and we invest in this, we want to make sure that we do it right, right? Responsible AI. And so it's interesting to see after they made some of these key discoveries, after it was declared that they were probably sitting on the most valuable intellectual property in the history of the world, that it it turned pretty quick. It has turned pretty quick. And their motivations now are um, you know, sort of business motivations, which leads me to say that I think the government has then an important role um as regulators on this. And, you know, certainly not to suppress it and certainly not to slow it down. I was I was shocked when when was it, maybe 23 or 24, when Elon Musk came out and said, Hey, we need to, we need to pause everything. You remember that? Um so that was a way of sort of, hey, let's regulate, let's pause. Um, there were people calling for nationalizing some of this right to make it sovereign. That's also a way that says sort of, hey, we're gonna put put some cold water on this business. Um, those are both those are both very shocking to me. And I I couldn't sort of understand the motivations behind people saying that. Um so so we want the commercial business when I say we, I mean the broil we. Um we want commercial business to flourish in this realm. We want the United States, this is INSA, right? We want the United States of America to um promote this technology in companies to to make to make uh the United States uh have have great advantage over our adversaries. But it's not without us having some ability, let's say NIST and IST um or other or the Department of Energy and others to have some regulatory function here to make sure that it's not harmful. And these are words that I know that maybe aren't in vogue anymore. Um, you know, uh talking about words about using words like safety, is this AI safe? Does this is this AI harmful? Um you know, is is are we doing this responsibly? Some of these words have sort of lost, um, sort of lost their vogue. And I think that it's important for the government and third parties to be able to come in and say what's good and right here for for the credibility of the country and so that we don't use AI to destroy the world.

Chitra 23:53

On that note. Um I do think it's it's an interesting point. And I think that we have to be serious about um actually providing those measures and those mechanisms to to do the responsible thing, to do the safe thing, right? To do all the things that we're and improve quality. Improve quality. Right. But not just like throw a bunch of words at it and not turn it into like, hey, it's gonna be hidden behind three pages of legalese, and that's how we're gonna protect ourselves. So how do we how do we do that? And like I know there's a lot of people that kind of favor purple team type strategies. Is that a viable strategy, or how do we how do we not accidentally do lip service to this? I'm putting it in the hot seat. So we certainly work myself.

SPEAKER_01 24:37

So we certainly can do lip service by the idea of going too fast. Um look, as as chief AI officer, I had two responsibilities promote, right, build, experiment, deploy, implement, make it happen. Right? Make it happen. And I'm certainly think that that's what that's where we are today. Um companies and agencies are are bending over backwards to implement AI as fast as they possibly can with whatever they can afford. But I also had this other responsibility, which was are we are we doing it securely? And I know we really haven't talked about security too much here. That to me is a huge gap. It's a huge vulnerability that we have right now. Um we we know that some of these models are vulnerable. We know that they are exposed in some ways. We need an entire new industry. Look, look first time we discovered that computers could be used as hacking machines, right? Um, how much and used to do banking and commerce, how much money, energy, resources that we put into cybersecurity. I mean, you know, solid 20 to 30 percent of the computing economy went into security. And it's 1% of for AI right now. It had, you know, we need to do a lot more um for security. So um getting getting back to the question, I think that we we have to do both. We have to invest in something that will that will keep us on the straight and narrow here. Um otherwise it will go it will go the way of cyber and this technology be will be exploited for you know for greed, for um for spying on on uh you know for for for doing harm. And we have to be protective of that.

Yevgeniy 26:38

So you know, I and we definitely want to make sure that AI doesn't destroy the world. Um I I think that's uh that's we should we should all agree on that. Check back with us in 10 years. But the the question is the pace of deployment of these systems is accelerating. Yeah. Right. So so the truth is like everybody's trying all these different use cases. And and I think thought what one thing that I really appreciated about the AI action plan is it talked about an AI evaluation ecosystem. And that that was a fascinating way to phrase it. And I thought that that was like quite interesting because you sort of need an ecosystem in order to kind of keep keep abreast of all this. It's not something that an ecosystem is an interesting way of putting it because in an ecosystem, you know, there's sort of a chain of, you know, the sort of the food mesh or or the food net now is is is how we talk about it. How do you conceptualize this ecosystem in AI? Like what are the parts of this ecosystem?

SPEAKER_01 27:35

Yeah, I think and let's go with cyber as an analogy. I really do. Um you you can think of the types of companies now that are in cyber. Um and it ranges from um people developing tools to analysts to entire services um uh, you know, that people provide to protect our networks and systems and security. And so um, like I said, it's probably I'm not sure these numbers, but it's probably 20, 30% of the entire um compute economy that's dedicated to security. So um I think the ecosystem is similar. Um there are companies popping up now that that do this sort of will go through models and and check for vulnerabilities that are perimeter type um uh AI systems to make make sure that they're sort of watching the perimeter as as your your AI turns more agentic and more ins and outs, right? Like APIs. Um so I I think it's gonna take a similar trajectory to cyber, where you're gonna have to have testing, evaluation, risk frameworks. There'll be entire, I mean, look at the look at the consulting companies um like uh Slalom or Deloitte or you know some some of the uh uh others, Accenture. If you look at kind of the shift they're making, so yeah, they're they're gonna be moving from analytic, um, sort of report writing and things like that to investing in how they can, if if you're not a frontier company developing the technology, right? If you're not Anthropic or Google or OpenAI that are developing the models, then your business has to be implementing the models, either agentic, rag, or securing the models or testing the models, or coming up with implementation uh uh strategies for companies. So all the meta stuff, I think, winds up being the rest of the eagles.

Yevgeniy 29:39

That's interesting. Cybersecurity, 30% of we're not there yet with AI testing, I think.

unknown 29:44

No.

Chitra 29:45

I mean, I guess it depends on how you color the boxes, but like it makes me think that the hard part about all this is like going from like dev and deployment to deployment and production, right? And I think you had a chance to early in your career push a lot of things out into the operational environment. So, like, where does the test philosophy for AI fit in terms of like dev versus operations? And I think similar to cyber, if we were to pull that thread, it seems like it fits on all sides of the coin there, right? So then how do we operationalize testing as part of our deployment, right?

SPEAKER_01 30:20

In my sort of simple computer science, traditional computer software engineering mind, I you know, I think of passing things to an agent, the agent doing things and then sort of passing them back. Um it's a little bit different. Um, because if it's something very predictable and testable that you would do sort of like a function like that, um, or an object class or whatever, you you know what to test. In this case, sometimes you're asking things from for the agent that are like creative or or different or um an output that's not readily testable. So I um I don't know. I don't know how people are I don't know how software engineers are implementing that today.

Chitra 31:03

Um I think this is the the challenge. So there's it's very blurry on the run rules. So I think it's very different depending on who you ask.

Yevgeniy 31:10

Aaron Powell Well, I think I would say is I would ask the question was what discipline is best equipped today for doing this type of testing? Because if we think of, we still come back to like software engineers are still intimately involved in designing and developing these AI systems. However, you know, if you look at the employees of Google's DeepMind, um not all not a lot of them are software. There's folks from neuroscience backgrounds, there's folks from physics backgrounds, all sorts of backgrounds. Like what dis do we need a new discipline, basically, that that sort of is interdisciplinary when it comes to these different concepts that needs to come together. And I I just don't know, right? Um, because I don't know that we have something that's necessarily like fits the bill exactly. Yeah.

SPEAKER_01 31:53

Yeah. Again, depending on what the output is, if it's if you're asking an agent to, you know, access a database, do a query, get a result set back, and pass that back, you know, through through an API, well, there's only one way to pass it back through an API. And that's testable.

SPEAKER_03 32:11

Yeah.

SPEAKER_01 32:12

Right? Um, but if it's to return a video or a chapter of a book, then you're, you know, the only way to test that is either have a human read it and evaluate it, or to have another model uh read it and evaluate it and and then another model to evaluate that. And then, you know.

Yevgeniy 32:34

Or now with model context protocol, you could be taking actions and and now you have to evaluate the consequences, the goodness of those actions. Trevor Burrus, Jr.

Chitra 32:41

Right. Trevor Burrus, Jr.: And and the unintended consequences, right? So if I took your computer vision models and asked to do something com inherently completely different, and then I get an off-the-wall output, like whose job was that supposed to be and who's responsible for that? Trevor Burrus, Jr.

SPEAKER_01 32:59

Right. I think in all of this conversation, uh and I'm drawn to what you just said, who's responsible for that? Um I think in all of this conversation, responsibility is a word that we need to sort of define here. And I say that when you think of the models now that are doing law, that are doing accounting, that are doing medicine, right? Think think about those three professions. Um those were always, you know, obviously human-centric and some legal responsibility or some, you know, sort of weighted responsibility. Someone's signing something. Yeah. Someone has liability insurance in those.

Chitra 33:39

Somebody looked at it, verified it, that I'll take the risk. Yeah.

SPEAKER_01 33:42

Approving, you know, sign. And so so I think today, these models are being used by humans to elevate their work, be more productive. A lawyer is gonna have to do the work of five lawyers, and accountants gonna have to do the work of five accountants. Um, but there's a point here. And and and whoever that accountant is or lawyer is or doctor is, they're responsible. They're the ones that that are going to be sued if there's a failure. Right? They're the ones that are gonna go to court or go to the jail if there's a failure. There's a day coming. Right. And again, depending on how you test it, how you evaluate it, and how you qualify it, that someone's gonna say the model is better than the doctor. That day's coming soon. Um, and they're gonna say, and the doctor's gonna be like, Don't I'm not responsible for this. This company is responsible for this, for this failure. And you then do you sue the model? Do you sue the company? And do you go back and say, hey, that was version 3.1, right? And now version 4.0 is taking care of that. Or I mean the the whole responsibility framework is going to shift one day. And I don't know when it is, when these models become better than any given human.

Chitra 35:13

That's that makes me think of like even things like, you know, simple like homeowners insurance, right? There's there's a lot of things that you can control that they can evaluate that they can sit there and say, this is how we think about what you have to do for me to insure you.

SPEAKER_03 35:28

Yeah.

Chitra 35:28

And then there's always that little clause of the like catastrophic, like natural thing that we can't control that like voids everything, right?

SPEAKER_01 35:35

The God act of God.

Chitra 35:37

So the act of God clause. And so maybe there's a some maybe there's something like that. We need an AI for the the appropriate act of God clause that says I can't test every condition of everything in every way. Like maybe there's the things that make sense to do that don't and you don't just spend a hundred million dollars if you don't have it, trying to test every every outcome because you have to like you the there's a sensibility approach, right? Maybe maybe there's something like that we have to figure out collectively around what is the.

SPEAKER_01 36:06

Well, I asked you guys a question. Do you guys think we're approaching superhuman AI?

Yevgeniy 36:14

Do you think we'll see that? Well, so I'm a neuroscientist, right, by training. So so I I don't know that we're gonna approach superhuman AI in every dimension. I I don't think these models are sophisticated enough to replicate fully the goings-on of the human brain. They are very good, but they're not quite there yet, yet. They're scaling very quickly. So I can't I'm not gonna say that they're not things are gonna stay the way they are now. That being said, I think there are lots of models in even before the rise of LLMs that have exceeded human capacity in many different domains. That's right. So for I'll give you face recognition as one example, right? Humans have an innate face recognition system. We're reasonably good at recognizing enough people that might be suitable to enter our house. But if you went out and had to recognize one person and a million other people, we just can't, we're not equipped to do that. AI has been instrumental in face recognition. And we've we've had those same challenges. Like as AI got really better than humans, we still maintain this sort of human oversight role. And I really appreciate the point you're making about that envelope for human oversight keeps shrinking, shrinking, because the current human oversight role for like in face recognition for law enforcement use, you use face recognition to generate a lead, and then it's incumbent on the law enforcement officer to follow up that lead and it's just a lead.

SPEAKER_03 37:39

Yeah.

Yevgeniy 37:39

And and and I've got good friends in that uh in that environment that that that would speak much more eloquently about this. But I think that you know the envelope is really shrinking with this AI because it can do all the follow-ups. And now it becomes kind of where do you go as a person?

Chitra 37:55

Yeah, I think the w the way I think about it, it is um it's more around like there's this theory of like just for people that if you have this kind of cross-learning, like across domains, like I'm able to um put together like the neuroscientist with the the geochemist, right? And and then you get like this additional like amazing learning opportunity that because it's it's this domain crossing that becomes interesting.

SPEAKER_03 38:18

Yeah.

Chitra 38:18

I think that's where the AI has proven to be the superhuman because I can kind of get that effect of that that domain cross learning much more rapidly than any one human being to like go and like study some new subject, right? Um I think we could talk about this forever though, right?

SPEAKER_01 38:35

Because I think I think days coming quick, like actually. I believe uh like s may maybe not for everything, but there would certainly be certain class of of things in the next decade where we will see someone go to court over this. Oh yeah. Yeah.

Chitra 38:49

Yeah. Hopefully, hopefully not in my lifetime, but by uh we'll see. Hopefully it's not me. So um so no, I think this is great. And I think if you had like final kind of parting wisdom for the companies working in national security, yeah. What would you advise people to do without obviously like you know, overspending on this, but like being like thoughtful and diligent thing.

SPEAKER_01 39:10

I think I kind of said it is uh obviously a balance. You need both both hands, both approaches here. We need to uh keep moving out, we need to keep building, implementing. Certainly, this is a great gift to the world. It is. It's it's it's gonna upgrade mankind, humankind. Um, but certainly rot with peril, certainly dangers uh await the same dangers that I'll use it cyber again, same dangers, right? Using computers to cause harm. Well, we've done it since we've invented computers. So AI is just gonna take it up a notch. So the government, companies, um, the world needs to invest in defense. The world needs to invest in quality testing evaluation, the world needs to um ensure that it's a balanced approach um and that and that we're um doing it right uh for the benefit of mankind.

SPEAKER_03 40:10

Yeah, here, here.

Yevgeniy 40:11

Yeah. That's very eloquent. Uh Mark, thank you for joining us. Thanks, guys. Uh, for sharing your insights uh on what it takes to test AI systems within mission environments and making sure they don't destroy the world.

Chitra 40:28

And thanks to our listeners for tuning in in this installment of our series from Action Plan to Mission Impact.

Yevgeniy 40:36

Our next episode is Humans Still Required, continuing the theme here. We'll be shifting our focus to the technology, uh, from the technology to the workforce, uh, examining how the AI action plan translates into recruiting, training, and preparing people to operate alongside these systems.

Chitra 40:53

Awesome. And uh joining us for that conversation will be uh Dr. Missy Cummings. She's the director of Mason's uh Autonomy and Robotics Center and a professor at GMU.

Yevgeniy 41:04

We'll be digging into uh the skills gap, barriers to adoption, uh, and what it really takes to ensure workforce investments drive mission outcomes.

Chitra 41:14

Until then, I'm Chitres Vanadham.

Yevgeniy 41:16

And I'm Yevgeny Sarotin.