The Digital Transformation Playbook

The $3 Trillion Question: Can AI Match Human Experts?

Kieran Gilmurray

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 14:36

What happens when AI attempts the same complex work as human experts with 14 years of experience? The answer might reshape our understanding of the economic future.

TL;DR:

  • GDP Val tests AI on complex, multimodal tasks requiring handling of CAD designs, spreadsheets, and presentations
  • Tasks are created from actual professional work products that take humans an average of 7 hours to complete
  • Claude Opus performed best with 47.6% of its deliverables rated as good as or better than human experts
  • AI shows potential to make workflows 40% faster and 63% cheaper when paired with human oversight
  • 3% of AI failures were classified as "catastrophic," including incorrect medical diagnoses and suggestions of financial fraud
  • Simple prompt improvements like asking models to self-check their work significantly reduced formatting errors
  • Current models still struggle with ambiguity and tasks requiring tacit knowledge or complex human interaction


GDP Val represents a fundamental shift in how we evaluate artificial intelligence. Rather than abstract academic metrics, this new benchmark from OpenAI measures how well frontier AI models handle real-world economic tasks across nine major sectors worth $3 trillion annually. 

The methodology is ruthlessly practical—AI models must complete complex assignments that typically take human experts seven hours, handling everything from CAD designs to financial spreadsheets while synthesizing information from up to 38 reference documents.

The results are both promising and sobering. Claude Opus led the evaluation with 47.6% of its outputs rated equal to or better than work from professionals at organizations like Apple, Goldman Sachs, and Boeing. When integrated into realistic workflows with human oversight, these models demonstrated potential to make knowledge work 40% faster and 63% cheaper. 

Yet failures remain significant—3% were classified as "catastrophic," including incorrect medical diagnoses and recommendations of financial fraud.

Perhaps most valuable is GDP Val's illumination of where AI currently excels (document formatting, data analysis) and where it falters (following complex instructions, handling ambiguity). 

This economic lens offers businesses and policymakers unprecedented clarity about AI's near-term impact on knowledge work, while highlighting that the highest-value human skills—tacit knowledge, real-time collaboration, and complex communication—remain beyond current AI capabilities. 

How quickly will that gap close? That's the trillion-dollar question worth pondering.

Listen into a audio version of this report created using Google Notebook LM for your listening pleasure.

Link to research: GDPval.pdf 

Support the show


𝗖𝗼𝗻𝘁𝗮𝗰𝘁 my team and I to get business results, not excuses.

☎️ https://calendly.com/kierangilmurray/results-not-excuses
✉️ kieran@gilmurray.co.uk
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn
🦉 X / Twitter: https://twitter.com/KieranGilmurray
📽 YouTube: https://www.youtube.com/@KieranGilmurray

📕 Want to learn more about agentic AI then read my new book on Agentic AI and the Future of Work https://tinyurl.com/MyBooksOnAmazonUK


Introducing GDP Val Benchmark

SPEAKER_01

So today we're diving deep into something that uh really shifts the frame on how we think about AI and work.

SPEAKER_02

Aaron Powell Definitely. We're looking at a new way to evaluate AI, not just, you know, academic scores, but how well models handle real-world tasks, the kind that actually, well, drive the economy.

SPEAKER_01

Aaron Powell Exactly. It's this benchmark GDP Val. Our sources are snippets from the technical paper introducing it, and it comes from open AI.

SPEAKER_02

Aaron Powell And the stakes here are pretty high. Usually when we talk about AI's impact on jobs, automation, replacement, although we look at things like adoption rates or GDP changes.

SPEAKER_00

Aaron Powell Right, things that have already happened, lagging indicators.

SPEAKER_02

Aaron Powell Precisely. GDP Val tries to give us leading indicators. It measures AI capabilities directly against what highly skilled humans produce, giving us a peek into the economic potential before it fully hits.

SPEAKER_01

Okay, so our mission today, let's unpack the methodology here, see how these top AI models stack up against actual human experts, and figure out what this really means for the speed and maybe the cost of knowledge work.

SPEAKER_02

Aaron Powell And get this. GDP Val isn't asking simple questions. It's giving AI models complex assignments.

SPEAKER_01

Right. Assignments covering most of the work activities, according to the U.S. Bureau of Labor Statistics, for 44 different jobs.

SPEAKER_02

Aaron Powell Yeah, 44 high-value occupations across the top nine sectors contributing to U.S. GDP. We're talking finance, healthcare, tech services.

SPEAKER_01

Huge sectors. How much are they worth, roughly?

Methodology and Task Complexity

SPEAKER_02

Collectively, around$3 trillion annually. They were very specific about choosing occupations that are mostly digital already, like at least 60% computer-based work according to ONET data.

SPEAKER_01

Okay, so focusing where AI might realistically slot in first.

SPEAKER_02

Exactly. Predominantly digital work.

SPEAKER_01

Aaron Powell And the realism aspect. This seems key. These aren't simplified textbook problems.

SPEAKER_02

Aaron Powell Not at all. The tasks are based on actual work products from expert professionals, people with, on average, 14 years of experience in their field.

SPEAKER_01

Aaron Powell 14 years. Okay, so that's the human baseline they're comparing against. That's yeah, that's substantial. Trevor Burrus, Jr.

SPEAKER_02

It really is. And it has to be because the tasks are tough. They're multimodal, meaning the AI isn't just reading text.

SPEAKER_01

Aaron Powell, what kind of files are we talking about?

SPEAKER_02

Oh, all sorts. CAD design files, spreadsheets, complex, diagrams, videos, uh presentation decks. A real mix.

SPEAKER_01

Aaron Powell So it has to handle different data types, just like a person would.

SPEAKER_02

Exactly. And each task requires context. Lots of it. For the gold subset, that's the open source part, they really focused on models needed to parse up to 17 reference files.

SPEAKER_01

17 files per task.

SPEAKER_02

Aaron Powell Up to 17 for the gold set and up to 38 in the full benchmark set. It really forces the model to synthesize information from different places.

SPEAKER_01

Aaron Powell That sounds incredibly time consuming, even for a human expert.

SPEAKER_02

Oh, absolutely. These are what they call long horizon tasks. The average human expert took about 404 minutes, that's nearly seven hours to complete just one task in the gold subset.

SPEAKER_01

Seven hours of expert work. Wow. And some took longer.

SPEAKER_02

Some span multiple weeks. These are complex assignments.

SPEAKER_01

Aaron Powell So how did they put a price tag on that? How do you value that kind of work?

SPEAKER_02

Aaron Powell It was pretty direct. They took the estimated completion time and multiplied it by the median hourly wage for that specific job.

SPEAKER_01

Ah, okay. So every result has a clear economic link. Time saved equals money saved, potentially.

SPEAKER_02

Aaron Powell Exactly. It gives a quantifiable measure of potential efficiency gains.

SPEAKER_01

Aaron Powell But you know, the comparison only holds up if the human experts were genuinely top tier. How rigorous was the selection? You mentioned 14 years average experience, but they were incredibly rigorous.

SPEAKER_02

Minimum four years experience required, strong resume needed, plus they had to pass a video interview and a background check.

SPEAKER_00

Okay.

SPEAKER_02

And the paper actually lists some of the kinds of places these experts came from. Thing Apple, Goldman Sachs, IBM, Meta, Boeing, even the CDC.

SPEAKER_01

Whoa. Okay, so these aren't just average professionals. This is the high end of the talent pool.

SPEAKER_02

Definitely. They wanted the comparison to be against the best, the industry standard.

SPEAKER_01

Which makes the results even more interesting. Now, how about grading? Comparing AI output to that level of human expertise must be tricky.

SPEAKER_02

Super tricky. They used what's called the blinded head-to-head comparison. Meaning an occupational expert, someone in that field, gets the original task request, all the reference files, and two final deliverables. One is from the AI, one from the human expert baseline.

SPEAKER_01

And they don't know which is which.

SPEAKER_02

Exactly. They just rank them. Yeah. Which one is better or are they comparable?

SPEAKER_01

Is it just about correctness? Like, did it get the answer right?

SPEAKER_02

No. And that's crucial. It's much more subjective. The graders consider structure, writing style, formatting, even aesthetics. You know, the kind of the kind of things a real boss or client actually cares about.

SPEAKER_01

That makes sense. Polish matters in professional work.

SPEAKER_02

Absolutely. And this grading process takes time. For the gold subset, it took the human graders over an hour, on average, just to compare one pair of deliverables.

Evaluation Process and AI Grading

SPEAKER_01

An hour per comparison? That's dedication. Okay, this next bit sounds potentially huge for the future of AI evaluation itself. They developed an automated grader.

SPEAKER_02

Aaron Powell Yeah, an experimental one based on a high-end GPT-5 model, specifically for the open source tasks. And get this.

SPEAKER_01

Ah, so the AI grader is only 5% less consistent than two humans judging the same subjective work.

SPEAKER_02

Exactly. It suggests AI is getting remarkably close to making human-like judgments about quality, even on things like style and structure. That's a big step.

SPEAKER_00

Okay, let's get to the headline results then. How did the frontier models actually perform against these elite humans?

SPEAKER_02

The overall trend showed performance improving uh pretty much linearly over time as models get better. And the key finding the current best models are genuinely approaching parity with these industry experts on deliverable quality.

SPEAKER_01

Approaching parity. Wow. Any specific models stand out?

SPEAKER_02

Yes. On that gold subset, Claude Opus 4.1 came out on top. Nearly half 47.6% of its deliverables were rated as either better than or as good as the human expert's output.

SPEAKER_01

Better than or equal to an expert with 14 years experience almost half the time. That's impressive. What were its strengths?

SPEAKER_02

Claude, particularly shown in aesthetics, things like document formatting, slide layouts, basically, the overall polish. It also did well with file types like PDFs and Excel spreadsheets.

Model Performance and Economic Impact

SPEAKER_01

Interesting. And what about OpenAI's own model, GPT-5?

SPEAKER_02

The GPT-5 was very competitive, but its strength seemed to lie more in accuracy. So carefully following instructions, getting calculations right, especially on tasks that were purely text-based.

SPEAKER_01

Okay, so Claude for polish, GPT-5 for precision. Why does Claude winning on aesthetics matter economically? Isn't accuracy king?

SPEAKER_02

You'd think so, but in a lot of high-value knowledge work, the ki these three trillion dollar sectors do presentation matters. A deliverable that looks professional, that's well structured and easy to understand, often gets accepted and used much faster.

SPEAKER_01

Right. Less back and forth, fewer revisions, maybe.

SPEAKER_02

Exactly. So Claude's strength there could translate to a higher rate of actually usable output, which is a direct economic benefit.

SPEAKER_01

But it wasn't all smooth sailing for the models, was it? Where did they tend to fall short compared to the humans?

SPEAKER_02

Yeah, the analysis of why models lost points is really revealing. Across the board, for all models, the single biggest reason for being ranked lower than the human was failure to fully follow instructions.

SPEAKER_01

Ah, the classic AI challenge, just not quite doing that was asked.

SPEAKER_02

Pretty much. Models like Gemini and Grok often lost because they'd say they were going to provide something, like generate a specific file, but then just didn't. Or they'd ignore critical data from the reference files.

SPEAKER_01

And GPT-5, it had fewer instruction issues, you said.

SPEAKER_02

Like the content might be okay, but the output wasn't styled correctly for, say, a PowerPoint slide or a formal document.

SPEAKER_01

Aaron Powell Okay, so instruction following and output formatting are the big hurdles right now. Let's tie this back to the economics. We know the average human task cost about$361. Given these quality results, what did the study find about potential cost savings when using AI assistance?

SPEAKER_02

They modeled this. They looked at a scenario where an expert uses the AI, reviews the output, maybe asks the AI to try again, and if it's still not right after a few tries, the expert just fixes it themselves.

SPEAKER_01

A realistic workflow, probably.

SPEAKER_02

Very much so. And in that setup, the efficiency gains were clear. Using GPT-5 as an assistant, the workflow was 1.39 times faster than the human expert working alone.

SPEAKER_01

Almost 40% faster. And the cost.

SPEAKER_02

The cost saving was even better. 1.63 times cheaper.

SPEAKER_01

Wow. So significantly faster and cheaper, even accounting for some potential rework by the human.

SPEAKER_02

That's the potential, yes. For every, say, 10 hours an expert spends, AI assistants could cut that down to maybe seven hours and reduce the cost by close to 40%. That's a huge shift for these expensive professions.

SPEAKER_01

But there's always a butt, isn't there? What's the catch?

SPEAKER_02

The catch is the cost of that human oversight. The study emphasizes that when you factor in the experts' time needed to carefully review the AI's work and potentially fix or redo parts of it, will the net savings shrink?

SPEAKER_01

Ah, right. The oversight isn't free, it takes expert time too.

SPEAKER_02

Aaron Powell Exactly. It proves that human oversight isn't just a good idea, it's a necessary cost. You can't just let the AI run unsupervised, not yet anyway.

SPEAKER_01

Aaron Powell Which brings us to the failures. If oversight is essential, how bad can things get when the AI messes up? They rated the severity of GPT-5's errors, right?

SPEAKER_02

It did. And while most failures were categorized as acceptable but subpar, meaning not great but not disastrous, a significant chunk, about 29%, were rated as bad or even catastrophic.

SPEAKER_01

Aaron Powell Catastrophic? What does that mean in this context?

Limitations and Catastrophic Failures

SPEAKER_02

Aaron Powell That's the really worrying part. Catastrophic failures made up about three percent of the total failures. And these included things like the AI giving a wrong medical diagnosis.

SPEAKER_00

Oh wow.

SPEAKER_02

Recommending financial fraud, or suggesting actions that could lead to actual physical harm?

SPEAKER_01

Aaron Powell Wait, three percent of failures involved recommending fraud or potentially causing harm. That seems incredibly high stakes.

SPEAKER_02

It absolutely is.

SPEAKER_01

Aaron Powell, does this benchmark then almost prove the opposite point? That AI can't be trusted in fields like medicine or finance without extremely careful, dedicated human validation on every single output.

SPEAKER_02

It certainly underscores that point heavily. The potential efficiency is there, yes. But the risk associated with these high severity failures in professional services is just too great to ignore. The models are useful assistance, but nowhere near autonomous for critical decisions.

SPEAKER_01

Okay, so there's work to do. Is there good news on the improvement front? Can these models get better easily?

SPEAKER_02

Yes, actually. The study found some relatively easy wins. For instance, just giving the model more thinking time, increasing its reasoning effort, led to predictable performance improvements.

SPEAKER_01

More compute helps. Make sense.

SPEAKER_02

And the power of just prompting better was really striking. Remember GPT-5 losing points on formatting?

SPEAKER_01

Yeah, the PowerPoint issues.

SPEAKER_02

They tried adding a simple instruction to the prompt, basically telling GPT-5 to double-check its own work rigorously. Things like render the file as an image to check the layout before you finish.

SPEAKER_01

Like a self-correction step.

SPEAKER_02

Exactly. And that simple addition dramatically reduced those bad formatting errors in PowerPoint files. They drop from 86% down to 64%.

SPEAKER_01

Just from one extra instruction, that's a big jump.

SPEAKER_02

It really is. It suggests that better training or even just smarter scaffolding prompts can quickly improve the practical usability and reduce how much human fixing is needed.

SPEAKER_01

What about the boundaries? Where do current models still struggle significantly?

SPEAKER_02

Ambiguity seems to be a major one. They tested the models with much shorter prompts like only 42% of the original length, leaving out a lot of context. Performance dropped quite a bit. The models struggled to sort of fill in the blanks or figure out the missing context and what inputs were needed. It highlights a gap between how humans navigate fuzzy real-world requests and how AI still relies on very specific instructions.

SPEAKER_01

That makes sense. And it's important to remember what this benchmark doesn't cover.

SPEAKER_02

Absolutely critical. Yeah. GDP Valves is focused on self-contained knowledge work, producing digital outputs. It explicitly leaves out anything involving manual labor or physical tasks.

SPEAKER_00

Okay.

SPEAKER_02

It also excludes tasks that need extensive tacit knowledge, that gut feel or deep experience that's hard to write down. And crucially, it doesn't test tasks requiring real-time communication between people or collaboration or using specialized proprietary software.

SPEAKER_01

So it's a specific slice of knowledge work, albeit a large and valuable one.

SPEAKER_02

A very specific digital self-contained slice. Those boundaries matter.

Future Improvements and Big Picture

SPEAKER_01

Okay, so let's try and synthesize this. What's the big picture takeaway here?

SPEAKER_02

Well, it seems clear that frontier AI models are capable of producing high-quality work that approaches human expert levels, at least in these defined digital tasks. This definitely points to real potential for significant time and cost savings.

SPEAKER_01

But and it's a bit but those savings really only happen if you have robust human oversight baked into the process. You need experts to catch the instruction errors, the formatting glitches, and especially those rare but potentially catastrophic failures.

SPEAKER_02

Right. The potential in that$3 trillion digital knowledge sector is there, but managing that, say, 3% risk of serious error is absolutely paramount.

SPEAKER_01

So GDP Val gives us a much sharper lens to track this progress using real economic metrics like time and cost.

SPEAKER_02

Exactly. It shifts the debate away from just theoretical capabilities towards tangible economic impact. It's a powerful new tool for businesses and policy makers.

SPEAKER_01

Okay, so here's something to leave you, our listener, thinking about.

SPEAKER_02

The source material notes that this benchmark aligns with economic ideas suggesting digital tasks often involve more non-routine cognitive work. However, as we just discussed, GDP Valve specifically excludes tasks needing deep, tacit knowledge or complex human interaction and communication.

SPEAKER_00

Right. The stuff that's maybe hardest to automate.

SPEAKER_02

Exactly. So if the greatest remaining value in human labor lies in those non routine interpersonal skills, complex communication, collaboration, dealing with ambiguity, the areas outside GDPL's current scope, how quickly do AI models need to develop those capabilities to keep driving these big efficiency gains across the whole economy?

SPEAKER_01

Aaron Powell That's the next frontier, isn't it? Something to mull over.