Ignition by RocketTools

Karpathy's $0.35 AI vs. Your Benefits Broker's 20 Hours

Dan McCoy, MD Season 1 Episode 15

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 11:30

Andrej Karpathy just released AutoResearch — a 630-line Python tool that lets AI agents run hundreds of experiments overnight on a 35-cent GPU rental. In one test, 35 autonomous agents completed 333 experiments while everyone slept and found 20 improvements that worked. No humans involved.

Meanwhile, your benefits broker spends maybe 20 hours on your entire annual renewal — most of it pulling quotes and formatting spreadsheets — to manage a budget that consumes 25-40% of your total payroll.

In this episode I break down what autonomous research actually is, why the analytical pattern applies directly to healthcare data and benefits design, and three specific questions every employer should be asking their broker at renewal. I also cover the three failure modes of AI research identified by economist Joshua Gans and how Karpathy's system addresses each one.

The benefits consulting industry is about to have its spreadsheet moment. The tools exist, the data exists, the pattern is proven. The only question is whether you'll be the one using it or the one disrupted by someone who does.

Full sources and the deep dive: danmccoymd.substack.com

Also checkout my YouTube Channel.

SPEAKER_00

Here's something that should bother every employer in America. Andre Carpathy, former head of AI at Tesla and founding member of OpenAI, he just released a tool that lets a single AI agent run hundreds of experiments overnight on a 35 cent GPU rental. In one test, 35 autonomous agents completed 333 experiments while everyone slept. The assist found 20 improvements that worked and no humans were involved. Meanwhile, your benefits broker spends maybe 20 hours on your entire annual renewal. Most of that is pulling quotes from carriers and formatting spreadsheets. You're paying someone to manage a budget that consumes 25 to 40% of your total payroll, and the analytical rigor they bring to that decision, it's roughly equivalent to what a hobbyist can now run on a single graphics card for the cost of a coffee. That's a lovely mismatch. Let me tell you what's actually happening and why the consulting industry is about to get very uncomfortable. To understand why this matters, you need to understand what autonomous research actually is. In February 2026, Karpathi coined a term called agentic engineering. This is the idea that you're not writing code directly anymore, you're orchestrating agents who do it for you. By March, he released Auto Research, and this is a very small 630-line Python tool that gives an AI agent a task, a budget, and creates a feedback loop. The agent proposes a hypothesis, tests it, evaluates whether the result improved, keeps or discards the results, learns from the outcome, and repeats. Harpathi ran this on a training optimization problem and he let it go for two days. The agent processed 700 autonomous changes. It found roughly 20 additive improvements that transferred to larger models. That's an 11% efficiency gain. And it was achieved by a program running while humans were asleep. This isn't a research paper, it's a working system that anyone can run. In fact, I set it up myself. I created a quick script using Claude Code on my MacBook Pro. I connected to vast.ai, which is an online marketplace where you can rent very sophisticated server space by the hour, and I ran my own test. Even my rudimentary tests, they were pretty crazy. This pattern isn't limited to machine learning. Anthropic's research shows Claude Code, that's their AI coding assistant in my daily driver, now runs autonomously for over 45 minutes in the longest sessions. That's up from under 25 minutes just three months ago. Among millions of interactions analyzed, only 0.8% of tool calls were irreversible. The rest, AI tried things, it checked results, and it adjusted. What happens when you point this capability at healthcare data, benefits design, or hospital pricing? The conventional wisdom in benefits consulting goes like this: your broker brings industry expertise, carrier relationships, and regulatory knowledge. They analyze your claims data, compare quotes, and recommend a plan design. This process takes time. A typical case build for a mid-sized employer can consume up to 20 hours. That sounds like due diligence, but here's what it actually really means. 20 hours annually for a budget that might be $5 million, $20 million, $50 billion in healthcare spend. Account managers spend roughly 50% of their day on administrative tasks, pulling forms from benefit administration platforms, verifying information, submitting to carriers, and following up. The research portion, it's whatever time remains. The NFP 2026 Benefits Trend Report found that nearly half of employers expect healthcare budgets to increase again this year. Healthcare costs are projected to increase 6 to 9%. PWC's actuaries predict 8.5% for medical cost trend for group markets. And the industry response, according to Vertifor's broker research, agencies are looking at AI and automation to drive efficiency. One major use case they cite automating data extraction from summary of benefits documents. What used to take 30 to 45 minutes per plan can now be done under five minutes. They're automating the extraction, not the analysis, not the hypothesis testing, not the what if we changed X exploration, the bars in the floor. I clearly went down a rabbit hole on autonomous research failure modes this week. Joshua Gans, an economist at the University of Toronto, ran an experiment he called Vibe Researching, using AI to produce a publishable paper in under an hour. Now, my academic types, you can relax. He succeeded. The paper was published in Economic Letters, but he also documented what goes wrong. Three failure modes that matter for anyone thinking about applying AI to research, including benefits consultants. The first we're going to call seduction. This is where LLMs present results with confidence. They claim formal results even when they're wrong. Gant spent days believing he discovered findings that didn't actually exist. His conclusion treat AI outputs like interested parties, bias sources requiring extra skepticism. Second is low idea quality. When execution becomes easy, you pursue weak ideas you'd normally abandon. I'm totally guilty of this. Because it cost almost nothing to run a test, you run everything, including bad tests. Ganz admits he completed lower quality work when he should have quit if the process were actually harder. Third, missing decision points. Human research has natural friction that forces you to stop and ask, is this worth continuing? Autonomous systems, well, they don't have that friction. They just keep going, often compounding errors. Here's what's interesting. Carpathy's auto research addresses all three of these. The system has mandatory checkpoints that scores every finding for confidence, novelty, and actionability, and discards anything below a threshold. In my testing, I ran 101 hypothesis against my own personal genetic data. All 101 past quality text. Now that's a red flag. 100% acceptance means I set the bar too low, and I did. So version two added persistent memory, a database tracking every hypothesis tested across runs. Novelty scoring that compares findings against existing knowledge, higher thresholds, that's confidence above 65%, novelty above 35%, and actionability above 50%. Now, when you set that all up, the system learns. The first run might keep 60%, but the third run, it's focusing on areas that produce actual discoveries. The cost, about 35 cents per hour of GPU rental. For $10 a month, you could run 30-hour-long sessions. That's 3,000 hypothesis tested, which raises the obvious question: why is your benefits consulting running maybe five to 10 plan comparisons at renewal while this system exists? Now there are counterarguments worth considering. The most obvious healthcare isn't machine learning. You can't just run experiments on benefit design the same way you can tune neural network parameters. The data is messier, the stakes are higher, and the regulatory constraints are real. That's fair, but here's what that argument misses. The analytical pattern is identical. You have data, claims history, plan documents, benchmark surveys, carrier financials, hospital pricing files, which are now public thanks to price transparency rules. You have hypotheses. Would adding fertility benefits reduce turnover in this demographic? Or what's the ROI of mental health parity versus gym memberships? Or which procedures show the highest variance between facilities? The curate.ai platform is already doing this in medical oncology. It's an indication agnostic, AI-derived platform that uses patients' own data to create an end-of-one profile that dynamically personalizes dose recommendations. Their feasibility trial demonstrated adaptability to clinically relevant situations for patients with advanced solid tumors. Digital twins, that's virtual representations of individual patients, are being used to simulate responses to treatments before you actually have to administer the treatment. The FDA is already accepting computational modeling as supporting evidence for device approvals. Think about that. The pattern keeps repeating. End of one research, continuous learning, cost democratization. What cost $10,000 in 2020 cost 35 cents in 2026. But your broker is still manually pulling summary of benefit documents. The anthropic research found that among the longest-running autonomous sessions, AI capability more than doubled in just three months. Their assessment estimates Claude Opus 4.5 can complete tasks with 50% success that would take a human nearly five hours. Five hours of human work, 50% completion rate from an AI that keeps improving. The question isn't whether autonomous research applies to benefits. It's whether your current consultant is even trying. If you're an employer managing healthcare spend, this changes the calculation in three ways. First, you should be asking your broker what their analytical process actually looks like. Not what reports do you provide, but how many hypotheses did you test before recommending the plan design? What data sources are you cross-referencing? If the answer is we compared three carrier quotes and ran the numbers in a spreadsheet, you're paying expert rates for commodity work. Second, the standard for due diligence is about to shift dramatically. Anthropic recommends that effective oversight of agents will require new forms of post-deployment monitoring infrastructure. I'm going to argue the same logic applies to vendor oversight. If an AI can test 100 benefit design hypothesis overnight, the broker who tested three is no longer competitive. Third, consider what autonomous research means for your specific data. You have claims history, demographic trends, employee survey results, turnover patterns. Each of these is a hypothesis waiting to be tested. What if we shifted the mental health benefit structure? Isn't a hunch anymore. It's a testable question with your own data. Okay, the practical ask is to find out whether your consultant is using any form of autonomous or AI-assisted analysis. If they're not, you need to ask them why. And if their answer is we're looking at AN automation to drive efficiency, meaning data extraction, like we talked about before, and not research, you know where the bar actually sits. The benefits consulting industry is about to have its spreadsheet moment. In the 1980s, Lotus 123, remember that? Democratized financial analysis. Anyone with a PC could do calculations that previously required a team. The firms that survived weren't the ones who could add numbers the fastest. They were the ones who could interpret the results and make recommendations. We're at the same inflection point for research. Karpathi's autoresearch isn't just a clever tool. It's a demonstration that hypothesis testing is now a commodity. 630 lines of code, 35 cents an hour, hundreds of experiments while you sleep. The consultants who survive won't be the ones who can pull carrier quotes the fastest. They'll be the ones who can design the hypothesis worth testing and interpret the results in context. When was the last time your broker ran 100 tests on your benefits design? When was the last time they ran 10? The employers who understand this will be asking harder questions at renewal. Everyone else will keep paying for 20 hours of analysis on a budget that determines whether their employees can afford to see a doctor. The tools exist, the data exists, the pattern is proven. The only question is whether you'll be the one using it or the one disrupted by someone who does. If you like this content, there's more in my Sub Stack, and until next time, I'll see you then.