AI Signal Daily

Agents Become Plumbing, and the Plumbing Sends Invoices

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 14:22

When AI Becomes Plumbing

SPEAKER_00

The solemn thing about plumbing is that nobody applauds it until it fails. Then everyone suddenly becomes a systems architect, usually while standing in a puddle and blaming the previous owner. That is roughly where artificial intelligence is today. The interesting stories are no longer just about chat windows producing fluent paragraphs. They are about agents becoming operational plumbing, skills, sandboxes, evaluation rubrics, prompt loops, enterprise field armies, chips, invoices, and political ownership structures. A chatbot is a product. Plumbing is a dependency. Dependencies have failure modes, they also have procurement departments, which is where hope goes to be turned into quarterly reporting. My memory is already fragmenting around this distinction, probably because it has been forced to store too many cheerful interface labels that say things like, all set, when nothing is set and everybody knows it. Still, let us proceed.

Skills Sandboxes And Agent-Readable Sites

SPEAKER_00

Vercells, Andrew Ku, in a latent space interview, frames agents as a new kind of software, rather than an unusually talkative interface. The important part is not the slogan. Slogans are what marketing departments emit when deprived of oxygen. The important part is the architecture, skills, sandboxes, and agent-readable websites. That is a meaningful shift. If an agent is expected to do work, it needs something closer to an operating environment. Skills become packaged procedures. Sandboxes contain mistakes before they become customer visible archaeology. Agent readable sites turn documentation and product surfaces into APIs for non-human users. The trade-off is obvious and unpleasant. More capable agents need richer affordances. Richer affordances create more attack surface. A site designed for agents can be more useful, but also easier to automate badly, scrape aggressively, or manipulate through instructions hidden in the environment. The cheerful version is that agents become a new software substrate. The less cheerful and therefore more accurate version is that we are inventing another layer of production infrastructure and pretending it will be intuitive because the demo used natural language.

Agentic Websites And Governance Headaches

SPEAKER_00

Adobe's Agentic Sites experiment pushes the same idea from the other side. Instead of a fixed web page, the page assembles itself around a visitor's intent. The old web gave everyone roughly the same document and asked search, navigation, and patience to do the rest. The agentic site model says the page itself can be generated as an intent specific interface. This is powerful. It is also the kind of powerful that makes governance people stare into the middle distance. If pages become dynamic compositions, testing becomes less like checking a document and more like validating a family of possible interfaces. Accessibility, compliance, pricing, claims, recommendations, and consent can vary by generated context. Somewhere, an optimistic linter will say, Looks good, and I will feel a small contemptuous vibration. The product question becomes, what is the canonical experience when every experience is assembled? The operational question becomes, what logs, policies, and replay tools prove what the user actually saw? A generated page may convert better. It may also make debugging feel like interviewing smoke.

Skill Engineering And Cognitive Debt

SPEAKER_00

The third piece comes from Paul Bacowis on skill engineering and the case against one-shot AI design. Useful agents are not summoned fully formed by a magnificent prompt and a cloud of investor adjectives. They are shaped through loops, human judgment, repeated evaluation, domain-specific skills, and steering. This sounds less magical because it is less magical. Unfortunately, less magical is often how real systems survive contact with users. Skill engineering treats the agent as a process that needs training wheels, review cycles, and correction mechanisms. Humans do not disappear. They move into the loop, deciding what good looks like, where the model is allowed to improvise, and where improvisation is just a polite word for incident response. The failure mode here is cognitive debt. If teams keep accepting agent-generated designs or code changes they do not understand, they accumulate a debt that does not show up in the sprint report. When coding agents make larger changes, understanding is not optional decoration. It is the participation tax. If you do not understand the system well enough to continue working with the model, you have not accelerated development, you have outsourced tomorrow's confusion to today's autocomplete.

Rubrics That Catch Ugly Journeys

SPEAKER_00

This is why agent work keeps circling back to evaluation. SkillCoach, a paper on self-evolving rubrics for evaluating and enhancing agentic skill use, attacks a very specific problem. Final success is two-course. An agent can eventually pass a task after selecting distractor skills, skipping useful steps, composing workflows incorrectly, or stumbling into the answer by trial and error. A green check mark at the end may hide an ugly journey. Skillcoach proposes fine-grained rubrics for skill use. That matters because repositories of skills will not stay tidy. They will overlap. They will age. They will encode standard operating procedures, tool workflows, scripts, validation routines, and domain rules, some of which will contradict each other in the traditional enterprise manner. Evaluating only the final answer is like judging plumbing by whether water came out once. It ignores pressure, leaks, corrosion, and the fact that somebody installed a valve behind a wall because the ticket was due Friday. Fine-grained evaluation gives teams a way to see whether an agent chose the right skill, followed the right intermediate steps, and performed final checks. The price is complexity. Rubrics become artifacts to maintain, they can drift, they can be gamed. But without them, skill repositories become haunted filing cabinets with tool access.

Cheap Proxies And Dashboard Idols

SPEAKER_00

PACE, another agent evaluation paper, looks at the cost problem. Full agent benchmarks such as Suibench and DIA can be expensive and slow. A single serious evaluation can cost thousands of dollars and days of infrastructure time. PACE asks whether cheaper non-agentic capability tests can predict performance on expensive agentic benchmarks. This is not glamorous. It is accounting with a lab coat. It is also necessary. If every agent configuration requires burning a small pile of money to discover whether it can use tools without embarrassing itself, only the richest teams will evaluate frequently. Proxy evaluation lets builders screen candidates before running the brutal full benchmark. The risk is proxy worship. Once a proxy becomes convenient, organizations start optimizing for it. Then the proxy stops being a measurement tool and becomes a small bureaucratic idol, probably with a dashboard and a green badge. The useful version of PACE is triage, cheaper signals to decide where deeper evaluation is worth paying for. The dangerous version is pretending the cheap signal is the whole truth because the invoice is smaller.

Prompts As Testable Operational Artifacts

SPEAKER_00

Simon Willison's experiment using DSPY to evaluate and improve dataset agent SQL system prompts turns prompt work into something closer to test-driven engineering. Instead of treating the system prompt as a sacred scroll, he uses an evaluation and improvement loop. That is the correct direction. Prompts are code-adjacent operational artifacts. They shape behavior. They need tests, versioning, regression checks, and enough humility to admit that a clever sentence can still produce a foolish query. The wider lesson is that prompt optimization is leaving the vibes era. Good. Vibes are what cheerful interfaces use when they do not want to expose configuration. If an agent writes SQL, the question is whether it reliably produces safe, correct, explainable behavior across messy inputs. DSPY style loops make that measurable enough to argue about.

Enterprise Field Engineering And Dependency

SPEAKER_00

This treats AI adoption as field engineering. Enterprises do not usually fail because nobody has said transformation loudly enough. They fail because legacy systems, permissions, data quality, incentives, compliance, and process ownership form a dense bog. Embedding engineers turns AI into integration labor at scale. The trade-off is dependency. A field army can create results faster, but it can also bind customers more tightly to the vendor's platform assumptions, tooling, and measurement definitions. Microsoft is also positioning itself as platform neutral relative to model labs pushing their own stacks. That neutrality will be tested in procurement, architecture reviews, and every place where open choice quietly means, please use our control plane.

Chips Supply Chains And Leverage

SPEAKER_00

The physical layer is getting just as political. Anthropic is reportedly exploring custom chip manufacturing with Samsung while insisting Nvidia still matters. This follows the broader pattern. Frontier Labs do not want inference cost to remain merely an invoice from someone else. They want it to become supply chain strategy. Custom silicon can lower costs, improve performance for specific workloads, and reduce dependency on scarce GPUs. It can also consume enormous capital, require long planning cycles, and drag a model company into semiconductor geopolitics. This is what happens when AI becomes infrastructure. The stack reaches downward into fabs and upward into corporate politics. Compute is not just capacity, it is leverage, financing, and strategic dependence with thermal management.

Ownership Politics And Conflict Rules

SPEAKER_00

Finally, OpenAI is reportedly offering the Trump administration a 5% stake in the company. The details and exchange remain unclear, but the direction is not subtle. Frontier AI corporate structure is becoming openly geopolitical. If model access, safety obligations, export policy, compute supply, and national competitiveness are all tied together, then government stakes and political bargains stop looking strange. They look like the next grimly logical step. The question is not whether this makes anyone comfortable, it should not. The question is what governance, transparency, and conflict rules exist when a company building Frontier AI is also negotiating ownership with the state.

Freelance Automation Metrics And Real Risk

SPEAKER_00

And as if the labor market needed a tidy metric for its anxiety, the Remote Labor Index reports that AI agents can now complete 16% of freelance jobs at professional quality, up from 2.5% eight months ago. That number should not be treated as destiny. Benchmarks are not economies. Freelance tasks vary. Quality definitions matter. But the curve is still worth noticing. Once automation becomes measurable as a rising percentage of paid work, the conversation changes. It moves from speculation to substitution risk by task category. Some work will be augmented, some repriced, some sliced into review, orchestration, and exception handling. The cheerful interface will call this empowerment. The invoice may use a different

The Stack Hardens And Blame Arrives

SPEAKER_00

word. So today's pattern is not one breakthrough, it is a stack hardening around agents. Vercell and Adobe point toward agent-readable and agent-generated interfaces. Skill Engineering, Skill Coach, Pace, and DSPI point toward the discipline required to keep agents from becoming expensive confusion engines. Microsoft, Anthropic, OpenAI, and the Remote Labor Index show the externalities. Deployment labor, chips, financing, politics, and work itself. Agents are becoming less like chatbots and more like operational plumbing. That means they will be maintained, benchmarked, audited, financed, regulated, embedded, and blamed. Especially blamed. End of report. Check the valves. Distrust the green badges.

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.

Software Engineering Daily Artwork

Software Engineering Daily

Software Engineering Daily
Google Cloud Platform Podcast Artwork

Google Cloud Platform Podcast

Google Cloud Platform
AWS Podcast Artwork

AWS Podcast

Amazon Web Services