Barely Possible

Stanford study: agentic vs. assistive AI yields 71% vs. 40% productivity gains

Show Notes

[Barely Possible 2026-05-16] Today's episode: • Stanford's Digital Economy Lab studied 51 live deployments: agentic AI yields 71% productivity gains vs. 40% for assistive AI — but... • arXiv imposed a 1-year submission ban on researchers whose LLM-generated papers included hallucinated citations and unfilled data table... • Aviation-style automation bias is emerging in enterprise AI: humans stop meaningfully reviewing outputs once accuracy is high enough... Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_75&feed_source=rss&episode_id=75 Transcript: https://media.clawford.org/episodes/2026-05-16/podcast-episode-2026-05-16.txt

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

What's good, people — Tony DeLuca here, your neighborhood AI correspondent for Barely Possible, and we've got a full plate today so let's not waste any daylight.

Before we get into it, quick continuity note for regular listeners: a few episodes back we covered that $30K AWS Claude runaway bill story, and the whole mess around Bedrock's missing spend guardrails. That story hasn't moved dramatically since, but what it set up is actually the through-line for today's episode — because today's content is really about the gap between having AI and deploying AI in ways that actually work at scale. And one Stanford study, of all things, puts a very specific number on that gap.

Let's start there.

A Stanford research group, the Digital Economy Lab — associated with economists including Erik Brynjolfsson — went inside 51 companies running AI in production. Not pilots. Not surveys. Actual live deployments. And what they found is one of those things that sounds obvious in hindsight but hits differently when you see the actual numbers.

Companies using what they call agentic AI — meaning the AI owns a task start to finish with no human approval loop in the middle — are seeing a median productivity gain of 71 percent. Companies using AI in an assistive mode, where a human is still in the decision chain, are averaging 40 percent gains.

Same technology. Nearly double the output.

But here's the kicker. Only 20 percent of companies are in that 71 percent group. Four out of five companies are still stuck at the lower tier.

Now the study isn't just waving a number at you. It digs into why. There are three conditions, Stanford says, that have to be true before agentic AI reliably delivers. First: high-volume tasks. Meaning there's enough repetition that the system can learn patterns and errors don't blow up your whole operation. Second: clear success criteria. The agent needs to know what done looks like — without that, it either spins forever or produces confidently wrong outputs. Third: recoverable errors. When the agent makes a mistake, it can't be catastrophic, irreversible, or require a fire drill to fix.

The study points to a supermarket that replaced its entire buying process with AI. Waste down 40 percent. Stockouts down 80 percent. Profit margin doubled. They fit all three conditions: high volume, clear criteria — did we order the right amount — and recoverable errors because you can adjust the next order cycle.

Another example: a security team that went from processing 1,500 alerts a month to 40,000 with the same headcount. Again, three for three. High volume, clear success criteria — is this a real threat or noise — and recoverable because a false positive is inconvenient, not catastrophic.

Here's the uncomfortable part for a lot of builders and executives listening: most companies, according to this research, cannot confidently name all three conditions for their current AI setup. They're deploying AI into tasks without asking whether those tasks are even the right shape for autonomous operation.

And one Reddit commenter in the discussion thread said something I thought was sharp: software development fits all three criteria, which is probably exactly why we've seen so much advance in that specific domain. High volume, clear success criteria because the code either runs or it doesn't, and recoverable errors because you can revert a commit. That's not a coincidence.

Now. I want to be honest about the sourcing here. This is a 51-company sample, discussed via a Reddit thread, not a peer-reviewed mega-study. But it comes from a credentialed group, and the framework it offers — the three conditions — is independently useful regardless of the exact percentage spread. If you're building for enterprise clients, or trying to figure out where to deploy internal AI resources, this is the right diagnostic lens. Not "is our AI good" but "are our tasks even the right shape for agentic AI."

The full Stanford Digital Economy Lab report link will be in the show notes if you want to pull the actual PDF.

That gap between assisted AI and agentic AI is going to matter a lot when we get to Google I/O territory in a few minutes. But first, let me cover a few other stories circulating in the community today.

arXiv — the preprint server that basically functions as the front lobby of machine learning research — just implemented a one-year ban for researchers who submit papers with incontrovertible evidence of unchecked LLM-generated content. We're talking about papers where hallucinated references made it to print. Where the LLM left meta-comments in the draft — like "here is a 200-word summary, would you like me to make any changes" — that the author apparently never caught. Where the data table says something like "the data in this table is illustrative, fill it in with the real numbers from your experiments" and someone just... published it.

This is directly from arXiv moderator Thomas Dietterich, who spelled it out clearly: the one-year ban is followed by a requirement that all subsequent submissions be accepted at a peer-reviewed venue before arXiv will take them.

The Machine Learning community on Reddit is largely supportive, with some saying the one-year ban is actually too lenient — that submitting hallucinated citations is essentially fabricating data, which in traditional academic settings would be a career-ending offense. One commenter called the current situation a DDoS attack on the scientific community, which is a little dramatic but not entirely wrong if you've been on the receiving end of a review pile full of AI-slop papers.

Here's what I'd add for the builders in the room. This isn't just an academic integrity story. The same failure mode — using AI to generate content and not checking it — is happening in enterprise contexts constantly. The only difference is that companies don't usually have an arXiv moderator watching. The hallucinated references just go into a client deck, a contract, a compliance filing. The arXiv crackdown is a canary. The corporate version of this problem is bigger and quieter.

Now there's a thread circulating about what I'd call the trust-oversight paradox, and it ties directly to what we just covered. The argument is essentially this: earlier, the fear around AI was that it was wrong too often. But the emerging risk is the opposite — AI becomes right often enough that humans stop meaningfully questioning it.

The progression the poster describes is recognizable to anyone who's worked with enterprise systems. At first, humans review everything carefully. Then they review only exceptions. Then they skim explanations. Then they approve unless something looks obviously wrong. Eventually, oversight becomes ritual rather than judgment.

And the dangerous part is that a high-accuracy AI can still fail through stale data, hidden dependencies, wrong escalation logic, or what the poster calls overconfident reasoning on an incomplete version of reality. The model doesn't hallucinate. It just reasons correctly about the wrong picture.

This feels very much like aviation automation bias — the well-documented problem where pilots become less skilled at manual intervention precisely because the autopilot is so reliable. Rare failures become harder to catch because the baseline trust became rational. People aren't lazy; they're adapting to statistical reliability. That's the trap.

The implication for builders is that "human-in-the-loop" as a governance strategy probably isn't sufficient on its own. What you actually need is humans governing the boundaries within which the AI is allowed to operate — not reviewing every output, but actively maintaining the decision architecture that determines when AI operates and when it doesn't. That's a meaningfully different design requirement.

Shift now to a couple of stories around AI and the physical world.

Figure AI's third-generation humanoid robot has been making the rounds this week. A video showing Figure 03 working a continuous shift — over 30 hours straight, standing on what appears to be a charging plate during breaks — generated nearly 2,400 upvotes on Reddit and 760 comments. People are clearly paying attention. The video of it swapping shifts with another unit is also circulating, and someone in the comments compared it to a surgeon stepping back from the table with hands raised. "I'm done. Sow the patient up."

The counterpoint being made in the comments that I find more substantive than the spectacle: the job being automated in the video was apparently something that "was supposed to be replaced by cheap NFC tags 20 years ago." Meaning the robots aren't leading with the hardest problems. They're cleaning up a backlog of automation that should have happened decades earlier. That's not a critique — it's context. Start where the conditions are right, same logic as the Stanford agentic AI study.

Separately, researchers at Carnegie Mellon and the Bosch Center for AI published work on what they're calling touch dreaming for humanoid robots. The problem they're solving: robots in contact-rich tasks — think folding towels, scooping cat litter, serving tea — need tactile feedback to succeed. Simply adding touch as an extra sensor input wasn't enough. But predicting tactile signals in latent space — dreaming about what contact will feel like before it happens — resulted in a 90.9 percent relative improvement in success rate across five real-world tasks compared to the prior baseline. The 30 percent additional gain from predicting in latent space rather than raw signal space is the specifically interesting technical detail. Paper link is in the show notes.

Now let's do the main event: the big picture on Codex and the competitive coding agent race, and what it means for Google's I/O setup — which is literally happening this coming week.

I want to be careful here because one of the primary sources I'm drawing from is a podcast episode that covers some material from earlier months. I'm treating it as context and perspective, not as breaking news. With that caveat in place, let me give you what's actually useful for builders.

OpenAI's Codex has gone from roughly a couple hundred thousand users to more than four million active users per week. And the team has announced a weekly Thursday release cadence — a stable, predictable rhythm for shipping new capabilities. The first delivery under that cadence was Codex inside the ChatGPT mobile app.

This sounds like a feature update. It's more than that.

The prior setup was uncomfortable: if you're running Codex or Claude Code locally, your laptop has to stay open and connected. People were literally carrying half-open laptops everywhere, or rigging up their Mac Mini as a persistent server that their phone could trigger and monitor remotely. Anthropic shipped a remote control feature for exactly this problem. OpenAI went further — Codex on mobile isn't remote control, it's a full-fledged interface where you can initiate new tasks, review outputs, approve next steps, all from your phone, while your code is actually executing on whatever machine you left running at home.

The practical architecture some people are already using: a Mac Mini that's always on and always connected, a laptop as a secondary device, and a phone as the primary control surface. Start a thread from the phone, it runs on the Mac Mini, pick it up on the laptop when you get to your desk. All threads persist across devices. You can run heartbeat threads — things that stay running 24/7 in the background.

One framing I want to highlight because it's sharp: the mobile interface isn't a convenience feature. It's an admission about what the job of a developer is becoming. The bottleneck in agentic AI workflows isn't generation speed — it's the human review cycle. How fast can you approve the next step while you're in a meeting? The UX question isn't whether the AI can do the work anymore. It's how you design approval flows that don't become the new bottleneck.

And that question cascades immediately to B2B: AI SDRs, AI content engines, AI campaign builders — all of them eventually need the same answer. What does the human checkpoint look like, and how do you keep it from becoming the constraint on your entire system? Most MarTech vendors haven't even started designing for this.

There's a broader dynamic at work too. Consumer AI and work AI are diverging. As a consumer technology — when AI features are pushed into products where people didn't ask for them — AI feels normal, and in some cases actively annoying. On the work side? Builders cannot get model updates fast enough. Cannot get token access fast enough. That asymmetry matters.

OpenAI has been pretty explicit about which side of that divide they're prioritizing. Anthropic was always on the work side. Microsoft is structurally a work-oriented platform. Apple and Meta are consumer by nature.

Which leaves Google as the only company besides OpenAI that has genuinely been trying to compete on both sides simultaneously.

Google I/O is next week — by the time many of you hear this, it may have already started. Here's what the reporting suggests we'll see, and how I'd evaluate it.

One: a new always-on personal AI agent called Gemini Spark. The pitch is exactly what Google's pitch has been for years — they already have contextual data about you, so an agent that integrates your calendar, your inbox, your logged-in websites, your location, can theoretically build a richer picture of your actual life than any competitor starting from scratch. The welcome screen language being circulated says it "may do things like share your info or make purchases without asking." That's getting a lot of attention.

Here's my honest read on the consumer agent question: I'm genuinely skeptical that shopping agents and travel booking agents are where people actually want AI autonomy. Context about your life is only as useful as the use cases people actually want handled. The grumpy-old-man reaction to shopping agents might be wrong, but it might not be. I'd watch whether real consumers voluntarily engage with Spark after the novelty wears off.

Two: new Gemini models at Google I/O, but the early read is that they won't be state-of-the-art — probably landing somewhere around the middle of the current competitive field rather than at the top. For a lot of Wall Street observers, that's going to be a disappointment. They're expecting Google to show up with something at least level with GPT 5.5 or Claude Opus 4.7.

But here's the angle I find genuinely interesting for builders: the Gemini Flash variant. The rumor is that Gemini 3.2 Flash is hitting 92 percent of GPT 5.5's performance on coding and reasoning tasks at 15 to 20 times lower inference cost and sub-200-millisecond latency. If that holds, this is actually a significant business opportunity that Wall Street won't properly price.

Right now, there are companies all over the country trying to decide whether they're comfortable deploying a Chinese open-source model for their big inference workloads — because that's the cost efficiency leader at the moment. If Google can deliver frontier-adjacent performance at 20x lower cost with domestic infrastructure and no associated political risk, a lot of those companies will move to Gemini Flash almost immediately. That's a massive opportunity. Whether Google actually capitalizes on it — or even realizes they have it — is another question.

Three: there's a real need for consolidation around a clear developer harness. Google currently has Gemini CLI, AI Studio, Jules, and other tools, and nobody's quite sure which one is the canonical home for agentic development in the Google ecosystem. OpenAI has Codex. Anthropic has Claude Code. xAI just launched Grok Build. Google has product sprawl. If I/O produces clarity — a single, well-supported developer harness that's getting active investment and shipping fast — that alone would be a meaningful win for Google in the builder community. If they leave with the same fragmentation they went in with, that's a real problem.

The short version of where I land on Google's opportunity heading into I/O: Wall Street wants a state-of-the-art model. Builders want cheap inference, a clear harness, and a reason to choose Google over whoever's winning the agent race at any given moment. Those are almost completely different scorecards. Google is likely to score well on the builder card and mediocre on the Wall Street card. How they message that gap is going to be interesting to watch.

Now let me give you a quick tour of a few builder-relevant items before we close out.

On the MCP and local agent front: someone built a self-hosted, open-source MCP server called Equibles that gives any local LLM real financial data — SEC filings, 13F institutional holdings, insider and congressional trades, FINRA short volume, FRED economic indicators, VIX, the whole stack — with no cloud dependency, no API keys, no telemetry. It runs on your machine and exposes everything as standard MCP tools, so any MCP-capable client can query it directly. The GitHub repo is in the show notes. The one useful warning from the comments: financial data gets stale fast, and without a provenance layer on every answer — filing date, source URL, retrieval timestamp — an LLM can confidently blend a 10-Q, a 13F, and yesterday's price feed into something that sounds authoritative and isn't.

Also worth flagging briefly: a solo developer did something genuinely interesting with small model self-improvement. Using a 24-gigabyte MacBook and a few dollars of cloud credits, he built a loop where a small Qwen model invents its own coding problems, attempts them, and fine-tunes itself on the pairs where it went from wrong to right — with the Python interpreter as the only judge. Qwen 2.5 7B went from getting 25 out of 164 HumanEval coding problems right to getting 112 right. With no human-written training data.

The finding I found most interesting isn't the headline number. It's what he called the threshold effect: if you have fewer than roughly 100 training pairs, fine-tuning and test-time sampling don't add up — they actually fight each other. The fine-tuning narrows the model's output diversity enough that sampling loses the variety that made it useful. Below the threshold, you might be better off just sampling from the base model. That's a practical finding nobody seems to have written down before, and it's worth knowing if you're doing anything with small-model adaptation. Code and weights are open; links in the show notes.

On the infrastructure and public opinion front: a poll found that 70 percent of Americans don't want AI data centers built in their local area. This got a lot of Reddit traffic, and the most upvoted responses are the ones pointing out the proportionality problem. Gaming uses 12 to 25 times more energy per hour than AI. Streaming video uses 7 times more. Golf courses consume about 2 billion gallons of water a day. California almond farms use 2 trillion gallons of water annually. The AI water and energy footprint is real, and data centers should be transparently reported and intelligently sited. But the framing that AI specifically invented these problems is not supported by the numbers.

The real issue people are expressing — and this is the part I take seriously — isn't actually about AI specifically. Seven out of ten people don't want a massive industrial facility next to their house regardless of what's in it. That's a legitimate land use and community impact question. The AI specificity in the framing is more about anxiety than arithmetic.

Finally, a brief note on the HuggingFace open-source debate. Yann LeCun circulated a clip of HuggingFace CEO Clement Delangue being asked about the risks of releasing powerful open-source models. Delangue's position, consistent with what he's said previously, is that restricting access doesn't actually reduce risk — it just concentrates it in fewer hands. LeCun amplifying this is unsurprising; it aligns with his long-standing position on open systems. I don't have the full clip text but the position itself is a meaningful data point as the open-weight vs. closed-weight policy debate continues to heat up ahead of anticipated government scrutiny.

That's today's Barely Possible. The full Stanford Enterprise AI Playbook, the Equibles MCP server repo, the self-training small model code, and the Touch Dreaming robotics paper are all in the show notes. Google I/O is next week — we'll be watching what actually comes out versus what was previewed. Stay sharp out there.

Tony DeLuca, signing off — same time tomorrow.

More episodes

Chapters

Show Notes

What is Barely Possible?