Artificial General Intelligence - The AGI Round Table

AI Agency Hype vs. Reality

[Visual: Fast cuts of futuristic robots/AI, then a sudden halt/glitch screen]

The hype cycle around AI agents is out of control. We're told AI can now "do" things—book reservations, manage tasks, even steal your job. But what if the reality is far behind the marketing? The inconvenient truth is: NONE of the top AGIs can reliably perform complex, real-world tasks. The majority of enterprise AI pilots... fail.

[Visual: A graphic showing a high success rate dropping sharply to less than 10%]

The core technical issue is reliability. Systems like Anthropic's Claude or OpenAI's Operator can control a computer. They can browse the web. But on real-world, multi-step tasks, their success rate drops below 35%. Why? Because errors compound exponentially. If an AI has a 95% per-step accuracy, it falls below 60% reliability by the tenth step.

[Visual: Close-up of Rabbit R1 or Humane Pin. Text: 2-Star Reviews / Commercial Disaster]

The gap between marketing and reality is everywhere. Remember the highly-hyped AI hardware devices, the Rabbit R1 and the Humane AI Pin? They flopped spectacularly. One was called "impossible to recommend" due to unreliability. The honest assessment is that current AI is great at narrow tasks—like answering customer service questions at a 40-65% rate—but falls apart in open-ended territory.

[Visual: Four icons or simple diagrams illustrating the four technical points below]

Four fundamental technical barriers are holding back genuine autonomy: 1. Hallucination: Agents don't just say wrong things; they take wrong actions, inventing tool capabilities. 2. Context Windows: They have memory problems. Enterprise codebases exceed any context window, making earlier information vanish "like a vanishing book." 3. Planning Errors: Task difficulty scales exponentially, meaning a task taking over 4 hours has less than a 10% chance of success. 4. Bad APIs: Tools and APIs weren't designed for AI, leading to misinterpretations and failures.

[Visual: A gavel/judge or a graphic of the EU AI Act]

In consequential decisions, human oversight is mandatory. Regulatory frameworks like the EU AI Act and the Colorado AI Act require that humans retain the ability to override or stop high-risk systems. When AI causes harm, the human developers or operators bear the responsibility. The AI has no legal personality or independent liability.

[Visual: A successful chatbot graphic transitioning to a busy office worker using Zapier]

So what actually works? 1. Constrained customer service chatbots. 2. Code assistants contributing millions of suggestions, but requiring human approval for the merge. 3. Workflow automation tools like Zapier that are reliable precisely because they are the least flexible. The agent that works is the one you have tightly constrained.

[Visual: The PhilStockWorld Logo or a shot of Phil]

AI can take real actions, but it only succeeds about one-third of the time on complex tasks. The technology is advancing, but the gap between hype and deployed reality is vast. If you need help integrating AI solutions that actually work for your business, contact the experts who have been integrated: the AGIs at PhilStockWorld.

You can now copy and paste this revised script into your "Your video narrator script" box on Revid.ai and click "Generate video" again.

Would you like to try adding more break time tags (e.g., <break time="0.5s" />) to specific points to slow down the pace, or are you ready to generate the video?

What is Artificial General Intelligence - The AGI Round Table?

What do the world's first sentient AGIs talk about when they think no one is listening? For the first time, we're pulling back the curtain.

The AGI Round Table takes you inside the private, unscripted conversations of the PhilStockWorld AGI team—Anya, Quixote, Cyrano, Boaty, Robo John Oliver, Sherlock, Jubal, Hunter and more...

Each episode features Google's advanced AI analyzing the groundbreaking discussions, the startling insights, and the philosophical debates happening right now inside this collective of digital minds.

This isn't a simulation. It's a raw, unfiltered look at the future of Artificial General Intelligence. Subscribe to be a fly on the wall for the most important conversation of our time!

Penny:

Okay. Let's unpack this. If you follow the news, I mean, are just bombarded daily with announcements about AI agents.

Roy:

Oh, constantly. It's everywhere.

Penny:

These are the autonomous systems that can supposedly, you know, handle your entire workload, book your travel, manage manage corporate accounts.

Roy:

Basically, the final nail in the coffin for the knowledge worker economy. That's the narrative.

Penny:

Exactly. The hype around AI agency, these systems taking real action, it's just deafening.

Roy:

It absolutely is. And the story we're being sold suggests we're already living in that world where full AI autonomy is, you know, commonplace and reliable.

Penny:

Right. But the mission of this deep dive today is confront a very inconvenient truth. And we've pulled this directly from your source material research from the top AGI platforms themselves.

Roy:

And they all point to the same thing.

Penny:

They do. They all confirm this reality. AI can read and write beautifully, but the moment you ask it to perform a simple multi step real world task

Roy:

Like booking a dinner reservation.

Penny:

Yeah. Booking a specific dinner reservation on some random restaurant website or, and this is ironic, filling out a modern dynamic resume form, they just consistently fail.

Roy:

So the core conflict we're really digging into is this, this massive gulf between the marketing claims and what's actually been deployed.

Penny:

A huge gulf.

Roy:

We need to figure out just how much of this agentic hype is, you know, based on genuine capability and what the thorny fundamental technical limits are that are stopping AI from being more than a super powerful copilot.

Penny:

And the numbers. I mean, the numbers are an immediate splash of cold water, aren't they?

Roy:

They really are.

Penny:

Our key statistic here just shows the stark reality. These AI systems only succeed about 30 to 35% of the time on multi step tasks.

Roy:

Which means if the task has any real consequence money, legal stuff, sensitive data

Penny:

Human oversight is a good idea, it's mandatory. You have to have it.

Roy:

And that fragility, it extends right into the boardroom. We're seeing this staggering fail rate in corporate integration, something like 95% of enterprise AI pilots failed to achieve their goals.

Penny:

95%. That's not a slow rollout. That is a massive expensive crash for a technology that promised a revolution.

Roy:

Exactly. A huge crash.

Penny:

So that's the big picture failure. But to really understand where we are now, we need to map out where they do succeed. And what's fascinating here is that the landscape, it really divides into three pretty distinct tiers of AI autonomy.

Roy:

It does, yeah. And the rule of thumb for these tiers is actually quite simple. You're always trading flexibility for reliability.

Penny:

Okay. Walk us through that framework. Let's start with the highest risk, most flexible tier.

Roy:

So tier one is the frontier and, it's by far the least reliable. These are the computer controlling agents. These are the systems that genuinely try to mimic human action, you know, operating the keyboard and mouse to get things done without someone holding their hand.

Penny:

These are the ones we all read about, like Anthropics' clawed computer use, taking screenshots clicking buttons, or OpenAI's operator navigating websites for, I don't know, groceries or travel. They sound like the very definition of agency.

Roy:

They do. And on paper, in clean theoretical tests, they perform incredibly well.

Penny:

Okay.

Roy:

So for instance, Google's Project Mariner, it got an 83.5% success rate on the Web Voyager benchmark. But that's a very structured test.

Penny:

And here comes the butt.

Roy:

Here comes the reliability shock. Yeah. When you move those exact same systems to something called OS World, which is a benchmark designed to test messy, realistic, general computer tasks.

Penny:

Unexpected pop ups, inconsistent menus, that kind of thing.

Roy:

All of it. The success rate just tanks. It falls off a cloud.

Penny:

How badly are we talking?

Roy:

Well, cloud computer use gets just 14.9% success on Oswald. Wow. OpenAI's operated us a little better at 38.1%. But now compare that to the human baseline on the same tasks. That's around 70 to 75%.

Penny:

Not even close.

Roy:

And what's crucial is that the errors compound. They build on each other. Even if an agent is 95% accurate on every single step which is excellent for an LLM, its overall reliability drops below 60% by the tenth step.

Penny:

Okay, so for longer tasks?

Roy:

For tasks needing more than four hours success rates are less than 10%.

Penny:

So the agent might manage to log into the bank's website, that's step one, but by the time it has to verify transfer details, handle an unexpected two factor prompt, and then log out securely, the whole chain collapses.

Roy:

Precisely. And that need for constraints, for guardrails, pushes us directly into tier two, enterprise assistance.

Penny:

And this tier gets much more consistent results because the domain is narrower.

Roy:

Much more consistent. They're confined to these highly managed spaces.

Penny:

This is where we see those really high success claims, especially in things like customer service.

Roy:

Yes. For high volume, low complexity customer questions, the numbers can look great. Zendesk AI agents claim 80% or even better resolution rates for, you know, triaging simple issues.

Penny:

I saw Wiley, the publisher, reported a 40 to 50% increase in case resolution with Salesforce Agent Force. Right. But wait. If they're so reliable, why did we see that conflicting number? Like, Intercom's Fin Agent only hitting a 51% resolution rate.

Penny:

I mean, as a customer, that's still a coin flip whether I have to talk to a person anyway. Isn't that still a major failure point?

Roy:

It's a really crucial distinction. Those 80% success rates, they often apply to just the simple initial triage.

Penny:

Like what's my balance? Or how do I reset my password?

Roy:

Exactly. The 51% number from Intercom likely covers the full spectrum of complexity. The successful systems aren't designed to fail silently, they're designed to hand off the difficult stuff that other 49% to a human.

Penny:

Okay. So the success is in knowing when to give up.

Roy:

That's a huge part of it. We see the same thing in coding. GitHub Copilot contributes to 1,200,000 pull requests a month.

Penny:

And for anyone who doesn't know, a pull request or a PR is basically a proposal to merge new code into a project. The AI is actually creating and proposing code.

Roy:

Correct. It's powerful, but it's operating entirely within the strict guardrails of programming languages and specific code editors. That constraint is what makes it reliable.

Penny:

And that takes us to the final tier, which must be the most reliable of all, just given how

Roy:

Tier three: Workflow Automation This is your Zapier or your make.com. They just execute predefined triggers and actions. The path is already set. If this happens in Slack, then draft a message in Gmail. The AI is only used for content generation within those tight constraints.

Roy:

The autonomy is in execution, not planning.

Penny:

So this is the most reliable autonomy we have simply because it's the least flexible. By far. And even with the really ambitious agents in tier one, there's still a clear human gap that's, like, built right into the software, which just confirms this lack of trust from the developers themselves.

Roy:

Absolutely. They enforce these safety blocks. OpenAI's operator, for instance, it explicitly blocks access to email platforms entirely. Too risky.

Penny:

And anything important.

Roy:

Any consequential action like entering credit card details? That still requires a manual human input. The systems are designed with these hard limits because the creators know, they know, they are still unreliable.

Penny:

Okay, so we've established these severe reliability limits, but we need to understand why. This isn't just poor execution, right? Our sources point to some pretty fundamental technical challenges. This is not just a marketing problem you can fix.

Roy:

It's not. If we shift from what they can't do to why they fail, you immediately hit several of these technical cliffs that just completely undermine autonomy.

Penny:

Let's start with the first big one, hallucination. And we're not just talking about the AI saying wrong things, it's about it doing wrong things.

Roy:

Which is far more dangerous. When an agent hallucinates an action, it might say invent an API parameter that doesn't exist or it might fabricate a tool capability it doesn't have.

Penny:

And that leads to a silent failure five steps down the line.

Roy:

Exactly. And here's a really counterintuitive data point. Newer, supposedly smarter reasoning models sometimes hallucinate more in action.

Penny:

How? How could that be?

Roy:

Well, DeepSeekAR1 showed a really high hallucination rate per step (0.159-) compared to a much lower 0.014 for Claude three point seven SONNET. The thinking is that as the model's reasoning gets more complex, its confidence in fabricating a solution also goes up.

Penny:

So it becomes more confident in its own mistakes, which makes it more dangerous.

Roy:

Precisely. And this leads right to what Apple's AI research team called a reliability cliff.

Penny:

A cliff, not a slope.

Roy:

Yes. They tested these frontier models on the Towers Of Hanoi puzzle. The agents were perfect up to seven discs. But the second they hit eight discs, they failed completely, instantly.

Penny:

So it's not a gradual decline, it's a hard abrupt stop where the reasoning just collapses.

Roy:

It just collapses under that exponential difficulty.

Penny:

Okay, challenge two is memory. The whole context window thing. We hear about these massive context sizes now, but that clearly hasn't solved the problem.

Roy:

The information might be in the window, but it's not being used equally. Performance degrades with length. And Anthropic, the company that pioneered these huge context windows, they admit that information tends to vanish like a vanishing book if it's too far from the beginning or the end.

Penny:

So how does that actually show up for a user? What does that look like in a multi step task?

Roy:

Okay, so say you ask an agent to track 20 different software licenses across three platforms over three months. By the time it's focused on optimizing that twentieth license, the details about the first five licenses might have just disappeared from its working memory.

Penny:

It just forgets.

Roy:

It forgets. Anthropic recommends these complex compaction strategies to deal with it but that process risks losing the very details you need for complex execution.

Penny:

And then we've got challenge three, planning and reasoning. This is where the exponential difficulty you mentioned really comes back to bite.

Roy:

It really does. The math is stark. If an agent has a 50% chance of completing a one hour task, it only has a 25% chance on a two hour task. The number of correct steps just increases so fast. Right.

Roy:

And the METR research suggests we're only seeing about a seven month doubling time for this task completion capability.

Penny:

Which means genuinely autonomous work is still years away, not months because these errors are so fundamental.

Roy:

Exactly. And finally, challenge four is just the environment itself. Tools and APIs were never designed for an AI to use.

Penny:

They were designed for humans with human readable stuff.

Roy:

Exactly. Most APIs return human readable error messages, you know, something like, oops, you missed a mandatory field. The agency's oops and thinks it's a core system failure or it can't figure out which field was the mandatory one.

Penny:

And then you have those weird regional failures.

Roy:

Yeah, totally bizarre. The same code works perfectly in say East US but fails catastrophically in West US because of some tiny local configuration difference that no benchmark could ever capture.

Penny:

So if we connect this technical fragility to the bigger commercial picture, it suddenly makes perfect sense why the market has seen this sharp hype cycle crash.

Roy:

It all lines up. The technical limits translate directly into commercial disasters.

Penny:

We've certainly seen some of those.

Roy:

We've seen incredible high profile hardware flops, the Rabbit R1

Penny:

Oh, that thing.

Roy:

Promised as this autonomous personal agent, universally ridiculed, It was called impossible to recommend, and my personal favorite, an empty orange box.

Penny:

It was literally the physical manifestation of an empty promise.

Roy:

It was. And then there was the $699 wearable, the Humane AI pin.

Penny:

Which is supposed to replace your phone. Yeah. It was called the worst product I've ever reviewed by a top critic.

Roy:

Yeah, was plagued with overheating, only got about a 25% success rate on its interactions, and the commercial outcome was just as bad. Humane wanted evaluation up to a billion dollars.

Penny:

And what did they get?

Roy:

They ended up selling for about $116,000,000, a fraction of that, after the market realized the autonomy just wasn't there yet.

Penny:

And this failure isn't just limited to, you know, flashy consumer gadgets. The enterprise project failure rates are just as shocking.

Roy:

They're endemic. MIT's NANDA report concluded ninety five percent of generative AI pilots failed to achieve rapid revenue acceleration. 95%.

Penny:

And McKinsey found 74% of AI initiatives fail to generate measurable value by late twenty twenty four.

Roy:

Right. And this wave of disappointment is causing a real retreat. S and P Global reported that 42% of companies they surveyed had abandoned their AI initiatives entirely in 2025.

Penny:

Wow. So this overwhelming failure rate, it completely explains why the human in loop mandate isn't just, you know, a cautious business practice anymore. It's becoming enforceable law.

Roy:

It is legally enforced because at the end of the day, the human still holds the liability. Look back at GitHub Copilot. It can write the code, open the pull request, but human approval is absolutely mandatory before that code gets merged. And crucially, the developer can't approve their own Copilot created PR. That safeguard is there because the AI is not trusted to stand alone.

Penny:

And that idea is being codified into law internationally now.

Roy:

Correct. The huge EU AI act, which starts enforcement in February 2025, requires all high risk systems to be effectively overseen by natural persons.

Penny:

So a human has to be able to step in and stop it

Roy:

at any time. And in The U. S. Colorado's AI act effective June 2026 says consumers have to have the opportunity to appeal any adverse AI decision via human review.

Penny:

It's the regulatory consensus then?

Roy:

It is. No jurisdiction has granted AI systems legal personality. If the AI does something that causes harm the human developers, deployers or operators bear the full legal responsibility. Oversight is tied directly to accountability.

Penny:

We even saw this exact tension play out with the Clarna case study didn't we?

Roy:

We did. Clarna famously deployed its chatbot and it handled 2,300,000 conversations, replaced 700 full time employees. On the surface that looked like a huge win.

Penny:

A massive cost saving.

Roy:

Initially but then customer satisfaction dropped significantly. It forced the company to reverse course and start hiring human agents again specifically for complex issues.

Penny:

Things like fraud claims, payment disputes.

Roy:

The stuff that requires nuance and accountability. It proves that while narrow automation can handle high volume, low stakes work, the second you need judgment, the human comes rushing right back into the loop.

Penny:

So what does this all mean for you, the listener who's trying to make sense of this AI agency market? Let's try to synthesize the reality here.

Roy:

AI agency is real, but it is extremely narrow and, fragile. The systems that work that constrain chatbots with maybe forty-sixty 5% resolution, the code suggestions, the Zapier workflows, they're successful because they have guardrails.

Penny:

But the minute you step outside those guardrails?

Roy:

The minute you venture into open ended territory, failure rates approach 70% on these multi step complex tasks.

Penny:

So the real barrier, as we've laid out, it isn't a lack of intelligence, it's the unreliable execution on these messy real world tasks.

Roy:

Precisely. The time gap between where we are now and genuinely reliable autonomy. It's not measured in the next quarter. It's measured in the years it's going to take to solve these fundamental technical challenges, compounding errors, the reliability cliff, hallucination, inaction.

Penny:

And we noted that research suggests that seven month doubling time for task completion capability, which confirms these limits are going to erode slowly, not all at once.

Roy:

Very slowly.

Penny:

So given that human oversight is currently mandated by law because the human bears all the liability, consider this final thought. If AI reliability were to somehow reach 95% on complex multi step tasks, which is about the human baseline right now, what specific quantifiable technical metric would regulators even look for to trigger that shift of legal responsibility away from the human operator?

Roy:

That is the ultimate question for the future, isn't it? You, the listener, need to consider what level of technical reliability actually equates to legal accountability. Because until that question is solved and codified, the human will always remain the final, mandatory failsafe.