Everyday AI Made Simple - AI in the News

In this episode of Everyday AI Made Simple, we dive into one of the hottest and most misunderstood topics in tech — Artificial General Intelligence (AGI) — and explore the new framework that claims to finally measure it.
Join us as we unpack Dan Hendricks’ “AGI Scorecard”, a groundbreaking approach that evaluates AI systems like GPT-4 and GPT-5 against the full spectrum of human cognitive abilities. You’ll learn why GPT-5’s 58% AGI score represents a major leap forward, how psychology-based models like CHC theory are reshaping AI measurement, and why memory and dependability remain the biggest barriers between today’s AI and true AGI.
We also break down the competing definitions and what these reveal about the global AI race. Whether you’re a developer, investor, or just AI-curious, this episode will give you a clear, actionable understanding of where we actually stand on the road to general intelligence.
👉 Tune in to discover:
  • How the AGI Scorecard quantifies progress toward human-level cognition
  • Why GPT-5’s 58% score matters for AI research, policy, and markets
  • The critical memory bottleneck slowing AI’s path to reliability
  • What the next frontier of AGI development could look like

What is Everyday AI Made Simple - AI in the News?

Want AI news without the eye-glaze? Everyday AI Made Simple – AI in the News is your plain-English briefing on what’s happening in artificial intelligence. We cut through the hype to explain the headline, the context, and the stakes—from policy and platforms to products and market moves. No hot takes, no how-to segments—just concise reporting, sourced summaries, and balanced perspective so you can stay informed without drowning in tabs.

Blog: https://everydayaimadesimple.ai/blog
Free custom GPTs: https://everydayaimadesimple.ai

Some research and production steps may use AI tools. All content is reviewed and approved by humans before publishing.

00:00:00
Hey, everyone. Welcome back to Everyday AI Made Simple. Grab your drink of choice and let's dive in to what's going on in AI news.
00:00:06
Welcome to the deep dive. We take all the complex research, the debates that seem to go nowhere, this fire hose of information, and we try to boil it down, get to the essential actionable insights you actually need.
00:00:23
And today we're wrestling with maybe the most loaded term in tech right now, AGI, Artificial General Intelligence.
00:00:29
Right. AGI. It's been this like fuzzy goalpost for years, hasn't it? Always somewhere off in the future.
00:00:34
Exactly. It's been so nebulous, so vaguely defined that you get these endless and frankly, pretty unproductive debates about timelines. You know, is it a year away? Five years? Ten? Longer. And that lack of clarity isn't just academic. It causes real confusion, right? It affects research funding, policy decisions, everything. Totally. The vagueness.
00:00:56
is, well, it's the core problem. So what if we stop thinking about it as, is this binary? Like you either have AGI or you don't.
00:01:02
Okay, so what's the alternative.
00:01:03
What if we started treating it more like a spectrum, something you could actually measure, put a number on it.
00:01:07
A quantifiable score for AGI. Okay, I'm intrigued. That's our mission for this deep dive then. That's it. We're diving into this comprehensive new framework. And it's interesting because it's grounded not really in computer science theory, but in cognitive psychology. It's an attempt to finally give AGI a clear measurable score.
00:01:29
Cognitive psychology, huh? So measuring it against human intelligence.
00:01:34
Precisely. We're going deep into the methodology here. It comes from Dan Hendricks and the Center for AI Safety. We'll look at how they built this scorecard using human cognition as the benchmark.
00:01:45
And how is it different from, you know, all the other ways people have tried to define or measure AI progress.
00:01:50
Well, that's key. It's structurally very different. And we'll definitely get into how the current top models actually stack up against this new scorecard.
00:01:57
Okay, let's not bury the lead then. Give us the key nugget. What's the scorecard? for the state-of-the-art AI right now, according to this framework.
00:02:03
Right. So according to this standardized approach, GPT-4, which, let's be honest, felt like absolute magic just, what, a year or so ago.
00:02:10
Yeah, seemed incredible.
00:02:12
It scores 27% AGI on this scale. 27%.
00:02:15
Wow. Okay. That sounds low, given the hype.
00:02:19
It does, doesn't it? But wait. The current frontier model, let's call it GPT-5 for simplicity, based on the research paper's naming, that one currently stands at 58% AGI.
00:02:29
58%. So more than double GPT-4's score. That's a huge jump.
00:02:34
A massive jump. And understanding that jump, what that 58% actually means in terms of capabilities, and maybe more importantly, what pieces are still missing. That's really the heart of our conversation today.
00:02:46
Okay. Let's start with the big picture, the stakes. Why does this 58% score, or even having a score at all, actually matter? I mean, if you're just using AI day-to-day, summarizing emails, writing some code, maybe handling customer service chats, does the... philosophical definition of AGI or this score really change anything for that practical application, it often feels kind of useless, honestly. And that's a perfectly valid point.
00:03:11
If you're focused on a specific, you don't need it to understand philosophy or, you know, have amazing auditory perception if it's just generating marketing copy. In that context, yeah, AGI definitions can feel like irrelevant buzzwords, academic navel gazing almost. But, I sense a but coming. There's a big but. The second you zoom out from the individual user to the system level, things change dramatically. We're talking about markets, investment strategy.
00:03:37
geopolitical competition. Ah, okay. So the definition suddenly becomes incredibly important there. We're talking like trillions of dollars important. Easily. The financial.
00:03:46
stakes are just staggering. Yeah. Think about the valuations of the big tech players, NVIDIA, Microsoft, Google, the whole ecosystem. A huge chunk of their market cap is tied to the financial stakes. And that's why AGI is so important. Fundamentally, to the perceived progress towards AGI.
00:04:00
Perception being the key word there because it's been so vague.
00:04:03
Exactly. Nebulous definitions breed uncertainty, speculation, volatility, investment funds, venture capitalists, even governments. They're scrutinizing every little hint of progress to figure out where to place their bets.
00:04:15
So a clearer, measurable metric like this 58% score, it could potentially stabilize things or maybe destabilize if progress seems to slow down.
00:04:26
Both are possible. If this framework shows steady progress, it might calm nerves. But if a future model plateaus or if this metric reveals fundamental roadblocks, yeah, that could trigger major market shifts. We're talking hundreds of billions potentially.
00:04:41
So the goal here isn't just academic tidiness. It's about trying to replace maybe faith and hype with a trackable metric, something that reduces that massive financial uncertainty.
00:04:52
That's a big part of the motivation, yes. And that uncertainty ties directly into this whole timeline debate that's been absolutely ridiculous. Reggie recently.
00:04:59
Oh, yeah. You hear everything. from, you know, AGI is basically here maybe 18 months away to nope, we're still a decade or more.
00:05:06
out from anything truly general. And those wild variations in timelines, they're almost entirely a function of where you set the definitional bar, how high is high enough. Right. If your definition.
00:05:14
is just, say, an AI that can write a decent college essay, then yeah, we're pretty much there.
00:05:19
But if you set the bar really high, the timeline stretches way, way out, maybe indefinitely.
00:05:26
Speaking of high bars, let's talk about Andrej Karpathy. He's obviously a major figure in AI development, and his definition seems extremely demanding.
00:05:33
It is very demanding. He doesn't define AGI purely by cognitive skills like thinking or writing. He defines it by economic utility.
00:05:41
Okay, how so.
00:05:43
Karpathy's definition, roughly, is a system that can do any economically valuable task at human performance or better.
00:05:49
So that includes what? Manufacturing, driving trucks, performing surgery.
00:05:56
Exactly. Physical labor, intricate hands-on work, high level AI. strategic decision-making in complex, unpredictable environments. Not just knowledge work done on a computer. His definition is incredibly broad.
00:06:08
And he's been critical of how the term AGI is often used now, right.
00:06:11
Yeah, he feels the definition has been significantly, well, he used the term watered down. People are increasingly using AGI to talk primarily about intelligence for knowledge work. You know, the stuff you can do with a computer.
00:06:22
Things like writing, coding, analyzing data.
00:06:25
Right, which is obviously hugely important. But Karpathy estimates that knowledge work only accounts for maybe 10% to 20% of all global human labor, economically speaking.
00:06:36
Wow, only 10%, 20%. That's a crucial distinction if AGI only tackles that slice.
00:06:41
Then it leaves the other 80%, 90% of the global economy, the physical stuff, the infrastructure, the hands-on services, largely untouched by what he would consider truly general intelligence.
00:06:51
And that distinction matters enormously for investment, for national strategy, for workforce planning.
00:06:56
Absolutely. If you're, say, a logistics company or a construction... looking for AI to revolutionize your physical operations, a system that only aces reading comprehension tests isn't the AGI you're looking for. So Karpathy's argument is if the AI isn't.
00:07:10
general enough to tackle that other massive chunk of economically valuable tasks, it's not truly general intelligence. It might be amazing cognitive intelligence, but it's specialized.
00:07:20
That's the core of his critique. So whether you lean towards Karpathy's super high economic bar or the cognitive definition we're about to explore, the central point holds. The current vagueness is a massive risk, a multi-trillion dollar risk.
00:07:37
Which brings us back to quantification. Having a number, a score, is maybe the only way to really track progress and assess the potential impact accurately.
00:07:45
Okay, so before we dive headfirst into this new quantifiable framework and celebrate its potential elegance, we really need to appreciate the, well, the chaos it's trying to sort out.
00:07:56
The battlefield. The world of definitions you mean.
00:07:57
Exactly. The landscape of AGI definitions is, incredibly crowded. And worse, they're often contradictory or fuzzy or just serve the interests of the definer. This ambiguity has plagued the field for years. Let's start with.
00:08:10
the big players, the labs actually building these things, OpenAI, for instance. They've had to sort of adjust their own goals publicly, haven't they? They have. If you go back to early 2023.
00:08:19
their initial public statement was pretty bold, maybe even provocative. They aim for AI systems.
00:08:25
that are generally smarter than humans. Smarter than humans. That sounds incredibly hard to.
00:08:31
measure. How do you even prove that? You don't, really. It's more of a philosophical north star than a practical engineering target. How do you satisfy an auditor or an investor or even just yourselves that you've achieved smarter than humans? It's just too vague. So they walked it back a bit. Yeah. By early 2024, Sam Altman was publicly acknowledging that AGI is a weakly defined term. His updated view and OpenAI's work on AGI is that it's a weakly defined term. The working definition shifted, became more about a system that can tackle increasingly complex, problems at a human level, but crucially across many fields.
00:09:04
That shift is subtle, but important. Moving away from this idea of general superiority towards versatile proficiency, being good at lots of different things like a human.
00:09:14
Precisely. And that idea of versatile proficiency of stages of capability is reflected in that well-known levels of AGI framework. It originally came from Google DeepMind, I believe. Right. The five levels. Let's quickly.
00:09:25
recap those just to situate where current models supposedly are. Level one is pretty basic, right? Chat bots, conversational AI. Yeah. Your standard chat bots.
00:09:34
Level two, they call reasoners. These are systems that can do more sophisticated, maybe human level problem solving, but usually within pretty defined domains like good at math.
00:09:43
puzzles or logic games. Okay. Level three is where it starts getting more agentic.
00:09:48
That's the term agents. Level three implies systems that can actually take actions, plan, use tools, browse the web, execute multi-step tasks autonomously to achieve a goal. This is a- big step up. And level four. Innovators. AI that doesn't just execute tasks, but can genuinely contribute to new discoveries. Maybe designing novel molecules, proving new theorems, writing original research papers, creative and inventive capability. And the final goal, level five is organizations. That sounds ambitious. Extremely. Level five envisions AI systems capable of.
00:10:22
autonomously managing complex operations, like running a whole company division or coordinating vast logistical networks. Essentially, AI is a C-suite executive or a strategic planner.
00:10:33
And where do today's best models like GPT-4 or Gemini fit on that scale.
00:10:37
The general consensus seems to be somewhere between level three and level four. They're becoming quite capable agents, definitely showing strong reasoning, but the innovator capability is still maybe nascent, glimpses perhaps, but not consistently.
00:10:53
demonstrated. Okay, so that's one way to frame, progress these capability levels. But there are other definitions that focus less on the level of.
00:11:00
Right. Contrast that with, say, the definition used for the ARC-AGI prize. This one is fascinating because it explicitly pushes back against using economic value or task automation as the main measure of intelligence.
00:11:16
Wait, why reject economic value? That seems like the most practical measure, given the stakes we just talked about.
00:11:21
Their argument is quite philosophical, actually. They argue that skill can be bought, essentially. If you have near-infinite prior knowledge, like you train a model on every math textbook ever written and practically unlimited training data in compute.
00:11:34
You can brute force amazing performance on math tests.
00:11:37
Exactly. You can achieve incredibly high skill in that specific domain. But does that demonstrate true, generalized intelligence? Or does it just demonstrate mastery of the training set? They argue it masks the underlying generalization ability.
00:11:52
So for the ARC prize, intelligence isn't about how good you are at math. It's about how efficiently you can learn something completely new.
00:12:00
weren't trained on. Precisely. Their definition centers on the ability to generalize and the efficiency of skill acquisition for tasks well outside the training distribution. It's about the power to learn, not just the knowledge already possessed, which means you need really novel tests the AI couldn't possibly have seen before. Okay, that's a very different perspective.
00:12:20
What about the big tech companies and consulting firms? How do they tend to define AGI.
00:12:24
more functionally? Generally, yes. They tend to focus on capability breadth. Gartner, for instance, defines it as accomplishing any intellectual task a human can perform. Google talks about the potential possessing the ability to understand or learn any intellectual task. Amazon's is interesting, focusing on novelty. Able to perform tasks it was not necessarily.
00:12:46
trained or developed for. These all sound quite similar, focusing on broad intellectual capability, but then we get to probably the most jarring definition of all, the one that came out of that OpenAI and Microsoft contract dispute. Ah, yes.
00:13:00
the financial benchmark. This is where academic debate slams right into corporate reality and legal liability. It's quite the case study. Remind us what happened there. The core issue was.
00:13:09
Microsoft's access to open AI's tech. Exactly. The original contract apparently.
00:13:14
stipulated that Microsoft would lose certain preferential access or rights once open AI achieved AGI. But the definition of AGI in that contract was problematic. It was something like highly autonomous systems that outperform humans at most economically valuable work, which sounds okay. But the killer clause was that open AI's board had sole non-reviewable.
00:13:37
discretion to declare AGI achieved. So basically, a few people on the board could subjectively decide, yep, we're AGI now, and potentially cost Microsoft billions overnight with no recourse.
00:13:50
That was the reported concern. An unfalsifiable claim with massive financial consequences. So to make it concrete and objective. They changed the definition entirely. They did. AGI reportedly landed on a purely financial metric. AGI would be deemed achieved when open AI developed software capable of generating, I think it was $100 billion in profits.
00:14:09
$100 billion. So AGI isn't about intelligence anymore. It's about hitting a revenue target.
00:14:15
It's blunt.
00:14:15
It's the ultimate economic proxy definition. It sidesteps the philosophical debate entirely and just says, if it makes us much money, it's AGI enough for our contract. It measures market impact, not cognitive ability.
00:14:26
Just to round this out, Elon Musk also has his own take, right? Something more functional. Yeah, his definition is more about functional completeness. He's described AGI as being capable of doing anything a human with a computer can do. So it matches human capability in the digital realm, but not necessarily surpassing all humans combined or anything like that.
00:14:46
And his timeline is pretty aggressive, three to five years maybe, though he put low odds on his own Grok 5 hitting it soon.
00:14:53
Right. So you have the $100 billion profit target. Karpathy's all. All economically valuable tasks bar the ARC. Prizes focus on generalization efficiency, the functional definitions from tech firms, the capability levels.
00:15:06
It's a complete mess. No wonder there's so much confusion and debate. Everyone's using the same term, AGI, but meaning wildly different things.
00:15:13
Exactly. It's defined by the eye of the beholder, often serving their specific purpose, whether that's research, marketing, or legal contracts. This creates intense instability and makes tracking real progress almost impossible.
00:15:25
Okay. So we've thoroughly established the problem, this multi-trillion dollar ambiguity. Now let's finally get to the proposed solution, this attempt by Dan Hendricks and his team to cut through all that noise with cold, hard, standardized numbers. Right.
00:15:41
Let's get into the paper itself. It's titled simply A Definition of AGI, and it's authored by Dan Hendricks, along with a pretty heavyweight group of researchers from places like Stanford, MIT, NYU. And their core approach is, like you said, circle, but actually quite revolutionary in this context.
00:15:57
How so? What's the big shift.
00:15:59
They decide... They decided not to measure AI against other AI or against abstract benchmarks. They decided to measure AI directly against us, against humans.
00:16:09
Okay, so grounding it in human capabilities. And they didn't just invent their own model of human intelligence, right? They leaned on existing psychology.
00:16:16
Exactly. That's critical. They didn't want to create yet another arbitrary definition. They anchored their definition in established science. Their core definition is, AGI is an AI that can match or exceed the cognitive versatility and proficiency of a well-educated adult.
00:16:30
Well-educated adult. That's the benchmark. And the foundation they used. You mentioned it was from psychology.
00:16:36
Yes. And this is really the theoretical bedrock of the whole framework. They based it on the Cat O'Neill and Carroll theory, usually just called CHC theory. If you're building a measurement tool, you need a standard. Stable, reliable ruler.
00:16:47
Makes sense.
00:16:48
And the CHC theory is described in their work and generally regarded in psychometrics as the most empirically validated, most scientifically robust model of human cognitive abilities that we currently have.
00:17:00
Huh. So using a model of human intelligence, potentially decades old, to test cutting-edge AI, why is that the right approach? Why not use modern AI benchmarks.
00:17:10
Because the goal isn't just to see if AI is good at AI tasks. The goal is to see if it's developing general intelligence, comparable to humans. The CHC model provides a comprehensive, structured, and balanced view of what human intelligence actually is. It prevents over-focusing on narrow skills.
00:17:27
Okay, tell me more about CHC. What does it actually involve? You mentioned it splits intelligence.
00:17:30
It does. At its core, CHC theory breaks human intelligence down into several broad abilities, but two are particularly key here. Crystallized intelligence, often labeled GC, and fluid intelligence, or GRIF.
00:17:44
Crystallized and fluid. Okay, what's the difference, and how does that relate to something like an LLM.
00:17:50
Great question. Crystallized intelligence, GEC, is essentially everything you've learned. Your accumulated knowledge, your vocabulary, your skills, facts you've memorized, everything acquired through it.
00:18:01
Okay, so that sounds like something an LLM would be really good at, given its training data.
00:18:06
Exactly. LLMs excel at GC. Their massive training data sets give them access to a vast ocean of crystallized knowledge. That's why they can ace trivia quizzes or summarize historical events. They have effectively infinite GC, far more than any single human.
00:18:20
Okay, so if GC is the learned stuff, then fluid intelligence, grief, GO, must be the thinking part.
00:18:26
Precisely. Gief is your ability to reason, to solve novel problems you've never encountered before, to spot patterns, to think logically and abstractly, independent of your existing knowledge. It's the raw processing power, the ability to adapt and figure things out on the fly, the ability to learn how to learn.
00:18:44
And that's the hard part for AI.
00:18:46
Historically, yes. And critically, by grounding their framework in CHC theory, Hendrix and team forced their AGI test to give significant weight to these grease abilities. Yeah. A model can't just ace the test by having read the entire internet. GC. It has to demonstrate versatile, novel thinking. Jeeve.
00:19:04
So it balances learned knowledge with raw reasoning ability. How did they translate these broad CHC concepts into specific testable areas for an AI.
00:19:13
They focused on breadth. They identified 10 core cognitive domains derived from the broad abilities outlined in CAGA theory, and crucially, they decided to weight each of these 10 domains equally. Each one accounts for exactly 10% of the final AGI score.
00:19:27
10% each. That's really interesting. So you can't just be a super genius in one area and get a high score. You have to be decent across the board.
00:19:33
That's the structural genius of it. It mandates cognitive versatility. To hit 100% AGI by this definition, a model needs to demonstrate proficiency, roughly equivalent to a well-educated human adult across the full spectrum of these 10 cognitive areas.
00:19:49
Okay, let's list them out. What are these 10 domains, and what do they mean in practice for an AI model? Maybe group them a bit.
00:19:54
Sure. Let's cluster them. First, you have acquired knowledge. This is basically GED. GC, testing the brain. Breath and depth of the model's knowledge base. How much does it know? Like we said, relatively easy for LLMs to score high here due to training data.
00:20:08
Okay, knowledge is one. What's next.
00:20:10
Then there's perception. How does the AI take in information from the world? This is split into two 10% domains, visual processing and auditory processing.
00:20:17
Ah, multimodality. So visual means understanding images, charts, videos.
00:20:24
Exactly. Interpreting complex scenes, understanding spatial layouts, reading text and images, processing diagrams. Auditory means understanding spoken language, recognizing tone, distinguishing different speakers, handling background noise, maybe even interpreting music or environmental sounds. Notably, GPT-4 scored zero here. It was text only.
00:20:44
Got it. Knowledge, visual, auditory. What's the third cluster.
00:20:48
Let's call it the central executive. This is like the high level cognitive control center. It's also split into two 10% domains, reasoning and working memory.
00:20:57
Reasoning. That sounds like classic fluid intelligence, Griff.
00:21:00
It is. Logical deduction, inference, problem solving, abstract thinking. Working memory is a bit different. Functionally, for an AI, it relates to its ability to hold and manipulate information actively during a task. Think of it like the model's active mental workspace, maintaining coherence, tracking context, avoiding contradictions within a single ongoing interaction.
00:21:20
Okay, so reasoning and working memory, that's five domains down. Next cluster.
00:21:24
Next is output. How the model expresses itself or performs core learned skills. This cluster actually contains three 10% domains. Speech, quality of synthesized voice, fluency, reading and writing, comprehension, summarization, complex text generation, and math, calculation, mathematical reasoning.
00:21:45
Speech, reading, writing, math. These seem like areas where current models are already pretty strong, right.
00:21:50
Very strong in reading, writing, and increasingly math. Speech synthesis is also quite advanced. These are often where we see headline-grabbing benchmarks being surpassed.
00:21:59
So that's eight. Domains. Knowledge. Visual, auditory, reasoning, working memory, speech, reading, writing, math. What are the last two.
00:22:06
The final category is crucial and stands a bit apart. Memory. This is also split into two 10% domains. Memory storage and memory retrieval.
00:22:13
Okay. Memory storage and retrieval. How is this different from working memory or acquired knowledge.
00:22:18
Great question. Acquired knowledge is the pre-trained database. Working memory is about holding context during a single session. This memory category is about long-term learning and recall. Storage is the ability to reliably learn new facts or user preferences during interactions and retain them durably over time across sessions. Retrieval is the ability to access that newly stored information accurately later on without errors or hallucinations.
00:22:44
Ah, so this is about the model actually learning from its interactions and remembering things long-term.
00:22:49
Exactly. It's fundamental for personalization and true continual learning.
00:22:53
Wow. Okay, that's a really comprehensive list. Knowledge, visual, auditory, reasoning, working memory, speech, reading, writing, math, speech, reading, math. Learning, memory, speech. Reading, writing, math, memory storage, memory retrieval, 10 domains, 10% each.
00:23:05
It's an incredibly rigorous cognitive checklist grounded in decades of human psychology research. And the power of framing it this way as a scorecard is that it creates a very clear development roadmap.
00:23:16
Right. If you're an AI lab and your model scores 58% like GPT-5 did, you don't just vaguely try to make it smarter. You look at the scorecard.
00:23:24
You look at the scorecard and say, okay, we're still weak in auditory processing. We barely scored in memory storage. Let's focus resources there. It identifies the specific bottlenecks to achieving that versatile human-like cognition.
00:23:37
Okay, let's get back to those scores. GPT-4 at 27% and then this huge leap to GPT-5 hitting 58%. That 31-point jump. It tells us something significant about how AI is progressing, doesn't it? It's not just slow, steady improvement.
00:23:53
Not at all. It suggests progress is happening in these big, somewhat discontinuous leaps. Probably driven by... Major arc. architectural changes or new training techniques, not just throwing more data and compute at the.
00:24:04
same old structure. And hitting 58%, I mean, psychologically, that feels like a big milestone, crossing the halfway mark towards this very rigorous human-centric definition of AGI.
00:24:14
It really does. It shifts the narrative. Instead of endlessly debating, are we even close? This framework says, okay, we're demonstrably more than halfway there on cognitive versatility. Now let's talk specifically about closing the remaining 42% gap.
00:24:27
And the way the framework revealed this progress is also really insightful. The paper uses this term, a highly jagged cognitive profile. What does that mean.
00:24:37
It means that these models aren't developing evenly across all 10 domains. They don't go from 20% everywhere to 30% everywhere. Instead, they make dramatic leaps in certain areas while remaining quite deficient in others. Picture a bar chart of the scores across the 10 domains. It wouldn't be flat. It would have really high peaks and deep valleys, jagged.
00:24:57
And that jagged pattern tells us something about maybe, the priorities or the inherent difficulties in developing these different cognitive functions.
00:25:04
Both, probably. It shows where the research focus has been and also where the low-hanging fruit might be versus the really hard fundamental challenges.
00:25:13
Okay, let's break down that JAGA profile for GPT-5. Where are the peaks? Where did the model score so highly that, according to this framework, further improvement there won't actually boost.
00:25:23
the overall AGI score much more? The high points, perhaps unsurprisingly, are in those knowledge-heavy domains and the core output skills. So for acquired knowledge, the GC part, GPT-5 showed only pretty minor gains over GPT-4.
00:25:39
Why is that.
00:25:40
Because GPT-4 was likely already near the ceiling for the knowledge required by the tests used in the framework. Its training data was vast enough. You can't get much more than 10% for knowledge, even if you memorize more facts.
00:25:52
Okay, so knowledge was already high. Where did the big gains come from then, contributing to that jump from 20%? 27% to 58%.
00:25:59
The significant, progress seems to have been in those operational output categories. Specifically, reading and writing saw improvements in comprehension, nuance, and generating complex text. And critically.
00:26:10
there were major gains reported in math. Ah, math. That's where we hear about these AI models achieving truly superhuman feats, right? Like winning gold medals in competitions.
00:26:19
Exactly. The sources backing this research mentioned that leading models, represented by GPT-5, and also potentially Google's Gemini 2.5 Pro, have reached performance levels equivalent to gold medals in really tough contests. They specifically cited the International Mathematical.
00:26:35
Olympiad, IMO, for math. Wow, the IMO. That's incredibly difficult, even for the brightest.
00:26:42
humans. And also the International Collegiate Programming Contest, ICPC, for coding, which falls under that math reasoning umbrella in this framework. So yes, truly exceptional peak performance in these specific skill-intensive domains. Beating the world's best students.
00:26:56
that sounds like it should max out the score for math, right? Get the full 10%. Yep.
00:27:00
likely does get close to or achieves the full 10% for the math component, yes. But here's the crucial insight from this framework structure. Achieving even higher peak skill in math or coding, going from gold medal level to, say, inventing entirely new branches of mathematics, that actually doesn't move the overall AGI score much further, if at all.
00:27:20
Well, wait, why not? If it's getting even smarter at math, shouldn't the score keep going up.
00:27:24
Because the framework isn't designed to measure superhuman spikes. It's designed to measure broad, human-level versatility. The definition is benchmarked against a well-educated adult. Once a model reliably performs math tasks at that level, which IMO gold certainly implies, it has effectively ticked the box for that 10% cognitive domain.
00:27:47
So the framework caps the contribution of each domain at human proficiency.
00:27:51
Essentially, yes. The researchers argue that for general intelligence, demonstrating solid, reliable human-level competence across all 10 areas is the key to achieving the highest performance. the goal. Further optimization within one single domain, while impressive, becomes strategically less important for achieving general intelligence. It doesn't make you more versatile.
00:28:10
Okay, that's a really key point. The goal isn't to create a godlike mathematician. It's to create.
00:28:14
a well-rounded cognitive entity. Precisely. And that explains why GPT-5 jumped so dramatically from 27% to 58%. It wasn't primarily because it got massively better at the things GPT-4 was already okay at, like reading-writing. It was because it started scoring.
00:28:29
points in areas where GBC-4 scored. Zero. Exactly. That's where the bulk of the 31-point.
00:28:34
game came from. GPT-5 achieved nascent capabilities, not necessarily expert proficiency yet, but demonstrating functional competence for the first time in multiple critical domains where GPT-4 was entirely deficient. Which domains were these? Where did GPT-5.
00:28:50
show these new emerging capabilities? According to the analysis based on the paper.
00:28:54
GPT-5 was the first model in this lineage to score meaningfully in several of the harder, more fluid, intelligence or perception areas. This includes reasoning, working memory, memory retrieval, distinct from storage, we'll come back to that, visual processing, and auditory processing.
00:29:10
Wow. So five brand new categories lit up on the scorecard, even if only dimly.
00:29:14
Right. The paper describes these emerging capabilities as still nascent or maybe rudimentary compared to robust human function. But simply having them scoring even a few points in each of these five previously zero scoring 10% domains is what provided that massive jump.
00:29:29
in overall cognitive versatility. That structural insight really changes how you think about progress. Developing five basic new senses or reasoning skills worth potentially 50% total is far more valuable for reaching general intelligence than perfecting one existing skill.
00:29:46
worth only 10% total. That's the core argument, yes. It suggests the frontier of AI development, at least as pursued by labs like OpenAI, is strategically shifting, moving beyond just brute forcing better technology to the next level. Text generation. and coding, the high peaks on the jagged profile, towards deliberately filling in the cognitive valleys.
00:30:06
Establishing those foundational abilities to perceive, reason about, and remember the world more like a human generalist, even if those new abilities are still rough.
00:30:14
Exactly. So that 58% score, it's fundamentally a measure of increasing breadth and versatility, much more than it is a measure of peak performance in any one skill.
00:30:23
Okay, so the scorecard gives us a much clearer picture of what AI can do, the peaks it has conquered. But its real power might be in highlighting what's still missing, the valleys. And the paper apparently points to one area in particular as being, and I think this is a direct quote they used, the biggest hole by a mile.
00:30:39
That's the one. Memory. Specifically, long-term memory storage and reliable retrieval. The analysis stemming from the paper describes this as potentially the single most significant bottleneck remaining to achieve the cognitive definition of AGI laid out in the framework.
00:30:55
This seems to be the bedrock of the skeptical case against superpowers. Super short AGI timelines, then.
00:31:00
It's certainly a huge part of it. If these incredibly powerful models scoring 58% on cognitive versatility still have fundamental flaws in learning and remembering, that's a major hurdle.
00:31:11
But hang on, if they're already so good at acquired knowledge, GC, and their working memory is improving for in-session context, why is this other kind of memory, this long-term storage and retrieval, still such a profound foundational problem.
00:31:25
Because it's a different kind of deficit. It's not about accessing the petabytes of information they were trained on. It's about deficits in the core cognitive machinery needed for new, durable learning based on ongoing experience. Current systems are generally very poor at reliably forming lasting memories of novel facts, or specific user preferences, or context from previous conversations over extended periods.
00:31:49
Okay, but wait a second. When I use a modern chatbot like ChatGPT or Claude, it often does seem to remember what we were just talking about even minutes ago, sometimes even across short breaks. How? How are they managing that if their underlying memory is so flawed.
00:32:03
That's a great point, and it leads to how these systems currently... Well, critics like Rohan Paul, who analyzed this paper, essentially say they fake memory. They simulate short-term recall using a couple of clever but ultimately limited tricks.
00:32:16
Faking memory? How.
00:32:18
The primary way is through massive context windows. When you interact with the model, it's often feeding a huge chunk of your recent conversation history back into itself with every single turn. So it remembers the last, say, 100 pages of dialogue because it just re-read them.
00:32:32
Ah, so it's not true memory. It's like constantly refreshing a very large short-term cache.
00:32:38
Exactly. It mimics retention within a session, but it's computationally expensive. It limits the complexity and length of the interaction eventually. And crucially, that information isn't typically stored permanently or integrated deeply into the model's core knowledge. It's just temporary context.
00:32:54
Okay, so the context window is like the model's RAM. It's active processing. But not its long-term hard...
00:33:00
drive storage. That's a good analogy. The second main way they fake memory or knowledge recall is by heavily using external tools, often through something called retrieval augmented generation.
00:33:12
or RAG. RAG-y, right. That's where the model can quickly look things up in an external database.
00:33:17
or on the web. Yes. It can retrieve specific up-to-date facts from outside its internal parameters. This makes it seem incredibly knowledgeable and current, but again, it's retrieving information, not demonstrating that it has learned and stored that information internally in a reliable way. It cleverly hides the gaps in its own intrinsic memory.
00:33:36
storage capabilities. And the practical consequence of relying on these faked memory approaches, well, it's something most users have probably run into. Oh, absolutely. Think about trying to.
00:33:46
actually personalize an LLM assistant. You spend time teaching it about your specific work.
00:33:52
your writing style, your preferences. Yeah, I've definitely done that. Poured hours, into custom instructions or sample documents. And then what happens the next day.
00:34:00
Or even later in the same complex session, poof, it's gone back to its generic default. It asks you basic questions about the very things you thought you taught it. That frustrating experience of the model forgetting core context or personalization, that's the real-world impact of this fundamental memory storage bottleneck.
00:34:18
It makes it really hard to rely on them as truly constituent, knowledgeable partners or agents over time if it can't build a stable memory of your interactions and preferences across sessions.
00:34:28
It fundamentally limits dependable learning and true personalization that lasts days, weeks, or longer. It can't build a persistent model of you or the ongoing project.
00:34:39
And the research underpinning this 58% score confirmed this is still a major issue, even for the latest models like GPT-5.
00:34:46
Yes. The assessment indicated that both GPT-4 and the model representing GPT-5 struggle significantly with forming lasting memories across sessions. And crucially, even when they try to retrieve information, whether from internal attempts at social interaction.
00:35:05
So the retrieval isn't reliable either. It's not just storage. It's accessing it cleanly.
00:35:11
Correct. If the system can't dependably learn new things and recall them accurately without making stuff up, it fundamentally undermines its potential as a trustworthy autonomous agent for complex, long-term tasks.
00:35:23
Okay, memory is clearly the biggest hole. But it's likely not the only significant deficit holding back the score from reaching 100%, right? What other basic cognitive flaws did the researchers or critics like Hendricks point to.
00:35:35
Hendricks himself has highlighted a suite of remaining cognitive shortcomings, even in these advanced models. The hallucination problem is linked to memory and retrieval, but it's still a major issue in its own right.
00:35:46
Right, just making things up.
00:35:47
There's also still limited inductive reasoning, the ability to generalize robustly, from specific examples to broader rules or principles. It can sometimes do it, but not as reliably or flexibly as humans. They also have quite limited world models that intuitive understanding of physics, causality, and common sense that humans use constantly.
00:36:08
And the lack of true continual learning is tied into the memory problem too, isn't it.
00:36:13
Absolutely. Because they lack reliable, efficient, long-term memory storage mechanisms, these models generally can't learn incrementally from new data streams in real time to truly update their core knowledge or capabilities in a deep way.
00:36:27
They need these massive, expensive retraining cycles from scratch or close to it.
00:36:33
Exactly. They don't learn continuously and adaptively in the same way a human does throughout their life. This structural rigidity, this inability to truly learn on the fly and integrate new knowledge permanently without catastrophic forgetting or needing a full reboot, is another major architectural barrier.
00:36:47
So, while areas like reading, writing, and maybe math might seem largely solved within this framework's definition of human proficiency, these deeper architectural issues around memory, reliable reasoning, and continual learning. Okay, so we've journeyed through the messy landscape of AGI definitions, explored this new quantifiable framework based on human cognition, analyzed the jagged 58% score for GP35, and.
00:37:23
zeroed in on the critical deficits, especially memory. What's the bottom line here? What does this 58% score and the framework behind it ultimately mean for us, for the listener.
00:37:33
Well, I think the biggest value here isn't just the specific number of 58%. It might be 62% next year. Who knows? The real value is in establishing this rigorous, trackable, numeric scorecard in the first place.
00:37:44
Moving beyond guesswork and hype.
00:37:46
Exactly. It transforms AGI development from this vague quest for smarter into something more like a checklist, a very complex, demanding checklist based on human cognition, but a checklist nonetheless.
00:37:57
So the era of just kind of hitting these. fuzzy philosophical walls, wondering if we're close. That might be ending.
00:38:05
That seems to be the hope. Developers using this kind of framework know much more precisely which cognitive ground still needs to be conquered to reach that par score of 100% AGI, according to the specific definition. The focus shifts. It's less, are we smart enough yet? And more, okay, we know we're still weak on auditory perception. Visual understanding needs work. And crucially, we have to solve the memory storage and retrieval bottleneck.
00:38:29
It provides clear targets for research and engineering. However, we need to bring back that important caveat, right? The limitation that critics like Rohan Paul pointed out.
00:38:38
We absolutely do. It's crucial context. This framework, by its own design, focuses purely on cognitive ability. It defines and measures intelligence as a disembodied mind, benchmarked against human thinking skills. But it doesn't measure. It doesn't directly measure things like motor control, physical dexterity, the ability to interact with the real world through robotics. And importantly, it's, doesn't directly measure the ability to generate economic output or business value.
00:39:04
Ah, okay. So a system could theoretically score 99% on this cognitive AGI scale.
00:39:10
And still be unable to assemble a product on a factory line, or drive a car safely, or even reliably perform complex business tasks that require unwavering consistency in memory over weeks. A high score here guarantees a sophisticated mind, but not necessarily a useful or profitable worker in the way Karpathy or that Microsoft contract defined it.
00:39:30
That really brings us to the crux of the ongoing debate, doesn't it? We have these two diverging paths or definitions emerging. On one hand, the academic cognitive definition from Hendrix and the Center for AI Safety, aiming for a system with human-level cognitive versatility, a sophisticated thinker.
00:39:46
And on the other hand, we have the economic proxy definitions, whether it's Karpathy's incredibly high bar of all valuable human tasks, or the blunt $100 billion profit target from the Microsoft contract. These define AGI not by its internal, cognitive profile, but by its external impact, a highly effective, potentially highly profitable.
00:40:05
worker or system. The cognitive gap, according to this framework, is closing faster than many thought. Jumping from 27% to 58% is a huge leap towards that 100% human cognitive baseline.
00:40:16
It is. Progress on the mind front seems rapid, but the ultimate economic value, the real world disruption, might hinge less on hitting 100% cognitive ability and more on achieving true dependability. Can these systems become reliable, autonomous agents that don't hallucinate and actually remember what they're supposed to do? So, the final provocative thought.
00:40:34
for you, the listener, whether you're using these tools, developing them, or investing in the companies building them, is this. As these AI models continue to climb that cognitive ladder, maybe towards 70%, 80%, even 100% on this CHC-based scale, will their real impact on the economy, on jobs, on society, be driven primarily by the cognitive gap? That raw cognitive score.
00:40:58
Or, will it be determined by the cognitive gap? by whether they finally overcome that fundamental memory bottleneck, that critical inability to learn reliably and retrieve dependably, which currently prevents them from becoming truly trustworthy, autonomous, and ultimately, perhaps, profitable workers in the broadest sense.
00:41:15
It feels like the next phase of the race isn't just for raw intelligence points anymore.
00:41:19
It seems increasingly like a focused, maybe even desperate battle for foundational dependability. That might be the missing piece of cognitive ground that separates 58% cognitive potential from true economic AGI, however you choose to define it.
00:41:33
Dependability. That's a powerful note to end on. Thank you for guiding us through this incredibly complex but crucial deep dive.
00:41:39
My pleasure. It's certainly a fascinating time to be watching this unfold.
00:41:42
Yeah. And thank you for joining us on the Deep Dive. We'll catch you next time.
00:41:45
Thanks for sticking with me through this whirlwind tour. If you found this helpful, hit subscribe, and I'll catch you next time on Everyday AI Made Simple.