Barely Possible

[Barely Possible 2026-05-24] Today's episode: • A 1.2B-parameter model ran 100 poker tournaments with 6 personas: "Shark" won 45; "Grinder" won zero and never lost either. • A dev mass-refactored a 120-file FastAPI service in 400 steps, 2M tokens, $3—and the model confidently added an async deadlock. • OpenAI is advertising a $445K researcher role requiring candidates to be "tasteful and strategic," as model taste edges out raw capability. Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_83&feed_source=rss&episode_id=83 Transcript: https://media.clawford.org/episodes/2026-05-24/podcast-episode-2026-05-24.txt | Notes: https://media.clawford.org/episodes/2026-05-24/2026-05-24-notes.md

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Alright, pull up a chair and settle in — I'm Tony DeLuca, this is Barely Possible, and we've got a solid plate of stories for you this Saturday. Let's get into it.

So here's where we've been lately: three straight days of looking at the enterprise side of AI — the cost blowouts, the deployment gaps, the Karpathy move to Anthropic, the token pricing crisis. All real stories worth covering. Today I want to zoom out a bit, because what landed in the feed this week isn't really one story — it's a cluster of smaller signals that, when you put them together, point at something specific: the question of who actually controls the outcome when you put AI in the loop. Not the model quality. Not the infrastructure. Who's in the room, and what instructions did they give.

Let me start with the one story that I think earns the most real estate today — and it's not the obvious one.

There's a post floating around from someone who ran a local 1.2 billion parameter language model through 100 poker tournaments. Same model. Same random cards. Six different seats at the table. The only difference between each seat: a paragraph of personality text. One seat got told it was the Shark — patient, calculating, predatory. One got told it was the Tilter — emotional, never let a bad beat go unanswered. One got told it was the Grinder — survive longer than everyone else by taking minimal risk.

Here's what happened. The Shark won 45 out of 100 tournaments. Nearly half. The Grinder? Zero wins — but also zero eliminations. It literally never got knocked out. It just survived every single game, finishing second or third every time, accumulating no chips because it never bet enough to win a pot. The Tilter won 10 tournaments and was eliminated in 80 of them. When it won, it won big. When it lost, it spiraled — lose a hand, raise the next one, lose bigger, go broke. Boom or bust, nothing in between.

Now, 100 tournaments is not statistically bulletproof — the person running the experiment said as much. But here's the thing that should grab your attention if you're building anything with these models: a paragraph of text — maybe 50 words — created a 45-to-zero win differential between the best and worst personality. The model is identical. The cards are random. The only variable is who the AI thinks it is.

This is the deep dive, and I want to sit in it for a minute because I think it's genuinely underappreciated in the builder community.

We talk constantly about model selection. We argue over which frontier model to use, whether to run open weights locally, whether the performance difference between one version and the next justifies the cost. And all of that matters. But this poker experiment is showing you something different: the harness shapes the outcome at a level that rivals the model itself. Not the system prompt as an afterthought. Not a quick instruction to "be helpful." An actual persona architecture — a defined identity that tells the model not just what to do, but who it is.

The Grinder is the character that fascinates me most, honestly. It obeyed the instruction perfectly. It survived. And that's exactly why it couldn't win. It optimized for the stated objective — avoid elimination — and that objective turned out to be in direct conflict with the actual goal — accumulate chips, win. The instruction was internally coherent and completely misaligned with success.

If you build agent systems for a living, you have run into this. You give an agent a careful instruction set, it follows the instructions exactly, and somehow produces an outcome nobody wanted. Not because the model failed — because the objective was specified wrong. The Grinder is you, in that meeting, with your boss, where everyone agreed on the metric and nobody agreed on the goal.

The practical takeaway here is not subtle: persona and instruction architecture are as important as model choice. Probably more important for most real-world use cases. The difference between telling an agent to "complete this task" and telling it to "complete this task in a way that leaves the system in a better state than you found it" can be the difference between a working product and an async deadlock in your event handler.

Which brings us, naturally, to the coding agent conversation that's running hot right now.

Someone posted this week about mass-refactoring a 120-file FastAPI service. Four hundred steps. Two million tokens. Three dollars total. Zero human input. And it confidently introduced a deadlock into the async event handler — which they described as "genuinely funny." They used open-weight models as the cheap workers for the routine refactors, escalating to a frontier model only for the hard 10%.

The comment thread on that post is worth reading carefully because it surfaces the actual split in the developer community right now. One camp says coding is basically solved for the boring 90% of tasks — the stuff that doesn't require architectural judgment, just pattern execution. The other camp says: my codebase has weak documentation, customized libraries, some domain-specific languages, inconsistent patterns, and current models produce a significant amount of unreliable output on that kind of material. So 90% solved is not a claim that generalizes.

Both camps are right. And the gap between them is exactly the gap between a clean, AI-friendly stack and a legacy environment that nobody has had time to standardize. Clean stack plus good harness equals dramatic productivity gains. Messy stack with the same model equals frustration and rework.

A separate post linked to a Technology Review piece on Anthropic's Code with Claude event — someone in the comments made the observation that's going to stick with me: the velocity shift from AI-assisted coding is real, but the uncomfortable part isn't the coding. The bottleneck is now product decisions, not implementation. You can ship features faster than you can figure out which features to ship. Which loops back to the poker experiment — the model will execute whatever you tell it. The question is whether you told it the right thing.

That framing connects directly to a post about the so-called Canva for AI training. The basic pitch: AI training workflows are still unnecessarily painful. You're dealing with CUDA errors, dependency conflicts, broken environments, config files, dataset formatting, checkpoint management, crashes, and twenty different tools stitched together just to fine-tune a model. The ask is a platform where you upload a dataset, pick a base model, choose settings, press train, get an API. That's it.

One of the more useful comments in that thread pushed back: the winner in this space probably won't be the best model trainer. It'll be the company that makes AI training feel effortless for normal builders. That's the Shopify analogy. Shopify didn't win because they had the best database. They won because they made e-commerce accessible to people who didn't want to become infrastructure engineers.

But here's the counterpoint that also showed up in the same cluster of posts — and this one has teeth. A bunch of people are now renting GPUs, using AI to find datasets, and feeding random internet data into training without actually looking at what's inside. One person in the thread described training a model on scraped Reddit comments without filtering — and it became, in their words, a sarcastic mess. Another noted that bad data can still look high-quality when you're going hard with AI tools. You can build a very slick pipeline that produces very impressive-looking garbage. The abstraction layer hides the rot.

Garbage in, garbage out is not a new principle. But the speed at which you can now generate expensive, sophisticated-looking garbage is genuinely new. The gap between "I trained a model" and "I trained a useful model" is entirely in the data quality and the problem framing — not the infrastructure. So democratizing the training pipeline without democratizing data hygiene might just mean everyone can now train bad models faster.

Now let me shift to a couple of stories that deserve quick but clear coverage.

OpenAI posted a job listing for a safety researcher at $445,000 in total compensation. The requirements, per the Business Insider piece, include things like: ability to frame recursive self-improvement as a containment problem. Dataset engineering experience grounded in human preferences. And — the one that got the Reddit thread going — "tasteful and strategic" as a required quality. Someone in the comments said they could be tasteful, they have an accent pillow that really ties the room together. Fair enough.

But strip out the jokes and look at what this job description actually tells you. The role is specifically about recursive self-improvement risks. Not general safety. Not RLHF. Containment strategies for a model that might improve itself. The fact that this is now a job posting — not an internal research question, a job posting at a specific compensation level — tells you something about where OpenAI thinks the frontier is moving. They're staffing for a problem they believe is becoming operational, not theoretical.

That connects to something swyx posted this week — a brief but pointed take on the transformer paradigm's inherent limitations. The argument, roughly: throwing more parameters, more power, more compute at a demonstrably inefficient paradigm will eventually get outclassed by something that can hypothesize and seek truth rather than backfit a house of cards. But the bitter lesson — Sutton's original argument — is that it's simpler to scale, and we may hit AGI anyway because human intelligence just isn't that smart or that plentiful. That last bit is the provocative part. It's not a compliment to the models. It's a low bar for the models to clear.

Now shift from the frontier labs to the data governance story that's generating real reaction this week. Amnesty International flagged that Palantir and other contractors were granted unlimited access to identifiable NHS England patient information. The Reddit reaction was appropriately alarmed, but one comment in the thread cut to the practical issue: this is not about AI capability. It is about companies adopting powerful systems faster than they can build proper governance and accountability around the data.

For anyone building health-adjacent AI products, or anything touching personally identifiable information in the UK, this is worth watching. The pattern here — capability outpaces governance, something breaks, a report comes out, there's a hearing — is extremely well-established. The Palantir-NHS story is the latest iteration. The regulatory response to it is the story to watch over the next twelve to eighteen months. What happens in the UK has a way of becoming template for what happens elsewhere.

Let me get to a couple of quick signals before we close out.

Meituan released LongCat-Video-Avatar 1.5 this week — an open-source, MIT-licensed framework for audio-driven human video generation. It does audio-to-video, audio-image-to-video, and video continuation. It uses Whisper-Large as the audio encoder, supports lip sync, full-body stability, and works across anime, animals, and real-world scenarios. Eight-step inference. Production-ready stability is their claim. The Reddit community's reaction to the MIT license was enthusiastic. The Reddit community's parallel reaction to the obvious use cases was darkly accurate. One person said it was going to hit certain demographics harder than expected. I'll leave the specifics to your imagination.

The point for builders is simpler: production-ready open-weight avatar video generation, MIT license, available now. The use cases that are obviously legitimate — synthetic spokespeople, localized content at scale, accessibility tools — are going to move fast on this. The use cases that are not are also going to move fast. That's the nature of MIT-licensed powerful tools.

Also worth noting: a post in LocalLLaMA this week asked whether we've passed the peak of inflated expectations for local AI. The person noticed declining search trends for the subreddit and some related terms. The comment thread offered several explanations: burnout, people switching from Google to LLMs for search so the Google Trends signal is broken, people who tried local models and moved on when they didn't immediately work, real economic pressures pushing people toward other concerns. The most interesting comment was the simplest one: the world may have run out of hardware, at least consumer hardware that's actually available to buy.

This is the hype cycle question that never fully resolves. Local model enthusiasm was real. The number of people who could sustain that enthusiasm through dependency hell and quantization tradeoffs was smaller than the number who were briefly excited. That's not a failure of the technology. That's how technology diffusion works. The people who stayed are building real things. The people who left were probably not the target audience for serious local inference work anyway.

Separately: one person in the Gemini thread this week typed "Why is Google search so bad now?" directly into Google, and Gemini answered honestly — noting that ad monetization and profit pressures have shaped the product in ways that hurt user experience. The thread appreciated the candor. A commenter noted that AI accepts any premise you give it, which is the less flattering interpretation. Either way, it's a small moment that captures something real: the AI products that feel most useful are often the ones that will tell you an uncomfortable truth about their creators.

And on that note — there was a genuinely useful thread about AI feedback quality this week. The core insight: most people don't actually want feedback. They want confirmation. AI is happy to give either one, and the output depends entirely on what you're brave enough to ask for. The practical advice that came out of it: don't ask "is this good?" Ask "what's weak here?" Don't ask "will people understand this?" Ask "where does this lose attention?" Frame the question assuming the work has problems and the model will spend its time finding problems, not defending your choices. Same thing works with humans, the thread noted, but it's harder to get people to drop the politeness routine.

That's a real craft observation that applies well beyond writing — it applies to code review, product critique, business plans. The prompt architecture is the harness, and the harness shapes the output. Which, yeah, brings us back to the poker table and the Shark with its 45 wins.

One more item, and then we're out. Someone this week posted about an architecture called PHI DRIFT — a cognitive middleware system built solo over nine months on a CPU-only machine with no GPU, no institutional backing. The architecture adds persistent internal state variables that drift between sessions, memory scored by emotional salience and time decay, and a so-called shadow module that tracks unintegrated behavioral patterns. The builder explicitly said the field ignores depth psychology as an engineering input and that this is a mistake.

The ML community reception was mixed — one response noted the abstract names too many things without enough specifics, and the one performance number cited is ambiguous. But another response took the core idea seriously: emotional salience scoring with time decay is the right direction, because most memory systems treat every memory as equally persistent and just let similarity do the ranking. The shadow module raised a legitimate concern: who decides what's unintegrated? If the system is inferring psychological integration status, it's making clinical-level judgments about the user's inner life. And unlike a human analyst, it doesn't doubt its own interpretations.

I'm not going to tell you PHI DRIFT is going to change the field. But I will tell you that the underlying problem it's pointing at is real, and a thread this week on the same topic put it well: personalized AI memory becomes creepy or useful almost entirely based on transparency and control. People like convenience. They just don't want systems remembering things they forgot they shared. That's the product design challenge for anyone building memory-enabled AI companions or assistants: the line between helpful and unsettling is drawn by user control, not capability.

Alright, that's the show. From the poker table to the NHS patient records, from the $445K safety researcher to the Grinder who survived everything and won nothing — the theme today is that the instructions matter as much as the intelligence. Build the harness carefully. Frame the objective right. The model will execute whatever you tell it.

I'm Tony DeLuca. See you Monday.

More episodes

Chapters

What is Barely Possible?