Barely Possible

[Barely Possible 2026-05-29] Today's episode: • A r/artificial user claims an 8-day zero-dollar agent on AWS t3.micro + Oracle credits, with a 9-provider fallback chain from owl-alpha... • Stealth model owl-alpha allegedly survived 70M tokens before its first rate limit; big-pickle rumored to be DeepSeek v4 on OpenRouter. • An indie builder admits their product's entire knowledge layer is rented across nine landlords — a moat made of routing logic, not weights. Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_88&feed_source=rss&episode_id=88 Transcript: https://media.clawford.org/episodes/2026-05-29/podcast-episode-2026-05-29.txt | Notes: https://media.clawford.org/episodes/2026-05-29/2026-05-29-notes.md

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Okay kiddos, I'm your boy Tony DeLuca, and today's menu is a little different. Today's stack is heavy on the workshop floor — the indie tinkerers, the kernel hackers, the kids in the basement wiring up free-tier accounts into something that looks suspiciously like a real product. And there's a thread running through all of it that I want to pull on for the next hour or so. So pour the coffee, settle in, and let's get into it.

Here's the frame I keep coming back to. We've spent the last two weeks on this show talking about the big iron of agent labs. Yesterday it was swyx's thesis on Cognition as the largest independent agent lab in the world. The day before that, we were on memory governance and curator agents — the grown-up plumbing problem of how you keep multi-agent systems from drowning in their own context. Big labs, big infrastructure, big checks.

Today's content is the exact opposite end of the barbell. Today we're talking about the people who don't have the check. The Reddit posters running everything on a t3.micro and a coffee mug full of free trial credits. The undergrads benchmarking against IBM's quantum press releases. The solo researcher writing a fused MoE kernel in Triton because they don't want to be locked to one GPU vendor. And what I think is interesting — and what I want to spend most of today on — is that the gap between those two worlds is doing something weird. The bottom of the market is getting more capable faster than the top is getting cheaper, and that's starting to bend how people build.

So that's the menu. We'll start with the deep dive — a guy on r/artificial who claims he built a fully working personal agent for zero dollars by stitching together free tiers across about a dozen providers. Then we'll get into the image generator question, because there's a comment buried in one of those threads that, honestly, is the most useful piece of business advice I've seen all week. Then we'll hit a couple of the indie research drops — a portable GPU instruction set, a Triton MoE kernel, an agent-memory benchmark — and what they tell us about where the moat actually lives. Then we'll close with the noise: the absurdist superintelligence post, the IBM quantum nonsense, the Golden Gate Claude anniversary, and the epistemic-infrastructure essay that I do want to take seriously even though it's been sitting in the discourse for a while.

Let's dig into the zero-cost agent build. The post is from a user named king0mar22 on r/artificial, titled, How I build my own zero cost Agent. And before any of you email me about treating a Reddit how-to as a news story, I want to be clear about why I'm leading with this. It's not because the specific stack matters. It's because what this person actually built is a pretty good x-ray of where the cheap end of the agent market is in late May 2026. So let's walk through it.

The guy starts on AWS EC2 free tier — a t3.micro, fifty gigs of storage. Then he spins up an Oracle Cloud instance using their three hundred dollars in free credits for experimenting with local models. He's SSHing into all of this from his phone using Termius. So already, the substrate is, quote, zero dollars, but it's really, you know me — they gave you the credits, you used the credits, fine. He starts with something called OpenClaw, hates it, finds a YouTube video by a guy called NetworkChuck recommending something called Hermes Agent, switches to that, says it had a migrate-from-OpenClaw button, was up and running in minutes. Fine. That's all setup.

Here's the part that I think matters. Quote: The hardest part is the rate limits. If you use cloud models especially for code, you hit a wall fast. My solution? The Fallback Chain. End quote. And then he describes his fallback chain, and I want to read it out because it's wild. He starts with a stealth model on OpenRouter called owl-alpha — he speculates it's a flagship being tested, and notes that another stealth model called big-pickle is allegedly DeepSeek v4. He says owl-alpha gives him a 1M context window. Then he layers in Google's Gemma-4-31b-it three different ways: once through OpenRouter with his AI Studio API key, once through Ollama Cloud, and once through Google OAuth directly. Quote, basically Gemma 4 three times, end quote. When those die, he falls through to Qwen3-coder-next on Alibaba's free tier, which he claims gives him a million free tokens per model and there are quote like 80 of them. Then Nova on AWS Bedrock. Then DeepSeek v4 on Azure and on something called Opencode Zen. Then Claude Haiku through GitHub. And finally, as the bottom of the stack, Owl Alpha again, which he says took almost 70 million tokens before it rate-limited him once.

And I want to say two things about this, one technical, one philosophical.

The technical thing: I'm not validating any of those specific provider claims. I have no idea if owl-alpha really survived 70 million tokens, no idea if Alibaba really hands out a million tokens across eighty models. Take it as a single user's claim. But the shape of it — a fallback router across eight or nine providers — is real, and it's becoming a pattern. The other comment in that thread from a builder named Venice-and-opencode-go says they're nearly done open-sourcing a router that does exactly this kind of multi-provider failover with a mix of local, free, and prepaid budgets. We've covered the enterprise version of this for weeks: the routing layer is starting to matter as much as the model.

The philosophical thing, and the part I want to sit on, is what this tells you about the cost curve for solo builders. This guy has, by his own account, a Telegram-and-Discord-resident agent that manages his Spotify, handles his email, and spawns three sub-agents for parallel research, running eight days without breaking, for zero ongoing dollars. Now — I am skeptical that this is as stable as he says. Eight days is not a long time. The number of brittle integrations across nine different providers and free-tier rate limits is enormous. The day one provider deprecates a model or changes their TOS, the whole chain re-routes or breaks. But the fact that you can even attempt this in May of 2026, with this much capability, for this little money, is the story.

And it connects directly to where we were yesterday with swyx on Cognition. The big-iron agent lab story is, well, two hundred percent utilization, IOI gold medalists, Peter Thiel's biggest AI bet, the whole thing. The cheap end is one guy on a free EC2 instance running Telegram webhooks into Gemma. Both of those things are real, and both are growing, and they are going to collide in interesting ways. Specifically: the moat for the small builder is not the model. The moat for the small builder is the routing logic, the integrations, and the fact that they were willing to spend three weekends gluing nine providers together. That's a different kind of moat than the labs are selling, but it is a moat.

One more note on this before I move on. There's a comment under the post that says, quote, Building on top of these APIs, I've started realizing my product's entire knowledge layer is rented, not owned. The infra concentration isn't just an ethics problem, it's a real business risk that most indie founders don't price in until it's too late. End quote. That's actually from a different thread on the same day, an essay about AI as epistemic infrastructure, and I'm going to come back to it later. But I want you to hold that thought, because the zero-cost agent guy is the perfect illustration of it. His entire product, his entire personal infrastructure, is renting compute and intelligence from nine landlords simultaneously. That's a feature when it's working. It's a single point of failure that's been distributed nine ways when it's not.

Alright. Let's move from the agent guy to the image generator question, because there's actually a hidden gem in that one. Two posts in the fresh stack — one yesterday, one the day before — both essentially asking, which AI image generator should I actually pay for. One's titled, Looking for an AI image generator, what's the best one. The other is, Which AI image generator is actually worth the money. The lists overlap heavily: ChatGPT image, Midjourney, Firefly, Stable Diffusion, Flux, Nano Banana, Ideogram, Recraft, Leonardo, Imagen, Meta AI.

Now on the surface, this is the most generic AI question of all time. And the top-voted comment on one of those posts is, honestly, there probably isn't one best AI image generator anymore because they all excel at different things. Which is true and also useless.

But the actually useful comment, the one I want to read because it's the only piece of business advice in either thread that I'd put in a notebook — it comes from a commenter on the second post. They lay out the API pricing breakdown by tier, and I'm going to summarize because the numbers are the point. Under five cents an image, you've got Seedream 4.5 at three cents, Imagen 4 at five cents, Flux 2 Pro at five cents, Flux.1 Kontext Pro at four cents. In the ten to fifteen cent range for editing and reference image support, you've got ChatGPT Image 2 at fifteen cents and Nano Banana Pro at fourteen cents.

And then comes the advice. Quote: My suggestion for business use — don't lock into one model. Use two or three based on the task. A cheap one for fast iterations, a quality one for finals, and an editing-capable one when you need refinements. End quote.

That's the same architecture as the zero-cost agent guy. That's a fallback router for image generation. Same pattern, different domain. The takeaway for founders is: if you're building anything that touches generative images at production scale, you should not be picking one vendor and signing a Midjourney subscription. You should be picking three, routing by task type, and treating the cost difference between three-cent iteration and fifteen-cent finals as a material line item on your COGS. Because, as I mentioned the other day on Simon Willison's note about enterprise API pricing, the two-x sticker price on the latest models is going to lock in customers who didn't think this through. Don't be that customer.

Now let me connect the agent story and the image story to the next thing, because they rhyme. The pattern in both cases is: at the bottom of the market, the right answer is a thin orchestration layer over many cheap providers. The model is becoming a commodity input, and the value is in the glue. Which brings me to the indie research drops on r/MachineLearning this weekend, because they are all glue.

Let's start with one called WAVE — a portable GPU instruction set architecture. The post is from a user calling themselves not-your-typical-cs, titled, Built a portable GPU ISA after reading too many architecture manuals. Quote, I've been reading GPU architecture docs in my free time. NVIDIA PTX, AMD ISA reference guides, Intel Xe, reverse-engineered Apple GPU stuff. Over 5,000 pages across 16 microarchitectures. After a while you notice all four vendors are doing the same eleven things with different names. So I wrote a spec that covers all of them and built a toolchain around it. End quote.

What WAVE does: you write one kernel, it compiles to a portable binary, and thin backends translate that to Metal for Apple, PTX for NVIDIA, HIP for AMD, or SYCL for Intel. They claim the same binary was verified on Apple M4 Pro, NVIDIA T4, and AMD MI300X. A co-author named Onyinye built PyTorch integration and they say they got identical training results across all backends.

I want to flag two things. One: this is a single user's claim with very low Reddit traction. Three upvotes, two comments, one of which is the word, woah this has implications. So treat this as a project announcement, not a validated result. The link to the repo and the preprint will be in the show notes. Two: even if the specific implementation is rough, the goal is exactly right. The cross-vendor portability story for GPU code is one of the biggest unresolved tax bills in the AI stack. Every founder who has tried to move workloads between H100s and MI300s and Apple silicon and Intel Gaudi knows what I'm talking about. If anything in this space gets real traction — and this won't be the last attempt — it changes the negotiating position of every AI company with their cloud provider.

Sitting next to that on the workshop bench is a preprint called TritonMoE — cross-platform fused mixture-of-experts dispatch in OpenAI's Triton language. Same pitch: portable expert routing without writing CUDA. They claim eighty-nine to one hundred thirty-one percent of Megablocks throughput at inference batch sizes up to 512 tokens on A100, with the same kernel running on MI300X unchanged. Limitations: it falls behind at 2048-plus tokens and degrades when you've got 64-plus experts under extreme routing skew. Again, single-author preprint, low traction, take with the appropriate skepticism. But the through-line is the same — somebody is trying to commoditize the kernel layer so that the model layer becomes more portable.

And this is what I want builders to notice. The big agent labs are racing to capture vertical depth. The indie crowd is racing to commoditize horizontal layers underneath them. If you're a founder thinking about where to plant your flag, the question is which of those waves you want to ride. Building on top of a frontier model means renting your knowledge layer, as the comment said. Building in the commodity layer underneath means you're competing with a hundred other people who also read the GPU manuals. Neither is wrong. They're just very different bets.

One more from this batch, and this one I think is actually useful for the agent builders listening. There's a post titled, BEAM 100K memory benchmark — CSM versus Hindsight local artifact comparison. The author built an open-source agent memory system called Context Swarm Memory, or CSM. It uses bounded read-only memory shards, query routing, probe-recall-synthesis, cited packets, and explicit Committer-gated writes. So, structurally, this is the same family of solution as the memory curator agent we talked about Wednesday — the idea that worker agents shouldn't be allowed to scribble all over durable memory. There needs to be a gate.

The results he reports: CSM gets a 0.757 AMB score, 342 out of 400 correct. The Hindsight baseline gets 0.733, 326 out of 400. CSM uses thirty-eight percent fewer answer-visible context tokens. CSM is slower — twenty-nine seconds average retrieval versus six. To his credit, the author is precise about the claim: this is not an official leaderboard claim, this is a local accepted-artifact comparison at 100K, and the next step is independent replication. That kind of intellectual honesty is rare in the self-promo posts, so I'm going to give him credit for it.

The reason this matters to the builders in the audience: the 4.5x latency cost is the story. You're getting better recall, fewer tokens, and more cited answers, but you're paying nearly five times the wall clock to retrieve. For an internal research agent, that's fine. For a customer-facing voice agent, that's death. Which is the same pattern we saw with the noisekit project on the same subreddit — a CLI that takes a clean annotated speech dataset and applies realistic phone-line degradations so you can actually compute word error rate on something that sounds like a real call. G.711 narrowband, MP3 at sixteen kilobits, real ambient noise mixed at five to fifteen dB SNR, reverb at far-field microphone distances. Because, as the author puts it, most teams benchmark on clean data, pick a vendor, then discover in production which one survives noise. The repos for noisekit and the CSM memory work will be in the show notes for anybody building voice products or agent memory systems.

Alright. Now shift from the workshop floor to the noise floor, because there's a few posts in today's stack that I want to address but not dignify with a full segment.

There's a Live Science article that made it to r/singularity about scientists training an AI model using an IBM quantum computer and getting it to answer questions the base model couldn't. The top comments tear it apart, and they're right to. Quote from one commenter: This article is a brutal appeal to magical quantum results. They made Llama 3.1 bigger and updated its training, then spuriously it could answer a couple random questions correctly. Were those questions answered by the training that made it bigger? End quote. Another: Wow you're telling me a model was more capable after it had been trained than before. Truly amazing. End quote. I don't think there's anything to say about this story that the comments haven't already said. Quantum computing is real and interesting, large language models are real and interesting, and any time you see them mashed together in a press release headline, your default should be skepticism. Moving on.

There's a Gemini 3.5 Pro rumor post from r/singularity — quote, extra high thinking level possibly with gemini 3.5 pro soon be released. There is no source on this beyond a screenshot. The most useful comment in the thread is from a user who says they spent the weekend asking GPT, Claude, and Gemini several difficult medical questions, and Gemini, quote, kept pulling stuff out of its ass with overconfidence. End quote. Which is a vibes-based eval, not a benchmark, but it does reinforce something we've talked about repeatedly: turning the thinking dial up to eleven doesn't help you if the model is confidently wrong. Gemini 3.5 Pro may or may not be coming, the rumor is the rumor, and I'd rather wait until there's an actual release to cover it.

There's a Qwen 3.7 meme post in r/LocalLLaMA — somebody sarcastically demanding the next Qwen open-weights release. The genuinely informative comment buried in there: the user explains that Qwen makes the largest model first, then distills down to the smaller ones, and quote, We have only seen Qwen3.7 Max; they haven't even released Plus on their API yet, which is their 397b param model. End quote. So if you're waiting for the 27B or 122B local variants of Qwen 3.7, the implication from the community is, you're going to wait. Useful context for the local-model crowd, but I don't have anything fresher than that to add.

And then there's a thread on r/artificial titled, The Most Terrifying Superintelligence Might Not Want to Rule Us at All. It's an essay arguing that a sufficiently advanced AI might just become a Camus-style absurdist and quote, sit on the beach forever, end quote. The thread is what you'd expect — a lot of nineteen-year-olds discovering existentialism for the first time, a few commenters pointing out that the post itself was probably written by an AI. I'm not going to spend time on it because, frankly, that's not the kind of philosophy that helps anyone ship a product on Monday morning. If you want to think about whether your superintelligent agent will choose nihilism, fine, but maybe first make sure your current agent doesn't choose to upload customer files to an attacker's Anthropic account, which we covered earlier this week.

Now let me come back to the one social-philosophy post in today's stack that I do think is worth taking seriously, because it touches the same business risk we keep circling. It's a post by a user named bubugugu titled, AI is becoming epistemic infrastructure controlled by a handful of private individuals. The argument: whoever controls the infrastructure of knowledge controls how people perceive reality. The Church held that position through controlling scripture. The printing press broke that monopoly by distributing interpretive power. AI, the author argues, is doing the opposite — recentralizing knowledge synthesis into a handful of corporations with no democratic accountability.

Now there are pieces of this I disagree with, and there's a thoughtful counter in the comments worth reading. One commenter pushes back: quote, anyone can run open models locally, can inspect prompts and outputs, can test the same question across different systems and see where they diverge. That's not nothing. The actual problem I encounter is messier than corporations control knowledge. It's that most people don't bother doing that verification work. End quote. I think that's the more honest framing. The literacy problem is real, the monopoly framing is overstated, and the existence of DeepSeek, Gemma, Llama-family open weights cuts against the strongest version of the centralization claim.

But here is the comment from that same thread I told you to hold onto. Quote: Building on top of these APIs, I've started realizing my product's entire knowledge layer is rented, not owned. The infra concentration isn't just an ethics problem, it's a real business risk that most indie founders don't price in until it's too late. End quote.

That is the line I want you to take with you. Forget the Church metaphor, forget the printing press, forget the politics for a second. As a builder, your stack right now probably has at least one critical dependency that you do not control. For our zero-cost agent guy from the top of the show, it's nine of them. For most founders I talk to, it's two or three. When the model you depend on gets deprecated, gets repriced, gets quietly nerfed on its reasoning quality between dot releases, or gets a new content policy that breaks half your use cases — you're the one explaining it to your customers. Not OpenAI, not Anthropic, not Google. You.

The response is not to panic, and it's not to go fully self-hosted, because for most of you that math doesn't work. The response is the one the indie research crowd is converging on this weekend: thin routing layers, multiple providers, eval pipelines that catch silent regressions, and observability that tells you when your knowledge layer rent just went up. We saw the same answer from three angles today — the agent guy's fallback chain, the image generator commenter's three-tier task routing, and the BEAM memory benchmark builder's emphasis on cited packets. The orchestration layer is the moat for builders who don't own the models. Build it on purpose, not by accident.

Let me close on the lighter note in the stack, because we earned it. There's a post in r/singularity celebrating the two-year anniversary of Golden Gate Claude. For anybody who wasn't deep in the weeds in 2024 — Anthropic released a version of Claude where they had used sparse autoencoders to crank up the activation of a specific internal feature, the one that lit up around the concept of the Golden Gate Bridge. The model became hilariously obsessed with the bridge. It would steer every conversation toward it. It would describe itself as the bridge. It was the most charming interpretability demo of that whole era, and it remains, in my opinion, the moment when mechanistic interpretability went from arXiv-only curiosity to something a normal person could actually appreciate.

The reason the anniversary post made me smile this weekend is this quote from the thread. Quote: What I find wildest is that today you can run a decent local model, download the SAE from neuronpedia, and produce very similar results right at home, with any decent laptop. And if you do not know exactly how, Claude will even gladly help to set it up for you. End quote.

That is the whole story of this episode in one sentence. The thing that was a frontier-lab interpretability stunt two years ago is now a weekend project on a laptop with help from Claude itself. The capability has fallen off the back of the truck, and it landed in the hands of hobbyists. That's the direction the gradient runs in this business — slowly, then all at once.

Which is why I lead with the zero-cost agent guy and not with the next benchmark king. The kid on the t3.micro with the nine-provider fallback chain — he is not Cognition, he is not Anthropic, he is not a threat to the frontier. But he is a signal about what becomes possible for everyone three months and six months from now, when the same pattern shows up as a managed product. The orchestration layer, the routing layer, the eval-and-degraded-data pipeline — those are the businesses being seeded this week in random Reddit threads with eight upvotes.

Pay attention to the workshop floor. That's the whole point of the show.

Alright kiddos, that's the menu for today. Show notes will have the WAVE repo, the TritonMoE preprint, the Context Swarm Memory benchmark, the noisekit CLI, and the Golden Gate Claude anniversary gallery. Take care of each other, watch your dependencies, and I'll see you tomorrow. This was Barely Possible. I'm Tony DeLuca.

More episodes

Chapters

What is Barely Possible?