Embodied AI 101

Comprehensive open-source agentic robotics framework treating VLMs/LLMs as code-generating APIs for perception (SAM3, Molmo) and control (IK, grasping), with CaP-Gym benchmark of 187 diverse manipulation tasks (tabletop, bimanual, mobile; sim/real) and CaP-Bench evaluating 12 frontier models; demonstrates rapid RL gains (7B model from 20% to 72% success) with strong sim-to-real transfer.

What is Embodied AI 101?

Stay in the loop on research in AI and physical intelligence.

CaP-X: A Code-as-Policy Framework for Robot Manipulation.

In the latest wave of embodied AI, researchers are exploring an alternative to the end-to-end, “black box” robot controllers that have dominated in recent years. Instead of training ever-larger neural policies to map pixels directly to torques, CaP-X (Coding Agents for Physical Interaction) asks: can we give robots a “brain” that writes code? In CaP-X, a large language or vision-language model generates Python programs on the fly, calling out to perception and control modules to execute a manipulation task. In effect, these models become programmers rather than parroting static behaviors.

This idea builds on a growing realization that language models may excel at symbolic reasoning, which we can harness via modular interfaces. Prior to CaP-X, robots were typically either hand-coded logic (classical TAMP planners) or fully neural policies (Vision-Language-Action models) trying to chew through data. But both extremes have drawbacks: hand-crafted rules break under new scenes (“change one cup on the table and you have to rewrite code”), and monolithic learned policies are brittle and opaque once deployed. The CaP-X team instead treats the LLM/VLM as a controller that emits actual robot commands in code. Importantly, even existing learned controllers (vision-language models) are demoted to just tools in this framework. Where a VLM like GPT or Gemini might once have been the robot “brain,” in CaP-X it becomes only a callable function inside the program. In other words, traditional vision-language policies are not thrown away; they are simply API calls that a code-writing agent may invoke when needed. In this sense, CaP-X “transforms large models from ‘command-issuing commanders’ into ‘code-writing programmers’”.

The CaP-X framework, released as open-source by its authors, has four main pieces. At its core is CaP-Gym, an interactive robotic sandbox. Built on a standard Gym interface, CaP-Gym connects the virtual (or real) robot to the language model via code. Whenever the agent generates one line of code, the physical world (simulator or hardware) steps and returns feedback. Crucially, the allowed commands are high-level primitives in perception and control, not raw joint torques. For vision, CaP-Gym provides built-in sensing modules: for example, it includes SAM3 (Segmentation Anything Model v3) and an open vision-language model called Molmo 2 (point-selection). These tools digest raw camera images and return structured semantics – e.g. “there is an apple on the table” or “the red cube is at pixel (x,y)”. For motion, the code need not specify robot joint angles by hand. Instead, the agent can call a motion planner or inverse-kinematics routine (the codebase uses a library called PyRoki) that automatically computes collision-free trajectories or grasping actions. Whether the task is single-arm pick-and-place, coordinated two-arm assembly, or even mobile-base manipulation, the agent simply issues Cartesian commands (e.g. “move gripper to this pose” or “execute a top-down grasp at this point”) and the underlying planner handles the details.

In this way, CaP-Gym provides a unified, modular action space that is rich enough for real tasks but still searchable by code. The developers collected a diverse suite of manipulation challenges (on the order of hundreds of tasks) covering table-top assembly, bimanual coordination, mobile manipulation, etc., with scenarios both in simulation and on real robots. This lets researchers test LLM-driven controllers across many hours of robot time.

Complementing CaP-Gym is CaP-Bench, a benchmark suite designed to stress-test coding agents under varying conditions. CaP-Bench evaluates whether an agent can “harness” the robot effectively by generating correct, robust code. It does this along three axes. First is abstraction level: some tasks are specified by high-level macros (e.g. ), while more challenging versions force the agent to use only atomic primitives (e.g. primitive move-and-grasp commands). Second is temporal interaction: a model might write a program in one shot (zero-shot) or interact in multiple turns, using feedback from prior attempts to refine its code. Third is perceptual grounding: CaP-Bench varies how visual feedback is presented to the agent (for example, raw images vs. semantic descriptors) to test whether the model can interpret percepts and incorporate them into its code. In short, CaP-Bench pushes models from the cozy “development wheels” extremes (lots of human-defined scaffolding) down to naked reasoning over the raw primitives.

Using CaP-Bench, the authors evaluated a range of state-of-the-art models (including large multi-modal models akin to GPT-4o and Gemini 3 Pro) in a blind, zero-shot setting. The results were striking: every model’s performance plunged as the tasks removed human scaffolding. With only the lowest-level primitives unlocked, even the largest public models achieved success rates far below human experts. As the paper notes, “as human priors (scaffolding) are removed, the performance of all cutting-edge models drops precipitously, and none can achieve the zero-shot success rate of human experts at the low-level primitives”. In practical terms, even powerful LLMs got “lost” when they could no longer rely on built-in knowledge of task-specific subroutines. This confirmed the intuition: without good interfaces and multi-step planning, code-as-policy by itself is fragile.

The CaP-X team used these findings to engineer a more capable agent, dubbed CaP-Agent0. This is a test-time system (it starts from an existing 7B-parameter LLM) that augments raw code generation with iterative reasoning and memory. First, it embeds a multi-turn loop: instead of writing one program and hoping for success, the agent executes its code in the environment, observes the outcome, and then tries again. A key innovation is Visual Difference Matching (VDM): after each attempt, the agent compares the before-and-after images and asks itself (in natural language) what changed. These visual differences are translated into structured feedback (“the red block moved by 2 cm, but the green block stayed” etc.), which the model then uses to debug its code on the next try. In effect, the agent is self-interrogating: “I tried this code, what was wrong, and how to fix it?”.

Meanwhile, CaP-Agent0 builds its own skill library on the fly. Whenever the agent serendipitously finds a code snippet that achieves a sub-task, it automatically extracts and stores that snippet as a reusable skill. Over many trials, this library grows into a set of macros that simplify future problems: complex tasks can be assembled by recalling these learned routines. Finally, the agent uses parallel proposal: for a hard task, it generates multiple candidate programs and tests them in parallel, increasing the chance that one will succeed quickly.

These tricks turn the model from a “write-once” oracle into a search-based planner that gradually improves. In their experiments, the zero-shot CaP-Agent0 (with no gradient updates) achieved surprisingly strong performance. On 7 core manipulation tasks in the benchmark, even when forced down to raw primitives it “matched the success rate of human experts in 4 tasks and even surpassed the reference programs” on some of them. Under noisy and perturbed conditions (the so-called LIBERO-PRO settings), CaP-Agent0 proved more robust than end-to-end vision-language-action policies like OpenVLA. In short, an untrained code-generating agent, when allowed to iterate and build skills, achieved human-level reliability on many tasks.

Beyond the training-free agent, the team also explores learning. CaP-RL applies reinforcement learning at the code level. Here, success or failure of the environment provides a clear reward signal (“task done or not done”), which can fine-tune the language model’s future code generation. They used a policy gradient approach (GRPO) to adjust the model so that its writing (its “intuition”) more often produces working programs. Remarkably, because this RL learning is over symbolic code rather than raw pixels, whatever the agent learns transfers directly from simulation to hardware without any additional tuning (i.e. zero-shot sim2real with no performance drop).

The impact of CaP-RL was dramatic. For example, one experiment showed that a 7B-agent’s success rate jumped from roughly 20% in zero-shot to over 70% after RL fine-tuning (moving from 0.2 to 0.72 success on a key task). And these gains held on the real robot as well, demonstrating strong sim-to-real transfer. In testbed hardware trials, CaP-Agent0 was able to solve tasks in the real world nearly as well as in simulation, significantly outperforming raw LLM-based controllers.

The authors are candid about limitations. Purely code-driven control still struggles on tasks demanding high-frequency reactions and delicate sensing: pouring liquids, precision insertion, or tactile maneuvers remain shaky if done only by successive code commands. In these cases, they propose a hybrid approach: let the agent handle high-level planning and error recovery in code, but hand off truly low-level execution to a learned controller (a VLA model) that can react continuously to sensor streams. In other words, CaP-X points toward a hybrid future: use code for logic and structure, and learned policies for muscle.

Overall, CaP-X is a comprehensive, open-access platform for this new “code-as-policy” paradigm. It comes with nearly 200 multi-step manipulation tasks in its gym, a benchmark kit for 12 top models, and example agents (training-free and RL-tuned). The key message is that we can leverage LLM/VLM power by making language models write the program for a robot rather than outputting actions directly. When the robot’s abilities (perception, grasping, motion planning, etc.) are wrapped as code-callable primitives, modern LMs can start to solve tasks with minimal new training.

In practice, this means an agent might generate something like:.

A short Python program like that (invoking SAM3, Molmo, PyRoki) encodes the entire stack of perception and control in human-readable form. The CaP-X experiments show that even relatively small models can do very well at writing such programs once we feed back execution results.

For roboticists, CaP-X is exciting for several reasons. It provides a toolkit: an environment where you can immediately take an open-source LLM (say, 7B parameters), hook it to robot APIs, and try “prompting by programming.” You get instant coding and debugging. It also gives us quantitative insights: we know now how much “training wheels” (hand-engineered macros, multi-shot fixes, etc.) these models still need under the hood. The CaP-Bench results make clear that raw LLMs still lack intrinsic competence at fine tasks, but can be brought up to speed with clever prompting and RL.

Looking forward, CaP-X points to new research directions. One is building even better skill libraries and memory systems so that agents learn faster about new robots. Another is richer perception interfaces: perhaps future VLMs could offer more than segmaps and points (narrative descriptions of scenes, 3D understanding, etc.) to the code-writer. And of course there’s the CaP–VLA hybrid the authors mention, where a dual system of planner (LLM) plus executor (neural controller) works together.

In sum, CaP-X marks a shift toward making robots that “think in code.” Rather than learning end-to-end policies from data, these agents reason symbolically by writing programs. The initial results are promising: with enough abstraction provided as primitives, even today’s LLMs can match expert-written controllers in many tasks. CaP-X thus opens a new avenue in embodied AI: leveraging the strengths of large language models as programmers who orchestrate a rich toolkit of robotic skills.

Sources: The above is based on the CaP-X 2026 paper and related coverage. The CaP-X code and benchmarks are released open-source by the authors. Key findings, agent designs, and metrics come directly from their results.