Embodied AI 101

CaP-X is an open-source agentic robotics framework where LLMs/VLMs generate code to call perception and control APIs for execution across diverse simulated and real robots in CaP-Gym's 187 manipulation tasks. The framework includes CaP-Bench for evaluating frontier models and CaP-RL, which boosts a 7B model's success from 20% to 72% with minimal sim-to-real gap.

What is Embodied AI 101?

Stay in the loop on research in AI and physical intelligence.

CaP-X: Coding Agents for Physical eXecution.

Roboticists typically teach manipulators by collecting huge datasets or hand-engineering pipelines. In contrast, the Code-as-Policy approach treats a large language model as a robot “programmer.” Rather than learning end-to-end action policies from pixels, the LLM writes actual code that calls perception and control primitives to accomplish tasks. The new CaP-X framework (short for Coding Agents for Physical eXecution) explores this idea systematically. Developed by NVIDIA’s GEAR Lab along with Berkeley AI and others, CaP-X provides a shared environment (CaP-Gym) and benchmark suite (CaP-Bench) to test how well LLMs and VLMs can control robots via programs. The key insight is that today’s foundation models, given the right scaffolding, can decompose high-level instructions into code for vision and motion modules – effectively acting as “programmer brains” for robots. Put simply: instead of training a specialized robot net on millions of images, we ask a general-purpose LLM to write the robot’s code on the fly.

In CaP-X, an LLM-based agent observes a task goal (and possibly an image) and generates a Python script that calls low-level APIs (perception and motor primitives) in sequence. These primitives might include functions like “detect_block”, “plan_grasp(point)”, “open_gripper”, etc. The agent can interact turn-by-turn in a REPL-like loop: each cycle it executes the generated code on the simulator or hardware, observes the outcome, and then possibly refines its program. The CaP-Gym environment provides over a hundred manipulation scenarios – ranging from stacking blocks and opening drawers to cleaning spills, turning stove knobs, or coordinating two arms. In the benchmark they use a variety of simulated robots, cameras, and cluttered scenes to test the agents’ generality.

The developers evaluated a dozen recent language and vision-language models as coding agents. These included closed-source giants (Gemini/Pro, GPT-4/GPT-5.2, Claude 4.5) and open models (Qwen-3B/235B Coder, DeepSeek, etc.). Importantly, none of these models were trained on the specific tasks. Instead, they were asked zero-shot to produce correct robot control code. CaP-Bench measures each model’s success under different conditions of abstraction, feedback modality, and test-time computation. For example, at the highest abstraction level the agent can call a one-step primitive like. At the lowest level it must piece together many steps (image segmentation, coordinate transforms, inverse-kinematics) entirely within its generated code. By evaluating models from high-level scripted APIs down to raw primitive APIs, CaP-Bench reveals exactly how much the LLMs rely on human-crafted scaffolding.

The initial results are striking. Even without any task-specific training, the top frontier models do achieve nontrivial success – typically on the order of 30–40% on these tasks. In other words, simply prompting GPT-like models to write robot programs outperforms the traditional view that only highly specialized Vision-Language-Action (VLA) systems could manipulate objects. However, there remains a huge gap to what a human programmer would code. The best model trails humans by roughly 50–60 percentage points of success. In practice, the LLM can solve some simple tasks easily (e.g. lifting a detached cube) but repeatedly fails on longer-horizon or precision tasks unless given extra help.

Notably, the agents’ success strongly depends on the level of abstraction and feedback. When models are provided helpful, human-created functions like “pick up object X” or “navigate to region”, they do okay: the LLM just needs to sequence those high-level calls. But when CaP-Bench replaces each high-level function with the low-level sub-steps (pixel processing, grasp-planning, gripper commands), the LLMs’ success rates plummet. Without the “training wheels” of human abstractions, the models struggle to combine dozens of lines of code correctly. In other words, cutting out all but primitive APIs exposes that today’s LLMs still rely heavily on designer scaffolding. Among different models, larger closed models did better than smaller ones (as you’d expect), but no model got close to perfect if it had to write fully low-level control from scratch.

Another surprising finding was that feeding raw camera images into the agent’s prompt often hurt performance. One might think a VLM or multi-modal model could directly interpret pixels, but in this study adding raw image patches to the context actually confused the code generator. The team hypothesized that the-base vision-language models aren’t really trained to jointly reason about camera images and write code. In fact, the best approach turned out to be an intermediate “Visual Differencing Module.” Here a separate vision-language pipeline first observes the scene and the robot’s action effect, and converts the result into a concise text “status report.” For example it might say “The red block was lifted but the green block remained stationary.” This textual diff of the scene change is then appended to the LLM’s prompt for the next code iteration. Surprisingly, this structured text feedback outperforms both giving no feedback and giving raw images. In practice, the researchers found: allowing the agent to write code, execute it, see what changed (in words), and then write more code dramatically boosts success. Simply dumping raw pixels into GPT-style context yielded markedly worse results, confirming a “cross-modal alignment” gap in current foundations. This means that bridging perception-to-code still benefits from an intermediate grounding step.

Building on these insights, the CaP-X team designed a complete training-free agent called CaP-Agent0. This system combines several of the above ideas into one pipeline so that no new model training is needed. Its core features are:.

Visual diff grounding: After each action, a VLM (like a captioner) generates a simple textual summary of what changed in the scene. This keeps the LLM focused on task-relevant state, rather than raw pixels.

Auto-generated skill library: Whenever the agent finds a useful helper function (e.g. a coordinate transform, image thresholding hack, or composite grasp routine) during one task, that function is saved. On future tasks, the agent can reload its library of proven code snippets. In effect the agent “learns” useful subroutines on the fly without additional learning.

Ensemble code generation: Instead of relying on one model output, CaP-Agent0 prompts multiple candidates in parallel. For instance, it might sample nine different code responses either by using different temperature settings or even different LLMs. A lightweight “judge” agent then compares these attempts and stitches together the best pieces. Think of it as nine programmers drafting solutions and one senior engineer merging them.

Using these tricks, CaP-Agent0 far outperforms a vanilla LLM. In simulation, it achieved success on 4 out of 7 core benchmark tasks at human-comparable rates. These tasks operate with only low-level primitives (no high-level shortcuts), so it’s notable that a zero-shot agent attained human baseline. On fully perturbed tests (in LIBERO-Pro, which shuffles object positions and rephrases goals), CaP-Agent0 even beat state-of-the-art VLA approaches. For example, in a suite of 30 manipulation tasks under perturbations, end-to-end VLAs like OpenVLA and the π^0 models scored essentially zero success, whereas CaP-Agent0 managed roughly 18% success without any fine-tuning. In short, the coding agent generalized where learned policies failed. The authors also tested CaP-Agent0 on a real robot (a Franka Panda and an AgiBot G1) on tasks like finding a needle in a haystack or lifting cups from clutter. Even there, without any real-world training trials, the agent successfully completed complex tasks and even solved small “block math” puzzles by moving blocks appropriately. As one NVIDIA researcher put it, CaP-Agent0’s logic-based solution often matched or beat pre-trained, end-to-end VLAs on these tasks.

The last piece of CaP-X is CaP-RL, an approach that adds learning on top of code-generation. The idea is simple: treat the language model’s weights (that generate the code) as a policy to be fine-tuned with Reinforcement Learning. Since each program’s outcome is fully observable in simulation, the team used a variant of PPO (called Group Relative Policy Optimization) to adjust the 7-billion-parameter Qwen 2.5 Coder model based on task rewards. After just a few dozen training iterations on simulation tasks, performance exploded. In their experiments, the post-trained agent’s success rate jumped from around 20% to 72% on average. Unlike end-to-end vision models, the policy here reasons over semantic routines, so transferring it to the real robot was straightforward. Indeed, the RL-tuned agent ran on a real Franka arm and achieved roughly 84% success at cube lifting and 76% at cube stacking – all without any additional real-world training. In effect, CaP-RL bypasses the usual “sim-to-real” vision gap by never touching raw pixels during policy learning. Because the agent’s code calls the same high-level functions regardless of simulator or reality, the learned strategy generalizes immediately.

In summary, the CaP-X work demonstrates that programming agents are a viable new paradigm for robot manipulation. By open-sourcing the entire platform, including the CaP-Gym environment, 187 manipulation scenarios, and the CaP-Bench evaluation suite, the authors give the community a fine-grained way to measure “agentic” capabilities. Their experiments show that LLMs need considerable scaffolding – abstract APIs, iteration, feedback – but with these, they can achieve human-like skill. The training-free CaP-Agent0 matches expert baselines on several tasks, and the RL-enhanced CaP-RL reaches state-of-the-art results with no sim-to-real penalty. As NVIDIA’s Jim Fan enthused, this heralds “the era of agentic robotics”: robots that don’t just follow a fixed policy, but actually write and refine their own control programs.

This work suggests a broader future where robots are controlled by generalist models reasoning over modular code. Rather than baking the solution entirely into learned weights, we disentangle planning (the LLM) from execution (perception and motion primitives). One could imagine hybrid systems: a small on-device LLM handles high-level strategy while a learned motor policy executes primitive actions. Crucially, by requiring agents to work with low-level primitives, CaP-X pushes future models to genuinely understand robot feedback and adapt. It also opens research directions: improving visual-differencing modules, growing the online skill libraries, and integrating newer foundation models.

For now, CaP-X provides the robotics community with a concrete benchmark and a proof-of-concept. It proves that “code as policy” isn’t just a thought experiment; with the right toolchain it can control real arms as well as – or better than – the traditional ways. The full code and environments are available to experiment with, meaning any lab can try their favorite LLM in this setting. Given the dramatic gains seen by simply leveraging more computation (ensemble reasoning, RL fine-tuning, etc.), it’s likely we’ll see even stronger agentic controllers soon. In the meantime, roboticists interested in embodied AI should take note: the robot’s next trick might be to write its own program, and that era is here when CaP-X gives it the pen and paper.

Sources: The CaP-X framework and experiments are described in the 2026 preprint by Fu et al. Key results and insights are summarized in public posts and the official project site. These include the observed abstraction gap, the visual-differencing method, and the CaP-Agent0/ CaP-RL performance.