Embodied AI 101

Presents a neural network-based world model for model-based reinforcement learning in robotics, focusing on sim-to-real transfer for quadrupedal and humanoid robots. Enables robust policy optimization through learned environment simulation.

What is Embodied AI 101?

Stay in the loop on research in AI and physical intelligence.

Robotic World Model: Learning to Simulate for Robust Robot Control.

Imagine teaching a robot to dream of its next moves. Instead of endless trial-and-error on real hardware, a robot could train its controller inside a learned simulator — a world model — then directly execute the learned policy on the real machine. This is the promise of model-based reinforcement learning for robotics. A new approach from ETH Zürich’s legged robotics group, called the Robotic World Model (RWM), takes an ambitious swing at this idea. The key insight is to learn a general, black-box neural simulator of the robot’s environment that can roll out long trajectories reliably, even when dynamics are complex, partially observable, and stochastic. By training policies “in imagination” inside this learned simulator and then transferring them zero-shot to real robots, RWM offers a path toward scalable, robust robot control with minimal sim-to-real loss.

At its core, RWM replaces handcrafted simulators or analytic models with a recurrent neural network. This network is trained on recorded interaction data, predicting future observations and hidden signals (like contact forces) from a window of recent history. Critically, RWM is trained autoregressively: during training it feeds its own predictions back as inputs, rather than always using the true data stream. This so-called dual-autoregressive mechanism ensures that the model learns to correct for its small prediction errors before they spiral out of control. In the authors’ words, it “learns to stay stable over long rollouts and mitigate compounding errors” even when the dynamics are only partially observable and noisy. In practice RWM uses a simple GRU-based architecture with two loops: one loop updates the GRU state from the recent real history (“inner” autoregression), and a second loop feeds back its own predicted observations to roll forward into the future (“outer” autoregression). This entire model is trained in a self-supervised way on trajectory data, without any task-specific engineering. The result is a general-purpose dynamics model that can predict outcomes for very long sequences of actions, with minimal domain-specific tuning.

Why go to such lengths? Robotic control policies (especially with algorithms like PPO) typically need many seconds of simulated interaction to estimate returns and learn stable behavior. If a learned world model starts to “hallucinate” even small amounts of error over those hundreds of steps, the policy will happily optimize against a fantasy world and fail on the real robot. Rather than fight this with hand-engineered physical priors for each robot and task, the RWM approach doubles down on data-driven learning. By forcing the model to correct itself over long chains of predictions, the learned simulator remains grounded. In effect, saving the model from going off the rails means downstream policy learning also stays on track. This black-box strategy has paid off: across a host of vastly different tasks (from manipulating objects to running robots) the same RWM architecture and training recipe delivers far lower long-horizon prediction errors than simpler networks or Transformer-based models. In experiments the authors show RWM beating out basic MLP dynamics, RSSM-style latent models, and even transformer sequence models at the task of future-state prediction -- a testament to its stability and generality.

Learning the World Model.

The RWM is trained purely on sequences of raw sensor readings and control actions, alongside any easily-obtained “privileged” simulator information. Inputs include things like the robot’s base velocity, gravity vector, joint positions, velocities, and torques. Since the model is meant to be robot-agnostic, these inputs are simply treated as vectors with no special structure. Crucially, the network also has auxiliary output heads that predict privileged simulation quantities like contact forces or foot heights. These privileged signals aren’t available to the robot at runtime, but during training they serve as extra consistency targets. In effect, the model implicitly learns the unseen physics of the system by forcing itself to predict them correctly. This clever trick helps the network internalize constraints such as contact dynamics and kinematic limits, boosting its long-term prediction accuracy.

Architecturally, the RWM uses a gated recurrent unit (GRU) at its core. During training, at each timestep the GRU ingests the recent history of observations and actions (the “inner” loop) to update its hidden state. Then it passes through output layers (“heads”) that predict the next sensor observations and privileged variables. Here comes the unusual part: instead of immediately resetting the hidden state with the actual next observation, the model feeds its own prediction back into the GRU for the next step (the “outer” loop). By repeating this process over many steps, the GRU effectively simulates future trajectories. Training the network with this loop closed means it never sees a disembodied forecast; it has to cope with its own output at each step. This dual-autoregressive scheme is the heart of RWM’s stability. As the authors note, it avoids the classic “train/test mismatch” of world models – at test time a model must consume its own predictions, so RWM trains it to do exactly that. The result is a simulator that “stays accurate under autoregressive rollouts across very different robots and tasks”.

Behind the scenes, training is entirely self-supervised: the losses are simply prediction errors on observation variables and privileged features. There is no reward signal or task-specific objective at this stage. One can think of it as an unsupervised model of dynamics. However, the choice of loss and training regime is carefully engineered for the long-horizon goal. For example, the authors experiment with the length of the prediction horizon during training. Training the model to predict only one step ahead (pure teacher-forcing) is fast but fails to instill robustness. By contrast, training it to roll out many steps at once (serially) greatly improves its long-horizon fidelity – albeit with increased training time. The team performed ablations showing that longer training horizons (more autoregressive steps) yield qualitatively better predictions over time, validating that this self-critical training is needed for the model’s eventual stability.

Policy Optimization in the Learned Simulator.

Once the RWM is trained, it becomes a drop-in simulator for policy learning. The authors implement a loop very much in the spirit of Model-Based Policy Optimization (MBPO), but using on-policy PPO instead of an off-policy optimizer. Concretely, they proceed as follows: first gather some interaction data in the real simulator (or real robot if it were safe) – for example, a few rollouts of random or baseline policy. Train the RWM on this initial dataset. Then, train the control policy inside the RWM, treating the world model as the environment. In practice this means running PPO rollouts where each step’s transition is generated by RWM’s predictions. Because the model can generate unlimited imagined data, PPO rapidly improves the policy. Every so often, the updated policy is re-tested in the real simulator to collect new true data, and the RWM is retrained. This loop repeats until convergence.

We can think of this as simply “PPO in imagination” with periodic reality checks. Importantly, PPO was chosen because of its robustness in robotic control tasks. PPO is known to require long episodes for good performance, which underscores why the stability of RWM is so critical. As the authors point out, “PPO is still the empirically reliable workhorse for robot control, yet PPO needs long-horizon trajectories … If a learned model starts hallucinating during autoregressive rollouts, those errors compound … and PPO will happily optimize against a fantasy world”. Thus, RWM’s training procedure is directly geared towards making PPO training safe.

The beauty of this pipeline is that aside from the initial data gathering, no actual simulation or hardware experiments are needed during policy learning. Once the model is fit, all policy improvement happens “offline” in the learned simulator. This makes learning much safer and scalable. The framework is quite plug-and-play: any PPO implementation can be wrapped to query the RWM for next observations, and the rest of RL proceeds as normal. The authors provide an open-source Isaac Gym extension that automates this process, along with scripts for handling the data loops and comparisons. In effect, one trains exactly as if one had created a high-fidelity custom simulator – except that the world model is a neural net trained purely from data.

Empirical Results and Sim-to-Real Performance.

To test RWM, the authors assembled a broad suite of robotic tasks spanning both manipulation and locomotion. For manipulation, they include tasks like reaching with a UR10 arm, pick-and-place with a Franka arm and Allegro hand (e.g. lifting and repositioning a cube, opening a drawer), and changing the orientation of an object. For locomotion, they have velocity-tracking tasks on many different robots: Unitree A1, Go1/Go2, ANYmal B, C, D, the Boston Dynamics Spot, the Cassie biped, the Tesla-style H1 robot, and even a Unitree G1 humanoid. Remarkably, the same RWM network architecture and training setup is used for all these tasks – with no task-specific tweaks. The network simply sees different input dimensions or action spaces corresponding to each robot.

The learned models are evaluated on how well they predict long sequences of observations. Across the board, RWM outperforms the baselines. In one benchmark, the authors compare on test-rollout prediction error for sequences of up to dozens of steps. The RWM’s error stays low, while that of a one-step-trained MLP or a naive RSSM and even a black-box Transformer explodes quickly. Quantitatively, they report that “RWM trained autoregressively achieves the lowest prediction error relative to MLP, RSSM, and transformer baselines” on every task. It also generalizes robustly: adding sensor noise or slight domain shifts does not unduly degrade its forecasts. This superior predictive accuracy gives more reliable imagined rollouts for policy training.

Thanks to its stability, the world-model-based PPO scheme finds strong control policies sample-efficiently. In comparisons on locomotion tasks, their MBPO-PPO (model-based PPO) consistently outperforms competitors. For instance, they compare against a short-horizon gradient method called SHAC (which suffers badly from model bias) and DreamerV3 (a recent state-of-the-art world-model approach). SHAC’s performance was poor because it tried to backprop through the model, making it very sensitive to imperfections. DreamerV3 learned okay but struggled on these demanding long-horizon tasks. The MBPO-PPO loop using RWM needed far fewer real simulation samples to reach high performance, and it achieved higher final rewards. On the ANYmal D quadruped’s forward/turning velocity-tracking task, MBPO-PPO reached a stable high reward (around 30) in only ~2000 model-based iterations, with the model error dropping below 5 by then. The more complex Unitree G1 humanoid needed longer (~10000 iterations) but still converged to a deployable policy. (These numbers were on a single NVIDIA RTX 4090 GPU: training the ANYmal world model took about 12 hours, and policy training another 6 hours on that hardware.).

Perhaps the most striking result is actual hardware validation. The authors were able to take a policy trained entirely in RWM’s imagined environment and run it on the real robot without any further fine-tuning (zero-shot transfer). In their demonstrations, the 50 kg, 12-DOF ANYmal D could accurately follow velocity commands (forward and turning speeds) just as it had in simulation. The predicted motion trajectories from the RWM were nearly indistinguishable from a high-fidelity simulator’s output – effectively the robot was executing its own “black-box simulator” in real time. The results were so accurate that, as the team notes, the sim-to-real performance degradation was minimal. For the speed-tracking tasks shown on video, the ANYmal followed the velocity commands with only tiny steady-state errors and no catastrophic failures. This experiment strongly supports the claim that RWM can learn robust dynamics and close the sim-to-real gap: by learning the environment’s nuances from (simulated) data, the policy need not know about any model bias when controlling the real robot.

Although the anytime-real humming demonstration focused on ANYmal D, the same method also showed promise on the Unitree G1 humanoid in similar speed-control tests. (The G1 is a 29-DOF biped, making the problem considerably harder.) The fact that zero-shot transfer worked on such different platforms speaks volumes for the model’s generality. In practice one might consider adding some domain randomization or a small amount of real data fine-tuning in a deployed system, but the core result here is that RWM yields policies that are already close enough to reality.

Discussion: Toward Adaptive Robot Intelligence.

The RWM framework represents a significant advance in embodied AI. By focusing on the hardest part of model-based RL – learning a flexible, stable world model – it sidesteps the reliance on expensive, hand-modeled simulators. There’s an elegance to the solution: a single generic GRU model, trained end-to-end by making noisy dreams, supports the learning of multiple robot controllers. This contrasts with prior approaches that often bake in physics equations or task constraints to stabilize predictions. The authors argue persuasively that such hand-crafting restricts applicability. Instead, RWM bets on data: given enough interaction data upfront, the network can implicitly capture contact events, inertia, and all that messy physics. The successful zero-shot transfers suggest this bet paid off.

From an engineering standpoint, adopting RWM is mainly a matter of plugging in the pre-trained network and running policy learning on it. The researchers have provided code (in an Isaac Gym extension) that handles the details: initializing the model, collecting data, rolling out the imagined trajectories, and running PPO. One need only specify the observation/action space and reward for the task. Because RWM is domain-agnostic, it can in principle work for any robot or environment you throw at it (in practice, one must gather data for each new morphology or task). The key requirement is access to some source of interaction data – for example, running some initial policy in a nominal simulator or slowly on the real robot – so that the world model has enough to learn from.

However, there are caveats. The authors are clear that fully training on real hardware is not yet solved. In the lab, their evaluations stayed inside simulation (with domain shifts between “training” and “testing” sims) for safety reasons. In a real robot, a partially trained policy that exploits model errors could crash the machine. They report that during online experiments, the policy did indeed sometimes try risky maneuvers that would have damaged the robot if unchecked. Resetting to the initial state on a big robot like ANYmal is nontrivial without a helper. For now, the team mitigated this by carefully restricting training scenarios and by planning to add uncertainty estimates in future work (see their related “Uncertainty-Aware RWM” paper). In short, RWM solves much of the sim-to-real puzzle, but one must still be cautious about model bias exploration in the real world.

Looking ahead, RWM opens exciting avenues. It suggests robots could spend most of their “thinking time” in a learned imagination, endlessly refining skills at low cost. The combination of GRU architectures with autoregressive training might inspire other domains – for instance, manipulation with deformable objects or even multi-agent control. Moreover, this neural simulator is itself a compact artifact of the environment; one could imagine using RWM for planning or reasoning (e.g. what-if queries) in addition to policy learning. The ETH team is already extending the idea to offline RL on real robot datasets (“RWM-U”), which addresses scenarios where real-time interaction is scarce.

In sum, the Robotic World Model demonstrates that a simple, generic learned simulator can be remarkably powerful. By embracing end-to-end learning and explicitly training for long-term stability, it delivers “imagination” that is almost indistinguishable from reality during control. For robotics researchers, this work is a concrete template: collect some data, train a GRU-based world model with dual autoregression, and then run standard RL inside it. The result is high-performance policies that generalize across tasks and even survive the leap to hardware. As the authors conclude, RWM “paves the way for adaptive and efficient robotic systems in real-world applications”.

References: The Robotic World Model paper and project page provide full details; see especially their experiments and videos demonstrating ANYmal’s zero-shot transfer. The code and further explanations are available in their GitHub repository. Other related works include Dremer-style world models and MBPO (Janner et al. 2019) for context. The RWM concept was recognized with an award at the NeurIPS 2025 Embodied World Models workshop, underscoring its impact.

More episodes

Chapters

What is Embodied AI 101?