Embodied AI 101

Vision-language model that adds generative depth prediction during pre-training for physical grounding; achieves SOTA on embodied benchiments and transfers directly to real-robot tasks.

What is Embodied AI 101?

Stay in the loop on research in AI and physical intelligence.

Generative Depth Supervision for Embodied Vision-Language Models.

Large vision-language models (VLMs) have recently become a cornerstone of embodied AI: they can parse rich textual instructions and visual inputs, guiding robots through complex manipulation tasks. However, standard VLM pre-training focuses on high-level semantics – aligning images to captions or answering static visual questions – and often neglects the low-level spatial and physical structure critical for real-world action. In other words, there is a gap between semantic understanding and physical grounding. For example, a model may correctly identify that “the mug is on the table” from an image, but it still needs a precise 3D sense of where the mug is relative to the robot’s arm. Bridging this gap is essential for tasks like picking-and-placing objects, folding clothes, or navigating cluttered environments.

The new paper “GEM: Generative Supervision Helps Embodied Intelligence” by researchers at Tencent Hunyuan and Tsinghua University directly addresses this shortfall. GEM introduces a novel pre-training objective: alongside the usual language and vision tasks, the model learns to generate depth maps of the scene as it pre-trains. In practice, this means augmenting a vision-language backbone with a generative depth-prediction module, and training them jointly. By forcing the model to reconstruct 3D geometry (via a graphics-inspired diffusion process), GEM encourages the internal representations to encode fine-grained spatial cues. As the authors put it, they “integrate a depth map generation task directly into the VLM pre-training phase”. The result is a model (“GEM”) whose learned features carry explicit structural information about the scene, not just high-level object labels.

This combination of language grounding and generative depth understanding leads to substantial gains on embodied benchmarks. In simulation environments like LIBERO and Simpler-WidowX, GEM achieves state-of-the-art performance (e.g. ~96% success on LIBERO tasks) far exceeding conventional VLAs (Vision-Language-Action models). Even more strikingly, when deployed on a real robot (a UR5 arm) in physically challenging tasks (table cleaning, cloth folding, zipper opening), the GEM-based agent attains success rates nearly ten points higher than the prior best (43% vs 28% on average). In short, by embedding a generative depth-task into pre-training, GEM significantly improves both the semantic and spatial/physical capabilities of embodied agents.

In the sections below, we dive deep into GEM’s approach and results. First, we explain how the model is built, focusing on the generative depth-prediction module and its training regimen. We then examine the new GEM-4M dataset curated to support this method, which contains millions of paired vision-depth-language examples for training. Finally, we survey GEM’s performance on a range of benchmarks — from simulated manipulation suites to real robot experiments — and discuss why generative depth supervision yields such dramatic gains in physical grounding. Our goal is to give a thorough, nuanced picture of how GEM works and why it matters for the future of embodied AI.

The Generative Depth-Predictive VLM.

At the heart of GEM is a multi-task training objective that combines the usual language modeling losses with a generative depth reconstruction loss. Concretely, GEM starts with a large pretrained vision-language backbone (in this case, a Qwen3-VL model) that takes an instruction and an image observation and produces a multimodal embedding. Typically a VLM would use these features to predict the next text token or answer a question. GEM augments this backbone with a lightweight “connector” and a Diffusion Transformer (DiT)-based depth generator (Figure 1).

Concretely, the process is as follows (an overview paraphrased from the authors’ description): The VLM backbone encodes the image into visual tokens (h_o). A small 2-layer MLP (the connector) then projects (h_o) into a learned conditional embedding (c). This embedding (c) is fed into the generative head (G_{\psi}) — a DiT model adapted for depth. The DiT head takes (c) as a conditioning input and generatively reconstructs the scene’s depth map (d) (the 2D grid of distances from the camera to each pixel). In training, this generative depth head is optimized with a flow-matching objective, similar to a diffusion model: starting from Gaussian noise, the network learns a vector field that gradually transforms the noise distribution into the true depth distribution. (Intuitively, the model injects noise into the depth and then learns how to remove it step by step, guided by the features (c) from the real image.) In parallel, the main VLM branch is trained with its usual cross-entropy loss on language tokens or question-answers.

In mathematical terms, the joint loss for GEM is a weighted sum of the usual language-modeling loss ( \mathcal{L}{CE} ) and the depth-generation loss ( \mathcal{L}{flow} ) (flow matching). Concretely, during pre-training they alternate between typical VLM masked-token or next-token training and a generative depth-training step. The net effect is that the visual tokens (h_o) from the image must serve two masters: they must enable the text predictions (as usual) and they must encode enough geometric information to allow recovering the depth map. This “multi-headed” training forces the model’s internal vision features to capture 3D structure in addition to semantics.

To ensure stability and effective learning, GEM uses a progressive training schedule. First, they initialize the connector MLP with a short phase (keeping the main VLM and generative head frozen) so that the learned visual features can be properly mapped into the DiT space. Next, they warm up the depth head itself with the connector, gradually learning basic depth generation. Finally, all components – the VLM backbone, connector, and DiT head – are unfrozen and trained jointly with the combined loss (\mathcal{L}{total} = \mathcal{L}{CE} + \lambda \mathcal{L}_{flow}). The authors found this three-stage strategy was crucial: “the three-stage progressive training strategy is crucial for stable convergence and effective fusion of semantic and structural features”. (In practice, after this joint pre-training, they add an action-prediction head and train a final stage where the model generates robot actions autoregressively, but that detail is beyond our present focus.).

Why a generative head (DiT) for depth? One might ask why GEM uses an image-generative model rather than a simpler regression network for depth. The key is that generative diffusion models have shown remarkable power in modeling complex distributions, and here they ensure the model truly captures the distribution of plausible depths. By training with a flow-matching loss, GEM’s depth head does more than just minimize L2 regression to depth — it learns a rich implicit representation of geometry. In fact, ablations in the paper show that depth generative supervision is more effective than image reconstruction: models trained to reconstruct RGB or masked pixels did not gain the same spatial insight as those trained on depth. Depth maps explicitly encode distances, so they provide a direct cue for spatial relationships (e.g. one object is twice as far away as another), which plain RGB cannot do. By embodying 3D cues in the loss, the model’s vision features acquire “low-level structural information” that they otherwise lack.

In summary, GEM’s architecture extends a standard vision-language model with a diffusion-based depth-prediction module. The training objective ties together language understanding and 3D perception. This novel “generative depth-prediction” approach forces the model to encode geometry at a fundamental level, effectively bridging the semantic-physical gap that plagues conventional VLMs. Next, we turn to the data that makes this training possible.

GEM-4M: A Large Depth-Enhanced Embodied Dataset.

To train a joint language-vision-depth model at scale, GEM’s authors collected a massive dataset called GEM-4M. This dataset contains roughly 4 million multimodal examples, each pairing an image with a natural-language instruction or question-answer pair and a depth map of the scene. The examples are organized into three high-level categories of embodied reasoning:.

Grounding (Affordance) Data: About 1 million examples focus on where and how an object can be manipulated. For instance, given an image of a scene, a question might ask "Where is the bread that the robot can pick up?" or "Can this door be slid to the side?" These tasks require localizing objects and understanding affordances (which part of a mug is for grasping, etc.). The GEM paper draws on existing resources like PACO-LVIS, RoboPoint, and annotations from robot logs, to generate open-vocabulary descriptions and object queries. Importantly, each example in this set comes with a high-quality depth map for the scene – either from 3D scans of the environment or synthetically generated (the authors even used a tool called DepthAnythingv3 to produce depth when it didn’t exist).

Spatial Reasoning Data: Another ~2.2 million examples probe 3D relationships and properties in a scene. These might be questions like “Which object is closer to the camera, the green block or the blue block?” or “Is the red cup taller than the white plate?” For this, GEM combines or generates from datasets like MindCube, ViCA, SPAR, VSI (spatial video QA), and others. The examples involve numerical or comparative queries about distance, direction, size, and 3D arrangements of objects. Because these questions hinge on accurate spatial perception, having the depth map is crucial. In fact, many of these datasets already come from 3D environments (e.g. simulated indoor scenes), so the depth ground truth is readily available.

Spatio-Temporal Planning Data: Finally, ~0.8 million examples address longer-term tasks and planning. Each sample might ask if a goal has been achieved (e.g. “Is the table clean after these steps?”) or what the next correct action should be in a pick-and-place or assembly sequence. The GEM team drew on robot demonstration logs (for instance, sequences of steps annotated with subgoals) to craft question-answer pairs about task completion and next-step prediction. Again, each frame has an associated depth map – sometimes ground truth from simulators, or pseudo-depth from their depth model when needed.

Altogether, GEM-4M provides millions of depth-augmented vision-language training examples. Crucially, this is tailor-made for the pre-training tasks of GEM: the model learns from both the textual aspect (instructions, Q/A) and the visual 3D aspect (via the depth). The authors also release a smaller subset, GEM-250K, to spur community research. GEM-250K includes 250,000 examples sampled from the full set: roughly 100k grounding (affordance) cases, 100k spatial reasoning, and 50k planning. Even this subset covers all three categories, enabling researchers to reproduce core results without needing the full scale of data.

By training on GEM-4M, the model sees a rich variety of scenarios where language, vision, and geometry intertwine. For example, one can imagine the model reading an instruction like “Put the cereal box next to the bowl” and simultaneously learning to generate the depth map so it knows exactly how far away and at what angle to move. The depth pairs ensure that the learned representations capture subtle 3D cues that language alone cannot convey. In short, GEM-4M equips the model with both abstract reasoning practice (through question-answer pairs) and ground-truth 3D perception (through depths) for each scenario.

Results on Embodied AI Benchmarks.

With the generative depth-enhanced model and GEM-4M data in hand, the authors evaluated GEM across a battery of challenging embodied intelligence benchmarks. Their goal was to test not only symbolic reasoning (like question-answering about space) but actual robot manipulation tasks in simulation and the real world. Across the board, GEM delivered state-of-the-art performance, confirming that depth supervision indeed bolsters the agent’s physical grounding.

Simulation Benchmarks (Vision-Language-Action tasks): The authors first tested GEM-VLA (GEM extended to Vision-Language-Action) in common simulated manipulation suites:.

LIBERO: This is a comprehensive benchmark for lifelong robot learning, containing hundreds of tasks grouped by spatial, object-centric, and goal-oriented skills. In GEM’s evaluation, they report results on four representative tasks suites. The GEM-VLA agent achieved an average success rate of 96.1% across these tasks. This far exceeds the prior best results from unmodified VLM-based agents or spatial baselines. For perspective, non-depth-supervised VLAs typically struggle in the 70–80% range on these tasks. The near-perfect 96.1% suggests that GEM’s implicit geometry understanding makes learned policies highly robust.

SIMPLER-WidowX: Another set of benchmarks involves manipulation tasks on a WidowX robot platform (e.g. putting objects in baskets, stacking blocks). In this suite, GEM-VLA achieved a 67.0% average success rate across tasks. This too was state-of-the-art; previous models without explicit depth cues were notably lower. (For example, a typical baseline might score 50–60%.).

Vision-Reasoning Tests: Beyond pure action tasks, the model was evaluated on spatial reasoning questions. On VSI-Bench (the Visual Spatial Intelligence benchmark), which probes video-based spatial Q&A, GEM showed large improvements. The paper reports that when using the same VLM size (2B parameters), GEM’s score jumped from roughly 50.4 to 62.8; with an 8B model, from 57.9 to 70.6. In other words, GEM’s depth pre-training boosted spatial reasoning accuracy by over 10 points. This indicates that the learned vision features now carry richer metric information, aiding tasks like distance estimation or fine-grained spatial queries.

Other Reasoning Benchmarks: GEM also outperformed or matched baselines on a variety of VQA and spatial benchmarks (CV-Bench, EmbSpatial, Where2Place, RoboSpatial, etc.). The review summarizing the paper notes that GEM even beat Gemini-3-Pro (a strong commercial system) by about 10% on fine-grained spatial grounding tasks, highlighting how much the model has improved in understanding geometry.

In summary, across simulated environments requiring both high-level planning and low-level precision, GEM’s generative-depth pre-training consistently pushed the state of the art. By training it on depth-rich question-answer pairs, the model became substantially better at tasks that depend on knowing where things are and how to move in 3D space.

Real-World Transfer and Robot Tasks.

Perhaps most impressively, the benefits of GEM’s depth supervision were not limited to simulations. The authors directly transferred GEM-VLA to a real robotic system (a UR5 manipulator) without extra finetuning and tested it on challenging physical tasks. These real-world experiments underscore the model’s enhanced physical grounding:.

Task Suite: They evaluated GEM on three long-horizon tasks: table cleaning (bussing), cloth folding, and backpack zipper opening. Each of these tasks is difficult for robots: table cleaning involves planning a sequence of actions to tidy up many objects, cloth folding requires manipulating deformable geometry, and zipper operation demands fine control.

Performance: GEM-VLA achieved an average success rate of 43% across these tasks, dramatically surpassing the prior best (28.7%). In fact, GEM outperformed the baseline in every task and even in per-subtask completion rate (i.e. partial successes). Breaking it down, each individual task saw significantly higher success – for instance, cloth folding and zipper tasks went from near zero up to meaningful success. A chart in the Japanese summary shows GEM above baseline on all tasks. These results indicate that the depth-informed model generalized well from simulation data to real sensors and dynamics.

The fact that GEM transfers “out of the box” to real hardware is notable. It suggests that generating depth maps during training conferred a form of geometric robustness: the model’s perception no longer hinges on just the appearance of objects, but on spatial structure that is more invariant between sim and real. The Japanese blog puts it this way: GEM “exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations”.

In practical terms, this means GEM can mount a camera on the robot’s end-effector and, given a language command (“fold the linens”, say), reason through the scene’s 3D layout to plan and execute motions. The results on real tasks (43% vs ~29% success) make clear that adding depth prediction to the pre-training was not just an academic exercise: it substantially improves a robot’s ability to physically ground its decisions.

Why Depth-Driven Pre-training Works.

The strong results beg the question: Why does this particular setup work so well? From a technical standpoint, the use of generative depth prediction accomplishes two things:.

Encouraging Geometry-Aware Features. By forcing the vision-language model to generate the scene’s depth, we ensure that its internal features encode 3D structure. In contrast, with ordinary masked token or caption objectives, a model can succeed by focusing mainly on object labels and coarse spatial hints. Depth generation demands that every part of the feature map relate to actual distances. As the reviewers note, depth supervision leads to “high-fidelity depth map generations” from the model’s visual features, whereas standard finetuned models do not. In other words, GEM learns strong structural priors: the model’s visual encoding (h_o) must capture the geometry to succeed. Indeed, ablation studies found that if you replace the depth loss with an RGB reconstruction loss (asking the model to regenerate the input image instead of the depth), the model loses these benefits for spatial reasoning. Depth is a much more direct cue for “where” things are, so it teaches the model to know distances and surfaces.

Joint Semantic-Geometric Learning. Language instructions often only partially specify the scene; there are ambiguities about unseen geometry that text alone can’t resolve (e.g., a command might say “Pick up the object next to the red box,” but which object that is depends on depth ordering). By combining language and depth in training, GEM implicitly learns to reconcile those modalities. The visual features become multi-faceted: they serve semantic tasks and also reconstruct geometry. This “joint supervision” leads to better embodied reasoning: when the agent is later asked to execute an instruction, it now has an internal model of the scene that includes numeric depth estimates.

Put another way, traditional VLM pretraining teaches what is in the scene (labels, attributes) but not where exactly. GEM adds that missing piece. The Moonlight review of the paper emphasizes this: GEM’s approach “forces (h_o) to encode both structural (for depth generation) and semantic information”. This dual focus is what translates to improved performance on embodied tasks.

Importantly, GEM does this with a relatively simple addition: a small connector and a DiT-based head. Many other approaches to physical grounding add geometric inputs or special modules. For example, XEmbodied (April 2026) uses 3D occupancy grids and explicit 3D adapters, and runs costly RL fine-tuning. GEM does not require explicit 3D sensors at runtime – it only needed depth at training time. By making depth a training target rather than an input, GEM leverages millions of cheap synthetic depth maps in pre-training, yet still operates like a normal RGB-language model at inference time.

The empirical gains back up the intuition. The improvement on tasks like LIBERO (96% success) shows that GEM’s policies generalize across many manipulation skills. The VSI-Bench gains (50→63 points) quantitatively confirm that depth training improved spatial language understanding. And the leap in real-robot success rate (from 28.7% to 43%) speaks directly to “physical operation capabilities” being enhanced. In summary, the results suggest that depth-generation serves as a form of very strong auxiliary supervision: it aligns the model’s visuals with the physics of space.

Discussion and Outlook.

The GEM paper represents an important step in embodied AI: it shows that generative multi-modal pretraining can pay dividends for robotics tasks. While generative objectives (like diffusion) have been wildly successful in image and video synthesis, GEM applies them to a different purpose — to improve inference in embodied agents. The fact that the model can produce accurate depth maps as a byproduct is impressive, but the key point is that this skill was only needed in training to shape its representation.

Several broader implications and questions arise:.

Data and Computation: Training a multi-billion parameter VLM with millions of depth maps is expensive. Indeed, GEM uses a cutting-edge Qwen3-VL backbone and an added diffusion head, along with a custom 4M dataset. This means replicating GEM’s exact setup requires substantial computational resources. However, the authors mitigate this somewhat by releasing GEM-250K, and the qualitative results suggest even smaller models (2B or 8B) show big wins. For the research community, the core idea – adding depth prediction – is transferable: future work might apply it to different VLM architectures or smaller models.

Realism of Depth: GEM relies on having “truth” depths during training. For simulated and CAD-generated scenes, ground-truth depth is easy. For real-world images, one could use multi-view stereo or lidar data. The review notes they even generated pseudo-depth with a model for some data. The success of GEM suggests that even imperfect depth (from synthetic or models) can be useful, as long as it’s reasonably aligned with the visuals. This opens up possibilities: one could imagine scaling up by web-scraping RGB-D data or running simulators in photorealistic environments.

Relation to Other Work: GEM is part of a wave of research trying to inject geometric or physical biases into VLMs. Besides XEmbodied (using 3D inputs) and VLM3 (arguing that VLMs implicitly learn 3D), others have proposed adding 3D adapters or training on egomotion. What sets GEM apart is the generative-loss formulation, which is relatively architecture-agnostic. Additionally, GEM also trains an action head using a similar diffusion-style loss (as noted in the Moonlight review), closing the loop so that even the policy network has a generative flavor. This unified perspective – that both perception and action generation can be framed as diffusion-like tasks – is a fascinating direction.

Physical Grounding: Conceptually, GEM strengthens the notion that embodied AI benefits from tying language models to the physics of the world. The model sees language paired with exactly how that scene exists in 3D. Over time, this could allow even more elaborate grounding: for example, one might imagine using generative models not just for static depth, but for predicting whole 3D scene reconstructions or future frames. GEM opens the door to such possibilities.

Limitations: Though GEM’s results are strong, performance is not perfect. A 43% success rate on real tasks, while impressive, still means more failures than successes. There is room to improve robustness and planning in the real world. Some failures likely come from dynamics, contact physics, or perception noise that were not fully captured by the simulated or dataset-based training. Additionally, embedding a diffusion module does add compute; training took special stages just to align the models. In practice, running real-time diffusion might be too slow, but fortunately at inference GEM likely discards the heavy depth generation head (using it only in training), so runtime overhead is minimal.

In summary, GEM shows that “we can significantly enhance both abstract semantic reasoning and actionable physical grounding” by integrating a depth-generation objective into VLM pre-training. It is a clear example of how ideas from generative modeling can enrich the capabilities of embodied agents. For practitioners, the takeaway is that if you’re building a vision-language agent for a robot, it might pay to train it with depth-supervised tasks. The GEM-250K dataset and code (once released) will let others experiment with these ideas. We can expect to see follow-ups exploring different forms of geometric supervision, scaling laws for depth, and more real-world tests. But for now, GEM sets a new standard: by taking vision-language models beyond the flat 2D image-text space into the 3D world, it unlocks a new level of embodied intelligence.

More episodes

Chapters

What is Embodied AI 101?