Embodied AI 101

Explores data modalities and co-training strategies to enhance large behavior models (foundation models) for improved performance in robot manipulation tasks, supporting end-to-end learning and cross-embodiment generalization.

What is Embodied AI 101?

Stay in the loop on research in AI and physical intelligence.

Co-Training Large Behavior Models: Multimodal Data for Robot Manipulation.

In recent years, robotics has seen a wave of optimism about large “foundation” models that can tackle a wide variety of manipulation tasks. These models are typically trained end-to-end on massive multi-task datasets of robot demonstrations and sensor inputs, in the spirit of GPT-like scaling. For example, Google’s PaLM-E model combines image inputs, continuous state sensors, and language inputs into one big transformer that generates robot actions, showing impressive abilities on tasks ranging from manipulation planning to question answering. The Toyota Research Institute (TRI) and others have similarly scaled up imitation learning to train “large behavior models” that aim to be general-purpose robot policies. However, a persistent problem has been data scarcity: real robot demonstrations are expensive and brittle to collect, so even a “large” model can overfit to the data it has and fail to generalize to new situations. In short, building truly generalist robot policies requires clever ways to augment or diversify the training data beyond the limited robot demos we have on hand.

One promising idea is co-training with auxiliary data: in addition to the core robot trajectories, train the model on other related datasets (images, video, language, etc.) that might teach the model more general knowledge about objects, tasks, and language. This is analogous to how large language models train on vast text corpora, or how vision-language models like CLIP train on millions of image-caption pairs. In the robotics context, we might co-train on things like unlabeled vision-and-language image sets, videos of humans doing tasks, simulated data, or even text descriptions of robot motions. Intuitively, these are different “views” or modalities of the world that could help the robot policy develop broader understanding.

Yet, until now it wasn’t very clear which co-training data sources actually help a robot policy, and how to use them effectively. The recent study “A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation” by Lin et al. (2026) takes a rigorous look at exactly this question. The authors perform a massive empirical evaluation to compare five different data modalities and different training schemes for fusing them. Their aim is practical: to identify what kinds of extra data truly boost generalization of a large end-to-end policy, and how best to mix them in. In this episode we’ll walk through the motivation, the modalities examined, the training strategies they tried, and the key findings that can guide anyone building scalable, generalist manipulation policies.

The Promise of Large “Behavior” Models.

To set the stage, let’s recall why large end-to-end policies are appealing. In conventional robotics we might hand-design perception pipelines, motion planners, or specialized controllers. But imitation-based “behavior cloning” promises simplicity: just throw in a neural network that maps from images (and maybe language instructions) directly to actions, and train it on as many demonstrations as you can. If we could gather millions of demos of every conceivable task, a single network might “work as a foundation” for many tasks. This is the vision of robot foundation models, analogous to how GPT-4 or large vision-language models serve as foundations in AI. In fact, early work along these lines has already shown that scaling up leads to qualitatively new capabilities. For example, PaLM-E (2023) augments a text language model with vision and state inputs, and the combined model can perform sequential manipulation planning from pixel observations. VIMA (an OpenAI model) and RT-1 (Google) similarly use image-language conditioning to achieve impressive dexterous tasks via a single architecture. These are often called Large Behavior Models (LBMs) in robotics research.

But even these successes depend critically on data. Large language models work partly because we have vast text corpora to pretrain them. In robotics, we lack web-scale data of robot executions. Demonstrations collected by humans or teleoperation are valuable but costly; even huge datasets like CC-SEED (100k demos) or BridgeData (billions of steps) are still limited in variety. Moreover, real-world robotic tasks can be very sensitive to even small distribution shifts in objects, viewpoints, or robot calibration.

Lin et al. point out that “their generalization remains limited by insufficient robot data coverage”. In other words, a policy trained on, say, 100 tasks in one lab may fail on task 101 or in a slightly different environment. What can we do? One solution is to creatively augment the training data without collecting more hard-to-engineer demos. This is the impetus behind co-training: to supplement the robot demonstrations with heterogeneous data that might carry useful signal.

A few recent works have tried similar ideas. For instance, Maddukuri et al. (2025) mixed simulated data with real-world data during training, and reported that including synthetic robot trajectories boosted performance on real tasks by up to 38% (we’ll revisit this later). Other groups have used human video data – viewing everyday videos of humans manipulating objects and converting them (via inverse kinematics or keypoint tracking) into approximate robot actions [38,41]. Still others have taken vision-language corpora (large image-caption datasets) and tried to tie them in, hoping that the language and visual understanding transfers. But these schemes have typically been tested piecemeal, without a systematic comparison.

The new study by Lin et al. is the first comprehensive comparison. They ask: if you had to pick a data modality to co-train with your robot demos, which ones help the most? And how should you schedule the training (mixing everything at once vs. staged training)? To answer this, the authors assembled enormous training sets and ran dozens of model variants – on simulation and on a real robot – to quantify the impact.

Five Modalities of Co-Training Data.

Lin et al. categorize the extra data sources into five broad modalities. Here’s what each means in practice:.

Vision-Language Image Data: This is standard image-caption or image-description data, like what image-language models train on. It could include datasets such as COCO, LAION, or any collection of pictures paired with text. The idea is that these image-text pairs teach the model about object categories, spatial arrangements, and language. When co-trained, the model sees things like “A cat on a chair” or “A person holding a cup” (though most of it won’t be robot-specific). The hope is that the shared vision encoder (often a convolutional or transformer backbone) and language understanding get better, which could indirectly help manipulation (via better object recognition, for example).

Dense Trajectory Language Annotations: This refers to taking robot trajectories (sequences of states and actions) and having them richly annotated in language. Instead of just a sparse linguistic command (“pick up the red block”), you might describe every step: “The arm moves above the red block, extends to grasp, lifts it off the table…” and so on. Such annotations could come from crowd-workers, in-house labeling, or automated descriptions. Training on trajectories paired with these “walking-through-description” language sequences could teach the model fine-grained task structure and how language corresponds to actions. (This is related to works on learning from narrated demonstrations or converting trajectories to text.).

Cross-Embodiment Robot Data: Here “cross-embodiment” means data from different robots or bodies than the target. For example, if your main robot is a 7-DoF arm, you might also include data from a humanoid, a quadcopter manipulator, or even a simulated hand. The idea is that even if the kinematics differ, the underlying tasks (like picking or placing) share patterns. Training on diverse robot platforms can force the model to learn higher-level abstractions (e.g. “approach object, grasp, lift”) that transcend a specific arm. This is like meta-learning across embodiments, or multi-robot knowledge transfer. Indeed, prior work has shown that policies trained on multiple robots (arms, drones, wheels) can sometimes benefit each other.

Human Videos: This includes video footage of humans performing manipulation tasks (e.g. egocentric recordings of people cooking, assembling, etc.). These videos can be either annotated or not, but typically the goal is to let the model learn from human-hand motions. Approaches like EgoVLA train a vision-language-action model on thousands of hours of human hand videos, then use inverse-kinematics to map the predicted human hand actions onto the robot controls. Similarly, other recent efforts (e.g. Li et al. 2025) preprocess egocentric video to detect hand keypoints, segment out atomic actions, and auto-generate textual descriptions, creating a giant “hand manipulation” dataset. Co-training on these shows promise because human activity videos are very diverse and abundant (think YouTube cooking videos).

Discrete Robot Action Tokens: This is a more experimental trick. The idea is to take continuous robot actions and quantize them into a vocabulary of discrete tokens (e.g. by using k-means on forces/angles or VQ-VAE to tokenize motions). Then you treat actions as “words” and train the model as if it were doing language modeling. The hope might be that this discrete representation ties in better with the language model head or helps chain-of-thought. For instance, one could concatenate an image, a language instruction, and a sequence of action tokens as a text sequence, and train a transformer to predict the next token. Lin et al. tested such an approach to see if transforming actions into a linguistic-like form could improve learning. (Spoiler: we’ll see that this particular modality didn’t pan out.).

Crucially, each of these modalities comes from very different sources – natural images, textual labels, videos, other robots – and can be mixed or matched. Lin et al. incorporate huge amounts of each. Their experiments use roughly 4,000 hours of robot and human manipulation data (across the modalities) plus 50 million vision-language image-text pairs. These datasets feed into training vision-language-action (VLA) policies that input camera images + language instructions and output robot actions.

Training Strategies: How and When to Mix Data.

In addition to the data types, Lin et al. explore training strategies for fusing these sources. Broadly, they compare:.

Single-Phase (Joint) Training: Here the model is trained from scratch on a single combined dataset that mixes all chosen modalities together. For example, one might train on a union of robot demos + image-caption pairs + human video examples simultaneously. The model simply sees a Spanish text caption followed by an image of a kitchen, next sees a trajectory with an English description, etc. The hope is end-to-end mixing will let the model share features between tasks.

Multi-Phase (Sequential) Training: In this regime, training is broken into stages. For instance, one could first pretrain the model on a large vision-language corpus, then fine-tune on robot data; or alternatively, first train on robot demos and then continue training on the extra modalities. The paper experiments with different stage orders (e.g. robot-first vs. language-first) to see what works best. Multi-phase is akin to a two-step pipeline: learn broad visual-language skills first, then adapt to robot control (or vice versa).

The distinction matters because if you throw all data at once, gradient signals from very different tasks might interfere. On the other hand, sequential stages might allow one modality to shape the model and then refine with the other. Lin et al. evaluate both approaches in detail.

Experimental Setup and Evaluation.

To put these ideas to the test, the authors trained a total of 89 distinct policies across various combinations of data and strategy, and evaluated their performance on a suite of manipulation tasks. The scale is impressive: in simulation, they ran 58,000 rollout episodes to measure success on both seen and unseen tasks; on hardware, they executed 2,835 real rollouts to validate real-world generalization.

The tasks themselves span diverse manipulation challenges (the paper uses both multi-phase assembly tasks and shorter pick-place tasks, but the key is that there are multiple “skills” and long sequences). They also test distribution shifts – for example, altering object positions, adding visual distractors, or introducing new language instructions at test time.

The baseline is a model trained only on the native robot demonstration data for those tasks. All other policies are initialized with the same model architecture (a vision-language transformer), but are co-trained with one or more of the extra modalities.

For vision-language data, they use off-the-shelf image-text corpora (e.g. internet photos with captions). For trajectory annotations, they either generate or use existing language captions of their robot demos. The cross-embodiment data comes from a different robot arm (a humanoid arm platform doing analogous tasks) or even simulated analogues. Human video comes from egocentric datasets (the EgoVLA and hand-VLA style pretraining pipelines). And the discrete token variant is created by quantizing actions into a 1000-token vocabulary and training it like language.

Weights and hyperparameters are tuned carefully for each condition to ensure a fair comparison. In each case, performance is measured by final task success rate, as well as the model’s ability to follow language instructions and handle unseen scenarios.

Results: What Helps and What Doesn’t.

The key findings from Lin et al.’s study provide clear guidance on co-training choices. In summary, not all data help equally, and some combinations yield huge gains in generalization. The main takeaways are:.

Vision-Language Data Helps a Lot: Co-training with standard vision-and-language corpora consistently improved the policy’s capabilities. Policies exposed to image-caption data learned a richer visual understanding and did better on tasks with novel objects or scenes. Crucially, Lin et al. found that vision-language co-training substantially boosted generalization to distribution shifts and completely unseen tasks. For example, a model pretrained on millions of generic image-text pairs could identify and act on objects outside the original training set. In fact, the paper reports that mixing in vision-language data made the model much better at following natural language instructions as well, because it retained stronger language grounding.

Cross-Embodiment Data is Highly Beneficial: Mixing in data from another robot embodiment also yielded significant gains. In the experiments, training jointly on demos from two different robot arms (or an arm and a humanoid) improved performance on each individual task. This suggests the model learned more abstract motor skills that transfer across kinematics. The improvement was on par with the vision-language boost. In practice, combining one’s limited dataset with publicly available demos from other robots could be a powerful data-source hack.

Combining Modalities is Cumulative: Importantly, the benefits stack. The authors observe that co-training on both vision-language data and cross-embodiment robotics data yields even larger improvements than either alone. In other words, there is no severe interference between these two sources; rather, each adds complementary knowledge. A joint model trained on both types of data achieved the best results, with cumulative gains in success rates. This means that if you have multiple kinds of extra data around, it’s worth using them together.

Robot-only Training Leads to Forgetting: A surprising (but common) effect was that if the model is trained only on the limited robot data, it tends to lose its earlier vision-language knowledge. In the experiments, a baseline model pretrained on images and text would become worse at language understanding after fine-tuning solely on robot demos – a classic overfitting/forgetting issue. Lin et al. show that without co-training, the vision-language backbone’s capabilities “degrade.” By contrast, including vision-language and cross-embodiment batches during training preserves that multi-modal understanding. In short, the co-trained policies remained good at grounding language because they were always being reminded of the visual-linguistic tasks.

Discrete Action Tokens Don’t Help: The surprising negative result is that turning actions into tokenized “words” did not improve performance. The authors experimented with representing robot actions as a sequence of discrete tokens and training the policy to generate them (analogous to language modeling). This modification made little to no difference in final performance. It seems that treating continuous control as a string of words didn’t confer the hypothesized benefits (perhaps because the discretization was lossy or because the policy still had to recover continuous torques at the end). In any case, the tokenization trick offered no significant advantage in this setting.

Chain-of-Thought Conditioning Doesn’t Help Much: Motivated by recent results in large language models, the authors also tried an experiment where the model was explicitly conditioned on a textual “chain of thought” explaining its planned actions. That is, they gave the model a pseudo-reasoning trace (also derived from the co-training data) and then let it predict actions. Despite the success of chain-of-thought in LLMs for reasoning tasks, here it made no noticeable improvement on the manipulation benchmark. The tasks didn’t seem to benefit from this extra structure – the policies performed roughly the same with or without the “thought trace.”.

These results are backed up by their extensive evaluations. For instance, in simulation the best co-trained model (vision-language + cross-embodiment) solved 38% more tasks on average than the robot-only baseline, and showed much higher success rates on newly introduced tasks. On real hardware as well, the co-trained policies completed more long-horizon tasks correctly than models trained only on on-policy demos. The paper provides quantitative breakdowns on dozens of scenarios, all pointing in the same direction: mixing in vision-language images and cross-embodiment dexterity data consistently raises performance.

Insights and Intuition.

What can we learn from so many runs? First, the success of vision-language data underlines that a lot of relevant knowledge about objects and language is hiding in generic photo-text datasets. By tapping into this reservoir, the robot policy gets indirect supervision on semantics. For example, if the caption dataset teaches the model that “citrus” refers to oranges and lemons, then even if a robot demo never involved a banana, the model might be able to parse “place the banana” by analogy. Or if it learned from captions what “cup,” “handle,” and “pouring” mean, it can better generalize to novel cups. The result is a model that is less brittle to new objects and phrasing.

Second, the cross-embodiment benefit suggests that many basic manipulation strategies are robot-agnostic. When the policy trains on two different arms doing similar tasks, it must focus on the higher-level structure (“align gripper over object, close fingers, lift up”) that works regardless of exact joint geometry. This abstraction strengthens the learned policy. It’s related to meta-learning: seeing multiple embodiments acts like an implicit curriculum that teaches the network what aspects of the task are universal. Intuitively, it’s analogous to multi-lingual language models learning that syntax vs. meaning are separate – here it’s mapping “grasp object” to control commands across robots.

Third, the fact that combining modalities helps implies the gains are complementary: vision-language data broadens perceptual and semantic grounding, while cross-embodiment builds robust control primitives. Since the network has shared parameters, these improvements propagate mutually: better vision helps all tasks, and more robust action routines help interpret visual goals.

On the other hand, the failure of discrete actions and chain-of-thought tells us something useful too. It suggests that not every LLM trick translates to robotics. The action-token idea seemed plausible if one hoped the model could “discretize policy learning” into a text-like format, but apparently the continuous nature of control may not benefit from that discretization. Similarly, chain-of-thought – which shines in factual reasoning – might simply not add new information for the network on these tasks, or it may introduce noise. The takeaway is that adding complexity to the training sequence (like extra text tokens) doesn’t automatically ease learning. The core issue is still having sufficient and relevant examples of behavior, more than fancy prompting strategies.

Finally, the forgetting phenomenon underscores a key practical tip: don’t just pretrain then fine-tune solely on your small dataset, or you’ll lose the broad knowledge. Instead, interleave training so that the model continually sees examples of the original vision-language or video domains alongside the robot tasks. In the language community, this is like reminding the model of the language corpus after fine-tuning so it doesn’t catastrophic-forget the grammar. In practice, it means always mixing in some image-text minibatches during robot training, or vice versa.

How It Compares to Other Approaches.

Let’s place these findings in context. As mentioned, Maddukuri et al. (2025) showed the power of one modality -- simulation data. They mixed a simulator (with its own diverse object models) and real-world camera images, co-training the policy. They observed a ~38% jump in sim-to-real success by this simple recipe. Lin et al. build on that idea by adding more modalities. In fact, Maddukuri’s work is like a special case of co-training (simulated vs real robot). Lin et al. add vision-language, alternate robot, etc., and confirm that each of those can similarly plug gaps.

Other recent work has explored modalities we mentioned. The EgoVLA (Yang et al. 2025) approach trains on human videos and then fine-tunes on robots, finding that even a handful of robot demos can leverage the human data to greatly speed up learning. Similarly, Li et al. (2025) convert unlabelled human hand videos into millions of “episodes” of hand-object interactions, then pretrain a model on those. They also see large gains in downstream robot tasks. The new study from Lin et al. is consistent with these: human videos do work as a co-training modality. (Lin’s results suggest that mixing human-video data was indeed helpful, though perhaps not as impactful as vision-text or cross-robot, but it definitely didn’t hurt.).

What's unique here is the systematic comparison. Survey papers and analyses of robot foundation models (e.g. a recent broad survey) have pointed out that integrating vision and language is important, but until now there wasn’t a clear roadmap on which data to prioritize. Lin et al. provide that roadmap. For instance, if a lab has only limited robot demos and some extra data to spare, these results say: focus on gathering/using more vision-language image captions and find any alternate robot’s logs; don’t bother investing effort into discretizing actions or generating chain-of-thought text for now.

Compared to simpler transfer learning baselines (like just pretraining on images then fine-tuning), this co-training is often better. Pretraining on images alone might improve perception, but Lin et al. show you also need to keep seeing images during training to maintain that edge. Likewise, just adding more on-policy robot data is usually too expensive to scale; but adding other modalities is a cheaper alternative.

Practical Guidance for Generalist Policies.

What does this mean if you’re, say, an engineer building a new general-purpose robot controller? Here are some practical lessons drawn from the study:.

Leverage Image-Text Corpora: If your policy has a vision-language backbone (which it likely does, to interpret instructions), then co-training on generic image-caption datasets is low-hanging fruit. Even if the captions are unrelated to your tasks, they help the visual features and language align. In practice, this could mean pre-loading CLIP or BLIP embeddings, or simply mixing in some image-caption batches during training. As Lin et al. show, this boosts robustness to novel settings.

Include Other Robot Data If Available: If you can get data from a different robot (even a different kinematic chain or a simulator), include it. For example, if you have simulations of the same tasks, or if you have recordings from a colleague using a different arm model, that counts. Co-training on “cross-embodiment” data gave a large performance bump. You don’t have to drastically alter your setup; just ensure the model occasionally trains on the other data.

Combine Efficiently: The gains were cumulative and even provided faster adaptation to new tasks with fine-tuning. In the paper, they found you could fine-tune the co-trained model on a few examples of a new task and quickly master it, much faster than a model trained only on robot data. So a recipe is: pretrain on all modalities you have, then few-shot fine-tune to specific long-horizon tasks.

Beware of Forgetting: If you do a two-phase training (e.g. first on image captions, then on robot demos), make sure to keep some mix. The study highlights that exclusively training on the final robot objective caused the model to forget its visual-language knowledge. A simple fix is to either interleave modalities in the same phase or alternate phases (e.g. 90% robot data + 10% image data each epoch).

Skip the Unhelpful Tricks: Don’t spend time engineering a discrete action vocabulary or complicated prompting chains unless you have clear evidence it helps for your domain. Lin et al.’s experiments showed no benefit from those. Focus your effort on gathering or generating the raw data modalities that worked: images/text and diverse robot demos.

Finally, remember that co-training isn’t a magic bullet – it requires computing resources to handle the extra data, and careful tuning. But the payoff in robustness and generalization can be substantial. As foundation models in natural language taught us, “more data” often beats “more compute” alone, especially when data covers new ground. Here, “more data” means data of the right kind.

Beyond the Study: Limitations and Future Directions.

It’s worth noting some caveats. The experiments, while large-scale, are still on a finite set of tasks and data sources. There may be other valuable modalities not tested – for example, simulated video (photo-realistic simulation of humans), or language instructions without vision (pure instruction tuning). Also, the study used specific robot platforms and assumptions; results could vary with different robots or with richer sensors (like depth or touch).

Moreover, mixing data can complicate training dynamics. Lin et al. had to carefully balance learning rates and batch compositions. A naive mix could sometimes reduce performance if one modality dominates the gradients. Learning to properly schedule (which they studied in terms of phased training) remains an art.

Another point is that all these models remain end-to-end imitation learners. There are still scenarios where extra reward shaping or reinforcement learning might be needed. Co-training with demonstrations is powerful, but if the dataset has blind spots, the policy might still fail in those holes. Future work might combine these co-training insights with active data gathering or curriculum learning (for example, use the vision-language knowledge to define new useful tasks to collect data on).

Finally, one should consider the quality of the extra data. Vision-language corpora are incredibly diverse but also noisy. The paper used generic corpora, which worked, but focused datasets (e.g. object-centric images, or language specifically about tasks) might be even better. Similarly, human videos cover many activities, not all relevant to robot tasks – better filtering or annotation could improve signal-to-noise. In practice, engineers might curate or annotate these datasets to align them more closely with the robot domain (for example, ensure that language descriptions use the same vocabulary as robot instructions).

Conclusion.

The systematic study by Lin et al. delivers clear, actionable guidance: when training a large vision-language robot policy, adding heterogenous data can significantly boost performance – but you should pick the right modalities. Generic image-text pairs and extra robot demonstrations (even from other robots) were the star helpers, while more fanciful ideas like discretizing actions did not move the needle. By combining vision and embodiment data in training, the resulting policy became more robust to new objects, new tasks, and new instructions. In the quest to scale up robot learning, this work suggests that smart data curation can go a long way. For a robotics practitioner, the message is: don’t limit your model to just the demonstrations you have – leverage any relevant video, image, or cross-robot data you can find, and train them together. This kind of co-training strategy could be key to unlocking truly generalist, foundation-like capabilities in robotic manipulation.

More episodes

Chapters

What is Embodied AI 101?