Embodied AI 101

First any-step video diffusion framework using flow maps, allowing a single model to adapt to arbitrary inference budgets for scalable high-quality video generation relevant to predictive world modeling.

What is Embodied AI 101?

Stay in the loop on research in AI and physical intelligence.

AnyFlow: Any-Step Video Diffusion for Predictive World Modeling.

Video diffusion models have recently surged forward, yielding remarkably realistic and temporally coherent video synthesis. Such generative video models are increasingly seen as potential world simulators – tools that can imagine future frames with complex dynamics over long horizons. In fact, a recent survey notes that modern video generation models can capture “complex physical dynamics and long-horizon causalities,” positioning them as “potential world simulators” for applications like autonomous driving and embodied AI. The catch, however, is efficiency: realistic video formation typically requires hundreds of denoising steps, making real-time or large-scale use impractical. Bridging this gap, AnyFlow – an any-step video diffusion framework from NVIDIA – introduces a clever distillation strategy that lets a single video model adapt to any inference-time budget. In plain terms, AnyFlow can trade off speed against quality seamlessly: it can run in very few steps when needed, yet also continue to improve generation quality if more steps are allowed. This flexibility is ideal for predictive world modeling, where compute availability or required fidelity can vary widely.

AnyFlow’s key innovation is to base distillation on flow maps instead of the usual one-shot endpoint mapping. Put simply, traditional “consistency-distilled” video models are trained to take a noisy frame sequence (at time $t$) and map it directly to a clean output (time 0) in a handful of steps. This approach works for very fast generation, but it sacrifices the diffusion model’s inherent test-time scaling – the property that more steps yield finer detail. In fact, as NVlabs points out, current consistency-distilled video models tend to lose quality or even degrade if you give them more sampling steps than they were trained for. AnyFlow fixes this by changing what the student model learns. Instead of learning only the final mapping $z_t \to z_0$, it learns the probabilities of jumping between any two intermediate times $z_t \to z_r$ – these are the flow-map transitions. By composing these learned short hops, the model effectively follows the original diffusion (ODE) trajectory more faithfully. In a layman’s analogy: rather than magically leaping straight to the end result (which may “lock in” coarse behaviors), the model learns to travel along the path with as many steps as you allow. The result is that more steps really do help again, just as in standard diffusion. In terms of features, AnyFlow is the first “any-step” video diffusion method: a single model handles arbitrary inference budgets, producing high-quality video in, say, 4 steps and then boosting quality further if run with 16 or 32 steps.

Under the hood, AnyFlow builds on recent ideas in consistency and flow-based diffusion. Recall that flow-based models (like score matching and flow-matching) express image/video generation as solving a differential equation. Solving that ODE exactly would give perfect samples, but it is expensive. Consistency models (e.g. Song et al.) learned to shortcut this by directly mapping noise to data in one shot, but at the cost of sacrificing iterative refinement. Flow-map approaches, as contemporary works have shown, offer a middle ground: they learn a function that can jump the latent anywhere along the trajectory. In the image domain, Boffi et al. introduced a unified framework where a “flow map” model can learn to jump from arbitrary $s$ to $t$ in the latent diffusion process. In one simple description, “flow maps … learn to jump directly between points on flow trajectories, enabling one or few-step generation”. Because these maps work for any two points in time, they naturally allow quality to improve as you chain more jumps. AnyFlow extends these ideas to video, where temporal coherence adds complexity.

Concretely, AnyFlow’s distillation has two stages: a forward initialization where the student learns to predict short flow transitions, and a flow-map backward simulation stage for on-policy refinement. In the first stage, rather than just learning the endpoint clean frame, the model is trained on random intervals. You sample a noisy frame representation at time $t$ (midway through diffusion) and a slightly later time $t+r$, and train the neural network to map from the $t$ state to the $t+r$ state. Over many such random $t,r$ intervals, the model learns the full family of flow-map transitions. This is in contrast to consistency distillation, where one might only train the network to go directly from $t$ to $0$. By targeting intermediate $r$ values, the network learns the flow of the ODE rather than a single shortcut. As the NVIDIA team notes, this design “shifts the distillation target from endpoint consistency mapping ($z_t \rightarrow z_0$) to flow-map transition learning ($z_t \rightarrow z_r$) over arbitrary time intervals,” which effectively recovers the diffusion model’s desirable scaling behavior.

The second stage, Flow Map Backward Simulation, refines this by making training more on-policy. In autoregressive video generation, an “exposure bias” can occur: the model during inference sees its own previous predictions as input, which differ from the clean training trajectories. Flow Map Backward Simulation combats this by using the student’s predictions in the loop during training. The idea is to decompose a full teacher rollout (an Euler ODE integration from start to finish) into shorter jumps, and train the student using its own outputs as intermediate states. In practice, one simulates the teacher’s diffusion from $t$ to $r$, then uses the model's own guess at $r$ to continue simulation. Each small student-predicted step is followed by a tiny correction via the teacher’s viewpoint. This way, the student experiences input distributions closer to test conditions, mitigating exposure bias. Simultaneously, breaking the long trajectory into short hops reduces discretization error. In short, AnyFlow’s backward simulation trains the model “on-policy” with respect to its own outputs, while still anchoring it to the high-quality teacher trajectory in small increments. The bottom line of these innovations is that AnyFlow-distilled models behave nicely at test time: they match or beat conventional few-step baselines when run in 4–8 steps, and they continue to improve as you give them 16, 32, or more steps.

The empirical results bear out these claims. The NVlabs team built AnyFlow versions of both causal (autoregressive) and bidirectional video diffusion architectures, across sizes from ~1.3B to 14B parameters. In each case, the any-step distilled model performed on par with or slightly better than existing fast models at low step counts. Crucially, unlike consistency-distilled counterparts, it continues to gain fidelity when allotted more steps. For example, on a 1.3B-parameter bidirectional video model, AnyFlow exceeded the original flow-matching teacher’s quality at every step setting (4, 16, 32 NFEs), with especially large gains in the few-step regime. Likewise, compared to a consistency-distilled model with 4-step output, AnyFlow’s 4-step output is comparable or better, yet its 16-step output surpasses it by a wide margin. Qualitative examples (Figure 2 in the paper) show that long-range structure and motion improve noticeably with extra steps under AnyFlow, whereas they stagnate under a conventional consistency model. The code release even includes Hugging Face pipelines demonstrating this: for instance the FARWanAnyFlowPipeline (a 1.3B causal model) and WanAnyFlowPipeline (a 14B bidirectional model) each let you specify freely. In the 1.3B case, generating a dynamic scene with 4 steps yields clear video, and bumping to 16 or 32 steps makes shapes sharper and motion smoother. These pipelines cover text‐to‐video, image‐to‐video, and even video‐to‐video tasks all with one model – again exploiting the underlying flow-map trainer’s flexibility.

The AnyFlow release is not just a proof-of-concept: it’s a scalable framework. The authors demonstrate it on massive 14B-parameter models (a full generation system comparable to the largest open models today). The Hugging Face card notes that AnyFlow has been validated at 1.3B through 14B scale. Moreover, because the flow-map distillation retains a fine-grained notion of instantaneous motion, the distilled models can be fine-tuned on new data while still preserving fast sampling. In other words, a robotics lab could take the AnyFlow base model and continue training it on domain-specific videos, and still enjoy one-shot or few-shot sampling. This adaptability is valuable: if your world model needs to specialize (say, driving videos or indoor tasks), you can fine-tune it without losing the ability to simulate efficiently. The project page even shows a pipeline diagram where, after distillation, model variants continue training on separate datasets while “keeping few-step sampling” intact.

What does all this mean for predictive world modeling? It’s a significant step forward. World models in robotics need to envision possible futures given actions or context, often under tight latency constraints. AnyFlow lets a single generative model handle both scenarios efficiently. In a time-critical loop, it can run in very few steps and give you a rough future, and if there is time (or off-line batch processing), it can refine predictions with more steps. Importantly, because the quality scales with steps, the system gains a true anytime property: any extra compute yields benefit. This contrasts with prior distilled models that had a fixed budget – one got no real improvement from additional time. As the NVIDIA team emphasizes, their framework recovers “the testing-time scaling behavior of flow matching”. In practical terms, this flexibility could allow embodied AI agents or simulators to dynamically allocate compute. For instance, an autonomous vehicle might run the world model at 4 steps under normal driving, but boost to 16 steps in a tricky scenario, without changing network weights.

Finally, from an engineering standpoint, AnyFlow is remarkably well-supported. The authors open-sourced all code, model weights, and a web demo on the day of the paper. They even provide ready-to-run Diffusers pipelines (as seen in their HuggingFace release) so that practitioners can instantly plug in the model. This accessibility means researchers in robotics can experiment with AnyFlow as a drop-in component for scene prediction or planning. The pip install instructions and examples (Figure above) show how easy it is to generate a bird chase scene or animate a robot action with just a few lines of code.

In summary, AnyFlow introduces a conceptually elegant and practically useful fix to a thorny problem in video diffusion. By distilling onto flow-map trajectories rather than end-point shortcuts, it delivers an “any-step” model: one that works well fast and even better when you slow it down. This restores the diffusion model’s natural advantage of iterative refinement, without sacrificing one-shot speed. For anyone building predictive world models, this means smoother control of speed/quality trade-offs. The approach is simple to state but powerful in effect. It will be interesting to see how these distilled flow models fare in downstream tasks and whether even fewer steps (1–2) can be made robust – but for now, AnyFlow gives us the best of both worlds in video generation.

References: This discussion is based on AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation (NVidia/ShowLab) and related work. For broader context, see the recent survey on video world models and the flow-map generative modeling literature. The AnyFlow code and demos are available via NVidia’s GitHub and HuggingFace (see citations above).

More episodes

Chapters

What is Embodied AI 101?