Jiggly Jar

Introduces AnyFlow, the first any-step video diffusion framework using flow maps that allows a single model to adapt to arbitrary inference steps for high-quality video generation, enabling efficient predictive video modeling for world models.

What is Jiggly Jar?

Testing

AnyFlow: Any-Step Video Diffusion with Flow Map Distillation.

Video diffusion models are becoming powerful world models for robotics and embodied agents, able to generate realistic future frames from text prompts, images, or prior video. However, high-quality diffusion sampling usually requires many timesteps, conflicting with the need for speed. Techniques like consistency distillation have dramatically sped up video diffusion – for example, the VideoLCM model achieved high-fidelity video with only four denoising steps – but these fast models lack flexibility. A consistency-distilled model is essentially trained to produce a final sample in one shot, so giving it more sampling steps does not improve quality. As NVIDIA’s AnyFlow authors explain, “few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time”. In effect, the usual diffusion advantage of test-time scaling (higher fidelity when using more iterations) is lost.

AnyFlow addresses this by changing the distillation target to a flow map between noise levels, which restores that scaling behavior. In conventional consistency distillation, the model learns a direct mapping from a noisy latent $z_t$ all the way to the clean output $z_0$ (the “endpoint” mapping). In contrast, AnyFlow’s flow-map distillation trains the model to connect any two noise levels: essentially learning a transformation $z_t \to z_r$ for arbitrary times $t>r$. In other words, the distilled model becomes a multi-step generator that knows how to make intermediate leaps along the diffusion trajectory. This subtle shift – from “how to jump directly to the end” to “how to transition between arbitrary points” – has profound effects. According to the authors, this allows the model to “optimize the full ODE sampling trajectory” instead of “only a few fixed sampling steps”. Consequently, as more sampling steps are used at inference, the model can follow the diffusion path more precisely and quality continues to improve, recuperating the test-time scaling that consistency models discarded.

Technically, AnyFlow’s training proceeds in two phases. Initially, forward distillation teaches the model the flow-map transitions: given a latent $z_t$ at time $t$, the model predicts the latent at some earlier time $r$, not necessarily zero. During this phase AnyFlow “shifts the distillation target … from endpoint consistency mapping ($z_t\to z_0$) to flow-map transition learning ($z_t\to z_r$) over arbitrary time intervals”. In practice this means sampling random source and target times and minimizing the error between predicted and teacher latents. This objective generalizes the idea behind flow-matching and consistency. (Indeed, recent work on Align Your Flow for images formalized flow maps as a generalization: they “connect any two noise levels in a single step” and “remain effective across all step counts”. AnyFlow adapts this philosophy to video diffusion.).

The second phase is backward, on-policy distillation using these flow maps. Here the goal is to correct the model’s performance when it actually samples frames. In classic consistency distillation, one effectively samples a “static” one-step trajectory; AnyFlow instead plugs the model into the original multi-step Euler trajectory. Concretely, the authors introduce Flow Map Backward Simulation: they take the full diffusion rollout (an Euler integration of the reverse SDE from $t=T$ down to $0$) and break it into smaller segments corresponding to the learned flow-map jumps. Crucially, this is done on-policy: the model’s own outputs are used as inputs for the next segment, rather than always using ground-truth latents. This training technique “preserves the original Euler sampling trajectory and decomposes the long trajectory into shortcut segments to reduce rollout cost”. By training on its own generated samples, the model sees during training the same distribution it will encounter at inference, eliminating the exposure bias that plagues one-shot distillation. At the same time, breaking the path into shorter jumps reduces discretization error from large time steps. In effect, AnyFlow performs a kind of “multi-step self-play” that both respects the diffusion dynamics and keeps the computation manageable. The result is a distilled video model that can be sampled flexibly: you get the high-fidelity one-step behavior of a distilled model in the few-step regime, and you regain the ability to refine outputs by running more steps, just like an undistilled diffusion model would.

In practice, NVIDIA’s AnyFlow implements this idea in both causal and bidirectional video diffusion architectures, at scales from 1.3B parameter “small” models up to 14B parameter giants. (Causal models generate frames sequentially like an autoregressive transformer, while bidirectional models can access future context – both appear in modern video diffusion literature.) A single AnyFlow model can even handle multiple generation tasks: for example, one causal model can do text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation. According to NVIDIA’s release notes, this “multiple tasks” capability is built into the model – you can switch the conditioning modality on the fly without retraining. The authors report that AnyFlow’s advantages hold across model sizes and architectures: in all cases, the flow-map distillation leads to robust performance at low step counts while allowing quality to keep rising with additional compute.

Quantitatively, AnyFlow matches or exceeds the leading consistency-based video models at low budgets, and it benefits from extra steps that consistency models cannot use. For example, NVIDIA compared a 14B causal video model distilled with AnyFlow against several community baselines (e.g. Wan2.1-based consistency models like KREA, LightX2V, FastVideo, etc.). At just 4 neural function evaluations (NFEs, essentially 4 denoising steps), the AnyFlow model “achieves better dynamics and quality” than these consistency-distilled competitors. In another test, the same 14B model was used for image-to-video generation: at only 4 steps it produced videos comparable to those obtained by the original Wan2.1 model using 100 steps (50x2 NFEs). In short, AnyFlow distillation allowed a 4-step model to match a 100-step baseline. Moreover, AnyFlow sets far less stringent limits: if you allow it to run, say, 16 or 32 steps, its video quality continues to improve, whereas consistency models plateau or even degrade. The demo site’s results show a clear “test-time scaling” advantage: AnyFlow’s performance curve rises with more steps, unlike the flat curve of a consistency model. Eventually, with enough steps, the AnyFlow generator can even surpass the original teacher diffusion model in sample quality (since it learned to approximate the exact flow ODE across all segments).

Another striking advantage is that the distilled AnyFlow model still retains a fine-grained representation of motion. Because it learns the instantaneous flow field between states, it can be further fine-tuned or adapted to new data without losing its fast-sampling property. As NVIDIA notes, “the flow map formulation preserves a fine-grained instantaneous flow field, so the distilled model can be continued on downstream data while keeping few-step sampling”. In practice this means you can take the 4-step AnyFlow model and train it (with low learning rate) on a specialized dataset – say robotic manipulation videos – and it will improve in detail while still sampling in 4 steps. Indeed, their experiments show that after such fine-tuning the model better preserves object identity and motion dynamics (e.g. retaining a robot-arm’s appearance and correctly following pedestrian paths) even at the same fast step count. This flexibility might be especially useful in robotics, where one might want to adapt a generic video model to a particular robot or domain on the fly.

In sum, AnyFlow delivers a trifecta of benefits: it matches the speed of distilled video models in the few-step regime, it restores the quality gains from using more steps, and it maintains rich motion information for fine-tuning. The key is adopting flow-map distillation “instead of a consistency mapping” and using an on-policy backward pass. This approach was inspired in part by similar ideas in image diffusion: past work (e.g. Align Your Flow) showed that flow maps – objectives that match any two noise levels – can unify consistency and score-based distillation, avoiding the pitfalls of one-step models. AnyFlow is essentially bringing that continuous-flow perspective to video.

For the robotics and embodied AI community, AnyFlow is potentially quite attractive. In NVIDIA’s own “Cosmos” platform for physical AI and world models, video generation is a central capability. For example, the Cosmos Predict model is described as a learned world foundation model that “produces up to 30 seconds of high-fidelity video from multimodal prompts” for planning purposes. One can imagine integrating AnyFlow into such systems to allow a planner to adjust fidelity on demand. Need a quick preview? Run AnyFlow for 4–8 steps. Want a more polished imagined future? Dial up 32 or 64 steps. The AnyFlow model intrinsically supports that tradeoff, whereas a fixed-4-step model does not.

NVIDIA has released code, model weights, and demos to the community, making it easy to experiment. The Hugging Face repository lists AnyFlow pipelines for 1.3B and 14B models (both tasks-agnostic “FAR” models and T2V-specific variants), and the official site provides videos of sample outputs. According to the team, the AnyFlow code and pretrained weights were open-sourced upon paper release (May 2026). This means researchers can immediately test AnyFlow as a predictive video model in their own contexts– for example, plugging it into a model-predictive control loop or world-model-based planning algorithm and comparing 4-step vs 16-step rollouts.

Of course, AnyFlow is not magic: it still faces the fundamental challenge of ultra-fast generation. The authors note that while their approach greatly improves sampling at, say, 4–8 steps, quality at the extreme low end (1–2 steps) remains limited and may require further research. Training the flow-map transitions themselves can also be somewhat more complex than one-shot distillation. But doubling the horizon (to 8 or 16 steps) is now a solved problem.

In conclusion, AnyFlow presents a conceptually simple but powerful twist on video diffusion distillation: train not for one-shot, but for any-step, effectively embedding the diffusion ODE into the network. As NVIDIA’s blog summary puts it, this recovers “the desirable test-time scaling behavior of ODE sampling” that consistency models gave up. For anyone building robot world models or video predictors, AnyFlow’s release is timely: it lets a single diffusion network flexibly span the whole spectrum from fast preview to high-fidelity rollout. We expect it will be a valuable component in physical AI pipelines, where balancing speed and accuracy in learned simulations is key.

References: NVIDIA’s AnyFlow paper and website; the VideoLCM consistency-distillation approach; theory of flow maps in diffusion; NVIDIA Cosmos world models.

More episodes

Chapters

What is Jiggly Jar?