τ0-WM: A Unified Video-Action World Model for Robotic Manipulation

When a human reaches for a heavy, fragile pitcher of water, our brain does not just blindly send movement instructions and hope for the best. Subconsciously, we run a mental micro-simulation: If I grip this too loosely, it will slip. If I knock the rim, it will spill. We anticipate the future, evaluate the consequences, and adjust our trajectory before we ever make physical contact, because we know a carelessly movement means a shattered glass, or a failed task.

Robotic manipulation also demands models that can both forecast the consequences of their own actions and produce robot-executable controls, requiring broad experience with object interactions and paired observations with continuous actions. These forms of supervision are rarely available together at scale. Egocentric videos and human interaction trajectories provide rich evidence of object motion, contact, spatial structure, and task-level temporal organization, but they do not specify actions in a robot’s control space. Robot demonstration data provides this grounding, coupling observations to actions collected with a specific embodiment, controller, sensor stack, and action representations, yet it covers only a narrow range of scenes and tasks due to high collection cost.

This motivates a unified video-action formulation in which heterogeneous data sources supervise the signals they contain—future observations when only video is available, actions when robot controls are available, and task progress when progress or failure labels are available—while sharing a predictive representation that connects visual dynamics to executable actions.

We present τ0-World Model (τ0-WM), a framework for robotic manipulation that unifies action generation, video prediction, and action-conditioned future evaluation. Rather than treating policy learning and dynamics modeling as separate objectives, τ0-WM builds them around a shared predictive representation. This representation enables two complementary interfaces—a policy that predicts executable action chunks, and an action-conditioned video simulator that imagines future observations and task-level consequences—which are combined at test time into a proposal–evaluation–revision procedure that allocates extra computation to selecting and refining actions before execution.

The core of τ0-WM is a Video Action Model (VAM) that uses a shared video diffusion backbone to take multi-view observations, a language instruction, and the robot state, jointly predicting future visual latents and a continuous action chunk. The video branch captures temporally structured scene dynamics, while the action branch predicts executable controls by attending to intermediate video representations through layer-wise cross-attention. This coupling makes future prediction a control-relevant training objective: the action branch is encouraged to use representations that encode how the scene is likely to evolve under manipulation. Beyond VAM, the shared representation also enable τ0-WM to act as an action-conditioned simulator: given the current observation, instruction, and a candidate action chunk, it can predict multi-view future outcomes along with a dense task-progress trajectory learned from subtask progress labels and failure data, thereby evaluating actions by both visual plausibility and task advancement.

Pre-training on Multi-source Heterogeneous Dataset

Training a model that can both imagine the future and execute actions requires more than a single data source. τ0-WM, a 5B world model, is trained on a heterogeneous data corpus of approximately 27,300 hours, including real-robot teleoperation data, UMI-style data, egocentric human videos.

Real-robot teleoperation data (17,800 hours, dual-arm, multi-view)
UMI data (6,500 hours) offers the strongest and most deployment-aligned action supervision, but is expensive and limited in diversity.
Egocentric human interaction data (3,000 hours, dual-arm, multi-view) offers the strongest and most deployment-aligned action supervision, but is expensive and limited in diversity.

These three data sources provide complementary supervision. Real-robot teleoperation data grounds the model in executable robot actions. UMI data expands the diversity of manipulation behaviors and environments with action-related signals. Egocentric human videos further scale up visual interaction learning by providing broad coverage of real-world object dynamics.

This lets the model learn from broad interaction data without pretending that every dataset contains the same kind of supervision. The key idea is simple: use every data source for the signal it actually contains. τ0-WM combines these sources through modality-specific supervision masks. Each sample supervises only what it can validly provide. Robot demonstrations supervise both video prediction and action generation. Egocentric videos supervise future visual prediction but not robot actions. Rollout and failure trajectories can supervise task-progress prediction. Missing camera views or unavailable modalities are masked out.

Action-Conditioned Video Simulator

The action-conditioned video simulator serves as an executable proxy for real-world interaction. Instead of directly executing every candidate action on the robot, the simulator predicts the visual consequences of an action sequence and produces a reward signal for action selection. This is useful in contact-rich manipulation, where repeated trial-and-error execution on the physical system is expensive and slow.

Given the current observation, instruction, and a candidate action chunk, it can predict multi-view future outcomes along with a dense task-progress trajectory learned from subtask progress labels and failure data, thereby evaluating actions by both visual plausibility and task advancement. The action-conditioned video simulator is not just a “visual predictor” but also an “action consequence evaluator”: helping the robot judge whether an action is worth trying before actually executing it.

Test-Time Computation: Propose, Evaluate, Revise

At inference time, this unified video-action interface enables τ0-WM to allocate additional test-time computation to action selection and refinement, rather than relying on a single feed-forward action prediction.

The policy first samples multiple action chunks and ranks them by Re-denoising Consistency Score, which measures agreement with the learned action distribution. When no candidate scores well, τ0-WM further simulates candidate futures, selects the most promising rollout, and conditions a second action prediction on that future. This forms a test-time proposal--evaluation--revision loop that uses the learned world model to improve action selection before execution.

Real-World Manipulation Results

On the four tasks unseen in the pretraining data, τ0-WM obtains the best average success rate and is strongest on most precision-sensitive tasks. In particular, Faucet remains difficult for all methods, indicating that the task is far from saturated; nevertheless, τ0-WM is more robust under these strict alignment constraints. These results support the benefit of joint video-action modeling for fine-grained real-world manipulation.

Toward Robots That Imagine Before They Act

τ0-WM points to a broader direction for robotic foundation models: robots should not only learn from more data, but also use computation at inference time to reason about the consequences of their actions.

By unifying action generation, video prediction, and action-conditioned future evaluation, τ0-WM turns a world model into an execution-time decision mechanism. It learns from heterogeneous interaction data, grounds visual prediction in executable controls, and allows the robot to propose, simulate, evaluate, and revise actions before acting.

The long-term promise is a shift from reactive manipulation to predictive manipulation. Future robots may not simply execute the first action a policy samples; they may imagine several possible futures, evaluate which one best advances the task, and act with that future in mind. τ0-WM takes a step toward that vision: robots that do not just react to the present, but reason about what their actions will make possible next.

Pre-training on Multi-source Heterogeneous Dataset

Action-Conditioned Video Simulator

Test-Time Computation: Propose, Evaluate, Revise

Real-World Manipulation Results

Toward Robots That Imagine Before They Act

Contact Us