NVIDIA Open-Sources DreamDojo: A 44,000-Hour "Dream" to Solve the Robotics Data Gap

An egocentric view from a simulated Unitree G1 robot. Two silver robotic arms are positioned over a wooden workbench where a cardboard box contains a blue cup and a green cup. The right robotic arm is interacting with a pink container. A translucent overlay at the bottom displays real-time line graphs tracking robot control signals. — DreamDojo utilizes 44,000 hours of human video to simulate dexterous robotics tasks, such as this Unitree G1 performing object manipulation within a virtual environment. By employing continuous latent actions for precise controllability, the model enables real-time applications including live teleoperation and model-based planning.

NVIDIA has officially entered the 2026 "World Model" arms race with the release of DreamDojo, an open-source foundation world model designed to simulate complex robotics tasks and environmental interactions directly from pixels. Described by NVIDIA’s Dr. Jim Fan as "Simulation 2.0," the model attempts to bypass the traditional "data bottleneck" of robotics by learning intuitive physics from 44,000 hours of human video.

The release marks a significant milestone in the industry’s shift toward generative simulation, joining recent efforts like 1X Technologies’ 1XWM and Google DeepMind’s Genie 3. However, NVIDIA is taking a distinct approach by open-sourcing the model weights, code, and datasets, inviting the broader research community to build upon its "World Action Model" framework.

Stay Ahead in Humanoid Robotics

Get the latest developments, breakthroughs, and insights in humanoid robotics — delivered straight to your inbox.

Scaling the "Bitter Lesson"

At the heart of DreamDojo is the DreamDojo-HV (Human Videos) dataset, which NVIDIA claims is the largest and most diverse video corpus for world model pretraining to date. While traditional robot datasets like RT-1 or BridgeData V2 plateau at hundreds of hours, DreamDojo-HV amasses over 44,711 hours of egocentric experiences spanning 6,015 unique tasks and 1,135,000 trajectories.

The diversity is stark: the dataset includes 96x more skills and 2,000x more scenes than the most diverse public robot learning datasets. By training on humans performing daily activities—such as folding laundry, assembling objects, and handling tools—the model acquires a generalized understanding of physics that can be transferred to varied robotic embodiments.

This strategy mirrors the "900-hour bridge" utilized by 1X Technologies, which relies on first-person human video to teach "intuitive physics" that motor-command regression alone often misses.

Dream Dojo Demo

Solving the "Actionless" Video Problem

The primary challenge of training on passive human videos is the lack of action labels; a video of a person picking up a cup does not inherently tell a robot which joint torques were required. To bridge this gap, NVIDIA introduced continuous latent actions.

The researchers trained a 700-million parameter spatiotemporal Transformer to extract semantically meaningful "proxy actions" directly from the visual changes between frames. This allows the model to treat any human video as if it came with motor commands attached, facilitating zero-shot generalization to objects and environments never seen in the robot’s specific training set.

Real-Time Control and "In-Dream" Planning

A world model’s utility is often limited by inference speed. To unlock downstream applications, NVIDIA developed a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS. This enables several high-value applications:

Live Teleoperation: Users can connect VR controllers (such as a PICO headset) to teleoperate a virtual robot inside the "dream" in real time.
Policy Evaluation: Success rates in DreamDojo simulation show a near-perfect linear correlation (Pearson $r=0.995$ ) with real-world results, allowing developers to rank robot checkpoints without physical deployment.
Model-Based Planning: By simulating multiple action proposals in parallel and selecting the best "future," NVIDIA reported a 17% increase in success rates for a fruit-packing task.

Industry Context: The World Model Rebellion

The launch of DreamDojo arrives amidst a foundational debate regarding the "brain" of humanoid robots. While many firms have focused on Vision-Language-Action (VLA) models, critics like Yann LeCun have argued that these systems are too "LLM-pilled" and lack common sense.

NVIDIA’s approach aligns with LeCun's AMI Labs by prioritizing visual imagination and intuitive physics over text-based reasoning. Built on the open-weight Cosmos-Predict2.5 latent video diffusion model, DreamDojo represents NVIDIA’s bid to provide a foundational platform for the next generation of "Physical AI".

NVIDIA has released two variants of the model—a 2B model and a 14B model—both pretrained on 256 NVIDIA H100 GPUs. By making these assets public, NVIDIA aims to accelerate the development of general-purpose robots that can "think" and "imagine" their way through the messy reality of the physical world.