- Published on
NVIDIA Open-Sources DreamDojo: A 44,000-Hour "Dream" to Solve the Robotics Data Gap

NVIDIA has officially entered the 2026 "World Model" arms race with the release of DreamDojo, an open-source foundation world model designed to simulate complex robotics tasks and environmental interactions directly from pixels. Described by NVIDIA’s Dr. Jim Fan as "Simulation 2.0," the model attempts to bypass the traditional "data bottleneck" of robotics by learning intuitive physics from 44,000 hours of human video.
The release marks a significant milestone in the industry’s shift toward generative simulation, joining recent efforts like 1X Technologies’ 1XWM and Google DeepMind’s Genie 3. However, NVIDIA is taking a distinct approach by open-sourcing the model weights, code, and datasets, inviting the broader research community to build upon its "World Action Model" framework.
Stay Ahead in Humanoid Robotics
Get the latest developments, breakthroughs, and insights in humanoid robotics — delivered straight to your inbox.
Scaling the "Bitter Lesson"
At the heart of DreamDojo is the DreamDojo-HV (Human Videos) dataset, which NVIDIA claims is the largest and most diverse video corpus for world model pretraining to date. While traditional robot datasets like RT-1 or BridgeData V2 plateau at hundreds of hours, DreamDojo-HV amasses over 44,711 hours of egocentric experiences spanning 6,015 unique tasks and 1,135,000 trajectories.
The diversity is stark: the dataset includes 96x more skills and 2,000x more scenes than the most diverse public robot learning datasets. By training on humans performing daily activities—such as folding laundry, assembling objects, and handling tools—the model acquires a generalized understanding of physics that can be transferred to varied robotic embodiments.
This strategy mirrors the "900-hour bridge" utilized by 1X Technologies, which relies on first-person human video to teach "intuitive physics" that motor-command regression alone often misses.
Solving the "Actionless" Video Problem
The primary challenge of training on passive human videos is the lack of action labels; a video of a person picking up a cup does not inherently tell a robot which joint torques were required. To bridge this gap, NVIDIA introduced continuous latent actions.
The researchers trained a 700-million parameter spatiotemporal Transformer to extract semantically meaningful "proxy actions" directly from the visual changes between frames. This allows the model to treat any human video as if it came with motor commands attached, facilitating zero-shot generalization to objects and environments never seen in the robot’s specific training set.
Real-Time Control and "In-Dream" Planning
A world model’s utility is often limited by inference speed. To unlock downstream applications, NVIDIA developed a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS. This enables several high-value applications:
- Live Teleoperation: Users can connect VR controllers (such as a PICO headset) to teleoperate a virtual robot inside the "dream" in real time.
- Policy Evaluation: Success rates in DreamDojo simulation show a near-perfect linear correlation (Pearson ) with real-world results, allowing developers to rank robot checkpoints without physical deployment.
- Model-Based Planning: By simulating multiple action proposals in parallel and selecting the best "future," NVIDIA reported a 17% increase in success rates for a fruit-packing task.
Industry Context: The World Model Rebellion
The launch of DreamDojo arrives amidst a foundational debate regarding the "brain" of humanoid robots. While many firms have focused on Vision-Language-Action (VLA) models, critics like Yann LeCun have argued that these systems are too "LLM-pilled" and lack common sense.
NVIDIA’s approach aligns with LeCun's AMI Labs by prioritizing visual imagination and intuitive physics over text-based reasoning. Built on the open-weight Cosmos-Predict2.5 latent video diffusion model, DreamDojo represents NVIDIA’s bid to provide a foundational platform for the next generation of "Physical AI".
NVIDIA has released two variants of the model—a 2B model and a 14B model—both pretrained on 256 NVIDIA H100 GPUs. By making these assets public, NVIDIA aims to accelerate the development of general-purpose robots that can "think" and "imagine" their way through the messy reality of the physical world.
Share this article
Stay Ahead in Humanoid Robotics
Get the latest developments, breakthroughs, and insights in humanoid robotics — delivered straight to your inbox.