The Great Parallel: NVIDIA’s Jim Fan Outlines the Robotics "End Game" Strategy

Jim Fan describes a "Great Parallel" where robotics follows the LLM evolutionary path: pre-training, alignment, reasoning, and autonomous research.
The industry is pivoting from "Language-first" Vision-Language-Action (VLA) models to video-first World Action Models (WAMs) that prioritize physical laws over linguistic nouns.
Human teleoperation is being replaced by "sensorized human data," with EgoScale demonstrating that 21,000 hours of egocentric video can predict robotic success.
NVIDIA predicts a "Physical Turing Test" within 2–3 years and the completion of the robotic "technology tree" by 2040.
The new scaling mantra, "compute equals environment equals data," drives the development of neural simulators like DreamDojo.

In a landmark talk at Sequoia Capital’s AI Ascent 2026, Jim Fan, Lead of the Embodied Autonomous Research group at NVIDIA, declared that the "end game" for robotics is no longer a distant vision, but a clearly defined technical roadmap. Fan argued that the field is currently undergoing a "Great Parallel," replicating the rapid evolution of Large Language Models (LLMs) by mapping robotic development onto the four-stage GPT playbook: pre-training, alignment, reasoning, and autonomous research.

Jim Fan, Lead of Embodied Autonomous Research at NVIDIA, stands on a stage gesturing with a remote clicker. Behind him, a presentation screen displays a white line-art robot icon on a black background and a timeline marker for the year 2040. — NVIDIA’s Jim Fan presents his vision for the robotics "technology tree" at AI Ascent 2026. Fan predicts that by 2040, the industry will reach a stage of "physical auto-research," where robots autonomously design, improve, and build their own next-generation iterations.

Moving Beyond "LLM-Pilled" Robotics

For the past several years, the industry has leaned heavily on Vision-Language-Action (VLA) models, such as GR00T N1.5 and Physical Intelligence’s π0.7. Fan characterized these as "head-heavy" architectures—essentially language models with an action head "grafted" on top. While excellent at recognizing "nouns" (e.g., identifying a Coke can near a photo of Taylor Swift), Fan noted they often struggle with the "verbs" of physics.

The new paradigm, according to Fan, is the World Action Model (WAM). Exemplified by NVIDIA’s DreamZero, these models treat vision and action as first-class citizens. Rather than predicting the next word, WAMs predict the next physical state by "dreaming" future pixels and joint torques simultaneously. Fan pointed to video models like Sora and Veo as early evidence of "physics slop"—emergent understandings of gravity, buoyancy, and reflection that arise purely from scale.

The Death of Teleoperation

Fan issued a "moment of silence" for teleoperation, the historical gold standard of robot data collection. Limited by the 24-hour physical day and the "mercy of robot gods" who frequently malfunction, teleoperation cannot scale to the millions of hours required for generalist intelligence.

Instead, NVIDIA is betting on sensorized human data. This involves:

Universal Manipulation Interfaces (UMI): Deceptively simple actuators worn on human hands to collect data without the robot in the loop.
Egocentric Scaling: Using thousands of hours of first-person human video to build a foundational motor prior.

The most prominent example of this shift is EgoScale, which pre-trained on 20,854 hours of human video. Fan highlighted a "near-perfect" log-linear scaling law discovered in the research: as human data increases, a robot's zero-shot dexterity improves monotonically. Under this new paradigm, teleoperation accounts for less than 0.1% of the training mix.

Compute Equals Environment Equals Data

To bypass the "data bottleneck" of the physical world, NVIDIA is aggressively pursuing generative simulation. Fan introduced the concept of "Digital Cousins"—procedurally generated environments derived from iPhone scans. However, the ultimate goal is "Simulation 2.0" via neural simulators like DreamDojo.

DreamDojo replaces classical physics equations with data-driven video generation, outputting real-time sensor states at over 10 FPS. This allows robots to run reinforcement learning in the "dream space" of the model. "Compute now equals environment equals data," Fan stated, echoing the scaling sentiment that has defined the Blackwell GPU era.

The 2040 Horizon

Fan concluded with a timeline for the "Robotics Technology Tree." He predicted that machines will pass the Physical Turing Test—performing tasks with a grace indistinguishable from humans—within the next 2–3 years. By 2040, he expects the realization of "Physical Auto Research," where robots begin to autonomously design and improve the next generation of themselves.

"Our generation was born too late to explore the earth and too early to explore the stars," Fan remarked, "but we are born just in time to solve robotics."

Watch the talk below: