- Published on
NVIDIA Keynote: Cosmos 3 Realizes the "World Action Model" End Game

- NVIDIA has launched Cosmos 3, an open-weights physical AI foundation model that unites vision reasoning, multimodal world generation, and native robotic action prediction in a single architecture.
- The platform shifts the industry from "head-heavy" Vision-Language-Action (VLA) models to a World Action Model (WAM) paradigm, prioritizing physical laws, forces, and trajectories over linguistic structures.
- Built on a mixture-of-transformers architecture, Cosmos 3 utilizes a reasoning block to interpret spatial-temporal scene dynamics and a generation block to synthesize highly accurate, physically grounded data.
- Operating as an omnimodel, Cosmos 3 natively outputs numerical action data—including joint angles and gripper trajectories—allowing developers to fine-tune it for specific robotic embodiments.
- NVIDIA announced the Cosmos Coalition with founding members like Agile Robots, Skild AI, and Generalist AI to standardize open-world models under the Linux Foundation’s OpenMDW 1.1 license.
TAIPEI — In the robotics industry, the taxonomy of "world models" has rapidly devolved into a crowded, hyper-capitalized battlefield. While some laboratories pour billions into predictive architectures, prominent industry voices have openly questioned whether grafting robotic actions onto brains trained primarily on internet text is holding the field back.
At NVIDIA GTC Taipei during COMPUTEX, Silicon Valley chip giant NVIDIA delivered its definitive architectural countermove. The company launched NVIDIA Cosmos 3, an open world foundation model built to unify vision reasoning, physical simulation, and action prediction into a single system.
The release marks the literal realization of the "end game" strategy outlined just weeks ago by Jim Fan, NVIDIA’s Lead of Embodied Autonomous Research. In his landmark talk, Fan detailed a necessary industry transition from "head-heavy," language-first frameworks toward video-first World Action Models (WAMs) that treat physics and actions as first-class citizens. Cosmos 3 is that blueprint brought to production.

The Architecture: Physics as a First-Class Citizen
Cosmos 3 tackles the core bottleneck of physical AI: enabling robots, autonomous vehicles (AVs), and vision agents to generalize in unstructured real-world settings with limited physical training data.
To move past the limitations of models that struggle with the "verbs" of physics, NVIDIA designed a mixture-of-transformers architecture. The model pairs a dedicated reasoning transformer block with an expert generation transformer block. The reasoning block interprets moving scenes, object interactions, and spatial-temporal relationships over time. The generation block then leverages that understanding to produce physically accurate outputs, from synthetic video sequences to text, images, and ambient sound.
Trained on a massive multimodal dataset comprising billions of physical AI samples, Cosmos 3 has swept the open leaderboards. It ranks first across Artificial Analysis, Physics-IQ, PAI-Bench, and R-Bench for world generation accuracy.
NVIDIA is releasing the model family across three tiers:
- Cosmos 3 Super: Designed for post-training robotics and AV models requiring the highest physics accuracy and generation quality.
- Cosmos 3 Nano: A lightweight variant optimized for high-quality video and action reasoning in fractions of a second.
- Cosmos 3 Edge: A forthcoming version tailored for real-time inference directly on physical hardware at the edge.
Generating Action Data for Diverse Embodiments
Crucially, Cosmos 3 operates as an omnimodel featuring native action generation. Instead of utilizing vision-language training as a conceptual crutch, the system directly outputs numerical action data, including joint angles, gripper positions, and spatial trajectory points.
For complex tasks like bimanual manipulation, robots require immediate, reactive guidance on how to reach, grasp, and correct for forces mid-action. Developers can fine-tune Cosmos 3 to adapt its underlying physical commonsense to specific camera layouts, custom workspaces, or unique mechanical forms. In simulated and real-world benchmarks, policies post-trained with Cosmos 3 Nano secured the top spots on the RoboLab and RoboArena leaderboards.
The release integrates deeply with NVIDIA's hardware ecosystem. Simultaneously at GTC Taipei, the company unveiled the NVIDIA Isaac GR00T Reference Humanoid Robot, an open reference architecture leveraging next-generation Jetson AGX Thor T5000 compute. Cosmos 3 functions as the predictive baseline driving these setups, effectively compressing research validation cycles from months to mere days.
The Cosmos Coalition: Scaling the Digital Flywheel
To cement this architecture as the global standard for physical intelligence, NVIDIA announced the Cosmos Coalition. The global collaboration unites world model builders, AI developers, and robotics pioneers to advance open-source physical AI by sharing models, evaluation metrics, and large-scale training workflows over NVIDIA DGX Cloud infrastructure.
The founding coalition members represent a massive vertical integration across the current robotics landscape:
- Agile Robots: The Munich-based automation standout is already using Cosmos 3 to generate action-conditioned trajectories at scale for its policy development, including its industrial-grade Agile ONE humanoid platform. This deepens their existing effort to scale factory-floor intelligence following a research partnership with Google DeepMind.
- Skild AI: Backed by a historic $1.4 billion Series C war chest, Skild is leveraging Cosmos to supercharge its "omni-bodied" foundation model. The integration will feed directly into their fleet orchestration software, recently expanded through the acquisition of Zebra Technologies’ Fetch Robotics division.
- Generalist AI: Known for its strict conviction in training large parameter counts entirely from scratch, Generalist AI has joined the coalition. Access to Cosmos 3's synthetic data engine provides a massive pipeline to supplement their proprietary dataset, continuing the push for "intelligent improvisation" seen in their recent GEN-1 model rollout.
"Compute Equals Environment Equals Data"
The open-weights release of Cosmos 3 under the Linux Foundation's OpenMDW 1.1 license represents a calculated blow against closed-source universal "intelligence layer" models. The license provides a unified framework allowing developers to train, modify, redistribute, and deploy weights, documentation, and source code across enterprise pipelines.
By open-sourcing the core world model, NVIDIA is executing on Jim Fan's core scaling mantra: compute now equals environment equals data. Rather than forcing labs to rely on slow, dangerous, and fragile physical teleoperation, Cosmos 3 allows robots to run reinforcement learning inside a generative neural simulator—essentially "dreaming" thousands of hours of flawless physical interactions in parallel.
"The big bang of physical AI is just around the corner thanks to breakthroughs in multimodal reasoning language, vision and world models," NVIDIA CEO Jensen Huang declared during his keynote. By democratizing the predictive engine of physics, NVIDIA isn't just selling silicon; it is positioning its ecosystem as the absolute bedrock for the upcoming Physical Turing Test.
Cosmos 3 Super and Nano are available today on Hugging Face, GitHub, and via NVIDIA NIM microservices.
Comments
No comments yet. Be the first to share your thoughts!
Share this article
Stay Ahead in Humanoid Robotics
Get the latest developments, breakthroughs, and insights in humanoid robotics — delivered straight to your inbox.




