Rhoda AI Hits $1.7B Valuation, Unveils "Direct Video-Action" Model to Bridge the Real-World Gap

After months of operating as one of the robotics industry's most high-profile "ghosts," Rhoda AI has officially stepped into the light. On March 10, 2026, the Palo Alto-based startup announced a massive $450 million Series B funding round led by Premji Invest, propelling the company to a $1.7 billion valuation.

The launch marks a significant pivot for founder and CEO Jagdeep Singh, the former head of solid-state battery pioneer QuantumScape, who is now betting that the key to "physical AI" lies not in specialized robot data, but in the vast, untapped archive of the internet. Alongside the funding, Rhoda unveiled its Direct Video-Action (DVA) model, a foundation model designed to solve the robustness gap that has long kept intelligent robots confined to controlled laboratory settings.

The DVA Paradigm: Video as Policy

While many competitors utilize Vision-Language-Action (VLA) models—which typically learn by mimicking human teleoperators—Rhoda is pursuing a "video-first" strategy. The DVA model operates by predicting the future visually before translating those frames into physical movement.

The system consists of two primary components:

Causal Video Model: Pre-trained on hundreds of millions of publicly available internet videos, this model learns a deep "prior" on motion, 3D structure, and intuitive physics.
Inverse Dynamics Model: A smaller translator that converts the predicted video frames into specific motor torques and joint angles for the robot.

"The industry's intelligent robots... work well in a laboratory setting," Singh noted in a launch video. "But when you take those same models into the real world, they don’t perform so well. Their entire understanding of physics comes from a relatively small robot teleoperation data set."

By contrast, Rhoda claims its model can learn complex, long-horizon industrial tasks with as little as 10 to 20 hours of specific robot data, relying on its pre-trained "world model" to handle the messy variables of reality.

Jagdeep Singh, CEO of Rhoda AI, speaking in front of a blurred industrial robot in a factory setting. — Rhoda AI founder and CEO Jagdeep Singh explains the company's Direct Video-Action (DVA) approach to robot foundation models.

From "Lab Demos" to Production Lines

To prove the model’s efficacy, Rhoda demonstrated its hardware operating within one of the world’s largest automotive factories. Unlike the precision-controlled unboxing tasks teased during its stealth phase, the new data shows robots handling heavy industrial workloads.

One "Decanting" task required the system to unpack 10kg boxes, pull small tabs, and sort deformable plastic bags—a process their industrial partner previously considered "infeasible to automate." Another demonstration featured the breakdown of 50-pound "Contico" containers, requiring the robot to manage partial observability and high-force interactions.

This move toward "factory-ready" tools puts Rhoda in direct competition with Physical Intelligence (Pi), which recently unveiled its $\pi_{0.6}$ model for e-commerce packaging, and Generalist AI, which has focused on physical commonsense through large-scale real-world interaction data.

A bimanual robotic system with two silver arms lifting a cardboard box in a dimly lit factory workstation. — Rhoda's 'general purpose bimanual manipulation platform' demonstrated its ability to perform decanting and box-handling tasks in a live automotive factory.

Solving for Memory and Ambiguity

A standout feature of the DVA model is its Long-Context Visual Memory. While standard VLA models may only process a few frames of history, Rhoda’s architecture natively handles hundreds of frames. This allows the robot to resolve visual ambiguity without the need for hand-engineered "scaffolding" or subtask indicators.

In a technical blog post, the company demonstrated this memory through a "Shell Game" challenge. The robot successfully tracked an object hidden beneath three shuffling shells, a task that requires persistent tracking of an object it can no longer see. This approach to memory offers a fascinating contrast to Pi’s Multi-Scale Embodied Memory (MEM), which combines short-term video with long-term text summaries to maintain context for up to 15 minutes.

The Data Arms Race

Rhoda’s emergence reinforces a growing divide in how the industry handles the "data bottleneck." Some current examples:

Google DeepMind continues to argue that one more big breakthrough is needed to bridge the gap between seeing the world and handling it.
Sunday Robotics is doubling down on human-to-robot transfer via its Skill Capture Glove.
Rhoda AI is betting that the "physics of everything" is already recorded on YouTube.

Investors seem to agree with Singh’s vision. The $450 million round included participation from Khosla Ventures, Temasek, and John Doerr. Vinod Khosla, who incubated the company within his firm, noted that "the real world is messy... being able to actually work on production lines is much, much harder than doing the demo."

As Rhoda moves from stealth to scale, the company plans to not only license its software but also develop its own hardware to act as a data-collection engine. For an industry hungry for reliable, general-purpose agents, Rhoda’s "Direct Video-Action" may represent the next major evolution in the quest for physical AGI.