Published on

Google DeepMind Robotics Director: We Need "One More Big Breakthrough" to Solve General Purpose Robots

The Apptronik Apollo humanoid robot holding a white piece of clothing while performing a laundry sorting task.
Hardware Agnostic: DeepMind demonstrates its 'cross-embodiment' strategy by deploying its software onto the Apptronik Apollo humanoid, shown here assisting with laundry tasks. Image: Google DeepMind/YouTube

In the race to build general-purpose robots, companies often project an image of imminent victory. However, in a candid new video podcast released by Google DeepMind, the lab’s leadership offered a refreshing dose of realism alongside their latest demonstrations.

While showcasing robots that can "think" before they act and sort trash based on vague instructions, Kanishka Rao, DeepMind’s Director of Robotics, admitted that the industry hasn't quite cracked the code yet.

"I think we need at least one more big breakthrough," Rao told mathematician and host Hannah Fry during a tour of the company's California facilities.

The "Inner Monologue" of a Robot

The episode highlights the capabilities of Gemini 1.5 Robotics, a framework that splits the robotic "brain" into two distinct parts: an Embodied Reasoning (ER) model that plans high-level strategy, and a Vision-Language-Action (VLA) model that executes movements.

The video demonstrates this "orchestration" in real-time. In one demo, Fry tells a robot, "I'm in San Francisco and I don't know the rules about sorting trash. Can you look it up for me and then tidy up?"

The robot doesn't just move; it first accesses Google Search to learn local recycling laws (compost vs. recycling vs. landfill), creates a plan, and then executes the sort.

Crucially, the system displays an "internal monologue"—a stream of text reasoning—before it moves. "Reds are all in the black box," the robot "thinks" while sorting laundry in a separate demo, confirming its semantic understanding of the scene before committing to a physical action.

A robotic arm grasping a small potted plant from a table cluttered with various items like a football and a bag of chips.
Testing the limits of generalization: The robot is challenged to manipulate objects it has never encountered before such as this potted plant. Image: Google DeepMind/YouTube

The Data Bottleneck: Teleop vs. The World

Despite these advances, the conversation highlighted the industry's most persistent bottleneck: data.

When asked if the current architecture is sufficient to "pack it up" and declare robotics solved, Rao was skeptical. He noted that while large language models (LLMs) had the entire internet to learn from, robotics suffers from a scarcity of "physical interaction data."

"It's not as big as the internet," Rao explained. "We have a breakthrough where they can learn more efficiently... but the core of the problem is still the robot data."

This highlights a diverging philosophy in Silicon Valley regarding how to get that data.

  • DeepMind's Approach: As revealed in the footage, the lab utilizes specialized mechanical "leader arms"—physical rigs that operators manipulate directly—to teleoperate the robots rather than VR headsets. While this 1:1 "puppet" matching ensures high-quality manipulation data, DeepMind emphasizes that these skills are transferable. Skills learned on these specific training rigs can be deployed onto entirely different robot bodies, such as the Apptronik Apollo, validating their "cross-embodiment" strategy.
  • Sunday Robotics' Approach: Interestingly, Sunday Robotics—founded by former DeepMind researcher Tony Zhao—is explicitly trying to bypass this bottleneck. By distributing "Skill Capture Gloves" to users in homes, Sunday aims to collect millions of trajectories without needing a robot present at all.

While DeepMind’s teleoperation allows for high precision—demonstrated in the video by a robot packing a sandwich into a Ziploc bag with millimeter accuracy—it remains labor-intensive.

A researcher sitting at a desk using a mechanical 'leader arm' rig to teleoperate a robot, teaching it how to pack a suitcase.
The 'Teacher': DeepMind uses these physical teleoperation rigs to capture high-fidelity manipulation data. This human motion is recorded and used to train the neural networks that control the robots. Image: Google DeepMind/YouTube

Learning from YouTube

If teleoperation is too slow and simulation is too "clean," where will the data come from? Keerthana Gopalakrishnan, a Research Scientist at DeepMind, pointed to a massive, untapped resource: video.

"There is a lot of manipulation data that is collected by humans posting videos about how to do anything," Gopalakrishnan said, referencing platforms like YouTube. "We should be able to learn from that at some point."

This aligns with DeepMind's broader "hardware agnostic" strategy. The video features the software running not just on stationary arms, but on the Apptronik Apollo humanoid, further cementing the company's goal to build the "Android operating system" for robotics rather than just the hardware.

Apptronik celebrated the feature, noting on social media that Apollo was "responding to complex instructions and adapting to changing contexts."

"A Long Tail of Problems"

The video serves as a progress report for DeepMind's "physical AI" ambitions. The visual generalization—the ability for robots to ignore lighting changes or backgrounds—is "much more solved" than it was four years ago, according to Rao.

However, the "final picture" of general-purpose robotics still requires bridging the gap between seeing the world and physically handling it with the ease of a human.

"There’s one hypothesis that [data] is all you need," Rao concluded. "If you can collect that much robot data, then we're done... but there is still a long tail of problems to solve."

Watch the full episode of the Google DeepMind podcast below

Share this article

Stay Ahead in Humanoid Robotics

Get the latest developments, breakthroughs, and insights in humanoid robotics — delivered straight to your inbox.