Humanoids Daily
Published on

Physical Intelligence Claims ‘RL is Back’ With New Model That Learns From Its Own Mistakes

Two robotic arms interacting with a commercial espresso machine
The π*0.6 model prepares espresso drinks. Physical Intelligence claims the robot ran continuously from 5:30 AM to 11:30 PM, utilizing reinforcement learning to handle long-horizon tasks without interruption.

In the race to solve the "physical AI bottleneck," most leading robotics companies have placed their bets on imitation learning—teaching robots by showing them massive amounts of human data, whether through video or teleoperation. Today, Physical Intelligence (Pi) argued for a different approach: letting the robot practice.

In a technical update released today, the company unveiled π*0.6 (pi-star-zero-point-six), a new vision-language-action (VLA) model trained using a method they call Recap. Unlike standard models that rely solely on copying human behavior π*0.6 is designed to learn from its own successes and failures, a process the company says has allowed its robots to perform complex tasks for hours on end without interruption.

"RL is back," declared Karol Hausman, a co-founder at Physical Intelligence, in a post on X (formerly Twitter) accompanying the release. To prove the point, Hausman shared a timelapse video of two robotic arms operating an espresso machine, grinding beans, pulling shots, and cleaning up, continuously for 13 hours.

Beyond Imitation: The ‘Recap’ Method

The core of Pi's announcement is a critique of the current industry standard: behavior cloning. While imitation learning—training a robot on expert demonstrations—can get a system to work "half of the time," the company argues it fails to deliver the reliability needed for real-world deployment.

The problem, according to the company's blog post, is compounding errors. When a robot trained only on perfect human demonstrations makes a small mistake (like grasping a handle slightly off-center), it enters a state it hasn't seen before. Confused, it often makes a larger mistake, leading to failure.

Pi’s solution, Recap (RL with Experience & Corrections via Advantage-conditioned Policies), attempts to mimic how humans master physical skills:

  1. Instruction: The robot watches human demonstrations (imitation).
  2. Coaching: A human teleoperator watches the robot and intervenes to correct mistakes in real-time, showing the robot how to recover from errors.
  3. Practice: The robot attempts the task autonomously thousands of times, using Reinforcement Learning (RL) to "score" its own actions, keeping what works and discarding what doesn't.
Robot arms folding a garment on a table.
Demonstrating generalization with deformable objects.The robot reportedly folded 50 different 'novel' laundry items in a new environment, adjusting to the specific dynamics of different fabrics.

The company claims this "practice" phase allows the model to refine its technique far beyond what is possible with human data alone. By training a "value function"—essentially a software critic that predicts the probability of success—the system can filter out bad behaviors and prioritize high-value actions.

13 Hours of Coffee and ‘Factory-Ready’ Boxes

To validate the method, Pi tested the π*0.6 model on three distinct tasks: making espresso, folding diverse laundry items, and assembling cardboard boxes.

The results reportedly show a significant leap in reliability. The company claims that adding the autonomous practice phase "more than doubles the throughput on some of the hardest tasks" compared to their previous models trained only on supervision.

  • Espresso: The model ran from 5:30 AM to 11:30 PM, handling the long-horizon task of grinding, tamping, extraction, and cleaning.
  • Laundry: The robot folded 50 distinct, novel clothing items in a new environment.
  • Logistics: In a factory setting, the robot assembled and labeled 59 packaging boxes used for chocolates, handling material inconsistencies like boxes sticking together.
A robot arm separating two stuck-together pieces of cardboard packaging.
A visualization of autonomous error recovery. The model detects that it has accidentally grabbed two flattened boxes—a common 'edge case' where materials stick together—and separates them to continue assembly without human intervention.

The Strategic Landscape: RL vs. Big Data

This development sharpens the philosophical divide currently splitting the humanoid robotics sector.

On one side are companies like Figure AI and Tesla. Figure is betting on "Project Go-Big," aiming to achieve autonomy by training on massive datasets of human video. Tesla relies on its "World Simulator," hoping to transfer physics understanding from virtual environments to the real world.

On the other side are proponents of active, embodied learning. 1X Technologies uses a "human-in-the-loop" strategy, where teleoperation is the engine that drives autonomy, allowing robots to "live and learn" in the real world.

Physical Intelligence, backed by academic heavyweights like Sergey Levine and Chelsea Finn, appears to be carving out a hybrid niche. Their approach combines the "human-in-the-loop" corrections used by 1X with the rigorous, autonomous reinforcement learning that defined early research breakthroughs in labs like Google DeepMind.

This focus on "embodied intelligence" aligns with Pi's recent strategic moves, including a partnership with Chinese robotics firm AgiBot to deploy their "brains" into capable humanoid bodies. That collaboration specifically targeted "complex, long-duration tasks," a goal that the new Recap method seems directly engineered to solve.

While competitors race to ingest the internet or record millions of human hours, Pi is betting that the final mile of robotic intelligence won't come from watching more videos—it will come from doing the work.

Share this article

Stay Ahead in Humanoid Robotics

Get the latest developments, breakthroughs, and insights in humanoid robotics — delivered straight to your inbox.