Physical Intelligence Unveils π0.7: The Rise of Compositional Generalization in Robotics

Physical Intelligence (Pi) announced π0.7, a general-purpose foundation model that exhibits compositional generalization, allowing it to recombine known skills to solve novel tasks.
The model demonstrated the ability to control a UR5e bimanual system for laundry folding despite having zero training data for that specific task on that hardware.
Pi introduced a multimodal prompting framework that uses language coaching, metadata, and visual subgoals generated by a lightweight world model to steer robot behavior.
In benchmarks, the single π0.7 model matched or exceeded the performance of the RL-tuned π*0.6 specialist models in tasks like espresso making and box assembly.

Just weeks after reports surfaced of a potential $11 billion valuation, San Francisco-based startup Physical Intelligence (Pi) has released π0.7, a model that may represent the "GPT-3 moment" for robotic dexterity. Released today, April 16, 2026, the new architecture moves beyond simple imitation to demonstrate compositional generalization—the ability to "mix and match" learned concepts to solve problems it has never seen before.

A high-angle view of two robotic arms with orange-tipped grippers operating on a wooden surface. One arm uses a silver knife to slice a green zucchini into round pieces, while the other arm holds the zucchini steady. A white bowl of sliced zucchini sits nearby. — Zero-shot dexterity: π0.7 demonstrates emergent compositional generalization by performing complex, multi-step tasks like food preparation without specific training data for the task.

The "LLM Moment" for Physical Actions

The central breakthrough of π0.7 is its ability to treat robotic skills like words in a sentence. Much like a Large Language Model (LLM) can combine the concept of "JSON formatting" with "French translation" without being explicitly trained on that specific combination, π0.7 can now combine motor skills to use new tools.

Physical Intelligence

@physical_int

·Follow

Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model!

Watch on X

6:12 PM · Apr 16, 2026

2.5K

Read 59 replies

"Vision-language-action models have not yet been shown to combine skills in new ways, like using a new tool or kitchen appliance," the company noted in its technical release. To prove this hurdle has been cleared, Pi demonstrated the model using an air fryer to cook a sweet potato—a task for which it had nearly zero direct training data. Instead, the model relied on a few disparate episodes of closing drawers and data from the open-source DROID dataset to "reason" its way through the new appliance's interface.

Zero-Shot Cross-Embodiment Transfer

Perhaps the most startling result for industry observers is π0.7’s performance on hardware for which it was never trained. Pi successfully tasked the model with controlling a UR5e bimanual industrial system to fold laundry.

While the company has previously shown advanced laundry folding with π0.6, that data was collected on much smaller, more precise robotic arms. The UR5e arms are heavier, have more inertia, and use different grippers. Despite this, π0.7 achieved a success rate on the UR5e that matched expert human teleoperators attempting the task for the first time on the same hardware.

"Any robot hardware maker will be able to buy physical intelligence, collect some data on their embodiment, and see our many capabilities transfer," noted Pi researcher Kyle Vedder. This reinforces Pi’s strategy of becoming the universal "intelligence layer" for any robot chassis.

Steerable Intelligence via Multimodal Prompts

Pi attributes this step-change in generalization to a new way of "talking" to the robot. Rather than simple text commands, π0.7 utilizes a multimodal prompting framework that includes:

Language Coaching: Step-by-step verbal instructions that guide the robot through "false starts" in real-time.
Visual Subgoals: Images generated by a lightweight world model that show the robot what the next stage of a task (like an open air fryer basket) should look like.
Strategy Metadata: Tags that tell the model whether to prioritize speed, quality, or a specific control modality.

By annotating diverse data—including "suboptimal" autonomous failures—with metadata, Pi has found a way to ingest vast amounts of data without "poisoning" the model with bad habits.

Consolidating the Specialist Models

Historically, the highest levels of robotic performance required "specialist" models tuned for a single task. In late 2025, Pi’s π*0.6 model used reinforcement learning to master espresso making over 13-hour shifts.

With π0.7, Pi claims they have distilled the performance of those specialists into one single, general-purpose model. Benchmarks show π0.7 achieving the same or higher throughput as the RL-trained specialists in espresso making, box folding, and laundry. This suggests that the industry is moving away from the need for bespoke fine-tuning for every new household or industrial chore.

As competitors like Generalist AI continue to push for scaling from scratch, Pi’s success with π0.7 signals that the combination of diverse data and "steerable" multimodal prompts may be the fastest path to a truly general-purpose robot.

The "Cloud-Brain" Strategy and the Cambrian Explosion

In a concurrent discussion with Y Combinator, Pi co-founder Quan Vuong detailed a strategic shift in how these models are deployed. To combat the high "Bill of Materials" (BOM) costs that plague the industry, Pi is hosting its models in the cloud rather than on-device.

Pipelining for Real-Time Control

To solve the latency issues inherent in cloud-based robotics, Pi uses a method called real-time action chunking:

The robot queries an API endpoint for a "chunk" of sequential actions (e.g., 100 milliseconds of movement).
While executing the current chunk, the robot pre-computes and fetches the next sequence.
Algorithmic smoothing ensures the transition between chunks remains consistent, allowing "dumb" hardware to be powered by massive, data-center-scale intelligence.

Y Combinator

@ycombinator

·Follow

Physical Intelligence (@physical_int) is building a foundation model that can control any robot to do any task — what the team describes as the GPT moment for robotics. The company's cross-embodiment approach trains across many different robot platforms, and recent results show

Watch on X

2:02 PM · Apr 16, 2026

423

Read 26 replies

A Playbook for Vertical Robotics

Vuong suggested that the industry is entering a "Cambrian explosion" where the barrier to entry for robotics startups has collapsed. By decoupling the "brain" (software) from the "body" (hardware), founders can now focus on specific industrial workflows—like the e-commerce packaging tasks handled by Ultra—without needing to build a proprietary autonomy stack from scratch.

"The upfront cost is not that high anymore," Vuong noted. "It requires someone that is really scrappy... who can do the system integration and understand what customers want."

What’s Next: Toward Autonomous Research

As Pi continues to scale, the team is exploring the creation of an automated robotic research scientist. This agent would ingest multimodal evaluation data, identify why a robot failed (e.g., "was it the data or the gripper?"), and suggest hypotheses to improve the model.

While the industry remains wary of Moravec’s Paradox, the emergence of compositional skills in π0.7 suggests that the "dark matter" of robotic intuition is finally being codified into a steerable, scalable foundation.