Don’t Forget the Salt: Physical Intelligence Equips Robots with 15-Minute "Multi-Scale" Memory

For robots, "remembering" has historically been a binary choice between a few seconds of high-fidelity video or a complete loss of context. This "goldfish" effect has limited even the most advanced Vision-Language-Action (VLA) models to short, atomic tasks.

On March 3, 2026, Physical Intelligence (Pi) announced a significant architectural shift to solve this: Multi-Scale Embodied Memory (MEM). By combining short-term visual tracking with a long-term "narrative" in natural language, Pi’s latest models can now maintain focus for up to fifteen minutes—long enough to clean an entire kitchen or prepare a meal from scratch.

Physical Intelligence

@physical_int

·Follow

We’ve developed a memory system for our models that provides both short-term visual memory and long-term semantic memory. Our approach allows us to train robots to perform long and complex tasks, like cleaning up a kitchen or preparing a grilled cheese sandwich from scratch 👇

Watch on X

10:04 PM · Mar 3, 2026

2.2K

Read 48 replies

The Hybrid Approach: Video for Detail, Text for Context

The core challenge of robot memory is tractability. Cramming minutes of high-frequency video into a model’s context window is computationally expensive and introduces "causal confusion," where a robot erroneously repeats past actions just because they are in the history.

Pi’s MEM architecture bifurcates memory into two distinct modalities:

Short-Term Video Memory: Using an efficient video encoder based on Vision Transformers (ViTs), the model captures dense, image-based memory of the last few seconds. This allows the robot to handle "partial observability"—remembering where an object is even when its own arm occludes its view.
Long-Term Text Memory: For the "big picture," the model summarizes semantic events in natural language. Instead of remembering every frame of a door opening, it simply stores the "note": "I opened the fridge door".

This textual memory is updated via a "chain-of-thought" process. As the robot completes a subtask, it predicts an updated summary of its progress, which informs the next high-level decision.

Physical Intelligence

@physical_int

·Follow

Replying to @physical_int

With MEM, our model can carry out real, multi-stage tasks in the real world, like cleaning up an entire kitchen or setting up the ingredients for a recipe. The videos show how the robot keeps track of long-horizon memory in text.

Watch on X

10:04 PM · Mar 3, 2026

Read 1 reply

In-Context Adaptation: Learning from Mistakes

The most immediate benefit of MEM isn't just longer tasks, but more resilient ones. Previously, if Pi’s $\pi_{0.6}$ foundation model failed to grasp an object, it might try the exact same failed strategy repeatedly.

With MEM, the robot exhibits in-context adaptation. In one demonstration, a robot attempting to pick up a chopstick from an unusually low table failed its first grasp. Because it "remembered" the failure in its short-term video buffer, it adjusted its approach on the fly—changing its grasp height and succeeding on the second attempt.

Similarly, when faced with a refrigerator with no clear visual cues on which side the hinge was located, the MEM-equipped robot tried one direction, realized it was stuck, and immediately switched to pulling from the other side.

Mastering the "Robot Olympics" Long Tail

This memory system appears to be the engine behind Pi's recent success in the Robot Olympics benchmarks. While earlier iterations of $\pi_{0.5}$ excelled at motor skills, they struggled with the "logic" of chores like making a grilled cheese sandwich, which requires precise timing and tracking of multiple stages.

In new tests, Pi demonstrated the $\pi_{0.6}$ -MEM model performing:

Kitchen Cleanup: A fifteen-minute task involving wiping counters, drying them with paper towels, washing dishes with running water, and stowing food in the fridge.
Recipe Setup: Correcting for "partial observability" by retrieving items from drawers and cabinets the robot can no longer see.
Logistics: Unpacking groceries and "counting" items to ensure no object is left in the bag.

Contextualizing the Breakthrough

This update follows Pi’s massive $600 million funding round in late 2025, which was predicated on the idea that a "universal brain" is more important than specialized hardware. While competitors like Google DeepMind have looked for "one big breakthrough," Pi’s Recap method—which combines imitation learning with autonomous reinforcement learning—seems to be the primary beneficiary of this new memory layer.

By allowing robots to "practice" and remember their mistakes, Pi is closing the gap between lab-bound prototypes and "factory-ready" tools. However, the company notes that scaling memory beyond the horizon of a single episode—into weeks or months—remains the next major frontier for the industry.