Physical Intelligence’s New AI Model Delivers Early Signs of Long-Sought Generalization in Robotics
Two-year-old San Francisco-based robotics startup Physical Intelligence has quietly grown into one of the most closely watched artificial intelligence firms in the Bay Area. On Thursday, the company published new research showing its latest model can guide robots to complete tasks it was never explicitly trained to perform — a capability that even caught the company’s own research team off guard.
The new model, dubbed π0.7, marks what the company calls an early but meaningful leap toward the decades-long goal of building a general-purpose robot brain: a system that can tackle an unfamiliar task, be coached through it in plain everyday language, and successfully deliver results. If the findings hold up to independent scrutiny, they suggest robotic AI may be approaching an inflection point similar to the one that transformed large language models, where capabilities begin compounding faster than the volume of underlying training data would ever predict.
At its core, the paper’s key claim centers on compositional generalization: the ability to combine skills learned across separate contexts to solve entirely new problems the model has never encountered. Until now, the standard approach to robot training has relied largely on rote memorization: collect data for one specific task, train a task-specific specialist model on that data, and repeat the entire process for every new task. Physical Intelligence says π0.7 breaks this long-standing pattern entirely.
“Once the model crosses that threshold from only doing exactly the work you collected training data for to actually remixing existing skills in entirely new ways, its capabilities grow faster than linearly with the amount of training data,” explained Sergey Levine, co-founder of Physical Intelligence and a UC Berkeley professor specializing in AI for robotics. “We’ve already seen this far more favorable scaling property play out in other AI domains, like natural language and computer vision.”
The paper’s most striking demonstration involves an air fryer the model had effectively never been trained on. When the research team audited their full training dataset, they found only two loosely relevant entries: one where a different robot simply pushed an air fryer door closed, and a second from an open-source dataset where another robot placed a plastic bottle inside an air fryer following a user instruction. Somehow, π0.7 synthesized those fragmented clues, plus broader web-based pre-training data, to build a working understanding of how the appliance operates.
“It’s very hard to track where the knowledge comes from, or predict when the model will succeed or fail,” said Lucy Shi, a Physical Intelligence researcher and Stanford computer science PhD student. Even with zero coaching, the model pulled off a passable attempt at cooking a sweet potato in the appliance. When given step-by-step verbal guidance — essentially a human walking the robot through the task the same way you’d explain a new job to a first-time employee — it completed the work successfully.
This on-the-fly coaching capability is critical because it suggests robots could be deployed to new environments and improved in real time, no extra data collection or full model retraining required.
That said, the research team is upfront about π0.7’s limitations and careful not to overstate their progress. In many cases, Shi notes, failures stem from the team, not the model itself. Early air fryer tests only posted a 5% success rate, she explained. After the team spent roughly half an hour refining how they framed the task for the model, that success rate jumped to 95%. “Sometimes the failure mode isn’t the robot or the model — it’s us, for not being good at prompt engineering,” she said.
The model also cannot yet execute complex multi-step tasks autonomously from a single high-level command. “You can’t just tell it, ‘Hey, go make me some toast,’” Levine said. “But if you walk it through — ‘open the toaster compartment, push this button, do this next’ — it actually tends to work pretty well.”
The team also acknowledges that standardized, widely accepted benchmarks for robotic AI do not yet exist, which makes independent external validation of their claims difficult. Instead, the company compared π0.7 against its own earlier task-specific specialist models — purpose-built systems trained for individual jobs — and found the generalist π0.7 matched the performance of specialized models across a range of complex tasks, including brewing coffee, folding laundry, and assembling cardboard boxes.
What may be most notable about the research, if you take the team at their word, is not any single demonstration but how often the results surprised even the people whose job it is to know exactly what is in the training data, and therefore what the model should and shouldn’t be able to do.
“My experience has always been that when I deeply know what’s in the training data, I can pretty easily guess what the model will be capable of,” said Ashwin Balakrishna, a research scientist at Physical Intelligence. “I’m rarely surprised. But the last few months have been the first time I’ve been genuinely caught off guard. I just randomly bought a gear set and asked the robot, ‘Hey, can you rotate this gear?’ And it just worked.”
Levine compared the moment to early days of large language models, when researchers first watched GPT-2 generate a coherent story about unicorns in the Andes. “We all thought, where on earth did it learn about unicorns in Peru? That’s such a weird combination,” he said. “And I think seeing that same kind of emergent unexpected capability in robotics is really special.”
Critics often point to a clear asymmetry between robotic AI and large language models: LLMs had the entire public internet to learn from, while robotics training data is far scarcer, and even the cleverest prompting cannot fully close that gap. But when asked where he expects the most skepticism, Levine points to an entirely different critique.
“The criticism that can always be leveled at any robotic generalization demo is that the tasks are kind of boring,” he said. “The robot is not doing a backflip.” He pushes back on that framing, arguing that the entire point of this work is the gap between a flashy, carefully choreographed robot stunt and a robotic system that actually generalizes to new tasks. Generalization will always look less dramatic than a rehearsed stunt, he notes, but it is far more useful for real-world use.
The research paper uses cautious, hedging language throughout, describing π0.7 as showing “early signs” of generalization and offering “initial demonstrations” of new capabilities. These are preliminary research results, not a finished product ready for commercial deployment.
When asked directly when a system built on these findings might be ready for real-world deployment, Levine declined to speculate. “I think there’s good reason to be optimistic, and certainly it’s progressing faster than I expected it would a couple of years ago,” he said. “But it’s very hard for me to answer that question.”
To date, Physical Intelligence has raised more than $1 billion in funding, and was most recently valued at $5.6 billion. Much of the investor enthusiasm around the company traces to co-founder Lachy Groom, who spent years as one of Silicon Valley’s most well-regarded angel investors — backing breakout startups including Figma, Notion, and Ramp — before launching Physical Intelligence, the company he says he’d long been searching for to build this technology. That track record has helped the startup attract major institutional investment even as it has refused to share a commercialization timeline with backers.
Industry reports indicate the company is currently in discussions for a new funding round that would nearly double its valuation to $11 billion. The Physical Intelligence team declined to comment on the ongoing fundraising talks.
Physical Intelligence’s New AI Model Delivers Early Signs of Long-Sought Generalization in Robotics