Researchers at Physical Intelligence, an AI robotics company, have developed a system called the Hierarchical Interactive Robot (Hi Robot). This system enables robots to process complex instructions and feedback using vision-language models (VLMs) in a hierarchical structure.
Vision-language models can control robots, but what if the prompt is too complex for the robot to follow directly?
— Physical Intelligence (@physical_int) February 26, 2025
We developed a way to get robots to “think through” complex instructions, feedback, and interjections. We call it the Hierarchical Interactive Robot (Hi Robot). pic.twitter.com/KdL5myyybT
The system allows robots to break down intricate tasks into simpler steps, similar to how humans reason through complex problems using Daniel Kahneman’s ‘System 1’ and ‘System 2’ approaches.
In this context, Hi Robot uses a high-level VLM to reason through complex prompts and a low-level VLM to execute actions.
Testing and Training Using Synthetic Data
Researchers used synthetic data to train robots to follow complex instructions. Relying solely on real-life examples and atomic commands wasn’t enough to teach robots to handle multi-step tasks.
To address this, they created synthetic datasets by pairing robot observations with hypothetical scenarios and human feedback. This approach helps the model learn how to interpret and respond to complex commands.
It outdid other methods, including GPT-4o and a flat Very Large Array (VLA) policy, by better following instructions and adapting to real-time corrections. It achieves a 40% higher instruction-following accuracy than GPT-4o. Hence, it demonstrates better alignment with user prompts and real-time observations.

In real-world tests, Hi Robot performed tasks like clearing tables, making sandwiches, and grocery shopping. It effectively handled multi-stage instructions, adapted to real-time corrections, and respected constraints.
Synthetic data, in this context, highlights potential in robotics to efficiently simulate diverse scenarios, reducing the need for extensive real-world data collection.
Hi Robot ‘Talks to Itself’
As seen in an example below, a robot is trained to clean a table by disposing of trash and placing dishes in a bin. It can be directed to follow more intricate commands through Hi Robot.
This system allows the robot to reason through modified commands provided in natural language, enabling it to “talk to itself” as it performs tasks. Moreover, Hi Robot can interpret user contextual comments, incorporating real-time feedback into its actions, such as handling complex prompts.
This setup allows the robot to incorporate real-time feedback, such as when a user says “that’s not trash”, and adjust its actions accordingly.
The system has been tested on various robotic platforms, including single-arm, dual-arm, and mobile robots, performing tasks like cleaning tables and making sandwiches.
“Can we get our robots to ‘think’ the same way, with a little ‘voice’ that tells them what to do when presented with a complex task?” the researchers said in the company’s official blog. This advancement could lead to more intuitive and flexible robot capabilities in real-world applications.
Researchers plan to refine the system in the future by combining the high-level and low-level models, allowing for more adaptive processing of complex tasks.