Inspired by advanced language models, researchers have created a new training method that utilizes diverse datasets to help robots acquire various skills.
In the beloved cartoon “The Jetsons,” the robotic maid Rosie effortlessly transitions from vacuuming the floors to preparing dinner and taking out the trash. However, in reality, creating a versatile robot is still a significant hurdle.
Engineers usually gather data tailored to a specific robot and task, which they then use for training in a controlled setting. This process can be both costly and time-consuming, and robots often struggle to adapt to new tasks or environments they haven’t encountered before.
To improve the training of general-purpose robots, MIT researchers have developed a flexible approach that aggregates vast amounts of varied data from multiple sources, enabling any robot to learn a wide array of tasks.
This technique aligns data from different fields, such as simulations and real robots, along with various modalities, including visual sensors and robotic arm position encoders, into a unified “language” that a generative AI model can understand.
By pooling such a large amount of data, this method allows robots to learn different tasks without starting from scratch each time.
What sets this method apart is that it can be quicker and more economical than traditional methods, as it requires significantly fewer task-specific data. It also achieves more than a 20 percent improvement in both simulation and real-world tests compared to training from the ground up.
“In robotics, it’s often said that there’s not enough training data available. However, I believe a major challenge is how the data comes from various domains, modalities, and robot hardware. Our research illustrates how to effectively train a robot by integrating all this data,” explains Lirui Wang, a graduate student in electrical engineering and computer science (EECS) and the lead author of the related paper.
Wang collaborated with fellow EECS graduate student Jialiang Zhao; Xinlei Chen, a research scientist at Meta; and senior author Kaiming He, an associate professor in EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). Their work will be presented at the Conference on Neural Information Processing Systems.
Inspired by Large Language Models
A robotic “policy” interprets sensor data, such as camera images or measurements that track a robotic arm’s speed and position, guiding the robot on how to move.
These policies are typically learned through imitation learning, where a human demonstrates actions or remotely operates the robot to generate data that feeds into an AI model. As this method uses a limited amount of specialized data, robots frequently fail when faced with new tasks or altered environments.
To create a more effective solution, Wang and his team drew inspiration from large language models like GPT-4.
These models undergo pretraining with vast amounts of diverse language data and are subsequently fine-tuned using a small set of task-specific data, enabling them to adapt efficiently to various tasks.
“In the language domain, the data consist solely of sentences. In robotics, due to the diversity of data types, a different architecture is necessary for a similar pretraining approach,” he notes.
Robotic data comes in many forms, from camera images to verbal instructions and depth maps. Additionally, every robot has its unique mechanical structure, with different configurations of arms, grippers, and sensors, not to mention the wide variety of environments where data is collected.
The MIT researchers introduced a novel architecture known as Heterogeneous Pretrained Transformers (HPT), which brings together data from different modalities and domains.
At the core of their architecture, they placed a machine-learning model called a transformer, which processes both visual and proprioceptive inputs. This transformer is the same type that underlies large language models.
The researchers convert data from vision and proprioception into a unified input type, called a token, which can be processed by the transformer. Each input is represented by a consistent number of tokens.
Next, the transformer integrates all inputs into a shared space, becoming a vast, pretrained model as it continues to learn from more data. The larger the transformer grows, the more effectively it performs.
A user need only provide HPT with a small amount of information about their robot’s design, configuration, and the desired task. Then HPT applies the knowledge acquired during pretraining to learn the new task effectively.
Facilitating Precise Movements
One of the major challenges in creating HPT was assembling the extensive dataset necessary for pretraining, which comprised 52 datasets with over 200,000 robot movements across four categories, including videos of human demonstrations and simulations.
The researchers also had to devise an efficient method to convert raw proprioceptive signals from various sensors into data that the transformer could work with.
“Proprioception plays a crucial role in enabling intricate movements. Since the number of tokens is consistent in our architecture, we give equal weight to proprioception and vision,” Wang states.
Testing revealed that HPT enhanced robot performance by more than 20 percent in both simulated and real-world tasks, compared to starting from scratch for each training session. Remarkably, even when the task was significantly different from the pretraining data, HPT still showed improved performance.
In the future, the researchers plan to explore how diversifying data further could enhance HPT’s performance. They also aim to refine HPT to process unlabeled data, similar to how GPT-4 and other large language models operate.
“Our ultimate goal is to develop a universal robot brain that can be downloaded and utilized for any robot without the need for training. Although we’re still in the early phases, we are determined to push forward and hope scaling will lead to breakthroughs in robotic policies, similar to the developments in large language models,” he concludes.