Engineers have created a system that aids robots in swiftly mapping their surroundings and recognizing objects needed for various tasks.
Picture trying to tidy up a disorderly kitchen, starting with a counter covered in different sauce packets. If your aim is to clean the counter efficiently, you might gather all the packets together. On the other hand, if you specifically want to pick out the mustard packets first before discarding the rest, you’d sort through them more carefully, focusing on the type of sauce. And if you were looking for a specific brand of mustard, like Grey Poupon, you’d need to search meticulously to find that particular one.
Engineers at MIT have developed a technique that allows robots to make intuitive, task-oriented decisions similar to this.
The new system, referred to as Clio, enables a robot to discern which parts of an environment are significant based on the tasks it needs to accomplish. With Clio, when a robot receives a list of tasks described in everyday language, it identifies the necessary details required to understand its surroundings and retains only the relevant scene components in its memory.
In practical experiments conducted in various locations—from a cramped cubicle to a five-story building on MIT’s campus—the team utilized Clio to automatically segment scenes at varying levels of detail, guided by a set of tasks communicated through natural-language prompts like “move rack of magazines” and “retrieve first aid kit.”
The researchers also tested Clio in real-time on a four-legged robot. As the robot navigated an office building, Clio was able to recognize and map only those aspects of the scene pertinent to the robot’s tasks (for instance, retrieving a dog toy while disregarding clutter of office supplies), which enabled the robot to focus on the objects of interest.
Named after the Greek muse of history, Clio is capable of identifying and retaining only the components that matter for specific tasks. The researchers believe that Clio could be beneficial in various scenarios where robots must quickly assess and interpret their environment in relation to their assigned tasks.
“Our primary goal is its application in search and rescue missions, but Clio could also enhance household robots and those operating on factory floors in conjunction with human workers,” states Luca Carlone, an associate professor in MIT’s Department of Aeronautics and Astronautics (AeroAstro), principal investigator at the Laboratory for Information and Decision Systems (LIDS), and director of the MIT SPARK Laboratory. “The emphasis is on helping the robot understand its environment and identify what it needs to remember to fulfill its mission.”
The team presents their findings in a study published today in the journal Robotics and Automation Letters. Co-authors of the paper from the SPARK Lab include Dominic Maggio, Yun Chang, Nathan Hughes, and Lukas Schmid, along with contributors from MIT Lincoln Laboratory: Matthew Trang, Dan Griffith, Carlyn Dougherty, and Eric Cristofalo.
Open fields
Significant progress in computer vision and natural language processing has enabled robots to recognize objects in their surroundings. However, until recently, robots were functioning mainly in “closed-set” environments, where they were programmed to operate within controlled settings featuring a limited set of objects they had been trained to identify.
In recent times, researchers have taken an “open” approach, allowing robots to identify objects in more genuine settings. Through open-set recognition, researchers have employed deep-learning techniques to develop neural networks that can analyze billions of images from the internet, alongside their associated text (like a Facebook post depicting a dog with the caption “Meet my new puppy!”).
By learning from millions of image-text pairs, neural networks can subsequently recognize segments in a scene characteristic of specific terms, such as a dog. A robot can then use this neural network to detect a dog in entirely new contexts.
However, a significant challenge remains in efficiently parsing a scene in a manner relevant to specific tasks.
“Standard methods typically select an arbitrary, fixed level of detail to determine how to consolidate scene segments into what might be considered a singular ‘object,'” explains Maggio. “Yet, the definition of what constitutes an ‘object’ is directly linked to the robot’s objectives. If this level of detail is unyielding and fails to consider the tasks at hand, the robot may end up creating a map that isn’t particularly useful for its mission.”
Information bottleneck
With Clio, the MIT team aimed to enable robots to interpret their environments with a level of detail that can adjust according to the tasks at hand.
For example, if the task is to move a stack of books to a shelf, the robot should recognize that the entire stack is the relevant object for the task. Conversely, if the objective is to move just the green book from that stack, the robot should identify the green book as a distinct target, ignoring the rest of the scene, including the other books.
The team’s methodology combines cutting-edge computer vision and vast language models made up of neural networks that correlate millions of open-source images with semantic texts. They also employ mapping tools that break down an image into numerous smaller segments, which can be analyzed by the neural network to determine if certain segments share semantic similarities. The researchers further utilize a strategy from classical information theory known as the “information bottleneck,” allowing them to condense multiple image segments in a way that highlights and retains the segments most relevant to a specific task.
“For instance, if there’s a stack of books in a scene and my task is solely to retrieve the green book, we process all the scene information through this bottleneck, which results in a collection of segments that represent the green book,” explains Maggio. “All unrelated segments can simply be grouped together and disregarded, leaving us with an object at the appropriate level of detail for the task at hand.”
The researchers have demonstrated Clio across different real-world settings.
“We decided to perform an experiment in my messy apartment with no prior cleaning to see how Clio would function,” Maggio shares.
The team generated a list of natural-language tasks, such as “move pile of clothes,” and applied Clio to images capturing the disorganized apartment. In these instances, Clio was capable of quickly segmenting the apartment scenes and processing the segments using the Information Bottleneck algorithm to identify those segments corresponding to the pile of clothes.
They also tested Clio on Boston Dynamics’ Spot robot. After assigning a set of tasks, as the robot explored and mapped the interior of an office building, Clio operated in real-time on an onboard computer mounted on Spot, identifying segments in the mapped scenes relevant to the designated tasks. This method produced an overlay map highlighting only the target objects, which the robot used to navigate and physically accomplish the tasks.
“Achieving real-time operation with Clio was a significant milestone for the team,” Maggio notes. “Much earlier work required several hours to yield results.”
Moving ahead, the team intends to adapt Clio to manage more complex tasks and build on advancements in photorealistic scene representations.
“At the moment, we are still assigning fairly specific tasks like ‘find a deck of cards,'” Maggio explains. “However, for search and rescue operations, we want to direct it with more abstract tasks, like ‘locate survivors’ or ‘restore power.’ Therefore, we aspire to achieve a more human-like understanding of how to tackle more intricate tasks.”
This research received partial support from the U.S. National Science Foundation, the Swiss National Science Foundation, MIT Lincoln Laboratory, the U.S. Office of Naval Research, and the U.S. Army Research Lab on Distributed and Collaborative Intelligent Systems and Technology Collaborative Research Alliance.