New computational approach enhances multi-modal manipulation

5/21/2026 Jeni Bushman

Researchers from The Grainger College of Engineering have presented a new method for combining multiple sensory modalities in robotic learning.

Written by Jeni Bushman

Like humans, robots learn to manipulate tools using sensory input such as vision and touch. This process is called multimodal manipulation, and it’s a lot like baking a cake—combining various ingredients (or sensory inputs) to produce something new. While the recipes for baking cakes and teaching robots might look a little different, they have something in common: their end result can change depending on how and when each component is mixed.

Photo of robot picking up puzzle piece. — Robot picking up puzzle piece.

Seeking to optimize this process, researchers from the lab of electrical and computer engineering professor Katie Driggs-Campbell proposed a new method for combining multiple sensory inputs in robotic manipulation tasks. Presented at the 2026 IEEE International Conference on Robotics and Automation, the Illinois Grainger engineers’ work introduces a framework for sparsity resilient, incremental and adaptive multimodal manipulation with seamless sensor integration that outperforms existing methods.

Currently, most robots are trained on task manipulation using a data-fusion technique called feature concatenation. In this aggregated approach, data from every sensor is fed into an encoder and merged from the start—like a baker dumping every ingredient into the same bowl at the same time. While this all-in method works (for cakes and for robots), it doesn’t always produce the best results.

ECE graduate student Haonan Chen, now a postdoc at Harvard, wanted to know why. He began by identifying two problems with the existing technique: it struggles to incorporate sparse signals and must be retrained any time a sensor is added or altered. In multimodal manipulation, each sensory signal is issued a weight, or significance. In feature concatenation, these weights reflect the proportion of time each signal is used throughout a given task.

“If we want a robot to grasp a tool, it will spend maybe 95% of the time using vision signals to locate that tool,” Chen said. “The other 5% is relying on touch to exercise caution and avoid breaking the tool. (Conventional approaches) will most likely overweight the vision signal and treat the touch signal as noise, even though touch is very important to the overall task.”

In baking terms, the volume of an ingredient is not a good indicator of its importance. Most cinnamon roll recipes require several cups of flour and only a teaspoon or two or cinnamon: a feature concatenation approach would interpret the comparatively small amount of cinnamon as noise, omitting it from the recipe entirely. It’s hard to imagine a cinnamon roll with no cinnamon!

Similarly, neural networks must take all their so-called ingredients into account—even the sparsest signals. Bakers contend with this by processing their recipes in steps: first creaming the butter and sugar, then adding eggs and vanilla, and finally stirring in the dry ingredients. This stepwise process extracts the highest potential from each ingredient before moving on, resulting in higher-quality baked goods. Similarly, instead of aggregating all robot sensor data at the beginning of the policy, the Illinois Grainger engineers’ approach trains sub-policies (or “experts”) on each specific signal.

The resulting formula is a product of multiple distributions that can preserve even the sparsest signals while allowing the network to easily adapt. In practice, the group’s model showed strong robustness to real-world human perturbation and sensor corruption.

“The robot is truly understanding how to utilize the type of signal,” Chen said. “If the marker is missing from its hand, it knows to go back and re-grasp.”

Chen and his colleagues view their recent work as a case study on incremental learning.

“We are rethinking how to use multimodal signals,” he said. “This method can save a lot of computing power and utilize data more efficiently, potentially giving the policy better performance.”

The study, “Multi-Modal Manipulation via Multi-Modal Policy Consensus” is available online. DOI: 10.48550/arXiv.2509.23468

Illinois Grainger Engineering Affiliations
Katie Driggs-Campbell is an Illinois Grainger Engineering assistant professor of electrical and computer engineering in the Department of Electrical and Computer Engineering. Driggs-Campbell is affiliated with the Coordinated Science Laboratory.

Share this story

This story was published May 21, 2026.