Multimodal Search

What is Multimodal Search?

Multimodal Search is a type of search that blends different search methods — often text, image, audio, and video — to search a diverse dataset and retrieve results in a way that feels more natural and intuitive to human thinking. You'll often see multimodal search implemented as semantic text search combined with a different modality, like images.

Most of us can understand multimodal search in the context of product & ecommerce search. Each product in a catalog may have a title, a description, and an image. And as humans we naturally search in ways that may cut across the information contained in those three fields — ”heavy coat with big buttons”. “Heavy coat” may match some text in the title or description of the item, but it’s unlikely that “with big buttons” will reliably feature in the title or description of every coat in the catalog with big buttons. Image search systems (like Objective Image Indexes) can ‘see’ inside images and recognize “big buttons” from a mile away.

When multimodal search works well, it lets you worry less about whether your users search exactly the right keywords, and focus more on building search UI that encourages users to include things like “big buttons” that, for decades, old broken search systems have trained people to believe they won’t be understood.

Back to Glossary