Martin Keen, a Master Inventor at IBM, breaks down the intricacies of multimodal AI in a recent presentation. Keen, a recognized expert in artificial intelligence and its applications, clarifies how AI models are evolving to process and understand a wider array of data types beyond traditional text. This shift represents a significant advancement in AI capabilities, moving towards systems that can interpret the world more holistically.
Understanding Multimodal AI
Keen begins by defining multimodal AI as systems that utilize multiple data modalities. While AI models have long been adept at processing text, the current frontier involves integrating other forms of data such as images, audio, video, and even sensor readings. This expansion allows AI to gain a richer, more nuanced understanding of complex information.
A standard language model, or LLM, typically takes text as input and generates text as output. However, to incorporate other data types, new architectures are required. Keen illustrates this by showing how an LLM can be augmented to process images. This is achieved through a process called feature-level fusion. In this approach, separate models, like a vision encoder, process each data modality independently. These models then extract features, which are represented as numerical vectors. These feature vectors are then fed into the LLM, allowing it to process information from different sources simultaneously.
The full discussion can be found on IBM's YouTube channel.
