IBM Master Inventor Explains Multimodal AI

IBM Master Inventor Martin Keen explains the evolution of multimodal AI, contrasting feature-level fusion with native multimodality and the importance of temporal reasoning for video.

5 min read
IBM Master Inventor Explains Multimodal AI
IBM

Martin Keen, a Master Inventor at IBM, breaks down the intricacies of multimodal AI in a recent presentation. Keen, a recognized expert in artificial intelligence and its applications, clarifies how AI models are evolving to process and understand a wider array of data types beyond traditional text. This shift represents a significant advancement in AI capabilities, moving towards systems that can interpret the world more holistically.

Understanding Multimodal AI

Keen begins by defining multimodal AI as systems that utilize multiple data modalities. While AI models have long been adept at processing text, the current frontier involves integrating other forms of data such as images, audio, video, and even sensor readings. This expansion allows AI to gain a richer, more nuanced understanding of complex information.

A standard language model, or LLM, typically takes text as input and generates text as output. However, to incorporate other data types, new architectures are required. Keen illustrates this by showing how an LLM can be augmented to process images. This is achieved through a process called feature-level fusion. In this approach, separate models, like a vision encoder, process each data modality independently. These models then extract features, which are represented as numerical vectors. These feature vectors are then fed into the LLM, allowing it to process information from different sources simultaneously.

The full discussion can be found on IBM's YouTube channel.

Related startups

What is Multimodal AI? How LLMs Process Text, Images, and More - IBM
What is Multimodal AI? How LLMs Process Text, Images, and More — from IBM

Keen explains, "The vision encoder extracts features from whatever images we provide. And then that feature vector is passed into my LLM." This method allows for a more integrated understanding, where the model can correlate information from text and images. For example, a prompt with text and an image could lead to a response that is informed by both inputs.

Feature-Level Fusion vs. Native Multimodality

While feature-level fusion has been a common approach, Keen introduces native multimodality as a more advanced and efficient method. In native multimodality, the AI model is designed from the ground up to handle multiple data types directly. Instead of relying on separate encoders to convert data into a common format, native multimodal models are trained to process diverse inputs within a unified framework.

Keen visualizes this by showing how text, images, and audio can all be tokenized and embedded into a single, shared vector space. "All these different modalities live in the same vector space," Keen states. This shared space allows the model to directly compare and relate information across different data types. For instance, the representation of the word "cat" in this space would be close to the visual representation of a cat image and the auditory representation of a cat's meow.

This approach offers significant advantages. "The model doesn't have to translate between different modalities," Keen notes. Instead, it can directly process and generate outputs across these modalities. This capability enables what is termed "any-to-any generation." A model with native multimodality could, for example, take an image and text as input and generate a video, or take audio and generate text, all within a coherent and contextually relevant output.

Temporal Reasoning in Video Processing

Keen also highlights the importance of temporal reasoning, particularly when dealing with video data. Videos are sequences of frames, and understanding the motion, changes, and narrative flow requires processing this temporal dimension effectively. Traditional methods of processing video by sampling a few frames and feeding them into a vision encoder can lead to a loss of information. Keen illustrates this by showing a grid of pixel patches representing 8 frames of video. He explains that simply comparing two static frames might miss crucial transitional information.

"The motion in the video isn't something the model has to guess," Keen emphasizes. By processing video as a sequence of temporal chunks, models can capture the dynamic aspects of the data. This allows for more accurate understanding and generation of video content. For example, a model could be asked to describe the action in a video clip or even generate a new video clip based on a textual description, leveraging its understanding of temporal relationships.

This sophisticated handling of temporal data is crucial for AI to perform complex tasks involving dynamic information. Native multimodal models, with their ability to process sequences and relate them across modalities, are at the forefront of this advancement.

Key Takeaways for Multimodal AI

Keen's presentation underscores the evolution of AI from single-modality processing to sophisticated multimodal understanding. The core advancements include:

  • The integration of diverse data types like text, images, audio, and video.
  • The distinction between feature-level fusion, which uses separate models for each modality, and native multimodality, which processes all data within a unified framework.
  • The concept of a shared vector space, where different data types are represented in a way that allows for direct comparison and correlation.
  • The critical role of temporal reasoning in processing sequential data like video, enabling AI to understand motion and change.
  • The ultimate goal of "any-to-any generation," where AI can fluidly process and generate content across multiple modalities, leading to more contextually aware and comprehensive outputs.

This progression in multimodal AI signifies a move towards AI systems that can interact with and understand the world in a manner much closer to human perception.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.