The quest for unified AI models that can understand the world through multiple senses—like vision and language—is a central theme in current research. The Platonic Representation Hypothesis suggests that different modalities, when processed by neural networks, converge towards a shared underlying model of reality. While prior work has explored aligning pre-trained vision and language models, it often requires vast amounts of paired data and complex contrastive losses. This paper investigates a crucial question: can we achieve robust cross-modal alignment with substantially less labeled data?
To tackle this, the authors introduce a novel semi-supervised setting and propose SOTAlign, a two-stage framework designed for efficient cross-modal alignment. The first stage employs a linear teacher model to establish a coarse shared geometric representation using a limited set of paired image-text samples. This initial alignment provides a strong foundation. The second stage then refines this alignment by leveraging large quantities of unpaired data. It utilizes an optimal-transport-based divergence to transfer relational structure between modalities without imposing overly rigid constraints on the target representation space. This approach is particularly adept at learning robust joint embeddings for multimodal data, demonstrating effective cross-modal representation learning.