djoldman
6 days ago
This is a key observation that is simple and intuitive:
>All CLIP-like models perform poorly on mixed-modality search due to a phenomenon known as the modality gap. As illustrated in the figure below, the closest vector to the snippet “I address you, members of the Seventy-Seventh Congress…” is not its screenshot, but other texts. This leads to search results that are skewed towards items of the same modality; in other words, text vectors will be closer to irrelevant texts than relevant images in the embedding space.
evrydayhustling
6 days ago
This quote is important, but in isolation it's not clear that they are claiming to have beat this problem: they are saying the new model, voyage-multimodal-3 instead identifies linked concepts across modalities. That would indeed be pretty cool -- if there is a latent space that could cluster the same idea, represented visually or in text.
> ... the vectors truly capture the semantic content contained in the screenshots. This robustness is due to the model’s unique approach of processing all input modalities through the same backbone.
With that said, I think this benchmark is a pretty narrow way of thinking about multi-modal embedding. Having text embed close to images of related text is cool and convenient, but doesn't necessarily extend to other notions of related visual expression (e.g. "rabbit" vs a photo of a rabbit). And on the narrow goal of indexing document images, I suspect there are other techniques that could also work quite well.
This seems like a great opportunity for a new benchmark dataset with multi-modal concept representations beyond media-of-text.
tugdual
5 days ago
They could be solving it with multimodal mixup, a technique making sure that there's no big latent gap between the two : https://arxiv.org/abs/2203.03897