Clip to Siglip > Migrating Our 200M+ MultiModal Embeddings

1 pointsposted 11 hours ago
by teocalin37

1 Comments

teocalin37

11 hours ago

We work a lot with multimodal embeddings with semantic search and image-to-image retrieval over massive datasets for CCTV Data. We've had 200M+ CLIP vectors indexed in vector DBs.

On the other side SigLIP smokes it. Approx. 5-10% better recall@1 on some test datasets. But re-embedding? Weeks of GPUs is hugely expensive for re-embedding all of this data.

So we made vector Rosetta. 50M-param adapter translates CLIP to SigLIP purely in embedding space. 41x faster, zero images.

Numbers:

90.9% cosine sim preserved 94.3% Rank@1 (10K pool), 84.4% (100K) COCO photos: 90.1%; WikiArt: 85.7%

Added the link to the model, we thought it may be useful for other people.