Technology

Embedding Models like CLIP

CLIP bridges the gap between vision and language by mapping images and text into a shared vector space.

Contrastive Language-Image Pre-training (CLIP) redefined multimodal AI by training on 400 million image-text pairs. Unlike legacy classifiers restricted to fixed labels, CLIP uses a dual-encoder architecture to calculate cosine similarity between visual and textual embeddings. This allows for zero-shot performance on diverse datasets like ImageNet or ObjectNet without task-specific fine-tuning. It serves as the foundational backbone for modern generative tools (DALL-E 3) and semantic search engines (Pinecone integrations), enabling machines to understand visual concepts through natural language descriptions.

https://openai.com/index/clip/

1 project · 1 city

Related technologies

CLIP 16 Image Classifier 2 Image Search 1 or Google's multimodalembedding@001 Model 1 RAG 254

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Embeddings Beyond RAG

Cologne Mar 5

CLIP RAG