Technology

CLAP

CLAP (Contrastive Language-Audio Pre-training) is a dual-encoder AI model that aligns audio and text embeddings for zero-shot classification and retrieval.

CLAP (Contrastive Language-Audio Pre-training) is a dual-encoder framework: it projects audio and text into a shared, multimodal space using contrastive learning. Accepted at ICASSP 2023, the model has been rigorously evaluated on 26 audio downstream tasks, achieving State-of-the-Art (SoTA) performance in several, including classification and retrieval. It is pretrained on large-scale data, such as LAION-Audio-630K, and utilizes advanced architectures (e.g., HTSAT-fused). This core capability enables 'Zero-Shot' inference, which removes the requirement for task-specific labeled audio data and provides flexible, open-vocabulary class prediction at inference time.

https://github.com/microsoft/CLAP

1 project · 1 city

Related technologies

CLIP 10

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Multimodal Embeddings: CLIP and CLAP

Bogotá Feb 27

CLIP CLAP