Technology
CLAP
CLAP (Contrastive Language-Audio Pre-training) is a dual-encoder AI model that aligns audio and text embeddings for zero-shot classification and retrieval.
CLAP (Contrastive Language-Audio Pre-training) is a dual-encoder framework: it projects audio and text into a shared, multimodal space using contrastive learning. Accepted at ICASSP 2023, the model has been rigorously evaluated on 26 audio downstream tasks, achieving State-of-the-Art (SoTA) performance in several, including classification and retrieval. It is pretrained on large-scale data, such as LAION-Audio-630K, and utilizes advanced architectures (e.g., HTSAT-fused). This core capability enables 'Zero-Shot' inference, which removes the requirement for task-specific labeled audio data and provides flexible, open-vocabulary class prediction at inference time.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1