Technology
VideoCLIP
VideoCLIP uses contrastive learning to align video and text through a transformer-based pre-training strategy that masters zero-shot video-text retrieval.
Developed by researchers at Facebook AI (now Meta), VideoCLIP achieves state-of-the-art performance by training on the HowTo100M dataset (1.2 million narrated videos). It employs a dual-encoder architecture that leverages overlapping video-text clips to learn fine-grained temporal associations. By using an objective that targets both video-to-text and text-to-video alignment, the model excels at zero-shot transfer for tasks like action recognition and video retrieval. It effectively bridges the gap between static image-text models (like CLIP) and dynamic video sequences without requiring manual labels for downstream tasks.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1