Technology
VideoBERT
Google Research's joint visual-linguistic model that applies the BERT architecture to learn unlabeled video representations for action forecasting and captioning.
VideoBERT leverages the transformer architecture to master cross-modal relationships by treating video frames as visual tokens alongside text. Developed by researchers at Google (Sun et al., 2019), the model was trained on 311,000 cooking videos (the YouCook2 dataset) to predict future actions and generate descriptions without manual labels. By quantizing video features into a discrete vocabulary, VideoBERT achieves state-of-the-art performance in zero-shot action classification and video-to-text translation tasks. This framework proves that large-scale self-supervised learning effectively bridges the gap between raw pixels and semantic language.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1