Technology

VideoBERT

Google Research's joint visual-linguistic model that applies the BERT architecture to learn unlabeled video representations for action forecasting and captioning.

VideoBERT leverages the transformer architecture to master cross-modal relationships by treating video frames as visual tokens alongside text. Developed by researchers at Google (Sun et al., 2019), the model was trained on 311,000 cooking videos (the YouCook2 dataset) to predict future actions and generate descriptions without manual labels. By quantizing video features into a discrete vocabulary, VideoBERT achieves state-of-the-art performance in zero-shot action classification and video-to-text translation tasks. This framework proves that large-scale self-supervised learning effectively bridges the gap between raw pixels and semantic language.

https://arxiv.org/abs/1904.01766

1 project · 1 city

Related technologies

Generate API 1 Imagen Video 2 Make-A-Video 2 Oscar 1 Runway Gen-2 2 Search API 2 UniVL 1 VideoCLIP 1 Video embeddings 2

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Twelve Labs: Chatting with Video

New York City Oct 26

Generate API VideoBERT