Technology

UniVL

UniVL is a unified video-and-language pre-training model designed to handle both multimodal understanding and generation tasks within a single framework.

Developed by researchers at Microsoft, UniVL addresses the historical gap between video-text understanding (like retrieval) and generation (like captioning). The model employs a Transformer-based backbone with four core components: two single-modal encoders, a cross-encoder, and a decoder. By pre-training on the massive HowTo100M dataset using five distinct objectives—including video-text alignment and language reconstruction—UniVL achieves state-of-the-art results across five major downstream tasks. Its flexible architecture allows it to adapt to diverse multimodal requirements, making it a highly efficient tool for developers working on complex video-to-text applications.

https://github.com/microsoft/UniVL

1 project · 1 city

Related technologies

Generate API 1 Imagen Video 2 Make-A-Video 2 Oscar 1 Runway Gen-2 2 Search API 2 VideoBERT 1 VideoCLIP 1 Video embeddings 2

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Twelve Labs: Chatting with Video

New York City Oct 26

Generate API VideoBERT