Technology
UniVL
UniVL is a unified video-and-language pre-training model designed to handle both multimodal understanding and generation tasks within a single framework.
Developed by researchers at Microsoft, UniVL addresses the historical gap between video-text understanding (like retrieval) and generation (like captioning). The model employs a Transformer-based backbone with four core components: two single-modal encoders, a cross-encoder, and a decoder. By pre-training on the massive HowTo100M dataset using five distinct objectives—including video-text alignment and language reconstruction—UniVL achieves state-of-the-art results across five major downstream tasks. Its flexible architecture allows it to adapt to diverse multimodal requirements, making it a highly efficient tool for developers working on complex video-to-text applications.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1