Technology
UNITER
Microsoft's large-scale transformer model for universal image-text representations: a unified framework for visual reasoning and cross-modal retrieval.
UNITER (UNiversal Image-TExt Representation) achieves state-of-the-art performance by pre-training on 9.6 million image-text pairs from datasets like COCO and Visual Genome. The architecture uses a large-scale transformer to learn joint embeddings through four key tasks: Masked Language Modeling, Masked Region Modeling, Image-Text Matching, and Word-Region Alignment. This unified approach allows the model to excel at complex visual reasoning (benchmarked on NLVR2 and VQA) and high-precision image-text retrieval (Flickr30K). By capturing fine-grained alignments between visual regions and textual tokens, UNITER provides a robust foundation for diverse vision-language applications.
Related technologies
Recent Talks & Demos
Showing 1-3 of 3