.

Technology

UNITER

Microsoft's large-scale transformer model for universal image-text representations: a unified framework for visual reasoning and cross-modal retrieval.

UNITER (UNiversal Image-TExt Representation) achieves state-of-the-art performance by pre-training on 9.6 million image-text pairs from datasets like COCO and Visual Genome. The architecture uses a large-scale transformer to learn joint embeddings through four key tasks: Masked Language Modeling, Masked Region Modeling, Image-Text Matching, and Word-Region Alignment. This unified approach allows the model to excel at complex visual reasoning (benchmarked on NLVR2 and VQA) and high-precision image-text retrieval (Flickr30K). By capturing fine-grained alignments between visual regions and textual tokens, UNITER provides a robust foundation for diverse vision-language applications.

https://github.com/ChenRocks/UNITER
3 projects · 3 cities

Related technologies

Recent Talks & Demos

Showing 1-3 of 3

Members-Only

Sign in to see who built these projects