.

Technology

LXMERT

LXMERT is a cross-modality transformer framework built to master the complex relationship between vision and language through large-scale pre-training.

Developed by Hao Tan and Mohit Bansal at UNC Chapel Hill, LXMERT (Learning Cross-Modality Encoder Representations from Transformers) employs a triple-encoder architecture: language, object relationship, and cross-modality fusion. It processes visual inputs using Faster R-CNN (typically extracting 36 objects per image) and text via WordPiece embeddings. The model was pre-trained on 9.18 million image-sentence pairs from five datasets (including MS COCO and Visual Genome) using five distinct tasks like masked language modeling and cross-modality matching. This approach yields high-performance results on benchmarks like VQA v2.0 and GQA, providing a robust foundation for sophisticated multimodal reasoning.

https://github.com/airsplay/lxmert
4 projects · 4 cities

Related technologies

Recent Talks & Demos

Showing 1-4 of 4

Members-Only

Sign in to see who built these projects