Technology

LXMERT

LXMERT is a cross-modality transformer framework built to master the complex relationship between vision and language through large-scale pre-training.

Developed by Hao Tan and Mohit Bansal at UNC Chapel Hill, LXMERT (Learning Cross-Modality Encoder Representations from Transformers) employs a triple-encoder architecture: language, object relationship, and cross-modality fusion. It processes visual inputs using Faster R-CNN (typically extracting 36 objects per image) and text via WordPiece embeddings. The model was pre-trained on 9.18 million image-sentence pairs from five datasets (including MS COCO and Visual Genome) using five distinct tasks like masked language modeling and cross-modality matching. This approach yields high-performance results on benchmarks like VQA v2.0 and GQA, providing a robust foundation for sophisticated multimodal reasoning.