Technology

ViLBERT

ViLBERT adapts the BERT architecture to process visual and textual data simultaneously through a dual-stream transformer.

Researchers from Georgia Tech and Facebook AI Research (FAIR) built ViLBERT to master multimodal tasks. The system employs a two-stream architecture to process visual regions and text tokens separately before merging them through co-attentional transformer layers. This setup enables the model to learn complex relationships between images and language. ViLBERT established state-of-the-art results on the VQA 2.0 and VCR datasets (Visual Commonsense Reasoning) upon its release. It leverages pre-training on the Conceptual Captions dataset (3.3 million image-caption pairs) to develop robust, task-agnostic representations.