Technology
VisualBERT
A single-stream Transformer architecture that aligns visual regions and text tokens through a unified self-attention mechanism.
VisualBERT (developed by researchers at UCLA) streamlines vision-language modeling by treating image regions and text as a single input sequence. The architecture uses a BERT backbone to process 36 regional features (extracted via Faster R-CNN) alongside word embeddings. By pre-training on the MS COCO dataset using masked language modeling and image-text alignment, the model excels at complex reasoning tasks: Visual Question Answering (VQA) and Natural Language Visual Reasoning (NLVR2). This design proves that a simple, joint-attention approach outperforms complex, multi-stream alternatives.
Related technologies
Recent Talks & Demos
Showing 1-3 of 3