Technology

BLIP-2

Salesforce Research's BLIP-2 uses a Q-Former to bridge frozen image encoders and large language models for high-efficiency multimodal reasoning.

Salesforce Research built BLIP-2 to eliminate the high compute costs of traditional vision-language training. The architecture uses a Q-Former (Querying Transformer) to connect a frozen image encoder (ViT-L/14) with a frozen LLM (Flan-T5 or OPT). This method achieves state-of-the-art zero-shot results on visual question answering (VQA) and image captioning. With only 188 million trainable parameters, BLIP-2 beats the 80-billion-parameter Flamingo model on zero-shot VQA v2 benchmarks. It provides a 54x reduction in trainable parameters while maintaining top-tier multimodal performance.