.

Technology

BLIP-2

Salesforce Research's BLIP-2 uses a Q-Former to bridge frozen image encoders and large language models for high-efficiency multimodal reasoning.

Salesforce Research built BLIP-2 to eliminate the high compute costs of traditional vision-language training. The architecture uses a Q-Former (Querying Transformer) to connect a frozen image encoder (ViT-L/14) with a frozen LLM (Flan-T5 or OPT). This method achieves state-of-the-art zero-shot results on visual question answering (VQA) and image captioning. With only 188 million trainable parameters, BLIP-2 beats the 80-billion-parameter Flamingo model on zero-shot VQA v2 benchmarks. It provides a 54x reduction in trainable parameters while maintaining top-tier multimodal performance.

https://huggingface.co/docs/transformers/model_doc/blip-2
4 projects · 5 cities

Related technologies

Recent Talks & Demos

Showing 1-4 of 4

Members-Only

Sign in to see who built these projects