Technology

BLIP

Salesforce's unified vision-language framework that uses synthetic data bootstrapping to outperform models trained on noisy web data.

BLIP (Bootstrapping Language-Image Pre-training) addresses the noise inherent in large-scale web datasets via a specialized CapFilt mechanism. This process uses a Captioner to generate synthetic labels and a Filter to prune low-quality matches (ensuring high-fidelity training data). The model dominates benchmarks like COCO and VQA (achieving a +2.7% boost in average recall@1 on COCO) by unifying vision-language understanding and generation into a single encoder-decoder framework. It provides a streamlined solution for image captioning, visual search, and zero-shot reasoning.