.

Technology

CogLVM

CogVLM is a powerful, open-source Visual Language Model (VLM) that achieves state-of-the-art performance across 17 cross-modal benchmarks.

CogVLM is an open-source visual language foundation model, engineered for deep vision-language feature fusion. The architecture employs a trainable 'visual expert module' inserted into the attention and FFN layers, specifically bridging a frozen pretrained language model (like Vicuna1.5-7B) with an image encoder (like EVA2-CLIP-E). This design avoids sacrificing NLP performance while enabling robust multimodal capabilities. The CogVLM-17B variant (10B vision parameters, 7B language parameters) delivers state-of-the-art results on 17 classic cross-modal tasks, including VQA, image captioning, and visual grounding. It supports complex functions: multi-turn image dialogue and precise visual grounding.

https://github.com/THUDM/CogVLM
1 project · 1 city

Related technologies

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Sign in to see who built these projects