Technology

CogLVM

CogVLM is a powerful, open-source Visual Language Model (VLM) that achieves state-of-the-art performance across 17 cross-modal benchmarks.

CogVLM is an open-source visual language foundation model, engineered for deep vision-language feature fusion. The architecture employs a trainable 'visual expert module' inserted into the attention and FFN layers, specifically bridging a frozen pretrained language model (like Vicuna1.5-7B) with an image encoder (like EVA2-CLIP-E). This design avoids sacrificing NLP performance while enabling robust multimodal capabilities. The CogVLM-17B variant (10B vision parameters, 7B language parameters) delivers state-of-the-art results on 17 classic cross-modal tasks, including VQA, image captioning, and visual grounding. It supports complex functions: multi-turn image dialogue and precise visual grounding.

https://github.com/THUDM/CogVLM

1 project · 1 city

Related technologies

GPT-4 Vision 2 LLaVA 7 Segment Anything Model 6 Set of Marks 1

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

GPT-4 Vision: Set-of-Marks Grounding

San Francisco Dec 3

GPT-4 Vision Segment Anything Model