Technology

LLaVA

LLaVA (Large Language-and-Vision Assistant) is an open-source, end-to-end trained Large Multimodal Model (LMM) connecting a CLIP vision encoder with an LLM (like Vicuna or Llama-2) for visual and language comprehension.

LLaVA is a pioneering open-source Large Multimodal Model (LMM) designed for general-purpose visual and language understanding, targeting GPT-4V-level capabilities. The architecture efficiently connects a pre-trained vision encoder (specifically CLIP ViT-L/14) with a powerful LLM (e.g., Vicuna, Llama-2) via a simple projection layer. Its strength comes from a two-stage visual instruction tuning process, utilizing a high-quality, synthetic multimodal dataset generated by GPT-4. This minimalist, data-efficient approach allows LLaVA to achieve impressive performance, hitting an 85.1% relative score compared to GPT-4 on a synthetic instruction-following dataset, making it a powerful, accessible alternative for visual chat and reasoning.