.

Technology

Multimodal LLM

Multimodal LLMs (MLLMs) process and reason across diverse data types: text, images, audio, and video, unifying human-like understanding in models like GPT-4V and Gemini.

MLLMs are state-of-the-art large language models that move beyond text-only processing, integrating multiple modalities (data types) for richer context and reasoning. They encode inputs like image pixels, audio waveforms, and text tokens into a shared embedding space, enabling cross-modal analysis. This capability allows for complex tasks: visual question answering (VQA), image captioning, and interpreting charts. Key models, including GPT-4V and Google’s Gemini, exemplify this shift, handling inputs like a photo of a product and a verbal description simultaneously to deliver a coherent, human-like response.

https://openai.com/research/gpt-4v-system-card
3 projects · 4 cities

Related technologies

Recent Talks & Demos

Showing 1-3 of 3

Members-Only

Sign in to see who built these projects