Technology

Set of Marks

Set-of-Mark (SoM) is a visual prompting technique: it segments images and overlays alphanumeric marks to dramatically enhance Large Multimodal Models' (LMMs) visual grounding capabilities.

SoM is a novel visual prompting method designed to unleash the full visual grounding potential of LMMs, specifically GPT-4V. The process is direct: an off-the-shelf segmentation model (e.g., SEEM or SAM) partitions an image into distinct regions. We then overlay these regions with a set of speakable marks (alphanumerics, masks, or boxes). This marked image, when input to the LMM, provides explicit spatial and object relationship context, which was previously a blind spot. Empirical studies confirm SoM's effectiveness: it enables GPT-4V to outperform state-of-the-art, fully-finetuned models on fine-grained vision tasks like RefCOCOg, all in a zero-shot setting.

https://github.com/microsoft/SoM

1 project · 1 city

Related technologies

CogLVM 1 GPT-4 Vision 2 LLaVA 5 Segment Anything Model 5

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

GPT-4 Vision: Set-of-Marks Grounding

San Francisco Dec 3

GPT-4 Vision Segment Anything Model