Technology
Set of Marks
Set-of-Mark (SoM) is a visual prompting technique: it segments images and overlays alphanumeric marks to dramatically enhance Large Multimodal Models' (LMMs) visual grounding capabilities.
SoM is a novel visual prompting method designed to unleash the full visual grounding potential of LMMs, specifically GPT-4V. The process is direct: an off-the-shelf segmentation model (e.g., SEEM or SAM) partitions an image into distinct regions. We then overlay these regions with a set of speakable marks (alphanumerics, masks, or boxes). This marked image, when input to the LMM, provides explicit spatial and object relationship context, which was previously a blind spot. Empirical studies confirm SoM's effectiveness: it enables GPT-4V to outperform state-of-the-art, fully-finetuned models on fine-grained vision tasks like RefCOCOg, all in a zero-shot setting.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1