Technology

Vision Transformer

Vision Transformer (ViT) applies the standard Transformer architecture directly to image patches to outperform convolutional neural networks on large-scale datasets.

Introduced by Google Research in 2020 (An Image is Worth 16x16 Words), ViT breaks an image into a sequence of fixed-size patches rather than using traditional pixel-by-pixel convolutions. By treating these patches as tokens—similar to words in NLP—the model leverages global self-attention to capture long-range dependencies across the entire frame. While ViT requires significant pre-training on datasets like JFT-300M to beat ResNet benchmarks, its scalability and efficiency on high-end TPU hardware make it the modern standard for state-of-the-art computer vision tasks.

https://arxiv.org/abs/2010.11929

0 projects · 0 cities

Recent Talks & Demos

Showing 1-0 of 0

Members-Only

No public projects found for this technology yet.