Technology
Vision Transformer
Vision Transformer (ViT) applies the standard Transformer architecture directly to image patches to outperform convolutional neural networks on large-scale datasets.
Introduced by Google Research in 2020 (An Image is Worth 16x16 Words), ViT breaks an image into a sequence of fixed-size patches rather than using traditional pixel-by-pixel convolutions. By treating these patches as tokens—similar to words in NLP—the model leverages global self-attention to capture long-range dependencies across the entire frame. While ViT requires significant pre-training on datasets like JFT-300M to beat ResNet benchmarks, its scalability and efficiency on high-end TPU hardware make it the modern standard for state-of-the-art computer vision tasks.
Recent Talks & Demos
Showing 1-0 of 0