Technology
Flash Attention
Flash Attention is an IO-aware algorithm: it dramatically accelerates Transformer training and inference by optimizing memory usage (SRAM over HBM) and fusing attention operations.
This is a critical optimization for large-scale AI: Flash Attention eliminates the 'memory wall' bottleneck in standard self-attention. Developed by Tri Dao and colleagues, the technique uses tiling and operation fusing to compute the attention mechanism entirely within the fast on-chip SRAM (Static Random-Access Memory), avoiding slow, repeated reads and writes to the larger HBM (High Bandwidth Memory). This IO-awareness yields significant performance gains: up to 3x speedup on models like GPT-2 and 10x memory savings at sequence length 2K. The result is a direct, linear-in-sequence-length memory footprint, enabling Large Language Models (LLMs) to handle massive context windows and higher resolutions in Vision Transformers (ViT) without running out of GPU memory.
Related technologies
Recent Talks & Demos
Showing 1-2 of 2