Flash Attention Projects .

Technology

Flash Attention

Flash Attention is an IO-aware algorithm: it dramatically accelerates Transformer training and inference by optimizing memory usage (SRAM over HBM) and fusing attention operations.

This is a critical optimization for large-scale AI: Flash Attention eliminates the 'memory wall' bottleneck in standard self-attention. Developed by Tri Dao and colleagues, the technique uses tiling and operation fusing to compute the attention mechanism entirely within the fast on-chip SRAM (Static Random-Access Memory), avoiding slow, repeated reads and writes to the larger HBM (High Bandwidth Memory). This IO-awareness yields significant performance gains: up to 3x speedup on models like GPT-2 and 10x memory savings at sequence length 2K. The result is a direct, linear-in-sequence-length memory footprint, enabling Large Language Models (LLMs) to handle massive context windows and higher resolutions in Vision Transformers (ViT) without running out of GPU memory.

https://github.com/Dao-AILab/flash-attention
2 projects · 2 cities

Related technologies

Recent Talks & Demos

Showing 1-2 of 2

Members-Only

Sign in to see who built these projects