Technology

Flash Attention

Flash Attention is an IO-aware algorithm: it dramatically accelerates Transformer training and inference by optimizing memory usage (SRAM over HBM) and fusing attention operations.

This is a critical optimization for large-scale AI: Flash Attention eliminates the 'memory wall' bottleneck in standard self-attention. Developed by Tri Dao and colleagues, the technique uses tiling and operation fusing to compute the attention mechanism entirely within the fast on-chip SRAM (Static Random-Access Memory), avoiding slow, repeated reads and writes to the larger HBM (High Bandwidth Memory). This IO-awareness yields significant performance gains: up to 3x speedup on models like GPT-2 and 10x memory savings at sequence length 2K. The result is a direct, linear-in-sequence-length memory footprint, enabling Large Language Models (LLMs) to handle massive context windows and higher resolutions in Vision Transformers (ViT) without running out of GPU memory.

https://github.com/Dao-AILab/flash-attention

1 project · 1 city

Related technologies

GGUF 3 Gradio 9 llama 40 llama-cpp-agents 1

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

llama-cpp Agents: Local Search

Quito Apr 24

llama llama-cpp-agents