Technology
diffusion transformers
Diffusion Transformers (DiT) replace the conventional U-Net backbone in latent diffusion models with a pure Vision Transformer (ViT) architecture, enabling superior image generation scalability and performance.
Diffusion Transformer (DiT) is a scalable generative model architecture developed by William Peebles and Saining Xie (2022). It fundamentally shifts the diffusion process by replacing the standard convolutional U-Net with a Transformer network operating on latent image patches: This design leverages the Transformer's global self-attention mechanism, which is critical for scaling performance. The largest configuration, DiT-XL/2 (675M parameters), achieved a state-of-the-art FID score of 2.27 on the ImageNet 256x256 benchmark, demonstrating that the Transformer is a highly effective, scalable backbone for high-fidelity image synthesis.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1