.

Technology

QLoRA

QLoRA (Quantized Low-Rank Adaptation) is an efficient finetuning approach: it uses 4-bit quantization and Low-Rank Adapters (LoRA) to drastically reduce Large Language Model (LLM) memory usage without sacrificing performance.

QLoRA is a game-changer for LLM fine-tuning, enabling operations previously deemed infeasible. The core mechanism backpropagates gradients through a frozen, 4-bit quantized pretrained model directly into smaller, 16-bit LoRA adapters. This technique cuts memory requirements: you can fine-tune a 65B parameter model on a single 48GB GPU, preserving full 16-bit performance. Key innovations include 4-bit NormalFloat (NF4) quantization, Double Quantization, and Paged Optimizers for managing memory spikes. The resulting Guanaco model family, for example, achieved 99.3% of ChatGPT's performance on the Vicuna benchmark, demonstrating state-of-the-art results with consumer-grade hardware access.

https://arxiv.org/abs/2305.14314
4 projects · 4 cities

Related technologies

Recent Talks & Demos

Showing 1-4 of 4

Members-Only

Sign in to see who built these projects