Technology

QLoRA

QLoRA (Quantized Low-Rank Adaptation) is an efficient finetuning approach: it uses 4-bit quantization and Low-Rank Adapters (LoRA) to drastically reduce Large Language Model (LLM) memory usage without sacrificing performance.

QLoRA is a game-changer for LLM fine-tuning, enabling operations previously deemed infeasible. The core mechanism backpropagates gradients through a frozen, 4-bit quantized pretrained model directly into smaller, 16-bit LoRA adapters. This technique cuts memory requirements: you can fine-tune a 65B parameter model on a single 48GB GPU, preserving full 16-bit performance. Key innovations include 4-bit NormalFloat (NF4) quantization, Double Quantization, and Paged Optimizers for managing memory spikes. The resulting Guanaco model family, for example, achieved 99.3% of ChatGPT's performance on the Vicuna benchmark, demonstrating state-of-the-art results with consumer-grade hardware access.

https://arxiv.org/abs/2305.14314

3 projects · 3 cities

Related technologies

Llama 3 38 Python 618 Axolotl 1 Llama-2 227 LoRA 14 Mistral 22 MLX-LM 4 PyTorch 265 Qwen 16 Unsloth Framework 1

Recent Talks & Demos

Showing 1-3 of 3

Members-Only

MLX Fine-Tuning on Apple Silicon

Orange County Jun 4

MLX-LM LoRA

QLoRA Fine-tuning AI Vtuber Identity

Quito Apr 24

Llama 3 QLoRA

Llama fine-tuning dad jokes

Berlin Jul 18

Llama-2 Llama 3