Technology

LoRA

LoRA (Low-Rank Adaptation) is a Parameter-Efficient Fine-Tuning (PEFT) technique that accelerates LLM training by injecting small, trainable rank decomposition matrices into the frozen pre-trained weights.

LoRA is the premier solution for adapting Large Language Models (LLMs) without the prohibitive cost of full fine-tuning. The method freezes the original model weights and injects lightweight, low-rank matrices (A and B) into the Transformer layers, specifically targeting attention weights (Wq and Wv). This approach delivers massive resource savings: for a model like GPT-3 175B, LoRA reduces the number of trainable parameters by up to 10,000 times and cuts GPU memory requirements by 3 times. Crucially, its linear design allows the new matrices to be merged with the original weights upon deployment, ensuring zero additional inference latency.