Technology

quantization

Quantization is a model optimization technique: it converts high-precision parameters (e.g., 32-bit float) to lower-bit formats (e.g., 8-bit integer) to drastically reduce memory footprint and boost inference speed.

Quantization directly addresses the significant resource demands of modern Deep Neural Networks (DNNs) and Large Language Models (LLMs). The core process maps a model’s high-precision weights and activations, typically FP32, down to lower-precision integers, such as INT8 or INT4, achieving up to 4x model compression. This compression is critical: it slashes memory usage, reduces computational cost, and enables high-speed, low-latency inference on resource-constrained hardware (e.g., mobile devices, IoT). Engineers typically employ two main strategies: Post-Training Quantization (PTQ), which converts the model after training, or Quantization-Aware Training (QAT), which simulates the low-precision environment during training for superior accuracy retention.

https://www.geeksforgeeks.org/what-is-quantization/

3 projects · 3 cities

Related technologies

BERT 179 BLOOM 115 CUDA 14 Docker 128 Dockerfile 1 FastAPI 160 GPT-3 191 GPT-4 528 Hugging Face 37 llama 40 Llama-2 227 Local Inference 1 Open source models 7 PaLM 2 116 Polars 2 Python 618 RAG 138 RoBERTa 118

Recent Talks & Demos

Showing 1-3 of 3

Members-Only

Benchmarking Small Language Models Where It Actually Matters

Transformer Lab GPT-4

llama.cpp: Local Quantized LLMs

Boston Jan 22

llama quantization