Technology
quantization
Quantization is a model optimization technique: it converts high-precision parameters (e.g., 32-bit float) to lower-bit formats (e.g., 8-bit integer) to drastically reduce memory footprint and boost inference speed.
Quantization directly addresses the significant resource demands of modern Deep Neural Networks (DNNs) and Large Language Models (LLMs). The core process maps a model’s high-precision weights and activations, typically FP32, down to lower-precision integers, such as INT8 or INT4, achieving up to 4x model compression. This compression is critical: it slashes memory usage, reduces computational cost, and enables high-speed, low-latency inference on resource-constrained hardware (e.g., mobile devices, IoT). Engineers typically employ two main strategies: Post-Training Quantization (PTQ), which converts the model after training, or Quantization-Aware Training (QAT), which simulates the low-precision environment during training for superior accuracy retention.
Related technologies
Recent Talks & Demos
Showing 1-10 of 10