Technology
llama.cpp
The definitive C/C++ library for high-performance LLM inference: run quantized GGUF models on commodity hardware, CPU-first.
Llama.cpp is the C/C++ engine for highly efficient Large Language Model (LLM) inference, built on the GGML tensor library. It champions a CPU-first philosophy, enabling models up to 70B parameters to run locally on diverse hardware (laptops, Raspberry Pi) with minimal dependencies. This efficiency is driven by the GGUF file format and advanced quantization techniques (1.5-bit to 8-bit), allowing 7B+ parameter models to operate comfortably with just 4โ8GB of RAM. The project provides a fast, hackable framework, including command-line tools (`llama-cli`) and an OpenAI-compatible web server (`llama-server`), ensuring state-of-the-art performance across various backends, including robust GPU support.
Related technologies
Recent Talks & Demos
Showing 1-12 of 12