Technology

vLLM

vLLM is the high-throughput, memory-efficient LLM inference engine: it leverages PagedAttention to maximize GPU utilization and cut serving costs.

This is the engine for scaling LLM inference: vLLM (Virtual Large Language Model) is an open-source library engineered for high-throughput and low-latency serving. Its core innovation is PagedAttention, a memory management technique inspired by OS virtual memory, which efficiently handles the Key-Value (KV) cache. This optimization drastically reduces memory overhead—up to 90% in some reported cases—and allows for continuous batching of requests. The result: significantly higher request capacity on the same hardware, lower GPU usage, and a production-ready, cost-effective serving system that supports popular models like Llama and Mistral, complete with an OpenAI-compatible API server.

https://vllm.ai/

33 projects · 25 cities

Related technologies

Python 739 llama 136 PyTorch 264 Transformers 168 Hugging Face 59 OpenAI API 500 Anthropic API 58 Docker 157 FastAPI 159 Kubernetes 34 Llama-2 337 Qwen 41 bash 19 Flux 13 Gemini 254 Gemma 16 GPT-4 678 Mistral 36

Recent Talks & Demos

Showing 21-33 of 33

Members-Only

Sign in to see who built these projects

Sign in View FAQ

Kalavai: AI Cloud from Idle Devices

Liverpool Jun 26

LLM Organ Lesion Prediction

GRPO Image Model Fine-tuning

Los Angeles Jun 9

DSPy: Self-Programming Meta-Agents

New York City Jun 3

Monocular 3D Food Reconstruction

New York City Jun 3

YOLO Depth Anything

Stopwatch: LLM Engine Benchmarks

New York City May 19

Aibrix/vllm: Self-Hosting Tokenomics

Amsterdam May 6

Kubernetes vLLM

vLLM: Self-hosting LLMs on GCP

Amsterdam Apr 2

vLLM Google Cloud

OpenAI API Anthropic API

regolo.ai: Scalable GPU Inference

Ollama Groq Local Inference

Manizales Jan 22

Llama-2 Mistral

Google Takeout AI Stories