.

Technology

vLLM

vLLM is the high-throughput, memory-efficient LLM inference engine: it leverages PagedAttention to maximize GPU utilization and cut serving costs.

This is the engine for scaling LLM inference: vLLM (Virtual Large Language Model) is an open-source library engineered for high-throughput and low-latency serving. Its core innovation is PagedAttention, a memory management technique inspired by OS virtual memory, which efficiently handles the Key-Value (KV) cache. This optimization drastically reduces memory overhead—up to 90% in some reported cases—and allows for continuous batching of requests. The result: significantly higher request capacity on the same hardware, lower GPU usage, and a production-ready, cost-effective serving system that supports popular models like Llama and Mistral, complete with an OpenAI-compatible API server.

https://vllm.ai/
33 projects · 25 cities

Related technologies

Recent Talks & Demos

Showing 1-24 of 33

Members-Only

Sign in to see who built these projects

MARSYS: Multi-Agent Workflows
Lausanne Apr 30
Python OpenAI API
Nixus: Orchestrating Agentic Infrastructure
Hong Kong Apr 29
Nix NixOS
Scaling 780k Page Hybrid Search
Poland Apr 23
FastAPI vLLM
Virtual Model Endpoints: Unlimited Context
Seattle Apr 13
vLLM Rust
AitherOS: Agent Operating System
Los Angeles Mar 19
Python FastAPI
VLLM and Qdrant: GPU Benchmarking
Manchester Nh Mar 18
vLLM Qdrant
Your Brand Translator
Paris Mar 17
OpenClaw Qwen3
Words to World: AI Models
San Diego Feb 26
Unreal Engine 5 PyTorch
Transformer Lab: Local to Distributed ML
Toronto Jan 29
Kubernetes SLURM
LLM Probing for Immediate Inference
Montreal Jan 21
LLaMA-8B vLLM
Quiet Local AI Inferencing
Hong Kong Jan 20
llama vLLM
Qwen-3-VL Sovereign Document Analytics
Berlin Nov 12
Qwen3-VL-4B-Instruct vLLM
vLLM: Guided Recommendations
Toronto Oct 30
vLLM OpenWeight
Strix Halo Unified Memory AI
Tokyo Oct 10
GPT-4 Llama-2
Confidential LLMs on Multi-GPU
San Francisco Sep 24
NVIDIA H200 vLLM
Hexagone: Anonymize Data for AI
Paris Sep 18
vLLM Transformers
Claude Coach
San Francisco Aug 21
Synth Next
Production LLM Cost Optimization
Orange County Jul 31
Transformers vLLM
Assembly of Experts: Chimera LLMs
Munich Jul 25
vLLM AMD MI325X
Mistral 7B On-Premise Wi-Fi Agent
Medellín Jun 26
Gemini Mistral 7B
Kalavai: AI Cloud from Idle Devices
Liverpool Jun 26
vLLM llama
LLM Organ Lesion Prediction
Milan Jun 10
vLLM Qwen
GRPO Image Model Fine-tuning
Los Angeles Jun 9
Oxen Flux
DSPy: Self-Programming Meta-Agents
New York City Jun 3
DSPY vLLM