Technology

vLLM

vLLM is the high-throughput, memory-efficient LLM inference engine: it leverages PagedAttention to maximize GPU utilization and cut serving costs.

This is the engine for scaling LLM inference: vLLM (Virtual Large Language Model) is an open-source library engineered for high-throughput and low-latency serving. Its core innovation is PagedAttention, a memory management technique inspired by OS virtual memory, which efficiently handles the Key-Value (KV) cache. This optimization drastically reduces memory overhead—up to 90% in some reported cases—and allows for continuous batching of requests. The result: significantly higher request capacity on the same hardware, lower GPU usage, and a production-ready, cost-effective serving system that supports popular models like Llama and Mistral, complete with an OpenAI-compatible API server.

https://vllm.ai/

33 projects · 25 cities

Related technologies

llama 136 Python 739 PyTorch 264 Transformers 168 Hugging Face 59 OpenAI API 500 Anthropic API 58 Docker 157 Kubernetes 34 Llama-2 337 Qwen 41 FastAPI 159 Flux 13 Gemini 254 Gemma 16 GPT-4 678 Mistral 36 Modal 45

Recent Talks & Demos

Showing 1-24 of 33

Members-Only

Sign in to see who built these projects

Sign in View FAQ

MARSYS: Multi-Agent Workflows

Lausanne Apr 30

Python OpenAI API

Nixus: Orchestrating Agentic Infrastructure

Hong Kong Apr 29

Scaling 780k Page Hybrid Search

Virtual Model Endpoints: Unlimited Context

AitherOS: Agent Operating System

Los Angeles Mar 19

VLLM and Qdrant: GPU Benchmarking

Manchester Nh Mar 18

Your Brand Translator

Words to World: AI Models

San Diego Feb 26

Unreal Engine 5 PyTorch

Transformer Lab: Local to Distributed ML

Kubernetes SLURM

LLM Probing for Immediate Inference

Montreal Jan 21

Quiet Local AI Inferencing

Hong Kong Jan 20

Qwen-3-VL Sovereign Document Analytics

Qwen3-VL-4B-Instruct vLLM

vLLM: Guided Recommendations

vLLM OpenWeight

Strix Halo Unified Memory AI

Confidential LLMs on Multi-GPU

San Francisco Sep 24

NVIDIA H200 vLLM

Hexagone: Anonymize Data for AI

vLLM Transformers

San Francisco Aug 21

Production LLM Cost Optimization

Orange County Jul 31

Transformers vLLM

Assembly of Experts: Chimera LLMs

vLLM AMD MI325X

Mistral 7B On-Premise Wi-Fi Agent

Medellín Jun 26

Gemini Mistral 7B

Kalavai: AI Cloud from Idle Devices

Liverpool Jun 26

LLM Organ Lesion Prediction

GRPO Image Model Fine-tuning

Los Angeles Jun 9

DSPy: Self-Programming Meta-Agents

New York City Jun 3