Technology

VLLMs

vLLM: The high-performance inference engine, leveraging PagedAttention to deliver up to 24x higher throughput for LLM serving.

vLLM (Virtual Large Language Model) is the definitive open-source library for high-speed LLM inference and serving. Its core innovation is PagedAttention: an algorithm managing the Key-Value (KV) cache like operating system virtual memory, dramatically boosting throughput and reducing latency. Originating from the Sky Computing Lab at UC Berkeley, the project supports continuous batching and integrates seamlessly with popular models (e.g., Llama, Mixtral) and diverse hardware (NVIDIA, AMD). Deploy production-grade models with an OpenAI-compatible API server for maximum efficiency and scalability.

https://vllm.ai/

2 projects · 2 cities

Related technologies

Anthropic API 58 Depth Anything 3 Flux 13 Gemini 254 OpenAI API 500 OpenCV 37 Oxen 1 TRELLIS 1 YOLO 18

Recent Talks & Demos

Showing 1-2 of 2

Members-Only

GRPO Image Model Fine-tuning

Los Angeles Jun 9

Oxen Flux

Monocular 3D Food Reconstruction

New York City Jun 3

YOLO Depth Anything