Technology

Triton Inference Server

Deploy any AI model (TensorFlow, PyTorch, ONNX) with optimized performance: Triton handles concurrent execution and dynamic batching to maximize throughput on NVIDIA GPUs and CPUs.

Triton Inference Server is the dedicated, open-source engine for high-performance AI deployment. It streamlines production inference by supporting major frameworks—TensorRT, PyTorch, TensorFlow, and ONNX—across diverse hardware, including NVIDIA GPUs, x86, and ARM CPUs. The server's core strength lies in its optimization features: Dynamic Batching, Concurrent Model Execution, and Ensemble pipelines. For instance, teams have leveraged dynamic batching to jump from 2 RPS to ~15 RPS per GPU on large computer vision models, making projects viable. Use Triton to manage your model repository and serve real-time, batched, or streaming requests via HTTP/REST or GRPC with guaranteed low latency and high utilization.

https://github.com/triton-inference-server/server

3 projects · 3 cities

Related technologies

Containerization 14 Groq 32 Inferless 1 KServe 1 Llama 3 139 Mixtral 8x22B 1 Multimodal Models 7 NVIDIA Triton Inference Server 1 ONNX Runtime 6 OpenAI API 500 OpenVINO 1 Seldon Core 1 Serverless 19 TensorFlow Serving 1 TensorRT 5 TorchServe 1 Voice models 3

Recent Talks & Demos

Showing 1-3 of 3

Members-Only

NVIDIA LLM Router Blueprint

Sydney Aug 20

Llama 3 Mixtral 8x22B

Multimodal Groq Demo

Denver Jun 10

Groq Multimodal Models

Triton Serverless Inference Optimization

San Francisco Sep 21

Triton Inference Server Inferless