TensorRT-LLM: High-Throughput Embeddings

This talk demonstrates an optimized TensorRT-LLM embedding runtime achieving up to twice the performance of alternatives, with code, benchmarks, and architecture insights.

Overview

Using an optimized embedding runtime based on TensorRT-LLM, I’ll demonstrate high-throughput backfill and low-latency retrieval that benchmarks at up to twice the performance of other embedding runtimes (TEI, vLLM).

Links

Tech stack

Related projects

Vector + Graph Friends

San Francisco

Shows a hybrid vector‑graph RAG system that creates personalized event emails using a knowledge graph and vector search,…

Artecon - A hotspot for AI

Seattle

Learn how to run CPU‑based ML models with low latency, using small public models and post‑processing, then bundle…

Open source Anthropic's Artifacts on steroids

San Francisco

This talk demonstrates an open source template for building customizable AI Artifacts UIs like Anthropic’s Claude, including deployable…

Vector Data Exploration

Mumbai

Learn how to turn any dataset into interactive visual maps using vector embeddings, clustering, labeling, and LLMs, with…

Vector Search - Vibe Coded, deployed and running

San Francisco

Learn how to build, test, and deploy a production‑grade vector search service by using Claude‑generated Vibe code, conductor.build,…

Measuring embedding API latency: the meh, the slow and the slowest

Berlin

An empirical comparison of embedding API latency from OpenAI, Cohere, Google Vertex AI, and Jina, examining temporal and…