Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
TensorRT-LLM: High-Throughput Embeddings
This talk demonstrates an optimized TensorRT-LLM embedding runtime achieving up to twice the performance of alternatives, with code, benchmarks, and architecture insights.
Using an optimized embedding runtime based on TensorRT-LLM, I’ll demonstrate high-throughput backfill and low-latency retrieval that benchmarks at up to twice the performance of other embedding runtimes (TEI, vLLM).
Related projects
Vector + Graph Friends
San Francisco
Shows a hybrid vector‑graph RAG system that creates personalized event emails using a knowledge graph and vector search,…
Artecon - A hotspot for AI
Seattle
Learn how to run CPU‑based ML models with low latency, using small public models and post‑processing, then bundle…
Open source Anthropic's Artifacts on steroids
San Francisco
This talk demonstrates an open source template for building customizable AI Artifacts UIs like Anthropic’s Claude, including deployable…
Vector Data Exploration
Mumbai
Learn how to turn any dataset into interactive visual maps using vector embeddings, clustering, labeling, and LLMs, with…
Vector Search - Vibe Coded, deployed and running
San Francisco
Learn how to build, test, and deploy a production‑grade vector search service by using Claude‑generated Vibe code, conductor.build,…
Measuring embedding API latency: the meh, the slow and the slowest
Berlin
An empirical comparison of embedding API latency from OpenAI, Cohere, Google Vertex AI, and Jina, examining temporal and…