Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Comet Opik LLM Observability
Explore a LLM-powered recipe generator, its real‑time observability pipeline, and how Comet Opik tracing with custom evaluation metrics detects hallucinations and quality issues.
I build a recipe generator app with LLMs. This app can help me a lot in the kitchen and has been amazing for meal planning, but how do I make sure my LLM is returning reasonable recipes? And if I have a lot of users, how do I automatically detect common issues like hallucinations? In this talk, I’ll show you my app in action and demonstrate the observability strategy I set up to detect issues live. Monitoring LLMs is hard because the output is nondeterministic. I’ll show how I used Comet Opik to implement tracing and custom eval metrics that you can use for your own projects as well.
Opik enables LLM observability, monitoring cost, quality, and custom outputs.
Related projects
Alignment Platform for LLM-as-a-Judge
London
The talk demonstrates a beta UI that captures human corrections to LLM judges, creates few‑shot examples, and continuously…
LLM Evaluations in Practice
Amsterdam
Learn about a practical setup for LLM evaluation in production, sharing hard-earned lessons for guiding prompt and code…
Using LLMs to automate content moderation
Dublin
A practical overview of using retrieval‑augmented generation, fine‑tuning, and prompt engineering for content moderation, focusing on accuracy, consistency,…
Automating LLM as a judge with EvalForge and Weave
Seattle
This talk explores automating custom LLM evaluation criteria using EvalForge and Weave, enabling users to create and run…
Benchmarking 100 LLM Inference Engine Configurations
New York City
This talk demonstrates Stopwatch, an open-source tool for quickly benchmarking LLM inference engines like vLLM, SGLang, and TRT-LLM,…
Keeping an "AI" on LLMs with Langfuse
London
Learn how to self‑host Langfuse for LLM observability, covering setup, tracking user queries, inputs, retrieved data, and practical…