Noa Notes: Medical AI Evaluation

Learn how we built a two‑tier evaluation framework for a medical transcription AI, using LLM‑driven factual checks, style analysis, prompt engineering, and MLflow tracking.

Overview

How do you evaluate an AI system that assists doctors with medical documentation? In this talk, we’ll share practical insights from building an evaluation framework for Noa Notes @ Docplanner - a system that transcribes and summarizes doctor-patient conversations. We will discuss our two-tier evaluation approach combining detailed factual assessment with style analysis, see how we leverage LLMs in the evaluation pipeline, and share specific examples of how prompt engineering improved our metrics. We’ll also discuss challenges unique to the medical domain and how we addressed them.

Links

https://noa.ai/pl/
Noa provides AI healthcare assistance, automating clinical note generation and 24/7 booking via AWS/Azure.

Tech stack

Related projects

Practical demo challenges in creating LLM-based consumer products

Poland

Explore real-world obstacles and solutions when integrating large language models into consumer products, covering design, deployment, testing, and…

Efficient data extraction from documents for data analytics and process automation

Poland

Explore Arctic‑TILT, a 0.8 B‑parameter model that outperforms GPT‑4 in document processing, enabling efficient data extraction for analytics and…

How Not to Kill Anyone: Safety Layers in Medical Reasoning

Poland

Methods for extracting body composition and lab data from unstructured sources, building real‑time digital health twins, and ensuring…

Genaicode - programming on steroids

Poland

Live demo of Genaicode, an AI code generator, modifying a personal game in real time and covering latency,…

Unleash Your voice Unleash Your Agents

Poland

The talk demonstrates a locally run AI system that provides real‑time speech transcription, lets you control applications, and…

LLM Evaluations in Practice

Amsterdam

Learn about a practical setup for LLM evaluation in production, sharing hard-earned lessons for guiding prompt and code…