Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Noa Notes: Medical AI Evaluation
Learn how we built a two‑tier evaluation framework for a medical transcription AI, using LLM‑driven factual checks, style analysis, prompt engineering, and MLflow tracking.
How do you evaluate an AI system that assists doctors with medical documentation? In this talk, we’ll share practical insights from building an evaluation framework for Noa Notes @ Docplanner - a system that transcribes and summarizes doctor-patient conversations. We will discuss our two-tier evaluation approach combining detailed factual assessment with style analysis, see how we leverage LLMs in the evaluation pipeline, and share specific examples of how prompt engineering improved our metrics. We’ll also discuss challenges unique to the medical domain and how we addressed them.
Noa provides AI healthcare assistance, automating clinical note generation and 24/7 booking via AWS/Azure.
- MLflowMLflow is the open-source platform for managing the complete machine learning lifecycle: tracking, reproducibility, and deployment.MLflow standardizes the ML workflow using four core components. **MLflow Tracking** logs experiment details, recording parameters, metrics (like accuracy), and artifacts for every run. **MLflow Projects** package code in a reusable format, ensuring reproducibility across environments via a simple `mlflow run` command. **MLflow Models** provide a consistent model format ('flavor') for deployment, supporting frameworks like scikit-learn, PyTorch, and TensorFlow. Finally, the **MLflow Model Registry** centralizes model management, handling versioning and stage transitions (e.g., Staging to Production) for governance and collaboration.
- Noa NotesAI-powered medical assistant: It transcribes doctor-patient appointments and generates structured, EHR-ready clinical summaries automatically.Noa Notes is Docplanner's AI solution, purpose-built for clinical documentation: It eliminates manual note-taking for physicians. The system integrates directly with Practice Management Systems (PMSes) via an embeddable widget and API, capturing the full patient-doctor conversation. It processes the audio, generates a transcription, and then produces a structured summary based on pre-defined templates. This summary is ready for doctor review and immediate population of the patient's Electronic Health Record (EHR), streamlining workflows in markets like Poland, Germany, Spain, and Portugal.
- DocplannerGlobal healthtech platform connecting patients and providers via online appointment booking and full-suite practice management software.Docplanner is the world's largest healthcare platform, focused on making the patient experience more human. The technology provides an integrated, end-to-end solution: patients find doctors and book visits, while healthcare professionals gain tools to manage practices (scheduling, payments, communication). Operating in 13 countries (including Poland's ZnanyLekarz and Spain's Doctoralia), the platform handles 25,000,000 appointments monthly and sees 100,000,000 patient visits monthly. Core tech utilizes Java, PHP, Kubernetes, MySQL, and Redis, supporting sophisticated products like TuoTempo (optimization for large institutions).
- GPT-4GPT-4 is OpenAI’s large multimodal model: it processes both text and image inputs, delivering human-level performance on complex professional and academic benchmarks.This is OpenAI’s latest milestone in scaling deep learning: a large multimodal model accepting both text and image inputs. It demonstrates a significant capability leap over its predecessor, scoring in the top 10% on a simulated bar exam (GPT-3.5 scored in the bottom 10%). The model handles nuanced instructions and long-form content, supporting context windows up to 32,768 tokens (32K model). This capacity allows processing up to 25,000 words in a single, complex prompt. GPT-4 is engineered for enhanced reliability, steerability, and advanced reasoning across diverse tasks.
- Prompt EngineeringPrompt Engineering is the discipline of structuring inputs (prompts) to Large Language Models (LLMs) to reliably and efficiently elicit a desired, high-quality output.This is the core skill for maximizing performance from models like GPT-4 and Claude 3: it's the art and science of guiding an AI. The process involves systematic iteration and applying specific techniques to control the model's behavior and reduce 'hallucination.' Key advanced methods include Chain-of-Thought (CoT) prompting, which forces the LLM to process complex problems step-by-step, and Few-Shot prompting (providing 2-3 examples) to establish a clear output format or style. Mastery of these methods directly translates to tangible gains: improved accuracy, reduced API costs from fewer retries, and production-ready outputs for applications like customer service bots or code generation.
- GPT-3A 175-billion parameter autoregressive language model that masters complex tasks through few-shot learning.OpenAI debuted GPT-3 in 2020: a transformer-based engine trained on 570GB of filtered text. It utilizes 175 billion parameters to execute diverse functions (including Python scripting and logical reasoning) using only natural language prompts. This architecture removed the requirement for task-specific fine-tuning: establishing the foundation for modern tools like GitHub Copilot and the initial ChatGPT release.
- Llama-2Llama 2 is Meta AI's powerful, openly accessible family of large language models (LLMs), featuring models from 7B to 70B parameters for research and commercial applications.Llama 2 is Meta AI's next-generation LLM family, released for free research and commercial use. The collection includes both pre-trained foundation models and instruction-tuned 'Chat' variants, scaling from 7 billion (7B) up to 70 billion (70B) parameters. Key technical upgrades over Llama 1 involve training on 2 trillion tokens (40% more data) and doubling the context length to 4096 tokens. The Llama-2-chat models were rigorously aligned using Reinforcement Learning from Human Feedback (RLHF), positioning them as a top-tier, openly available option for developers building advanced generative AI solutions.
- PaLM 2Google's versatile large language model optimized for advanced reasoning, multilingual translation, and coding across four distinct scales.PaLM 2 powers 25+ Google products (including Gemini and Workspace) using a Transformer-based architecture trained on a massive corpus of 100+ languages. It excels in specialized tasks: solving complex math problems, generating high-quality code, and passing professional-level exams. Developers deploy the model via the PaLM API in four sizes: Gecko, Otter, Bison, and Unicorn. Gecko is lightweight enough to run locally on mobile devices (offline), while Unicorn handles the most complex, data-heavy reasoning tasks at scale.
- BLOOMA 176-billion parameter open-access multilingual language model built by the BigScience research collective.BLOOM is the result of a year-long collaboration involving 1,000+ researchers from 70+ countries. It supports 46 natural languages and 13 programming languages: it provides a high-performance alternative to proprietary models. The model was trained on the Jean Zay supercomputer in France using the 1.6-terabyte ROOTS dataset (a massive collection of diverse text sources). By providing full access to its weights and training process, BLOOM enables global developers to build and audit AI tools without the restrictions of closed-door APIs.
- BERTBERT (Bidirectional Encoder Representations from Transformers) is a foundational, pre-trained NLP model that uses a Transformer encoder to process text bidirectionally, capturing full word context for superior language understanding.BERT is a revolutionary language representation model introduced by Google AI Language in 2018. It is built on the Transformer architecture and distinguishes itself by being deeply bidirectional: it processes the entire sequence of words (left and right context) simultaneously, unlike previous unidirectional models. This capability is achieved through a Masked Language Model (MLM) pre-training objective. The model, released in sizes like BERTBASE (110 million parameters) and BERTLARGE (340 million parameters), dramatically improved the state-of-the-art across 11+ Natural Language Processing tasks, including question answering (SQuAD) and sentiment analysis, establishing a new baseline for the field.
- RoBERTaRoBERTa (Robustly Optimized BERT Pretraining Approach) is a high-performance language model from Facebook AI that significantly outperforms BERT by optimizing the pretraining strategy, not the core architecture.RoBERTa is a robustly optimized version of the BERT model, developed by researchers at Facebook AI in 2019. The team conducted a replication study, proving BERT was undertrained and could achieve state-of-the-art results with a refined recipe: they removed the Next Sentence Prediction (NSP) objective, implemented dynamic masking, and scaled up training dramatically. Specifically, RoBERTa trained for 500K steps (up from 100K) on a massive 160GB of text data (ten times BERT’s data) using much larger batch sizes (up to 8K). This optimized approach yielded superior performance on major benchmarks like GLUE, RACE, and SQuAD, establishing RoBERTa as a benchmark for subsequent language model development.
Related projects
Practical demo challenges in creating LLM-based consumer products
Poland
Explore real-world obstacles and solutions when integrating large language models into consumer products, covering design, deployment, testing, and…
Efficient data extraction from documents for data analytics and process automation
Poland
Explore Arctic‑TILT, a 0.8 B‑parameter model that outperforms GPT‑4 in document processing, enabling efficient data extraction for analytics and…
How Not to Kill Anyone: Safety Layers in Medical Reasoning
Poland
Methods for extracting body composition and lab data from unstructured sources, building real‑time digital health twins, and ensuring…
Genaicode - programming on steroids
Poland
Live demo of Genaicode, an AI code generator, modifying a personal game in real time and covering latency,…
Unleash Your voice Unleash Your Agents
Poland
The talk demonstrates a locally run AI system that provides real‑time speech transcription, lets you control applications, and…
LLM Evaluations in Practice
Amsterdam
Learn about a practical setup for LLM evaluation in production, sharing hard-earned lessons for guiding prompt and code…