.

Members-Only

Recent Talks & Demos are for members only

Exclusive feed

You must be an AI Tinkerers active member to view these talks and demos.

March 27, 2025 · Toronto

LLM Fingerprinting: Model Classification

This talk demonstrates a system to identify and classify large language models by analyzing their responses to benchmark prompts, using live API classification and code walkthroughs.

Overview
Links
Tech stack
  • PromptFoo
    PromptFoo is the open-source CLI for systematic LLM evaluation, red teaming, and side-by-side model comparison (e.g., GPT-4 vs. Claude 3) in CI/CD.
    PromptFoo enables test-driven LLM development, moving past trial-and-error with a powerful CLI and library. Developers use declarative YAML configs to define test cases, running evaluations across multiple providers (OpenAI, Anthropic, Llama) simultaneously. It provides automated scoring via assertions (e.g., `is-valid-json`, `context-faithfulness`) and matrix views for side-by-side output comparison. Crucially, it secures LLM applications with built-in red teaming and vulnerability scanning, integrating cleanly into CI/CD pipelines to ensure non-regression and quality for production systems.
  • Ollama
    Deploy and run open-source Large Language Models (LLMs) like Llama 3 and Mistral locally on your machine: achieve private, cost-effective AI via a simple command-line interface.
    Ollama is the essential tool for running LLMs locally: consider it the Docker for AI models. It packages complex models and dependencies into a single, easy-to-use application for macOS, Linux, and Windows systems. You get immediate access to models like Gemma 2 and DeepSeek-R1 via a straightforward CLI or REST API. This local-first approach guarantees data privacy and security, eliminating cloud dependency and high API costs. Ollama also optimizes performance on consumer hardware using techniques like quantization, ensuring efficient execution even on standard desktops.
  • Niagara
    Niagara is an open software framework that unifies diverse devices and protocols into a single, manageable network.
    Developed by Tridium, the Niagara Framework serves as the operating system for the Internet of Things (IoT). It normalizes data from disparate systems—including HVAC, lighting, and security—using a common model that bridges manufacturers and communication protocols like BACnet, Modbus, and LonWorks. With over one million active instances globally, Niagara 4 provides a cyber-secure environment for real-time monitoring and automation across smart buildings, data centers, and industrial plants. It transforms raw operational data into actionable intelligence through a centralized web interface, allowing operators to manage complex infrastructures from a single console.
  • MMLU-Pro Evals
    MMLU-Pro Evals is the advanced, high-stakes LLM benchmark: 12,000+ reasoning-focused questions with ten answer options, engineered to break model saturation and precisely differentiate top-tier performance.
    This is the MMLU-Pro benchmark: the definitive, next-generation evaluation for Large Language Models. It directly addresses the original MMLU's saturation by significantly increasing difficulty and robustness. The suite features over 12,000 rigorously curated questions across 14 diverse domains, shifting the focus from simple knowledge recall to complex, multi-step reasoning. Crucially, MMLU-Pro expands the multiple-choice options from four to ten, drastically reducing the chance of random guessing and enhancing discriminative power; for example, it showed a 9% performance gap between GPT-4o and GPT-4-Turbo, compared to just 1% on the old MMLU. Top models like Gemini 3 Pro (11/25) currently score around 90.1%, proving the benchmark remains a high bar for true general intelligence.
  • OpenAI Evals
    OpenAI Evals is the framework for systematically testing and measuring LLM performance against specific, user-defined criteria.
    OpenAI Evals is the essential framework for systematic LLM evaluation, ensuring accuracy, consistency, and reliability in production. Use the Evals API or the OpenAI dashboard to define clear testing criteria (classification, fact-checking, safety) and run them at scale. The process is direct: describe the task, run your eval against a test dataset (e.g., up to 500 responses at once), and analyze the results to quickly iterate on prompts or models . This capability is critical for adopting an eval-driven development cycle, allowing you to track performance over time and, for example, improve chatbot resolution rates from 68% to 89% in just three weeks .

Related projects