LLM Fingerprinting: Model Classification

This talk demonstrates a system to identify and classify large language models by analyzing their responses to benchmark prompts, using live API classification and code walkthroughs.

PromptFoo Ollama Niagara MMLU-Pro Evals OpenAI Evals

Overview

The presentation walks through a LLM classification system to identify and classify Large Language Models (LLMs) based on their ability to respond to various prompts from diverse disciplines. This project involves evaluating performance on specific benchmarks (relating to math, logic, self-identification, etc) and scoring the LLMs at various temperatures to then use that data to build a classifier.

The implementation combines benchmarking and classification to classify LLMs from different families, such as GPT, LLaMA, Claude, and Gemini.

The demo will include:

A live classification run to determine which LLM is accessed through an API key.
A code walkthrough of the frontend, evaluation process and classification model.

Links

https://github.com/CSC392-CSC492-Building-AI-ML-systems/ai-identities
Compares Deepinfra LLM first-token responses, analyzing distribution across models/temperatures.

Tech stack

PromptFoo

PromptFoo is the open-source CLI for systematic LLM evaluation, red teaming, and side-by-side model comparison (e.g., GPT-4 vs. Claude 3) in CI/CD.

PromptFoo enables test-driven LLM development, moving past trial-and-error with a powerful CLI and library. Developers use declarative YAML configs to define test cases, running evaluations across multiple providers (OpenAI, Anthropic, Llama) simultaneously. It provides automated scoring via assertions (e.g., `is-valid-json`, `context-faithfulness`) and matrix views for side-by-side output comparison. Crucially, it secures LLM applications with built-in red teaming and vulnerability scanning, integrating cleanly into CI/CD pipelines to ensure non-regression and quality for production systems.

https://promptfoo.dev

View projects
Ollama

Deploy and run open-source Large Language Models (LLMs) like Llama 3 and Mistral locally on your machine: achieve private, cost-effective AI via a simple command-line interface.

Ollama is the essential tool for running LLMs locally: consider it the Docker for AI models. It packages complex models and dependencies into a single, easy-to-use application for macOS, Linux, and Windows systems. You get immediate access to models like Gemma 2 and DeepSeek-R1 via a straightforward CLI or REST API. This local-first approach guarantees data privacy and security, eliminating cloud dependency and high API costs. Ollama also optimizes performance on consumer hardware using techniques like quantization, ensuring efficient execution even on standard desktops.

https://ollama.com

View projects
Niagara

Niagara is an open software framework that unifies diverse devices and protocols into a single, manageable network.

Developed by Tridium, the Niagara Framework serves as the operating system for the Internet of Things (IoT). It normalizes data from disparate systems—including HVAC, lighting, and security—using a common model that bridges manufacturers and communication protocols like BACnet, Modbus, and LonWorks. With over one million active instances globally, Niagara 4 provides a cyber-secure environment for real-time monitoring and automation across smart buildings, data centers, and industrial plants. It transforms raw operational data into actionable intelligence through a centralized web interface, allowing operators to manage complex infrastructures from a single console.

https://www.tridium.com

View projects
MMLU-Pro Evals

MMLU-Pro Evals is the advanced, high-stakes LLM benchmark: 12,000+ reasoning-focused questions with ten answer options, engineered to break model saturation and precisely differentiate top-tier performance.

This is the MMLU-Pro benchmark: the definitive, next-generation evaluation for Large Language Models. It directly addresses the original MMLU's saturation by significantly increasing difficulty and robustness. The suite features over 12,000 rigorously curated questions across 14 diverse domains, shifting the focus from simple knowledge recall to complex, multi-step reasoning. Crucially, MMLU-Pro expands the multiple-choice options from four to ten, drastically reducing the chance of random guessing and enhancing discriminative power; for example, it showed a 9% performance gap between GPT-4o and GPT-4-Turbo, compared to just 1% on the old MMLU. Top models like Gemini 3 Pro (11/25) currently score around 90.1%, proving the benchmark remains a high bar for true general intelligence.

https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro

View projects
OpenAI Evals

OpenAI Evals is the framework for systematically testing and measuring LLM performance against specific, user-defined criteria.

OpenAI Evals is the essential framework for systematic LLM evaluation, ensuring accuracy, consistency, and reliability in production. Use the Evals API or the OpenAI dashboard to define clear testing criteria (classification, fact-checking, safety) and run them at scale. The process is direct: describe the task, run your eval against a test dataset (e.g., up to 500 responses at once), and analyze the results to quickly iterate on prompts or models . This capability is critical for adopting an eval-driven development cycle, allowing you to track performance over time and, for example, improve chatbot resolution rates from 68% to 89% in just three weeks .

https://platform.openai.com/docs/guides/evals

View projects

Related projects

LLM.f90 - Minimal Large Language Model Inference Framework

Toronto

A low‑dependency Fortran framework for LLM inference, showing zero‑dependency implementation, matrix operations, and support for Llama, Phi, and…

Unlocking Insights from Tabular Data with LLMs

Toronto

Learn how to convert natural‑language questions into SQL, retrieve data, and get concise summaries, enabling product managers to…

Fine Tune and Evaluate 10 LLMs in in 10 minutes

Toronto

Demonstrate rapid synthetic data creation, fine‑tuning ten open LLMs, evaluating with LLM‑as‑Judge and G‑Eval, and discuss dataset collaboration,…

All the Trainingz, No Codez

Toronto

This talk demonstrates how to train, finetune, and preference tune large language models on a home computer using…

Browsing the web with AI

Toronto

Explore building reliable web‑scraping agents using a vision-language model, Claude reasoning, Selenium automation, and prompt engineering, demonstrated with…

Educational AI

Toronto

Learn how we built a code‑annotation model and a documentation generator, the challenges faced, and insights for improving…