Technology

Evaluation

DeepEval: The LLM Evaluation Framework that integrates unit testing directly into your CI/CD pipeline for production-grade AI applications.

Evaluation, specifically DeepEval, is the critical framework for rigorously testing and validating Large Language Models (LLMs) before deployment. It applies the familiar unit-testing paradigm (like Pytest) to AI, ensuring measurable, repeatable quality for your Generative AI applications. The platform leverages over 50 research-backed metrics, including advanced techniques like G-Eval, to score subjective criteria with objective, criteria-based reasoning. This integration allows engineering teams to embed robust model performance checks directly into their existing continuous integration workflows, ensuring every prompt tweak or model update maintains production-grade standards.

https://deepeval.com

26 projects · 27 cities

Related technologies

OpenAI API 500 Python 739 FAISS 18 GPT-4 678 Claude-3 443 FastAPI 159 Gemini 254 PyTorch 264 GPT-3 390 LangChain 439 Llama-2 337 Next 197 Ollama 82 Pydantic 62 sentence-transformers 17 Transformers 168 ADK 17 Agent Engine 3

Recent Talks & Demos

Showing 1-24 of 26

Members-Only

Sign in to see who built these projects

Sign in View FAQ

Training AI Like a Dog

AO: Deconstructionist AI Engineering

React Native TypeScript

UofT: Reliable Policy RAG

UofT: Intelligent Document Search

ai-flow.eu: Systematic LLM Testing

Guardrailed AI for Constrained Environments

Kubernetes TrustyAI

Gemini: Production ESG KPI Extraction

Nashville Jan 29

OpenAI API Python

DARIA: Multi-modal Assessment Pipeline

Policy-as-Code Regulation Engine

Number Theory: AI, Crypto, Optimization

Python Apache Kafka

Google Cloud GenAI and Gemini

ADK Agent Engine

Scalable Production RAG Architecture

FAISS OpenAI API

Instruct Lab LLM Evaluation Playbook

Merlinite-7B-Lab Mistral Mixtral

NEXT Agents Fix User Errors

Amsterdam Oct 10

OpenAI API TypeScript

Agentic AI: Coherent Long Fiction

LLM-Judge: Reliable Immigration AI

New York City Oct 2

OpenAI API Pinecone

AutoRAG: Specialized AI Datasets

Brisbane Sep 11

Llama-3-8B-Instruct FAISS

Benchmarking LLMs for Fraud Detection

Minneapolis Saint Paul Sep 10

AWS Bedrock LangChain

LangGraph Multi-Step AI Agent

New York City Aug 26

LangGraph Ollama

LLM Safety: Model vs Prompt

Evals: KPIs to CI/CD

Mining opportunities

Santiago Jun 26

HandIt.ai: Self-Improving AI Systems

Medellín Apr 1