.

Technology

Evaluation

DeepEval: The LLM Evaluation Framework that integrates unit testing directly into your CI/CD pipeline for production-grade AI applications.

Evaluation, specifically DeepEval, is the critical framework for rigorously testing and validating Large Language Models (LLMs) before deployment. It applies the familiar unit-testing paradigm (like Pytest) to AI, ensuring measurable, repeatable quality for your Generative AI applications. The platform leverages over 50 research-backed metrics, including advanced techniques like G-Eval, to score subjective criteria with objective, criteria-based reasoning. This integration allows engineering teams to embed robust model performance checks directly into their existing continuous integration workflows, ensuring every prompt tweak or model update maintains production-grade standards.

https://deepeval.com
26 projects · 27 cities

Related technologies

Recent Talks & Demos

Showing 1-24 of 26

Members-Only

Sign in to see who built these projects

Training AI Like a Dog
Chicago Apr 14
LLM Tiny Llama
AO: Deconstructionist AI Engineering
DC Apr 9
React Native TypeScript
UofT: Reliable Policy RAG
Toronto Mar 25
Python RAG
UofT: Intelligent Document Search
Toronto Mar 25
Python FastAPI
ai-flow.eu: Systematic LLM Testing
Cologne Mar 5
ai-flow Node
Guardrailed AI for Constrained Environments
Milan Feb 24
Kubernetes TrustyAI
Gemini: Production ESG KPI Extraction
Vienna Feb 19
Google Gemini
Email Writer
Nashville Jan 29
OpenAI API Python
DARIA: Multi-modal Assessment Pipeline
Raleigh Dec 10
React Python
Policy-as-Code Regulation Engine
Bremen Dec 10
Python YAML
Number Theory: AI, Crypto, Optimization
Boston Dec 2
Python Apache Kafka
Google Cloud GenAI and Gemini
Austin Nov 10
ADK Agent Engine
Scalable Production RAG Architecture
Toronto Nov 10
FAISS OpenAI API
Instruct Lab LLM Evaluation Playbook
Toronto Nov 10
Merlinite-7B-Lab Mistral Mixtral
NEXT Agents Fix User Errors
Amsterdam Oct 10
OpenAI API TypeScript
Agentic AI: Coherent Long Fiction
Tokyo Oct 10
RAG OpenAI API
LLM-Judge: Reliable Immigration AI
New York City Oct 2
OpenAI API Pinecone
AutoRAG: Specialized AI Datasets
Brisbane Sep 11
Llama-3-8B-Instruct FAISS
Benchmarking LLMs for Fraud Detection
Minneapolis Saint Paul Sep 10
AWS Bedrock LangChain
LangGraph Multi-Step AI Agent
New York City Aug 26
LangGraph Ollama
LLM Safety: Model vs Prompt
Dubai Aug 23
GPT-4 GPT-3
Evals: KPIs to CI/CD
Pune Aug 23
Claude GPT
Mining opportunities
Santiago Jun 26
GPT-4 Claude-3
HandIt.ai: Self-Improving AI Systems
Medellín Apr 1
GPT-4 Claude-3