Technology

RLHF

RLHF (Reinforcement Learning from Human Feedback) is a machine learning technique: it aligns an AI agent's behavior with complex human preferences using direct human judgments as a reward signal.

RLHF is the industry-standard method for fine-tuning Large Language Models (LLMs) to be helpful, harmless, and accurate. The process involves three steps: first, human evaluators rank several model outputs to a prompt; second, this preference data trains a separate 'reward model' to predict human-like scores; third, the original LLM (the policy) is optimized using a reinforcement learning algorithm (e.g., Proximal Policy Optimization or PPO) guided by the reward model's feedback. This technique was crucial for the development of models like OpenAI's InstructGPT and ChatGPT, effectively bridging the gap between raw model capability and user-expected behavior.

https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback

2 projects · 2 cities

Related technologies

DPO 1 LangGraph 62 LoRA 14 Open source models 7 ORPO 1 PPO 2 ReactJsonAgent 1 Transformer Lab 2

Recent Talks & Demos

Showing 1-2 of 2

Members-Only

Multimodal Agents with RL Feedback

Atlanta Nov 19

LangGraph ReactJsonAgent

Transformer Lab: No Code Tuning

Toronto Sep 20

LoRA DPO