Technology
RLHF
RLHF (Reinforcement Learning from Human Feedback) is a machine learning technique: it aligns an AI agent's behavior with complex human preferences using direct human judgments as a reward signal.
RLHF is the industry-standard method for fine-tuning Large Language Models (LLMs) to be helpful, harmless, and accurate. The process involves three steps: first, human evaluators rank several model outputs to a prompt; second, this preference data trains a separate 'reward model' to predict human-like scores; third, the original LLM (the policy) is optimized using a reinforcement learning algorithm (e.g., Proximal Policy Optimization or PPO) guided by the reward model's feedback. This technique was crucial for the development of models like OpenAI's InstructGPT and ChatGPT, effectively bridging the gap between raw model capability and user-expected behavior.
Related technologies
Recent Talks & Demos
Showing 1-3 of 3