Concept

Reinforcement learning from human feedback

Summary
In machine learning, reinforcement learning from human feedback (RLHF) or reinforcement learning from human preferences is a technique that trains a "reward model" directly from human feedback and uses the model as a reward function to optimize an agent's policy using reinforcement learning (RL) through an optimization algorithm like Proximal Policy Optimization. The reward model is trained in advance to the policy being optimized to predict if a given output is good (high reward) or bad (low reward). RLHF can improve the robustness and exploration of RL agents, especially when the reward function is sparse or noisy. Human feedback is most commonly collected by asking humans to rank instances of the agent's behavior. These rankings can then be used to score outputs, for example with the Elo rating system. While the preference judgement is widely adopted, there are other types of human feedbacks that provide richer information, such as numerical feedback, natural language feedback, and edit rate. RLHF is used in tasks where it's difficult to define a clear, algorithmic solution but where humans can easily judge the quality of the model's output. For example, if the task is to generate a compelling story, humans can rate different AI-generated stories on their quality, and the model can use their feedback to improve its story generation skills. RLHF has been applied to various domains of natural language processing, such as conversational agents, text summarization, and natural language understanding. Ordinary reinforcement learning, where agents learn from their own actions based on a "reward function", is difficult to apply to natural language processing tasks because the rewards are often not easy to define or measure, especially when dealing with complex tasks that involve human values or preferences. RLHF can enable language models to provide answers that align with these complex values, to generate more verbose responses, and to reject questions that are either inappropriate or outside the knowledge space of the model.
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.