Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
This lecture explores a theoretical framework for Reinforcement Learning with Human Feedback (RLHF) that deals with ordinal data, focusing on the convergence of estimators under different models. It discusses the challenges faced when training a policy based on learned reward models and introduces a pessimistic MLE for improved performance. The analysis validates the success of existing RLHF algorithms and provides insights for algorithm design, unifying RLHF and max-entropy Inverse Reinforcement Learning. The lecture also covers the formulation of RLHF, the Plackett-Luce model, and the connection with Inverse RL, along with experiments comparing different estimators and policies.