This lecture explores a theoretical framework for Reinforcement Learning with Human Feedback (RLHF) that deals with ordinal data, focusing on the convergence of estimators under different models. It discusses the challenges faced when training a policy based on learned reward models and introduces a pessimistic MLE for improved performance. The analysis validates the success of existing RLHF algorithms and provides insights for algorithm design, unifying RLHF and max-entropy Inverse Reinforcement Learning. The lecture also covers the formulation of RLHF, the Plackett-Luce model, and the connection with Inverse RL, along with experiments comparing different estimators and policies.