This lecture discusses the complexities of finite horizon reinforcement learning (RL) and introduces the concept of non-stationary policies. The instructor explains how the optimal policy can change over time, using basketball as an analogy to illustrate how strategies depend on the game state. The lecture then transitions to the optimistic variant of Proximal Policy Optimization (OPPO), which utilizes optimistic estimates of value functions to improve policy updates. The instructor details the algorithm's structure, emphasizing the importance of estimating transitions and bonuses based on empirical observations. The discussion includes the significance of exploration in RL and how the OPPO algorithm can lead to better performance compared to traditional methods. The lecture concludes with a comparison of OPPO to other algorithms like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), highlighting their theoretical underpinnings and practical implications in reinforcement learning.