Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
When humans or animals perform an action that led to a desired outcome, they show a tendency to repeat it. The mechanisms underlying learning from past experience and adapting future behavior are still not fully understood. In this thesis, I study how humans learn from sparse and delayed reward during multi-step tasks. Learning a sequence of multiple decisions, from a reward obtained only at the end of the sequence, requires a mechanism to link earlier actions to later reward. The theory of reinforce- ment learning suggests an algorithmic solution to this problem, namely, to keep a decaying memory of the state-action history. Such memories are called eligibility traces. They bridge the temporal delay between the moment an action is taken and a subsequent reward. We ask whether humans make use of eligibility traces when learning a sequential decision making task. The difficulty in answering this question is that different competing algorithmic solu- tions make similar predictions about behavior. Only during a few initial trials, learning with eligibility traces is qualitatively different from other algorithms. Here, I implemented a novel learning task with an experimental manipulation that allowed us to guide participants through a controlled sequence of states. With this hidden manipulation, we were able to isolate the specific trials in which the competing models are distinguishable. Behavioral data as well as simultaneously recorded pupil dilation revealed effects compatible with eligibility traces, but not with simpler models. Furthermore, the trial-by-trial reward prediction errors were correlated with pupil dilation and EEG measurements. Our experimental data show effects of eligibility traces in behavior and pupil data, after a single experience of state-action associations, which has not been studied before in a multi-step task. We view our results in the light of one-shot learning and as a signature of a learning mechanism present both in temporal difference and one-shot learning.
Jamie Paik, Mustafa Mete, Hwayeong Jeong