Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.
In reinforcement learning (RL), an agent makes sequential decisions to maximise the reward it can obtain from an environment. During learning, the actual and expected outcomes are compared to tell whether a decision was good or bad. The difference between the actual outcome and expected outcome is the prediction error. The prediction error can be categorised into two types: the reward prediction error (RPE) and the state prediction error (SPE), which can serve as teaching signals in reinforcement learning models. Electroencephalogram (EEG) studies have also shown that the RPE can be reflected by a EEG waveform, called the feedback-related negativity (FRN), occurring in the frontal-central brain region, between 250 and 400ms after a reward signal is shown. Most FRN studies use one-step decision-making tasks to study the relationship between FRN amplitude and the RPE. However, everyday reinforcement learning situations come usually with many non-rewarded states and actions until a reward is obtained. The first part of this thesis uses a truly sequential decision making paradigm and aims to answer the question whether the FRN still reflects the RPE in multi-step complex tasks. The state prediction error (SPE) measures how much the agent's expectation on state transitions differs before and after an observation. Novelty and surprise are two types of SPE signals that drive learning when the external reward is not yet provided. However, how novelty and surprise interact and contribute in learning remained un-addressed. In this thesis, I proposed a model combining both novelty and surprise to explain human learning when reward is delayed and sparse. I used a 2-block experimental design to distinguish the effect of novelty and surprise in learning, and studied the neural correlates of novelty and surprise in a sequential decision-making task. I implemented different sequential decision-making tasks to study four RL signals, which are the eligibility trace, the RPE, novelty and surprise. I showed the evidence of eligibility trace in human learning using pupil dilation measurement. With EEG recording, I confirmed that the RPE is reflected in the amplitude of FRN (time window of 280-390ms after the state onset), for both directly rewarded and non-directly rewarded states. I proposed a new RL model, called SurNoR, using novelty as the intrinsic reward and surprise as the learning modulator, to explain human learning where no external reward is provided. The novelty signal is found to be reflected between 80-130ms after the state onset in EEG waveform. The surprise signal occurs later than the novelty signal, which is reflected between 150-210ms after the state onset. By using the sequential decision-making paradigm, this thesis extends the EEG observations of RPE and SPE signals from simple one-step tasks to complex multi-step decision-making tasks.
Wulfram Gerstner, Johanni Michael Brea, Alireza Modirshanechi, Kerstin Preuschoff, Marco Philipp Lehmann, Vasiliki Liakoni