State–action–reward–state–action

State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note with the name "Modified Connectionist Q-Learning" (MCQ-L). The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote. This name reflects the fact that the main function for updating the Q-value depends on the current state of the agent "S1", the action the agent chooses "A1", the reward "R" the agent gets for choosing this action, the state "S2" that the agent enters after taking that action, and finally the next action "A2" the agent chooses in its new state. The acronym for the quintuple (st, at, rt, st+1, at+1) is SARSA. Some authors use a slightly different convention and write the quintuple (st, at, rt+1, st+1, at+1), depending on which time step the reward is formally assigned. The rest of the article uses the former convention.

One-shot learning and eligibility traces in sequential decision making

Marco Philipp Lehmann

When humans or animals perform an action that led to a desired outcome, they show a tendency to repeat it. The mechanisms underlying learning from past experience and adapting future behavior are still not fully understood. In this thesis, I study how humans learn from sparse and delayed reward during multi-step tasks. Learning a sequence of multiple decisions, from a reward obtained only at the end of the sequence, requires a mechanism to link earlier actions to later reward. The theory of reinforce- ment learning suggests an algorithmic solution to this problem, namely, to keep a decaying memory of the state-action history. Such memories are called eligibility traces. They bridge the temporal delay between the moment an action is taken and a subsequent reward. We ask whether humans make use of eligibility traces when learning a sequential decision making task. The difficulty in answering this question is that different competing algorithmic solu- tions make similar predictions about behavior. Only during a few initial trials, learning with eligibility traces is qualitatively different from other algorithms. Here, I implemented a novel learning task with an experimental manipulation that allowed us to guide participants through a controlled sequence of states. With this hidden manipulation, we were able to isolate the specific trials in which the competing models are distinguishable. Behavioral data as well as simultaneously recorded pupil dilation revealed effects compatible with eligibility traces, but not with simpler models. Furthermore, the trial-by-trial reward prediction errors were correlated with pupil dilation and EEG measurements. Our experimental data show effects of eligibility traces in behavior and pupil data, after a single experience of state-action associations, which has not been studied before in a multi-step task. We view our results in the light of one-shot learning and as a signature of a learning mechanism present both in temporal difference and one-shot learning.

EPFL2018

Models of Reward-Modulated Spike-Timing-Dependent Plasticity

Nicolas Frémaux

How do animals learn to repeat behaviors that lead to the obtention of food or other “rewarding” objects? As a biologically plausible paradigm for learning in spiking neural networks, spike-timing dependent plasticity (STDP) has been shown to perform well in unsupervised learning tasks such as receptive field development. However, STDP fails to take behavioral relevance into account, and as such is inadequate to explain a vast range of learning tasks in which the final outcome, conditioned on the prior execution of a series of actions, is signaled to an animal through sparse rewards. In this thesis, I show that the addition of a third, global, reward-based factor to the pre- and postsynaptic factors of STDP is a promising solution to this problem, consistent with experimental findings. One one hand, dopamine is a neuromodulator which has been shown to encode reward signals in the brain. On the other hand, STDP has been shown to be affected by dopamine, even though the precise nature of the interaction is unclear. Moreover, the theoretical framework of reinforcement learning provides strong foundation for the analysis of these learning rules. After studying existing examples of such rules in a navigation task, I derive simple functional requirements for reward-modulated learning rules, and illustrate these in a motor learning task. One of those functional requirements is the existence a “critic” structure, constantly evaluating the potential for rewarding events. The implication of the existence of such a critic on the interpretation of psychophysical experiments are also discussed. Finally, I propose a biologically plausible implementation of such a structure, that performs motor or navigational tasks. This is based on a generalization of temporal difference learning, a well-known reinforcement learning framework, to continuous time, well suited to an implementation with spiking neurons. These result provide a unified picture of reward-modulated learning rules: even though different rules have been proposed, these can be reduced to a single model at the synaptic level, with variations in the computation of the neuromodulatory signal enabling switching between different learning rules.

EPFL2013

Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning

Wulfram Gerstner, Gediminas Luksys, Maria del Carmen Sandi Perez

Suppose we train an animal in a conditioning experiment. Can one predict how a given animal, under given experimental conditions, would perform the task? Since various factors such as stress, motivation, genetic background, and previous errors in task performance can influence animal behavior, this appears to be a very challenging aim. Reinforcement learning (RL) models have been successful in modeling animal (and human) behavior, but their success has been limited because of uncertainty as to how to set meta-parameters (such as learning rate, exploitation-exploration balance and future reward discount factor) that strongly influence model performance. We show that a simple RL model whose meta- parameters are controlled by an artificial neural network, fed with inputs such as stress, affective phenotype, previous task performance, and even neuromodulatory manipulations, can successfully predict mouse behavior in the "hole-box" - a simple conditioning task. Our results also provide important insights on how stress and anxiety affect animal learning, performance accuracy, and discounting of future rewards, and on how noradrenergic systems can interact with these processes