We apply diffusion strategies to propose a cooperative reinforcement learning algorithm, in which agents in a network communicate with their neighbors to improve predictions about their environment. The algorithm is suitable to learn off-policy even in large state spaces. We provide a mean-square-error performance analysis under constant step-sizes. The gain of cooperation in the form of more stability and less bias and variance in the prediction error, is illustrated in the context of a classical model. We show that the improvement in performance is especially significant when the behavior policy of the agents is different from the target policy under evaluation.
Nikolaos Geroliminis, Emmanouil Barmpounakis
Adam Teodor Polak, Marek Elias
Ali H. Sayed, Kun Yuan, Lucas Cesar Eduardo Cassano