**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Multi-armed bandit

Summary

In probability theory and machine learning, the multi-armed bandit problem (sometimes called the K- or N-armed bandit problem) is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines (sometimes known as "one-armed bandits"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of stochastic scheduli

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related publications

Loading

Related people

Loading

Related units

Loading

Related concepts

Loading

Related courses

Loading

Related lectures

Loading

Related people (4)

Related concepts (1)

Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machin

Related publications (42)

Related courses (10)

Related lectures (13)

CS-430: Intelligent agents

Software agents are widely used to control physical, economic and financial processes. The course presents practical methods for implementing software agents and multi-agent systems, supported by programming exercises, and the theoretical underpinnings including computational game theory.

COM-406: Foundations of Data Science

We discuss a set of topics that are important for the understanding of modern data science but that are typically not taught in an introductory ML course. In particular we discuss fundamental ideas and techniques that come from probability, information theory as well as signal processing.

EE-556: Mathematics of data: from theory to computation

This course provides an overview of key advances in continuous optimization and statistical analysis for machine learning. We review recent learning formulations and models as well as their guarantees, describe scalable solution techniques and algorithms, and illustrate the trade-offs involved.

Loading

Loading

Loading

Related units (2)

In this thesis the Support Vector Machine (SVM)is applied on classification of high resolution satellite images. Sveral different measures for classification, including texture mesasures, 1st order statistics, and simple contextual information were evaluated. Additionnally, the image was segmented, using an enhanced watershed method, in order to improve the classification accuracy.

2003The Distributed Constraint Optimization (DCOP) framework can be used to model a wide range of optimization problems that are inherently distributed. A distributed optimization problem can be viewed as a problem distributed over a set of agents, where agents are responsible for finding a solution for their part of the problem. In the DCOP framework, the decisions the agents need to take are modeled using decision variables, and both the local problems and the inter-dependencies are modeled using constraints. In this thesis, we attack two problems related to the DCOP framework. In the first part of this thesis, we look at coordination problems with complex, non- trivial preferences. In the DCOP literature, the notion of complex is used for problems where agents need to take multiple-decisions and thus own multiple decision variables. To express the fact that agents preferences are hard to obtain, we introduce the notion of a non-trivial local problem. The present DCOP literature has largely ignored the origin of preferences, i.e. algorithms assume that the constraints are known a-priori. In coordination problems with complex, non-trivial preferences this assumption no longer holds. In order to investigate the behavior of existing DCOP algorithms on such coordination problems, we first introduce a new DCOP model for such coordination problems, where we make an explicit distinction between the coordination part of the problem, and the local preferences of the agents. We then introduce two new benchmarks with complex, non-trivial local subproblems, and perform an evaluation of several complete and non-complete DCOP algorithms. Our results show that the O-DPOP algorithm, which elicits preferences from agents in a best-first order, outper- forms both other complete approaches as also local search approaches. We show that, despite the fact that a best first order cannot always be guaranteed when dealing with non-trivial local problems, O-DPOP still finds solutions that are either better, or on par with, local search algorithms. In the second part of this thesis, we introduce a new sampling based approach to solving DCOPs. Existing DCOP algorithms are either complete, but do not scale well, or scale well but are not guaranteed to find a solution. Our approach, inspired by the application of the UCB algorithm for the multi-armed bandit problem on trees, aims to provide method that scales well. Furthermore, given certain assumptions on the problem structure, our method is guaranteed to find good solutions. We introduce the Distributed UCT (DUCT) algorithm, that uses bounds similar to those used in the UCT algorithm, which is an adaptation of the UCB approach on trees. We introduce and evaluate four different variations of the DUCT algorithm, DUCT-A, DUCT-B, DUCT-C and DUCT-D, that differ in the bounds that they use. To evaluate our algorithms, we compare them with existing DCOP approaches on three different types of problems, both with and without hard constraints. Our results show that, compared with other DCOP algorithms, the DUCT-D variant consistently performs well on the problems used, with respect to both solution quality and runtime. Furthermore, it consistently sends less information than the state-of-the-art DCOP algorithms, and is thus shown to be a suitable and practical algorithm for solving DCOPs.

We propose an online algorithm for sequential learning in the contextual multiarmed bandit setting. Our approach is to partition the context space and, then, optimally combine all of the possible mappings between the partition regions and the set of bandit arms in a data-driven manner. We show that in our approach, the best mapping is able to approximate the best arm selection policy to any desired degree under mild Lipschitz conditions. Therefore, we design our algorithm based on the optimal adaptive combination and asymptotically achieve the performance of the best mapping as well as the best arm selection policy. This optimality is also guaranteed to hold even in adversarial environments since we do not rely on any statistical assumptions regarding the contexts or the loss of the bandit arms. Moreover, we design an efficient implementation for our algorithm using various hierarchical partitioning structures, such as lexicographical or arbitrary position splitting and binary trees (BTs) (and several other partitioning examples). For instance, in the case of BT partitioning, the computational complexity is only log-linear in the number of regions in the finest partition. In conclusion, we provide significant performance improvements by introducing upper bounds (with respect to the best arm selection policy) that are mathematically proven to vanish in the average loss per round sense at a faster rate compared to the state of the art. Our experimental work extensively covers various scenarios ranging from bandit settings to multiclass classification with real and synthetic data. In these experiments, we show that our algorithm is highly superior to the state-of-the-art techniques while maintaining the introduced mathematical guarantees and a computationally decent scalability.

2019