MS Thesis

General Technical Committee

General Technical Committee Names
Chairperson: Dr. P Sreenivasa Kumar (HoD’s Nominee)
  Professor
  Computer Science & Engineering
  Indian Institute of Technology Madras
Research Advisor: Dr. Balaraman Ravindran
  Professor
  Computer Science & Engineering
  Indian Institute of Technology Madras
Research Co-Advisor: Dr. Nandan Sudarsanam
  Assistant Professor
  Department of Management Studies
  Indian Institute of Technology Madras
Member 1: Dr. Sutanu Chakraborti
  Associate Professor
  Computer Science & Engineering
  Indian Institute of Technology Madras
Member 2: Dr. Nirav P. Bhatt
  DST-INSPIRE Faculty
  Department of Chemical Engineering
  Indian Institute of Technology Madras
Thesis Reviewer: Dr. Krishna Jagannathan
  Assistant Professor
  Department of Electrical Engineering
  Indian Institute of Technology Madras

Finite-time Analysis of Frequentist Strategies for Multi-armed Bandits

Abstract

KEYWORDS: Reinforcement Learning, Stochastic Bandits, UCB, UCBV, EUCBV, Thresholding Bandits, APT, AugUCB

Show details

This thesis studies the following topics in the area of Reinforcement Learning: classic Multi-armed bandits in stationary distribution with the goal of cumulative regret mini- mization using variance estimates and Thresholding bandits in pure exploration fixed- budget setting with the goal of instantaneous regret minimization also using variance estimates. The common underlying theme is the study of bandit theory and its applica- tion in various types of environments. In the first part of the thesis, we study the classic multi-armed bandit problem with a stationary distribution, one of the first settings stud- ied by the bandit community and which successively gave rise to several new directions in bandit theory. We propose a novel algorithm in this setting and compare both theoret- ically and empirically its performance against the available algorithms. Our proposed algorithm termed as Efficient-UCB-Variance (EUCBV) is the first arm-elimination al- gorithm which uses variance estimation to eliminate arms as well as achieve an order optimal regret bound. Empirically, we show that EUCBV outperforms most of the state-of-the-art algorithms in the considered environments. In the next part, we study a specific type of stochastic multi-armed bandit setup called the thresholding bandit problem and discuss its usage, available state-of-the-art algorithms on this setting and our solution to this problem. We propose the Augmented-UCB (AugUCB) algorithm which again uses variance and mean estimation along with arm elimination technique to conduct exploration. We give theoretical guarantees on the expected loss of our algo- rithm and also analyze its performance against state-of-the-art algorithms in numerical simulations in multiple synthetic environments.

MS Thesis

You can find my MS thesis here. Download M.S thesis