Subhojyoti Mukherjee
I am a Ph.D. candidate in the Department of Electrical and Computer Engineering (ECE), University of Wisconsin-Madison.
Looking for full-time positions in the industry.
Download CV
Email: smukherjee27 [at] wisc [dot] edu
Statement and Vision
My broader vision is to build large-scale trustworthy Language, Vision and Machine Learning models. To achieve this I have looked into incorporating adaptive data collection strategies for LLM training and aligning LLMs with human feedback by collecting informative data. Building large-scale trustworthy Machine learning models is a challenging task, and to achieve this my past works have looked into various aspects of data collection for training models:
- Adaptive Data collection in Reinforcement Learning
- Understanding Incontext learning for Decision transformers
- Adaptive prompt design for LLMs, and aligning LLMs with human preferences through finetuning
- Safety in Machine Learning
My expertise ranges from research and developing algorithms to training machine learning models, Reinforcement Learning, fine-tuning LLMs, and prompt designing for LLMs. This expertise is crucial to build large-scale real-life systems for user interaction and understanding user preference from data.
Education
Ph.D. candidate (Fall 2019 to Fall 2024 expected) now |
at ECE, University of Wisconsin Madison advised by Dr. Robert Nowak, Dr. Josiah Hanna, and Dr. Qiaomin Xie Areas of Research: Reinforcement Learning, Active Learning, incorporating deep active learning strategies for Large Language Models (LLMs), aligning Large Language Models with human feedback (RLHF), and understanding sequential decision-making using transformers (DT). (Joint) Masters Thesis: Active Sequential Hypothesis Testing with Extension to Active Regression and Multi-armed Bandits pdf |
M.S by Research (2015 to 2018) |
at CSE, Indian Institute of Technology (IIT) Madras advised by Dr. Balaraman Ravindran, and Dr. Nandan Sudarsanam RISE Lab Areas of Research: Reinforcement learning, Stochastic and non-stochastic Multi-Armed Bandit settings. Masters Thesis: Finite-time Analysis of Frequentist Strategies for Multi-armed Bandits pdf |
Bachelor of Technology (2009 to 2013) |
at Dept. of Computer Science and Engineering Meghnad Saha Institute of Technology, Kolkata under West Bengal University of Technology, India |
Research Internships
Amazon AWS AI, Santa Clara, USA Summer 2024 (full-time) |
hosted by Branislav Kveton, Anusha Lalitha and: Sailik Sengupta, Yifei Ma, Aniket Deshmukh, Gaurush Hiranandani. Area of Research: Multi-objective alignment for LLMs. |
Amazon AWS AI, Santa Clara, USA Fall 2023 (Part-time) |
hosted by Branislav Kveton and: Yifei Ma, Anusha Lalitha, Kousha Kalantiri, Ge Liu, Aniket Deshmukh, Anoop Deoras. Area of Research: RLHF with LLMs. |
Amazon AWS AI, Santa Clara, USA Summer 2023 (Full-time) |
hosted by Branislav Kveton and: Yifei Ma, Anusha Lalitha, Ge Liu, Aniket Deshmukh, Anoop Deoras. Area of Research: Active In-Context Learning with LLMs. |
CMU, ECE Dept., Pittsburgh, USA Summer 2019 |
hosted by Prof. Gauri Joshi Area of Research: Structured Bandits. |
Adobe Research, San Jose, USA Spring 2018 |
hosted by Branislav Kveton Area of Research: Item recommendation with Ranking and Bandits. |
INRIA, SequeL Lab, Lille, France Fall 2017 |
hosted by Odalric Maillard Area of Research: Non-stationary Bandits. |
Research Focus and Selected works
LLMs, RLHF, and Prompt Design
Optimal Design for Human Feedback for Training Large Language Models (NeurIPS 2024 main conference)We study the problem of data collection for learning preference models. The key idea in our work is to generalize the optimal design, a method for computing information gathering policies, to ranked lists. We design efficient algorithms and experiment with several synthetic and real-world datasets to show the statistical efficiency of our algorithms. pdf |
|
Off-Policy Evaluation from Logged Human Feedback using Large Language Models (ICML 2024 Workshop)We study off-policy evaluation from logged human feedback. We formalize the problem, propose both model-based and model-free estimators for policy values, and show how to optimize them. We analyze unbiasedness of our estimators and evaluate them empirically wit Large Language Models. Our estimators can predict the absolute values of evaluated policies, rank them, and be optimized. pdf |
|
Optimal Design for Adaptive In-Context Prompt Selection in Large Language ModelsWe use active learning for adaptive prompt design and call it Active In-context Prompt Design (AIPD). We design the LLM prompt by adaptively choosing few-shot informative examples using Optimal Design from a training set to optimize performance on a test set. We experiment in different tasks with small, medium, and large sized LLMs; and show that our proposed algorithms GO and SAL outperform other methods for choosing few-shot examples in the LLM prompt at inference. pdf |
Transformers, Multi-task Learning and Incontext Learning
Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task LearningWe study multi-task RL problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and the algorithm exploits the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer (CausalLM GPT2 model) as a decision-making algorithm to learn this shared structure imlpicitly so as to generalize to the test task. Our model outperforms other SOTA methods like DPT, and imitation learning algorithms like Algorithmic Distillation (AD) over a series of experiments on several structured bandit problems. pdf |
|
Multi-task Representation Learning for Pure Exploration in Bilinear Bandits (Neurips 2023)We study multi-task representation learning for the problem of pure exploration in bilinear bandits. We aim to find optimal items for multiple tasks that share a common low-dimensional linear representation. We propose and analyze the algorithm GOBLIN that uses an Optimal Design approach to optimize sample allocations for learning the global representation as well as minimize the number of samples needed to identify the optimal pair of items in individual tasks. pdf |
Reinforcement Learning
SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits (AISTATS 2024)In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. pdf |
|
ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling (UAI 2022)We study the problem of data collection for policy evaluation in Markov decision processes (MDPs). In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain in an environment formalized as an MDP. We develop and analyze the algorithm Reduced Variance Sampling (ReVar) algorithm that approximates the oracle strategy when the reward variances are unknown a priori and bound its sub-optimality compared to the oracle strategy. Finally, we empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy. pdf |
|
Chernoff Sampling for Active Testing and Extension to Active Regression (AISTATS 2022)Active learning can reduce the number of samples needed to perform a hypothesis test and to estimate the parameters of a model. We revisit the work of Chernoff that described an asymptotically optimal algorithm for performing a hypothesis test. We obtain a novel sample complexity bound for Chernoff’s algorithm, with a non-asymptotic term that characterizes its performance at a fixed confidence level. We also develop an extension of Chernoff sampling that can be used to estimate the parameters of a wide variety of models and we obtain a non-asymptotic bound on the estimation error. We apply our extension of Chernoff sampling to actively learn neural network models and to estimate parameters in real-data linear and non-linear regression problems, where our approach performs favorably to state-of-the-art methods. pdf |
Safety in RL
SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in Tabular MDP (ICML 2024)We study safe data collection for the purpose of policy evaluation in tabular Markov decision processes (MDPs). While prior work has considered behavior policy selection, in this paper, we additionally consider a safety constraint on the behavior policy. We then introduce an algorithm SaVeR for this problem that approximates the (best possible) safe oracle algorithm and bound the finite-sample mean squared error of the algorithm while ensuring it satisfies the safety constraint. Finally, we show in simulations that SaVeR produces low MSE policy evaluation while satisfying the safety constraint. pdf |
|
Safety Aware Changepoint Detection for Piecewise i.i.d. Bandits (UAI 2022)We consider the setting of piecewise i.i.d. bandits under a safety constraint. In this setting, there exists a finite number of changepoints where the mean score of some or all actions (items) change simultaneously. We propose two actively adaptive algorithms for this setting that satisfy the safety constraint, detect changepoints, and restart without the knowledge of the number of changepoints or their locations. Empirically, we show that our safety-aware algorithms perform similarly to the SOTA adaptive algorithms that do not satisfy the safety constraint. pdf |
News
2024
-
Our paper Optimal Design for Human Feedback was accepted at NeurIPS 2024 (main conference).
-
Our paper Optimal Design for Human Feedback was accepted at Models of Human Feedback for AI Alignment workshop in ICML 2024.
-
Our paper Off-Policy Evaluation from Logged Human Feedback was accepted at Models of Human Feedback for AI Alignment workshop in ICML 2024.
-
Our paper SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in Tabular MDP was accepted at ICML 2024 (main conference).
-
I will be returning as an Applied Scientist intern to Amazon AWS AI in the summer of 2024.
-
Our paper SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits was accepted at AISTATS 2024.
-
My internship at Amazon AWS AI has been extended (as part-time) till February 2024.
2023
-
I won the Neural Information Processing Systems (Neurips) 2023 top reviewer award.
-
I passed my prelim exam for the Doctoral degree.
-
Our paper Multi-task Representation Learning for Pure Exploration in Bilinear Bandits was accepted at Neurips 2023.
-
Our paper SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits was accepted at ICML 2023 Workshop The Many Facets of Preference-Based Learning.
-
I won the top reviewer award at Uncertainty in Artificial Intelligence (UAI) 2023.
-
I worked at the intersection of Active Learning and Large Language Models (LLMs) in my internship at Amazon AWS AI in the summer 2023. My internship has been extended (as part-time) till the end of Fall 2023.
2022
-
Our paper ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling was accepted at Uncertainty in Artificial Intelligence (UAI) 2022.
-
Our paper Safety Aware Changepoint Detection for Piecewise i.i.d. Bandits was accepted at Uncertainty in Artificial Intelligence (UAI) 2022.
-
I passed my Qualification Exam for the Doctoral degree.
-
Our paper Chernoff Sampling for Active Testing and Extension to Active Regression was accepted at Artificial Intelligence and Statistics (AISTATS) 2022.
-
Our paper Nearly Optimal Algorithms for Level Set Estimation was accepted at Artificial Intelligence and Statistics (AISTATS) 2022.
2021
-
I got Master’s Degree in Electrical Engineering from UW-Madison. Now moving on to finish my doctoral degree.
-
Our paper A Unified Approach to Translate Classical Bandit Algorithms to the Structured Bandit Setting was accepted in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 21).
2020
-
Our paper Generalized Chernoff Sampling: A New Perspective on Structured Bandit Algorithms”, was accepted at Theoretical Foundations Of Reinforcement Learning ICML 2020 Workshop.
-
Our paper A Unified Approach to Translate Classical Bandit Algorithms to the Structured Bandit Setting was accepted in IEEE Journal on Selected Areas in Information Theory (2020).
2019
-
Our paper Distribution-dependent and Time-uniform Bounds for Piecewise i.i.d Bandits was accepted at Reinforcement Learning for Real Life ICML 2019 Workshop.
-
Going to spend the summer of 2019 as a Research Associate in the Department of Electrical and Computer Engineering (ECE) at Carnegie Mellon University (CMU) working with Professor Gauri Joshi and Osman Yagan.
-
I received the 2019 Chancellor’s Opportunity Fellowship award at the University of Wisconsin-Madison.