Thompson Sampling: A Bayesian approach to balancing exploration and exploitation in online learning tasks.
Thompson Sampling is a popular Bayesian method used in online learning tasks, particularly in multi-armed bandit problems, to balance exploration and exploitation. It works by allocating new observations to different options (arms) based on the posterior probability that an option is optimal. This approach has been proven to achieve sub-linear regret under various probabilistic settings and has shown strong empirical performance across different domains.
Recent research in Thompson Sampling has focused on addressing its challenges, such as computational demands in large-scale problems and the need for accurate model fitting. One notable development is Bootstrap Thompson Sampling (BTS), which replaces the posterior distribution used in Thompson Sampling with a bootstrap distribution, making it more scalable and robust to misspecified error distributions. Another advancement is Regenerative Particle Thompson Sampling (RPTS), which improves upon Particle Thompson Sampling by regenerating new particles in the vicinity of fit surviving particles, resulting in uniform improvement and flexibility across various bandit problems.
Practical applications of Thompson Sampling include adaptive experimentation, where it has been compared to other methods like Tempered Thompson Sampling and Exploration Sampling. In most cases, Thompson Sampling performs similarly to random assignment, with its relative performance depending on the number of experimental waves. Another application is in 5G network slicing, where RPTS has been used to effectively allocate resources. Furthermore, Thompson Sampling has been extended to handle noncompliant bandits, where the agent's chosen action may not be the implemented action, and has been shown to match or outperform traditional Thompson Sampling in both compliant and noncompliant environments.
In conclusion, Thompson Sampling is a powerful and flexible method for addressing online learning tasks, with ongoing research aimed at improving its scalability, robustness, and applicability to various problem domains. Its connection to broader theories, such as Bayesian modeling of policy uncertainty and game-theoretic analysis, further highlights its potential as a principled approach to adaptive sequential decision-making and causal inference.

Thompson Sampling
Thompson Sampling Further Reading
1.A Note on Information-Directed Sampling and Thompson Sampling http://arxiv.org/abs/1503.06902v1 Li Zhou2.Asymptotic Convergence of Thompson Sampling http://arxiv.org/abs/2011.03917v1 Cem Kalkanli, Ayfer Ozgur3.Thompson sampling with the online bootstrap http://arxiv.org/abs/1410.4009v1 Dean Eckles, Maurits Kaptein4.Regenerative Particle Thompson Sampling http://arxiv.org/abs/2203.08082v2 Zeyu Zhou, Bruce Hajek, Nakjung Choi, Anwar Walid5.Thompson Sampling for Noncompliant Bandits http://arxiv.org/abs/1812.00856v1 Andrew Stirn, Tony Jebara6.A Comparison of Methods for Adaptive Experimentation http://arxiv.org/abs/2207.00683v1 Samantha Horn, Sabina J. Sloman7.Thompson Sampling with Unrestricted Delays http://arxiv.org/abs/2202.12431v2 Han Wu, Stefan Wager8.Generalized Thompson Sampling for Sequential Decision-Making and Causal Inference http://arxiv.org/abs/1303.4431v1 Pedro A. Ortega, Daniel A. Braun9.MOTS: Minimax Optimal Thompson Sampling http://arxiv.org/abs/2003.01803v3 Tianyuan Jin, Pan Xu, Jieming Shi, Xiaokui Xiao, Quanquan Gu10.Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning http://arxiv.org/abs/2110.00871v1 Tong ZhangThompson Sampling Frequently Asked Questions
What is the Thompson sampling method?
Thompson Sampling is a Bayesian approach used in online learning tasks, particularly in multi-armed bandit problems, to balance exploration and exploitation. It works by allocating new observations to different options (arms) based on the posterior probability that an option is optimal. This method has been proven to achieve sub-linear regret under various probabilistic settings and has shown strong empirical performance across different domains.
What is Thompson sampling reinforcement learning?
Thompson Sampling in reinforcement learning is an algorithm used to solve the exploration-exploitation dilemma in sequential decision-making tasks. It uses Bayesian methods to estimate the value of each action and selects actions based on their posterior probability of being optimal. This approach allows the agent to balance the need to explore new actions to gain information and exploit known actions to maximize rewards.
What is the difference between UCB and Thompson sampling?
UCB (Upper Confidence Bound) and Thompson Sampling are both algorithms used to address the exploration-exploitation trade-off in multi-armed bandit problems. The main difference between them is their approach to selecting actions. UCB selects actions based on upper confidence bounds, which are calculated using the mean reward and a confidence interval. In contrast, Thompson Sampling selects actions based on their posterior probability of being optimal, using Bayesian methods to update these probabilities as new observations are collected.
What are the benefits of Thompson sampling?
Thompson Sampling offers several benefits, including: 1. Balancing exploration and exploitation: Thompson Sampling effectively balances the need to explore new options and exploit known options to maximize rewards in online learning tasks. 2. Strong empirical performance: Thompson Sampling has been shown to achieve sub-linear regret under various probabilistic settings and has demonstrated strong performance across different domains. 3. Flexibility: Thompson Sampling can be applied to various problem domains and can be extended to handle noncompliant bandits and other complex scenarios. 4. Connection to broader theories: Thompson Sampling is connected to Bayesian modeling of policy uncertainty and game-theoretic analysis, highlighting its potential as a principled approach to adaptive sequential decision-making and causal inference.
How does Thompson sampling handle large-scale problems?
Recent research in Thompson Sampling has focused on addressing its challenges, such as computational demands in large-scale problems and the need for accurate model fitting. One notable development is Bootstrap Thompson Sampling (BTS), which replaces the posterior distribution used in Thompson Sampling with a bootstrap distribution, making it more scalable and robust to misspecified error distributions.
What are some practical applications of Thompson Sampling?
Practical applications of Thompson Sampling include adaptive experimentation, where it has been compared to other methods like Tempered Thompson Sampling and Exploration Sampling. In most cases, Thompson Sampling performs similarly to random assignment, with its relative performance depending on the number of experimental waves. Another application is in 5G network slicing, where Regenerative Particle Thompson Sampling (RPTS) has been used to effectively allocate resources. Furthermore, Thompson Sampling has been extended to handle noncompliant bandits, where the agent's chosen action may not be the implemented action, and has been shown to match or outperform traditional Thompson Sampling in both compliant and noncompliant environments.
Explore More Machine Learning Terms & Concepts