Probability matching Probability matching is a decision strategy in which predictions of class membership are proportional to the class base rates. Thus, if in the training set positive examples are observed 60% of the time, and negative examples are observed 40% of the time, the observer using a probability-matching strategy will predict (for unlabeled examples) a class label of "positive" on 60% of instances, and a class label of "negative" on 40% of instances.
Bayesian control rule A generalization of Thompson sampling to arbitrary dynamical environments and causal structures, known as
Bayesian control rule, has been shown to be the optimal solution to the adaptive coding problem with actions and observations. In this formulation, an agent is conceptualized as a mixture over a set of behaviours. As the agent interacts with its environment, it learns the causal properties and adopts the behaviour that minimizes the relative entropy to the behaviour with the best prediction of the environment's behaviour. If these behaviours have been chosen according to the maximum expected utility principle, then the asymptotic behaviour of the Bayesian control rule matches the asymptotic behaviour of the perfectly rational agent. The setup is as follows. Let a_1, a_2, \ldots, a_T be the actions issued by an agent up to time T, and let o_1, o_2, \ldots, o_T be the observations gathered by the agent up to time T. Then, the agent issues the action a_{T+1} with probability: :P(a_{T+1}|\hat{a}_{1:T}, o_{1:T}), where the "hat"-notation \hat{a}_t denotes the fact that a_t is a causal intervention (see
Causality), and not an ordinary observation. If the agent holds beliefs \theta \in \Theta over its behaviors, then the Bayesian control rule becomes :P(a_{T+1}|\hat{a}_{1:T}, o_{1:T}) = \int_{\Theta} P(a_{T+1}|\theta, \hat{a}_{1:T}, o_{1:T}) P(\theta|\hat{a}_{1:T}, o_{1:T}) \, d\theta, where P(\theta|\hat{a}_{1:T}, o_{1:T}) is the posterior distribution over the parameter \theta given actions a_{1:T} and observations o_{1:T}. In practice, the Bayesian control amounts to sampling, at each time step, a parameter \theta^\ast from the posterior distribution P(\theta|\hat{a}_{1:T}, o_{1:T}), where the posterior distribution is computed using Bayes' rule by only considering the (causal) likelihoods of the observations o_1, o_2, \ldots, o_T and ignoring the (causal) likelihoods of the actions a_1, a_2, \ldots, a_T, and then by sampling the action a^\ast_{T+1} from the action distribution P(a_{T+1}|\theta^\ast,\hat{a}_{1:T},o_{1:T}).
Upper-confidence-bound (UCB) algorithms Thompson sampling and upper-confidence bound algorithms share a fundamental property that underlies many of their theoretical guarantees. Roughly speaking, both algorithms allocate exploratory effort to actions that might be optimal and are in this sense "optimistic". Leveraging this property, one can translate regret bounds established for UCB algorithms to
Bayesian regret bounds for Thompson sampling or unify regret analysis across both these algorithms and many classes of problems. == References ==