Bayes optimal classifier The Bayes optimal classifier is a classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it. The
Naive Bayes classifier is a version of this that assumes that the data is conditionally independent on the class and makes the computation more feasible. Each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior probability of that hypothesis. The Bayes optimal classifier can be expressed with the following equation: :y=\underset{c_j \in C}{\mathrm{argmax}} \sum_{h_i \in H}{P(c_j|h_i)P(T|h_i)P(h_i)} where y is the predicted class, C is the set of all possible classes, H is the hypothesis space, P refers to a
probability, and T is the training data. As an ensemble, the Bayes optimal classifier represents a hypothesis that is not necessarily in H. The hypothesis represented by the Bayes optimal classifier, however, is the optimal hypothesis in
ensemble space (the space of all possible ensembles consisting only of hypotheses in H). This formula can be restated using
Bayes' theorem, which says that the posterior is proportional to the likelihood times the prior: :P(h_i|T) \propto P(T|h_i)P(h_i) hence, :y=\underset{c_j \in C}{\mathrm{argmax}} \sum_{h_i \in H}{P(c_j|h_i)P(h_i|T)}
Bootstrap aggregating (bagging) Bootstrap aggregation (
bagging) involves training an ensemble on
bootstrapped data sets. A bootstrapped set is created by selecting from original training data set with replacement. Thus, a bootstrap set may contain a given example zero, one, or multiple times. Ensemble members can also have limits on the features (e.g., nodes of a decision tree), to encourage exploring of diverse features. The variance of local information in the bootstrap sets and feature considerations promote diversity in the ensemble, and can strengthen the ensemble. To reduce overfitting, a member can be validated using the out-of-bag set (the examples that are not in its bootstrap set). Inference is done by
voting of predictions of ensemble members, called
aggregation. It is illustrated below with an ensemble of four decision trees. The query example is classified by each tree. Because three of the four predict the
positive class, the ensemble's overall classification is
positive.
Random forests like the one shown are a common application of bagging.
Boosting Boosting involves training successive models by emphasizing training data mis-classified by previously learned models. Initially, all data (D1) has equal weight and is used to learn a base model M1. The examples mis-classified by M1 are assigned a weight greater than correctly classified examples. This boosted data (D2) is used to train a second base model M2, and so on. Inference is done by voting. In some cases, boosting has yielded better accuracy than bagging, but tends to over-fit more. The most common implementation of boosting is
Adaboost, but some newer algorithms are reported to achieve better results.
Bayesian model averaging Bayesian model averaging (BMA) makes predictions by averaging the predictions of models weighted by their posterior probabilities given the data. BMA is known to generally give better answers than a single model, obtained, e.g., via
stepwise regression, especially where very different models have nearly identical performance in the training set but may otherwise perform quite differently. The question with any use of
Bayes' theorem is the prior, i.e., the probability (perhaps subjective) that each model is the best to use for a given purpose. Conceptually, BMA can be used with any prior.
R packages ensembleBMA and BMA use the prior implied by the
Bayesian information criterion, (BIC), following Raftery (1995).
R package BAS supports the use of the priors implied by
Akaike information criterion (AIC) and other criteria over the alternative models as well as priors over the coefficients. The difference between BIC and AIC is the strength of preference for parsimony. BIC's penalty for model complexity is \ln(n) k , while AIC's is 2k. Large-sample asymptotic theory establishes that if there is a best model, then with increasing sample sizes, BIC is strongly consistent, i.e., will almost certainly find it, while AIC may not, because AIC may continue to place excessive posterior probability on models that are more complicated than they need to be. On the other hand, AIC and AICc are asymptotically "efficient" (i.e., minimum mean square prediction error), while BIC is not . Haussler et al. (1994) showed that when BMA is used for classification, its expected error is at most twice the expected error of the Bayes optimal classifier. Burnham and Anderson (1998, 2002) contributed greatly to introducing a wider audience to the basic ideas of Bayesian model averaging and popularizing the methodology. The availability of software, including other free open-source packages for
R beyond those mentioned above, helped make the methods accessible to a wider audience.
Bayesian model combination Bayesian model combination (BMC) is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weights drawn randomly from a Dirichlet distribution having uniform parameters). This modification overcomes the tendency of BMA to converge toward giving all the weight to a single model. Although BMC is somewhat more computationally expensive than BMA, it tends to yield dramatically better results. BMC has been shown to be better on average (with statistical significance) than BMA and bagging. Use of Bayes' law to compute model weights requires computing the probability of the data given each model. Typically, none of the models in the ensemble are exactly the distribution from which the training data were generated, so all of them correctly receive a value close to zero for this term. This would work well if the ensemble were big enough to sample the entire model-space, but this is rarely possible. Consequently, each pattern in the training data will cause the ensemble weight to shift toward the model in the ensemble that is closest to the distribution of the training data. It essentially reduces to an unnecessarily complex method for doing model selection. The possible weightings for an ensemble can be visualized as lying on a simplex. At each vertex of the simplex, all of the weight is given to a single model in the ensemble. BMA converges toward the vertex that is closest to the distribution of the training data. By contrast, BMC converges toward the point where this distribution projects onto the simplex. In other words, instead of selecting the one model that is closest to the generating distribution, it seeks the combination of models that is closest to the generating distribution. The results from BMA can often be approximated by using cross-validation to select the best model from a bucket of models. Likewise, the results from BMC may be approximated by using cross-validation to select the best ensemble combination from a random sampling of possible weightings.
Bucket of models A "bucket of models" is an ensemble technique in which a model selection algorithm is used to choose the best model for each problem. When tested with only one problem, a bucket of models can produce no better results than the best model in the set, but when evaluated across many problems, it will typically produce much better results, on average, than any model in the set. The most common approach used for model-selection is
cross-validation selection (sometimes called a "bake-off contest"). It is described with the following pseudo-code: For each model m in the bucket: Do c times: (where 'c' is some constant) Randomly divide the training dataset into two sets: A and B Train m with A Test m with B Select the model that obtains the highest average score Cross-Validation Selection can be summed up as: "try them all with the training set, and pick the one that works best". Gating is a generalization of Cross-Validation Selection. It involves training another learning model to decide which of the models in the bucket is best-suited to solve the problem. Often, a
perceptron is used for the gating model. It can be used to pick the "best" model, or it can be used to give a linear weight to the predictions from each model in the bucket. When a bucket of models is used with a large set of problems, it may be desirable to avoid training some of the models that take a long time to train. Landmark learning is a meta-learning approach that seeks to solve this problem. It involves training only the fast (but imprecise) algorithms in the bucket, and then using the performance of these algorithms to help determine which slow (but accurate) algorithm is most likely to do best.
Amended Cross-Entropy Cost: An Approach for Encouraging Diversity in Classification Ensemble The most common approach for training classifier is using
Cross-entropy cost function. However, one would like to train an ensemble of models that have diversity so when we combine them it would provide best results. Assuming we use a simple ensemble of averaging K classifiers. Then the Amended Cross-Entropy Cost is : e^k = H(p,q^k)-\frac{\lambda}{K}\sum_{j\neq k}H(q^j,q^k) where e^k is the cost function of the k^{th} classifier, q^k is the probability of the k^{ th} classifier, p is the true probability that we need to estimate and \lambda is a parameter between 0 and 1 that define the diversity that we would like to establish. When \lambda=0 we want each classifier to do its best regardless of the ensemble and when \lambda=1 we would like the classifier to be as diverse as possible.
Stacking Stacking (sometimes called
stacked generalization) involves training a model to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm (final estimator) is trained to make a final prediction using all the predictions of the other algorithms (base estimators) as additional inputs or using cross-validated predictions from the base estimators which can prevent overfitting. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although, in practice, a
logistic regression model is often used as the combiner. Stacking typically yields performance better than any single one of the trained models. It has been successfully used on both supervised learning tasks (regression, classification and distance learning ) and unsupervised learning (density estimation). It has also been used to estimate bagging's error rate. It has been reported to out-perform Bayesian model-averaging. The two top-performers in the Netflix competition utilized blending, which may be considered a form of stacking. Bayesian predictive stacking generalizes the idea of stacking from the statistical estimation literature to the combination of posterior predictive distributions. These ideas have also been developed and investigated for Gaussian process models, especially for spatial data analysis and can be used to construct transfer learning frameworks for massive spatial data sets.
Voting Voting is another form of ensembling. See e.g.
Weighted majority algorithm (machine learning). == Implementations in statistics packages ==