Several problem transformation methods exist for multi-label classification, and can be roughly broken down into: ===Transformation into
binary classification problems=== The baseline approach, called the
binary relevance method, amounts to independently training one binary classifier for each label. Given an unseen sample, the combined model then predicts all labels for this sample for which the respective classifiers predict a positive result. Although this method of dividing the task into multiple binary tasks may resemble superficially the one-vs.-all (OvA) and one-vs.-rest (OvR) methods for
multiclass classification, it is essentially different from both, because a single classifier under binary relevance deals with a single label, without any regard to other labels whatsoever. A
classifier chain is an alternative method for transforming a multi-label classification problem into several binary classification problems. It differs from binary relevance in that labels are predicted sequentially, and the output of all previous classifiers (i.e. positive or negative for a particular label) are input as features to subsequent classifiers.
Bayesian network has also been applied to optimally order classifiers in
Classifier chains. In case of transforming the problem to multiple binary classifications, the
likelihood function reads L=\prod_{i=1}^n (\prod_k (\prod_{j_k}(p_{k,j_k}(x_i)^{\delta_{y_{i,k},j_k}}))) where index i runs over the samples, index k runs over the labels, j_k indicates the binary outcomes 0 or 1, \delta_{a,b} indicates the
Kronecker delta, y_{i,k}\in {0,1} indicates the multiple hot encoded labels of sample i. ===Transformation into
multi-class classification problem=== The label powerset (LP) transformation creates one binary classifier for every label combination present in the training set. For example, if possible labels for an example were A, B, and C, the label powerset representation of this problem is a multi-class classification problem with the classes [0 0 0], [1 0 0], [0 1 0], [0 0 1], [1 1 0], [1 0 1], [0 1 1], and [1 1 1] where for example [1 0 1] denotes an example where labels A and C are present and label B is absent. ===
Ensemble methods=== A set of multi-class classifiers can be used to create a multi-label ensemble classifier. For a given example, each classifier outputs a single class (corresponding to a single label in the multi-label problem). These predictions are then combined by an ensemble method, usually a voting scheme where every class that receives a requisite percentage of votes from individual classifiers (often referred to as the discrimination threshold) is predicted as a present label in the multi-label output. However, more complex ensemble methods exist, such as
committee machines. Another variation is the random -labelsets (RAKEL) algorithm, which uses multiple LP classifiers, each trained on a random subset of the actual labels; label prediction is then carried out by a voting scheme. A set of multi-label classifiers can be used in a similar way to create a multi-label ensemble classifier. In this case, each classifier votes once for each label it predicts rather than for a single label. ==Adapted algorithms==