There are different Feature Selection mechanisms around that utilize
mutual information for scoring the different features. They usually use all the same algorithm: • Calculate the
mutual information as score for between all features ( f_{i} \in F ) and the target class () • Select the feature with the largest score (e.g. \underset{f_{i} \in F}\operatorname{argmax}(I(f_{i},c))) and add it to the set of selected features () • Calculate the score which might be derived from the
mutual information • Select the feature with the largest score and add it to the set of select features (e.g. \underset{f_{i} \in F}\operatorname{argmax}(I_{derived}(f_{i},c))) • Repeat 3. and 4. until a certain number of features is selected (e.g. |S|=l) The simplest approach uses the
mutual information as the "derived" score. However, there are different approaches, that try to reduce the redundancy between features.
Minimum-redundancy-maximum-relevance (mRMR) feature selection Peng
et al. proposed a feature selection method that can use either mutual information, correlation, or distance/similarity scores to select features. The aim is to penalise a feature's relevancy by its redundancy in the presence of the other selected features. The relevance of a feature set for the class is defined by the average value of all mutual information values between the individual feature and the class as follows: : D(S,c) = \frac{1}\sum_{f_{i}\in S}I(f_{i};c) . The redundancy of all features in the set is the average value of all mutual information values between the feature and the feature : : R(S) = \frac{1}\sum_{f_{i}\in S}I(f_{i};c) - \frac{1}{|S|^{2}}\sum_{f_{i},f_{j}\in S}I(f_{i};f_{j})\right]. Suppose that there are full-set features. Let be the set membership
indicator function for feature , so that indicates presence and indicates absence of the feature in the globally optimal feature set. Let c_i=I(f_i;c) and a_{ij}=I(f_i;f_j). The above may then be written as an optimization problem: :\mathrm{mRMR}= \max_{x\in \{0,1\}^{n}} \left[\frac{\sum^{n}_{i=1}c_{i}x_{i}}{\sum^{n}_{i=1}x_{i}} - \frac{\sum^{n}_{i,j=1}a_{ij}x_{i}x_{j}} {(\sum^{n}_{i=1}x_{i})^{2}}\right]. The mRMR algorithm is an approximation of the theoretically optimal maximum-dependency feature selection algorithm that maximizes the mutual information between the joint distribution of the selected features and the classification variable. As mRMR approximates the combinatorial estimation problem with a series of much smaller problems, each of which only involves two variables, it thus uses pairwise joint probabilities which are more robust. In certain situations the algorithm may underestimate the usefulness of features as it has no way to measure interactions between features which can increase relevancy. This can lead to poor performance
Quadratic programming feature selection mRMR is a typical example of an incremental greedy strategy for feature selection: once a feature has been selected, it cannot be deselected at a later stage. While mRMR could be optimized using floating search to reduce some features, it might also be reformulated as a global
quadratic programming optimization problem as follows: : \mathrm{QPFS}: \min_\mathbf{x} \left\{ \alpha \mathbf{x}^T H \mathbf{x} - \mathbf{x}^T F\right\} \quad \mbox{s.t.} \ \sum_{i=1}^n x_i=1, x_i\geq 0 where F_{n\times1}=[I(f_1;c),\ldots, I(f_n;c)]^T is the vector of feature relevancy assuming there are features in total, H_{n\times n}=[I(f_i;f_j)]_{i,j=1\ldots n} is the matrix of feature pairwise redundancy, and \mathbf{x}_{n\times 1} represents relative feature weights. QPFS is solved via quadratic programming. It is recently shown that QFPS is biased towards features with smaller entropy, : \mathrm{SPEC_{CMI}}: \max_{\mathbf{x}} \left\{\mathbf{x}^T Q \mathbf{x}\right\} \quad \mbox{s.t.}\ \|\mathbf{x}\|=1, x_i\geq 0 where Q_{ii}=I(f_i;c) and Q_{ij}=(I(f_i;c|f_j)+I(f_j;c|f_i))/2, i\ne j. An advantage of is that it can be solved simply via finding the dominant eigenvector of , thus is very scalable. also handles second-order feature interaction.
Joint mutual information In a study of different scores Brown et al. as a good score for feature selection. The score tries to find the feature, that adds the most new information to the already selected features, in order to avoid redundancy. The score is formulated as follows: : \begin{align} JMI(f_i) &= \sum_{f_j \in S} (I(f_i;c) + I(f_i;c|f_j)) \\ &= \sum_{f_j \in S} \bigl[ I (f_j;c) + I (f_i;c) - \bigl(I (f_i;f_j) - I (f_i;f_j|c)\bigr)\bigr] \end{align} The score uses the
conditional mutual information and the
mutual information to estimate the redundancy between the already selected features ( f_j \in S ) and the feature under investigation (f_i). ==Hilbert-Schmidt Independence Criterion Lasso based feature selection==