Machine learning of high-dimensional data faces the curse of dimensionality, a set of phenomena that limit the performance of the tools. Many limitations come directly from the representation of the data, and not from the analysis tool. It is therefore needed to reduce the data dimensionality. There are basically two ways to do this: either to select features among the original variables, or to project the latter on new ones. Although more general and thus more powerful in theory, projecting features induces a loss of interpretability. On the contrary, by selecting original features, one can come back to the application and interpret which are the relevant factors for the analysis; this is important advantage in many applications. This paper shows how to use Mutual Information (MI) for feature selection. In practice, the MI criterion has to be estimated and the search for possible feature subsets restricted for computation time reasons. It is shown how to use resampling and permutation tests to select optimal parameters for the estimator, and to stop the search procedure in a sound way. It is also shown how to design an estimator of feature subset relevance inspired from the mutual information criterion, with the supplementary advantage to restrict the estimation to a two-dimensional problem.
Verleysen, M., & François, D. (2008). Parameter-free feature selection with mutual information. Proceedings of the first workshop of the ERCIM Working Group on Computing and Statistics, p. 13. https://hdl.handle.net/2078.5/254139