One the earliest challenges a practitioner is faced with when using distance-based tools lies in the choice of the distance, for which there often is very few information to rely on. This chapter proposes to find a compromise between an a priori unoptimized choice (e.g. the Euclidean distance) and a fully-optimized, but computationally expensive, choice made by means of some resampling method. The compromise is found by choosing distance definition according to the results obtained with a very simple regression model – that is one which has few or no meta-parameters – and then use that distance in some other, more elaborate regression model. The rationale behind this heuristic is that the similarity measure which best reflects the notion of similarity with respect to the application should be the optimal one whatever model is used for classification or regression. This idea is tested against nine datasets and five prediction models. The results show that this approach is a reasonable compromise between the default choice and a fully-optimized choice of the metric.
François, D., Wertz, V., & Verleysen, M. (2011). Choosing the Metric: A Simple Model Approach. In Norbert Jankowski (ed.), Meta-Learning in Computational Intelligence (p. p. 97-115). Springer. https://doi.org/10.1007/978-3-642-20980-2_3