Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
The present article proposes a methodology to consider the uncertainty intrinsic to data-based models when comparing their performance. The goal is to provide a quantification of the variability of this type of models due to the random nature of the calibration process and enable a statistical comparison of the models' performance when attempting to identify the best. The methodology proposed doesn't provide an alternative metric to determine the models' performance, but it expands the traditional deterministic comparison to a stochastic comparison. The methodology builds on the current standard approach for developing data-based model and its application is demonstrated to model sewer condition using data from 4 trunk sewers of the SANEST - Saneamento da Costa do Estoril sewer system, corresponding to 25 km of sewer pipes. The data-based models were developed using artificial neural networks, support vector machines, bootstrapping aggregation and least squares support vector machines. For the case study, the highest and average misclassification performance records are similar for all models (23% to 24% and 31% to 33%, respectively) but the lowest performance varied more significantly (39% to 62%). This demonstrates that selecting a model based on its maximum single realisation performance alone may be misleading.
Carmela González Troncoso, Bogdan Kulynych
Anastasia Ailamaki, Viktor Sanca