Data Science Seminar

Unraveling uncertainty in benchmarking: Methods and open-source toolkit for analyzing and visualizing challenge results

Although the scientific impact of biomedical image analysis challenges is steadily increasing, there is surprisingly a huge discrepancy between the challenges' impact and their quality control. In particular, challenge rankings are sensitive to a range of challenge design parameters. For example, rankings and thus the identified challenge winner may strongly depend on the chosen ranking method or on a couple of test cases. Thus, the validity and transferability of challenge results may be questioned due to possibly considerable instabilities in rankings. Yet, most publications of challenges ignore the uncertainty associated with rankings and result presentations are often limited to the ranking list and simple visualizations of the metric values for each algorithm.

Thus, the purpose of this work is to propose methodology along with an open-source framework for systematically analyzing and visualizing results of challenges. It intends to help challenge organizers and participants to gain further insights into both the algorithms' performance and the assessment data set itself in an intuitive manner.

Visualization approaches for both challenges designed around a single task and for challenges comprising multiple tasks are presented. The proposed tools involve bootstrapping, significance testing and unsupervised learning. They allow to investigate questions such as, e.g., whether there are influential test cases, whether the winner is consistently superior to other algorithms across test cases, whether the winner is significantly superior, what range of ranks for a specific algorithm is supported by the data, which task yields clear separation of algorithms and a stable ranking and which tasks are similar with respect to their rankings.

All techniques are illustrated by synthetic and real-world assessment data.

Dr. Manuel Wiesenfarth

to top
powered by webEdition CMS