Cookie Hinweis

Wir verwenden Cookies, um Ihnen ein optimales Webseiten-Erlebnis zu bieten. Dazu zählen Cookies, die für den Betrieb der Seite notwendig sind, sowie solche, die lediglich zu anonymen Statistikzwecken, für Komforteinstellungen oder zur Anzeige personalisierter Inhalte genutzt werden. Sie können selbst entscheiden, welche Kategorien Sie zulassen möchten. Bitte beachten Sie, dass auf Basis Ihrer Einstellungen womöglich nicht mehr alle Funktionalitäten der Seite zur Verfügung stehen. Weitere Informationen finden Sie in unseren Datenschutzhinweisen .


Diese Cookies sind für die Funktionalität unserer Website erforderlich und können nicht deaktiviert werden.

Name Webedition CMS
Zweck Dieses Cookie wird vom CMS (Content Management System) Webedition für die unverwechselbare Identifizierung eines Anwenders gesetzt. Es bietet dem Anwender bessere Bedienerführung, z.B. Speicherung von Sucheinstellungen oder Formulardaten. Typischerweise wird dieses Cookie beim Schließen des Browsers gelöscht.
Name econda
Zweck Session-Cookie für die Webanalyse Software econda. Diese läuft im Modus „Anonymisiertes Messen“.

Diese Cookies helfen uns zu verstehen, wie Besucher mit unserer Webseite interagieren, indem Informationen anonym gesammelt und analysiert werden. Je nach Tool werden ein oder mehrere Cookies des Anbieters gesetzt.

Name econda
Zweck Measure with Visitor Cookie emos_jcvid
Externe Medien

Inhalte von externen Medienplattformen werden standardmäßig blockiert. Wenn Cookies von externen Medien akzeptiert werden, bedarf der Zugriff auf diese Inhalte keiner manuellen Zustimmung mehr.

Name YouTube
Zweck Zeige YouTube Inhalte
Name Twitter
Zweck Twitter Feeds aktivieren

Best practices in scientific benchmarking and validation

The importance of data science techniques in almost all fields of medicine is increasing at an enormous pace. This holds particularly true for the fields of radiology and image-guided interventions where the automatic analysis of medical images (e.g. for tumor detection, classification, staging and progression modeling) plays a crucial role. While clinical trials are the state-of-the-art methods to assess the effect of new medication, in a comparative manner, benchmarking in the field of image analysis is governed by so-called challenges. Challenges are international competitions, hosted by individual researchers, institutes, or societies, for example, that aim to assess the performance of competing algorithms on identical data sets for benchmarking. They are often published in prestigious journals, such as Nature Methods, and receive a huge amount of attention with hundreds of citations and thousands of views. Moreover, in platforms like Kaggle, awarding the winner with a significant amount of prize money (up to €1 Mio) is becoming increasingly common.

Given that validation of algorithms has traditionally been performed on the individual researchers' data sets, this development was a great step forward. On the other hand, the increasing scientific impact of challenges now puts huge responsibility on the shoulders of the challenge hosts that take care of the organization and design of such competitions. The performance of an algorithm on challenge data is essential, not only for the acceptance of a paper and its impact on the community, but also for the individuals' scientific careers (e.g. due to awards, paper (non-)acceptance, performances of their algorithms), and the potential that algorithms can be translated into clinical practice.

In the scope of our research, we developed the hypothesis that there is a huge discrepancy between the importance of biomedical image analysis challenges and their quality (control). In response to this, we formed an international multidisciplinary initiative with partners from about 30 institutes worldwide to bring international challenges to the next level. In an article in Nature Communications (Maier-Hein et al., 2018), we present the first comprehensive evaluation of biomedical image analysis challenges. Our analysis of more than 500 sub-competitions (tasks) demonstrates the high importance of challenges in the field of biomedical image analysis, but also reveals major issues:

  1. Challenge reporting: Common practice related to challenge reporting is poor and does not allow for adequate interpretation and reproducibility of results.
  2. Challenge design: Challenge design is very heterogeneous and lacks common standards, although these are requested by the community.
  3. Robustness of rankings: Rankings are sensitive to a range of challenge design parameters such as the metric variant applied, the type of test case aggregation performed and the observer annotating the data (see figure below). The choice of metric and aggregation scheme has a significant influence on the ranking’s stability.
  4. Exploitation of common practice: Security holes in challenge design can potentially be exploited by both challenge organizers and participants to tune rankings (e.g. by selective test case submission (participants) or retrospective tuning of the ranking scheme (organizers)).
  5. Best practice recommendations: Based on the findings of our analysis and an international survey, we present a list of best practice recommendations and open research challenges.

Effect of different ranking schemes (RS) applied to one example MICCAI 2015 segmentation task. Design choices are indicated in the header: RS xy defines the different ranking schemes. The following three rows indicate the used metric (Dice similarity coefficient (DSC), Hausdorff distance (HD) or the 95% variant of the HD (HD95)), the aggregation method (metric-based (aggregate, then rank) or case-based (rank, then aggregate)) and the aggregation operator (mean or median). RS 00 (single-metric ranking with DSC; aggregate with mean, then rank) is considered as the default ranking scheme. For each RS, the resulting ranking is shown for algorithms A1 to A13. To illustrate the effect of different RS on single algorithms, A1, A6 and A11 are highlighted.

To address the discrepancy between the impact of challenges and the quality (control), the Biomedical Image Analysis ChallengeS (BIAS) initiative, founded by the MICCAI board challenge working group (, led by Prof. Dr. Lena Maier-Hein, developed a set of recommendations for the reporting of challenges. The BIAS statement aims to improve the transparency of the reporting of a biomedical image analysis challenge regardless of field of application, image modality or task category assessed. A first step to achieve this goal was the submission of a guideline paper on how to report biomedical challenges which is currently under review. The document further includes checklists for the challenge organizers and journal reviewers with all relevant items to be reported. The guideline itself was registered at the EQUATOR (Enhancing the QUAlity and Transparency Of health Research) network:

For further information please refer to here.


  • Keno März (Developer)
  • Patrick Scholz (Student Assistant)
  • Marko Stankovic (Master's Student)
  • Sebastian Pirmann (Student Assistant)

Key collaborators

Metrics Reloaded Consortium

Lena Maier-Hein
Annika Reinke
Patrick Godau
Minu Dietlinde Tizabi
Evangelia Christodoulou
Ben Glocker
Fabian Isensee
Jens Kleesiek
Michal Kozubek
Mauricio Reyes
Michael Riegler
Manuel Wiesenfarth
Michael Baumgartner
Matthias Eisenmann
Doreen Heckmann-Nötzel
Emre Kavur
Tim Rädsch
Laura Acion
Michela Antonelli
Tal Arbel
Spyridon Bakas
Arriel Benis
Matthew Blaschko
Florian Büttner
M. Jorge Cardoso
Veronika Cheplygina
Beth A. Cimini
Gary S. Collins
Keyvan Farahani
Luciana Ferrer
Adrian Galdran
Bram van Ginneken
Robert Haase
Daniel Hashimoto
Michael Hoffman
Merel Huisman
Pierre Jannin
Charles E. Kahn
Dagmar Kainmueller
Bernhard Kainz
Alexandros Karargyris
Alan Karthikesalingam
Hannes Kenngott
Florian Kofler
Annette Kopp-Schneider
Anna Kreshuk
Tahsin Kurc
Bennett Landman
Geert Litjens
Amin Madani
Klaus Maier-Hein
Anne Martel
Peter Mattson
Erik Meijering
Bjoern Menze
Karel G.M. Moons
Henning Müller
Brennan Nichyporuk
Felix Nickel
Jens Petersen
Nasir Rajpoot
Nicola Rieke
Julio Saez-Rodriguez
Clarisa Sánchez Gutiérrez
Shravya Shetty
Maarten van Smeden
Carole Sudre
Ronald Summers
Aziz A. Taha
Aleksei Tiulpin
Sotirios A Tsaftaris
Ben Van Calster
Gael Varoquaux
Paul Jäger


Maier-Hein, L., Eisenmann, M., Reinke, A., Onogur, S., Stankovic, M., Scholz, P., Arbel, T., Bogunovic, H., Bradley, A. P., Carass, A., Feldmann, C., Frangi, A. F., Full, P. M., van Ginneken, B., Hanbury, A., Honauer, K., Kozubek, M., Landman, B. A., März, K., ... Kopp-Schneider, A. (2018). Why rankings of biomedical image analysis competitions should be interpreted with care. Nature Communications, 9(1), 5217.

Maier-Hein, L., Reinke, A., Kozubek, M., Martel, A. L., Arbel, T., Eisenmann, M., Hanbuary, A., Jannin, P., Müller, H., Onogur, S., Saez-Rodriguez, J., van Ginneken, B., Kopp-Schneider, A., & Landman, B. (2020). BIAS: Transparent reporting of biomedical image analysis challenges. Medical Image Analysis, 101796.

Maier-Hein, L., Wagner, M., Ross, T., Reinke, A., Bodenstedt, S., Full, P. M., Hempe, H., Mindroc-Filimon, D., Scholz, P., Tran, T. N., Bruno, P., Kisilenko, A., Müller, B., Davitashvili, T., Capek, M., Tizabi, M., Eisenmann, M., Adler, T. J., Gröhl, J., ... Müller-Stich, B. P. (2020). Heidelberg Colorectal Data Set for Surgical Data Science in the Sensor Operating Room. ArXiv:2005.03501 [Cs].

Reinke, A., Eisenmann, M., Onogur, S., Stankovic, M., Scholz, P., Full, P. M., Bogunovic, H., Landman, B. A., Maier, O., Menze, B., Sharp, G. C., Sirinukunwattana, K., Speidel, S., van der Sommen, F., Zheng, G., Müller, H., Kozubek, M., Arbel, T., Bradley, A. P., ... Maier-Hein, L. (2018). How to Exploit Weaknesses in Biomedical Challenge Design and Organization. In A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, & G. Fichtinger (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2018 (pp. 388–395). Springer International Publishing.

Ross, T., Reinke, A., Full, P. M., Wagner, M., Kenngott, H., Apitz, M., Hempe, H., Filimon, D. M., Scholz, P., Tran, T. N., Bruno, P., Arbeláez, P., Bian, G.-B., Bodenstedt, S., Bolmgren, J. L., Bravo-Sánchez, L., Chen, H.-B., González, C., Guo, D., ... Maier-Hein, L. (2020). Comparative validation of multi-instance instrument segmentation in endoscopy: Results of the ROBUST-MIS 2019 challenge. Medical Image Analysis, 101920.

Wiesenfarth, M., Reinke, A., Landman, B. A., Eisenmann, M., Saiz, L. A., Cardoso, M. J., Maier-Hein, L., & Kopp-Schneider, A. (2021). Methods and open-source toolkit for analyzing and visualizing challenge results. Scientific Reports, 11(1), 2369.

Reinke, A., Eisenmann, M., Tizabi, M. D., Sudre, C. H., Rädsch, T., Antonelli, M., Arbel, T., Bakas, S., Cardoso, M. J., Cheplygina, V., Farahani, K., Glocker, B., Heckmann-Nötzel, D., Isensee, F., Jannin, P., Kahn, C. E., Kleesiek, J., Kurc, T., Kozubek, M., ... Maier-Hein, L. (2021). Common Limitations of Image Processing Metrics: A Picture Story. ArXiv:2104.05642.




This project is funded by the Helmholtz Imaging Platform (HIP).

to top
powered by webEdition CMS