Computer Assisted Medical Interventions

Best practices in scientific benchmarking and validation

The importance of data science techniques in almost all fields of medicine is increasing at an enormous pace. This holds particularly true for the fields of radiology and image-guided interventions where the automatic analysis of medical images (e.g. for tumor detection, classification, staging and progression modeling) plays a crucial role. While clinical trials are the state-of-the-art methods to assess the effect of new medication, in a comparative manner, benchmarking in the field of image analysis is governed by so-called challenges. Challenges are international competitions, hosted by individual researchers, institutes, or societies, for example, that aim to assess the performance of competing algorithms on identical data sets for benchmarking. They are often published in prestigious journals, such as Nature Methods, and receive a huge amount of attention with hundreds of citations and thousands of views. Moreover, in platforms like Kaggle, awarding the winner with a significant amount of prize money (up to €1 Mio) is becoming increasingly common.

Given that validation of algorithms has traditionally been performed on the individual researchers' data sets, this development was a great step forward. On the other hand, the increasing scientific impact of challenges now puts huge responsibility on the shoulders of the challenge hosts that take care of the organization and design of such competitions. The performance of an algorithm on challenge data is essential, not only for the acceptance of a paper and its impact on the community, but also for the individuals' scientific careers (e.g. due to awards, paper (non-)acceptance, performances of their algorithms), and the potential that algorithms can be translated into clinical practice.

In the scope of our research, we developed the hypothesis that there is a huge discrepancy between the importance of biomedical image analysis challenges and their quality (control). In response to this, we formed an international multidisciplinary initiative with partners from about 30 institutes worldwide to bring international challenges to the next level. In an article in Nature Communications (Maier-Hein et al., 2018), we present the first comprehensive evaluation of biomedical image analysis challenges. Our analysis of more than 500 sub-competitions (tasks) demonstrates the high importance of challenges in the field of biomedical image analysis, but also reveals major issues:

  1. Challenge reporting: Common practice related to challenge reporting is poor and does not allow for adequate interpretation and reproducibility of results.
  2. Challenge design: Challenge design is very heterogeneous and lacks common standards, although these are requested by the community.
  3. Robustness of rankings: Rankings are sensitive to a range of challenge design parameters such as the metric variant applied, the type of test case aggregation performed and the observer annotating the data (see figure below). The choice of metric and aggregation scheme has a significant influence on the ranking’s stability.
  4. Exploitation of common practice: Security holes in challenge design can potentially be exploited by both challenge organizers and participants to tune rankings (e.g. by selective test case submission (participants) or retrospective tuning of the ranking scheme (organizers)).
  5. Best practice recommendations: Based on the findings of our analysis and an international survey, we present a list of best practice recommendations and open research challenges.

Effect of different ranking schemes (RS) applied to one example MICCAI 2015 segmentation task. Design choices are indicated in the header: RS xy defines the different ranking schemes. The following three rows indicate the used metric (Dice similarity coefficient (DSC), Hausdorff distance (HD) or the 95% variant of the HD (HD95)), the aggregation method (metric-based (aggregate, then rank) or case-based (rank, then aggregate)) and the aggregation operator (mean or median). RS 00 (single-metric ranking with DSC; aggregate with mean, then rank) is considered as the default ranking scheme. For each RS, the resulting ranking is shown for algorithms A1 to A13. To illustrate the effect of different RS on single algorithms, A1, A6 and A11 are highlighted.

To address the discrepancy between the impact of challenges and the quality (control), the Biomedical Image Analysis ChallengeS (BIAS) initiative, founded by the MICCAI board challenge working group (, led by Prof. Dr. Lena Maier-Hein, developed a set of recommendations for the reporting of challenges. The BIAS statement aims to improve the transparency of the reporting of a biomedical image analysis challenge regardless of field of application, image modality or task category assessed. A first step to achieve this goal was the submission of a guideline paper on how to report biomedical challenges which is currently under review. The document further includes checklists for the challenge organizers and journal reviewers with all relevant items to be reported. The guideline itself was registered at the EQUATOR (Enhancing the QUAlity and Transparency Of health Research) network:

For further information please refer to here.


  • Keno März (Developer)
  • Patrick Scholz (Student Assistant)
  • Marko Stankovic (Master's Student)
  • Sebastian Pirmann (Student Assistant)

Key collaborators


Maier-Hein, L., Eisenmann, M., Reinke, A., Onogur, S., Stankovic, M., Scholz, P., Arbel, T., Bogunovic, H., Bradley, A. P., Carass, A., Feldmann, C., Frangi, A. F., Full, P. M., van Ginneken, B., Hanbury, A., Honauer, K., Kozubek, M., Landman, B. A., März, K., ... Kopp-Schneider, A. (2018). Why rankings of biomedical image analysis competitions should be interpreted with care. Nature Communications, 9(1), 5217.

Maier-Hein, L., Reinke, A., Kozubek, M., Martel, A. L., Arbel, T., Eisenmann, M., Hanbuary, A., Jannin, P., Müller, H., Onogur, S., Saez-Rodriguez, J., van Ginneken, B., Kopp-Schneider, A., & Landman, B. (2020). BIAS: Transparent reporting of biomedical image analysis challenges. Medical Image Analysis, 101796.

Maier-Hein, L., Wagner, M., Ross, T., Reinke, A., Bodenstedt, S., Full, P. M., Hempe, H., Mindroc-Filimon, D., Scholz, P., Tran, T. N., Bruno, P., Kisilenko, A., Müller, B., Davitashvili, T., Capek, M., Tizabi, M., Eisenmann, M., Adler, T. J., Gröhl, J., ... Müller-Stich, B. P. (2020). Heidelberg Colorectal Data Set for Surgical Data Science in the Sensor Operating Room. ArXiv:2005.03501 [Cs].

Reinke, A., Eisenmann, M., Onogur, S., Stankovic, M., Scholz, P., Full, P. M., Bogunovic, H., Landman, B. A., Maier, O., Menze, B., Sharp, G. C., Sirinukunwattana, K., Speidel, S., van der Sommen, F., Zheng, G., Müller, H., Kozubek, M., Arbel, T., Bradley, A. P., ... Maier-Hein, L. (2018). How to Exploit Weaknesses in Biomedical Challenge Design and Organization. In A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, & G. Fichtinger (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2018 (pp. 388–395). Springer International Publishing.

Ross, T., Reinke, A., Full, P. M., Wagner, M., Kenngott, H., Apitz, M., Hempe, H., Filimon, D. M., Scholz, P., Tran, T. N., Bruno, P., Arbeláez, P., Bian, G.-B., Bodenstedt, S., Bolmgren, J. L., Bravo-Sánchez, L., Chen, H.-B., González, C., Guo, D., ... Maier-Hein, L. (2020). Comparative validation of multi-instance instrument segmentation in endoscopy: Results of the ROBUST-MIS 2019 challenge. Medical Image Analysis, 101920.

Wiesenfarth, M., Reinke, A., Landman, B. A., Eisenmann, M., Saiz, L. A., Cardoso, M. J., Maier-Hein, L., & Kopp-Schneider, A. (2021). Methods and open-source toolkit for analyzing and visualizing challenge results. Scientific Reports, 11(1), 2369.

Reinke, A., Eisenmann, M., Tizabi, M. D., Sudre, C. H., Rädsch, T., Antonelli, M., Arbel, T., Bakas, S., Cardoso, M. J., Cheplygina, V., Farahani, K., Glocker, B., Heckmann-Nötzel, D., Isensee, F., Jannin, P., Kahn, C. E., Kleesiek, J., Kurc, T., Kozubek, M., ... Maier-Hein, L. (2021). Common Limitations of Image Processing Metrics: A Picture Story. ArXiv:2104.05642.




This project is funded by the Helmholtz Imaging Platform (HIP).

to top
powered by webEdition CMS