Computer Assisted Medical Interventions

Good scientific practice initiative: Critical review of current practices related to grand challenges

< Back to Research overview


The importance of data science techniques in almost all fields of medicine is increasing at an enormous pace. This holds particularly true for the field of radiology where the automatic analysis of medical images (e.g., for tumor detection, classification, staging and progression modeling) plays a crucial role. While clinical trials are the state-of-the-art methods to assess the effect of new medication in a comparative manner, benchmarking in the field of image analysis is governed by so-called challenges. Challenges are international competitions, hosted by individual researchers, institutes, or societies, for example, that aim to assess the performance of competing algorithms on identical datasets for benchmarking. They are often published in prestigious journals, such as Nature Methods, and receive a huge amount of attention with hundreds of citations and thousands of views. Awarding the winner with a significant amount of prize money (up to €1 Mio) on platforms like Kaggle is also becoming increasingly common.

Given that validation of algorithms has traditionally been performed on the individual researchers’ data sets, this development was a great step forward. On the other hand, the increasing scientific impact of challenges now puts huge responsibility on the shoulders of the challenge hosts that take care of the organization and design of such competitions. The performance of an algorithm on challenge data is essential, not only for the acceptance of a paper and its impact on the community, but also for the individuals’ scientific careers, and the potential that algorithms can be translated into clinical practice.

In the scope of our research, we developed the hypothesis that there is a huge discrepancy between the importance of biomedical image analysis challenges and their quality (control). In response to this, we formed an international multidisciplinary initiative with partners from about 30 institutes worldwide to bring international challenges to the next level. In an article that has recently been accepted (subject to final minor revisions) in Nature Communications, we present the first comprehensive evaluation of biomedical image analysis challenges. Our analysis of more than 500 competitions (tasks) demonstrates the high importance of challenges in the field of biomedical image analysis, but also reveals major issues:


  1. Challenge reporting: Common practice related to challenge reporting is poor and does not allow for adequate interpretation and reproducibility of results.
  2. Challenge design: Challenge design is very heterogeneous and lacks common standards, although these are requested by the community.
  3. Robustness of rankings: Rankings are sensitive to a range of challenge design parameters, such as the metric variant applied, the type of test case aggregation performed and the observer annotating the data (see figure). The choice of metric and aggregation scheme has a significant influence on the ranking’s stability.
  4. Exploitation of common practice: Security holes in challenge design can potentially be exploited by both challenge organizers and participants to tune rankings (e.g. by selective test case submission (participants) or retrospective tuning of the ranking scheme (organizers)).
  5. Best practice recommendations: Based on the findings of our analysis and an international survey, we present a list of best practice recommendations and open research challenges.

Key collaborators



  • Maier-Hein, L., Eisenmann, M., Reinke, A., Onogur, S., Stankovic, M., Scholz, P., Arbel, T., Bogunovic, H., Bradley, A. P., Carass, A., Feldmann, C., Frangi, A. F., Full, P. M., van Ginneken, B., Hanbury, A., Honauer, K., Kozubek, M., Landman, B. A., März, K., Maier, O., Maier-Hein, K., Menze, B. H., Müller, H., Neher, P. F., Niessen, W., Rajpoot, N., Sharp, G. C., Sirinukunwattana, K., Speidel, S., Stock, C., Stoyanov, D., Aziz Taha, A., van der Sommen, F., Wang, C.-W., Weber, M.-A., Zheng, G., Jannin, P., Kopp-Schneider, A.: Is the winner really the best? A critical analysis of common research practice in biomedical image analysis competitions. arXiv preprint arXiv:1806.02051 (2018) (in press; accepted by Nature Communications).

  • Reinke, A., Eisenmann, M., Onogur, S., Stankovic, M., Scholz, P., Full, P. M., Bogunovic, H., Landman, B. A., Maier, O., Menze, B., Sharp, G. C., Sirinukunwattana, K., Speidel, S., van der Sommen, F., Zheng, G., Müller, H., Kozubek, M., Arbel, T., Bradley, A. P., Jannin, P., Kopp-Schneider, A., Maier-Hein, L.: How to Exploit Weaknesses in Biomedical Challenge Design and Organization. International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 388-395. Springer, Cham. (2018)

Keynote talks

09/2017  MICCAI 2017 Tutorial Designing Benchmarks and Challenges for Measuring Algorithm Performance in Biomedical Image Analysis

09/2017  MICCAI endoscopic vision challenge: "10 years of MICCAI grand-challenges - successes and failures"


< Back to Research overview

to top