Predictive modeling

Team: Natalia Becker, Axel Benner, Thomas Hielscher, Annette Kopp-Schneider, Christina Kunz, Maral SaadatiDiana Tichy  (former members Martin Sill, Alla Slynko, Manuela Zucknick)

Prediction models

By associating patient characteristics with treatment response, a statistical prediction model may find evidence for new medical hypotheses. The clinical endpoints that are most relevant in cancer research are binary or time-to-event, e.g., survival times. The goal of prediction analysis is to understand the dependency of clinical endpoints on covariates (prognostic factors, treatment, treatment/factor interactions). An important task is the selection of the appropriate prediction model.

The Cox proportional hazards regression model is the most popular approach to model covariate information for time-to-event data. The distinguishing feature of time-to-event data is that at the end of the follow-up period the event will probably not have occurred for all patients. For these patients the time-to-event is censored, indicating that the observation period was cut off before the event occurred. Competing risks are informative censoring. Any other event that changes the risk of the event investigated may be considered a separate state in a competing risk model. For example, if the interesting event is dying from cancer, the competing risk would be dying from any other cause. The effect of nonfatal events can also be studied in multi-state models. We are currently working on a project on variable selection in multistate models.

No treatment works the same for every patient. Few therapies will benefit all patients, and some may even cause harm. Hence, biological markers ("biomarkers") are required that can guide patient tailored therapy.

High dimensional data

One important topic of current research on prognostic factor studies is the development of methods that can be employed to analyze high-dimensional data, where the number of explanatory variables is much larger than the number of observations. We aim to develop prognostic biomarker signatures on the basis of multiple molecular data, such as mRNA and miRNA expression, methylation, or copy number alterations. The major problem in analyzing such data is the risk of overfitting. Methods to address this issue include penalized regression, but also boosting models, random forests, and other machine-learning approaches.

Control of model complexity is achieved by: (1) restriction methods, where the class of functions of the input vectors is limited; (2) selection methods, which include only those basis functions of the input vectors that contribute ‘significantly’ to the fit of the model; or (3) regularization methods that restrict the coefficients of the model.

An example of a specific application would be to combine data from multiple data sources, available for the same set of patients, in order to improve tumor classification/disease outcome prediction. This could be done by adapting existing risk-prediction methods for integrating molecular data sets by holistic modelling. Another important task is to represent intrinsic data structures, e.g., gene expression may be reduced for genes in deleted genomic regions, or genes in the same pathway may tend to be co-regulated.

Validation and calibration

The prediction performance of a statistical risk prediction model describes how well the model will work with future patients. If no independent validation cohort is available, the model being applied to new patients could be simulated by bootstrapping or repeated cross-validation. Results of prediction methods would then be compared with respect to the prediction error and interpretability of the results. Comparisons of predictive accuracy are commonly done using (time-dependent) Brier scores, (time-dependent) ROC curves and the corresponding area under the curve, or by derived measures of explained variation.


  • Becker N, Toedt G, Lichter P, Benner A. Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data. BMC Bioinformatics 2011; 12:138.
  • Bender R, Benner A. Calculating ordinal regression models in SAS and S-Plus. Biometrical Journal 2000; 42: 677-699.
  • Benner A, Zucknick M, Hielscher T, Ittrich C, Mansmann U. High-dimensional Cox models: the choice of penalty as part of the model building process. Biom J. 2010; 52:50-69.
  • Binder H, Benner A, Bullinger L, Schumacher M. Tailoring sparse multivariable regression techniques for prognostic single-nucleotide polymorphism signatures. Stat Med 2013; 32: 1778-1791.
  • Hielscher T, Zucknick M, Werft W, Benner A. On the prognostic value of survival models with application to gene expression signatures. Stat Med. 2010; 29:818-829.
  • Werft W, Benner A, Kopp-Schneider A. On the identification of predictive biomarkers: Detecting treatment-by-gene interaction in high-dimensional data. Comp Stat and Data Analysis 2012; 56: 1275-1286.

to top