Exploring Characteristics of Omics Data

Team: Axel Benner, Natalia Becker, Manuela Hummel, Maral Saadati, Manuel Wiesenfarth  (former members Martin Sill, Manuela Zucknick)

In our research we encounter omics data from various sources, such as as mRNA and miRNA expression, methylation or copy number alterations and many more. Some of our main focus areas include preprocessing, integrative analysis, epigenomics and proteomics.

An important topic of current research on diagnostic and prognostic factor studies is the development of methods that can handle high-dimensional omics data, where the number of explanatory variables is much larger than the number of observations. In this situation there is a high risk that statistical models overfit the data. A better understanding of the data characteristics may improve model building.

Data from diverse genomic readouts are on different levels of measurement, and their distributions have unique properties. Consequently, statistical methods originally developed for specific data sources must be adapted in order to remain applicable to other data types. For example, while array-based gene expression data are frequently assumed to be log-normally distributed, methylation data generated with Illumina arrays require methods for proportions with distribution in [0,1], whereby mutations can often be described as ternary (i.e., silent, activating or inactivating). Before using these data in statistical models for diagnostics and prediction, we must understand their specific characteristics, in particular, their level of measurement and data distribution.

An omics-based analysis consists of both low-level analysis specific to the data-generating assay and high-level analysis fitting the fully specified statistical model.



There are different platform-specific and generic preprocessing strategies.

We often make use of established procedures combining all relevant preprocessing steps. For specific data types, such as antibody microarrays, we examine the performance of normalization methods.  

Integrative analysis

We are currently working on the establishment of statistical methods that are capable of integrating data from multiple omics data sets. In a joint project between the DKFZ and the Bayer HealthCare Alliance we aim to develop models for the early detection of potential biomarkers for risk prediction using data from in vitro drug-response experiments. In other projects, we are using patient data from clinical trials to identify prognostic and predictive markers as well as potential therapeutic targets in chronic lymphocytic leukemia (funded by the Else Kröner-Fresenius foundation) or factors contributing to chemotherapy resistance in leukemia (Virtual Helmholtz Institute).  


Infinium HumanMethylation450 BeadChip

The field of DNA methylation is of particular interest because it investigates a modification in gene transcription that occurs without changes in the DNA sequence. Due to the critical role of DNA methylation in disease biology, one goal of the analysis of methylation data is to identify “differentially methylated CpG sites”, namely CpG dinucleotides that show a significant change in methylation between groups of patient samples (e.g., healthy vs. tumor cells or disease subtypes). Challenges arise from the fact that methylation levels are proportions between 0 and 1, often from an asymmetric, bimodal distribution with peaks close to 0 and 1. Several parametric and nonparametric methods, such as beta regression and rank-based estimation, can be applied in the analysis of such data.

Methyl-CpG immunoprecipitation (MCIp)

MCIp can be used to enrich tumor DNA sequences, which are hypermethylated compared to control tissue. In our projects, enriched methylated DNA of tumor and control tissue samples were either analyzed by dual-color microarrays (MCIp-chip) or by nextgeneration sequencing (MCIp-seq). After preprocessing, MCIp-chip data are typically analyzed as log2-(tumor/control) ratios, which can be described as a mixture of normal and gamma distributions. More specifically, the mixture components relate to the probes, which are either enriched in the tumor, enriched in the control tissue, or not enriched. MCIp-seq data are overdispersed count data and can be modeled by a negative binomial distribution. In general, while the MCIp method can be used to identify regions of hypermethylation (enrichment), it does not provide quantitative measurements of the degree of CpG methylation. 


Antibody microarrays

Antibody microarray technology allows the simultaneous measurement of the expression of hundreds of proteins in a competitive dual-color approach similar to dual-color gene expression microarrays. Whereas the established normalization methods for gene expression microarrays, e.g., loess regression, can in principle be applied to protein microarrays, the typical assumptions of such normalization methods might be faulty due to a bias in the selection of the proteins to be measured. Due to high costs and the limited availability of high-quality antibodies, the current arrays usually focus on a high proportion of regulated targets. We propose to select invariant features from the features already represented on available arrays for normalization and have developed a dedicated selection algorithm.


  • Baer C, Claus R, Frenzel LP, Zucknick M, Park YJ, Gu L, Weichenhan D, Fischer M, Pallasch CP, Herpel E, Rehli M, Byrd JC, Wendtner CM, Plass C. Extensive promoter DNA hypermethylation and hypomethylation is associated with aberrant microRNA expression in chronic lymphocytic leukemia. Cancer Res. 2012 Aug 1;72(15):3775-85.
  • Dutruel C, Bergmann F, Rooman I, Zucknick M, Weichenhan D, Geiselhart L, Kaffenberger T, Rachakonda PS, Bauer A, Giese N, Hong C, Xie H, Costello JF, Hoheisel J, Kumar R, Rehli M, Schirmacher P, Werner J, Plass C, Popanda O, Schmezer P. Early epigenetic downregulation of WNK2 kinase during pancreatic ductal adenocarcinoma development. Oncogene. 2013 Aug 5. doi: 10.1038/onc.2013.312. [Epub ahead of print]
  • Pfister S, Schlaeger C, Mendrzyk F, Wittmann A, Benner A, Kulozik A, Scheurlen W, Radlwimmer B, Lichter P. Array-based profiling of reference-independent methylation status (aPRIMES) identifies frequent promoter methylation and consecutive downregulation of ZIC2 in pediatric medulloblastoma. Nucleic Acids Res. 2007;35(7):e51.
  • Schröder C, Jacob A, Tonack S, Radon TP, Sill M, Zucknick M, Rüffer S, Costello E, Neoptolemos JP, Crnogorac-Jurcevic T, Bauer A, Fellenberg K, Hoheisel JD. Dual-color proteomic profiling of complex samples with a microarray of 810 cancer-related antibodies. Mol Cell Proteomics. 2010 Jun;9(6):1271-80.
  • Sill M, Schröder C, Hoheisel JD, Benner A, Zucknick M. Assessment and optimisation of normalisation methods for dual-colour antibody microarrays. BMC Bioinformatics. 2010 Nov 12;11:556. doi: 10.1186/1471-2105-11-556.
  • Solinas-Toldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Döhner H, Cremer T, Lichter P. Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer. 1997 Dec;20(4):399-407.

to top