High-throughput Genomic Data Analysis

Machine Learning Pipelines for Personalized Medicine

Source: DNA methylation-based classification of central nervous system tumours David Capper, David T. W. Jones, Martin Sill et al., Nature, 2018 Extended Data Figure 4a: Development of the random forest classifier.

Development of Machine Learning Pipelines for Personalized Medicine in a highly multiclass setting of high-throughput genomic microarray data

This is a personal project of Dr. Maros conducted in tight cooperation with and support from the Department of Biostatistics and Pediatric Neuroonkologie Group, German Cancer Research Center (DKFZ), Heidelberg and the Institute for Medical Biostatistics and Informatics (IMBI), Heidelberg University.

The aim of this project is to develop and compare machine learning pipelines in terms of well-calibrated probability estimates for high-throughput DNA methylation- and genomic array data in a highly multiclass diagnostic setting that can support personalized medical decision making.

Check out our methodical paper about ML workflows for precision cancer diagnostics

Máté E. Maros, David Capper, David T. W. Jones, Volker Hovestadt, Andreas von Deimling, Stefan M. Pfister, Axel Benner, Manuela Zucknick & Martin Sill

This work compares several ML and calibration algorithms for classifying tumor DNA methylation profiles. The resulting protocol provides workflows for selecting, training and calibrating ML algorithms to generate well-calibrated multiclass probability estimates.


DNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth’s penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 × 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from <15 min to 5 d using multi-core desktop PCs. Detailed scripts in the open-source R language are freely available on GitHub, targeting users with intermediate experience in bioinformatics and statistics and using R with Bioconductor extensions.

Cite this article

Maros, M.E., Capper, D., Jones, D.T.W. et al. Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data. Nat Protoc (2020) doi:10.1038/s41596-019-0251-6

Chek out the companion repository for the article on GitHub