Data splitting to avoid information leakage with DataSAIL

Joeres R, Blumenthal DB, Kalinina OV (2025)


Publication Language: English

Publication Status: Published

Publication Type: Journal article, Original article

Publication year: 2025

Journal

Publisher: Nature Research

Book Volume: 16

Article Number: 3337

Journal Issue: 1

DOI: 10.1038/s41467-025-58606-8

Open Access Link: https://doi.org/10.1038/s41467-025-58606-8

Abstract

Information leakage is an increasingly important topic in machine learning research for biomedical applications. When information leakage happens during a model’s training, it risks memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time. We present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of machine learning models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. Finally, we empirically demonstrate DataSAIL’s impact on evaluating biomedical machine learning models.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Joeres, R., Blumenthal, D.B., & Kalinina, O.V. (2025). Data splitting to avoid information leakage with DataSAIL. Nature Communications, 16(1). https://doi.org/10.1038/s41467-025-58606-8

MLA:

Joeres, Roman, David B. Blumenthal, and Olga V. Kalinina. "Data splitting to avoid information leakage with DataSAIL." Nature Communications 16.1 (2025).

BibTeX: Download