Joeres R, Blumenthal DB, Kalinina OV (2025)
Publication Language: English
Publication Status: Published
Publication Type: Journal article, Original article
Publication year: 2025
Publisher: Nature Research
Book Volume: 16
Article Number: 3337
Journal Issue: 1
DOI: 10.1038/s41467-025-58606-8
Open Access Link: https://doi.org/10.1038/s41467-025-58606-8
Information leakage is an increasingly important topic in machine learning research for biomedical applications. When information leakage happens during a model’s training, it risks memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time. We present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of machine learning models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. Finally, we empirically demonstrate DataSAIL’s impact on evaluating biomedical machine learning models.
APA:
Joeres, R., Blumenthal, D.B., & Kalinina, O.V. (2025). Data splitting to avoid information leakage with DataSAIL. Nature Communications, 16(1). https://doi.org/10.1038/s41467-025-58606-8
MLA:
Joeres, Roman, David B. Blumenthal, and Olga V. Kalinina. "Data splitting to avoid information leakage with DataSAIL." Nature Communications 16.1 (2025).
BibTeX: Download