HISTAI: a valuable dataset with a valuable lesson

Hewitt KJ, Reitsam NG, Foersch S, Erber R, Zeng Q, Kather JN (2026)

Publication Type: Journal article

Publication year: 2026

Journal

Journal of Pathology: Clinical Research Wiley

Book Volume: 12

Article Number: e70089

Journal Issue: 3

DOI: 10.1002/2056-4538.70089

Abstract

The application of artificial intelligence in computational pathology depends on both robust algorithms and high-quality, clinically reliable data. Progress in this field has been limited by the scarcity of large, diverse, and well-validated whole slide image (WSI) datasets. To address this gap, HISTAI introduced an open-source resource comprising over 112,000 WSIs across multiple organ systems with associated clinical metadata. Here, we present a pathologist-led evaluation of label accuracy, metadata completeness, and dataset composition across 328 selected cases from this resource. Although HISTAI reports 47,279 cases, we identified only 44,564 unique cases after accounting for missing entries and duplicate records. Basic demographic information, including age and sex, was available for only 55% of cases. Dataset composition was uneven, with dermatopathology accounting for 47.1% of cases and gastrointestinal pathology for 24.0%; however, primary specialty was explicitly reported for only 39.6% of cases, obscuring this imbalance within the provided metadata. Notably, clinical ground truth is recorded in the Conclusion column. Concordance between the dataset's Conclusion and Diagnosis fields was observed in only 20.7% of cases, while 27.1% contained conflicting diagnoses. In a focused review of 198 cases, 30.3% were found to contain unclear or ambiguous diagnostic conclusions, including eight cases in which the diagnosis was incorrect. Assessment of molecular annotation revealed that only 18.9% of analyzed lung and colorectal cancer cases included molecular information. Furthermore, among adult-type diffuse gliomas, none of the 55 cases met current World Health Organisation Classification of Tumors of the Central Nervous System 5th Edition (WHO CNS5) diagnostic criteria, with IDH mutation status reported in only 15 cases. Together, these findings highlight substantial ambiguities in ground-truth labeling, incomplete molecular annotation, and limited documentation of dataset provenance and ethical oversight. While HISTAI represents a valuable open-source resource, its effective and responsible use requires careful clinical validation and close collaboration between computational researchers and pathologists.

Involved external institutions

Universitätsklinikum Carl Gustav Carus Dresden

Germany (DE) Universitätsmedizin der Johannes Gutenberg-Universität Mainz

Germany (DE) Universität Regensburg

Germany (DE)

How to cite

APA:

Hewitt, K.J., Reitsam, N.G., Foersch, S., Erber, R., Zeng, Q., & Kather, J.N. (2026). HISTAI: a valuable dataset with a valuable lesson. Journal of Pathology: Clinical Research, 12(3). https://doi.org/10.1002/2056-4538.70089

MLA:

Hewitt, Katherine J., et al. "HISTAI: a valuable dataset with a valuable lesson." Journal of Pathology: Clinical Research 12.3 (2026).

BibTeX: Download