The Impact of Data Quality on ML Diagnostic Models: Ensuring Reliability in Medical AI

Bartlomiej Cieszynski, Joao Gregorio
Abstract:
The application of artificial intelligence and machine learning algorithms in healthcare has grown exponentially in recent years, offering benefits such as improving diagnostic accuracy, efficiency, and addressing challenges such as interpretative bias or the increasing volume of patient data. However, ensuring the reliability of such models necessitates robust evaluations of their sensitivity to data quality – the extent to which data meets required standards. This study explores how variations in data accuracy, precision, and completeness affect an algorithms’ ability to correctly classify electrocardiogram results. The models analysed, namely K-Nearest Neighbours, Random Forest, Artificial Neural Networks, and Convolutional Neural networks were selected due to their prevalence in electrocardiogram classification tasks. Using the PhysioNet’s MIT-BIH Arrhythmia Dataset, the study classifies five types of beats defined by the AAMI EC57. To simulate varying data quality, the dataset has undergone systematic degradation through the addition of noise, rounding of numerical, and removing data features. The re-evaluation of model performance, quantified by model accuracy, precision, and recall has highlighted the effects of data quality on diagnostic outcomes, providing insights into the robustness of models under suboptimal conditions. By examining these critical factors, the study aims to inform the development of more reliable machine learning diagnostic systems, raising awareness of the importance of data integrity in medical applications. The study has shown that model performance is most sensitive to accuracy, completeness and precision in descending order; with accuracy showing greatest reduction in model predictions. It has also displayed that model sensitivity is correlated with class population, as low represented classes have yielded greater deviation under the application of data degradation.
Download:
IMEKO-TC8-11-24-2025-010.pdf
DOI:
10.21014/tc8-2025.010
Event details
IMEKO TC:
TC8
Event name:
IMEKO TC8, TC11 and TC24 Conference
Title:

Joint conference of the TCs ‘Traceability in Metrology’ (IMEKO TC8), ‘Measurement in Testing, Inspection and Certification’ (IMEKO TC11), and ‘Chemical Measurements’ (IMEKO TC24).

Place:
Torino, ITALY
Time:
14 September 2025 - 17 September 2025