Abstract
Healthcare data quality is a critical factor in clinical decision-making, diagnostic accuracy, and the overall efficacy of healthcare systems. This study addresses key challenges such as missing values and anomalies in healthcare datasets, which can result in misdiagnoses and inefficient resource use. The objective is to develop and evaluate a machine learning-based strategy to improve healthcare data quality, with a focus on three core dimensions: accuracy, completeness, and reusability. A publicly available diabetes dataset comprising 768 records and 9 variables was used. The methodology involved a comprehensive data preprocessing workflow, including data acquisition, cleaning, and exploratory analysis using established Python tools. Missing values were addressed using K-nearest neighbors imputation, while anomaly detection was performed using ensemble techniques. Principal Component Analysis (PCA) and correlation analysis were applied to identify key predictors of diabetes, such as Glucose, BMI, and Age. The results showed significant improvements in data completeness (from 90.57% to nearly 100%), better accuracy by mitigating anomalies, and enhanced reusability for downstream machine learning tasks. In predictive modeling, Random Forest outperformed LightGBM, achieving an accuracy of 75.3% and an AUC of 0.83. The process was fully documented, and reproducibility tools were integrated to ensure the methodology could be replicated and extended. These findings demonstrate the potential of machine learning to support robust data quality improvement frameworks in healthcare, ultimately contributing to better clinical outcomes and predictive capabilities.
| Original language | English |
|---|---|
| Article number | 1621514 |
| Number of pages | 14 |
| Journal | Frontiers in Artificial Intelligence |
| Volume | 8 |
| DOIs | |
| Publication status | Published - 2025 |
| Externally published | Yes |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords*
- accuracy
- completeness
- data quality
- healthcare data analysis
- machine learning
- reusability
Field of Science*
- 1.2 Computer and information sciences
- 3.3 Health sciences
Publication Type*
- 1.1. Scientific article indexed in Web of Science and/or Scopus database
Fingerprint
Dive into the research topics of 'Machine learning-based strategies for improving healthcare data quality: an evaluation of accuracy, completeness, and reusability'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver