Abstract
Machine learning (ML) techniques are progressively being used in biomedical research to improve diagnostic and prognostic accuracy when used in conjunction with a clinician as a decision support system. However, many datasets used in biomedical research often suffer from severe class imbalance due to small population sizes, which causes machine learning models to become biased to majority class samples. Current oversampling methods primarily focus on balancing datasets without adequately validating the biological relevance of synthetic data, risking the clinical applicability of downstream model predictions. To address these shortcomings, we propose the Biological Evaluation Framework for Oversampling (BEFO) designed to ensure that synthetic gene expression samples accurately reflect the biological patterns present in original datasets. This innovation not only mitigates bias but enhances the trustworthiness of predictive models in clinical scenarios. We have developed a ranking method for synthetic samples based on this and evaluated each sample’s inclusion based on its rank. This ranking method calculates the WGCNA gene co-expression clusters on the original dataset. Several random forests are constructed to assess the alignment of each synthetic sample to each cluster. Only synthetic samples more important than real samples are included in a study. The experimental results demonstrate that our proposed ML oversampling framework can improve the biological feasibility of oversampled datasets by an average of 11%, leading to improved classification performance by an average of 9% when compared against five state-of-the-art (SOTA) oversampling methods and ten classification algorithms across six real world gene expressions datasets. Thereby establishing a new standard for synthetic data evaluation in biomedical ML applications.
| Original language | English |
|---|---|
| Article number | 104932 |
| Number of pages | 12 |
| Journal | Journal of Biomedical Informatics |
| Volume | 171 |
| Early online date | 21 Oct 2025 |
| DOIs | |
| Publication status | Published - Nov 2025 |
Keywords
- BEFO
- Biological Evaluation Framework for Oversampling
- gene expression