





















Abstract:Class imbalance remains a practical obstacle in the development of clinical prediction models for conditions such as diabetes mellitus, where the number of confirmed cases is often much smaller than the number of controls. The Synthetic Minority Over-sampling Technique (SMOTE) and its variants are widely used to address this imbalance, but they generate synthetic observations through local interpolation in feature space and do not explicitly model the joint dependence structure of the minority class. To address this challenge, our study introduces a copula-based data augmentation approach that estimates the minority-class dependence structure when generating synthetic samples and integrates with standard machine learning techniques. Specifically, we employ truncated vine copulas to represent multivariate dependence through a sequence of bivariate building blocks. We evaluate the proposed approach on three public diabetes datasets, namely the Pima Indians Diabetes dataset, the Iraqi Diabetes dataset, and the CDC BRFSS 2015 Diabetes Health Indicators dataset, which together cover a range of sample sizes, dimensionalities, and imbalance regimes. For each dataset, five resampling strategies are compared across five classifiers using a 5 by 2 cross validation protocol with Dietterich's paired t test. Our findings suggest that CopulaSMOTE can improve minority-class recovery in larger tabular diabetes datasets, particularly the CDC BRFSS dataset, but its advantages depend on the classifier and evaluation metric.
| Subjects: | Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML) |
| Cite as: | arXiv:2506.17326 [cs.LG] |
| (or arXiv:2506.17326v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2506.17326 arXiv-issued DOI via DataCite |
From: Agnideep Aich [view email]
[v1]
Wed, 18 Jun 2025 22:21:40 UTC (197 KB)
[v2]
Thu, 25 Sep 2025 00:52:54 UTC (210 KB)
[v3]
Mon, 25 May 2026 02:18:55 UTC (425 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。