A comparative study of handling imbalanced data using generative adversarial networks for machine learning based software fault prediction

Ha, Thi Minh Phuong; Pham, Vu Thu Nguyet; Nguyen, Huu Nhat Minh; Le, Thi My Hanh; Nguyen, Thanh Binh

Please use this identifier to cite or link to this item: https://elib.vku.udn.vn/handle/123456789/5785

Title:	A comparative study of handling imbalanced data using generative adversarial networks for machine learning based software fault prediction
Authors:	Ha, Thi Minh Phuong Pham, Vu Thu Nguyet Nguyen, Huu Nhat Minh Le, Thi My Hanh Nguyen, Thanh Binh
Keywords:	Data imbalance Data sampling Fault prediction GANs
Issue Date:	Jan-2025
Publisher:	Springer Nature
Abstract:	Software fault prediction (SFP) is the process of identifying potentially defect-prone modules before the testing stage of a software development process. By identifying faults early in the development process, software engineers can spend their efforts on those components most likely to contain defects, thereby improving the overall quality and reliability of the software. However, data imbalance and feature redundancy are challenging issues in SFP that can negatively impact the performance of fault prediction models. Imbalanced software fault datasets, in which the number of normal modules (majority class) is significantly higher than that of faulty modules (minority class), may lead to many false negative results. In this work, we study and perform an empirical assessment of the variants of Generative Adversarial Networks (GANs), an emerging synthetic data generation method, for resolving the data imbalance issue in common software fault prediction datasets. Five GANs variations - CopulaGAN, VanillaGAN, CTGAN, TGAN and WGANGP are utilized to generate synthetic faulty samples to balance the proportion of the majority and minority classes in datasets. Thereafter, we present an extensive evaluation of the performance of different prediction models which involve combining Recursive Feature Elimination (RFE) for feature selection with GANs oversampling methods, along with pairs of Autoencoders for feature extraction with GANs models. Throughout the experiments with five fault datasets extracted from the PROMISE repository, we evaluate six different machine learning approaches using precision, recall, F1-score, Area Under Curve (AUC) and Matthews Correlation Coefficient (MCC) as performance evaluation metrics. The experimental results demonstrate that the combination of CTGAN with RFE and a pair of CTGAN with Autoencoders outperform other baselines for all datasets, followed by WGANGP and VanillaGAN. According to the comparative analysis, GANs-based oversampling methods exhibited significant improvement in dealing with data imbalance for software fault prediction.
Description:	The International Journal of Research on Intelligent Systems for Real Life Complex Problems; Volume 55, article number 280
URI:	https://doi.org/10.1007/s10489-024-05930-z https://elib.vku.udn.vn/handle/123456789/5785
ISSN:	1573-7497
Appears in Collections:	NĂM 2025

Files in This Item:

Sign in to read

Show full item record