A combination of feature selection and data sampling techniques for software fault prediction

Ha, Thi Minh Phuong; Nguyen, Thanh Long; Nguyen, Thanh Binh

Please use this identifier to cite or link to this item: https://elib.vku.udn.vn/handle/123456789/3956

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ha, Thi Minh Phuong	-
dc.contributor.author	Nguyen, Thanh Long	-
dc.contributor.author	Nguyen, Thanh Binh	-
dc.date.accessioned	2024-07-29T03:04:35Z	-
dc.date.available	2024-07-29T03:04:35Z	-
dc.date.issued	2023-09	-
dc.identifier.isbn	978-604-357-201-8	-
dc.identifier.uri	http://vap.ac.vn/Portals/0/TuyenTap/2024/2/21/64e13532907845ed9f5a2547dfec276f/33B_FAIR2023_paper_6739.pdf	-
dc.identifier.uri	https://elib.vku.udn.vn/handle/123456789/3956	-
dc.description	Proceedings of the 16th National Scientific Conference on Fundamental and Applied It Research (FAIR-2023); pp: 258-265.	vi_VN
dc.description.abstract	Software fault prediction (SFP) is the process of building models to predict faults in the early stage of software development. Prediction of software fault-prone modules can help developers allocate testing efforts more effectively and optimize maintenance cost. However, the performance of SFP models is influenced by the quality of software fault datasets. The irrelevant and redundant features of datasets may lead to negative impacts on the speed and accuracy of the trained models. Additionally, the presence of data imbalance that the number of faulty modules is significantly less than the number of non-faulty modules is the challenge in SFP. The study has applied 3 Generative adversarial networks (GAN) models including VanillaGAN, CTGAN and WGANGP along with 4 feature selection ranking methods including Chi-Squared, Information Gain, Fisher and Relief on four software fault datasets. The comparative analysis is performed by using 4 different classifiers to predict software faults. We have considered precision, recall, F1-score and Area Under the ROC (receiver operating characteristic curve) Curve (AUC) as performance evaluation metrics. The experimental results reveal that combinations of CTGAN, VanillaGAN and feature selection approaches outperformed the SFP models without applying data sampling and feature selection methods. The combinational pair of CTGAN and Relief demonstrated the best performance than other combinations with the highest average precision, recall, F1-score and AUC values of 0.857, 0.873, 0.856 and 0.767, respectively on Extra Tree.	vi_VN
dc.language.iso	en	vi_VN
dc.publisher	Publishing House for Science and Technology	vi_VN
dc.subject	Software fault prediction	vi_VN
dc.subject	Feature selection	vi_VN
dc.subject	Data sampling	vi_VN
dc.subject	Promise	vi_VN
dc.title	A combination of feature selection and data sampling techniques for software fault prediction	vi_VN
dc.type	Working Paper	vi_VN
Appears in Collections:	NĂM 2023

Files in This Item:

Sign in to read

Show simple item record