ViSWAP: Vietnamese Voice Conversion System with Diffusion Model

Ma, Thanh; Tran, Viet Chau; Tran, Nguyen Minh Thu; Pham, Xuan Hien; Nguyen, Van Nguyen; Do, Thanh Nghi

Please use this identifier to cite or link to this item: https://elib.vku.udn.vn/handle/123456789/6234

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ma, Thanh	-
dc.contributor.author	Tran, Viet Chau	-
dc.contributor.author	Tran, Nguyen Minh Thu	-
dc.contributor.author	Pham, Xuan Hien	-
dc.contributor.author	Nguyen, Van Nguyen	-
dc.contributor.author	Do, Thanh Nghi	-
dc.date.accessioned	2026-01-20T07:26:42Z	-
dc.date.available	2026-01-20T07:26:42Z	-
dc.date.issued	2026-01	-
dc.identifier.isbn	978-3-032-00971-5 (p)	-
dc.identifier.isbn	978-3-032-00972-2 (e)	-
dc.identifier.uri	https://doi.org/10.1007/978-3-032-00972-2_8	-
dc.identifier.uri	https://elib.vku.udn.vn/handle/123456789/6234	-
dc.description	Lecture Notes in Networks and Systems (LNNS,volume 1581); The 14th Conference on Information Technology and Its Applications (CITA 2025) ; pp:	vi_VN
dc.description.abstract	This paper presents an advanced Vietnamese voice conversion system, called ViSWAP, that utilizes a diffusion model to achieve highly natural and intelligible speech synthesis. By incorporating cutting-edge techniques such as HiFi-GAN, Real-Time Voice Cloning, and speaker diarization, ViSWAP effectively converts voices in both single and multi-speaker contexts with precision and speed. The system processes audio through a structured pipeline, from pre-processing with mel-spectrogram generation and TextGrid alignment in Vietnamese, to encoding and decoding within the diffusion framework. The adoption of the diffusion model is crucial, as it excels in maintaining high-quality voice conversion by handling complex transformations with superior fidelity. Experimental evaluations across multiple audio frequencies demonstrate the system’s strength in minimizing key metrics such as DTW, Euclidean, and Cosine distances, MSE showcasing significant improvements in timbre accuracy and harmonic preservation. We have also published the dataset and implementation on Github (https://github.com/Nguyen-Van-Nguyen-github/DiffusionVoiceVietNam).	vi_VN
dc.language.iso	en	vi_VN
dc.publisher	Springer Nature	vi_VN
dc.subject	Diffusion model	vi_VN
dc.subject	Voice conversion	vi_VN
dc.subject	Vietnamese speech	vi_VN
dc.subject	HiFi-Gan	vi_VN
dc.subject	Real-time-voice cloning	vi_VN
dc.title	ViSWAP: Vietnamese Voice Conversion System with Diffusion Model	vi_VN
dc.type	Working Paper	vi_VN
Appears in Collections:	CITA 2025 (International)

Files in This Item:

Sign in to read

Show simple item record