Time-Domain Target Speaker Extraction with Parallel Intra and Inter-framework

Ha, Minh Tan; Le, Dinh Nguyen; Dang, An

Please use this identifier to cite or link to this item: https://elib.vku.udn.vn/handle/123456789/6223

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ha, Minh Tan	-
dc.contributor.author	Le, Dinh Nguyen	-
dc.contributor.author	Dang, An	-
dc.date.accessioned	2026-01-20T03:11:34Z	-
dc.date.available	2026-01-20T03:11:34Z	-
dc.date.issued	2026-01	-
dc.identifier.isbn	978-3-032-00971-5 (p)	-
dc.identifier.isbn	978-3-032-00972-2 (e)	-
dc.identifier.uri	https://doi.org/10.1007/978-3-032-00972-2_12	-
dc.identifier.uri	https://elib.vku.udn.vn/handle/123456789/6223	-
dc.description	Lecture Notes in Networks and Systems (LNNS,volume 1581); The 14th Conference on Information Technology and Its Applications (CITA 2025) ; pp: 147-158	vi_VN
dc.description.abstract	Speaker extraction addresses isolating the specific speaker’s voice from a bend of other speakers using supplementary information. This paper proposes a time-domain speaker extraction using a parallel intra- and inter-framework (TSEPII). An efficient intra- and inter-architecture converts mixed utterance into multi-scale embedding coefficients. Additionally, we incorporate parallel architectures to achieve more stability than previous single architectures. This architecture includes the main components such as the auxiliary encoder (the talker encoding block), the extraction encoder (utterance encoding block), the talker extraction block, and the extraction decoder (the utterance decoding block). In particular, the time domain-based raw voice processing system keeps important information. The utterance encoding block transforms the mixed voice into multiple-scale embedding values, while the talker encoding block learns the target talker by the talker embedding feature. The talker extraction block plays an important role and uses multiple-scale embedding values and the talker embedding feature as the input features. It estimates the time-domain mask for the system. Finally, the utterance decoding block recreates the utterance of the target talker. Experiments show that the TSEPII achieves state-of-the-art performance and competes with current methods.	vi_VN
dc.language.iso	en	vi_VN
dc.publisher	Springer Nature	vi_VN
dc.subject	Target speaker extraction	vi_VN
dc.subject	Informed talker extraction	vi_VN
dc.subject	Time-domain talker extraction	vi_VN
dc.subject	Parallel intra- and inter-framework	vi_VN
dc.subject	Deep learning	vi_VN
dc.subject	End-to-end deep neural network	vi_VN
dc.title	Time-Domain Target Speaker Extraction with Parallel Intra and Inter-framework	vi_VN
dc.type	Working Paper	vi_VN
Appears in Collections:	CITA 2025 (International)

Files in This Item:

Sign in to read

Show simple item record