Please use this identifier to cite or link to this item:
https://elib.vku.udn.vn/handle/123456789/6204Full metadata record
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Nguyen, Ngoc Thanh Thanh | - |
| dc.contributor.author | Tran, Manh Son | - |
| dc.contributor.author | Mai, Duc Tho | - |
| dc.date.accessioned | 2026-01-20T01:51:20Z | - |
| dc.date.available | 2026-01-20T01:51:20Z | - |
| dc.date.issued | 2026-01 | - |
| dc.identifier.isbn | 978-3-032-00971-5 (p) | - |
| dc.identifier.isbn | 978-3-032-00972-2 (e) | - |
| dc.identifier.uri | https://doi.org/10.1007/978-3-032-00972-2_31 | - |
| dc.identifier.uri | https://elib.vku.udn.vn/handle/123456789/6204 | - |
| dc.description | Lecture Notes in Networks and Systems (LNNS,volume 1581); The 14th Conference on Information Technology and Its Applications (CITA 2025) ; pp: 415-427 | vi_VN |
| dc.description.abstract | Automatic speech recognition (ASR) based on deep learning has come a long way, thanks to the growing need for speech-to-text transcription in multimedia tools like video subtitling and accessibility tools. While OpenAI’s Whisper model has demonstrated state-of-the-art transcription performance across multiple languages and accents, its real-world deployment remains challenging due to linguistic variations, background noise, and formatting inconsistencies. This study investigates the application of Whisper for automated video captioning, evaluating its baseline transcription accuracy and exploring its limitations under diverse linguistic and environmental conditions. The research focuses on assessing Whisper’s weaknesses, particularly in accented speech recognition and robustness in noisy environments, and proposes a series of post-processing techniques to enhance its usability. The methodology consists of dataset pre-processing, model evaluation, real-world robustness testing, and the application of phonetic normalization, punctuation restoration, and spelling correction. The study result demonstrates that Whisper achieves a Word Error Rate (WER) of 4.75% after post-processing but struggles with Scottish-accented speech (WER 18.18%) and noisy environments. By introducing post-processing strategies, including phonetic adaptation and transcript enhancement techniques, the study significantly improves transcription accuracy, readability, and usability for real-world applications. | vi_VN |
| dc.language.iso | en | vi_VN |
| dc.publisher | Springer Nature | vi_VN |
| dc.subject | Speech recognition | vi_VN |
| dc.subject | OpenAI whisper | vi_VN |
| dc.subject | Automatic captioning | vi_VN |
| dc.subject | Error correction | vi_VN |
| dc.subject | Phonetic normalization | vi_VN |
| dc.title | Automated Speech-To-Text Captioning for Videos and Noise Robustness Analysis Using OpenAI Whisper: A Performance and Enhancement Study | vi_VN |
| dc.type | Working Paper | vi_VN |
| Appears in Collections: | CITA 2025 (International) | |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.