Vui lòng dùng định danh này để trích dẫn hoặc liên kết đến tài liệu này: https://elib.vku.udn.vn/handle/123456789/4041
Nhan đề: DaNangNLP Toolkit for Vietnamese Text Preprocessing and Word Segmentation
Tác giả: Nguyen, Ket Doan
Nguyen, Tran Tien
Nguyen, Duc Bao
Ton, That Ron
Vo, Van Nam
Pham, Van Nam
Phung, Anh Sang
Huynh, Cong Phap
Nguyen, Huu Nhat Minh
Từ khoá: Sentence Segmentation
Regular Expression
Word Segmentation
Word Normalization
Vietnamese Language Processing
Năm xuất bản: thá-2024
Nhà xuất bản: Vietnam-Korea University of Information and Communication Technology
Tùng thư/Số báo cáo: CITA;
Tóm tắt: Recent research has focused on Vietnamese large language models, however, the preprocessing steps play important complementary roles in the future success of Vietnamese language processing. In this paper, we design and develop a novel DaNangNLP toolkit that could cope with important Vietnamese language preprocessing steps. Although there have been many successful modules on Vietnamese language processing, existing toolkits still exhibit certain shortcomings, especially for word segmentation in complex Vietnamese sentences. Therefore, we have developed a practical and robust natural language processing pipeline specifically tailored for the Vietnamese language to address the challenging issues present in previous Vietnamese processing toolkits. The DaNangNLP pipeline based on the novel built-in word dictionaries is designed to handle Vietnamese text for typical preprocessing steps such as sentence segmentation, word regex, word normalization, and word segmentation. Throughout the evaluation, the proposed semantic-based word segmentation has outperformed the frequency-based word segmentation and existing toolkits in complex sentences.
Mô tả: Proceedings of the 13th International Conference on Information Technology and Its Applications (CITA 2024); pp: 296-307
Định danh: https://elib.vku.udn.vn/handle/123456789/4041
ISBN: 978-604-80-9774-5
Bộ sưu tập: CITA 2024 (Proceeding - Vol 2)

Các tập tin trong tài liệu này:

 Đăng nhập để xem toàn văn



Khi sử dụng các tài liệu trong Thư viện số phải tuân thủ Luật bản quyền.