Please use this identifier to cite or link to this item:
https://elib.vku.udn.vn/handle/123456789/4293
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Vo, Xuan Thuy | - |
dc.contributor.author | Nguyen, Duy Linh | - |
dc.contributor.author | Priadana, Adri | - |
dc.contributor.author | Choi, Jehwan | - |
dc.contributor.author | Hyun Jo, Kang | - |
dc.date.accessioned | 2024-12-06T08:46:22Z | - |
dc.date.available | 2024-12-06T08:46:22Z | - |
dc.date.issued | 2024-11 | - |
dc.identifier.isbn | 978-3-031-74126-5 | - |
dc.identifier.uri | https://elib.vku.udn.vn/handle/123456789/4293 | - |
dc.identifier.uri | https://doi.org/10.1007/978-3-031-74127-2_30 | - |
dc.description | Lecture Notes in Networks and Systems (LNNS,volume 882); The 13th Conference on Information Technology and Its Applications (CITA 2024) ; pp: 360-371. | vi_VN |
dc.description.abstract | Self-attention can capture long-range dependencies from input sequences without inductive biases, resulting in quadratic complexity. When transferring Vision Transformers to dense prediction tasks, the models suffer huge computational costs. Recent methods have drawn sparse attention to approximate attention regions and injected convolution into self-attention layers. Motivated by this line of research, this paper introduces group attention that has linear complexity with input resolution while modeling global context features. Group attention shares information across channels, and convolution is spatial sharing. Both operations are complementary, and multi-scale convolution can capture multiple views of the input. Merging multi-scale convolution into group attention layers can help improve feature representation and modeling abilities. To verify the effectiveness of the proposed method, extensive experiments are conducted on benchmark datasets for various vision tasks. On ImageNet-1K image classification, the proposed method achieves 77.6% Top-1 accuracy at 0.7 GFLOPs, surpassing other methods under similar computational costs. When transferring pre-trained model on ImageNet-1K to dense prediction tasks, the proposed method attains consistent improvements across visual tasks. | vi_VN |
dc.language.iso | en | vi_VN |
dc.publisher | Springer Nature | vi_VN |
dc.subject | Multi-Scale Convolutions Meet Group Attention for Dense Prediction Tasks | vi_VN |
dc.subject | On ImageNet-1K image classification, the proposed method achieves 77.6% Top-1 accuracy at 0.7 GFLOPs, surpassing other methods under similar computational costs | vi_VN |
dc.title | Multi-Scale Convolutions Meet Group Attention for Dense Prediction Tasks | vi_VN |
dc.type | Working Paper | vi_VN |
Appears in Collections: | CITA 2024 (International) |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.