Page 62 - Kỷ yếu hội thảo khoa học lần thứ 12 - Công nghệ thông tin và Ứng dụng trong các lĩnh vực (CITA 2023)
P. 62
46 K Y U H I TH O KHOA H C QU C GIA V CNTT VÀ NG D V C L N TH 12
CNN architectures, including Resnet-50, and proposed a novel architecture integrating
all models into one with highly competitive results[2].
The use of Transformers in deep learning models has gained significant popularity
recently, being applied to both natural language processing and computer vision tasks.
Transformer-based models such as BERT, GPT-3, and the recently released GPT-4
have demonstrated state-of-the-art performance in various language modeling tasks [3].
Likewise, there is a growing trend of employing Vision Transformer (ViT) and its
variants in various computer vision tasks, such as image classification [4] and object
detection [5]. Recent research by Dosovitskiy et al. (2021) has demonstrated that ViT
exhibits remarkable performance when compared to state-of-the-art CNN while also
requiring significantly fewer computational resources during the training process[4].
Realizing the absence of transformer-based models in this task and the potential of ViT,
especially when trained on massively large databases such as ImageNet[6], this paper
aims at developing ViT, which is pre-trained on a huge dataset, then fine-tuned and
evaluated on several benchmark datasets for Facial Beauty Prediction such as SCUT-
FBP5500 [1]. Similarly, CNN models such as ResNet-50 and VGG-16, both pre-trained
on ImageNet, are also re-implemented to compare their performance with that of Vision
Transformers.
This paper walks through some of the related work in the field of Facial Beauty
Prediction in Section 2. Then, it will propose the application of ViT architecture for
facial beauty prediction in Section 3 before conducting several experiments with ViT
along with other benchmarks for performance comparison and analysis. Based on the
experimental results, Section 5 will draw conclusions about the architecture of ViT and
suggest some future works and progress to improve the performance of our proposed
architecture.
2 Related work
In the context of beauty prediction, there have been many research papers suggesting
approaches to this task. Iyer et al. (2021) explored machine learning-based facial beauty
prediction using facial landmarks and traditional image descriptors, nose symmetry
ratio, for instance. These attributes are then input to various traditional machine
learning algorithms such as Linear Regression, Random Forest, K-Nearest Neighbour,
v.v. to output a score for facial attractiveness [7].
However, one disadvantage of this approach is its heavy dependence on handcrafted
features, which requires extensive domain-specific knowledge and subjective design
choices about facial beauty. Furthermore, since preferences and opinions about facial
beauty may vary among different races and generations, re-implementing feature
engineering to align with the interests of a new target audience could be challenging.
Avoiding highly accurate facial characteristics, Xiao et al. proposed
Beauty3DFaceNet - a deep CNN predicting attractiveness on 3D faces using both
geometry and texture information [8]. The model utilizes a fusion module to combine
geometric and texture features and designs a novel sampling strategy based on facial
landmarks for improved performance in learning aesthetic features. Nevertheless, this
CITA 2023 ISBN: 978-604-80-8083-9