Page 62 - Kỷ yếu hội thảo khoa học lần thứ 12 - Công nghệ thông tin và Ứng dụng trong các lĩnh vực (CITA 2023)
P. 62

46    K  Y U H I TH O KHOA H C QU C GIA V  CNTT VÀ  NG D              V C L N TH  12


                     CNN architectures, including Resnet-50, and proposed a novel architecture integrating
                     all models into one with highly competitive results[2].
                       The use of Transformers in deep learning models has gained significant popularity
                     recently, being applied to both natural language processing and computer vision tasks.
                     Transformer-based models such as BERT, GPT-3, and the recently released GPT-4
                     have demonstrated state-of-the-art performance in various language modeling tasks [3].
                     Likewise, there is  a growing trend  of  employing Vision Transformer (ViT) and  its
                     variants in various computer vision tasks, such as image classification [4] and object
                     detection [5]. Recent research by Dosovitskiy et al. (2021) has demonstrated that ViT
                     exhibits remarkable performance when compared to state-of-the-art CNN while also
                     requiring significantly fewer computational resources during the training process[4].
                     Realizing the absence of transformer-based models in this task and the potential of ViT,
                     especially when trained on massively large databases such as ImageNet[6], this paper
                     aims at developing ViT, which is pre-trained on a huge dataset, then fine-tuned and
                     evaluated on several benchmark datasets for Facial Beauty Prediction such as SCUT-
                     FBP5500 [1]. Similarly, CNN models such as ResNet-50 and VGG-16, both pre-trained
                     on ImageNet, are also re-implemented to compare their performance with that of Vision
                     Transformers.
                       This paper walks through some of the related work in the field of Facial Beauty
                     Prediction in Section 2. Then, it will propose the application of ViT architecture for
                     facial beauty prediction in Section 3 before conducting several experiments with ViT
                     along with other benchmarks for performance comparison and analysis. Based on the
                     experimental results, Section 5 will draw conclusions about the architecture of ViT and
                     suggest some future works and progress to improve the performance of our proposed
                     architecture.


                     2      Related work


                     In the context of beauty prediction, there have been many research papers suggesting
                     approaches to this task. Iyer et al. (2021) explored machine learning-based facial beauty
                     prediction  using facial landmarks and  traditional image descriptors, nose symmetry
                     ratio,  for  instance.  These  attributes  are  then  input  to  various  traditional  machine
                     learning algorithms such as Linear Regression, Random Forest, K-Nearest Neighbour,
                     v.v.  to output a score for facial attractiveness [7].
                       However, one disadvantage of this approach is its heavy dependence on handcrafted
                     features, which requires extensive domain-specific knowledge and subjective design
                     choices about facial beauty. Furthermore, since preferences and opinions about facial
                     beauty  may  vary  among  different  races  and  generations,  re-implementing  feature
                     engineering to align with the interests of a new target audience could be challenging.
                       Avoiding  highly  accurate  facial  characteristics,  Xiao  et  al.  proposed
                     Beauty3DFaceNet  -  a  deep  CNN  predicting  attractiveness  on  3D  faces  using  both
                     geometry and texture information [8]. The model utilizes a fusion module to combine
                     geometric and texture features and designs a novel sampling strategy based on facial
                     landmarks for improved performance in learning aesthetic features. Nevertheless, this





                     CITA 2023                                                   ISBN: 978-604-80-8083-9
   57   58   59   60   61   62   63   64   65   66   67