Page 64 - Kỷ yếu hội thảo khoa học lần thứ 12 - Công nghệ thông tin và Ứng dụng trong các lĩnh vực (CITA 2023)
P. 64

48    K  Y U H I TH O KHOA H C QU C GIA V  CNTT VÀ  NG D              V C L N TH  12

                     3.2    Model Architecture


                     3.2.1  Resnet-50 and VGG-16

                     ResNet-50,  as  i

                     convolutional layers, batch normalization layers, ReLU activation functions, and skip
                     connections. The convolutional layers extract features from input images, while batch
                     normalization  layers  normalize  inputs  to  each  layer  for  improved  training  stability.
                     ReLU activation functions introduce non-linearity for learning complex patterns. The
                     skip connections allow the network to bypass one or more layers, facilitating direct
                     gradient  propagation  and  fluid  gradient  backpropagation  across  the  network  and
                     mitigating the vanishing gradient problem.


                     Large-
                     convolutional  layers  and  3  fully  connected  layers,  utilizes  small  3x3  convolutional
                     filters  stacked  on  top  of  each  other,  with  occasional  max-pooling  layers  for
                     downsampling.
                       These two  CNN  models  were initially  pre-trained  on  the ImageNet  dataset,  and
                     subsequently fine-tuned and trained on the SCUT-FBP5500 dataset for further model
                     evaluation and performance comparison against the proposed Beauty ViT model.

                     3.2.2   ViT Architecture


                     Facial Beauty Prediction Transformer follows the architecture of ViT, depicted in Fig. 1.























                         Fig. 1. Facial Beauty Prediction ViT architecture and Transformer Encoder component


                     The ViT model for Face Beauty Prediction is an effective transformer-based model
                     tailored  for facial  recognition tasks. The architecture comprises  a  patch embedding
                     layer, a transformer encoder, and an MLP (Multilayer Perceptron) head for output. The
                     patch embedding layer dissects the input image into a multitude of patches, which are
                     subsequently flattened and fed to the transformer encoder. The transformer encoder,





                     CITA 2023                                                   ISBN: 978-604-80-8083-9
   59   60   61   62   63   64   65   66   67   68   69