Page 64 - Kỷ yếu hội thảo khoa học lần thứ 12 - Công nghệ thông tin và Ứng dụng trong các lĩnh vực (CITA 2023)

P. 64

48 K Y U H I TH O KHOA H C QU C GIA V CNTT VÀ NG D V C L N TH 12

3.2 Model Architecture

3.2.1 Resnet-50 and VGG-16

ResNet-50, as i

convolutional layers, batch normalization layers, ReLU activation functions, and skip
connections. The convolutional layers extract features from input images, while batch
normalization layers normalize inputs to each layer for improved training stability.
ReLU activation functions introduce non-linearity for learning complex patterns. The
skip connections allow the network to bypass one or more layers, facilitating direct
gradient propagation and fluid gradient backpropagation across the network and
mitigating the vanishing gradient problem.

Large-
convolutional layers and 3 fully connected layers, utilizes small 3x3 convolutional
filters stacked on top of each other, with occasional max-pooling layers for
downsampling.
These two CNN models were initially pre-trained on the ImageNet dataset, and
subsequently fine-tuned and trained on the SCUT-FBP5500 dataset for further model
evaluation and performance comparison against the proposed Beauty ViT model.

3.2.2 ViT Architecture

Facial Beauty Prediction Transformer follows the architecture of ViT, depicted in Fig. 1.

Fig. 1. Facial Beauty Prediction ViT architecture and Transformer Encoder component

The ViT model for Face Beauty Prediction is an effective transformer-based model
tailored for facial recognition tasks. The architecture comprises a patch embedding
layer, a transformer encoder, and an MLP (Multilayer Perceptron) head for output. The
patch embedding layer dissects the input image into a multitude of patches, which are
subsequently flattened and fed to the transformer encoder. The transformer encoder,

CITA 2023 ISBN: 978-604-80-8083-9

59 60 61 62 63 64 65 66 67 68 69