Page 65 - Kỷ yếu hội thảo khoa học lần thứ 12 - Công nghệ thông tin và Ứng dụng trong các lĩnh vực (CITA 2023)

P. 65

Duy Tran, Thang Le, Khoa Tran, Hoang Le, Cuong Do, Thanh Ha 49

stacked with self-attention layers, enables the model to capture long-range
dependencies among various aspects of the face image. The output feature vector from
the transformer encoder, representing the holistic face image, feeds into the MLP head,
which produces the predicted beauty score for evaluation and backpropagation.

3.2.3 Loss function

The loss function was employed to assess the performance of the models on the SCUT-
FBP5500 dataset integrates two critical metrics: Mean Squared Error (MSE) and Mean
Absolute Error (MAE). These metrics offer a quantitative measure of the disparity
between the predicted facial beauty scores ( ) from the MLP and the corresponding
ground truth scores ( ) for each image ( ) in the dataset.
The MSE is computed by averaging the squared differences between the predicted
and ground truth beauty scores.

The symbol 'n' is the total number of instances in the dataset. The squared term in MSE
amplifies larger prediction errors, making it more sensitive to significant discrepancies
between predicted and ground truth scores.
Conversely, the MAE is the average of the absolute differences between the
predicted and ground truth beauty scores.

The MAE calculates the average of the absolute differences between the predicted and
actual beauty scores. Unlike MSE, MAE is equally sensitive to all differences between
the predicted and actual scores, irrespective of their size.
Employing both MSE and MAE in the loss function offers a comprehensive
evaluation of the model's performance from unique perspectives. Smaller MSE and
MAE values indicate superior model performance, denoting lesser deviations between
the predicted and actual beauty scores, thus indicating higher model precision and
accuracy. Additionally, the Pearson correlation coefficient is employed to offer
additional insight into the model's performance.

4 Experiments and Result

Although the ViT does require more computational resources to train, the superior
performance and faster convergence time make it a highly effective and efficient
choice. Table 1 illustrates the results of our experiments using the 3 models.

ISBN: 978-604-80-8083-9 CITA 2023

60 61 62 63 64 65 66 67 68 69 70