Page 43 - Kỷ yếu hội thảo khoa học lần thứ 12 - Công nghệ thông tin và Ứng dụng trong các lĩnh vực (CITA 2023)
P. 43

Dinh-Hoang-Long Tran, Quoc-Huy Le                                                27


                     which has total 110 samples collected which 50 of them acquire from real life and other
                     from internet source like Kaggle [11].
                     from real life therefore all of them were brought from Kaggle.
                       After acquired the dataset desired, next step is preprocessing. With the help of Pydub
                     library, all the sample collect was spitted into chunk with 1 second duration. Samples
                     was converted using Python programming language to sample rate of 256 kbps, with
                     frequency of 16 kHz. Snice, there is no direction detection function of the sound in the
                     system, all the dataset was converted to mono channel. Moreover, we also remove the
                     silent part before splitting to get rid of unnecessary part, all of audio part which has
                     amplitude below the threshold -45 dB will be cut. In final, there are total of 423 sample
                     in the dataset.



























                                            Fig. 5.  A sample before and after process

                     Labeling the data: To train the neural network with the data set, the labelling of the
                     input data had to be done. This process was made by evaluating the name of each file.
                     We used Python to rename all the spitted audio file in order and then save it to a folder.
                     After having located the type of audio, we labeled it by using Excel, depend on the
                     name for the file in previous step and save it to a comma-separated values (.csv) file.


                     Training Process: Because the input of the model is an 2D imagine spectrogram, to
                     training this neural network, Conv2D was used.  This common type of convolution that
                     its Kernel slides in the input layer in two-dimension vertically and horizontally, com-
                     puting the product of the weights and the input, and then adding a bias term.
                       Cross-entropy is a loss function and a part of Softmax function that will be used for
                     this Classification. A Log-loss function will determine how well or poorly for this
                                                           -entropy to multiple classes which will give a
                     probability how the model will detect the true label and otherwise. The less loss proves
                     a better performance.





                     ISBN: 978-604-80-8083-9                                                  CITA 2023
   38   39   40   41   42   43   44   45   46   47   48