Page 30 - Kỷ yếu hội thảo khoa học lần thứ 12 - Công nghệ thông tin và Ứng dụng trong các lĩnh vực (CITA 2023)
P. 30
14
Imputation methods can be classified into two categories: single imputation [17] and
multiple imputation [18]. Single imputation creates a complete dataset from an incom-
plete one, relying on specific assumptions about the missing values rather than the type
of missing data. These assumptions are not always applicable or accurate and can lead
to biased results. On the other hand, multiple imputation methods are a robust approach
for reducing imputation bias. They generate multiple datasets from an incomplete da-
taset, introducing random values to restore the lost randomness. By reducing random-
ness, statistical analyses based on the distribution of the imputed values become more
reliable. Multiple imputation methods offer greater flexibility and can be applied in a
wide range of scenarios.
2.2 Generative Adversarial Networks for Data Imputation
Generative adversarial networks (GANs) have been a subject of debate in recent years
due to their potential in data synthesis. Comprised of a generator and discriminator,
GANs are trained in an adversarial manner, with both models being implemented
through neural networks. GANs have been successfully applied to various fields such
as image processing and computer vision, natural language processing, and medicine
[7].
While GANs have primarily been utilized for data synthesis, researchers have also
explored their application in data imputation. Initially, GAN-based data imputation
methods were proposed for image completion tasks [23, 24]. However, these models
applied in image inpainting only. However, more recent publications have focused on
data imputation in general, such as GAIN [21], MisGan [15], and GAMIN [22].
The GAIN model (Generative Adversarial Imputation Nets) model [21] treats the
generator as an imputer and employs the discriminator to determine whether each com-
ponent of an input has been imputed or not. This algorithm performs well on low-di-
mensional datasets with a low missing rate and even shows promise on MNIST datasets
with a 50\% missing rate. However, its performance diminishes when confronted with
higher missing rates, tending to converge towards zero or mean imputation.
On the other hand, MisGAN model [15] demonstrates better performance on da-
tasets with a high missing rate. This approach involves a GAN architecture specifically
designed for missing datasets, consisting of two pairs of generators. One pair generates
a mask to indicate missing components, while the other generates synthetic complete
data. The synthetic complete data is then combined with the mask to create the synthetic
missing data. A data discriminator is utilized to distinguish between real and synthetic
missing data. Additionally, another generator and discriminator pair are used for data
imputation. The imputation generator aims to fool the corresponding discriminator by
generating imputed data that is indistinguishable from real complete data.
The GAMIN (Generative Adversarial Multiple Imputation Network)[22] is pro-
posed as a solution for multiple imputation in highly missing data scenarios. Inspired
by MisGAN, GAMIN introduces several modifications. Firstly, the imputation archi-
tecture is altered to directly incorporate the data generator into the imputation process.
Secondly, a novel confidence prediction and top-k imputation strategy is introduced.
Lastly, GAMIN employs new loss functions that consider confidence during training.
CITA 2023 ISBN: 978-604-80-8083-9