Label-Noise Robust Deep Generative Model for Semi-Supervised Learning

Abstract Deep generative models have demonstrated an excellent ability to generate data by learning their distribution. Despite their unsupervised nature, these models can be implemented in semi-supervised learning scenarios by treating the class labels as additional latent variables. In this article, we propose a deep generative model for semi-supervised learning that offsets label noise, which is a ubiquitous feature in large-scale datasets owing to the high cost of annotation. We assume that noisy labels are generated from true labels and employ a noise transition matrix to describe the transition. We estimate this matrix by adjusting its entries to minimize its difference with the true transition matrix and use the estimated matrix to formulate the objective function for inference, which consists of an evidence lower bound and a classification risk. However, because directly minimizing the latter with noisy labels may result in an inaccurate classifier, we propose a statistically consistent estimator for computing the classification risk solely with noisy data. Empirical results on benchmark datasets demonstrate that the proposed model improves the classification performance over that of the baseline algorithms. We also present a case study on semiconductor manufacturing. Additionally, we empirically show that the proposed model, as a generative model, is capable of reconstructing data even with noisy labels.


Introduction
Recent advancements in machine learning-especially, deep learning-have produced successful results in various tasks such as classification and object detection. However, these tasks generally require high-quality, comprehensively labeled data, and the labeling process is both expensive and time-consuming. Although crowdsourcing services have been employed at a low cost to overcome these issues, the generated labels are often noisy, as the labelers may not be experts in the particular field. Directly training high-capacity models using such data can deteriorate their accuracy owing to memorization effects (Zhang et al. 2016;Arpit et al. 2017). Hence, it is important to develop methods to robustly train models with noisily labeled data.
Several researchers have proposed methods to address noisy labels in the training dataset. These methods fall into one of two categories: statistically inconsistent or consistent algorithms (Xia et al. 2019). Statistically inconsistent algorithms rely on heuristic methods to offset noisy labels, for example, correcting labels (Reed et al. 2014;Tanaka et al. 2018) and loss functions (Ma et al. 2018;Hendrycks et al. 2018;Arazo et al. 2019;Han, Luo, and Wang 2019), selecting reliable samples (Malach and Shalev-Shwartz 2017;Han et al. 2018;Yu et al. 2019), and weighting samples (Jiang et al. 2017;Ren et al. 2018). However, classifiers trained using such algorithms cannot wholly substitute optimal classifiers trained using clean data despite their high empirical performance.
CONTACT Heeyoung Kim heeyoungkim@kaist.ac.kr Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/TECH.
In contrast, statistically consistent algorithms (risk-consistent algorithms in particular) employ risk-consistent estimators to achieve optimality. In this case, an estimator is risk-consistent if the empirical risk computed using noisy samples converges to the clean risk or expected classification loss with respect to the clean data as the number of noisy samples increases. These methods typically involve a noise transition matrix, wherein the entries represent the probabilities of true labels being flipped into noisy labels. This matrix could be adopted to modify loss functions for obtaining risk consistency (Natarajan et al. 2013;Liu and Tao 2015;Patrini et al. 2017;Kremer, Sha, and Igel 2018;Zhang and Sabuncu 2018;Xia et al. 2019) or as an additional noise-adaptation layer for learning consistent classifiers (Sukhbaatar et al. 2014;Goldberger and Ben-Reuven 2017;Patrini et al. 2017).
Although the aforementioned algorithms exhibit outstanding classification performance, they assume the existence of fully labeled datasets. However, as the amount of data that can be accessed increases, it becomes impractical to obtain labels for all the data points. To exploit partially labeled data, several researchers have studied label noise in a semi-supervised learning scenario. For example, crowdclustering was combined with semi-supervised learning to extract a pairwise similarity measure from crowdsourced labels (Yi et al. 2012). A disadvantage of this approach is that it computes the similarity function based on raw features of the data instances, which are not suit- able for expressing their semantic similarity. Another approach employs ensemble learning using multiple weak annotators to generate pseudo-labels of the unlabeled samples, whose true labels are then approximated by aggregating the pseudo-labels (Yan et al. 2016). This method also does not perform well for data with a high level of label noise because the different levels of expertise among the annotators may lead to label uncertainty. A recent study used a deep generative model for semi-supervised learning with crowds (Atarashi, Oyama, and Kurihara 2018). This method attempts to improve upon previous models, but assumes that each instance has multiple labels annotated by different labelers and models the labeling process using a simple multi-class logistic regression, which does not reflect how noisy labels are generated from true labels in the raw data. Moreover, although there exists an approach that attempts to reflect this labeling process (Zhang et al. 2019) by applying importance weighting, it assumes the existence of a small amount of clean data and is designed only for binary classification.
In this study, we address the challenges of semi-supervised learning when a limited amount of labeled data contains noisy labels by implementing deep generative models. Compared to deep discriminative models that are commonly used for classification problems, deep generative models can model complex high-dimensional data by exploring their latent representation, which better captures the semantic similarity among the data points than raw features. When the data are partially labeled, deep generative models can leverage the unlabeled data to additionally perform missing label imputation tasks, solving classification problems.
Specifically, in this study, we adopt the M2 model (Kingma et al. 2014) as a generative model for semi-supervised learning because it allows for scalable inference and can be trained in an end-to-end fashion. Although much effort has been made to improve this model, such as normalizing flow (Rezende and Mohamed 2015) and inverse autoregressive flow (Kingma et al. 2016), these flow-based generative models generally aim for explicit density estimation and likelihood computation. Moreover, unlike the M2 model, flow-based generative models require bijective mapping, meaning volume preservation over transformation between the observation space and the latent space, and thus, are incapable of dimension reduction. This issue may become critical when dealing with large-scale real-world datasets in which such deep generative models are designed to excel.
In addition to the classifier in the M2 model, we add a transition module that models the transition between true and noisy labels. The transition module is composed of an input layer with true labels, an output layer with noisy labels, and the weight between the two layers specified by the noise transition matrix. It accepts the true labels as inputs and converts them into noisy labels based on the flipping probabilities in the transition matrix. The overall structure of the proposed model is illustrated in Figure 1.
To learn the noise transition matrix in the transition module, we apply the T-Revision method (Xia et al. 2019) because it does not rely on clean validation data. First, we initialize the transition matrix based on a confusion matrix built by directly fitting the noisy data. Subsequently, we adjust its entries to optimize its difference with the true transition matrix. A detailed explanation of this learning process is provided in Section 2.2.
The loss function of the proposed model consists of an evidence lower bound (ELBO) and a classification risk, as in the M2 model. While the relationship between the true and noisy labels is described in the ELBO through the estimated transition matrix, it is not factored into the computation of the classification risk. To specify this relationship in the classification risk, we construct and minimize a risk-consistent estimator for the classification risk that involves the estimated transition matrix based on an earlier study (Xia et al. 2019). The entire model is then trained in an end-to-end fashion.
Thus, our contributions can be summarized as follows: (i) We present a new deep generative model for semisupervised learning that simultaneously offsets noisy labels using a transition module that captures the relationship between true and noisy labels. (ii) We show that the noise transition matrix can be defined naturally in the derivation of the variational lower bound and classification risk, minimizing the latter through a risk-consistent estimator constructed based on importance reweighting. (iii) We validate the performance of our model on various datasets and compare it with several baselines that achieved exemplary results. (iv) We quantitatively show that the learned importance weights are different for clean and noisy samples and that the proposed model can generate clean data even in the presence of label noise.
The remainder of this article is organized as follows. In Section 2, we briefly review deep generative models for semisupervised learning and label-noise robust classification algorithms that use the noise transition matrix. In Section 3, we describe the proposed model and its inference. In Section 4, we provide experimental results on benchmark datasets and a realworld dataset. Finally, we conclude our article in Section 5.

Deep Generative Models for Semi-supervised Learning
The variational auto-encoder (VAE), a well-known deep generative model, was introduced by Kingma and Welling (2013) to model the distribution of complex data, and extended by Kingma et al. (2014) to use labels in semi-supervised learning. The extended model, called the M2 model, contains a continuous latent variable z and latent class variable y. The generative process of this model is specified as follows: where π denotes the relative probabilities of a multinomial distribution Cat(y|π), and 0 and I denote the zero mean and the identity covariance matrix of a centered isotropic multivariate Gaussian distribution N (z|0, I), respectively. Moreover, f (x; y, z, θ ) is a likelihood function such as a Gaussian distribution, that is, p θ (x|y, z) = N (x|μ θ (y, z), σ 2 θ (y, z)), or a Bernoulli distribution, that is, p θ (x|y, z) = Bernoulli(p θ (y, z)), where θ denotes the parameters of a deep neural network that models the likelihood function f (x; y, z, θ ). Similar to the implementation of VAE, an inference model is introduced for each of the latent variables y and z. The approximate joint posterior can be factorized as q φ (y, z|x) = q φ (z|x, y)q φ (y|x), where each factor is defined by q φ (z|x, y) = N (z|μ φ (x, y), diag(σ 2 φ (x, y))) and q φ (y|x) = Cat(y|π φ (x)). Here, φ denotes the parameters of a deep neural network that models the approximate joint posterior q φ (y, z|x). The latent variables y and z of a data point x are also interpreted as its codes; hence, the inference model q φ (y, z|x) is called a probabilistic encoder because it produces a distribution over the codes y and z of a data point x. Similarly, the generative model p θ (x|y, z) is called a probabilistic decoder because it produces a distribution over x given the codes y and z.
For this model, the variational lower bounds are derived for the marginal likelihood of the labeled and unlabeled data separately. For the labeled data, the variational lower bound is a simple extension of that of the original VAE: (z|x,y) [log p θ (x|y, z)] For the unlabeled data, however, the label y is treated as a latent variable, and hence, the variational lower bound can be obtained as follows: where H denotes the entropy. Here, the approximate posterior q φ (y|x) acts as a classifier that fills in an unobserved label y for an unlabeled data point x, and hence, only appears in Equation (5). However, this implies that the predictive distribution q φ (y|x) only learns from the unlabeled data. As a solution, a classification loss is added to have the distribution q φ (y|x) also learn from the labeled data. Therefore, the variational lower bound for the entire dataset is (6) where L denotes the set of labeled data points, U denotes the set of unlabeled data points, D L denotes the distribution of labeled data, and α is a constant. Note that the last term is the added classification loss. For further details on the VAE and M2 model, refer to Kingma and Welling (2013) and Kingma et al. (2014).

Robust Classification with Noise Transition Matrix
Noise transition matrices are often employed in learning with label noise in order to obtain the true class posterior p(y|x) from the noisy class posterior p(ỹ|x), where x denotes an instance, y denotes its true label, andỹ denotes its observed (potentially noisy) label. The idea follows from the following equation: where T ij (x) = P(ỹ = j|y = i, x) represents the ijth entry of the noise transition matrix T(x), and k denotes the number of classes. As Equation (7) suggests, the noise transition matrix generally depends on instance x. This dependency is realistic in practice because confusing observations are more likely to be mislabeled. However, given only noisy data, the instancedependent transition matrix T(x) is unidentifiable (Xia et al. 2019;Cheng et al. 2020a;Yao et al. 2020) without additional knowledge of the clean labels. The most common assumption to make it identifiable is that the label noise is instanceindependent; that is, Although a few studies have attempted to approximate instance-dependent label noise (Xia et al. 2020;Cheng et al. 2020b;Berthon et al. 2021;Yao et al. 2021), the majority of current state-of-the-art methods still assume that label noise is instance-independent owing to the identifiability of the transition matrix. Conventionally, T is estimated using datasets that contain anchor points, or in other words, datasets that contain clean examples for validation (Patrini et al. 2017;Hendrycks et al. 2018). This implies that an accurate estimation of T without anchor points remains a challenging problem. Recently, Xia et al. (2019) proposed the T-Revision method to address this issue. This method uses only noisy data to construct a risk-consistent estimator for the clean risk where D denotes the distribution of clean data, and denotes the loss function. The constructed estimator is minimized to learn a classifier f that can predict y from x.
To estimate R(f ) with noisy data, Xia et al. (2019) first employ importance reweighting to represent R(f ) as the expectation of the weighted loss with respect to the noisy data: where˜ is the weighted loss, andD denotes the distribution of noisy data. They then build the empirical counterpart ofR(f ). In this case, if g i (x) denotes a neural network that approximates P D (y = i|x), then f (x) = arg max i∈{1,...,k} g i (x), and Equation (7) implies that (T g) i (x) is an approximation for PD(ỹ = i|x). These estimates yield the following risk-consistent estimator for R(f ): where n denotes the number of data points. When the true transition matrix T is not available in hand, it requires to be learned, and hence, the estimator in Equation (9) becomesR n (f ) =R n (T, f ). Here, T is first estimated using the noisy data and the classifier f is initialized with the estimator R n (T, f ), whereT denotes the estimate of T. Then, the difference T = T −T is defined to be an additional learnable parameter and the loss functionR n (T + T, f ) is used to approximate R(f ). This fine-tuning method works because the parameter T is learned by minimizing Equation (9), which is risk-consistent, that is, it is asymptotically equal to R(f ). Thus,R n (T + T, f ), which approximatesR n (T, f ), is asymptotically equal to R(f ), and therefore,T + T = T at optimality. Throughout this process, T is modified such that the entries ofT + T are as similar to those of T as possible.

Methodology
In this section, we combine the concepts of the M2 model and T-Revision method described in Sections 2.1 and 2.2 to propose a new deep generative model for semi-supervised learning that considers label noise.

Problem Scenario
Our dataset consists of both labeled and unlabeled data points.
The labeled data points appear as pairsL is the corresponding noisy label. The remaining data points U = {x n+1 , . . . , x n+m } are unlabeled. We assume that each instance x i , either labeled or unlabeled, has an unknown true label y i ∈ {1, . . . , k}. For simplicity, we omit the index i when it is apparent that we are referring to a single data point. The purpose of our model is to accurately estimate the true label y of an unknown instance x using both the noisy datasetL and the unlabeled dataset U.

Generative Model with Transition Module
We assume that each data point x has a latent feature z and a latent true label y. We also assume that label noise is instanceindependent in order to simplify the labeling process. With the above assumptions, we extend the generative process of the M2 model (Kingma et al. 2014) in Equations (1), (2), and (3) by incorporating the relationship between y andỹ. This relationship is described by the noise transition matrix T in Equation (7). A graphical representation of the generative model is shown in Figure 2.
To explicitly model the relationship between y andỹ, we add a transition module that bridges y andỹ over the generative model. This module contains an input layer of y, an output layer ofỹ, and a noise transition matrix T that serves as the weight between the two layers. Figure 3 depicts the structure of the transition module. This module accepts y, an output of the classifier in the M2 model, as its input, computes its weighted sum, where the weight is determined by the noise transition matrix T, and outputsỹ. The outputỹ is then compared with the given noisy label for the optimization of the parameters of the transition module and the M2 model.

Inference
The problem of the proposed generative model is that the posterior distribution of the latent variables y and z is intractable. To resolve this issue, we borrow the idea of introducing an approximate posterior distribution q φ (·) as an inference model for the latent variables from an earlier study (Kingma et al. 2014). We further assume that the approximate posterior has a factorized form q φ (y, z|x) = q φ (z|x, y)q φ (y|x) and model the two factors with a Gaussian and multinomial distribution, respectively, parameterized by a deep neural network. Upon estimating the parameter φ, the distribution q φ (y|x) is applied to predict the labels of any unseen data in the test phase. Similar to previous studies (Kingma and Welling 2013;Kingma et al. 2014), we update the parameters θ and φ by maximizing the variational lower bound of the marginal loglikelihood.
While the variational lower bound is the same as in Equation (5) when the labels are missing, the variational lower bound for a data point with a noisy label is computed as follows: where p θ (ỹ|y) represents the probability of transition from y tõ y and is modeled by the noise transition matrix T. For the distribution q φ (y|x) to correctly learn from the labeled data, we add a classification risk, as implemented in Kingma et al. (2014). Therefore, our objective function becomes (11) However, the classification risk cannot be directly minimized owing to the unknown y; thus, we need an indirect method to estimate the classification risk with noisy data. As prescribed in Xia et al. (2019), we use the derivation in Equation (8) to obtain the following equation: where D andD denote the distribution of clean and noisy data, respectively. We then use this equation to replace the classification risk in Equation (11), which, in turn, yields our final objective function.
Note that the variational lower bounds from Equation (5) and (10) result in the first two terms in Equation (11) and that the classification risk is added separately. Alternatively, Equation (11), and hence, our final objective function with Equation (12), can be derived directly by performing inference over y, z, and π , where the additional latent variable π denotes the parameters of the categorical distribution. The complete derivation is provided in the supplementary materials.
For training our model with the objective function, we construct a risk-consistent estimator for the classification risk similar to Equation (9) and combine it with the stochastic gradient variational Bayes (SGVB) estimator (Kingma and Welling 2013) for the ELBO. First, we derive the SGVB estimator for the ELBOs in Equations (4) and (5). In Section 2.1, we define the approximate joint posterior as the product of a multinomial and a multivariate Gaussian with a diagonal covariance structure. This implies that given an unlabeled data point x s , we can sample from the posteriors y s ∼ q φ (y|x s ) = Cat(y|π s ) and z s ∼ q φ (z|x s , y s ) = N (z|μ s , diag(σ 2 s )), where π s = π φ (x s ), μ s = μ φ (x s , y s ) and σ 2 s = σ 2 φ (x s , y s ) are outputs of the encoder. Then, the reparameterization trick is applied to yield the following SGVB estimators for the ELBOs in Equations (4) and (5): where z s,l = z s = μ s + σ s ε l with ε l ∼ N (0, I) whereÛ (x s ) is defined in Equations (14). Here, h i (x) denotes a neural network that parameterizes the approximate posterior q, and the loss function is used to train the distribution log q φ,D (ỹ|x).
In the experiment presented below, we did not ensureT+ T to be a valid transition matrix, that is, j (T + T) ij = 1, when fine-tuning T becauseT is normalized at the first estimation stage and T is initialized to be a zero matrix. However, during training, we can always makeT + T a valid transition matrix by clipping all negative entries to zero and normalizing each row.

Experiments
In this section, we verify the effectiveness of our methodology through experiments on benchmark datasets, MNIST and Fashion-MNIST, and a case study on semiconductor manufacturing. We describe the experimental results on MNIST dataset in Section 4.1 and the case study on semiconductor manufacturing in Section 4.2. The experimental results on Fashion-MNIST dataset is provided in the supplementary materials.

Experiments on MNIST Dataset
We first conducted a series of experiments on MNIST dataset. The dataset contains 10 classes of hand-written digits and is composed of 60,000 training images and 10,000 test images with dimensions of 28 × 28. The datasets for semi-supervised learning were created by randomly dividing the dataset into labeled and unlabeled images. Simultaneously, we ensured that each class had the same number of labeled images.
In addition, because the dataset is clean, we manually corrupted it using the noise transition matrix T. To achieve this, we assumed that T has two representations depending on the noise type: (i) symmetric noise-each label has an equal probability of being flipped into other labels; and (ii) asymmetric noise-each label can only be flipped into one other label, that is, each row of T has only two nonzero entries. The noise transition matrices T for both noise types are as follows: Symmetric : Here, denotes the noise level, and k denotes the number of classes. In this study, we varied the noise level from 0.2 to 0.4 for symmetric noise to observe the robustness of our algorithm. For asymmetric noise, we set to 0.4, because the model cannot learn anything if ≥ 0.5 in this case. We also considered the case when the labels are not contaminated, that is, = 0, to observe if there is some tradeoff while obtaining robustness. Baselines. We compared the proposed model with the following approaches for semi-supervised learning with label noise: (i) ROSSEL (Yan et al. 2016)-combines pseudo-label generation and label aggregation to infer true labels; and (ii) Semi-LFC (Atarashi, Oyama, and Kurihara 2018)-a generative model with multi-class logistic regression to model label noise. In ROSSEL, we replaced the multi-class support vector machine (SVM) with a multilayer perceptron (MLP) containing two hidden layers for better results. In Semi-LFC, which is designed for learning from crowds, we assumed that there is only one annotator and that the labels are generated from one classifier. We also compared the proposed model with three representative models for supervised learning with label noise: (i) S-model (Goldberger and Ben-Reuven 2017)-adds a softmax layer on top of the classifier to predict noisy labels; (ii) F-correction (Patrini et al. 2017)-uses noise transition matrices to correct loss functions; and (iii) Co-teaching+ (Yu et al. 2019)-extends Co-teaching (Han et al. 2018) by adopting the "Update by Disagreement" strategy (Malach and Shalev-Shwartz 2017). The purpose of using the S-model, F-correction, and Co-teaching+ as baselines was not to directly compare classification performance, but to verify the benefit of using unlabeled images. Hence, we used only labeled images for training the three models. We also compared the proposed model with the standard M2 model (Kingma et al. 2014) to determine the robustness of our method.
Network structure and optimization. For our model, we used a 50-dimensional latent variable z and ReLU as an activation function. In addition, both the encoder and decoder networks were modeled using MLPs with two hidden layers, each with 500-dimensional hidden units. When the model is trained, the latent variable y is one-hot encoded and simply concatenated with the latent variable z, which is then passed to the decoder for reconstruction. The initial weight parameters were randomly sampled from N (0, 0.001 2 I), and the initial bias parameters were set to 0. To estimate the noise transition matrix, we used the stochastic gradient descent (SGD) with a batch size of 100 and learning rate of 0.05. The objective function for the proposed model was then optimized using Adam with a batch size of 100, decay parameters for the first and second moments of the gradients (β 1 , β 2 ) = (0.9, 0.999), and learning rate of 10 −4 . To update the noise transition matrix, the learning rate of Adam was changed to 5 × 10 −7 .

Classification Results on MNIST Dataset
We used various ratios of labeled and unlabeled images to verify the benefit of using unlabeled images. Specifically, we varied the number of labeled images from 1000 to 3000 and unlabeled images from 10,000 to 55,000. We repeated the experiment 10 times for each dataset; for each experiment, we randomly selected the required number of labeled and unlabeled images from the entire dataset without replacement. The classification accuracies with standard errors are shown in Table 1.
From the table, it can be observed that our model is more robust than the other baseline models for semi-supervised learning, especially when a large number of unlabeled images are considered. For example, when the noise is asymmetric, our model with 3000 labeled and 55,000 unlabeled images improved upon ROSSEL and Semi-LFC with the same number of images by 10% and 11%, respectively. We also observed that the proposed model is a significant improvement over the standard M2 model. With settings identical to those in the previous example, the accuracy of our model was found to be 23% higher than the accuracy of the standard M2 model.  Moreover, the abbreviations Sym-20, Sym-40, and Asym-40 for the noise category stand for 20% symmetric noise, 40% symmetric noise, and 40% asymmetric noise, respectively.
Additionally, as observed in the table, with a fixed number of labeled images, the accuracy of our model increases with the number of unlabeled images, and the amount of increase is greater than that of the other models. For example, when there are 3000 labeled images and the noise is 40% symmetric, our model with 55,000 unlabeled images improved on the same model with 10,000 unlabeled images by 3.5%, whereas Semi-LFC only improved by 0.6%. Similarly, it can be observed that the accuracy of ROSSEL reduced by 0.9%. This proves that the larger the amount of unlabeled data, the higher the accuracy of our model, implying that our model has the potential to enhance performance on real-world datasets with vast amounts of unlabeled data.
From the table, we can also observe that our model performed better than the baseline models for supervised learning, S-model, F-correction, and Co-teaching+, when there were insufficient labeled images. Particularly, when the noise was 20% symmetric, the accuracy of our model with 3000 labeled images improved upon the accuracy of Co-teaching+ with the same number of labeled images by approximately 2%. The difference in accuracy was found to be even larger when the noise was asymmetric and there were 55,000 unlabeled images, achieving 17%. However, this does not imply that our model is superior to the three models for supervised learning. Rather, we can infer that while S-model, F-correction, and Co-teaching+ were unable to use unlabeled images, our model could effectively do so in order to generalize unseen images and boost classification accuracy. We could also consider, with certainty, that our model is better than the three models when the number of labeled images is limited and when there is access to a large number of unlabeled images.

Importance Weights for Clean Sample Detection
The proposed model learns the clean approximate posterior q φ,D (y|x), the noisy approximate posterior q φ,D (ỹ|x), and the importance weight q φ,D (y|x)/q φ,D (ỹ|x) for each sample. The importance weights not only allow for the computation of the loss function withỹ, but also determine which samples are clean and should be reflected more in the loss function. If a sample is noisy, the clean approximate posterior in the numerator will be small, resulting in a small importance weight. If a sample is clean, however, the noisy approximate posterior in the denominator will be small, resulting in a large importance weight.
In order to prove that samples with large importance weights are indeed clean, we randomly selected samples and compared their importance weights. We plotted the distribution of the learned importance weights to see how different the weights are between clean and noisy samples. Figure 4 shows the selected MNIST digits with their importance weights for 40% symmetric label noise, and Figure 5 shows the empirical PDFs for 20% and 40% symmetric label noise in the MNIST dataset. From the figures, we can observe that when the true label is the same as the observed label, that is, the corresponding sample is clean, its importance weight is significantly greater than zero; otherwise, its importance weight is close to zero.

Conditional Generation and Latent Visualization
The proposed model is also capable of generating data, which enabled us to explore the hidden structure of the data by manipulating the latent variables z and y. This differentiates our model from the baseline models.
One of the ways that can be implemented is to fix the latent variable z, vary the class variable y, and use the trained gener- ative model to construct images corresponding to the combination of the two latent variables. Figure 6 shows the simulated images for the MNIST dataset when the noise level is 40%. Here, the representative samples in the leftmost column are selected and their latent representations z are obtained from the encoder. The obtained z is then combined with different instances of y and passed to the decoder to generate the rest of the samples. In the figure, the representative samples are the same as those in an earlier study Kingma et al. (2014) for the purpose of easy comparison. The results demonstrate that our model can separate styles from classes even when the labels are noisy.
Another way involves fixing a class variable y and varying the latent variable z. Figure 7 shows four digits from the MNIST dataset spread over a region where there is 40% symmetric label noise. Because we assume that z follows a Gaussian distribution, we can observe that the digits in the figure are continuously changing. Notably, the digits that are close to each other have similar styles for each y. The aforementioned figure hence proves that our model is robust and can generate data without being hindered by label noise.

Application to Semiconductor Manufacturing
In semiconductor manufacturing, wafer fabrication involves a number of complicated processes to produce integrated circuits, or semiconductor chips, on a semiconductor wafer. After the wafer fabrication, multiple electrical tests are performed to check the functionality of each chip. Then, each chip on a wafer is assigned 0 or 1 depending on the results of wafer tests, where 0 indicates that the chip has passed all the tests and 1 indicates that the chip has failed some of the tests. The resulting binary values on a wafer form a spatial map called a wafer bin map (WBM).
Defective chips in WBMs often form systematic patterns. These defects are called systematic defects, which are different   from random defects that are randomly distributed over a wafer. Figure 8 illustrates four typical patterns of systematic defects: circle, ring, zone, and scratch, together with random defects. While random defects are usually caused by random particles in the manufacturing environment and hence cannot be easily eliminated, systematic defects occur due to assignable causes. It is known that different patterns of systematic defects are related to different root causes of process failures, and thus it is crucial to detect and classify systematic defect patterns in WBMs to identify and fix the root causes of process failure (Kim, Lee, and Kim 2018).
Recently, there have been many studies that use convolutional neural networks (CNNs) to perform WBM defect pattern classification (Kyeong and Kim 2018;Nakazawa and Kulkarni 2018;Hyun and Kim 2020;Lee and Kim 2020). Although these studies have shown outstanding classification performance resulting from the high computational power of the CNNs, they still make use of large number of labeled WBMs, which are implicitly assumed to be clean. Moreover, they did not consider the case when the WBMs are mislabeled, which is highly likely to occur due to manual annotation. Figure 9 illustrates two sets of confusing examples that are susceptible to label noise. The  first two images in the figure are WBMs with scratch pattern, both of which can be mistakenly labeled as no pattern. The last two depicts WBMs with ring and zone patterns, and one can be possibly labeled as the other due to their similarity in length, thickness, and position. If such samples are indeed mislabeled and are used to train the existing models suggested by the aforementioned studies, it is obvious that the accuracy of the classifiers will decline and thus it is necessary to consider label noise in WBM defect pattern classification.
To classify defect patterns, we first followed the WBM data generation procedure in Kyeong and Kim (2018) to simulate WBMs with systematic defects. The simulated WBMs mimic the real WBMs well and can be generated in large number. We considered four typical classes of systematic defect patterns, circle, ring, zone, and scratch, and added one more class that contains WBMs with only random defects, marked as no pattern. We generated 5000 WBMs for each class, each with size 56 × 53, so that there are a total of 25,000 WBMs. Then we divided the dataset into 20,000 training and 5000 test WBMs and while doing so we ensured that each class contains the same number of training and test images.
To inject label noise to the simulated WBMs, we again followed the procedure in Section 4.1 and manually corrupted the labels using the noise transition matrix. As mentioned in Section 4.1, we considered both symmetric and asymmetric noise at various noise levels. However, because the WBM dataset is different from the benchmark datasets, we injected asymmetric noise by manipulating the transition matrix based on the inspection of the confusing WBM examples in Figure 9. Specifically, we generated the transition matrix by assuming that ring pattern can only be mislabeled as zone pattern, circle and scratch patterns can only be mislabeled as no pattern, and no pattern can only be mislabeled as scratch pattern. The corresponding noise transition matrix is shown below: where again denotes the noise level. We selected 1000 labeled WBMs and varied the number of unlabeled WBMs from 5000 to 19,000. We repeated the experiment 10 times and randomly sampled the labeled and unlabeled images from the entire training dataset consisting of 20,000 WBMs without replacement for each experiment. The performance of the proposed model was compared with the baseline methods considered in Section 4.1. The classification accuracies with standard errors are shown in Table 2, which shows that our model is superior to the baseline models. To be specific, when the noise is 20% symmetric, our model with 19,000 unlabeled WBMs improved upon the standard M2 model, ROSSEL, and Semi-LFC with the same number of unlabeled WBMs by 11%, 8%, and 47%, respectively. Moreover, when the noise is 40% symmetric, the accuracy of our model with 19,000 unlabeled WBMs increased by 17% for S-model, 15% for F-correction, and 17% for Co-teaching+. This difference is even larger when the noise is 40% asymmetric, achieving the maximum of 30%.
To verify that our model, as a generative model, can also generate WBM images, we followed the procedure in Section 4.1.3 to generate three WBM samples for each defect pattern. Since  the last hidden layer of the decoder outputs values between 0 and 1, which makes the generated WBM images blurry, we decided to set a threshold to binarize the output values and hence produce clean WBM images. Based on empirical results, we found out that 0.15 is a reasonable threshold. Figure 10 depicts the generated WBM samples after binarization for 20% symmetric noise. From the figure, we can observe that our model is able to generate WBM images even with corrupted labels. Moreover, by controlling the latent variable y, we can obtain WBM images with a specific defect pattern that we want. While the simulated WBM images used as a training data in this study are generated using specific formulas and hence are very idealistic, our model can generate more realistic WBM images in a large number. This property can be regarded as a positive side effect of our model. That is, the ability to generate a large number of correctly labeled WBM images may be a great help for studies that require WBM images with accurate supervision.

Conclusion
In this article, we discussed a solution for implementing semisupervised learning when labels are noisy. We designed a generative model that employs a noise transition matrix, and trained the model to learn the transition matrix and classifier. The key philosophy of our model is to add a transition module over the classifier in the M2 model and construct a risk-consistent estimator for the clean risk that can be computed using only noisy data. We conducted experiments with benchmark datasets and a simulated WBM dataset to demonstrate that the proposed model is capable of both robust classification and data generation with noisy data. In the future, we aim to extend this model to incorporate sample selection approaches in an end-to-end fashion. Furthermore, we plan to extend this model to handle situations where the label noise is instance-dependent or where both instances and labels are corrupted.

Supplementary Materials
Supplementary document: The pdf file contains: (i) the derivation of the objective function used for the proposed model; (ii) and the experimental results for Fashion-MNIST dataset. Python code: "Code.zip" contains the codes and datasets used in this article.