Improving Synthetic to Realistic Semantic Segmentation With Parallel Generative Ensembles for Autonomous Urban Driving

Semantic segmentation is paramount for autonomous vehicles to have a deeper understanding of the surrounding traffic environment and enhance safety. Deep neural networks (DNNs) have achieved remarkable performances in semantic segmentation. However, training such a DNN requires a large amount of labeled data at the pixel level. In practice, it is a labor-intensive task to manually annotate dense pixel-level labels. To tackle the problem associated with a small amount of labeled data, deep domain adaptation (DDA) methods have recently been developed to examine the use of synthetic driving scenes so as to significantly reduce the manual annotation cost. Despite remarkable advances, these methods, unfortunately, suffer from the generalizability problem that fails to provide a holistic representation of the mapping from the source image domain to the target image domain. In this article, we, therefore, develop a novel ensembled DDA to train models with different upsampling strategies, discrepancy, and segmentation loss functions. The models are, therefore, complementary with each other to achieve better generalization in the target image domain. Such a design does not only improves the adapted semantic segmentation performance but also strengthens the model reliability and robustness. Extensive experimental results demonstrate the superiorities of our approach over several state-of-the-art methods.

A DEEP neural network (DNN) is powerful for extracting rich hierarchical feature representations [1], [2]. The superiority of feature extraction helps DNN-based approaches to make compelling achievement on semantic segmentation. A deep-learning-based image segmentation model [3], [4] has been utilized to understand the surrounding traffic environment of the autonomous vehicle to enhance its driving safety. When deploying such a model, each pixel of an image is assigned to one of the semantic classes, such as car, truck, tree, or pedestrian. Since a fully convolutional network (FCN) [5] was proposed, it has significantly outperformed traditional computer vision methods. Recently, many studies, including U-Net [6] and SegNet [7], have extended the idea of FCN and achieved top-performance in semantic segmentation. However, these methods require a vast amount of labor-intensive work to label the dense image at the pixel level. For instance, it takes about one and a half hours to annotate an image from the Cityscapes data set, which is unaffordable for the most of real-world applications.
Deep domain adaptation (DDA) [8], [9] is one of the most promising paradigms to achieve a generalized model without committing intensive manual labeling. The underlying idea is to minimize the discrepancy between two domains, i.e., the source and target domains. Assuming that there exists a huge amount of free annotated data in the source domain, e.g., synthetic driving scenes, while no labeled data is available in the target domain, e.g., realistic driving scenes. DDA approaches seek to find domain invariant feature representations or domain transformation functions, so that generalized models can be trained based on the data from the source domain and deployed in the target domain.
Many DDA methods [9]- [11] have been proposed to narrow the domain gap, which can be broadly categorized into two types [12]. The first category is known as a feature distribution alignment between the source and target domains. In these methods [13], [14], the similarity of the feature distribution from these two domains is maximized by measuring certain predefined distance metrics. The second category improves the quality of domain alignment via adversarial learning [9], [12], [15], where the generative adversarial network (GAN) is used at either pixel level, feature level, or output level to ensure that the source and target domains share common characteristics across the deep-learning-based segmentation pipeline.
Despite the popularity of generative adversarial networks, a common failure pattern is observed while training GANs is the collapsing of large volumes of probability mass onto a few modes as highlighted in [16], where it can model one part of the data distribution well but fail to represent the entire distribution in the target domain. For a single generative network, it is difficult to guarantee the generalizability of all cases in the target domain due to insufficient learning of diversity. To deal with the problem, the concept of ensembles could be introduced to better represent the data distribution so that generative networks can explore the diverse alignments, between source and target domains, at the global level (adversarial learning), category level (co-training), and local level (ensemble scheme).
In this article, we develop parallel generative ensembles of GAN (PGE-GAN) to improve the performance and reliability of traditional DDA algorithms in the semantic segmentation applications. In particular, several GANs are trained in parallel with different discrepancy loss and segmentation loss functions under different upsampling strategies. The idea behind is this concept that different discrepancy loss functions, segmentation loss functions, and upsampling strategies have their own strengths to recognize specific semantic classes in a driving scene, so that the ensembles of them are likely to provide a more holistic distribution in the target domain. In light of this, these ensembles, such as ensembles for discrepancy loss, ensembles for segmentation loss, and ensembles for upsampling, are incorporated to enhance the model generalizability. The main contributions of this article are summarized as follows.
1) We develop a novel ensemble-based DDA method by integrating multiple GAN networks. The ensemble scheme achieves remarkable performance on adapted semantic image segmentation applications when comparing to other advanced DDA methods. 2) When training the parallel models in our framework, we add new optimization targets in loss functions. These include a) the generalized dice loss term in the segmentation loss function and b) the Pearson similarity in the discrepancy loss function. It is noted that the generalized dice loss can reduce the overfitting problem for those classes with a small number of training data, and the Pearson similarity can alleviate the effect of scaling and shifting when the dimensions of variables are significantly different and their values may be noisy or random. 3) For each GAN model, a mixture co-training framework is adopted to learn the multiviews of the same inputs via maximizing the divergence of different classifiers. Both Pearson and cosine similarities are mixed with a cotraining framework to derive various views. Then, more diversity is introduced to generalize better in the target domain. 4) Comprehensive comparison and ablation study are performed to demonstrate the superiority of our proposed method against state-of-the-art domain adaptation methods on transferring from GTA5 and SYNTHIA synthetic images to Cityscapes realistic images.

A. Semantic Segmentation
Semantic segmentation predicts pixel-level labels for an image to distinct objects. In previous decades, handcrafted features were commonly used to achieve the task of semantic segmentation. These handcrafted features are defined with the help of domain experts. An alternative way is to extract features by DNNs, which is able to extract efficient features automatically. Since deep-learning-based methods reveal their outstanding performance in feature extraction, recent work on semantic segmentation is mainly conducted through DNNs, such as FCN [17], U-Net [6], and Seg-Net [18]. To extract efficient features, these advanced networks need to be trained by a substantial amount of dense pixel annotations while it is difficult to obtain a large number of pixel-level labels in real-world applications. To deal with this problem, annotated data can be collected from a simulator, where the pixel-level annotations can be achieved automatically. Although some advanced simulators can synthesize high-fidelity data, there still exists a gap between synthetic data and real-world data and the gap can be bridged through a technique known as domain adaptation.

B. Domain Adaptation
It is noted that most existing machine learning algorithms assume that the training and testing data are drawn from the same underlying distribution [19]. However, such an assumption is not always true in practice [20]. This issue often occurs when transferring knowledge from synthetic images to real images [10], [21], [22], because the domain shift exists between the training and testing data [9], [11]. Domain adaptation, a method intending to solve this issue, learns the transformation to align cross-domain data with the class regularity so as to achieve a better generalization in the target domain [12]. Some approaches minimize the discrepancy of domain distribution through learning domain-invariant representations, where the domain distribution discrepancy can be calculated by maximum mean discrepancy or mean and covariance of feature distribution [13], [14]. Unfortunately, it is usually not sufficient to match source and target data by solely aligning the mean and covariance (i.e., low-order moments) of the distribution. In recent years, adversarial learning is adopted in domain adaptation as it is insensitive to the feature distribution.

C. Adversarial Learning
Adversarial learning can minimize the discrepancy of different domains by using an adversarial objective with regard to a domain discriminator [12]. GAN is one of the most popular adversarial learning methods, which consists of a generative network G and a discriminative network D. These two networks pit against each other during the training phase, where G dedicates to generating more realistic synthetic data and D aims to distinguish the synthetic data from the real data. In [23], the adaptation loss function is designed with the expected loss in the source domain, the domain divergence compared to the target domain, and the shared error of the ideal joint hypothesis on these two domains. In category level, [15] improves semantic consistency in the target domain by aligning the distribution shift in the latent feature space. At the pixel level, style transfer is widely adopted as it aims to make data indistinguishable across domains [24], [25]. Different from the above studies, [9] and [11] consider the alignment of category level with pixel level simultaneously to enable the joint optimization for the representation and the prediction.

III. PARALLEL GENERATIVE ENSEMBLES OF GAN
Driving scene translation can be formulated as follows: given a source domain image x s with the corresponding ground truth y s drawn from the source set {X S , Y S } and a target domain image x t from the target set X T without labels, the objective is to learn a generative model G for transferring knowledge from the source domain to the target domain so that G can correctly predict labels (e.g., road, building, sign, etc.) at the pixel level in the target domain.
Our proposed PGE-GAN method is illustrated in this section, which can learn more transferable knowledge across the domains. We enhance the co-training framework by using the ensemble scheme. Each model in PGE-GAN is trained based on different types of upsampling strategies under various discrepancy loss and segmentation loss functions to obtain diverse predictions.

A. Architecture of PGE-GAN
Our network architecture of each model consists of a generative network G and a discriminator D, where G is a fully convolutional segmentation network and D is a convolutional classification network. As illustrated in Fig. 1, G is separated into feature extractor F and two classifiers C 1 and C 2 . F is used to extract features from input images and then C 1 and C 2 predict pixel-level labels by using the extracted features. To derive the divergence of co-training classifiers, we conduct the weight diversity of C 1 and C 2 through maximizing the cosine or Pearson distance loss for different ensembles in the training phase. Subsequently, the distinct views of a feature can be provided by C 1 and C 2 so as to make more reliable semantic predictions. In our work, each (eth) model has two classifiers C e 1 and C e 2 and their corresponding predictions are p e 1 and p e 2 . The final prediction map p e for the eth model is obtained through adding up the predictions of p e 1 and p e 2 . The networks G and D are alternately trained until the maximum epoch is reached. Given a source domain image x s ∈ X S , feature extractor F provides a feature map for classifiers C e 1 and C e 2 so as to derive the semantic prediction map p e . The p e is not only the input to D for computing adversarial loss but also is compared with the ground-truth label y s ∈ Y S to derive segmentation loss. Given a target domain image x t ∈ X T , it is the input to G and then a semantic prediction map p e is generated from G. For the target data flow, we adopt the discrepancy between the two predictions p e 1 and p e 2 of ensemble p e as an indicator to weight the adversarial loss.

B. Loss Function
The loss function of the proposed network consists of three losses: 1) discrepancy loss; 2) segmentation loss; and 3) adversarial loss.
Discrepancy Loss: As explained in [12], the co-training framework can provide different views of the same feature. To increase diversity, two similarity metrics are introduced: 1) cosine [26] and 2) Pearson [27]. Two classifiers of the cotraining framework need to have diverse parameters. For a DNN, the diversity comes from the weights of specific layers. One popular way, to obtain the difference between co-training classifiers, is to calculate their cosine similarity as mentioned in [26]. Here, the discrepancy loss can be measured by cosine similarity as follows: where − → w 1 and − → w 2 are flattened and concatenated weights of convolutional kernels, which belong to the two co-training classifiers.
Taking into consideration the variable weight scales and randomness, we also measure the divergence of the two classifiers by the Pearson similarity, which is a scale-free metric [27]. The Pearson similarity is defined as follows: where w 1 and w 2 are the means of w 1 and w 2 , respectively. Segmentation Loss: A source domain image x s with the size of H × W and a label map y s are given, where the shape of y s is C × H × W and C is the number of semantic classes. Here, different kinds of segmentation loss are discussed, including multiclass cross-entropy loss, generalized dice loss, and their combination. The definition of multiclass cross-entropy loss can be computed by where p ic is the predicted probability of class c on pixel i and g ic is the ground truth of pixel i. That is, g ic is assigned 1 when pixel i belongs to class c; otherwise, g ic is assigned 0. The dice loss is a measure of labeled regions of images that is used to evaluate segmentation performance. The dice loss outperforms other loss functions in the case of a severe class imbalance. However, traditional dice loss can only deal with binary problems. To overcome this issue, we use the generalized dice loss to access multiple class segmentation with a single score, which is given in where w c is the weight of invariance for semantic class c and r ic is the ground truth. To address the issues of small object detection and class imbalance, their combination is introduced to derive the loss [28]. According to (3) and (4), the combined loss function of segmentation is given by where w ce is the weight of cross entropy loss and w gdl is the weight of generalized dice loss for multiclass segmentation. Adversarial Loss: Adversarial learning is to train a generative model G that generates target domain samples to confuse the domain discriminator D, where D is able to distinguish between samples of source (synthetic images) and target (realistic images) domains [12]. In order to learn domaininvariant features, G needs to minimize the divergence between the source and target domains. D needs to maximize the classification performance. This property is achieved through minimaxing an adversarial loss as shown in where p e 1 and p e 2 are the predictions of the co-training framework for ensemble e. S(p e 1 , p e 2 ) denotes the similarity of the two predictions. The similarity can be computed by cosine or Pearson, which is determined by the ensemble type.
[λ local S(p e 1 , p e 2 )+ ] represents the adaptive weight of adversarial loss. The impact of adversarial loss is controlled through λ local in overall training objective. is used to improve the stability of training processing.

C. Description of Ensembles
Ensembles for Discrepancy Loss: A mixture co-training framework is used to formulate distinct views of features as suggested in [12]. Our proposed method trains two co-training classifiers for each individual model, and the divergence of the two classifiers is measured through discrepancy loss. Cosine and Pearson are two similarity metrics as the loss function terms. Their main difference is that one is scaling-related and the other is scaling-free. To guarantee the diversity of models in our PGE-GAN, both cosine and Pearson similarity metrics are drawn to compute the discrepancy loss of co-training classifiers. Therefore, two ensembles are derived by considering diverse discrepancy loss. One of the corresponding training objectives is given in (7) with bicubic upsampling and the other one is provided where L ce (G) is a multiclass cross entropy loss. L cosine (G) and L Pearson (G) are the two types of discrepancy loss. L adv (G, D) is the adversarial loss. Moreover, λ dl and λ adv are the weights of discrepancy loss and adversarial loss, respectively, which control the relative importance of different kinds of losses. Ensembles for Segmentation Loss: Cross entropy is commonly used to calculate the segmentation loss, which evaluates the class predictions for each pixel individually and then averages over all pixels. However, cross entropy underperforms in handling imbalanced data. To address this issue, the combination of cross entropy and generalized dice loss is proposed to achieve better results for those imbalanced semantic classes. To keep diversity in the ensembles for both balanced and imbalanced semantic classes, two segmentation loss functions are designed in our method. One uses cross entropy and the other one uses the combination of cross entropy and dice loss. Therefore, (9) and (10) are the overall training objectives of the two ensembles where L ce−gdl (G) is the segmentation loss measured by the combination of cross entropy and generalized dice loss. Ensembles for Upsampling: In the previous work, only one upsampling strategy is utilized to derive the original size of images. There are three popular upsampling strategies, including nearest neighbor upsampling, bilinear upsampling, and bicubic upsampling. However, it is difficult to determine which upsampling strategies should be used. To determine an interpolated pixel, it mainly relies on the nearest pixel or several surrounding pixels, even on more surrounding pixels. To this end, three ensembles are trained corresponding to the nearest neighbor, bilinear, and bicubic upsampling strategies. Such an ensemble decision scheme improves the performance and reliability of upsampling.
The nearest neighbor upsampling strategy is to interpolate a new pixel according to the nearest existing pixel. The bilinear and bicubic upsampling strategies mentioned in [29] are presented in where G is a 2-D digital image. G(x, y) is an upsampling function to pixel at the position of (x, y) in the upsampled image. g is a raw output without upsampling. G bil (•, •) is the bilinear upsampling function, where (x, y) is an interpolated pixel and s = (12) where G bic (•, •) is the bicubic upsampling function. P 0 (z) = (−z 3 + 2z 2 − s)/2, p 1 (z) = (3z 3 − 5z 2 + 2)/2, and p 3 (z) = (z 3 − z 2 )/2.

D. Weights of Ensembles
Different upsampling strategies rely on different number of existing pixels to interpolate a new one. For instance, it requires different number of existing pixels for nearest neighbor, bilinear, and bicubic upsampling strategies to interpolate a new pixel. As suggested by [29], it results in a better upsampling when using more existing pixels. To this end, we determine the weight of each ensemble with regard to its upsampling strategy and the final prediction of a pixel is derived from a set of weighted ensembles as (13) where P f is the final pixel-level prediction of semantic class, n is the number of co-training classifiers for each ensemble, and E is the number of ensembles. w e is the weight of the eth ensemble and p e i is a vector of the ith co-training classifier for the eth ensemble. Such vector corresponds to all semantic classes and each element of the vector is the prediction value of its corresponding semantic class. In our model, the weights are chosen based on the upsampling methods, so different upsampling methods lead to different weights. Specifically, in our proposed method, the setting of the GAN networking combines one nearest neighboring, one bilinear, and four bicubic upsampling methods together, which rely on 1, 4, and 16 pixels, respectively. Thus, one interpolated pixel in our proposed method is determined by 69 existing pixels. The weight of the prediction model with the bicubic upsampling method is 4 times that of the bilinear one, and 16 times that of the nearest neighbor one.

IV. EXPERIMENTAL EVALUATION
In this section, experimental evaluation is conducted on the proposed method for synthetic-to-realistic translation. The experiments are evaluated by synthetic and realistic data sets, which are described in Section IV-A. The implementation details are discussed in Section IV-B, where the settings of generative and discriminative networks are clarified along with the explanation of the platform. To quantitatively evaluate the performance of synthetic-to-realistic translation, our proposed method is compared against other methods in Section IV-C.

A. Data Sets
In the experiments, three benchmark data sets are used. The GTA5 synthetic data set [30] and SYNTHIA [31] are chosen as the source domains and the Cityscapes realistic data set [32] is chosen as the target domain. In particular, GTA5 contains 24,966 high-resolution vehicle-egocentric images, where these synthetic images are produced by using a photorealistic open-world computer game called "Grand Theft Auto V." In addition, SYNTHIA is also a synthetic collection of imagery and annotations, a large-scale collection of photo-realistic frames rendered from some virtual cities. In the SYNTHIA data set, there are two image data sets and seven video sequences with a resolution of 1280 × 760. Specifically, it contains 9400 images with 13-class categories. Some examples of GTA5 and SYNTHIA data sets are present in Fig. 2. In contrast, Cityscapes is a real-life data set, including 5000 images of street scenes in Germany and neighboring countries. Some examples of Cityscapes are presented in Fig. 2. GTA5, SYNTHIA, and Cityscapes data sets all provide dense pixellevel labels and their annotations are compatible with each other. Following the settings in [9], [12], and [30], all 24,966 images from the GTA5 data set and 9400 synthetic images from the SYNTHIA data set are used for training generative networks, respectively. The validation set of the Cityscapes data set is used for assessing the performance.

B. Implementation Details
The proposed algorithm is implemented under the PyTorch framework. In particular, the pretrained ResNet-101 [33] is chosen as the backbone for source-only generative network G. For each GAN model, the last classification module is duplicated for co-training. The discriminative network structure in [12] is adopted to build up the discriminative network D. The discriminative network D includes five convolution layers with the kernel size of 4-by-4, where the channel number is n c ∈ {1, 64, 128, 256, 512} and stride step is 2. The activation function parametric ReLu is used to concatenate the convolutional layer, where the parameter is α = 0.2. After the last layer, an upsampling layer is added to make the size of an output image match the size of a local alignment score map. Stochastic gradient descent (SGD) and Adam are used to optimize generative network G and discriminative network D, respectively. More details of network configuration are provided in Table I. In the training phase, an original input image   TABLE I  CONFIGURATION OF NETWORKS is resized to the resolution of 512 × 1024. In the evaluation phase, a prediction map is upsampled by a factor of 2 for assessing the performance of the mean Intersection of Union (mIoU). All experiments are conducted on a PC with the following configuration: Intel 2.20-GHz Xeon E5-2630 CPU, GeForce RTX 2080ti, 16 GB of RAM.

C. Performance of Semantic Segmentation
This section provides the adapted semantic segmentation results of driving scenes for various domain adaptation methods. All experiments are evaluated on synthetic data sets: GTA5 and SYNTHIA, and realistic data set: Cityscapes. To quantitatively evaluate the results of semantic segmentation, Intersection over Union (IoU) is used to assess the  performance for each semantic class as it is not affected by the class imbalances. The definition of IoU is given by where Y is the ground-truth labels of pixels andŶ is the predictions of pixels. Moreover, t p , f n , and f p represent the true positives, false negatives, and false positives, respectively. In addition, the overall performance of different methods is measured through mean IoU which is calculated by averaging the IoU of various semantic classes. Here, we first show the performance of individual GAN models in the PGE-GAN. Subsequently, we compare our proposed method against a number of recently developed methods.

1) Individual Model Performance in PGE-GAN:
The settings of the GAN networks in our PGE-GAN are provided in Table II. They are formed by using different combinations of upsampling strategies, discrepancy loss functions, and segmentation loss functions. Here, the combination of cross entropy and generalized dice loss in the segmentation loss functions is represented by CE-GDL. It is found that each GAN network outperforms others on some of the specific semantic classes. Thus, diverse ensembles can help avoid overfitting and improve the generalization. The quantitative results of individual semantic classes and mIoU (overall) on GTA5 and Cityscapes data sets are summarized in Table III. In addition, the qualitative comparison of different ensembles is provided in Fig. 3. To identify the strength of different ensembles clearer, the results are obtained before image transferring. It illustrates that there exist variations across individual models, which indicates that the emsembling scheme can provide a mechanism to increase the reliability of the final outputs.
2) Comparative Results: We compare the adapted semantic segmentation results between our proposed method with other state-of-the-art methods on the task of domain adaptation, which transfers the source domain data (GTA5 and SYNTHIA) to the target domain data (Cityscapes). The qualitative results of adapted segmentation are illustrated in Fig. 4 and the quantitative comparison is summarized in Tables IV and V with regard to the IoU of various semantic classes and mIoU of all semantic classes (Top 3 performances are highlighted in blue). The following observations can be drawn.
1) By introducing generative ensembles in parallel, our developed method can significantly improve the generalization ability from the source domain to the target domain. As a result, our method outperforms other advanced methods in terms of mIoU, which can reach 46.9% and 47.1% on GTA5 to Cityscapes and SYNTHIA to Cityscapes adaptation, respectively.     [12] is implemented by automatic mixed precision (AMP) to save graphic memory. To make it comparable, our method is also implemented by AMP.

3) Ablation Study:
In the ablation study, we analyze the contributions of various components of our method. Extensive experiments are conducted to figure out their roles in our proposed method. The improvement of mIoU by considering one more component at each stage is presented in Table VI, where adversarial adaptation is denoted as AA, the co-training framework is denoted as CT, parallel generative ensembles are denoted as PGE, and image translation is denoted as IT. We can see that a poor performance of mIoU will be obtained with only 38.6% if simply training on the GTA5 data set and then evaluating on the Cityscapes data set. If we introduce AA, CT, PGE, and IT into domain adaptation, they will bring the performance gain as 3.0%, 1.6%, 2.2%, and1.5%, respectively. When all of them are adopted in our method, we can achieve the best performance of 46.9%.

V. CONCLUSION
This article develops a parallel generative ensembles method to improve the generalization of semantic segmentation, where a perception model trained on the data generated by a simulator can generalize in real-world scenarios reliably. Due to the high cost of collecting and annotating of real-world data relating to traffic scenes, this study would facilitate the development of autonomous on-road vehicles by creating synthetic data related to traffic scenes. The developed method can translate synthetic data into realistic data so as to bridge the gap between the two domains.
In the proposed method, multiple GAN models are trained on various discrepancy loss and segmentation loss functions under different upsampling strategies for obtaining diverse predictions. The ensemble scheme is utilized on semantic segmentation in unsupervised domain adaptation. The developed method holds the advantages of generative adversarial learning to learn domain-invariant features by minimax game and overcoming the drawback of low segmentation accuracy brought by the imbalanced data. Such a design makes the parallel models complement with each other to achieve a synergy effect on enhancing the performance of segmentation in the target domain. Moreover, the final prediction of each pixel is determined through fusing the predictions from different ensembles, where the ensemble scheme determines the weights of each ensemble based on its upsampling strategy.
The developed method is evaluated by transfer learning tasks on synthetic data sets, GTA5 and SYNTHIA, and realistic data set Cityscapes. As a result, our proposed method outperforms other competing methods on semantic segmentation in the target domain with regard to overall performance and individual segmentation class performance. We also conduct an ablation study to investigate the contributions of various components in our method. In addition, a comparison of different ensembles is enforced to identify the strength of different ensembles on various semantic classes. To make full use of the advantages of different ensembles, the developed method combines different ensembles to derive final predictions.
Furthermore, within autonomous urban driving, although there have been several studies attempting to apply the emerging deep learning model, such as convolutional neural network, etc., more studies on optimizing the deep learning framework are required for adapting to the real-world applications better. In this study, we have made pioneering efforts on introducing an ensemble scheme to handle the problem of imbalanced data and improve the reliability of unsupervised domain adaptation on semantic segmentation. Since the DNNs are "black-box" models, these models are lack of interpretability. In the future, the interpretability of the developed model should be further explored for providing human-understandable explanation of how domain-variant information can be learned from a simulator so as to generalize in real-world driving scenes reliably.