Smart manufacturing under limited and heterogeneous data: a sim-to-real transfer learning with convolutional variational autoencoder in thermoforming

ABSTRACT Data in advanced manufacturing are often sparse and collected from various sensory devices in a heterogeneous and multi-modal fashion. Thus, for such intricate input spaces, learning robust and reliable predictive models for product quality assessments entails implementing complex nonlinear models such as deep learning. However, these ‘data-greedy’ models require massive datasets for training, and they tend to exhibit poor generalization performance otherwise. To address the data paucity and the data heterogeneity in smart manufacturing applications, this paper introduces a sim-to-real transfer-learning framework. Specifically, using a unified wide-and-deep learning approach, the model pre-processes structured sensory data (wide) as well as high-dimensional thermal images (deep) separately, and then passes the respective concatenated features to a regressor for predicting product quality metrics. Convolutional variational autoencoder (ConvVAE) is utilized to learn concise representations of thermal images in an unsupervised fashion. ConvVAE is trained via a sim-to-real transfer learning approach, backed by theory-based heat transfer simulations. The proposed metamodeling framework was evaluated in an industrial thermoforming process case study. The results suggested that ConvVAE outperforms conventional dimensionality reduction methods despite limited data. A model explainability analysis was conducted and the resulting SHAP values demonstrated the agreement between the model’s predictions, theoretical expectations, and data correlation statistics.


Introduction
The emerging modern-era manufacturing, known as smart manufacturing, leverages new cutting-edge technologies such as Machine Learning (ML) and Cyber-Physical Systems (CPS) to extract meaningful insights about the underlying behavior of complex manufacturing processes necessary for prognostic and diagnostic decision-making tasks (Li et al. 2018).In the presence of large datasets, ML models have shown superior performance in uncovering the mappings between the manufacturing systems' inputs and outputs and learning robust predictive functions (Romero et al. 2020).Nonetheless, in cases where generating sufficient data is temporally and financially infeasible, e.g.large-scale advanced manufacturing procedures, ML approaches often fail to learn the true behavior of the data (signal) and are prone to overfitting (noise) (Ramezankhani et al. 2021).Such problems can be rapidly exacerbated when the size of the problem's feature space exponentially grows.Such a phenomenon can manifest itself even more in smart manufacturing applications where the advances in sensory technology have enabled capturing high-dimensional and well-descriptive online records of the process/manufactured part (e.g. using infrared (IR) cameras instead of thermocouples to capture the part's thermal distribution) (Khan et al. 2020).
Besides, the existing data streams from various sensory devices (e.g.tabular and visual modalities) in a given factory can result in an intricate heterogeneous feature space.This necessitates an even more exhaustive data preprocessing procedure, as it requires learning an effective approach to fuse such mixed features and then mapping them onto a concise representation, in order to provide useful information for the subsequent ML task at hand (Bayoudh et al. 2021).Additionally, the collected historical data from the factory can contain missing entries and noisy features, both of which may significantly degrade the ML model's performance (Xu et al. 2020).Thus, for a predictive model to be both reliable and robust in such datasparse manufacturing environments, it must be able to address the following shortcomings.Firstly, it needs to tackle the high-dimensionality of the feature space to avoid the 'curse of dimensionality' pitfall (Murphy 2012).Secondly, the model should be able to learn the highly nonlinear mapping between the process settings and product quality metrics using only a limited set of available annotated data.
To overcome some of the above challenges, stateof-the-art works have been performed in recent years, a synopsis of which is provided below as related to the present work.
Dimensionality reduction (DR) has been established as a popular method for projecting the highdimensional output of factory sensors to a concise representation with the minimum possible information loss.If properly learned, such low-level representation can help avoid overfitting when training the model with limited data.Despite the successful implementation of conventional DR methods such as Principal Component Analysis (PCA) and manifold learning approaches (Wei et al. 2017), recently, deep learning-based DR methods such as Autoencoders (AE) have exhibited promising potential in more effectively extracting useful features from complex manufacturing data structures.Jung et al. (2021) developed an AE-based ML model for quality prediction, outperforming conventional ML methods in injection molding manufacturing.Feng et al. (2017) proposed a denoising AE framework for knowledge transfer across different optimization problems using an evolutionary search for resin transfer molding in composite manufacturing.In the investigation by Zhang et al. (2020), a manufacturing digital twin framework is integrated with a stacked AE to monitor the product quality while remaining robust to environmental shifts.The training of such deep learning models (AE and its variants), however, is a data-intensive task that entails the presence of large manufacturing datasets.Alternative approaches such as Transfer Learning (TL) (Pan and Yang 2010) and multi-fidelity learning (Meng and Karniadakis 2020) have been investigated in the literature to compensate for the insufficient manufacturing data.
TL essentially aims to enhance the predictive model's performance on a task of interest by transferring learned knowledge from other different but related tasks.Ramezankhani et al. (2021) proposed an active TL approach for an aerospace-grade composite manufacturing.They trained a TL model that predicts the cured composite part's quality for the target resin material of interest under limited data, by transferring the learned knowledge from an inexpensive auxiliary data source, e.g.another composite with a different resin system.Within the smart manufacturing framework, Tercan et al. (2019) used TL for knowledge transfer across different mold designs to reduce the necessity for generating large datasets in re-training the neural networks in case of a design adjustment.Multi-fidelity learning, similar to sim-to-real TL (Peng et al. 2018), aims to overcome limited data obstacle by incorporating computationally inexpensive low-fidelity simulation data into the learning process.It bridges the reality gap between the abundant low-fidelity (simulation) and limited high-fidelity (real) data by learning the mapping function between the two (Meng and Karniadakis 2020).De et al. (2020) developed a bi-fidelity learning framework in which the learned model on the low-fidelity data is fine-tuned by soft-labeled samples generated from a Gaussian Processes model trained on the limited high-fidelity data.The prediction uncertainty of the model was incorporated as sample weights into the fine-tuning step.For an accurate and computationally efficient prediction of composites' structural response, Guo et al. (2021) proposed a hierarchical kriging method by incorporating coarse and fine finite element analysis mimicking the low-and high-fidelity data.The framework reduced the necessity to generate costly high-fidelity data while achieving high performance.In a composite autoclave processing case study (Ramezankhani et al. 2022), proposed a multi-fidelity physics-informed neural network which incorporated the heat transfer physical laws towards data-efficient learning of the relationship between the thermal behaviour of two different composite materials.

Objective and novelty of the present study
Despite the significant efforts reviewed above, developing an end-to-end learning framework that can effectively handle the limited and heterogeneous manufacturing data, while yielding robust and explainable predictions, is rather unexplored in the literature.Accordingly, this paper aims at developing and assessing a sim-to-real TL-based deep learning framework in the presence of both limited and heterogeneous data, with the ultimate task of predicting a given manufactured product's quality metric (i.e.model output) based on the controlled process configurations (i.e.model input).It is assumed that the latter would typically contain different data formats and modalities (e.g.thermal images next to the tabular process control data).
Accordingly, the novelty of the work is two-fold.First, it is demonstrated that for manufacturing scenarios with high-dimensional and severely limited image-based data, the relevance of the source and target distributions is of paramount importance for achieving a successful TL.In this case study, to extract informative features from thermal images in the manufacturing process, we propose a sim-to-real TL-based framework for training a convolutional unsupervised variational autoencoder (ConvVAE), as a highly nonlinear feature extractor.Specifically, the framework leverages a physics-aware source domain with abundant data, namely the thermal images generated from a heat transfer simulation tool mimicking the real-world process.This is as opposed to some common practices in TL where the base model is trained on open-source datasets such as ImageNet.Our results on the performed industrial example reveal that employing such relevant source domain (heat transfer simulations) for training the base model is vital and yields better prediction performance in limited data manufacturing scenarios.Similar notions of the significance of task similarities in TL were reported in other application fields, such as robotic control (Zhao, Queralta, and Westerlund 2020) and computer vision (Dwivedi and Roig 2019).However, it has not been thoroughly explored in the context of advanced manufacturing.
Second, a wide-and-deep approach is implemented and assessed in the present work as a data integration strategy in advanced manufacturing; namely, to fuse the structured (numerical readings such as pressure, deformation, etc. from sensory points) with the unstructured (e.g.thermal camera images) data.We will show that combining both image (deep) and tabular (wide) data and feeding them into a single feature extractor (e.g.autoencoder variants) can cause poor prediction performance.Instead, our framework processes the deep data separately via ConvVAE, and then its features can be fitted along with the wide data using a common regressor such as Support Vector Machine (SVM).
Accordingly, the main contributions of the work may be summarized as follows: • Introducing a unified TL-based solution to address the issue of concurrent limited and heterogeneous data in developing reliable predictive models for smart manufacturing settings; The remainder of this manuscript has been organized as follows.Section 2 discusses details of the proposed modeling framework.Section 3 presents an industrial case study followed by model development and training specifications.Section 4 provides the results and discussion, on the evaluation of the framework's performance.Finally, Section 5 outlines the key takeaways and future work.

Proposed ML framework
As outlined above, here a data-efficient and multimodal learning framework has been aimed to address data heterogeneity and data paucity in product quality prediction in advanced manufacturing settings, with the ultimate objective of more accurately learning and explaining the underlying behaviours of the given process and material.To this end, the overview of the proposed sim-to-real TL-based convolutional variational autoencoder framework is illustrated in Figure 1.As discussed in more details in the subsequent sub-sections, module a preprocesses the structured raw data (e.g.pressure, ambient temperature and fan speed).Modules b-d are designed for training a ConvVAE via a theory-guided sim-to-real TL approach in order to learn a low-dimensional and informative representation of thermal images, describing the temperature distribution on the part.The learned features from both structured and thermal image streams are then combined in module e as the input of a regression model for predicting the final part quality metrics (here, the thickness of critical locations on the part, namely, corner and bottom locations.)

In-situ data collection and preprocessing
In an intelligent manufacturing setting, various sensory devices with different output formats are needed to capture the complex relationship between the process settings and the quality metrics of the manufactured part.The collected data is often attained from various resources in a factory setting, each with its own unique statistical properties and formats such as structured (e.g.tabular data) and unstructured (e.g.images and sounds) data (Huang 2020).Such a heterogeneous and high-dimensional input space requires a robust data pipeline that can properly preprocess the data before passing it to a learning model.In this notion, the proposed framework here initially separates and passes the raw sensor data features into a wide-and-deep preprocessing pipeline for handling structured and image data streams (similar to (Cheng et al. 2016)) and then applies formatspecific transformations.The structured data stream implements an imputation method to handle the missing entries, and then it applies a z-score normalization to the raw numerical sensory measurements.This leads to a faster convergence rate and a better performance of the ML models, especially for those sensitive to relative feature scales.On the other hand, the sensory image data (e.g.IR thermal images) are initially cropped, resized, normalized, and then input to an image processing unit for dimensionality reduction.

Conventional dimensionality reduction
The data sparsity in advanced manufacturing necessitates a low-dimensional input feature space for model training as otherwise, the model will be prone to overfitting and a poor generalization performance (Ramezankhani et al. 2021).DR approaches can be used to transform sensory images into concise and informative representations.In this study, four conventional DR approaches are selected and compared: PCA, kernel PCA (kPCA) (Jolliffe and Cadima 2016), t-distributed stochastic neighbour embedding (t-SNE), and Locally Linear Embedding (LLE) (Géron 2019).
Remark: Although the selected DR methods are commonly used in ML practice for visualizing the feature spaces, only PCA, kPCA and LLE are deemed suitable for subsequent integrations in a regression pipeline for reducing the dimensions of the input data.More specifically, the t-SNE is often suggested to be merely used for visualizing high-dimensional and complex data, but not for generating lowdimensional representations for prediction tasks (Van Der Maaten 2009).Although it is capable of conserving the local distances when projecting the data to the low-dimensional space by minimizing the KL-divergence between the probability distributions of high-and low-dimensional spaces, it does not necessarily preserve the global data structure.Besides, since t-SNE does not learn the mapping function between the high-and low-dimensional spaces, it cannot be used as part of the preprocessing pipeline for projecting the new (unseen) data to the condensed feature space, which is required for prediction purposes (Van Der Maaten 2009).

Convolutional variational autoencoder
AEs, as an unsupervised deep learning approach, are widely used for learning a condensed feature representation of high-dimensional data (Tschannen, Bachem, and Lucic 2018).Unlike traditional DR approaches such as PCA and owing to the nonlinearity in their activation functions, AEs are capable of learning highly complex and nonlinear projections to the latent space.This has been shown to be a key factor for effectively extracting the useful features for training complex models (J.Wang et al. 2018).Variational AE (VAE) is a variant of AEs in which the latent space is replaced by the parameters of a probability distribution, namely the Gaussian mean and variance.In other words, instead of a deterministic representation, z, the input x is mapped to a posterior distribution pðzjxÞ.However, pðzjxÞ is intractable and needs to be approximated through variational inference by finding an approximation posterior qðzjxÞ often parameterized by a Gaussian N μ z ; σ z I ð Þ, the parameters of which are obtained by training the encoder.The loss function of VAE, J VAE , consisting of two components, is used to optimize the VAE's parameters.The first term aims to reconstruct x by minimizing the negative loglikelihood log p xjz ð Þ with sampling from the approximated posterior qðzjxÞ.This ensures that the learned latent space is informative enough to be used as the input of the decoder responsible for reconstructing the original data (input of VAE).The second term is a Kullback -Leibler (KL) divergence regularizer, which minimizes the distance between the approximated posterior qðzjxÞ and a normal Gaussian prior distribution p z ð Þ.
The KL regularizer can help alleviate the model overfitting during the training and establishing a wellformed latent space.Incorporating the Gaussian prior, in essence, provides additional control over the learning process of the low-level latent space.
Convolutional VAE (ConvVAE) has a similar structure and loss function to vanilla VAE mentioned above.However, instead of a fully connected encoder and decoder, it is parametrized by convolutional and transposed convolutional layers.This makes the ConvVAE architecture suitable for training on highdimensional image datasets.While the convolutional layers in the encoder network facilitate learning the latent space of the raw image data, the transposed convolutional layers of the decoder network increase the spatial size of the latent representation, by conducting an inverse convolution with the goal of reconstructing the original image.Figure 2 demonstrates the general architecture of a ConvVAE network.

Model architecture and training
The architecture of the ConvVAE used in this work was adopted from (Ha and Schmidhuber 2018).Table 1 summarizes the specification of ConvVAE's layers.The input data, i.e. the cropped, resized and normalized thermal images (see Section 3.1.for details) with the RGB channel passes through the convolutional layers (encoder), which maps them to the vectors μ z and σ z with the size of N z .The low-level latent space z is then sampled from N μ z ; σ z I ð Þ.The decoder's four transposed convolutional layers decode the latent vector z to reconstruct the original thermal image.N z is a hyper-parameter of the model and determines the size of the latent space.A stride of 2 is utilized for both convolutional and transposed convolutional layers.All layers use ReLU activation function except the decoder's last layer, for which a sigmoid activation is implemented as it outputs a normalized value between 0 and 1 for each reconstructed pixel.

Sim-to-real transfer learning
Training the ConvVAE network from scratch requires a large amount of data.Considering the data limitations in manufacturing applications, a widely used remedy in such scenarios would be using a pretrained network (source) as a base followed by finetuning the network's weights towards learning the task of interest (target) using the limited available data.This approach is called TL, and it is an effective tool for training large and complex networks while avoiding overfitting when the data at hand is not diverse and abundant (Pan and Yang 2010).It has been shown that the convolutional neural network (CNN) layers tend to learn generic feature representation (e.g.lines and edges) across various tasks with unrelated image datasets (Yosinski et al. 2014).Although a CNN trained on natural images might offer generic learned features for even an unrelated new task, in the limited data regime where the ability to sufficiently fine-tune the source network is impractical, the distance/similarity between the source and target data distributions becomes of paramount importance.A source model trained on the abundant natural images (e.g. via ImageNet) might require further fine-tuning with a larger dataset for learning a totally unrelated task like detecting defects/cracks in manufacturing products/tools (Zoph et al. 2020).On the contrary, if the source model is trained on a task fairly similar to that of the target, the fine-tuning process can be less dataintensive, as the learned features are expected to share more resemblance.In advanced manufacturing settings, this can be manifested as transferring a learned knowledge from the simulation model to  training the real (experimental) domain on the same task (Xiao et al. 2022).
In light of the above idea, here a sim-to-real procedure for training the ConvVAE is adopted.For training the source network, instead of using readily available unrelated datasets (e.g. from ImageNet), an interactive heat transfer (finite element based) simulation tool, Energy 2D (Xie 2012), was used to generate a large set of related data; namely, simulated thermal distribution images on a thermoplastic sheet (see Section 3.1.for more details).The images, conveying information regarding the governing laws and constitutive equations of heat transfer, are used to pre-train the ConvVAE in a theory-guided fashion.The target network, i.e. the ConvVAE on the real IR thermal images, is then initialized with the pre-trained network.Ideally, this is expected to help the target model remain in the same basin in the weights' optimization landscape, enabling it to reach the optimal state by taking only a few gradient steps during fine-tuning (Neyshabur, Sedghi, and Zhang 2020).Thus, by utilizing TL, ConvVAE can be fine-tuned using the limited real-world thermal images to bridge the reality gap between the simulation and in-situ thermoforming heating process.Figure 3 depicts examples of simulation and real-world thermal images used in this study.Finally, the encoder of the trained ConvVAE is used as the feature extractor to generate a low-level feature representation for training the regression models.

Regression model
Once the thermal images' latent representation is achieved, the new low-dimensional features can be combined with the preprocessed structured data to generate the new feature space.Specifically, the N z -dimensional latent representation array is concatenated with the 4-dimensional tabular data (refer to Section 3.1 for more details).The resulting (N z þ 4)dimensional feature space is then used as the input of the regression model.Next, the data is split into training (75%) and test (25%) sets.The training set is used to train the ML models, and the test set is kept for measuring the models' generalization performance.Here, four regression models are investigated; namely, Random Forest (RF), Extremely Randomized Tree (ERT), Gradient Boosting (GB), and SVM.RF is an ensemble learning method that employs n decision trees and trains each with a subsample of the training set using bootstrap aggregation (Ho 1998).The randomness in RF helps to reduce the high variance of decision trees and thus, mitigate the effect of overfitting.ERT is simply an RF with more randomness induced by randomly selecting the criteria/threshold for the data split.Adopting this strategy can even further reduce the model variance (trading for higher bias), a feature often useful for cases dealing with limited high-dimensional data (Géron 2019).GB is a boosting method in which a couple of weak predictive models (here, decision trees) are trained to achieve a strong predictor.In particular, in GB, each predictor is trained on the prediction residual error of its predecessor (Friedman 2001).SVM is a powerful supervised ML model with a robust performance against highly nonlinear datasets.It leverages kernel functions to simplify the complex feature space by mapping the input to a high-dimensional space (Vapnik, Golowich, and Smola 1997).In particular, an SVM regression model is trained using an 2-sensitive loss function which penalizes the training points outside its allowable margin (the width of which is determined by 2) with respect to the predicted regression line.Accessisdenied is the SVM's regularization parameter and its value is inversely proportional to the regularization strength.γ is a kernel-specific hyperparameter that partly determines the behavior of the kernel mapping.This study investigates three kernel functions: Linear, Polynomial, and radial basis function (rbf).Table 2 summarizes the hyperparameters of each model, to be tuned by 5-fold cross-validation with R-squared (R 2 ) as the evaluation metric.
Here, ŷi is the model's prediction for the i-th sample y i represents the corresponding observation.� y denotes the average of all observations.While staying computationally sound, a wide range of levels is selected to explore the hyper-parameter space comprehensively.The combination that yields the highest performance score is then evaluated against the unseen test set to measure the model's generalization score.

Model explainability assessment
To better understand the underlying behavior of the proposed black-box ML framework, a post-hoc explainability approach is utilized.Here, Shapley additive explanations (SHAP) (Lundberg and Lee 2017), an additive feature attribution approach based on game theory, is implemented.SHAP identifies the contribution of each feature to the model's prediction in an attempt to understand the underlying decision rules learned by the model.In other words, it interprets the model by assigning importance (SHAP) value to each feature in the prediction of any particular instance.To interpret the model's decision rules, two statistics, namely the Pearson correlation and Mutual Information (MI), are measured and compared with each feature's mean absolute SHAP value.These metrics are utilized to measure the linear and nonlinear relationships between the input feature space (process setting variables) and the output (product quality metric).In particular, the Pearson correlation is the ratio of the covariance and the product of standard deviations of two variables and essentially measures the linear correlation.MI, on the other hand, measures the similarity (using KL divergence) of the joint distribution p X; Y ð Þ and the factored distribution p X ð Þp Y ð Þ of two random variables, X and Y (Murphy 2012): From the information theory perspective, MI measures how uncertainty about one variable reduces after observing the other variable.MI receives values between [0,1], with 0 referring to variables being independent.While the above definition is suitable for discrete random variables, for continuous variables, MI can be directly estimated by discretizing and implementing different bin sizes and boundary locations and calculate the maximum MI obtained.Introduced by Reshef et al. (2011), this statistic is called Maximal information coefficient (MIC), and it is defined as: where B is the bound on bin count and it depends on the sample size, N (e.g.B ¼ N 0:6 suggested in (Reshef et al. 2011)).G x; y ð Þ denotes the set of 2d grids with the size x � y, and X G ð Þ and Y G ð Þ are the variable descritization onto G x; y ð Þ.Values of MIC close to 1 show a strong relationship between the two variables in linear, nonlinear and even nonfunctional forms.The study and comparison of such statistics with the model's feature importance allocations can provide more in-depth insight and consequently more trust for the manufacturing decisionmaking experts regarding the prediction behavior of the black-box ML models.

Experiment
As elaborated in Section 1, data paucity is a common challenge towards Industry 4.0 implementation in many advanced manufacturing processes.It often leads to poor AI model predictions as the available data is not sufficient for deeply learning the inputoutput mappings.The following sections describe the selected industrial thermoforming process example and its in-situ data collection procedure, along with an in-depth explanation of the input space (process configurations and raw material properties), along with the present limitations on data collection from such a process.

Industrial thermoforming process
In thermoforming, known as a cost-effective and yet robust polymer manufacturing process, a raw thermoplastic sheet is initially heated well above its glass transition temperature until it becomes soft and deformable (Figure 4(d)).Then, by using a specific type of force (e.g.vacuum or physical pressure), the pre-heated sheet is stretched against the desired mold (Figure 4(a)) (Throne 2008).The quality aspects of the final product, e.g. a uniform thickness distribution, highly depend on the process settings, raw material properties and ambient conditions (Figure 4(e)).Above all, the nuisance factors and non-idealities during the thermoforming process such as operator's poor material handling and seasonal shifts in ambient temperature can play a significant role in determining the quality of the finished part.Moreover, it has been shown that the underlying relationship between such variables is highly nonlinear, making the optimization of process control variables a complex, multi-variate and multi-objective problem (Leite et al. 2018).Currently, the common practice in such a manufacturing process is primarily limited to trial and errors, expert knowledge and prior experience.Considering the needs in the current dynamic markets, such traditional approaches may fail to provide practical solutions to shifts and changes in a timely and costeffective manner.Thus, this entails developing a more systematic, data-driven and data-efficient modeling framework that can adaptively respond to customers' ever-changing needs by providing optimal agile designs and optimization solutions.
The thermal distribution of the thermoplastic sheet during the thermoforming process is one of the key factors directly influencing the final part quality and its thickness uniformity.Particularly, in heavy-gauge thermoforming with large and complex mold geometries, different sections of the heated sheet experience different thermomechanical behavior depending on their location and the timing of the contact with the mold (and hence the subsequent heat transfer).Minimizing the thickness variation (i.e.avoiding very thin and thick spots) on the fabricated parts under such conditions requires an optimal non-uniform thermal distribution on the sheet, which is often achieved by utilizing a differential heating procedure (using zonal heat banks) (Li et al. 2015).Here, we aim to learn the mapping between the thermoforming process settings and the end part thickness distribution by training the proposed ML model, which may then be used further as a surrogate model for identifying the optimal process variables (Ramezankhani et al. 2019).
In the studied industrial vacuum thermoforming manufacturing process, in an attempt to uncover the causation of defects and thickness nonuniformity in the thermoformed parts, the sensory data was collected under different combinations of process parameters (Table 3) per regular operation of the factory.The FLIR IR camera was used at the facility to capture the thermal distribution images of each sheet right after the completion of the heating process and just before placing it onto the thermoforming mold.This was done to ensure that the information regarding the effects of thermal ambient nuisance factors is also incorporated in the thermal image, and hence taken into account during model training.Once the part is manufactured and de-molded, an ultrasonic thickness gauge was used to measure the thickness of the part's critical locations (e.g.corners, bottom, and sides).Since the data collection procedure at the industrial facility was time-consuming and intrusive to the normal production flow (e.g.taking thermal images requires the operator to stop the process for a few seconds), only a total of 218 samples from eight different mold geometries were collected.In this paper, two molds, mold-1 and mold-2, with sample sizes of 53 and 26 are investigated (Figure 4(b,c)).The thickness measurements from the corner and bottom of the manufactured parts are selected as the target variables estimated by ML models.The raw material used in the thermoforming process was Polymethyl methacrylate (PMMA).
During the factory data collection, four batches of raw material were used.To consider the known but uncontrollable effect (if any) of different batches influencing the part's final thickness, a categorical variable (raw thermoplastic sheet batch number) was added to the input feature space in ML model.In addition, for a fair comparison of the DR methods, identical thermal image preprocessing steps were implemented.All images were cropped and resized to 64 × 64 × 3 and the pixel values are normalized to [0, 1].

Simulation model development
The heat transfer simulations were performed in Energy2D using 42 heat sources.It created a 42zone heat bank (one heater per zone) with varying operating power between 500 W and 1000 W to capture the non-uniform thermal distribution of insitu samples.A thermoplastic sheet with the thermal conductivity of 0.18 W m:K , specific heat capacity of 1465 J kg:K , and density of 1380 kg m 3 was used.By  randomly adjusting the power rate of heaters for each simulation, various thermal distributions were obtained and used for generating 10,000 simulation images.Simulations were continued until the sheet's maximum temperature exceeds 200°C.The images were generated by taking screenshots during the real-time heat transfer simulations.The rainbow colormap with an identical range (min of 80°C and max of 180°C) were chosen for both actual and simulation thermal images.
For training the ConvVAE network, Adam optimizer with learning rates of 0.001 and 0.0001 was used (see Section 4.3).Two latent space sizes, 32 and 64, were chosen empirically.The source models were trained for 100 epochs, while 20 epochs were used for fine-tuning the target models.Data augmentation was performed on real and simulation images to enhance the dataset size and achieve a more descriptive feature space (Shorten and Khoshgoftaar 2019).The augmented data was obtained by applying horizontal flipping, cropping, rotation, shifting and noise injection.Incorporating such translation invariances into the simulation dataset also helps resemble uncontrollable real-world noises such as viewpoint, lighting, and background, which further assist with a more informative transfer of knowledge from the source to the target.For fine-tuning the pretrained ConvVAE network, 180 unlabeled factory images were used while setting aside 38 images for evaluating the test performance.In addition, to assess the performance of the proposed sim-to-real approach, two pretrained ConvVAE models were trained on 10,000 images of 'Car racing' (N z ¼ 32) and 'Doom' (N z ¼ 64) images adapted from (Wellmer 2020), and used for comparison in Section 4.3 (please refer to [37] for more information regarding the datasets).
The regression models were trained with five repetitions, and the average performance was reported.Python's Tensorflow and sklearn libraries were used to develop and train the framework.The MICtools open-source pipeline was adopted for MIC calculations (Albanese et al. 2018).All model developments and training were performed on UBC's advanced research computing (ARC) platform.

Results and discussion
In the following sections, the performance of the proposed framework in the industrial case study is investigated.

Dimensionality reduction on real thermal images collected from the factory
Four common DR approaches (PCA, kPCA, LLE and tSNE) were implemented to project the real thermal images into a low-dimensional representation.Figure 5 illustrates the performance of such methods for two different thermoforming molds (thus, different thermal distributions).The first and second coordinates (C) of each method are plotted as the x and y axes.The sequential color bars represent the corresponding thickness in the corner of the thermoformed part (thinnest location).In addition, a dashed gray line is visually plotted for better visualization of highthickness and low-thickness regions.It can be observed that by only considering the first two principal components, PCA, kPCA and LLE were able to divide the data points into distinct clusters to some extent; namely, (mostly) thick (dark red/dark blue) and (mostly) thin (light red/light blue) regions.Although a fully accurate classification cannot be achieved by only using the reduced data presented in this figure, considering the information loss due to the dimensionality reduction, and also the absence of other processing factors (e.g.pressure and time), it may be concluded from Figure 5 that thermal images carry basic but crucial information for learning the mapping between the thermoforming process configurations and the part quality metric (here thickness distribution).Furthermore, comparing the results of t-SNE with other DR methods in Figure 5, it appears that t-SNE has reduced all points into one large cluster consisting of areas with high densities of high-and low-thickness points.Regardless, the results of all DR methods agree on the fact that thermal images are defining factors in predicting the final part thickness.It is worth adding that the primary goal of the proposed framework is to estimate the final thickness of the part (regression) rather than a classification task.Nonetheless, the clear separation of thin and thick samples is a valid indicator that the sheet's thermal distribution is a defining factor of the final part's quality metric.The plots also suggest that in order to obtain an accurate regression model, more than the first two components of the DR methods need to be considered.
Figure 6 bottom subplot shows the rate of the total explained variance (i.e. the proportion of data variance that lies on the selected principal components) as a function of the number of principal components (PCs).Due to the high-dimensional input space of the thermal images, it takes up to 32 principal components for PCA to reach above 90% explained variance.On the other hand, training a model on a large number of features makes it prone to overfitting.Thus, the number of PCs used during the training should be selected in a way that ensures conveying enough information from the images while keeping the feature space as small as possible.Figure 6 top subplot demonstrates the performance of an SVM regressor trained on different sizes of PCs.The best R-squared is achieved when 16 PCs are used (82% explained variance).Adding any more features would cause the model to overfit and yield a poor performance.

Conventional ML performance analysis
The performance of the regression models with features extracted from the conventional DR methods is investigated and summarized in Figure 7.It is worth noting that for a fair comparison of the models in the following sections, the wide stream of the proposed approach, which is responsible for processing the structured sensory data, remains unchanged for both conventional and novel DR methods (deep stream).For mold-1 and mold-2, two regression tasks (corner and bottom thickness prediction) are considered.SVM achieved the highest R-squared when trained on the principal components of kPCA and PCA.Among the tree-based models, on average, ERT outperformed RF and GB.Under SVM, kPCA has a slight overall edge over PCA, which can be explained by the kPCA's kernel advantage for nonlinear projections.Another observation is that unlike SVM, which is biased towards the features of PCA and its variant kPCA for achieving the highest R-squared in all tasks (different molds and thickness location), the  tree-based models can benefit from different DR methods for outputting their best performance.ERT, for instance, leverages both the LLE and kPCA features to obtain its highest R-squared for different tasks.

The new TL framework performance
This section evaluates the effect of pre-trained models on the TL performance using ConvVAE.Three ConvVAE networks (same architecture) were trained on three different datasets: Doom, Car racing, and Energy2D.The target ConvVAE was then initialized by the weights of these three networks and finetuned for 20 epochs using the limited factory thermal images.The effect of learning rate (lr) during the finetuning was also investigated by incorporating two levels, 0.001 and 0.0001.Finally, two latent space sizes (32, 64) were used and evaluated.The results are summarized in Table 4.The test total loss and the breakdown of its components, reconstruction loss and KL loss, are also reported.A larger latent space size seems to put equal efforts into minimizing the reconstruction and the KL losses.The 32-dimensional latent space instead is able to further reduce the KL divergence term while incurring a higher reconstruction loss.It also yields a higher total loss in comparison to its 64-dimensional variant.It was expected as the latter has a larger latent space, it would have a more straightforward job condensing the information from the high-dimensional thermal images.Under equal latent space sizes, the Energy2D pretrained model demonstrates better performance in the reconstruction loss in comparison to the theoryagnostic models (Car racing and Doom).This aligns with the hypothesis that transferring knowledge from a more related task can result in training a more accurate target network under the TL framework.Moreover, under the same number of epochs, the larger lr tends to yield lower total loss values without incurring divergence problems.
The effectiveness of the present ConvVAE encoder in extracting meaningful features within the proposed learning framework was further evaluated for the prediction of thermoformed part thickness distribution.7).The R-squared against the test set is reported in Table 5.Although by fine margins, the SVM trained on Energy2D variants of ConvVAE achieved an overall better generalization compared to the conventional methods.This indicates that the deep learning encoders powered by convolutional layers and guided by theory (TL) can be more effective in extracting useful features from complex and high-dimensional feature spaces.Despite a lower reconstruction error, the 64dimensional ConvVAE (Energy2D-64) underperformed its 32-dimensional counterpart, Energy2D-32.This can be explained by the limited available labeled data (thickness measurements) restricting the regression model's capabilities to learn a robust mapping from a high-dimensional feature space (e.g.64 + 5) to the output space without avoiding overfitting.The fact that Energy2D-32 represents the data in a more concise space positively affects the model's generalization performance.
To further highlight the effect of knowledge transfer from physics-based simulations, the regression performance of three additional encoders was investigated: 1) no pre-training: randomly initializing the network and training the ConvVAE using limited factory data; 2) initializing the ConvVAE with the Car racing network' weights and 3) initializing ConvVAE with the Doom network's weights.With no pretraining ('Random initialization'), the model exhibits the poorest regression performance.In fact, PCA and kPCA both outperform this setting.This is expected as the large number of trainable weights in ConvVAE, and the limited high-dimensional factory data are the ingredients for an over-fitting prone model.The lack of sufficient data simply hinders the training of a reliable encoder that can be used for regression tasks.When ConvVAE is initialized with Car racing and Doom networks' weights, the model experiences an enhancement in its generalization R-squared.Though better than the 'Random initialization' case, their performance remains at the same level of the conventional DR methods.The takeaway from the above analysis on ConvVAE behaviour is twofold.On the one hand, it shows that pre-training the network using unrelated images (Car racing and Doom) improved the baseline performance (Random initialization).This can be explained by the fact that regardless of how (un)relatable source and target images are, pre-training might still be useful as it helps the target model learn the generic features (e.g.curves and edges) of the image from the source model.This has been observed in the previous studies (Hanneke and Kpotufe 2019;Yosinski et al. 2014) as well.On the other hand, the results demonstrate that if a strong relationship exists between the source and target domains, the regression performance can be considerably enhanced.In this case study, the use of Energy2D simulation images offers a reduction in the domain divergence between the source (simulation) and target (real) domains, as the images are closely related visually (i.e. both representing thermal heatmaps) and physically (i.e. both following heat transfer governing laws).

Comparison to other modeling approaches
Next, the performance of the developed learning framework was compared against other existing approaches in the literature on ML applications in manufacturing, under data heterogeneity and data  1) was utilized.Per the results in Table 5, ConvVAE-SVM achieved an overall better generalization compared to the earlier methods tested.Specifically, it was observed that the MMWA approach underperforms the present ConvVAE-Energy2D-32.This can be due to the fact that combining image and tabular data and feeding them into a highly nonlinear feature extractor such as ConvAE can result in information loss in tabular data (as they typically do not need extensive feature extraction).Similarly, data augmentation seems to fall short in enhancing the performance of the encoder enough for regression tasks, leading to very poor R-squared values (even worse than conventional PCA).This further underscores the crucial role of knowledge transfer in cultivating the feature extraction capabilities of ConvVAE.Similar to Car Racing and Doom pretrained networks, it was observed that using the optimized weights from a CNN trained on ImageNet improves the performance compared to the random initialization.However, they result in a lower R-squared compared to the model trained on heat transfer simulation data (Energy2D-32).Overall, these results demonstrate the importance of carefully considering the design and implementation of multimodal deep learning models, particularly when dealing with complex and limited data structures such as images and tabular data.By leveraging the power of knowledge transfer and exploring the methods for successful data integration, the performance of deep learning models can be improved in manufacturing applications, unlocking new opportunities for data-driven insights and decision-making.
Remark: It must be added that, unlike conventional methods, the performance of ConvVAE heavily depends on the availability of highly descriptive data.High-dimensional data such as images aggravate the problem by introducing large feature spaces to the model.TL proved to be an effective method for learning an informative latent space.However, more labeled factory data are necessary for learning the complex mapping between the ConvVAE feature space and the output variables (see section 4.4).Besides, the factory measurement data in this study is prone to various sources of known and unknown uncontrolled sources of noise (e.g.sensor error, noisy images, and human/operator error), which negatively affect the performance of the learned model.Generating more descriptive data in a more controlled and noise-free environment (e.g.fully controlled lab-scale thermoforming process) can unleash the maximum capability of ConvVAE in the feature extraction tasks and thus leads the proposed framework to yield better generalization performance (i.e. higher test R-squared).Regardless, the presented results demonstrate a promising step towards implementing deep learning image processing methods in advanced manufacturing applications under limited high-dimensional heterogeneous data.

Model explainability
For every prediction of the ConvVAE-32-SVM model (best-performing DR-regressor combination per Table 5 in this case study), the features' SHAP values were next calculated and plotted (Figure 8).The color bar corresponds to the value of each feature (low and high are shown with blue and red, respectively).The features on the y-axis are sorted based on the total sum magnitude of their SHAP values (i.e.overall importance).It can be perceived that a significant amount of the model's decisions attributes to the latent features (LF) representing the sheet's thermal distribution as they appear in the top portion of the feature list.This is aligned with the theory as the sheet's temperature is known to be a key factor in determining the part's final thickness.Another important takeaway is that tabular data, particularly raw sheet characteristics, play a vital role in the model's decision-making.This can be further understood as the effectiveness of the proposed 'wide-and-deep' architecture in which image and tabular data are separated through different pipelines to circumvent information loss and excessive feature engineering (caused by the convolutional layers) of the tabular data.
Figure 9 presents the SHAP, Pearson correlation and MIC of top 10 features based on their SHAP importance measures (Figure 8).The Pearson correlation and MIC were calculated for each feature versus the sheet's thickness (model output).Both Pearson correlation and MIC follow the same general trend as SHAP.This indicates that the mapping learned by the trained model is based on the true underlying relationships of the inputs and output (signal) and not a result of overfitting or fitting to the noise.MIC, however, exhibits a stronger agreement with SHAP.This unveils the presence of nonlinear and complex mapping.Learning an accurate and robust predictive model thus requires a fully descriptive and data-rich feature space.However, since labeled data paucity is a mainstay in most advanced manufacturing applications and data generation is often financially prohibitive, alternative approaches must be adopted to compensate for the limited data availability.Auxiliary data sources (e.g.finite element simulations) can be used as a low-fidelity source of knowledge for more effective learning of such nonlinear relationships.This approach will be investigated as a next step by the authors.

Conclusion
The data paucity in smart manufacturing applications, often aggravated by the high-dimensional and heterogeneous nature of data, entails developing a deep learning framework that can effectively learn from complex and multi-modal input data distributions (e.g. both thermal camera images as well as tabular data from sensory points) for predicting the manufactured product's quality metric.In this work, a TL-based deep learning framework with a wide-and-deep architecture was designed to extract useful features from various sources of collected data.In particular, in an industrial thermoforming process, a ConvVAE was trained using the sim-to-real TL to map the thermoplastic sheets' high-dimensional thermal images (deep data) into an informative latent space, along with the tabular (wide) data coming from controlled process parameters.The ConvVAE latent space was separately used as the input of an SVM regression model, along with the structured data, to predict the part quality output (here thickness distribution).The model evaluation results suggested that despite the limited and heterogeneous data, the proposed framework could successfully learn the mapping between the input and output spaces while outperforming conventional DR methods as opposed to ConvVAE.Since the performance of such deep learning feature extractor heavily depends on the size of the dataset being trained on, the availability of more factory data (e.g. through establishing an online learning pipeline) can further unleash the full capacity of ConvVAE and widen its performance gap against conventional methods.Finally, the performed explainability analysis revealed that the model's sensitivity predictions are in strong agreement with the theory and correlation analysis.Such insights may be crucial for manufacturing industry experts relying on ML models for their high-risk decision-making tasks.
As a future work, other deep learning DR methods (e.g.beta-VAE) may be implemented and compared with the presented framework.To improve the generalization performance of the regression model, FE simulation of the vacuum thermoforming process can also be developed and used to generate auxiliary sources of data, to compensate for the labeled data sparsity via a TL framework.In addition, the predictive model may be used as a surrogate in an optimization approach (reverse engineering) in order to identify the optimum process settings that can minimize the defect occurrence.

Figure 1 .
Figure 1.Proposed ML framework for limited heterogeneous data in advanced manufacturing applications (exemplified here for a thermoforming process); (a, b) in-situ data collection (image and structured) from sensory devices and data preprocessing steps for training predictive models; (c, d) training of image feature extractor (ConvVAE) through physics-based transfer learning; (e) training regression models using extracted features and performing explainability analysis.The light green dash-dotted arrow represents the path in which ConvVAE is replaced by conventional DR approaches for comparison.

Figure 3 .
Figure 3. Examples of a part's actual thermal distribution in thermoforming process: (a) in-situ IR image; (b) cropped IR image used for model training; (c) simulated thermal distribution in Energy2D.

Figure 4 .
Figure 4. 3D prototype of thermoforming mold with the green and red squares indicating the bottom and corner thickness measurement locations (a); thermoformed bathtubs from mold-1 (b) and mold-2 (c); vacuum thermoforming station (d); and examples of common defects, namely, cracks, webbing and thickness variation (T 1 > T 2 ), in thermoformed parts (e).

Figure 5 .
Figure 5. 2D visualization of the latent space of DR methods; (a) mold-1, (b) mold-2.Though reduced to two dimensions, the DR methods were able to separate apart -to some degree -the thin and thick samples.The colorbar represents the level of thickness for each sample (thermoformed product).The dashed gray lines visually divide the DR spaces into thin (light red) and thick (dark red) regions.

Figure 6 .
Figure6.PCA explained variance and R-squared as a function of number of PCs.Adding more PCs increases the level of explained variance by the PCA model.However, at some point (here, after 16 PCs), adding any more components (dimensions) negatively affect the regression performance due to overfitting and curse of dimensionality.

Figure 7 .
Figure 7.Comparison of the ML models performance trained using different DR approaches for mold-1 (a, b) and mold-2 (c, d).

Figure 9 .
Figure 9. SHAP, Pearson correlation and MIC comparison of top 10 features.

Table 1 .
Specifications of the employed ConvVAE architecture.

Table 2 .
Regression models employed and the corresponding hyper-parameters.

Table 3 .
Summary of the thermoforming process settings.

Table 4 .
Comparison of the ConvVAE test performance with different pretrained models, latent space sizes and learning rates.

Table 5 .
Liu et al. 2021)erformance of SVM regressor trained on latent representations produced by PCA, kPCA and ConvVAE variants and existing approaches in the literature.R-squared is calculated and shown as the performance metric.Output variable is parts' corner and bottom thicknesses.Liu et al. 2021), the authors leveraged the pre-trained models on large-scale datasets such as ImageNet to cope with data paucity on the shop floor.Their proposed framework fixed convolution layers and fine-tuned the fully connected layers of the transferred CNN model using the limited available data.For a fair comparison with the current work, for all three cases, the architecture of ConvVAE's encoder (Table