FADN: Fully Connected Attitude Detection Network Based on Industrial Video

In 3-D attitude angle estimation, monocular vision-based methods are often utilized for the advantages of short-time and high efficiency. However, the limitations of these methods lie in the complexity of the algorithm and the specificity of the scene, which needs to match the characteristics of the cooperation object and the scene. In this article, we propose a fully connected attitude detection network (FADN), which combines neural network and traditional algorithms for 3-D attitude angle estimation. FADN provides a whole process from the input of a single frame image in the industrial video stream to the output of the corresponding 3-D attitude angle estimation. Benefiting from the end-to-end estimation framework, FADN avoids tedious matching algorithms and thus has certain portability. A series of comparative experiments based on the rendering software 3-D Studio Max (3d Max) have been carried out to evaluate the performance of FADN. The experimental results show that FADN has high estimation accuracy and fast running speed. At the same time, the simulation results reliably prove the feasibility of FADN, and also promote the research in real scenarios.


I. INTRODUCTION
A S A PREREQUISITE for plenty of application scenarios such as intelligent robot, aerospace, industry, the realtime, and accurate 3-D attitude angle estimation has attracted tremendous attention [1], [2]. Recent years have witnessed the flourishing of attitude angle estimation, which could be divided into the monocular vision based, the binocular vision based, and the multivision-based methods, according to the number of cameras [3], [4]. The monocular vision based methods are widely used for their advantages of simple equipment, low cost, and real-time.
At the same time, the attitude angle estimation could also be divided into the visual correspondence based algorithms [5] and the deep learning based methods [6]. Currently, the former is the mainstream method in industrial applications. However, the inevitable disadvantage is that it needs to design the corresponding algorithm according to the application scenario. Typically, the following processes are needed: feature design such as point feature and line feature [7]; image process for feature matching; algorithm solving. Moreover, accurate feature matching mostly relies on additional auxiliary equipment such as laser, which puts higher demands on equipment. The deep learning based method can achieve an end-to-end detection, which automatically fits the camera imaging model and outputs the 3-D attitude angle directly. Therefore, a series of cumbersome tasks such as manual calibration, feature extraction and matching are avoided. However, most of the methods based on deep learning are in the stage of synthetic simulation and theory. Although good results have been achieved on manual data sets, such as LINEMOD [8], OCCLUSION [9], and T-LESS [10], they have not been widely adopted in industry.
Many deep learning based methods utilize convolution neural network (CNN) to estimate 3-D attitude angle. Thus, through large-scale deep learning, an end-to-end estimation can be directly implemented from the input of the RGB image to the output of the attitude angle. However, the CNN frameworks they rely on are mainly designed for targets detection and image segmentation. They fail to fully analyze the characteristics of 3-D attitude angle estimation and design a specialized and suitable estimation network accordingly.
In this article, we propose a fully connected attitude detection network (FADN) to estimate 3-D attitude angle. To verify its validity, we also build a data set on the rendering software 3d Studio Max (3d Max). And subsequent experiments are conducted on our own data set. The ideal results prove that FADN can achieve fast and accurate estimation on the one hand, and on the other hand promote the industrialization of FADN and lay the foundation for future work. The main contributions of this article can be summarized as follows.
1) For the first time, we use a fully connected neural network architecture to estimate 3-D attitude angle. Different from the popular CNN-based methods, it combines the characteristics of estimation with deep learning. The relationship between 2-D image and 3-D attitude angle can be automatically learned, and thus the manual workload in monocular visual estimation is greatly simplified. 2) We construct a complete set of feature points extraction and matching algorithm, which provides accurate pixel coordinates as the input of FADN. 3) We adjust the joint output mode to independent output mode, which better adapts to the difference in the three attitude angles and further improves the output accuracy. 4) For industrialization, we build a specialized data set using rendering software. It contains one million training samples and 1000 test samples. The rest of this article is organized as follows. In Section II, we introduce the related work. Section II describes the methodology of our method. Section IV explains our data set and the experiments in detail. Finally, Section V concludes this article.

II. RELATED WORK
The method based on monocular vision has the advantage of simple equipment, only utilizing RGB images. And the method based on deep learning is fast and convenient, realizing an endto-end estimation. Therefore, in order to inherit the superiority of both, we mainly focus on the deep learning based estimation method with RGB images. It is necessary to review the basic flow of monocular visual attitude angle estimation, the deep learning method based on CNN framework, and the related properties of fully connected network.

A. Monocular Vision System
The monocular vision system uses RGB images to inversely solve the 3-D attitude angle, which is a complex task due to the lack of depth information. Generally, more information is needed to assist the estimation. And there are many methods currently available through monocular vision, such as stereoscopic monocular odometry [11], [12]. The essence of attitude estimation based on monocular vision is to establish a camera imaging model, which reflects the 2-D/3-D mapping relationship. The model building process involves camera calibration, lens distortion correction, etc [13]. In addition, the selection of features in cooperative object also involves many aspects. Standard algorithms typically utilize point feature, line feature, and surface feature [7]. All the time, people have been studying some links in these processes to improve the accuracy of estimation. For example, Tian et al. [14] proposed a new calibration approach to relocate the camera's pose. Konigseder et al. [15] proposed an attitude estimation strategy based on extended Kalman filter, and unscented Kalman filter.

B. Convolutional Neural Network
Recently, CNN based deep learning methods have been widely used and achieved excellent results in detection, classification, and regression [16]- [18]. CNN frameworks with superior performance have been proposed continuously, such as GoogleNet, R-CNN, VGG, etc. [19]- [21]. At the same time, the research on the role of network layer is also carried out in depth. Zeiler et al. [22] had a deep understanding of the functions of the middle feature layer and the operation of the classifier. Bojarski et al. [23] proposed a new method for visualizing CNN.
Deep learning also provides a new possibility for attitude angle estimation. Many methods using CNN have emerged. Brachmann et al. [24] developed a framework to reduce the uncertainty in the estimation. Tekin et al. [25] proposed an approach to predict the pose of a seamless object. Kehl et al. [26], [27] presented a method based on SSD to detect 3-D model instances and estimate their attitudes. Most of the deep learning based methods are evaluated on the existing data sets such as LINEMOD [8], OCCLUSION [9], and T-LESS [10]. However, due to limited conditions, the current deep learning based methods have not been widely adopted in practical industrial applications. Therefore, we are trying a method that can be applied in real-industrial scenarios.

C. Full Connected Network
The fully connected network is based on dense layer, which is completely different from CNN. In essence, it is a switch that connects all inputs and outputs, with high throughput, high reliability and low latency. What CNN learns is a generalization feature [22]. The shallow convolutional layer extracts intuitive features which are sensitive to the region of the image, and the higher layer extracts more abstract features. Whereas the dense layer flattens the learned features into a 1-D vector for integration and adjusts the weight for specific tasks. Therefore, the dense layer is more targeted, more stable, and more resistant to interference. For example, transfer learning is an emerging technology that typically takes advantage of dense layers. It replaces original dense layers at the end of the previously trained CNN models to fine tuning for other application tasks [28], [29].
According to the excellent characteristics of the fully connected network, we creatively construct FADN for attitude angle estimation. The task of the dense layer is to establish a parameter mapping from low-dimensional vectors to 3-D attitude angles. Therefore, FADN can directly fit the camera imaging model after a large-scale deep learning and achieve a fast and end-to-end estimation.

III. METHODOLOGY
Our goal is to accurately estimate the 3-D attitude angle of the cooperative object using RGB image, subsequently providing possibilities for real scenarios. Fig. 1 shows an overview of our pipeline that consists of three stages: marker process, FADN structure, and attitude output. In the first stage, the RGB image of the marker is processed to obtain feature point coordinates as the input of the next stage. In the second stage, a fully connected network based on the dense layer is constructed to fit 2-D/3-D mapping relationship. And the network output, which matches the dimension of the 3-D attitude angle provides an accurate estimation in the last stage.

A. Marker Process
To extract the feature points effectively and steadily, we design the feature marker, which is attached to the cooperative object. The marker has eight different sizes of graphics, including circles and rectangles. At the same time, we design a stable algorithm to extract feature points, which are the centroids of graphics. First, we process the input image by graying and filtering. Then, we use Canny operator to get the edge contours of the eight graphics. Finally, the centroid of each graphic is obtained by calculating the moment of the contour. After marker processing, we get eight stable coordinates of the centroids and integrate them into a 16-D array as the input of FADN.
The graphics with different sizes and shapes in the feature marker can improve the accuracy of the coordinates extraction and avoid the extraction offset effect caused by the similarity of graphics.

B. Feasibility Analysis of FADN 1) 2-D/3-D Mapping:
RGB images are utilized to estimate attitude angles. Therefore, it is necessary to clarify the basic mapping between 2-D images and 3-D information [30], [31]. We briefly summarize the process by a matrix multiplication as follow.
For the convenience of calculation, they are all in the form of homogeneous coordinates. (u, v, 1) represents the pixel coordinate, and (X W , Y W , Z W , 1) is in the world coordinate system. M represents the relevant parameters of the camera. R is a matrix of three times three for rotation, and T is a matrix of three times one for translation.
However, in real scenarios, there are factors that can cause nonlinear effects, such as camera distortion and lighting condition. Some of them can be represented by mathematical models, while others cannot even be quantified. In order to visually reflect these nonlinear effects, we quantify one of the camera distortions by mathematical models as an example. In (2), (x, y) represents the physical image coordinate. And k, p 1 , p 2 , and s represent the corresponding parameters of radial distortion, centrifugal distortion, and lens distortion respectively. Obviously, these factors bring estimated errors and cannot be ignored.
Due to the presence of nonlinear effects, the 2-D/3-D mapping is no longer a simple linear matrix regression relationship. Therefore, the network needs excellent nonlinear expression ability to learn and generalize the mapping relationship, so as to further meet the requirements of practical applications.
2) Expression Ability of FADN: FADN is a fully connected network architecture. The forward parameter propagation of each layer can be represented by the following mathematical model. ⎡ Here, ϑ k = (ϑ k 1 , ϑ k 2 , . . . , ϑ k n k ) T denotes n k neuronal nodes of layer K, and ϑ k+1 = (ϑ k+1 1 , ϑ k+1 2 , . . . , ϑ k+1 n k+1 ) T denotes n k+1 neuronal nodes of layer K + 1. So, we can deduce that k is a matrix of n k+1 times n k . σ k = (ς k 1 , ς k 2 , . . . , ς k n k ) T is the weighting bias term of neuron nodes. Each layer has such a nonlinear factor, and all of them are different. Therefore, the weighting bias gives layer K nonlinear expression ability. After multiple layers of iteration, FADN has sufficient nonlinear expressive ability.
In summary, there are two main reasons why we confirm that FADN can achieve attitude angle estimation. First and foremost, in practical application scenarios, the 2-D/3-D mapping is no longer a simple linear matrix regression relationship. Accordingly, the network needs to have adequate nonlinear expression ability, which is precisely the characteristic of FADN. On the other hand, (1) and (3) are similar in expression form, describing the relationship by matrix multiplication. Naturally, FADN can regress the mapping relationship after learning.

C. Analysis of FADN and CNN
CNN shows good performance in extracting image features and information. And in the field of attitude detection, checkerboard is widely used as a marker for camera calibration because of its excellent topological characteristics. However, CNN has some deficiencies in checkerboard based detection, which will result in a decrease in detection accuracy and increase of training time.
Next, we will further explain the causes of these problems and give theoretical supports.
The rotation of an object is reflected on its feature point. For simplicity, we analyze the rotation of an object with its feature point. The rotation could be broken down into three directions, around the X-axis, the Y -axis, the Z-axis, and projected to three 2-D planes. In Fig. 2(a), after rotation the point (x 0 , y 0 , z 0 ) is to (x, y, z). Without loss of generality, we assume that the object rotates θ • around the X-axis, which is represented by Rot(X, θ). Project it to the plane of Y -O-Z coordinate system, as shown in Fig. 2(b). The x 0 remains unchanged, and the (y 0 , z 0 ) turns to (y, z).
(y 0 , z 0 ) 2 is rewritten as ξ (y 0 ,z 0 ) , which represents the twonorm of vector. Comparing with imaging distance, Δz is much smaller than z 0 . Besides, according to the camera imagining model, the distance remains unchanged by default. Therefore, we ignore Δz later. This rotating motion can be decomposed into the effect of Δy, which is only related to y 0 and θ.
In the same way, if the object rotates around the Y -axis, and is projected to the plane of X-O-Z coordinate system, as shown in Fig. 2(c), the rotating motion can be decomposed into the effect of Δx, which is only related to x 0 and ψ.
Points on the same horizontal line have the same y 0 and Δy, so they can reflect the same characteristics around the X-axis. Points on the same vertical line have the same x 0 and Δx, so they can also reflect the rotation around the Y -axis. In summary, the pixel points on the checkerboard have equivalent characteristics, so that there is a lot of redundant information learned by CNN.
Based on the above analysis, we conclude that CNN has two main defects in 3-D attitude angle estimation. First, the shallow layers of CNN are sensitive to the local regions, and the pixel points in the different areas of the checkerboard represent the same attitude angle. However, the illumination, lens distortion, and noise have regional differences that cause different effects on the same attitude angle. So that the features learned by CNN are unstable, resulting in an unreliable estimation. Second, CNN extracts a large amount of redundant information, which increases the parameters in the network and slows down the convergence speed.
Therefore, fully considering the characteristics of attitude angle estimation, a fully connected network is more suitable. It implicitly constructs the mapping relationship between the pixel coordinate input of the 2-D image and the output of 3-D attitude angle. The avoidance of redundant information guarantees a significant performance improvement of FADN.

D. Analysis of the Number of Feature Points
At the end of the first stage of our method, we integrate the coordinates of the feature points into an array as the input of network. This involves the perspective N points problem in the field of computer vision, which is first proposed by Fishler et al. [32]. Through the image coordinate of N feature points, we can calculate the pose of a camera.
F o represents object coordinate system, while F c represents camera coordinate system. = {P o 1 , P o 2 , . . . , P o n } is a set of n feature points on an object, where . . , P c n } is the set of these points in the camera coordinate system, where And  = {q 1 , q 3 , . . . , q n } is the set of corresponding points on the imaging plane, and The attitude of F o relative to F c can be expressed by rotation matrix R and translation vector t.
For simplicity, we use the following equation: We , where e i ∈ R 3 , i = 1, 2, 3. Thus, we can get S = [e 1 , e 2 , e 3 , t], which is a matrix of four times three.
Take the third equation into the first two.
Each feature point provides two equations. Matrix S includes 12 parameters, which describe a rotation and translation. In an ideal situation, three points can solve the 2-D/3-D position problem.
However, factors including illumination, lens distortion, and noise cause interference. So, that it requires more feature points to enhance robustness. Whereas, increasing the number of feature points has negative impacts, such as increase of training time, instability of test results, etc.

E. Analysis of Joint Output and Independent Output
FADN better overcomes the deficiencies of CNN and achieves a desired performance. But there is still room for improvement. Without loss of generality, the rotation of the cooperative object can be decomposed into pitch angle θ, rotation angle ψ, and roll angle ϕ. From (4) to (11), we know the impacts of θ and ψ. And the object rotates around the Z-axis, which affects Δx and Δy, as Fig. 2(d).
The changes of pixel coordinates caused by these three angles are not equivalent. Based on the analysis, we can prove that the θ and ψ are equivalent and different from ϕ. The ϕ has more characteristics, so that it converges easier and more quickly. As a result, this inequality will lead to mutual interference and degradation of network performance. Therefore, we creatively adjust the network architecture to output independently, as shown in Fig. 3. The model of the network should be consistent with the application characteristics. In terms of network architecture, we output a 2-D vector in the joint mode and three 1-D vectors in the independent output mode. The output of the three vectors and the independent loss values during the training process ensure that the attitude angles do not interfere with each other.

IV. EXPERIMENTS
To evaluate the performance of FADN, we build a data set using the rendering software 3d Max. And a series of experiments are all carried out on this basis. We implement our method using the python deep learning library Keras which is based on the backend of TensorFlow and run it on Intel Xeon E5-2620 v4 @2.10G and two NVIDIA TITAN XP GPUs.

A. Data Set
At present, deep learning based methods are not widely utilized in industrial applications. And the existing data sets are not suitable for industrial applications in real scenarios. Therefore, for further industrial implementation, we build a data set that is closer to the application background. The cooperative object is a regular cylinder rendered by 3d Max, of which the diameter is 100 mm and the height is 200 mm, as shown in Fig. 4. The marker is attached to this cooperative object. And we use the target camera in 3d Max to capture the images of the marker, which are exactly the same as the ones taken in the real scenarios. The resolution of the image is 640 × 480. The focal length of the target camera is 228.712 mm, the viewing angle is 9 • , and other parameters are defaults. The cooperative object is initially placed 600 mm from the target camera and on the optical axis, and it can change the distance within a certain range.
Obviously, the smaller the range of angle changes, the smaller the difference between images, and the more difficult the estimation is. To reflect the superiority of FADN and lay a foundation for subsequent high-precision industrial practices, our own data set is in a small range with a high accuracy. The range of rotation for each attitude angle is from positive to negative 1 • at the accuracy of 0.001 • . Thus, traversing all combinations of these three angles will produce 2001 3 images, which places very high demands on hardware devices. So, we randomly generate a million images using 3d Max as the training set. The corresponding 3-D attitude angle values are set as labels for supervised training. In the same way, 1000 images are generated as the test set.

B. Evaluation Metrics
To evaluate the performance of the network comprehensively, we adopt two types of evaluation metrics, mean square error (MSE) and mean absolute error (MAE). In the process of training, MSE is used as a loss function to characterize the degree of network fitting. K is a batch in the network training; M is a single image in the batch;ŷ (θ,ψ,ϕ) is the predicted attitude angle value of M , and y (θ,ψ,ϕ) is the corresponding label.
During testing, MAE is the absolute difference between the output result and the corresponding label, including three single attitude angle errors and Global MAE. V is the test set; N is an image from it;ŷ ( * ) is the predictive value, and y ( * ) is the corresponding label value. * represents three attitude angles θ, ψ, and ϕ. Global MAE is an average of these three errors.

C. Experiments of End-to-End Estimation
In Section III, we conduct a set of complete and systematic visual theory analyses. The related theory lays a solid foundation for using FADN to estimate the cooperative object's 3-D attitude angle. Here, we implement FADN to achieve a complete endto-end training and test on our data set.   [33] undertakes the engineering tasks of automatic meeting and docking of space. Test ASTRO and NEXT at docking on track and a variety of service operations. -The AVGS [34] is a sensor mounted on the OE to provide the relative position and attitude between ASTRO and NEXTSet. -The NGAVGS [35] further improves the technical indicators based on AVGS.
-The ETS-VII [36] is a two-way remote operation test platform built by the Japan Space Agency and has been subjected to engineering rendezvous and docking test.
During the training, we design a fully connected neural network with 15 hidden layers. At the same time, the output is designed as a 3-D port to correspond to the dimension of attitude angle. We use stochastic gradient descent (SGD) to train our network, and set the learning rate to 0.0003 and the momentum to 0.9. And 100 images from the test set are randomly selected to evaluate the performance of FADN. Fig. 5 shows the experimental results.
It is well known that in the field of aeronautics and astronautics, the estimation of attitude angle has higher accuracy requirements. In order to intuitively reflect their requirements, we list a few of the technical indicators and accuracy in some aerospace organizations and engineering projects in Table I [37]. In actual scenarios, their accuracy reaches minute. In our simulation experiments, the accuracy of FADN can reach second. Despite the background gap, it initially illustrates the feasibility and high performance of FADN and lays the foundation for subsequent research, which will be discussed in Section V.
Further, we also conduct a comparative experiment between FADN and CNN. Table II shows the details. During training, FADN has a smaller MSE, easier convergence, and better network fitting. The same conclusion can be obtained during testing, and the estimations of FADN are more accurate.

D. Number of Feature Points
The number of feature points determines the input array dimension of the network, thus affecting the estimation accuracy.
To determine the most efficient form of input array, we explore the effect of the number of feature points. Fig. 6 shows the effect of the number of feature points on the MSE and Global MAE. In terms of MSE, the network has a higher error when solving three points, and the error is significantly reduced during the solution of four points. When the number of points is more than three, the MSE value tends to be stable at a low value, which indicates that the network achieves an active fitting. At the same time, the Global MAE value tends to be smaller and more durable after the number of feature points is more than six, which indicates that FADN achieves higher estimation precision. Fig. 7 shows the relationship between the number of feature points and the MAE of three attitude angles in detail. The MAEs of the three attitude angles are significantly different, and the MAE of the roll angle is much more larger than the other two. But as the number of feature points increases, their trends are the same. When the feature points exceed three, the MAE drops drastically. When the feature points are more than six, the MAE tends to be stable at a low value.
According to the previous theoretical analysis, we know that FADN can fit 2-D/3-D mapping relationship when the number of feature points is more than three and less than six, but this is a multisolution case. And the experimental results also prove that it is not stable. When the number of feature points reaches six, FADN can fully achieve a stable solution. Considering the interference of nonlinear factors such as illumination and camera distortion, we should have more than six points to increase robustness. However, some of the negative effects follow. After the  Table III. As the depth of the network increases, the MAE of the three angles and the global attitude decreases first and then increases. According to the theoretical analysis and experimental results, the shallow network expression ability is not enough to fit mapping relationship, while the deep network is too complex to represent. Considering the MSE and MAE, the performance of FADN-D-25 is better than the other two depth models.
2) Network Width: The width of the network is critical to the dimension of the weight matrix, which affects the number of the parameters and the training time. We compare six fully connected network models with different network width distributions.
The width of the network is designed according to the following principles. All network layers are divided into three parts: the bottom part, the middle part, and the upper part. The width of the bottom part is n, the middle part 2n, and the upper part is n/2. For example, in a 15-layer network, FADN-W-2048 represents that the width of the bottom five layers is 2048, the middle five layer 4096, and the upper five layer is 1024.
In Table IV, as the network width increases, the training time increases dramatically. Because the increase of network nodes   in each layer leads to a surge in the total network parameters. Fig. 8 shows the results on MAE of three attitude angles and Global MAE. Under the condition of accuracy of 0.001°, the width of the network has a certain influence on the estimation results. Weighing the accuracy and the time, the FADN-W-256 achieves higher performance.
3) Network Output Structure: According to the analysis, the three attitude angles are not equivalent in the mapping relationship. The experimental results in Fig. 5, 7, and 8, Tables II and  III all verify this analysis. The MAE of roll angle is higher than the other two angles. To effectively solve this problem and optimize the performance of FADN, we design a mode of independent output. First and foremost, it is necessary to ensure the dimensional matching of the output. So in the joint output mode, we output a 3-D vector, and three 1-D vectors in the independent output mode. In the network construction level, three dense nodes are set up for output, each with their own separate loss function to ensure training independence. Table V shows the experimental results, which are consistent with the theoretical analysis. The independent output mode performs better, effectively reducing the MAE of these three attitude angles. The reason is that the independent output mode helps to avoid mutual interference caused by the three attitude angles in the neural network fitting process.

V. CONCLUSION
In this article, the prosperity of deep learning had inspired us to rethink the attitude angle estimation. This technique overcomes the tedious manual workload of traditional methods. Therefore, it was of great significance to apply the method of deep learning to practical industrial scenarios. We were taking the first step, constructing a fast and high-precision neural network FADN. This was a whole new network based on dense layer and fully considers the characteristics of attitude angle estimation. We proposed an adaptive feature points extraction algorithm in the marker process stage to get the precise pixel coordinates as the input of FADN. Furthermore, we make innovative adjustments to the network output. The independent output mode effectively avoids mutual interference between the three attitude angles and achieves higher precision.
We construct our own data set, which was close to the actual application background. The experimental results fully illustrated the feasibility and superiority of FADN, which was the premise of industrialization. Although our simulation experiment settings are as close to the real scenarios as possible, there was still a background gap. The key to this problem is to ensure the consistency of the rendered images and reality, so as to transform our method into actual applications. Therefore, we will discuss several possible strategies in our future work.
For consistency, FADN is trained on the photos taken by real camera. The problem that ensues is the generation and storage of large-scale training samples. Therefore, reducing the required samples is the key. A convolution-deconvolution network helps to solve this problem. First, benchmark photos of each step are taken in a single dimension. For example, cooperative object only rotates around the X-axis. As a result, the real camera captures 2001 photos. Similarly, another 2001*2 images of the other two attitude angles can also be generated. Then, the convolutional-deconvolutional networks can synthesize training samples using benchmark images. Finally, during the test, the image needs to be decoded accordingly to ensure consistency.
If FADN is trained with rendered images, the problem that follows is the inconsistency of parameters between the target camera and the real camera. Therefore, the point is to enhance the adaptability of FADN and the robustness to camera parameters. More types of images will be added to the training data set, such as taken by different target cameras, having different noises and lighting conditions.
In addition, there is another way to train FADN using the data set, which proportionally blends the rendered and real images. Thus, the diversity of training samples helps to improve the adaptability of the network. At the same time, more specific issues need to be further determined, such as the fusion ratio and network parameters.
In summary, there are many possibilities to apply FADN to actual scenarios, and the corresponding problems will be encountered. In future work, we will further compare the feasibility of these above strategies and pursue exclusively to industrialize FADN.