Unsupervised Saliency Detection of Rail Surface Defects Using Stereoscopic Images

Visual information is increasingly recognized as a useful method to detect rail surface defects due to its high efficiency and stability. However, it cannot sufficiently detect a complete defect in the complex background information. The addition of surface profiles can effectively improve this by including a 3-D information of defects. However, in high-speed detection, the traditional 3-D profile acquisition is difficult and separate from the image acquisition, which cannot satisfy the above-mentioned requirements effectively. Therefore, an unsupervised stereoscopic saliency detection method based on a binocular line-scanning system is proposed in this article. This method can simultaneously obtain a highly precise image as well as profile information while also avoids the decoding distortion of the structured light reconstruction method. In our method, a global low-rank nonnegative reconstruction algorithm with a background constraint is proposed. Unlike the low-rank recovery model, the algorithm has a more comprehensive low rank and background clustering properties. Furthermore, outlier detection based on the geometric properties of the rail surface is also proposed in this method. Finally, the image saliency results and depth outlier detection results are associated with the collaborative fusion, and a dataset (RSDDS-113) containing the rail surface defects is established for the experimental verification. The experimental results demonstrate that our method can obtain a mean absolute error of 0.09 and area under the ROC curve of 0.94, better than 15 state-of-the-art algorithms.


I. INTRODUCTION
S URFACE inspection plays a pivotal role in improving the rail manufacturing process. Different from the online rail surface inspection system [1], [2], the manufacturing process inspection of the rail products faces the following problems.
1) The working environment is harsh.
2) The appearance of the rolled rail surface is complicated.
3) The generation and distribution of defects are random. 4) The rail transmission has a high speed and vibration. 5) The surface detection system needs to work continuously. Manual inspection is inefficient and has a low sensitivity [2], [32], which will slow down the entire manufacturing process. Compared with the manual inspection, nondestructive testing methods, such as ultrasonic testing, acoustic emission testing, and magnetic flux leakage [3], have a higher sensitivity and data interaction in the industrial inspection. However, most of them are time intensive [2] to the target motion and the detected defects lack the intuitive descriptions of defects' shapes. A machine vision has more advantages in surface defects testing. Especially with the development of the hardware and software, the application of machine vision technology in the railway measurement has attracted more attention.
Machine vision detection systems usually rely on analysis of images content [3], [4], and defects' characteristics [33], [34]. At present, most of them are based on grayscale images, which is due to the high detection speed and the low cost of the equipment. For some recent studies, such as mentioned in [3], the improved completed local binary patterns (ICLBP) and generalized completed local binary patterns (GCLBP) features of the grayscale images are constructed for the detection. In [4], the texture features of the defect-free grayscale images are analyzed for judging the steel strip defects. However, these methods of using grayscale images may cause a misjudgment in defect detection due to the lack of color information.
In some surface contour inspection fields, a 3-D profile reconstruction is one of the commonly used methods. Enzberg and Al-Hamadi [5] use a 3-D surface quality testing by establishing a surface approximation model. Lilienblum and Al-Hamadi [6] use the structured light and line-scanning stereo matching to obtain the 3-D information. However, the above-mentioned methods have to face the difficulty of the encode and decode structured light fringes, and the reconstruction accuracy depends on the resolution of stripes. Besides, the 3-D information and the image information cannot be uniformly used in the detection process that the two processes are independent. The abovementioned image-based and 3-D profile-based detection methods can obtain impressive detection results but they still face the following problems.
1) Image information is susceptible to the illumination and shooting angles. The low-texture defects are difficult to be detected in the similar texture background area, as shown in Fig. 1(a). In particular, the grayscale images may detect more false defects due to the lack of color information. 2) In the profile detection, some local distortion or loss of depth information will lead to false detection, as shown in the yellow box area of Fig. 1(b). Besides, the defects with a small depth change are easily submerged in a large curvature background, as shown in Fig. 1(c). In summary, the single information in the abovementioned description is not enough to detect defects. A multimodal fusion detection method with the image information and depth information is necessary.
In this article, a new stereoscopic saliency detection method based on a stereoscopic visual system is proposed. The visual system can quickly acquire the 2-D image and 3-D profile of a rail surface.
The main contributions of this article are as follows.
1) The application of a binocular line-scanning system in surface defect detection is a pioneering work that can serve as a reference for other industrial fields. Under the support of the system, an unsupervised saliency detection method based on the stereoscopic images is proposed. The method integrates a variety of information (2-D image and 3-D profile) and the rail data are used as a case for the experimental verification. 2) A global low-rank and nonnegative reconstruction (GLRNNR) saliency algorithm is constructed for the image defects detection. Compared with the traditional lowrank recovery (LRR) model, the algorithm incorporates the background information constraints and nonnegative constraints, which can make our model has clustering properties under the constraint of global low rank. 3) According to the geometrical and depth information of the rail surface, a method for detecting the rail surface outlier by constructing the indirect plane is proposed. Then, the image saliency results and the depth saliency results are combined to get the final defect saliency map. 4) A total of 113 stereoscopic image pairs are collected and organized into the rail surface defects dataset (RSDDS-113), which include the left-camera images and the corresponding depth maps. The rest of this article is organized as follows. Section II introduces some related work. Section III describes the hardware acquisition system. In Section IV, the 2-D image saliency detection method is explained and the 3-D profile outlier detection algorithm based on the indirect plane and geometric properties is elaborated, and then the abovementioned two detection results are fused. In Section V, 15 methods are used for experimental comparison and analysis. Finally, Section VI concludes this article.

II. RELATED WORK
Saliency detection is inspired by the human visual system (HVS), to detect the region of interest with the visual uniqueness in the face of complex scenes [2], [7]- [9]. By selectively acquiring the relevant regional information, HVS can greatly avoid computational waste and reduce the difficulty of the image analysis.
In recent years, saliency models based on machine learning, the low-rank sparse principle [8], and graph theory [9] have achieved many good results. Especially, the LRR model is often used to distinguish the background and foreground, due to its not requiring a large number of labeled samples for training.
The LRR model (1) treats the background feature as a low-rank matrix L. The decomposed error is treated as foreground information S with a sparse noise. F denotes the original image feature matrix. λ is the balance factor, defined as follows: Since the original LRR model [8], [11] is a nondeterministic polynomial time problem, the kernel norm · * and 1-norm · 1 are used to solve (1). The abovementioned formula can be redescribed as follows: where Z denotes the reconstruction coefficients of F with respect to the dictionary D. In some studies, D is played by an over-complete dictionary or directly replaced by F , namely F = F Z + S. The following problems usually exist in LRR models.
1) The construction of the over-complete dictionary D is tedious and difficult, especially when the number of samples is insufficient. 2) When the difference between the foreground and background is tiny, the foreground area also has a low-rank characteristic. 3) Finally, the negative values of Z lack a clear and reasonable guiding explanation, e.g., the cluster indication for the saliency analysis of the image. In order to effectively solve the abovementioned problem, a GLRNNR saliency detection algorithm is used in image defects detection. The algorithm is based on the background priori constraint, and the nonnegative low-rank sparse constraint.

III. HARDWARE ACQUISITION SYSTEM
In this article, a binocular line-scanning system is creatively implemented for surface detection. Compared with the existing visual inspection equipment, this system is based on a binocular stereo camera (BSC), which is produced by Chromasens. It can obtain 2-D and 3-D information at a high speed and a high resolution. Meanwhile, it can effectively preclude the nonuniform between the 2-D and 3-D information acquisition during the detection process and provides more effective information for the detection of surface defects.
The optical resolution of BSC can reach 70 μm/pixel; the maximum acquisition speed can be 1.4 m/s; the maximum frame rate is 21 kHz; and the line-scanning camera has 7142 pixels and an Red, Green and Blue color (RGB) three-channel sensor.
According to the principle of the triangulation and polar line correction, as shown in Fig. 2, the disparity map of the left and right cameras can be obtained by the stereo matching [35], [36]. Then, the corresponding depth map can be calculated based on the disparity map by combining the internal and external parameters of the cameras of BSC, as shown in Fig. 3.
In order to accelerate the calculation of the depth information, a simple and fast stereo matching algorithm called the semiglobal matching (SGM) [12] is used. The left and right consistency detection error of SGM is "10" and the matching disparity range is between −165 and +165. Based on the abovementioned conditions, the BSC can provide the high-precision depth information with a resolution of 14 μm in a range of 52 mm.

A. Saliency Detection With GLRNNR
In general, the homogeneity region of the image has similar saliency information. In order to decrease the computational complexity, the simple linear iterative clustering [13] is used to segment the original image into n homogeneous regions. The m-dimensional feature information about each region is utilized to construct an overall image feature matrix As previously stated, since the negative values in the coefficient matrix Z lack a reasonable explanation for the actual cluster, then according to the nonnegative low-rank and sparse graph (NNLRS) [14] model, the LRR model can be changed to the following representation with a nonnegative coefficient constraint: Otherwise, to improve the detection effect, some priori information [10], [23] are used, such as the boundary background priori, the center priori, and the color priori. However, the use of these priori information is separate from the solving process of LRR in some saliency models, which results in an inability to improve the results of LRR.
To solve the above-mentioned problems, a GLRNNR is utilized and transformed into the following equation optimization problem. In the process of the low-rank decomposition, not only the image boundary information is invoked as the dictionary B to carry on the background priori constraint but also the coefficient matrix Z is also nonnegative constrained. Meanwhile, L is used to ensure the low-rank property of the global background of the image arg min where L denotes the global low-rank term and S 1 means the global sparse term. B stands for the background dictionary, L = BZ + S 2 is the boundary background constraint reconstruction, Z indicates the reconstruction coefficient, S 2 means the reconstruction error, H 2 ≥ 0 signifies the nonnegative  constraint of Z, and α, β, λ, and η are the balance factors, as shown in Fig. 4.
The abovementioned problem can be converted into the augmented Lagrangian function as follows: where Y 1 , Y 2 , Y 3 , and Y 4 are the Lagrange multipliers, and μ 1 and μ 2 are the penalty for violating the linear constraints. L, Z, S 1 , and S 2 can be solved by alternating iterations with LADMAP [14], as given in Table I.
The detailed iterative update steps are as follows.
Step 1. Updating L: According to Zhuang et al. [14], Ψ (τ ) (·) expresses the singular-value shrinkage algorithm with soft threshold τ , which is used to approximate the calculation of the kernel paradigm. The result of L obtained by the (k+1)th iteration is as follows: Step 2. Updating H 1 : Step 3. Updating S 1 : Step 4. Updating Z: Step 5. Updating H 2 : It is needed to perform the nonnegative constraints during the process of updating.
Based on S 1 and S 2 , the saliency Sal of the ith superpixels region in the image is calculated as follows: where (S 1 ) i and (S 2 ) i are the ith column vectors of S 1 and S 2 , respectively. Z Sal is the normalization coefficient. Otherwise, a multiscale fusion method is used to obtain the final saliency map where the fusion is performed using N scales superpixel segmentation, and w n and Sal (n) are the corresponding weight and saliency results at the jth scale, respectively.

B. Outlier Saliency Model With Depth
The ideal rail surface can be viewed as a ruled surface generated by a moving straight line where S(t, u) is a point on the ruled surface. The directrix p(t) is the curved path of a moving line. The moving line is called a generator. r(t) means the unit vector of the generator passing through p(t). If the ruled surface is a general cylinder, r(t) is fixed, e.g., r(t) ≡ r 0 . It can be recognized that any rail cross section perpendicular to r(t) should have a similar profile. It is the key to detect the outlier of the profile.
Limited by the depth of the camera field, the indicative feature of r(t) may be missing in the image. It is particularly severe when the imaging plane is not perpendicular to the direction of the rail movement.
In order to discover r(t), a method based on an indirect fitting plane is proposed in this article. The indirect plane (I-plane) is the approximate fitting plane of a spatially symmetric region on a ruled surface with the least square method (LSM), as shown in Fig. 5(c) and in Fig. 6(b).
In this method, it is assumed that there is a spatially symmetric area Ω S on a general cylindrical ruled surface and after a rigid transformation (translation and rotation), Ω S is converted into as shown in Fig. 5 and Fig. 6(a). The projection region of Ω S on the X_Y plane is Ω X_Y = {(x, y)|x 2 + y 2 = r 2 }. The directrix of Ω S on the X_Z plane is p(t). The direction vector of the generator is r(t), where r(t) ≡ − → V g and − → V g = (0, 1, κ 0 ). Otherwise, l is one of the generators on Ω S . On surface Ω S , there are the following relationships: where (x, y, z) are the points' coordinates on Ω S , and z pl is the intersection of l and X_Y plane. The discrete sampling on surface Ω S , the sample point set P S = {(x P S , y P S , z P S )|(x P S , y P S , z P S ) ∈ Ω S }can be written as a discrete form, which is as follows: This article uses the LSM to find the I-plane a 0 · x + a 1 · y − z + a 2 = 0
− → V g can be obtained by crossing the normal vector − − → V lsp = [a 0 , κ 0 , −1] of the I-plane and the vector − → V cc , and ε is a normalized constant, as shown in Fig. 6(c).
For the calculation of − → V cc before rotation, it can be obtained by the center of distance mass P m and geometric center P g , where P m = ( x P S · d P S , y P S · d P S , z P S · d P S )/ d P S and d P S denotes the distance between Ω S and I-plane. P g = (x p , y p , z p ) means the geometric center of Ω S .
, as shown in Fig. 7(a). The point data perform the outlier detection of the new Z-value along − → V g , as shown in Fig. 7(d). It is worth noting that the actual surface is nonsmooth and the depth map contains the noise and errors, as shown in Fig. 7(c). Here, the RANSAC [15] algorithm is used to perform a line fit for Z-value along − → V g . z lsp represents the ideal Z-value online.
In order to weaken the influence of the noise and errors, the − → V g direction should be fine-tuned through setting the angle fine-tuning interval ϕ and minimizing the overall variance . Then, the saliency of the outlier of the rail surface is calculated as follows: where z lsp represents the distance of the points on the ruled surface to the I-plane and z lsp indicates the ideal value along the direction It is worth noting that Ω S should be selected as far as possible without defects or microdefects. As shown in Fig. 8, the 2-D saliency map is segmented into the foreground and background regions by Otsu [16]. Then, the largest inscribed circle of the background region is selected as Ω X_Y . The main defect area is highlighted by the centroid reconstraint and where Z Dep denotes the normalization constant, S D (x S D ,y S D ) means the pixel saliency value at (x S D , y S D ), and Var(·) represents the variance.

C. Final Saliency Fusion
In order to effectively fuse the 2-D-based and 3-D-based saliency detection results, the collaborative fusion detection process, as shown in Fig. 8, is employed.
First, the initial 2-D saliency result S C1 is obtained by the algorithm of GLRNNR. Similar to [25], a 53-D feature vector is employed in GLRNNR.
Second, S C1 is subjected to the threshold segmentation by Otsu [16] to obtain Ω X_Y in the background. The 3-D detection result S Dep is obtained by the outlier detection, which is the pixel level.
According to the outlier saliency model, the detected defect may be incomplete and cover the background noise of a rough surface. Especially for the scar defect, the detection results are more obvious at the edge of the defect due to the interior of the defect is consistent with the background profile. With the superpixels segmentation and clustering properties of GLRNNR, it can expand the range of the defects and reduce the impact of the background noise.
Therefore, S Dep will be seen as the 54th dimensional feature vector to recalculate the new 2-D result S C using GLRNNR.
Finally, S C and S Dep results are nonlinearly combined as follows:

V. EXPERIMENT
A. Dataset 1) RSDDS-113: The dataset samples are taken from an actual industrial production line of one section-steel factory. The 20 rail segments of them with the defects information are collected and employed to construct the rail surface defects dataset (RSDDS-113). Under the laboratory conditions, the data acquisition process is shown in Fig. 9.
The data of the samples cover all the positions of the rail,s such as the waist surface, the tread surface, and the bottom surface. The types and locations of their surface defects are random. In the RSDDS-113 dataset, 113 pairs of them with the typical defects will be selected and employed. Every pair consists of the left-camera image and the corresponding depth image.
In the dataset, the main types of defects are rolling scar, corrosion, scratches, holes, pits, and so on, as shown in Fig. 10. The RSDDS-113 dataset and our codes are available at the Github homepage (https://github.com/neu-rail-rsdds/rsdds).
2) Rail Surface Discrete Defect (RSDD) Dataset [2]: The RSDD is a public railway image dataset, which is mainly composed of 2-D grayscale images captured from the express rails and heavy haul rails, including two subdatasets: Type-I and Type-II. The dataset is used to verify the applicability of GLRNNR for online railway images.
The Type-II dataset has a narrower and more consistent background than the Type-I dataset but more sophisticated defects are included, which are shown in Fig. 15.

B. Evaluation Metrics
For a comprehensive assessment, five evaluation metrics [17] are used for RSDD-113, including the precision-recall (PR) curve, the receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), mean absolute error (MAE), and the F-measure. Two evaluation metrics (pixel-level index and defect-level index) [2] are employed for the RSDD dataset.  Table I. Otherwise, the adjustment angle of the direction vector of the ruled surface is ϕ = (0, ±2 sin(π/120), 0).

C. Comparison With State-of-the-Art Methods
The results of each step are evaluated by the above-mentioned indicators, as illustrated in Fig. 11 and Table II. The results demonstrate that each step is effective for generating the final saliency map.
As shown in Fig. 11, "depth saliency" denotes the result S Dep , which is produced by the outlier detection. "GLRNNR"    represents the saliency result S C1 generated using only a single color image. "GLRNNR-D" represents the saliency result S C regenerated using S Dep and the color image. "Fusion result" indicates the final merger S F of S C and S Dep .
As illustrated in Table III and Fig. 12, compared with the algorithms, such as LRR [8], DSR [20], and NNLRS [14], the GLRNNR algorithm using only the color information is obviously superior to the abovementioned algorithm. Otherwise, it is also verified that GLRNNR with the nonnegative sparse constraint can improve the influence of the negative coefficients of DSR.
In order to verify the effectiveness of the proposed algorithm, "GLRNNR," "GLRNNR-D," and the final fusion result "Ours" are divided into groups for the experimental comparison, as shown in Figs. 13 and 14, and the MAE and AUC are also listed in Tables IV and V. As shown in Fig. 13, the GLRNNR algorithm in this article can obtain a better result than the other methods without the depth information. The detection effect of "GLRNNR-D" is obviously better than ACSD, CAIP-MB, DCMC, DES, and LBE. The  MAE of "GLRNNR-D" is slightly worse than CAIP-MB in Fig. 14 and Table V but the AUC of "GLRNNR-D" is obviously higher than CAIP-MB. It is possible that CAIP-MB strengthens the requirement of accuracy and leads to a decrease in the recall rate. The MAE is mainly for the regression problems but it is susceptible to the background noise. In Tables IV and V, the MAE of our proposed methods is equal to other excellent algorithms. To evaluate the effectiveness of defect detection, it is necessary to use some additional evaluation metrics, which are suitable for the segmentation and classification problems, such as the PR curve and AUC. It can be found that the PR curve and AUC of ours are excellent than other algorithms.
In the experimental comparisons, the final fusion results of our method are significantly better than the other 15 methods. Otherwise, when the resolution of the image is set to 256 × 512 and the number of the superpixels is set to 200, "GLRNNR" takes 3.9 s, and the depth outlier detection takes 4.9 s, which does not use the graphics processing unit (GPU) acceleration.
In practical applications, offline and parallel processing methods are utilized. First, GLRNNR and 3-D outlier detection algorithms are, respectively, used for preprocessing detection. Then, if the defect requirements are not satisfied, the two pieces of information are combined using our fusion detection method in this article.
2) RSDD Dataset: For the RSDD dataset, the values of the parameters in (4) and (14) are the same as the RSDDS-113.
Considering the comparison with other known methods (LN+DLBP, MLC+PEME, PM, CTFM) in [2], Otsu [16] and active contours [31] are used to segment the saliency results into the foreground and background. As described in [2], the evaluation metrics of the RSDD dataset are pixel-level index (precision, recall, and F-measure) and defect-level index (precision', recall', and F-measure'). The results of the RSDD dataset are shown in Fig. 15 and the evaluation metrics of comparison are given in Tables VI and VII. It can be ascertained that GLR is significantly better than the other algorithms in Type-I. Otherwise, the GLRNNR is equal to  VI  EXPERIMENTAL RESULTS FOR THE TYPE-I RSDD DATASET   TABLE VII  EXPERIMENTAL RESULTS FOR THE TYPE-II RSDD DATASET CTFM at the accuracy and recall rate of the pixel-level index of Type-II but it is better than the other algorithms at the defect-level index.

VI. CONCLUSION
In this article, a novel unsupervised stereoscopic saliency detection method for rail surface defects was proposed. It was based on a binocular line-scanning system, GLRNNR saliency algorithm, and depth outlier detection. First, a 2-D image and 3-D profile information of the rail surface were obtained by our developed binocular color line-scanning system. Second, utilizing LADMAP and GLRNNR, the algorithm could quickly obtain the saliency map of a 2-D image. Next, the outlier region of the 3-D profile was detected based on a 2-D saliency map and the surface characteristics. Meanwhile, the 3-D saliency map was also used as a feature to enhance the 2-D saliency map. Finally, the last 2-D saliency result and the 3-D result were nonlinearly fused. Our experimental results on the RSDDS-113 dataset outperformed the 15 state-of-the-art methods in the literature.
Furthermore, to verify the applicability to the online railway images, the gray image dataset RSDD was also used to compare with other known methods. The experimental results showed that the GLNNRR method proposed in this article is also suitable for general gray rail images and can obtain better detection results.
It is worth noting that the method in this article is only suitable for locating rail surface defects. More research is needed to determine the defect attributes. In addition, the limitations on the imaging hardware equipment hinder this method from being effectively applied to detect shallow internal defects. Therefore, in the future, we will look at the multimodal information fusion and defect classification. Between 2018 and 2019, he was an Academic Visitor in the Department of Computer Science, Loughborough University, Loughborough, U.K. He is currently an Associate Professor in the School of Mechanical Engineering and Automation, Northeastern University. His research interest covers vision-based inspection system for steel surface defects, surface topography, image processing and pattern recognition.
Liming Huang received the B.S. degree in mechatronic engineering from the Shandong University of Science and Technology, Qingdao, China, in 2017.
He is currently working toward the M.S. degree in mechanical engineering with the School of Mechanical Engineering and Automation, Northeastern University, Shenyang, China.