CNN-Based Salient Object Detection on Hyperspectral Images Using Extended Morphology

Salient object detection in hyperspectral images (HSIs) is of interest in various image processing and computer vision applications. Many studies considering spectral information have been reported, extracting only low-level features from a HSI. This letter proposes a convolutional neural network (CNN) based salient object detection method using hyperspectral imagery to utilize spatial and spectral information simultaneously. The proposed methodology incorporates an extended morphological profile (EMP) followed by a CNN to utilize the information from nearby pixels and high-level features simultaneously. We have evaluated the performance of the proposed approach on two independent datasets to verify the generalization ability, viz.: 1) hyperspectral salient object detection dataset (HS-SOD) and 2) Pavia University (PU) dataset. An extensive quantitative analysis of the results revealed that the proposed method significantly outperforms other state-of-the-art methods by approximately ≥2% of the area under receiver operating characteristic (ROC) curve (AUC) and F-measure and lower mean absolute error for both datasets.


I. INTRODUCTION
S ALIENT object detection aims at identifying the objects or regions in an image that are visibly distinctive from its vicinity (i.e., background). Considering this, salient object detection has been used in many computer vision tasks such as object recognition [1], visual tracking [2], image retrieval [3], military target detection [4], and so on. Initially, the salient object detection methods were based on the cognitive ability of humans limited only to RGB colors. The features were extracted using contrast difference between objects and image background [5]. On the contrary, with the advancement in technology, many applications such as military intelligence, weather forecasting, monitoring crop photosynthesis, and so on, require a broad-spectrum wavelength with more than three channels (e.g., RGB or YCbCr).
Hyperspectral image (HSI) data provide spectral information from both the visible and infrared portions of the  [6] first proposed the theoretical foundation for saliency maps on images. Followed by this, Borji and Itti [5] developed a salient object detection method constrained to RGB images using distance measures and local characteristics, such as color, intensity, and so on, for feature extraction. Subsequently, many methods were introduced and modified to improve salient object detection and performance accuracy. Cheng et al. [7] proposed a region-based contrast (RC) method to segment the images based on spatial relations, thereafter extracting the features. Earlier proposed salient object detection methods used local features such as texture, color, intensity, orientation, and so on, to generate saliency maps. These methods were able to detect and extract only low-level and hand-crafted features based on prior knowledge of the existing datasets [8]. Thus, a learning-based process such as [9] was proposed to overcome this limitation, efficiently integrating different features and improving the detection accuracy for salient objects. With the emergence of advanced learning-based algorithms, convolutional neural network (CNN) has been widely considered for learning automated feature extraction and filters for multiple objects. Attributed to its end-to-end learning tool ability, CNN has been employed for saliency detection techniques [10], [11], [12]. A multiscale deep feature (MDF) [13] using CNN was proposed for salient object detection, in which a multilevel image segmentation method was adopted. This leads to a large number of parameters resulting in heavy computation that is not ideal for HSI data having a large number of features. Wang et al. [14] proposed a recurrent fully convolutional neural network (RFCN) for salient object detection, which iteratively sends the output having any errors back to the network to rectify them. This gives a refined saliency map as an output with more accurate detection. Similarly, a deeply supervised network (DSN) [15] was introduced in which skip-layer connections were used between the loss layer and the last layer of each CNN. With its multiscale and multilevel features, CNN enhances salient regions and preserves object boundaries.
Imamoglu et al. [16] propose a salient object detection technique using CNN-based unsupervised segmentation for the HSIs. Since CNN-based HSI does not need labels to extract high-level features, this process is justified. However, this 1558-0571 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
technique lacks spatial resolution, which plays a crucial role in detecting objects in HSI. Object detection methods lacking spatial resolution and focusing only on spectral information are inadequate with the object complexity with respect to its background. Appice et al. [17] proposed an autoencoder-based data reconstruction method called autoencoding of hyperspectral imagery for saliency analysis (AISA). It incorporates spectral-spatial information using a distance measure to compute the saliency map. However, this method lacks in utilizing spatial information adequately. To overcome this limitation, Huang et al. [18] proposed a CNN method called HSIs and saliency optimization (HSISO) for object detection.
In this letter, we focus on extracting high-level features by incorporating both spectral and spatial information for salient object detection on the HSI datasets. Instead of individually stacking spectral and spatial features, which leads to information loss, they are assimilated simultaneously. However, automation and extraction of high-level features comprising spectral-spatial information are complex. This motivates us to propose a method for salient object detection by integrating CNN with EMP (as discussed in Section II). By considering nearby pixel information using an EMP, our method highlights two advantages: 1) both spectral and spatial information can be combined simultaneously for extracting high-level features and 2) it detects objects of multispatial scales (size and shape) and aspect ratios in HSIs. Further, a comparative performance analysis of the proposed approach with existing conventional salient object detection methods is carried out using standard evaluation metrics.

A. EMP Feature Extraction
The mathematical morphology [19] is considered one of the most used techniques for analyzing inter-pixel dependency. There are many morphological operators available for extracting geometrical information. The two fundamental operators are erosion and dilation. These operators are used to extract structural information from the image using a set of structuring elements (SEs) of known shapes and sizes. The SE detects the shape and size of geometrical structures and controls the smoothness of the output image. The morphological profile (MP) consists of the opening profile (OP) and the closing profile (CP), which are obtained after applying erosion and dilation in two different orders. OP segregates bright features while CP separates dark features from the image.
The OP and the CP at pixel p of the image I can be defined as an n-dimensional vector and are given as where γ i R is the opening operator of the image with SE of size i , φ i R is the closing operator of the image with a SE of size i , and n is the total number of opening and closing operations. The MP operator is a combination of both opening and closing operators and is given as An MP is usually formulated based on a single wavelength band image. Thus, in hyperspectral data, spectral information is extracted from multiple bands, and an MP for each of the individual bands is computed to present the spatial information at different spectral regions. This approach is called extended MP (EMP), an extended version of the MP [20]. Since HSIs have a large number of feature channels, it is challenging to apply MP directly to the HSI data. PCA was used initially to reduce the dimensionality, and then MP features were extracted from each principal component image. The EMP is represented as where, x i is the i th channel of the image x after PCA processing and m is the total number of principal components.

B. EMP With CNN
In order to avoid data redundancy and reduce the number of parameters used for the feature map, EMP feature extraction and CNN are combined. For the proposed methodology (Fig. 1), the principal components that account for ≥96% of the total explained variance in the data have been used as the reduced set of features. Extended morphology using the reconstruction approach was computed to incorporate spatial information with extracted PCA bands providing spectral information.
A disk-shaped SE has been used with increasing size at each iteration to capture the spatial information. OP eliminates small objects with bright pixels from the foreground, keeping them in the background, while CP discards dark pixels. The reconstruction method restores the objects performing consecutive opening and closing operations. Each MP enlarges or reduces dark and bright parts, resulting in a more homogeneous version of the original image. The EMP output obtained is binarily masked to compare with the provided binary ground-truth data. Precision, recall, and F-measure is used as performance metrics for evaluating the proposed approach. A three-layer CNN model is integrated with the EMP to have rich semantic information with high-level features from hyperspectral imagery for salient object detection. The three convolution layers consist of convolution kernels of size 3 × 3 with stride 1.
Generally, a CNN model with a 3 × 3 convolution kernel fails to preserve spectral information. Thus, in this research, the output of EMP is used as an input for the CNN model having both spectral and spatial information consecutively. rectified linear unit (ReLU) is used as a nonlinear activation function followed by batch normalization (BN) and a 40% dropout layer to increase the speed of data training while avoiding overfitting (Table I). Further, the feature maps were input to fully connected layers, and the softmax cross-entropy loss function was applied to achieve convergence in the model while training. The parameters of the CNN models were updated by back-propagation using a stochastic gradient with momentum on the loss.

III. EXPERIMENTAL RESULTS AND DISCUSSION
This section has compared the proposed methodology with prominent salient object detection methods. Two different hyperspectral datasets, namely hyperspectral salient object detection (HS-SOD) and Pavia University (PU), have been considered. The experimental parameters set up for this research are described in Section III-A. The obtained results are described in Sections III-B and III-C, further discussed in Section III-D.

A. Parameter Setting
The different hyper-parameters used for this research include a learning rate of 0.001 set for stochastic gradient descent with a momentum of 0.7 and weight decay of 0.0005. Full-resolution images have been used to train the network with a mini-batch of size 10. A total number of 50 iterations was selected with 89, 798 trainable parameters (Table I). The labeled dataset has been split into 25% for training and 75% for testing. The weights of the kernels were randomly initialized. The proposed framework is implemented using the TensorFlow repository on an online platform called Google Colab [21]. Google Colab is based on a Jupyter notebook that can run in any environment with a graphical processing unit (GPU), 12 GB of random access memory (RAM), and 358.27 GB of cold storage for data computation.

B. Experimental Results 1) Dataset Details:
The performance of this research work was evaluated on two independent datasets to verify the generalization ability, viz., HS-SOD and PU datasets.
The HS-SOD [22] dataset consists of 60 HSIs collected from different scenes at the public parks of Japan, between August and September 2017. The images were collected in 151 spectral bands with an image size of 1024 × 768 pixels, among which 81 spectral bands were considered to have a visible spectral range from 380to 780 nm. The ground truth and the rendered color image (sRGB) were provided with the dataset. This dataset can be downloaded from GitHub https://github.com/gistairc/HS-SOD.
The PU dataset acquired in 2001 by the reflective optics system imaging spectrometer (ROSIS) covers a site in Northern Italy. It has a spatial resolution of 1.3 m and 115 spectral bands, each of size 610 × 340 pixels. Among 115 spectral bands, 12 noisy bands were removed. The ground truth data comprising nine land use/landcover classes were split into two halves, one for training and the other for testing.
2) Results: Among 60 HSIs in the HS-SOD dataset, 15 images were used for training and the remaining images for performance evaluation. This section analyses the resulting output for the proposed methodology and compares it with other developed methods. For the comparison, two conventional methods, including RC [7], MDF [13] and four recent CNN-based methods, including RFCN [14], DSN [15], AISA [17] and HSISO [18], have been considered. A total of 20 PCA bands were extracted, having ≥96% of the total explained variance in the data. In Fig. 2, the proposed methodology at different iterations for the HS-SOD dataset is illustrated. With the increase in the number of iterations, the boundary edges are reformed, thus reducing blurriness. Different images of the HS-SOD dataset considering various conditions such as large objects, small objects, simple center bias scenes, and complex and multiple object scenes were categorized to show visual comparison. From Fig. 3, it can be observed that the salient object detection results generated by the proposed method are much closer to the ground truth than other methods. This is because incorporating spatial information results in coherent boundaries and higher contrast between objects and their background. The saliency map computed by CNN-based methods yields better results than the conventional method, as shown in Fig. 3.
The quantitative comparison was performed for the selected datasets with different methods, as shown in Table II. The proposed method yields higher accuracy for all the metrics calculated. It increases the F-measure and AUC score by 2% and gives the minimum absolute error (MAE). Thus, it is inferred that the number of false predictions is significantly  less compared to the other methods. The precision-recall (PR) and receiver operating characteristic (ROC) curves were plotted for the comprehensive analysis of different image-binary thresholds, as shown in Fig. 4(a) and (b). The ROC curve for the proposed method is closer to the top left corner [ Fig. 4(b)], showing better performance. Similarly, the precision value in the PR curve for the proposed method is much higher even when recall increases, indicating a lesser false positive rate. For the comparison of PU dataset results with HS-SOD dataset results, the first 81 spectral bands were selected. The training was done for 50 epochs keeping the same learning rate. From Fig. 5, the ground truth can be observed, giving two significant salient objects. The quantitative evaluation has been  carried out for different metrics, as shown in Table II. Among all the compared methods, DSN and HSISO perform comparatively well. However, our proposed method gives better results in terms of both the AUC score and F-measure. Fig. 5 shows the visual comparison of the results obtained for the PU dataset with different methods. The proposed methodology also extracted the nearby object boundaries, which were not annotated in the ground truth. The exact boundaries can also be seen in the AISA method but contain more noise in the image.

C. Discussion
In this section, the behavior of the proposed method has been analyzed and interpreted to provide insight for future work. The quantitative evaluation and the saliency map were generated for comparison purposes. The saliency map generated in Fig. 2 shows reformed boundary edges with an increase in the number of iterations, thus yielding better optimization of the model. Figs. 3 and 5 show the visual comparison of the salient object detection method for the two different datasets to study the efficacy of the proposed approach. This demonstrates that combining spatial information and spectral information using a combination of CNN and EMP performed better in terms of accuracy and generating a saliency map. The performance of the proposed method was also compared with different published methods by computing standard evaluation metrics results on both datasets as shown in Table II. EMP considering nearby pixel information along with CNN could be the primary reason for the proposed method to yield improved accuracy with refined boundaries. However, there were a few images from the HS-SOD dataset for which the proposed methodology failed to generate a saliency map (Fig. 6). These failures are because of objects having extremely low contrast with respect to their background [ Fig. 6(a) and (b)]. The other reason may be the inability to identify the depth between the ground and the object [ Fig. 6(c)]. The presence of multiple objects having complex shapes, which are very close to each other [ Fig. 6(d)] may be responsible for less than optimal performance of the proposed methodology. The possible solution for the above-mentioned problems could be to provide more prior information to the model based on object feature similarity. This can help in updating the learnable weights of the CNN model by providing segment-level information. Apart from this, training the model with more complex scenes can eventually benefit different levels of scenes.

IV. CONCLUSION
In this letter, we proposed a CNN-based salient object detection method on HSI data. The proposed technique simultaneously utilizes spectral and spatial information using EMP and nearby pixel information. Further, a three-layer CNN is integrated with EMP to extract high-level features and preserve the edges of the boundary. The proposed methodology can detect salient objects of multiple spatial scales and aspect ratios. The experimental results on HSIs prove that the proposed method increases accuracy by ≥2% in terms of F-measure and AUC score and low MAE, thus outperforming other methods. The generated saliency map for visual comparison shows a better result with enhanced boundary edges when compared with other methods. On the other hand, we observed that detecting the edges for objects with very low contrast and complex features was comparatively less effective. This ineffectiveness in object detection could be because of fixed convolutional kernel size, insufficient parameter tuning, and optimization. Notably, the proposed CNN-based EMP method for saliency detection is significant in comparison with state-of-the-art techniques and can be further used in studying different saliency objects. Further work is being taken up wherein the spectral-spatial characteristics, and a CNN model with more tunable parameters can be integrated to detect complex salient objects from the image.