A transformer-based cloud detection approach using Sentinel 2 imageries

ABSTRACT Presence of clouds blocks the view of Earth’s surface objects in optical imageries, thus compromising their application and usability. Identifying and removing the clouds become a crucial task during image preprocessing. Recently deep learning (DL)-based cloud detection methods have shown improved performance, but capturing global semantic features and long-range dependencies necessitates a careful selection of DL classifiers to further enhance their effectiveness. Keeping this in view, the present study proposes a novel spatial-spectral attention transformer for cloud detection (SSATR-CD) with a spatial-spectral attention module that generates an enhanced feature map to replace convolution by using the image patches directly. To implement the proposed approach, a new Sentinel-2 data set with various types of cloud covers over India (IndiaS2) was created and tested with the proposed method. Alongside this, an additional benchmarked data set (WHUS2-CD) was also considered to check the ability of the proposed model to different regions of the world by applying model-based transfer learning. The result highlights the effectiveness and efficiency of the SSATR-CD approach in both cases.


Introduction
The presence of cloud and cloud shadow often limits the usage of freely available optical satellite imagery in different Earth monitoring programmes by hiding objects on Earth's surface. Thus, cloud and cloud shadow detection becomes a vital requirement as a preprocessing step to improve the utilization of contaminated imageries Zhiwei et al. (2022). Automated cloud detection is the most effective mechanism to create cloud masks in comparison to manual detection, which is time-consuming and requires a large team of experts. Therefore, automated cloud detection has been explored over the past decade in which threshold-based, machine learning-based (ML-Based), and deep learning-based (DL-based) approaches are three significant categories of cloud detection. The thresholdbased cloud detection approaches, such as Fmask (Function of Mask) (Qiu, Zhu, and Binbin 2019) and Sen2Cor (Sentinel-2 Correction) ('Sen2Cor v2.11';Main-Knorn et al. 2017;Richter, Louis, and Berthelot 2012), applies several rules to determine the probable boundary of cloud and cloud shadow. In contrast, ML-and DL-based approach uses a classifier by automatically handling threshold selection requirements to generate cloud masks (Liu et al. 2023). ML-based cloud detection requires additional manual feature extraction to achieve better results (Caraballo-Vega et al. 2022;W. Zhang et al. 2022;Liu et al. 2023;Singh, Biswas, and Pal 2022). Feature extraction in satellite imagery is a complex task as object interclass variance is relatively low (Gawlikowski et al. 2022), e.g. clouds and snow cover have similar surface reflectance. Thus, ML-based methods often fail to extract diverse spatial characteristics from training samples. On the other hand, DLbased cloud detection methods automatically learn discriminative spectral and spatial features from input samples to achieve better results. Approaches such as RS-Net (Jeppesen et al. 2019), MSCFF (Zhiwei et al. 2019), DeepMask (Xu et al. 2019), WDCD (Yansheng et al. 2020), CD-SLCNN (Nan et al. 2021), Boundary-net (Kang et al. 2022), GCDB-UNet (Xian et al. 2022), etc., demonstrate the potential of DL-based methods by achieving promising results for different satellite imageries. Keeping in view the requirement of high computation cost and a large number of parameters in DL-based methods, a few studies also proposed lightweight DL models (Jun et al. 2021;Kai et al. 2022).
The lightweight networks such as Efficient Cloud Detection Network (ECDNeT) and Cloud Detection-fusing multiscale spectral and spatial features (CD-FM3SFs) achieved comparable performance for binary cloud detection (cloud only). Keeping in view of the point that cloud detection is a multiclass cloud detection problem, where classes such as cloud shadow and thin-cloud have vital importance Xuemei, Huping, and Qiu 2022), existing DL-based cloud detection methods lack in capturing global semantic features and long-range dependencies to attain better performance. In computer vision, a similar problem has been tackled by using a vision transformer (ViT) recently, where the self-attention mechanism automatically discovers the relationship between image patches. ViT exhibited good performance for remote-sensing applications (Papoutsis et al. 2023;Aleissaee et al. 2022), but several limitations, such as long training periods, high computation cost, and unstable training (Han et al. 2022), are also highlighted. These problems are due to the direct use of satellite imagery to extract image patches or the use of convolution layer (Z. Roy et al. 2021).
Keeping in view of the above limitations, a novel cloud detection method based on a vision transformer named spatial-spectral attention transformer for cloud detection (SSATR-CD) is proposed for Sentinel-2 imageries. Spatial-spectral attention (SSA) module is created to extract enhanced feature maps by replacing convolution layer. A new data set named 'IndiaS2' is also created over the Indian region to evaluate the proposed model's performance in detecting multiple cloud classes, including thin cloud, thick cloud, and cloud shadow, separately. The performance comparison of the proposed approach is carried out with different state-of-the-art cloud detection methods. In order to replicate the work, The code and data set are provided at: https://github.com/rohit singh-nit/SSATR-CD-using-IndiaS2-dataset. The proposed method's transfer learning capabilities are also evaluated using the benchmarked WHUS2-CD data set created over the mainland China region. Here, the model-based transfer learning technique is utilized (Fang et al. 2019), which is found to reduce operational time and achieve comparable results.

Methodology
Self-attention module in transformers (Vaswani et al. 2017) allows them to learn longrange dependencies globally by shortening the path length of any input and output sequences. This mechanism highlights the relevant features from the input for machine translation. The Vision Transformers (ViT) have effectively been used in various computer vision tasks, including remote-sensing data sets, due to their ability to use self-attention and derive global information from images as a sequence of 1D embedded patches (Dosovitskiy et al. 2020;Carion et al. 2020). In spite of their effectiveness for image classification, ViT faces challenges, including (a) long training periods, (b) high computation cost, (c) difficulty in model convergence, and (d) unstable training.

SSATR-CD
To deal with the above challenges, we propose an SSATR-CD in which the SSA module is introduced to replace the convolution layer to extract image patches from the satellite imagery ( Figure 1).
The SSA module combines spatial attention and spectral attention to extract longrange contextual features from spatial distribution and channel dimension, respectively. The SSA module generates an enhanced feature map via skip connection to preserve the gradient. The enhanced feature map helps in fast convergence of the transformer with stabilized training. The feature maps obtained this way are split into fixed 2D patches that are flattened and projected linearly. Position encoding is applied to the linearly projected fixed patches, which generate a 1D embedded image sequence that become the input of the standard transformer encoder. The transformer encoder comprises the multi-head self-attention module in order to apply the attention mechanism multiple times in parallel to one another, which allows capturing a broader range of relationships by learning unique attention patterns from each head. Finally, a multilayer perceptron head is applied to generate cloud masks (see Appendix 1 for algorithm).

Spatial attention module
The spatial attention module comprises of parallel convolution layer with different dilation rates to obtain enhanced feature maps by utilizing small to large receptive fields for each pixel. Image patches of size h*w (height * width) are used as input, where each patch will help in determining the class corresponding to middle pixel as cloudy or clear. Each convolution layer output is normalized to handle unstable gradients, and the Rectified Linear Unit activation function is used to transform the received input by changing all negative values to zero. Matrix multiplication is applied to each convolution layer output to obtain optimized features for a reduced computational requirement. The output of the adjacent convolution layer is transposed followed by matrix multiplication using Einstein Summation (einsum) notation. The convolution layer with the activation layer is further applied to smoothen the feature map. As large filters usually overlook small and essential local features, a small filter size of 3 × 3 is used throughout. Skip connection is added to the output of the convolution layer to recover lost spatial information ( Figure 2).

Spectral attention module
The spectral attention module emphasizes the inter-channel feature map by examining the corresponding bands or channel maps. The spatial attention output is utilized to calculate the spectral attention map where skip connection with element-wise multiplication provides an enhanced SSA feature map. The global average pooling operation is applied to the spatial feature map (height (h) * weight (w) * band (b)) to generate one feature map (1 × 1*b), which reduces the total number of parameters to minimize overfitting, followed by two convolutions and activation layer for amplifying extracted spectral attention features as shown in Figure 3 (see Appendix 1 for algorithm).

Data set used
For this study, freely available Sentinel-2 images were used to create a new cloud detection data set over the Indian region. Altogether, 11 tiles are selected in the form of level 1 scenes (L1C scenes) covering different parts of India. All L1C image products are available at Copernicus Open Access Hub (European Space Agency ESA 2022). This data set is named 'IndiaS2' and has imageries covering rich diversity (Table 1) with less than 5% of cloud cover. The imageries having less than 5% cloud cover are of higher importance as less scattered clouds often corrupt entire imagery, and cloud detection in such cases is highly affected by the existence of an imbalanced classification problem. India being second largest populated country has a high portion of the urban area in the form of cities, towns, and suburbs. Therefore, almost every imagery used to create this data set has some amount of urban land cover.
Cloud reference masks are generated by the manual method using Adobe Photoshop software. Magic wand and lasso tool is used to mark cloud boundaries. Thick cloud is considered if the cloud affects the Earth's surface with an opacity of more than 50%, and if the opacity is less than 50%, then the cloud is marked as thin. The generated cloud mask consists of four classes: thick cloud, thin cloud, cloud shadow, and ground ( Table 2). The generated cloud mask has 10 m pixel resolution, which can be easily down-sampled to 20 m and 60 m pixel resolution using the nearest neighbour algorithm. Threshold-based cloud detection methods such as Fmask and Sen2Cor create respective cloud masks comprising of different classes denoted by some values or labels. In order to have a fair performance evaluation, the output mask values need to be converted into considered manual reference mask values. Table 3 provides a conversion mechanism of cloud mask value to be used with Fmask and Sen2Cor algorithms for the IndiaS2 data set.
Additionally, WHUS2-CD (Jun et al. 2021) data set is also considered to check the proposed methodology's transfer ability and compare performance with different state-of -the-art cloud detection methods. WHUS2-CD data set also consists of Sentinel-2 imagery over mainland China, where each imagery is paired with a manually labelled binary cloud mask. This cloud mask identifies each pixel as either cloud or non-cloud only.

Results and discussion
Several performance measures, including overall accuracy (OA), user accuracy (UA), producer accuracy (PA), F1-score, precision, recall, and mean intersection of union (mIoU), as in Tables 4-7 and Figure 4, were used to compare the performance of the proposed approach. The colour-coded mask is generated to visually analyse individual Sentinel-2 imagery having different land-cover conditions over the Indian region ( Figures 5-9). Four optical spectral bands: Band 2 (blue), Band 3 (green), Band 4 (red), and Band 8 (near infrared) of multispectral imageries are used for training, validation, and testing. During training, imageries are divided in a way that 75% of data was used for training, whereas the remaining 25% was used to validate the trained model. For all    training and testing, a high computing system with Intel(R) Xeon(R) Gold 6132 CPU @ 2.60 GHz, RAM of 96 GB, and 16 GB Nvidia graphic card and python environment using the TensorFlow framework was used.

Evaluation metrics
The confusion matrices (Figure 4) are used to evaluate multiclass cloud classification performance metrics (Table 4) to compare the result of considered cloud masking methods, in which n is the number of pixels and c is the class label. Main performance metrics are evaluated as micro-averaged to aggregate the contribution of each class.

Cloud detection with IndiaS2 data set
To compare the results of the proposed approach with the IndiaS2 data set, a transformer-    Table 5 provides the results of considered cloud detection methods for multiclass cloud detection, including thick and thin clouds, cloud shadow, and ground. Results suggest that the proposed approach outperformed Fmask, Sen2Cor, and standard ViT in almost all performance metrics. Further, Sen2Cor and the proposed method performed well compared to ViT for the IndiaS2 data set. The result seems promising for the Indian region, as all methods achieve an overall accuracy of more than 99%, while SSATR-CD outperformed others in terms of F1-score, recall, and mIoU. The prediction time of Fmask and Sen2Cor was about 4 and 12 min per imagery, respectively. In contrast, ViT and SSATR-CD generated cloud masks in a comparable time of about 15 and 18 minper imagery, which highlights the requirement of further optimizing designed layers and modules of SSATR-CD for considered TensorFlow framework.    6 reports UA and PA for thick, thin, and overall clouds, along with cloud shadow class where SSATR-CD achieved the best UA and PA for thin clouds and balanced results for other classes. Although Fmask attained the highest UA for both cloud and shadow detection, however, its lowest PA indicates a higher omission error rate for both cloud and shadow. Sen2Cor achieved the highest PA for cloud and shadow, which suggests its superiority in dealing with overall cloud and cloud shadow omission errors. However, the individual omission error rates of thick and thin cloud classes are much higher for Sen2Cor than for SSATR-CD. As PA of Sen2Cor for thick and thin clouds is 65.93 and 56.13%, respectively, whereas the PA of SSATR-CD for thick and thin clouds is 97.07 and 68.22%, respectively.

Qualitative results
In Figures 5-9, the true-colour imagery and colour-coded cloud mask are displayed, where thick cloud is represented as red, thin cloud as yellow, cloud shadow as black, and ground as green.
Figures 5(a-g) provide true-colour imagery over Thiruvananthapuram (Kerala), tile-id: T43PFK acquired on 16 August 2022, having some land cover, shoreline, and the large area of sea where the land portion is covered by small cumulus cloud. Figures 5(b-h) are colour-coded reference mask, whereas Figures 5(c-f) and Figures 5(i-l) are colour-coded mask obtained by using considered cloud detection methods. The selected square portion in Figures 5(a-f) is zoomed and provided in Figures 5(g-l). Both ViT and SSATR-CD performed equally well in detecting thick and thin clouds, whereas SSATR-CD estimated cloud shadow cover area better than ViT. Both SSATR-CD and ViT fail to detect small portions of thin clouds and thin clouds over the bright Earth's surface. Sen2Cor and Fmask misclassified the shoreline as thick cloud and cloud shadow, while Sen2Cor failed to detect thin clouds, as shown in Figures 5(i, j), respectively. SSATR-CD performed better than other considered methods, but the main concern is the commission error of cloudsurrounded seawater pixels as cloud shadows (Figures 5(a-f)).
Figures 6(a-g) depict the true-colour image over Bathinda (Punjab), tile-id: T43RDP, captured on date: 23 November 2022 having the small thin cloud sparsely scattered over barren and farmland. Figuress 6(b-h) are colour-coded reference mask, whereas Figures 6 (c-f) and Figures 6(i-l) are colour-coded mask obtained by considered cloud detection. Figures 6(g-l) are a magnified version of a selected square part in Figures 6(a-f). Both ViT and SSATR-CD methods were able to detect thin clouds (Figures 6(k,l))but SSATR-CD estimated the boundary of thin cloud cover correctly. Sen2Cor misclassified thin clouds as thick clouds while overestimating cloud boundaries, as shown in Figure 6(i). Fmask tends to overestimate cloud shadow by generating it for each detected cloud (Figures 6(d-j).  Figures 7(g-l) represent zoomed selected part of Figures 7(a-f). Sen2Cor, Fmask, ViT, and SSATR-CD were able to detect clouds displayed in zoomed Figures 7(i-l), respectively), whereas Sen2Cor was found to confuse thin clouds with thick clouds. Both transformer-based methods (ViT and SSATR-CD) were able to estimate thin cloud cover correctly. In comparison, Sen2Cor was found to overestimate cloud shadow by confusing terrain shadow with cloud shadow, while SSATR-CD reduced this confusion rate to some extent. Figures 8(a-g) display the true-colour image from Pithora (Chhattisgarh), Tile: T44QPJ, captured on 21 December 2022, having less than 1% small thin cloud scattered over the hills and plain land region. Figures 8(b-h) is colour-coded reference mask, whereas Figures 8(c-f & Figure 8i-l) is colour-coded mask obtained by considered detection methods. Figures 8(g-l) are zoomed selected square part of Figures 8(a-f). A comparison of these figures indicates that only SSATR-CD was able to detect small scattered thin clouds among all considered methods. Figures 9(a-g) represent true-colour imagery captured from Haldwani (Uttarakhand), tile-id: T44RLT, on date 28 October 2022, having clouds over the mountain, river bed, and plain ground. Figures 9(b-h) are colour-coded reference mask, and Figures 9(c-f) and (i-l) were obtained using cloud detection methods. Figures 9(g-l) represent zoomed selected square part of Figures 9(a-f). A comparison of Figures 9(g-l) suggests that SSATR-CD was able to detect better thick and thin clouds (Figure 9(l)), while ViT is found to confuse dried riverbed area with thin clouds, thus overestimating the thin cloud cover. Fmask exhibited the highest commission error of clouds over dried riverbed areas of valleys and river deltas (Figures 9(d-j)). Cloud shadow detection near the bottom of the mountain region is relatively low for all considered detection methods (Figures 9(g-l)). Table 7 reports cloud-only detection results of considered cloud detection methods, including the proposed SSATR-CD method with standard ViT, CloudViT , CD-FM3SF Jun et al. (2021), Fmask, and Sen2Cor using WHUS2-CD data set. Here, eight test images of WHUS2-CD were considered for performance evaluation in terms of transfer learning. Trained models with Indian data set using Standard ViT and SSATR-CD were used to perform cloud detection on test imagery of WHUS2-CD where thick and thin clouds were re-labelled as cloud, and cloud shadow and ground as non-cloud. Results (Table 7) suggest that SSATR-CD outperformed both the threshold-based and transformer-based cloud detection methods (CloudViT and standard ViT) in terms of all considered performance matrix except recall value with Fmask method. In spite of being a case of transfer learning, SSATR-CD achieves comparable results to those obtained by CD-FM3SF.

Discussion
Sentinel-2 satellite data is affected by the presence of different types of clouds (nimbus, cirrus, cumulus, stratus, etc.) and their shadow. Based on visual interpretation, these clouds can be divided into thick and thin clouds. For the Sentinel-2 data set created over the Indian region, where images are usually available within a gap of 5-10 days for each tile, a data set having less than 5% cloud cover is created. Fmask, which provided good results over a different benchmarked Sentinel-2 data set, surprisingly performed poorly for the IndiaS2 data set. Fmask cannot detect thin clouds separately; therefore, modifications are necessary to consider thin clouds as a separate class by incorporating specific rules and new thresholds. Sen2Cor performs well for ground detection but often confuses thin cloud with bright surfaces and objects as thick cloud (Figures 5-9). The proposed DL-based methods perform well compared to Sen2Cor because of the use of spatial information combined with spectral information. The combination of spectral and spatial information has the advantage of discriminating clouds with similar spectral reflectance (Table 7). ViT seems to be overestimating thin cloud over the bright surfaces, which is ignored by Sen2Cor, leading to its better performance. The proposed SSATR-CD is designed to extract the additional information by applying an SSA module to help better identify small thin clouds over different land covers, thus outperforming other considered methods, including thresholdbased (Fmask and Sen2Cor) and standard ViT (Table 5).
A detailed visual comparison in Figures 5-9 highlights that the proposed SSATR-CD method achieves consistent results over different conditions compared to other considered methods. The SSATR-CD is found to detect small thin clouds over the different land covers by estimating the proper boundary of the cloud. When used with the WHUS2-CD data set (binary), the result suggests improved transfer capabilities of the trained SSATR-CD model to other regions with different cloud cover percentages. The SSATR-CD has comparable results to that of CD-FM3SF, and its ability to detect cloud shadow is an additional advantage. The performance of CD-FM3SF is found to degrade when considered for multiclass cloud detection. The SSATR-CD outperforms CloudViT, which considers a multiscale dark channel feature as input. Despite encouraging results, SSATR-CD is found to overestimate cloud shadow over seawater, cloud shadow and thin cloud detection with the big cover area, and cloud shadow detection near the bottom of the mountain, which is similar to the problem with other cloud detection methods.

Conclusion and future scope
In this study, a transformer-based cloud detection method (SSATR-CD) was proposed by adding an SSA module to extract the enhanced feature map to overcome the limitation of ViT. Keeping in view of India's condition and the availability of Sentinel-2 imagery, a new data set named IndiaS2 was also created by manually labelling each imagery at the maximum provided pixel resolution (10 m). SSATR-CD results proved the efficiency and effectiveness of the proposed method over other traditional and state-of-the-art cloud detection methods. The proposed approach was found to be very effective when the model trained using the IndiaS2 data set was applied to another benchmark data set (WHUS2-CD). Future work will explore possibilities for improving SSATR-CD to reduce computational cost without sacrificing its performance by considering additional global data sets such as KappaSet (Shtym et al. 2022), CloudSEN12 (Aybar et al. 2022), etc. A mechanism to extend the usability of SSATR-CD for other satellites will also be considered.

Disclosure statement
No potential conflict of interest was reported by the authors.