Generating Views Using Atmospheric Correction for Contrastive Self-Supervised Learning of Multispectral Images

In remote sensing, plenty of multispectral images are publicly available from various landcover satellite missions. Contrastive self-supervised learning is commonly applied to unlabeled data but relies on domain-specific transformations used for learning. When focusing on vegetation, standard transformations from image processing cannot be applied to the near-infrared (NIR) channel, which carries valuable information about the vegetation state. Therefore, we use contrastive learning, relying on different views of unlabeled, multispectral images to obtain a pretrained model to improve the accuracy scores on small-sized remote sensing datasets. This study presents the generation of additional views tailored to remote sensing images using atmospheric correction as an alternative transformation to color jittering. The purpose of the atmospheric transformation is to provide a physically consistent transformation. The proposed transformation can be easily integrated with multiple channels to exploit spectral signatures of objects. Our approach can be applied to other remote sensing tasks. Using this transformation leads to improved classification accuracy of up to 6%.

Generating Views Using Atmospheric Correction for Contrastive Self-Supervised Learning of Multispectral Images Ankit Patnala , Scarlet Stadtler , Martin G. Schultz, and Juergen Gall , Member, IEEE Abstract-In remote sensing, plenty of multispectral images are publicly available from various landcover satellite missions. Contrastive self-supervised learning is commonly applied to unlabeled data but relies on domain-specific transformations used for learning. When focusing on vegetation, standard transformations from image processing cannot be applied to the near-infrared (NIR) channel, which carries valuable information about the vegetation state. Therefore, we use contrastive learning, relying on different views of unlabeled, multispectral images to obtain a pretrained model to improve the accuracy scores on small-sized remote sensing datasets. This study presents the generation of additional views tailored to remote sensing images using atmospheric correction as an alternative transformation to color jittering. The purpose of the atmospheric transformation is to provide a physically consistent transformation. The proposed transformation can be easily integrated with multiple channels to exploit spectral signatures of objects. Our approach can be applied to other remote sensing tasks. Using this transformation leads to improved classification accuracy of up to 6%.

I. INTRODUCTION
C OMPUTER vision methods play a pivotal role in the field of remote sensing which includes a wide range of applications [10]. Specifically, landcover classification aims at labeling contiguous regions of the Earth's surface based on the characteristic surface reflectance patterns. In most landcover classification algorithms, information from various channels of the visible spectrum measured by satellite or airborne instruments is exploited. Since several years machine learning has been explored in the remote sensing community for this task. The community has developed a small number of benchmark datasets for landcover classification, for example, smallsized labeled landcover datasets such as sat-4 [3], sat-6 [ eurosat [8], and medium-sized dataset such as bigearthnet [13] and sen12ms [6]. While these have proven their use, they are rather small compared with the state-of-the-art benchmark datasets for image recognition or the petabytes of data publicly available from satellite missions.
Most previous machine learning studies on landcover classification used supervised training approaches where a neural network is trained on labeled data and then tested on another portion of the labeled dataset that was withheld from the model during training. Newer machine learning techniques allow the extraction of useful information from unlabeled data and thus open new possibilities to train machine learning models on much larger amounts of data than before. Due to the difference in nature of multispectral image to that of red, green, and blue (RGB) image, in this study we focus on contrastive learning as one fundamental approach to train on these unlabeled multispectral data.
In particular, multispectral remote sensing images provide different surface reflection properties useful for landcover classification and it would be a waste of information if their analysis is limited to the RGB channels commonly found in natural images. Sentinel2 [5], for example, also has bands such as red edge, near-infrared (NIR) and short-wave infrared. The NIR band is particularly useful for the classification of vegetation and the assessment of the healthiness of vegetation as it relates to the leaf area index [14]. Fig. 1 shows a box plot of NIR radiations for different classes of the eurosat dataset. Landcover classes such as pasture lands, crop fields, and forests reflect significantly more NIR radiation than classes such as industrial buildings, residential places, vegetationlacking urban land, and the water body landcover types such as rivers, oceans, and lakes. Thus, adding an NIR channel to the input data stream of a deep learning model will allow the model to learn more distinguishable features of landcover types and improve scores on land cover classification tasks, especially with respect to vegetated land surfaces.
To classify landcover and specifically vegetated landcover from petabytes of unlabeled Sentinel2 remote sensing data, we use a self-supervised contrastive deep learning model. Such contrastive learning approaches rely on the type of transformation used in creating augmented data, i.e., the controlled creation of variants of the original data samples. Augmentations are task-specific and must be curated accordingly [11], [15]. While remote sensing data can in principle be considered as images, they usually come with more spectral channels. Fig. 1. Boxplot of NIR reflectance of different landcover types randomly sampled from the eurosat dataset. The plot shows the differences between densely vegetated and nonvegetated landcover types. This plot is drawn from 500 images from each landcover type of which 30 pixels were randomly selected.
Therefore, the transformations applied to remote sensing data should be able to take these into account. So far, the remote sensing community has adopted the same transformations that are used in natural image processing. However, especially the widely used color jittering can only be applied to RGB images. Naively extending color jittering to other spectral channels such as NIR leads to a deterioration of physical information. For example, brighter signals in the NIR channel imply healthier plants. Therefore, we set out to develop a new transformation pipeline based on atmospheric correction, which can be applied to all the channels of remote sensing images and preserves physical consistency.
Atmospheric correction is a process to remove atmospheric effects on the spectral signature of the reflected light [2]. Sentinel2 measures the solar radiation that is reflected at the Earth's surface. In the atmosphere, light is absorbed and scattered and this must be corrected to retrieve the true spectral signature of the objects on the Earth's surface. Absorption reduces the intensity of pixels, causing haziness, whereas scattering will affect the readings at neighboring pixels. Atmospheric correction affects the spectral reading of all the bands, hence plays an important role in landcover classification [12]. Currently, there are many atmospheric correction algorithms in use to convert the top of atmosphere (TOA) images into the bottom of atmosphere (BOA) reflection. For our contrastive learning approach, we exploit the fact that atmospheric correction produces pairs of uncorrected and corrected images. Furthermore, atmospheric correction is applied to all the image channels, so that this "augmentation" can be used for multichannel data and not only for visible RGB images as done in most prior methods [4]. The pretrained model obtained using our atmospheric-correction-based transformation as an alternative to color jittering yields better scores on two different landcover classification tasks. In this letter, we also show with the experimental results that by adding one channel, i.e., NIR to the RGB bands and without increasing trainable parameters, accuracy improves on the landcover classification tasks. In summary, our contributions are as given below.
1) We show that adding NIR information to landcover classification improves accuracy.
2) We devise a new transformation using pairs of images before and after atmospheric correction feasible to all the spectral channels.
3) The combination of these two points leads to accuracy improvements up to 6% for standard remote sensing benchmark datasets.

II. MOTIVATION AND PROBLEM STATEMENT
The contrastive learning algorithms learn representations for visual inputs by minimizing the distance between pairs of differently augmented views of the same sample in the latent space while other instances (negatives) are pulled apart from each other as far as possible. Here, distance is typically measured as angular distance in a multidimensional vector space. Mathematically, this is equivalent to minimizing the dot product of two vectors. Special loss functions such as InfoNCE [4], [7] are used to avoid collapsing of the features (which happens easily with mean squared error) and maintain the homogeneity in representation among the images available in the dataset.
In terms of image transformations used to generate the augmented image pairs, random cropping and resizing, color jittering, grayscaling, and Gaussian blurring are commonly used for natural images [7]. These transformations have also been applied successfully to remote sensing images although the results are not at par with the ImageNet pretrained models [10]. In addition to these transformations, [1], [10] used spatially aligned pairs of the same landcover image at different times. The idea behind this is that such image pairs should have the same semantics, because there is a seasonal-invariant part such as land surface characteristics. This has improved the results when compared with the ImageNet pretrained model.
As demonstrated by [10], time provides an additional signal to learn landcover semantics, but this approach can only be applied to static image analysis. For tasks aiming at characterizing the dynamics of surface changes, the temporal contrast cannot be used to form image pairs. Moreover, as discussed in Section I, the NIR channel contains vital information about the vegetation state of a surface.
The opportunity to access the previously unused information motivated us to develop a novel transformation which exploits the multiple channels of multispectral images, i.e., NIR + RGB, to provide additional general surface reflectance information to the pretrained models.
This letter presents a first step toward such dynamic vegetation mapping by exploring the new transformation on static images for landcover classification. We designed a contrastive self-supervised learning setup for multispectral images, focusing on vegetation and using atmospheric correction to create the views.

III. DATASETS
In contrastive learning, two types of dataset are used, one for pretraining to obtain a pretrained model and another to evaluate the pretrained model. The tasks on which the pretrained model is used for evaluation are called the downstream tasks, in our case landcover classification. The experiments in this study are benchmarked with the seco dataset [10]. We use the dataset released by [10] for pretraining. For landcover classification, we use bigearthnet [13] and eurosat [8].

A. Seco Dataset
This is an unlabeled dataset created by [10]. They used Google Earth Engine (GEE) to obtain Sentinel2 images of the region of interest directly from the archive. They proposed that varying landcover types can be obtained by uniformly sampling geo-location or coordinates in and around cities. The authors claim that at the city center, one can find mostly buildings and as we move far from the city, landcover types such as croplands are seen.
As temporal similarity was one of the key factors taken into their method, they obtained five samples for each site within a span of three months. Sentinel2 has a native cloud detection algorithm. Images having more than 10% of clouds were discarded from the Seco dataset.

B. Atmospheric-Uncorrected Seco Dataset
We are using atmospheric-uncorrected and -corrected images as a similar pair for our contrastive learning setup. Sentinel 2 has multiple preprocessing levels of which 1C represents an atmospheric-uncorrected image and 2A represents the corrected version. The Sen2Cor [9] algorithm which converts an uncorrected image into a corrected image is noninvertible. Sen2Cor uses a scene classification module and an atmospheric correction module along with measurements such as aerosol optical thickness and water vapor to apply corrections and output an atmospherically corrected image. To obtain a complementing uncorrected image of each corrected image in the Seco dataset, we query GEE for a corresponding uncorrected image using metadata from its corresponding corrected image. The process is graphically shown in Fig. 2.

C. Bigearthnet Dataset
This is one of the first large-scale labeled datasets in the field of remote sensing [13]. The data are also from Sentinel2 and consist of around 600 K images. Bigearthnet contains all the spectral bands of Sentinel2. Each image patch contains several landuse classes. Hence, the dataset is a multilabel classification dataset. This dataset covers ten countries of Europe in contrast to the Seco dataset which covers many regions across the world.

D. Eurosat Dataset
Eurosat is a labeled, balanced, single-label landcover classification dataset [8] with ten classes. Helber et al. [8] claim that there is a low positive correlation between each class. Unlike bigearthnet, it is a small-sized dataset consisting of 27K Sentinel2 images. It contains all the bands of Sentinel and is spread across 34 countries of Europe. Images with large cloud percentage are discarded.

IV. METHODS
For contrastive learning, the type of contrastive learning setup and transformations used in obtaining augmented views play an important role. For our experiments, we use variants of momentum contrastive learning (moco) [7], and to obtain augmented views specific to multispectral images, we use an atmospheric transformation instead of color jittering. Moco and augmented view generation using atmospheric transformation are explained in Section IV-A and IV-B correspondingly.

A. Moco Experiment Setup
As common to contrastive learning, moco consists of two networks. But in moco, these networks are asymmetric. One is the trainable base and the other one is the nontrainable momentum network. The idea behind the momentum network is to obtain more negative samples to increase the generalizability of the pretrained model. During training moco, the embeddings of the data obtained in previous iterations are appended to the queue to augment the number of negatives samples. The negatives are updated in the queue dynamically, i.e., by flushing the old embeddings and filling the queue with new embeddings. That leads to the challenge to ensure the consistency between the embeddings of the previous iterations and the current training step. Therefore, the momentum network is updated slowly using the weights of the base network as shown in the following equation: For details, refer to [7].

B. View Generation Using Atmospheric Transformation
In self-supervised learning on natural images, color jittering is applied. In this transformation, the user defines the range of color distortion to produce new views of the same image. During each epoch, the image is distorted with different levels of brightness, saturation, contrast, and hue. Extending the algorithm to jitter other channels would distort the valuable physical information in those channels. Therefore, we use atmospheric correction to perturb the channels of multispectral images without distorting the physical meaning of the landcover information. In this study, we use four channels: RGB and NIR. The approach is summarized by Algorithm 1. Fig. 3 shows a visualization of images obtained via Algorithm 1 for one sample image of the Seco dataset. Fig. 4 shows a schematic comparison of the transformation pipeline that we used in our work to integrate four channels to the existing baseline transformation pipeline.  V. EXPERIMENTS We conducted two sets of experiments. The first set of experiments was designed to show the added value of the NIR channel in landcover classification. The second set of experiments includes the pretraining setup using our proposed transformation pipeline to compare it with the transformations used in moco-v2 [7]. Section V-A and V-B give detailed experimental descriptions.

A. Randomly Initialized Linear Classifier Experiments
The experiments show the relevance of the NIR channel for landcover classification. We conducted the experiments with both the eurosat and bigearthnet datasets to evaluate pretrained model's performance on single-label and multilabel classification tasks, respectively. Both of them are multispectral datasets and contain the NIR channel and the three RGB channels. The experimental setup and hyperparameters used for both eurosat and bigearthnet are kept similar to the randomly initialized setup used in [10]. In our four channels' setup, the first convolution layer is changed from three to four channels. This change increases the number of nontrainable parameters from 3 × 64 × 7 × 7 to 4 × 64 × 7 × 7. As we trained a linear classifier by freezing the weights of the backbone, the number of training parameters remains the same. Hence, the comparison of results is a fair evaluation.
1) Quantitative Results: Table I shows the difference in the results with three and four channels for eurosat (singlelabel classification) and bigearthnet (multilabel classification). We found an increase of 3.35% and 3.1% for eurosat and bigearthnet, respectively. By only training the linear classifier, this experiments show the inherent relevance of the NIR channel.

B. Self-Supervised Learning With Atmospheric Correction
The pretraining experiments are conducted to show the quality of the representation learned by the pretrained model. We conducted experiments on all the pretraining setups mentioned in [10] on the Seco dataset and compare them with our implementation of the baseline experiments. For the evaluation of the pretraining experiments, we used the same hyperparameter settings on the downstream tasks. Sections V-B1 and V-B2 give a detailed description of the experimental settings for the pretraining and downstream tasks correspondingly.
1) Pretraining Experiments: We conducted three sets of pretraining experiments. In all the experimental settings, we used the same hyperparameters adopted from [10]. We train the network for 1000 epochs.
The moco-v2, moco-v2 + TP, and Seco algorithms described in [10] are used as baselines to compare with our transformation pipeline. The complexity of the experiments increases as follows: moco-v2 uses the de facto standard transformation pipeline; moco-v2 + TP is extended using images with different time stamps as similar pairs; and Seco is based on leave-one-out contrastive learning (LOOC) [15], i.e., it uses three embeddings (baseline transformation, different time stamp transformation, and different time stamp + baseline transformation), and the backbone networks share their weights.
2) Downstream Experiments: The experiments were conducted on two datasets. For eurosat, a linear classifier is used on the pretrained Resnet18. Whereas for bigearthnet, we evaluated our approach for both the linear classifier and fine-tuning using two pre-trained models, i.e., Resnet18 and Resnet50. The training-validation split and hyperparameters for downstream experiments are adopted from [10]. a) Quantitative Results: Table II shows the accuracy  obtained by each pretrained model on the eurosat dataset.  The accuracy for all the three experiments shows the new  TABLE II  CLASSIFICATION ACCURACY ON THE EUROSAT DATASET. THE BACKBONE  IS KEPT FROZEN AND WE ONLY TRAIN THE LINEAR CLASSIFIER   TABLE III  MEAN AVERAGE ACCURACY SCORES ON THE BIGEARTHNET LANDCOVER  CLASSIFICATION TASK transformation pipeline surpasses the baseline transformation. We achieved an improvement of 1.0% for moco-v2, 2.9% for moco-v2 + TP, and 0.3% for seco. Table III shows the accuracy obtained on bigearthnet. We found our novel transformation to perform better than the baseline transformation irrespective of the backbone network and of the task type. We found an increase in the score for moco-v2 from 2.39% to 5.94%. In case of moco-v2 + TP, the accuracy gain lies in the range of 2.4% and 4.31%. For Seco, it is between 1.52% and 6.38%. With our transformation, we even found improvement in performance using the simclr loss [4]. If we remove atmospheric transformation from our proposed transformation pipeline, the mean average accuracy for moco-v2 on bigearthnet (linear classifier) was degraded to 66.12% for 10% data and 66.39% for 100% data on the resnet-18 backbone. With similar experiments on resnet-50, we found that the mean average accuracy degraded to 69.41% for 10% data and 71.39% for 100% of bigearthnet data.

VI. CONCLUSION
We showed a new transformation based on atmospheric correction which can be used as an alternative to color jittering in a moco-v2-style transformation pipeline for landcover classification based on Sentinel2 images. With this new transformation, it is possible to exploit multispectral bands of images and to generate physically consistent augmented samples. As we do not explicitly use information on temporal change, this novel transformation pipeline can be extended to spatio-temporal landcover self-supervised learning applications.
From our experiments, we found that our new transformation pipeline outperforms the baseline in all the above test cases. Thus, we conclude that using atmospheric correction as a substitute for color jittering is more effective for self-supervised pretraining on landcover images.