Enhanced Feature Pyramid Network With Deep Semantic Embedding for Remote Sensing Scene Classification

Recent progress on remote sensing (RS) scene classification is substantial, benefiting mostly from the explosive development of convolutional neural networks (CNNs). However, different from the natural images in which the objects occupy most of the space, objects in RS images are usually small and separated. Therefore, there is still a large room for improvement of the vanilla CNNs that extract global image-level features for RS scene classification, ignoring local object-level features. In this article, we propose a novel RS scene classification method via enhanced feature pyramid network (EFPN) with deep semantic embedding (DSE). Our proposed framework extracts multiscale multilevel features using an EFPN. Then, to leverage the complementary advantages of the multilevel and multiscale features, we design a DSE module to generate discriminative features. Third, a feature fusion module, called two-branch deep feature fusion (TDFF), is introduced to aggregate the features at different levels in an effective way. Our method produces state-of-the-art results on two widely used RS scene classification benchmarks, with better effectiveness and accuracy than the existing algorithms. Beyond that, we conduct an exhaustive analysis on the role of each module in the proposed architecture, and the experimental results further verify the merits of the proposed method.


I. INTRODUCTION
S CENE classification in remote sensing (RS) images, referred to as the task of assigning a specific semantic label to an RS scene, has received wide interests in recent years, since it can be used in a wide range of practical applications, such as urban planning, environment prospecting, natural disaster detection, and land-use classification [1]- [3]. Over the past decades, an extremely rich set of RS scene classification algorithms has been developed. Earlier methods were mainly based on various hand-crafted features and classical classifiers, such as support vector machine (SVM) [4], random forest [5], and boosting [6]. In general, these methods are divided into two categories: methods relying on low-level features and methods using mid-level representations. The representative low-level features include a histogram of oriented gradient (HOG) [7], scale-invariant feature transform (SIFT) [8], local binary pattern (LBP) [9], and gray-level cooccurrence matrix [10]. They perform well on images with simple objects and high contrasts between objects and the background but fail to depict the characteristics of complex RS scenes. Compared with the low-level methods, mid-level approaches attempt to develop a holistic scene representation by coding the low-level local features. The popular mid-level methods include bag-of-visual-word (BoVW) [11], locality-constrained linear coding (LLC) [12], spatial pyramid matching (SPM) [13], and improved Fisher kernel (IFK) [14]. As the most popular mid-level approach, BoVW represents the image by using a histogram of visual word occurrences [11]. LLC uses the locality constraint to project each descriptor into its localcoordinate system and integrates the projected coordinates by max-pooling to produce the final representation [12]. SPM builds a spatial pyramid coding of local image descriptors by using a sequence of increasingly coarser grids [13]. IFK applies Gaussian mixture model-based probability densities to encoding local image features [14]. Although mid-level methods produce more impressive representations for RS scenes, their performance essentially relies on low-level features. Furthermore, lacking the flexibility in discovering highly intricate structures, these methods also carry little semantic meaning [15]- [17].
Recently, convolutional neural networks (CNNs) have successfully broken the limits of traditional hard-crafted features in a variety of computer vision tasks, such as object detection [18], semantic segmentation [19], edge detection [20], and image classification [21]. AlexNet [22], VGGNet [23], GoogLeNet [24], and ResNet [25] are four of the most commonly used backbones. For instance, in [16], an end-toend learning system was proposed to learn a feature representation with the aid of convolution layers so as to shift Fig. 1. RS scenes contain a diversity of objects. The highly complex spatial patterns and geometric structures in RS scene images bring smaller interclass dissimilarity for instance. the burden of feature determination from hand-engineering knowledge to a deep CNN. In [23], very deep convolutional networks were investigated to extract very deep features for large-scale image recognition. In [25], a residual learning framework was presented to extract feature maps from input data for image recognition. In [26], an architecture using stacked autoencoders is proposed to extract high-level features for hyperspectral data classification. In [27], a pretrained deep CNN model was selected as a feature extractor, and then the initial feature maps were fed into the CapsNet to obtain the final classification result. In [28], different global features using different CNN-based models were reported for aerial scene classification. In [29], an end-to-end CNN was adopted to extract global-context features for RS scene classification.
Although progress has been made in feature extraction by CNNs, there is still a large room for improvement of the generic CNN models that extract global image-level features for RS scene classification, ignoring local object-level features [30]- [34]. Different from the natural images in which the objects occupy most of the space, RS scene images generally contain a diversity of objects which are smaller and more decentralized than the background, as shown in Fig. 1. Due to the highly complex spatial patterns and geometric structures in RS scene images, they have larger intraclass variations and smaller interclass dissimilarity. For instance, the left subfigure of Fig. 1 shows a "Commercial" scene, while the right subfigure illustrates a "Dense Residential" scene. Both of these two categories of scenes contain houses, roads, trees, cars, as well as other kinds of objects. The differences between them are merely reflected in the spatial layouts and the density distributions of the objects. Hence, accurate RS scene classification needs to extract not only the global image-level features but also the local object-level ones.
To overcome the drawbacks of the vanilla CNNs for RS scene classification, in this article, we propose a new RS scene classification method via an enhanced feature pyramid network (EFPN) with deep semantic embedding (DSE). By introducing the EFPN, DSE module, and two-branch deep feature fusion (TDFF) module into the unified framework, the performance of the RS scene classification can be intrinsically improved.
We summarize our contributions as follows: 1) To address the problem that many previous CNN-based algorithms only capture global image-level features but ignore local object-level features for RS scene classification, a novel pyramid-like network called EFPN is proposed to extract multiscale multilevel features simultaneously. 2) To leverage the complementary advantages of multiscale multilevel features, a DSE module, DSE, is proposed. By mapping the semantics of higher-level but coarser-resolution features into lower-level with finerresolutions, both of the stronger semantics as well as higher spatial resolutions could be reserved, so that more reliable features can be generated. 3) A TDFF module named TDFF is proposed. With this module, the features at different levels can be aggregated to get complete and accurate descriptions of complex scenes. 4) We evaluate our method and compare it against a number of state-of-the-art methods on two well-known benchmark data sets. Results show that our method performs favorably over all the others. Also, we provide a very comprehensive ablation study to demonstrate the effectiveness of each module in our method. The rest of this article is organized as follows. Section II introduces the details of the proposed method. In Section III, the experimental results are reported. Section IV discusses the effectiveness of each module in the proposed method. Finally, conclusions are drawn in Section V.

II. PROPOSED METHOD
The overall architecture of our proposed method is illustrated in Fig. 2. It contains four main modules. The first module is the EFPN, which is used to produce initial feature maps at multilevels and multiscales. The second one, i.e., the DSE, is designed for boosting the ability to generate features with rich semantics and high spatial resolutions. The third one is the TDFF, in which two branches, namely the top branch and down branch, are designed to process and fuse different levels of features. The fused deep features are fed into the last module for RS scene classification.
A. Enhanced Feature Pyramid Network 1) Motivation: Current CNN-based methods prefer to cast the RS scene classification as an end-to-end problem and learn a global image-level representation from the raw image data [30], [35]- [37]. Nevertheless, the insightful consensus has pointed out that neurons in high layers respond to the whole image, while neurons in low layers are more likely to be activated by local patterns [38]. This manifests that it is necessary to utilize local object-level features extracted from low layers to further enhance the performance of RS scene classification.
To this end, we propose a pyramid-like network, which can capture both the global image-level features and local  object-level features for scene reasoning. Our architecture is based on the well-known feature pyramid network (FPN) [31], which has been proposed for object detection tasks. FPN can produce a feature pyramid at multiple scales and multiple levels. However, FPN adopts nearest neighbors or bilinear interpolation to generate higher resolution feature maps, leading to the lack of high-frequency components of the higher resolution features, discontinuous phenomena production, as well as blurred edges of objects [32], which may influence the generation of precise feature maps for complex RS scenes. This motivates us to utilize a more effective technique, that is, deconvolution, for upsampling. Compared to a nearest neighbor or bilinear interpolation, deconvolution, as a vital tool in super-resolution, motion deblurring, and semantic segmentation [33], can effectively complement the lost details caused by the convolutional layers in FPN and at the same time, suppress blurry edges or noise. We call our proposed pyramid-like network EFPN.
2) Enhanced Feature Pyramid Network: Fig. 3 illustrates the architecture of our EFPN. It consists of a bottom-up pathway, a top-down pathway, and lateral connections. It is worth noting out that, in the top-down pathway, as shown in Fig. 3, the spatial resolution is increased by deconvolution. The specific description of our EFPN is given as follows.
In the bottom-up pathway, layers of the backbone that generate feature maps of the same resolutions are defined as a stage. Considering that feature maps generated by different stages should be multiscale and multilevel, we choose ResNet34 that has five hierarchies [39] as the basic backbone. Let Q = {(I n , L n ), n = 1, 2, . . . , N } denote the RS scene image data set for training, where N represents the number of training images, I n denotes the input image, and L n is the class label for I n . For each image I n , we feed it into ResNet34 [25], and calculate the outputs of each stage's last residual block. Formally, let X, Y ∈ R H ×W ×C denote the input and output tensors of the last convolutional layer for a certain stage of ResNet34, where H and W denote the spatial dimensions, and C is the number of feature maps or channels. Let ω ∈ R K ×K ×C denote a K × K convolution kernel with C channels. Each feature map in Y p,q ∈ R C can be calculated by where F denotes a convolution layer, ( p, q) represents the location coordinate and defines a local neighborhood. For simplicity, here we assume that K is an odd number. Based on ResNet34, the outputs of conv2_3, conv3_4, conv4_6, conv5_3 are used as the initial bottom-up feature maps for I n , which are denoted by F i ∈ R H i ×W i ×C i , i = 2, 3, 4, 5, where i represents the i th stage of ResNet34.
In the lateral connections, to reduce the channel dimensions, we apply a 1 × 1 convolutional layer to each bottom-up map F i as below where F 1 (·, ω 1 ) denotes a 1 × 1 convolution with parameters ω 1 . Then, based on lateral connections, more precise locations of features can be passed from the finer levels of the bottom-up maps to the top-down ones.
In the top-down pathway, considering that the semantically stronger feature maps are spatially coarser, a deconvolution processing block is designed (dashed box in Fig. 3), in which a deconvolutional layer is, in effect, followed by a batch normalization (BN) and a rectified linear unit (ReLU), that is Deconv-BN-ReLU. The deconvolution processing block aims to upsample the spatial resolution by a factor of 2 with a coarser-resolution feature map. The deconvolution process can be simply expressed as where G(·, ϕ i ) refers to a deconvolutional layer with a kernel size of 3 × 3 and parameters ϕ i . The upsampled map is then merged with the corresponding bottom-up map by element-wise addition. Besides, to reduce the aliasing effect of upsampling, a 3 × 3 convolution is subsequently applied to the outputs of the element-wise addition operation. Finally, the EFPN feature maps P i , i = 2, 3, 4 can be generated where ⊕ represents the element-wise addition operation and F 2 (·, ω 2 ) denotes a 3 × 3 convolution with parameters ω 2 .
Note that in the top-down pathway, the coarsest resolution map P 5 is directly generated from F 5 through 1 × 1 convolution operation. The final outputs of EFPN are referred to as B. Deep Semantic Embedding 1) Motivation: Due to repeated downsampling and pooling operations in the bottom-up pathway, the resolutions of top feature maps are reduced. The loss of spatial details makes them unable to extract clear boundaries of smallscale objects. Our framework aggregates semantics of features by incorporating high responses of bottom features and strong activations of top features based on the fact that high responses to instances are helpful for accurately localizing objects and strong activations to semantics is an indicator for exactly understanding scenes. For this reason, we build a light-weighted and simple module, called DSE, to aggregate features from different levels. Through this module, spatial information can be directly propagated into the target map without crossing dozens of layers. By integrating the fine details of lower-level but finer-resolution features with the semantics of higher-level but coarser-resolution features, DSE can make full use of the complementary information and learn more reliable features.
2) Deep Semantic Embedding: Fig. 4 shows the architecture of DSE. The two-stream inputs of a DSE module are P = {P j , P j −1 }, j ∈ {3, 5}, where P j ∈ R H j ×W j ×C j corresponds to the j th level feature maps of EFPN. First, to capture more accurate representations, we apply two convolutional layers, with the kernel sizes of 3 × 3 and 1 × 1, respectively, to adjacent-level features for cross-channel information interaction and integration. Then, by applying a deconvolution operation, we upsample the higher-level feature maps to the scale of lower-level ones. The process can be represented by where F 2 (·, ω 2 ) denotes the 3×3 convolution with parameters ω 2 .
where F 1 (·, ω 1 ) denotes the 1 × 1 convolution with parameters ω 1 and G(·, ϕ j ) refers to the deconvolutional layer in DSE. Subsequently, we embed the strong semantic information of the upsampled features into the lower-level features by using element-wise addition. Similarly, to alleviate the aliasing effect, a 3 × 3 convolution is applied to the merged maps to get the final feature maps D j −1 , j = {3, 5} of DSE. The final outputs of DSE can be computed by where F 3 (·, ω 3 ) denotes the 3×3 convolution with parameters ω 3 .

C. Two-Branch Deep Feature Fusion 1) Motivation:
In the generic FPN, a proposal on a specific level is chosen for recognition according to the size of objects, since object detection only needs to assign a specific category to an individual object. Albeit simple and efficient, it cannot meet the demand of RS scene classification, because we infer the scene label via recognizing the combined characteristics of multiple discriminative objects rather than a single object.
Our motivation stems from the fact that, when manually classifying RS scene images, specialists always set the semantic labels based on the global characteristics of scenes as well as the local features of objects [2], [18]. Therefore, we believe that global image-level and local object-level features are two important representations for distinguishing RS scenes. To be specific, the higher-level feature maps generated by global receptive fields give strongly semantic features. Making lower-level features access them will better absorb meaningful contextual information for prediction. On the contrary, the lower-level feature maps generated by local receptive fields reflect refined details for locating objects. Such features can help higher-level features to complement their loss of spatial information, which is beneficial for classification. Therefore, based on the above analysis, we present a TDFF module to fuse the features at different levels in a more effective way.
2) Two-Branch Deep Feature Fusion: This architecture consists of two branches to deal with the higher-level and lower-level feature maps, respectively. To process various levels, and at the same time, enlarge the receptive fields so as to incorporate multilevel contextual information without increasing computational cost, we advocate the use of the combination of convolution and atrous convolution. Thereinto, atrous convolution, also known as dilated convolution, has been verified to be a powerful tool for dense prediction tasks [40], [41]. In addition, to avoid network degradation that may be caused by excessive depth, we also introduce skip connection into the architecture. Furthermore, the last layers in the architecture are two global average pooling (GAP) layers, which are used to generate image-level representation features. A high-level illustration of the presented TDFF module is shown in Fig. 5.
Top Branch: This branch receives the higher-level feature maps D 4 produced by DSE. It is equipped with two residual blocks and one GAP layer. Table I shows the details of the residual blocks in this top branch, in which the 1 × 1, 3 × 3 and 3 × 3 convolutional layers are arranged orderly to learn deep feature efficiently. Note that each layer is also followed by a batch normalize layer and a ReLU layer for nonlinear transformation, and the outputs of internal paths for a residual block are combined by element-wise addition.
The output of one residual block can be represented by where where X and Y denote the input and output of the residual block, respectively. F 1 , F 2 , F 3 and F 4 denote the 1 × 1, 3 × 3, 3 × 3 and 1 × 1 convolutional layers in TDFF. ω 1 , ω 2 , ω 3 and ω 4 are the corresponding parameters. represents the  (11). In (9), the BN and ReLU are omitted for simplifying notations We can now write the outputs of the two residual blocks as where Y 1 t and Y 2 t denote the outputs of the first and second residual blocks in the top branch, respectively. Note that, in (12), the input is the higher-level feature maps D 4 and in (13), the input is the output of the first residual block Y 1 t . After two residual blocks, GAP [42] is introduced to strengthen the correspondences between categories and feature maps, and generate deep features. Ultimately, the output features of the top branch can be described as where σ denotes the GAP operation and Y l ∈ R H ×W is a feature map with height H and width W for the l-th channel of the input Y 2 t . Down Branch: Compared with the top branch, the down branch replaces the convolutional layers in each residual block with atrous convolutional ones, because of the different scales of the inputs. By virtue of atrous convolution, our down branch is able to enlarge the receptive fields, thus capturing objects or useful image contextual information for classifying complex RS scenes. The information of residual blocks in the down branch is given in Table II. Compared with standard convolution, atrous convolution can enlarge the kernel via integrating holes between pixels in kernels [43]. In the down branch, we utilize atrous convolution to increase the receptive field of the output units without increasing the kernel size. Generally, an atrous convolution with a kernel size K × K and atrous rate r has a receptive field of As a result, the output of M × M convolution, which can be calculated by (1), may be used as the result of K × K atrous convolution. In detail, suppose U, V ∈ R H ×W ×C are the input and output tensors of an atrous convolutional layer, where H and W denote the spatial dimensions, and C is the number of channels. Each feature map in V p,q ∈ R C with the location coordinate ( p, q) can be computed by where denotes a local neighborhood and μ are the parameters of atrous convolution. Based on atrous convolution, the output of one residual block in the down branch can be calculated by where where U and V denote the input and output of the residual block in the down branch, respectively. H 1 , H 2 , H 3 and H 4 denote the 1 × 1, 3 × 3, 3 × 3 and 1 × 1 atrous convolutional layers in TDFF. μ 1 , μ 2 , μ 3 and μ 4 are the corresponding parameters. Similarly, we can obtain the outputs of the two residual blocks as where V 1 d and V 2 d denote the outputs of the first and second residual blocks in the down branch, respectively. Note that, in (20), the input is the lower-level feature maps D 2 and in (21), the input is the output of the first residual block V 1 d . Subsequently, the GAP is applied to V 2 d to acquire the deep features for the down branch. As a result, the output features of the down branch can be described as where V l ∈ R H ×W is a feature map with height H and width W for the lth channel of the input V 2 d . Ultimately, TDFF with atrous convolutions effectively captures two-branch deep features, i.e., Branch t and Branch d . The serial feature fusion scheme is adopted to concatenate these two different types of features together, so as to obtain more significant and informative features to represent the RS scene images.

D. Scene Classification
The fused deep features are subsequently fed into the next scene classification module. This module, composed of the fully connected layer and softmax layer, is utilized to predict the class label for the input image.
Suppose the output of the fully connected layer is Z = {z i , i = 1, 2, . . . , m}, where m is the total number of class labels. The softmax function is defined as where θ i represents the probability that the input image belongs to the i th class. The final predicted label is determined by θ . Besides, during the process of classification, our loss function is the cross-entropy loss [44], which is given by where y is the real scene label, m is the number of scene categories, N denotes the size of mini-batch, and 1{ * } represents an indicator function. Mathematically, if y n is equal to i , 1{y n = i } = 1, else 1{y n = i } = 0. The proposed method is summarized in Algorithm I.

III. EXPERIMENTS
In this section, we evaluate our proposed method on two publicly available data sets for RS scene classification. First, the data sets used in experiments are described. Second, we give an introduction to the experimental setup. Finally, the proposed architecture is compared with a number of stateof-the-art algorithms.

A. Data Sets
We test the proposed method on two different RS scene data sets. One is the well-known UCMerced Land-Use data set (referred to as UCM) [45], and the other is the Aerial Image data set (referred to as AID) [28].
1) UCM: This data set has 2100 RS scene images, each of which is categorized into a certain class. These scene images with the RGB color space, which originate from 20 diverse regions, are all provided by United States Geological Survey (USGS) National Map. There are 21 scene labels in total, including Agricultural, Airplane, and Baseball Diamond. Some example images of all the categories in UCM are illustrated in Fig. 6. Each class consists of 100 images with a size of 256 × 256, and a spatial resolution of 30 cm per pixel.
2) AID: This data set is from Wuhan University, which is also available online. All images are categorized into 30 classes: Airport, BareLand, Baseball Field, etc. A total of 10 000 images are in the data set; however, the number of images in each category varies from 220 to 420. Table III illustrates the detailed information of AID, and some example images of all categories are shown in Fig. 7. Each image has the size of 600 × 600, but the spatial resolution ranges from 1 to 8 m.

Algorithm 1 The Proposed Method
Step 1 Enhanced Feature Pyramid Network Input: Full Output: Enhanced Feature Map P 1: Input I to the pre-trained ResNet34, the convolutional feature maps of different stages are reserved as F 2 , F 3 , F 4 , Step 2 Deep Semantic Embedding Input: Enhanced Feature Map P Output: Deep Semantic Embedding Feature 8: for j = 3; j ≤ 5; j ← j + 2 do 9:

12: end for
Step 3 Two-branch Deep Feature Fusion Input: Deep Semantic Embedding Feature Output: Fused Deep Feature I out 13: B. Experimental Setup 1) Training/Testing: To make a comprehensive evaluation, the training-testing ratios for UCM are set to 80%-20% and 50%-50%, and the training-testing ratios for AID are set to 50%-50% and 20%-80%. We randomly select the samples from each scene category for training and leave the remaining images for testing. All images are resized to 224 × 224. Besides, to enhance the generalization ability of our method, some data augmentation techniques, including random horizontal flipping and random rotation, are adopted.
2) Implementation Details: The proposed method is constructed based on the Pytorch library on Google Colaboratory, which is a cloud platform with a NVIDIA Tesla T4 GPU and 16 G memory. Parameters of the backbone (ResNet34) are initialized from the official model pretrained via ImageNet. In our framework, we discard the last GAP and fully connected layers of ResNet34 and introduce several lateral connections and DSE. Among them, the weights of 1 × 1 convolutional layers are initialized as 0.1, and the biases are initialized as 0. The kernel size, stride, padding, and out padding of the deconvolutional layers are set to 3, 2, 1, and 1, respectively. In TDFF, the dilation_rates of the atrous convolutional layers in a block are set to 1, 3, and 5 orderly and the other parameters are initialized as the PyTorch default settings. We use the stochastic gradient descent (SGD) approach to optimize the proposed model. The initial learning rate is set to 1e-2 and is divided by 10 every 80 epochs, while the momentum is set to 0.9. The minibatch of SGD depends on the running data. In the testing phase, the batch sizes are 10 and 36 for the UCM and AID data sets, respectively, while in the training phase, the batch sizes of both data sets are set to 72. All images are resized to 224 × 224.
3) Evaluation Metrics: Two widely used metrics, i.e., overall accuracy and confusion matrix [46]- [48], are selected here to evaluate the performance of our proposed algorithm. Overall accuracy is the proportion of the number of correct predictions to the total number of samples, reflecting the overall classification performance of a classification method. Furthermore, to analyze the detailed classification errors between different classes, a specific table layout named confusion matrix is used, each column/row of which denotes the instances in a predicted/actual class. It is worth pointing out that, to achieve reasonable results of evaluation metrics as well as reduce the influence of randomness, all experiments are repeated ten times.
Thereinto, SCK, BoVW, Dense-SIFT, IFK, and salM 3 LBP-CLM are the RS scene classification methods based on midlevel scene features, while the other comparing approaches are based on high-level deep features. For those models with available code, we train and test those models using their default settings. For those models without released code, we used their results proposed in their original works. Also, for a fair comparison, the same ratios were applied in the following experiments according to the experimental settings in the related works. For the UCM data set, the ratios of the number of the training set are set to 80% and 50%, respectively, while for the AID data set, the ratios are fixed at 50% and 20%, respectively.

1) Experimental Results on UCM Data Set:
In this section, we compare our proposed method with a number of state of the art on the widely used UCM data set. The experimental results and analysis consist of four parts: the overall accuracy results under the training ratios of 80% and 50%, and the confusion matrix results under the training ratios of 80% and 50%. We randomly select the fixed percent of the images to construct the training set by repeating ten times on the UCM data set, and then compute the means and standard deviations of overall accuracy. The results are given in Table IV. From this table, we can find that, among the four kinds of midlevel methods, i.e., SCK, BoVW, Dense-SIFT, and salM 3 LBP-CLM, salM 3 LBP-CLM performs much better than others. However, its performance is still inferior to most of the other high-level deep feature methods shown in Table IV. This indicates that the midlevel methods have limited abilities for RS scene classification. On the contrary, benefiting from the superiority of deep NNs, significant improvement of scene classification performance has been made by deep feature methods for RS images.
Comparing the results of various deep feature methods, our method, EFPN-DSE-TDFF, achieves the best performance with the ratios of 80% and 50%. VGG-16-CapsNet and AlexNet+VGG16 obtain the second best results with the training ratio of 80%, while under the training ratio of 50%, VGG-16-CapsNet, TEX-Net-LF, VGG-VD16-DCF, and CaffeNet-DCF achieve competitive performances. In addition, our method also has much smaller standard deviations with both ratios.
All the above phenomena indicate that our proposed strategy of combining EFPN, DSE, and TDFF in a unified framework is effective to improve the classification performance for RS scenes.
Besides the overall accuracy, we also compute the confusion matrix for the proposed method. With different fixed training ratios, we choose to show the best results. Fig. 8 shows the confusion matrix with the training ratio of 80%. From Fig. 8, we can see that most scene categories obtain the classification accuracy equal to 1. Categories with classification accuracy lower than 1 only includes "Medium Residential" (0.95) and "Storage Tanks" (0.95). As is known, for the UCM data set, the most confused scene types are "Dense Residential," "Medium Residential," and "Sparse Residential," due to their similar spatial patterns and the common objects shared by them such as buildings and trees. In our confusion matrix, 5% images from "Medium Residential" are mistakenly classified as "Dense Residential". Besides, 5% images from "Storage Tanks" are mistakenly classified as "Parking Lot," which may attribute to their similar land cover types. Fig. 9 presents the confusion matrix with a training ratio of 50%. From this confusion matrix, we can observe that 15 of the 21 categories achieve a classification accuracy of over 95%. Apart from these, the scene categories with the classification accuracy rate of more than 90% include "Building" (0.90), "Golf Course" (0.92), "Medium Residential" (0.92), "Overpass" (0.92), and "Storage Tanks" (0.92). The most obvious confusion is still between "Dense Residential" and "Medium Residential". As can be seen, 10% of images from "Dense Residential" are mistakenly classified as "Medium Residential," while 8% images from "Medium Residential" are classified as "Dense Residential" by mistake. Besides, 6% of images from "Overpass," are mistakenly classified as "Intersection," due to their similar appearances. And 6% of images from "Golf Course," are mistakenly classified as "River," for both of them contain large areas of trees.
2) Experimental Results on AID Data Set: In this experiment, we use the publicly available RS scene data set, i.e., AID data set to evaluate the effectiveness of the proposed method.
The comparative results of our method against a number of state-of-the-art scene classification algorithms over the AID 30-class scenes are shown in Table V. The overall accuracy of midlevel methods, i.e., BoVW, IFK, and salM 3 LBP-CLM are 67.65%, 77.33%, and 89.76%, respectively, under the training ratio of 50%, and 61.40%, 70.60%, 86.92%, respectively, with the training ratio of 20%. Compared with the midlevel methods, the methods based on deep features achieve far better performance, which indicates that the deep features are more informative and discriminative than the hand-crafted descriptors. Moreover, among all the deep feature methods, our method, i.e., EFPN-DSE-TDFF, and VGG16+CapsNet achieve comparable results (94.50% and 94.74%) with a training ratio of 50%. When the training ratio drops to 20%, our method still performs better (94.03%), while the overall accuracy of VGG16+CapsNet drops to 91.63%. It indicates that the strong discriminative power of our EFPN-DSE-TDFF compared to VGG16+CapsNet, providing a more robust representation for RS scenes from AID.
Due to limited space, we report only the confusion matrixes of our method under the training ratios of 50% and 20% in Figs. 10 and 11, respectively. As shown in Fig. 10, 25 of the 30 categories achieve a classification accuracy of more than 90%. Categories with classification accuracy lower than 0.9 include "Center" (0.88), "Industrial" (0.88), "Resort" (0.69), "School" (0.79), and "Square" (0.89). Over the AID data set, the major confusions occur between "Resort" and "Park," "School" and "Commercial," "Stadium" and "Playground," or "BareLand" and "Desert." These results are explained by the fact that these categories have similar ground object distributions or geometrical structures. However, some types with large interclass similarity, such as "Medium Residential" (0.93) and "Sparse Residential" (0.99), can be accurately classified. In addition, "Bridge," "Port," and "River," which have the analogous objects and image textures, also achieve high overall accuracy of 0.97, 0.96, and 1.00. Fig. 11 demonstrates the confusion matrix with the training ratio of 20%, in which the row reflects the producer's accuracy, while the column reflects the user's accuracy. From this figure, we can see that most of the classes obtain a satisfactory classification result of over 90%, and only five categories including "Center," "Park," "Resort," "School," "Square" have a bit severe misclassification. Although the number of the training images has decreased dramatically, some categories that are easily confused can be still effectively classified,  such as "Medium Residential" (0.92) and "Sparse Residential" (0.99), or "Bridge" (0.98), "Port" (0.98), and "River" (0.99). The reason why our proposed method obtains effective scene classification results can attribute to different modules including EFPN, DSE, and TDFF in the unified framework. From the above experimental results, we can summarize some interesting observations as follows. By comparing the proposed method with different scene classification algorithms, it can be seen that midlevel methods achieve worse performances, while deep feature methods perform better on all the data sets. Also, among the deep feature methods, our proposed framework has a better performance, indicating the progress in RS scene classification.

IV. DISCUSSION
To comprehensively evaluate the effectiveness of our proposed method, various ablation experiments are performed by using different connection patterns or different design options.

A. Impact of Data Augmentation
Data augmentation has been proven to be very useful in many learning-based vision tasks [44]. Therefore, in our experiments, we also employ data augmentation to generate enhanced data to train an effective model. The input images are augmented by randomly horizontal flipping and random rotation during training, which results in an augmented image set richer than the original one. We compare the methods with and without data augmentation so as to verify the effectiveness of data augmentation. The comparison results are shown in Table VI. In this table, ResNet34+ represents our proposed method that adopts data augmentation, while ResNet34 * represents the corresponding method without data augmentation. From experimental results, we can find that adopting data augmentation improves the overall accuracy by more than 0.6%.

B. Scalability
There are two popular backbone networks (i.e., VGG-VD-16 and ResNet34) that are utilized in scene classification. To further validate the scalability of our proposed method, we conduct comparison experiments using different backbones in Table VI. For both backbones, we choose the last outputs of each stage (except stage 1) as the initial inputs of top-down pathway. We keep all other settings the same. The comparison results are shown in Table VI. As can be seen, with the same training set, the performance of our ResNet34-based architectures (ResNet34+) are much better than that of the VGG-VD-16-based architecture.

C. Effects of Different Modules
Our framework contains three main modules, i.e., EFPN, DSE, and TDFF. To analyze the importance of each main module, a series of ablation experiments are conducted with    Table VII, and the following can be seen from the results.
Effects of EFPN: Results from Scheme 1 are the worst, which is because the EFPN is omitted from this architecture, while the function of EFPN is to initially strengthen semantics of all level feature maps. Compared with Scheme 1, since Schemes 2, 3, and 4 contain the EFPN, the overall accuracy of them increases by 4.98%, 5.64%, 7.76% under 80% training ratio, respectively, and 4.87%, 6.1%, 6.38% under 50% training ratio, respectively. These results demonstrate that the EFPN we address is indeed beneficial for RS scene classification.
Effects of DSE: Scheme 2 directly links the TDFF to the outputs of EFPN without DSE. Compared with it, by using Besides overall accuracy, we also report the per-class classification accuracies of these four different architectures in Fig.  13. As can be seen from Fig. 13(a) and (b), no matter whether the training ratio is 80% or 50%, our method achieves the best performance, which indicates the effectiveness and superiority of our proposed architecture.   Moreover, we report the accuracy comparison achieved after convergence of different architectures on UCM data set, as shown in Fig. 14. As we can see, the accuracy of our method is much more stable and higher than those of the other three architectures under the training ratios of 80% as well as 50%.

D. Different Upsampling Strategies
In our framework, we propose to use the deconvolution technique (referred to as Deconv) instead of the commonly used upsampling strategies, such as nearest neighbor (referred to as Nearest) and bilinear interpolation (referred to as Bilinear). To verify this improvement, we do the experiments correspondingly. Specifically, we change the up-sampling methods in either the top-down pathway or DSE module. For simplicity, here we merely demonstrate the experimental results using the UCM data set with the 80% training ratio.
Table VIII reveals the details of various combinations of upsampling techniques used in the top-down pathway and DSE module. There are totally nine different modes, and for instance, the first mode means that both the top-down pathway and the DSE module utilize the nearest neighbor method for upsampling.
We first take Modes 1, 4, and 7 for example to illustrate the importance of upsampling strategies in the top-down pathway. By comparing Modes 1, 4, and 7, in which all the DSE modules use "Nearest" for upsampling, while the upsampling   schemes are set to "Nearest," "Bilinear," and "Deconv" in the top-down pathways, we can find that the overall accuracy has changed from 96.35%, 97.88% to 98.57%. Second, we take Modes 1, 2, and 3 for instance to show the importance of upsampling strategies in the DSE module. In these three modes, the top-down pathways adopt "Nearest" for upsampling, while the DSE modules use "Nearest," "Bilinear," and "Deconv," respectively, for upsampling. The increasing overall accuracy, changing from 96.35%, 96.83% to 97.02%, illustrates the effectiveness of our proposed upsamping scheme. A similar conclusion can be also drawn by comparing Modes 4, 5, and 6.
Third, by comparing Modes 7, 8, and 9, in which all the top-down pathways use "Deconv" for upsampling, while the upsampling schemes are set to "Nearest," "Bilinear," and "Deconv" in the DSE module, we can see that our final method, i.e., Mode 9, which uses "Deconv" in both the In addition, in the light of the better performance of Modes 7, 8, and 9, we further select them for comparison to illustrate the superiority of our "Deconv" upsampling technique by the loss curve. Figs. 15-17 display the loss curves of different modes under various training-testing ratios. The data set is the UCM data set. From these figures, we note that the losses of our proposed method (Mode 9) for both training-testing ratios converge faster than Modes 7 and 8. Also, the values of losses of our proposed method are much lower than those of Modes 7 and 8. In summary, these suggest that our upsampling scheme is beneficial to the whole architecture.

E. Benefits of Atrous Convolution
We further investigate the contribution of our atrous convolution. The study of different numbers and modes in atrous residual units is shown in Fig. 18, and the corresponding classification results are listed in Table IX. It can be seen that Shallow-Atrous Unit achieves the worst results. The performance of Parallel-Atrous Unit is a bit better, but the increase of the number of channels may lead to the increase of learning parameters. The other two units, Deep-Atrous and Deeper-Atrous obtain suboptimal results. Our Standard-Atrous Unit wins the competition on the RS classification task.

V. CONCLUSION
In this article, we first detailed the challenges for the task of scene classification in RS images. In order to overcome these problems, we constructed a unified deep learning framework, called EFPN-DSE-TDFF, in which three main contributions were made. In consideration of different rich characteristics of different level features, we designed an EFPN to extract multiscale multilevel feature maps and initially strengthen semantics. Besides, a DSE module has been introduced to map the semantics of higher-level but coarser-resolution features into lower-level but finer-resolution ones so as to learn more reliable features. A TDFF module is also employed for processing and aggregating the features at different levels together. We have compared the proposed method with many representative RS scene classification approaches on several well-known RS scene data sets.