Multifeature Collaborative Fusion Network With Deep Supervision for SAR Ship Classification

Multifeature synthetic aperture radar (SAR) ship classification aims to build models that can process, correlate, and fuse information from both handcrafted and deep features. Although handcrafted features provide rich expert knowledge, current fusion methods inadequately explore the relatively significant role of handcrafted features in conjunction with deep features, the imbalances in feature contributions, and the cooperative ways in which features learn. In this article, we propose a novel multifeature collaborative fusion network with deep supervision (MFCFNet) to effectively fuse handcrafted features and deep features for SAR ship classification tasks. Specifically, our framework mainly includes two types of feature extraction branches, a knowledge supervision and collaboration module (KSCM) and a feature fusion and contribution assignment module (FFCA). The former module improves the quality of the feature maps learned by each branch through auxiliary feature supervision and introduces a synergy loss to facilitate the interaction of information between deep features and handcrafted features. The latter module utilizes an attention mechanism to adaptively balance the importance among various features and assign the corresponding feature contributions to the total loss function based on the generated feature weights. We conducted extensive experimental and ablation studies on two public datasets, OpenSARShip-1.0 and FUSAR-Ship, and the results show that MFCFNet is effective and outperforms single deep feature and multifeature models based on previous internal FC layer and terminal FC layer fusion. Furthermore, our proposed MFCFNet exhibits better performance than the current state-of-the-art methods.

Abstract-Multifeature synthetic aperture radar (SAR) ship classification aims to build models that can process, correlate, and fuse information from both handcrafted and deep features. Although handcrafted features provide rich expert knowledge, current fusion methods inadequately explore the relatively significant role of handcrafted features in conjunction with deep features, the imbalances in feature contributions, and the cooperative ways in which features learn. In this article, we propose a novel multifeature collaborative fusion network with deep supervision (MFCFNet) to effectively fuse handcrafted features and deep features for SAR ship classification tasks. Specifically, our framework mainly includes two types of feature extraction branches, a knowledge supervision and collaboration module (KSCM) and a feature fusion and contribution assignment module (FFCA). The former module improves the quality of the feature maps learned by each branch through auxiliary feature supervision and introduces a synergy loss to facilitate the interaction of information between deep features and handcrafted features. The latter module utilizes an attention mechanism to adaptively balance the importance among various features and assign the corresponding feature contributions to the total loss function based on the generated feature weights. We conducted extensive experimental and ablation studies on two public datasets, OpenSARShip-1.0 and FUSAR-Ship, and the results show that MFCFNet is effective and outperforms single deep feature and multifeature models based on previous internal FC layer and terminal FC layer fusion. Furthermore, our proposed MFCFNet exhibits better performance than the current state-of-the-art methods.

I. INTRODUCTION
S YNTHETIC aperture radar (SAR) is a high-resolution radar system that can operate from either spaceborne or airborne platforms, providing a variety of features such as all-day, all-weather, and cloud-penetrating capabilities. Unlike optoelectronic sensors, SAR image mainly reflects the backscattered information of the target, and the image signal-to-noise ratio is low. Moreover, the signal-to-noise ratio decreases with increasing radar distance, and its amplitude value fluctuates randomly with the change of target observation angle, which makes SAR target identification far more complex than optical images. Ships are the main means of transportation for maritime trade, and SAR is the main means of marine detection. It is important to develop SAR ship classification for marine fisheries management, maritime traffic management, combating illegal activities at sea, maritime search and rescue, and so on. Therefore, SAR image interpretation and ship target information extraction are critical issues for SAR ship classification.
Features are the keys to SAR ship classification since the goodness of features largely determines the accuracy of classification. These features can be divided into two categories: traditional handcrafted features and deep features based on modern convolutional neural networks (CNNs) according to the differences in feature extraction. The handcrafted feature often needs to describe images from different perspectives using mature and explicable mathematical theories, such as grayscale, texture, edge, or shape [1], [2], [3]. Specifically, they are only applicable to a specific environment, so generalization in unknown environments is not sufficient. However, multisensor and multiscene variations require ship images to be highly descriptive and distinguishable, so it is not possible with handcrafted features alone.
Different from shallow learning methods that rely on handcrafted features, deep learning methods, supported by powerful computing platforms and big data, can extract features directly from raw data through self-driven learning. Deep features can be seen as multilevel representations of the essence of objects, so they are more descriptive than handcrafted features. However, they have low interpretability. Existing CNN-based SAR ship models rely excessively on abstract deep networks, leading to a single cycle of network structure modification, training skill optimization, and loss function improvement.
As most CNN have a black-box behavior, improving model performance through optimizing CNN architecture has become more challenging. Some SAR experts have therefore begun to study explainable artificial intelligence, exploring the importance of features or neurons in image analysis [4]. Other experts have incorporated prior knowledge of handcrafted features, exploring efficient ways to combine them with deep features. Extensive experiments have shown that handcrafted features can provide supplementary information to deep features, thereby enhancing the classification performance of CNNs [5], [6]. However, existing feature fusion methods simply concatenate deep features with handcrafted features and directly input the high-dimensional fused feature vector to the fully connected (FC) layer, leading to a very complex optimization plane. This direct concatenation causes the computation of the FC layer to grow exponentially and contains a lot of noise, which ultimately fails to provide satisfactory results. Alternatively, this concatenation considers all features to be equally important, ignoring the different contributions of each feature. It results in a negative impact among different features and causes the ultimate decision ability to be diminished. We recognize that if handcrafted features can incorporate their information into the network training process, they can provide a more rigorous mathematical interpretation for the deep feature extraction process, achieve mutual supervision and collaborative learning during feature extraction, and thereby enhance the robustness of the CNN. In addition, weighting the final decision according to the contribution of different features can render applications in highly sensitive fields, such as military, more reliable.
To solve the above issues, a multifeature collaborative fusion network (MFCFNet) with deep supervision is proposed to achieve SAR ship classification. In MFCFNet, inspired by supervised learning, handcrafted feature auxiliary branches are added to the deep backbone network for the first time to improve the accuracy of the model through feature fusion. The relative importance of deep and handcrafted features is also considered, and an attention mechanism is used to adaptively balance the contribution of different features to the model performance. We introduce a new synergy loss to achieve knowledge interaction between all supervised branches. It normalizes the network training based on knowledge dynamically learned by all classifiers to achieve dynamic knowledge extraction and fusion. We perform a comprehensive evaluation of two public datasets (OpenSARShip and FUSAR-Ship) and carefully studied the performance of each module in MFCFNet. The experimental results demonstrate the effectiveness and robustness of MFCFNet with advanced SAR ship classification accuracy compared to modern CNN-based methods and other handcrafted feature fusion methods.
The main contributions of this article are specified as follows.
1) A novel MFCFNet is proposed, which uses handcrafted feature maps as an auxiliary branch and deeply mines the expert knowledge it contains, solving the past optimization hyperplane problem caused by feature concatenation in front of the FC layer. 2) In the knowledge supervision and collaboration module (KSCM), high-quality extraction of feature maps from each branch is achieved, and a synergy loss is employed to foster dynamic knowledge matching and mutual learning between deep knowledge and handcrafted knowledge. 3) In the feature fusion and contribution assignment module (FFCA), an improved channel attention mechanism is used to address the differences in importance between deep features and handcrafted features, and the issue of imbalance in feature contribution. 4) The MFCFNet, combined with four handcrafted features, respectively, on the OpenSARShip and FUSAR-Ship datasets, improves the classification performance of the base model and demonstrates superior classification accuracy compared to traditional handcrafted feature fusion methods. The remainder of the article is organized as follows. Section II describes the related works about SAR ship classification based on handcrafted and deep features. Section III presents a detailed introduction to the proposed MFCFNet. Section IV shows experimental settings and comparative analysis of results. The ablation studies are introduced in Section V. Finally, Section VI provides the limitation and conclusion.

II. RELATED WORK
In this section, we review previous research works about three main types: traditional handcrafted feature methods, modern deep feature methods, and feature fusion methods.

A. Traditional Handcrafted Feature Methods
Traditional handcrafted visual features are used to express low-level information, which amplifies some visual features of an image, such as color, texture, and shape. These features are often accompanied by some interpretable theories.
Karvonen and Hallikainen [7] pointed out that in addition to the areal backscattering, the information in SAR images was also in the edges. The canny edge detection algorithm can effectively improve the SAR image classification task. Similarly, some local features such as the mast position, were found to have more substantial discriminatory power in ship classification [8]. In addition, various feature frameworks have shown better performance. Li et al. [9] viewed the Gabor filter as a global operator to capture global texture features (e.g., orientation and scale) and the local binary pattern (LBP) as a local operator to characterize local spatial textures (e.g., edges, corners, and nodes). The classification was improved by combining Gabor features and LBP features from different perspectives. Wu et al. [10] analyzed the reflectivity histogram and estimated the values of some macroscopic features such as length, width, and radar cross-sectional profile of the ship, which were evaluated using the fuzzy logic module. Lin et al. [11] designed an MSHOG feature describing the ship structure and used a task-driven dictionary learning algorithm to increase the ship separability. Although they achieved excellent performance in some specific settings, these methods were highly dependent on handcrafted features. These features were time-consuming and labor-intensive to extract manually, and they did not describe the image content in a comprehensive manner, limiting the classification accuracy in complex tasks.

B. Modern Deep Feature Methods
Compared with traditional handcrafted features, modern deep feature methods can automatically extract robust and Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. adaptive deep features from labeled data. These methods have been widely used in SAR ship classification tasks and have achieved excellent performance due to the powerful multilevel characterization capability of deep features. For example, Shi et al. [12] applied 2-D discrete fractional Fourier transform (2D-DFrFT) and two-branch CNN to obtain features. Wang et al. [13] developed a semisupervised learning framework based on ResNet50, in which a self-consistent augmentation rule enables the network to efficiently utilize unlabeled data. Dong et al. [14] designed a deeper SAR ship classification model by introducing a residual module. Zheng et al. [15] proposed an ensemble network to improve the robustness and accuracy of classification by fusing multiple heterogeneous deep CNNs. Huang et al. [16] presented a novel method for CNNs, called group squeeze stimulated sparsely connected convolutional networks (GSESCNNs), which made the concatenation of feature maps from different layers more efficient through sparse connection operations.
With the rise of artificial intelligence, the fact that deep feature-based SAR ship classifiers have achieved higher accuracy than traditional handcrafted feature classifiers, leading these models to uncritically discard handcrafted features. To further improve the characterization of deep features, the CNN structure becomes increasingly complex, deeper, and uninterpretable. This cramming enhancement will soon face a bottleneck. In addition, in the modern information military, uninterpretable abstract features pose great risks in applications such as precision strikes.

C. Feature Fusion Methods
The latest methods in feature fusion primarily focus on two areas: multiscale deep feature fusion and fusion of deep features with handcrafted features. Multiscale deep feature fusion methods facilitate an organic integration of high-level semantic information with low-level spatial features, often adopting a top-down approach to extract representative information from each layer. Bai et al. [17] employed a spatial pyramid attention mechanism and expanded the receptive field of convolutions to extract fine-grained feature information. Chen et al. [18] aimed to diminish differences between multiscale features, promoting a smooth transition of these features, by enhancing feature correlation and encoding spatial feature information. In order to effectively distinguish and utilize features of different scales, Wang et al. [19] used multiscale feature attention and an adaptive weighting classifier to measure features effectively. These approaches allow for a more comprehensive expression of features in the input image, but fusing with handcrafted features can not only diversify the features but also incorporate expert knowledge into the deep learning model. Similarly, Li et al. [20] adopted feature alignment and adaptive weights to achieve multiscale feature fusion. The low-scale images contained precise locations and contours, while the high-scale images provided complete contextual and structural information. Li et al. [21] used multihead encoders to extract complementary features of optical, SAR, and terrain modalities separately and implemented multimodal knowledge fusion using an indicator-guided decoder.
In order to enhance model interpretability and further improve CNN classification performance, some researchers in recent studies combined handcrafted features with deep features to achieve complementary effects. Tang et al. [22] utilized a 3-D fuzzy gradient histogram descriptor to fully capture spatial-spectral characteristics. This, coupled with the multidimensional features extracted by a CNN, significantly improved the robustness of their model. Zhang and Zhang [23] thoroughly investigated the effect of fusing handcrafted features with deep features at the internal FC layer and the terminal FC layer. The results showed that the best classification accuracy can be achieved by injecting handcrafted ones into the terminal FC layer, due to the handcrafted features having rich expert experience. He also pointed out that different CNNs differ in their sensitivity to handcrafted features. The worse the performance, the more significant the accuracy improvement of the CNN. Zhang et al. [24] integrated handcrafted features into CNNs, demonstrating that mature handcrafted features can play an important role. They studied the fusion of 2-D handcrafted features with deep features by first flattening 2-D handcrafted features to one dimension, then using principal component analysis (PCA) to reduce the dimensionality of handcrafted features, and finally combining them in the FC layer. Zhang et al. [25] proposed a HOG-ShipCLSNet network that combined the HOG feature with multiscale CNN-based features at the FC layer to improve the classification accuracy. The HOG-ShipCLSNet used a multiscale mechanism to enrich the deep features, and then flattened the multiscale ones with HOG into 1-D and fused them in the terminal FC layer to enhance the global representation.
In summary, previous methods generally involved simply concatenating handcrafted and deep features, treating these two types of features equally, and then feeding them into the FC layer. However, this approach can result in a complex optimization hyperplane in the fused feature map, without digging deeply into the relationship between handcrafted and deep features, thus not fully enhancing the network's feature learning capacity.

III. METHODOLOGY
We propose a novel multifeature collaborative fusion network framework with deep supervision, as shown in Fig. 1, containing two branches (DEEP and HAND Branch) and two modules (KSCM and FFCA). In the HAND branch, we design a new location for feature injection. Specifically, the handcrafted feature map is treated as input, and the backbone network is used to deeply explore the contained expert knowledge. In this way, it can solve the optimization hyperplane problem caused by traditional feature fusion directly in front of the FC layer. At the same time, we design the KSCM module to improve the quality of feature maps by auxiliary supervision units and adopt synergy loss to promote dynamic information interaction between DEEP knowledge and HAND knowledge. Second, in order to reduce the overfitting problem caused by channel feature redundancy, we use the spatial dropout mechanism [26] to randomly zero out 50% of the feature maps in channel units. Finally, the feature map is input to the FFCA module, and the difference in importance between deep and handcrafted features is weighed by the channel attention mechanism, and the total weights of deep and handcrafted features are output separately to solve the feature contribution imbalance problem. To the best of our knowledge, this is the first work to achieve multifeature collaborative fusion using handcrafted feature maps as input, and the experimental results demonstrate the effectiveness of MFCFNet. The modules are described in detail in Sections III-A-III-C.

A. Handcrafted Feature Extraction
Traditional handcrafted features enhance some of the visual information of an image, such as edges, corners, and textures, which are often accompanied by some interpretable theories. According to the effect of previous use in the field of SAR ship classification and the requirement that the dimension of handcrafted features is 2-D, we selected handcrafted features of each type, such as Canny edge, Harris corner, Gabor filter, and LBP histogram, respectively. As shown in Fig. 2, these handcrafted features all have the same dimensional as the original image. All methods are well-known and each method is briefly explained below.
The canny edge feature is used to extract the edge information of SAR ships [27], which has the advantages of high localization accuracy and effective suppression of false edge points. Similar to the traditional edge detection step, the original image f (x, y) is first smoothed and denoised by using the following Gaussian filter G(x, y): where H (x, y) is the smoothed image and * is the representation of the convolution operator. σ is the standard deviation of G(x, y), which affects the Gaussian filtering quality. Then, the gradient amplitude and direction of the pixel are calculated by computing the first-order partial derivatives in both directions for each pixel and transforming the coordinate system. Harris corner feature is used to characterize the ship's corner information [28], which is more effective for ship positioning recognition. The Harris feature is defined by where w(x, y) is a window function, which can also be a Gaussian function G(x, y). When the w is shifted in both x and y directions, the E(u, v) is calculated.
Gabor filter [29], [30] feature is also widely used for ship classification because it can represent the spatial structure at different scales and orientations, enhancing the global rotation invariance. The mathematical expression of the 2-D Gabor function takes the following form: where (x, y) is the spatial domain coordinate, λ is the wavelength, θ is the directional separation angle of the Gabor core, γ is the spatial aspect ratio, and ψ is the phase shift. LBP descriptor is a simple and effective pixel-based texture descriptor for extracting spatial texture features of ship images [31]. The descriptor computes each neighborhood pixel using the centroid pixel gray value as a threshold, which can be expressed as where (x c , y c ) is the pixel coordinate, p is the pth pixel in the domain, c is the pixel in the neighbor center, i p is the pth pixel value, and i c is the pixel value in the neighbor center. Then, the whole LBP feature map is counted using the histogram to obtain the final LBP feature vector histogram.

B. Knowledge Supervision and Collaboration Module
KSCM consists of an auxiliary feature supervision unit and a knowledge collaboration learning unit, as shown in Fig. 3. Briefly, the auxiliary feature supervision unit is responsible for providing supervision on the output features of each branch and introducing accompanied objective functions L d and L h to improve the convergence rate of the model. The knowledge collaboration learning unit uses a knowledge synergy strategy L k to facilitate the information interaction between deep features and handcrafted features.
In the auxiliary feature supervision unit, we add auxiliary classifiers after two feature extraction branches. The HAND branch is used for low-level visual features and the other DEEP branch is used for high-level semantic features. Let D = {(x 1 , y 1 ), . . . , (x n , y n )} be an annotated SAR dataset having N training samples collected from K ship classes, where each member (x i , y i ) contains x i ∈ R d and y i is the corresponding ship category. Let W = {W d , W h , W g } be the weights of the DEEP branch, HAND branch, and global network that needs to be learned. Hence, f (W, x i ) is the k-dimensional output vector of the W branch for a training sample x i . According to the deeply supervised network fusing the losses of each branch, the global optimization objective can be expressed by the following equation: where L is calculated with the cross-entropy cost function, L g is the default loss, the auxiliary loss L d and L h are the corresponding DEEP and HAND auxiliary classifiers evaluated on the training set, making the learned deep and handcrafted features more discriminative and robust.
In deep supervised networks, Sun et al. [32] stated that setting a fixed value of 1.0 for α and β gives the same performance as the best CNN trained by the ZERO-ing strategy [33]. However, we found in our experiments that when deep features are added, the two branches contribute differently to the final classification, and if the same weights are used it will lead to poor fusion. Therefore, how to set the weights of α and β, we will introduce in Section III-C.
In the knowledge collaboration learning unit, the knowledge synergy strategy can facilitate the aggregation of deep features and handcrafted ones to improve the information consistency among them. Specifically, the class probability outputs of the two auxiliary classifiers on the training data are utilized as learned knowledge to regularize the network's training. The knowledge matching between the DEEP auxiliary classifier and the HAND auxiliary classifier is a Kullback-Leibler (KL) divergence where f d and f h are the class probability outputs of DEEP and HAND classifiers using the softmax function and µ weights the information loss of knowledge matching among them. In this study, to make the knowledge learned by the classifiers transferable to each other, we set µ = 1 and keep them fixed like in [32].

C. Feature Fusion and Contribution Assignment Module
Traditional multifeature fusion methods usually use a simple concatenated feature map, and this concatenation defaults the deep and handcrafted features to have the same important information. In order to more clearly characterize the features of different channels after concatenation, as shown in Fig. 4, we designed an FFCA. Similar to the widely applied attention mechanism [34], [35], FFCA module uses global average pooling to aggregate the spatial dimensional information of the multichannel feature maps F ∈ R 2C × H × W , compressed into a 1 × 1 × 2C sequence of real numbers. Then, the feature sequence is fed into a shared multilayer perceptron (MLP) to learn the relationship between each channel and generate a more representative feature vector. After that, using the sigmoid function obtains the feature channel weights. It outputs the two types of feature weights according to the summation operation of the deep feature channel and the handcrafted feature channel, respectively. Finally, we use the feature channel weights multiplied by the input feature map to obtain the final channel attention map. The equation of the channel attention mechanism is shown in the where F represents the concatenated feature map and W (F) represents the feature weights of each channel. Thus, the deep feature weights α and handcrafted feature weights β are We use the deep feature weights α and the handcrafted feature weights β to measure the corresponding supervised loss functions, thus balancing the contribution of different features to the model classification. As a result, combining the loss function of Section III-B with the contribution weights of this section, the total loss function of the whole framework is where L g is the default loss, L d and L h can play the role of judging the good or bad quality of the corresponding feature maps, and L k can promote the auxiliary classifiers to learn from each other.

IV. EXPERIMENT AND RESULT ANALYSIS
All programs are implemented in Python language, and the CNN is implemented using the open-source PyTorch framework, with the handcrafted feature extraction methods partially derived from the skimage library. The model inference is accelerated using the CUDA11.6 platform called GPU.

A. Data Description
To evaluate the feasibility and effectiveness of MFCFNet to fuse handcrafted features, we perform extensive experimental analysis on two popular SAR ship datasets like other scholars [40], [41], [42]. The distribution ratio and preprocessing of the datasets are the same as our previous work [15]. Table I shows the distribution of two datasets, such as categories, totals, and allocations. The validation set is randomly divided from the training set by using the threefold cross-validation method, which is used to verify a variety of hyperparameters. On the basis of the minimum average error hyperparameters, Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. the training set and the validation set are combined to retrain the final model, and then its generalization ability is tested through the test set.
1) OpenSARShip Dataset: The OpenSARShip images were derived from the dual-polarization SAR detected by the European Space Agency's Sentinel 1 satellite, including both VH and VV polarization channels. Combining the coordinates and categories provided by Huang and the experimental setup of earlier research [13], [25], three main types of ships are extracted and the same training-test ratio is set to solve the sample imbalance problem. In addition, the resolution of this dataset was decreased compared to FUSAR-Ship. As shown in Fig. 5, there are three types of ships: bulk, container, and tanker.
2) FUSAR-Ship Dataset: The FUSAR-Ship dataset was extracted from 126 hyperfine images acquired on the quadpolarization Gaofen-3 satellite, and had a greater variety of ships compared to the OpenSARShip dataset. As shown in the first row of Fig. 2, seven types of ships are used in the experiment, i.e., bulk, container, fishing, tanker, general cargo, other cargo, and others. We use the same data preprocessing method and training-testing ratio as in [15]. Specifically, the image is first padded by 5 pixels to both sides and then 224 × 224 crops are randomly sampled from the padded image or its horizontal flips.
B. Experiment Settings 1) Backbone and Implementation Details on Opensarship: We use the four most representative CNN architectures for evaluation, namely, AlexNet [36], VGG-16 [37], ResNet-18 [38], and DenseNet-121 [39]. We employ the open-source model code in Pytorch and train each backbone network following the standard settings. For ResNet-18, we use a stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and a learning rate of 0.001. The rest of the models use the Adam optimizer with a learning rate of 0.0001 and the weight decay as 5 × 10 −4 . All models are trained with 100 epochs and the batch size is set to 16.
2) Backbone and Implementation Details on FUSAR-Ship: Due to the larger FUSAR-Ship dataset, we add two deeper models to test the validity of MFCFNet, namely ResNet101 [38] and DenseNet201 [39]. For ResNet-18 and ResNet-101, we use an SGD optimizer with a momentum of 0.9 and the learning rate set to 0.01. The rest of the models use the Adam optimizer with a learning rate of 0.001 and the weight decay as 5 × 10 −4 . All models are trained for 100 epochs, and the learning rate is decayed by 10% at the 60th epoch, and the batch size is set to 32.
3) Auxiliary Classifier Implementation Details: The auxiliary classifiers on both branches have the same structure as the classifiers in the original backbone network.

C. Metric Index
For the SAR ship classification task, we use the Accuracy, F1, Precision, and Recall metrics to measure the classification performance and compare it with the state of the arts. The formula for accuracy is presented here as an example Accuracy = TP + TN TP + TN + FP + FN (13) where TP, TN, FP, and FN denote the number of correctly classified ships, number of correctly classified opposite classes, number of incorrectly classified ships, and number of the misclassified ships, respectively. Table II shows the SAR ship classification results of MFCFNet on OpenSARShip and FUSAR-Ship with and without handcrafted features. In Table II, the Backbone refers to the deep features, the Baseline denotes the standard training scheme, and Canny, Harris, Gabor, and LBP indicate the corresponding handcrafted feature fusion schemes. We run each combination five times and report the "mean ± std" accuracy. For better comparison, we also present the average gain ABG and AFG. ABG refers to the average gain of the identical backbone combined with different handcrafted features. Similarly, AFG refers to the average gain of the same handcrafted feature combined with different backbones.

D. SAR Ship Classification Results
Results on the OpenSARShip are summarized in Table II where our method MFCFNet consistently improves the performance of all backbones. Among them, Densenet-121 + Harris achieved the highest accuracy of 78.60%. From the perspective of average backbone gain, we find that the accuracy improvement is more significant for the original model with poorer performance. For example, the original VGG-11 model has 69.90% classification accuracy, and the average gain after adding handcrafted features is 6.69%. However, the original DenseNet-121 model has 73.59% accuracy, and the average gain after adding features is 3.96%. The average gain of the ResNet-18 model is only 1.36%, partly because of the high accuracy of the original network, but mainly because the sparse residual summation operation in the network disrupts the feature information flow to some extent. Meanwhile, the same model has sensitivity differences to different handcrafted features. For example, the accuracy improvement of the VGG-16 model is 7.35% after fusing Gabor features and only 3.81% after fusing Harris. Therefore, we need to further consider the intrinsic relationship between deep and handcrafted features. From the average feature gain perspective, the textures features are described by the Gabor and LBP for the Top-2 gains, which are 5.30% and 4.75%, respectively. It shows that texture features have a substantial gain for deep features on OpenSARShip. The experimental results powerfully illustrate the effectiveness of MFCFNet for the fusion of handcrafted features with deep features. Results on the FUSAR-Ship are similar to those on OpenSARShip, and MFCFNet achieves an effective accuracy improvement even on larger datasets and deeper networks. The accuracy results of all backbone networks are consistent with that reported in [25]. Benefiting from the proposed handcrafted feature fusion, MFCFNet improves 1.94%, 3.25%, 3.45%, 1.1%, 2.32%, and 1.73% in average accuracy gain for AlexNet, VGG-16, ResNet-18, ResNet-152, DenseNet-121, and DenseNet-201, respectively. The accuracy improvement of ResNet-152 and DenseNet-201 with deeper layers is lower than that of the corresponding shallow networks, indicating that the deeper networks contain richer semantic information. ResNet-18 achieves the best average backbone gain of 3.45% on a larger dataset in contrast to OpenSARShip. So the performance of residual blocks can be exploited when the dataset is sufficiently complex and diverse. The Gabor feature also achieves the best average feature gain of 3.40% on FUSAR-Ship.
Due to the ability of MFCFNet to fuse different deep features and handcrafted features, Fig. 6 is intended to better illustrate the role of handcrafted features on the backbone during the training process. From Fig. 6, we can find that the backbone based on MFCFNet combined with handcrafted features can accelerate the convergence speed and improve the accuracy, but each network has sensitivity differences to various handcrafted features. For example, the DenseNet-121, after combining Canny and Gabor features, obviously converges faster than the original network. However, the combination of Gabor causes oscillations in the training process. The internal mechanism of this phenomenon needs to be further investigated in the future. In conclusion, as the backbones become deeper (e.g., ResNet-152 and DensenNet-201) and the datasets become larger (e.g., FUSAR-Ship), our method MFCFNet has the same significant accuracy improvement for all backbones .  Tables III and IV show the top-1 accuracy DenseNet121 + Harris and DenseNet201 + Gabor on both datasets and illustrate the classification performance for each ship category in the form of confusion matrices. Tables III and IV also have many misclassifications due to the significant interference of background noise in the images of the two datasets. However, Table IV performs better than Table III because the FUSAR-Ship dataset has a higher resolution, can learn more ship features, and has an accuracy of 86.92%, higher than the 78.62% of OpenSARShip. Clearly, the confusion in category prediction on the FUSAR-Ship dataset mainly occurs in Fishing and Others, as these two ship types have similar geometric shapes. The various types of cargo such as containers, general cargo, tankers, and bulk achieved better classification performance.

E. Comparison Results
In the comparison experiments, the best feature combinations DenseNet121 + Harris on OpenSARShip and DenseNet-201 + Gabor on FUSAR-Ship are used as benchmarks and then compared them with handcrafted feature-based, deep feature-based, and state-of-the-art feature fusion methods, respectively. The Harris feature is used in OpenSARShip and the Gabor feature is used in FUSAR-Ship. 3) Comparison With Feature Fusion Methods: Among the feature fusion methods for SAR ship classification in Table V, the state-of-the-art methods are DUW-Cat-FN [23] and HOG-ShipCLSNet [25] proposed by Zhang. Since we study 2-D manual features, the dimensionality difference between deep features and manual features that are flattened to 1-D is large, and direct concatenation to the FC layer will lead to feature confusion and overfitting. Both of the above methods fuse handcrafted features in the terminal FC layer and achieve the best classification accuracy of 78.15% and 86.86% on both datasets. Therefore, for better comparison, we use the regular internal FC layer and terminal FC layer as well to fuse handcrafted features separately, which have significantly lower performance than our MFCFNet. Since we are studying 2-D handcrafted features, the dimension difference between the deep feature and handcrafted feature which is flattened to 1-D is large, and the direct concatenation to the FC layer leads to feature confusion and overfitting. From Table V, MFCFNet can achieve a state-of-the-art classification accuracy of 78.60% and 87.23%. We find that for all experimental results on the OpenSAR dataset, the precision value is significantly lower than the other three metrics, which is due to the unbalanced  Fig. 7, the standard deviation produced by MFCFNet is much lower than the deep feature methods and feature fusion methods, which is 0.21 and 0.26 on the two datasets, respectively. The results show that in each random experiment, KSCM and FFCA modules make the fusion features play a maximum and stable role.

F. Generalization Experiment
To test the generalization performance of MFCFNet, the OpenSARShip dataset was used for training and then tested directly on the FUSAR-Ship dataset. Since the number of categories in the FUSAR-Ship is more than that in the OpenSARShip, only three common categories in FUSAR-Ship are retained in the generalizability experiments, such as Bulk, Container, and Tanker. Table VI shows the experimental results of different methods. Compared with the classical deep models, MFCFNet improves on average by 20.79% in the four metrics. Comparing the handcrafted feature fusion methods in different locations, MFCFNet also achieves the best results with an accuracy of 65.54%.
To further explain the generalization performance of MFCFNet, we present the confusion matrix of seven models. It is clear from Fig. 8(a)-(d) can be found that the classical deep models both perform poorly on the Tanker class, and Bulk and Tanker are easily misclassified as Container. As shown in Fig. 8(e)-(g), the model after fusing handcrafted features can better distinguish each category. Because the handcrafted features can provide prior knowledge, which prevents the models from overfitting in the direction of irrelevant target features. All methods have degraded classification performance in generalization experiments, which is due to the differences in the two data domains, such as different image resolutions and sea conditions. In general, MFCFNet has a better generalization performance, but further generalization research is still needed.

V. ABLATION STUDY
In MFCFNet, the feature fusion unit is the core of the FFCA module, the knowledge collaborative learning unit is Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  All ablation experiments were also run five times to report the "mean ± std" accuracy.

A. Ablation Study on Feature Fusion Unit
We conduct several ablation studies to investigate the effectiveness of attentional mechanisms in feature fusion units and the effect of different attentional mechanisms on classification accuracy, including the commonly used types of attention: the spatial attention module (SAM) [52] and the convolutional block attention module (CBAM) [53]. From Table VII, the performance of the network can be improved by using any attention mechanism on OpenSARShip, and the lower the Attention Removed value, the more significant the improvement. However, due to the more diverse and complex ship categories in FUSAR-Ship, and the handcrafted features and deep features characterizing ship information from two different perspectives, the SAM focusing more on spatial pixel relationships can cause feature representation confusion, resulting in a negative impact. The CBAM performs a little better as it is a mixed spatial and channel attention mechanism. Our method based on the channel attention mechanism performs a little better, coming from paying more attention to the relationship between deep features and handcrafted features of each channel to eliminate the effect of feature confusion.

B. Ablation Study on Auxiliary Feature Supervision Unit
We conduct some ablation experiments to verify the effectiveness of auxiliary feature supervision loss and feature contribution weights. Here, we set the loss weight values to 0, 0.2, 0.8, and 1 for the experiments. α represents the DEEP branch loss weight and β represents the HAND branch loss weight. When α, β = 0, it means no auxiliary loss is used. The model gain at this time is attributed to the attention mechanism and knowledge synergy loss. From Table VIII, we can make the following observations. 1) When using the weaker handcrafted feature Canny, if β is set greater than or equal to α, it makes the model pay more attention to the HAND branch in the backpropagation process, resulting in the model shaking violently and difficult to converge during the training process, which eventually leads to lower performance than Baseline. 2) When using stronger Gabor features, a larger gain can be produced for the deep features. The model performance gets better as α keeps increasing. 3) Our method uses feature contribution degree to set α and β, which can adaptively measure the relative importance of different deep features with handcrafted features, and finally achieve the maximum accuracy gain.  In conclusion, the results in Table VIII strongly prove that the auxiliary features supervision loss and feature contribution degree, which can balance the importance between features, make the handcrafted features and deep features complement each other.

C. Ablation Study on Knowledge Collaboration Learning Unit
We conduct some ablation experiments to verify the effectiveness of synergy loss in the knowledge collaborative learning unit. Here, we remove the knowledge synergy loss and keep the rest of MFCFNet in Table IX. Specifically, the best accuracy gain of 3.6% was achieved on OpenSARShip using AlexNet + Canny, and effective results were also achieved on FUSAR-Ship. These results demonstrate the significance of the knowledge synergy loss and the effectiveness of our approach, which leads to deep knowledge and handcrafted knowledge to learn from each other, achieving a dynamic collaborative process for the same task.

VI. LIMITATION AND CONCLUSION
In this article, we proposed a novel collaborative multifeature fusion network with deep supervision to better achieve handcrafted features to provide complementary information to deep features. An auxiliary feature supervision unit and a knowledge collaborative learning unit are designed, the former realized high-quality extraction of feature maps for each branch, and the latter achieved collaborative learning of deep and handcrafted knowledge. In addition, an FFCA based on the channel attention mechanism is designed, which can solve the problem of the important difference between deep features and handcrafted features and the unbalanced contribution of different features. Extensive experiments have shown that our proposed MFCFNet outperforms single deep features and multifeature models based on Internal FC Layer and Terminal FC Layer fusion, and exhibited better performance than the current state-of-the-art related methods. Therefore, MFCFNet reliably and realistically achieves superior ship classification results.
Our current study has some limitations. First, MFCFNet is unable to achieve effective classification in a multiobject scenario. Second, the model performance is improved when fusing one handcrafted feature with a deep feature, but sustained performance improvement cannot be achieved when more than two handcrafted features are fused with a deep feature at the same time. The main reason for our analysis of this phenomenon is the feature redundancy problem. Finally, we observe that the size of MFCFNet is twice that of the backbone, which is mainly related to the number and complexity of features auxiliary branches. Therefore, weighing the model size and the expected increase in accuracy, we believe that the current increase in model size is justified. More importantly, all auxiliary classifiers are discarded in the inference process, so there is no additional computational overhead.
Our future work is as follows.
1) Study the intrinsic relationship between deep and handcrafted features to implement recommending the best-handcrafted features to different networks. 2) Solve the feature redundancy problem and extract common features in deep and handcrafted features, thus improving model robustness and extending the MFCFNet framework to any number of feature fusions. 3) Study and evaluate various representational capabilities of deep and handcrafted features and build a feature capability matrix. 4) The existing feature fusion methods all use classic models as the backbone. In our future work, we will demonstrate the feasibility of our method on the latest models.