Hyperspectral images classification based on multiple kernel learning using SWT and KMNF with few training samples

ABSTRACT In this paper, a new techniquebased on Multiple kernel learning (MKL) with just a few training samples is proposed for HSI classification utilising stationary wavelet transform (SWT) and kernel minimal noise fraction (KMNF). 2D-SWT is applied to each spectral band to discriminate spatial information, and feature sets are created by concatenating wavelet bands. The base kernels associated with each feature set are constructed, and the optimum kernel for maximum separability is learned. The experimental results indicate that the suggested approach provides high accuracy with a low number of training samples and outperforms state-of-the-art MKL-based classifiers with no increase in computing complexity.


Introduction
Hyperspectral images (HSI) are a type of remote-sensing data that have a large number of narrow spectral bands each one captured with a different frequency.The bands are typically more than a hundred.Partial spectral sampling in Hyprspectral imagery results in a continuous spectrum devoted to each pure material (Xu et al. 2019).As a useful source of information, HSI is used in a wide range of applications such as urban development, precision farming, the food industry (product quality recognition), environmental monitoring, mineral or gas identification, military and security applications, and so on (Prasad et al. 2018, Tu et al. 2020).Different processes such as compression, noise reduction (Hassanzadeh and Karami 2016), detection (Jiang et al. 2020), and classification (Zhong et al. 2014) are required for different applications.Since classification results have a key role in the overall performance of many applications, classification methods have grown dramatically in recent years.A system for HSI classification generally involves the following steps: training data selection, image pre-processing, feature extraction and feature selection, selecting the most appropriate classifier and accuracy evaluations (Chutia et al. 2016).
Recently, many efficient HSI classification methods have been introduced.Among them, sparse representation classification (SRC) (Li and Du 2016) has been introduced, in which a testing sample is represented by a sparse linear combination of training samples.Although the performance of SRC has been enhanced in (Su et al. 2020) it has CONTACT Habibollah Danyali danyali@sutech.ac.irSupplemental data for this article can be accessed online at https://doi.org/10.1080/14498596.2022.2097962.a huge complexity cost and does not provide significant accuracy in the case of a low number of training samples.Other classification methods that have recently attracted a lot of attention are based on deep learning (Hong et al. 2020a).These methods have a large number of parameters to be trained and that increases the computational cost.On the other hand, they do not have high performance in a small number of training samples, which is our main concern in this paper.Some of the state-of-the-art deep learning-based HSI classification methods have been investigated in (Jia et al. 2021) and the results show that by combining deep learning methods and related techniques, such as transfer learning, the issues of small training sample sets can be alleviated.
The support vector machine (SVM) approach is one of the most widely used remote sensing classification algorithms (Melgani andBruzzone 2004, Tarabalka et al. 2010).Since classification performance is highly dependent on extracted features, employing efficient spectral and spatial features might greatly enhance classification accuracy.Several classification methods have been presented that have improved SVM efficiency through spectral and spatial feature fusion (Dalla Mura et al. 2010, Kang et al. 2013, Jia, Shen, and Li 2014).A classification method based on SVM was proposed (Kordi Ghasrodashti et al. 2017) in which features have been extracted using wavelet transform and spatialspectral Schrodinger eigenmaps.Another method was introduced (Li et al. 2014) using multiple feature learning (MFL) and multinomial logistic regression (MLR) classifier.In the MFL strategy, a feature vector is constructed from stacked linear or non-linear features which have been generated from morphological and kernel analysis.Another classification method based on random forest (RF) was suggested (Hong, Wu, et al. 2020) in which a new technique based on attribute profiles, i.e. invariant attributes (IAs) was proposed for feature extraction.The mentioned researches show that richer features result in higher classification accuracy without an increase in computation complexity.
Identification of each pixel in HSI for labelling data is often costly or in some cases impossible.Therefore, there is a limited amount of training data.Although increasing the number of features would theoretically lead to improved performance, as a result of the Hughes effect, if the number of features (dimensionality) is greater than the number of training samples, the final accuracy will be decreased (Bioucas-Dias et al. 2013).
Dimensionality reduction can be an effective strategy to minimise the impact of the Hughes phenomenon.By mapping the data into a lower-dimensional space, considerable improvements can be achieved.Principal component analysis (PCA) (Rodarmel and Shan 2002) and linear discriminant analysis (LDA) (Tharwat et al. 2017) are conventional dimensionality reduction methods for hyperspectral images.By computing the eigenvectors of the original dataset's covariance matrix, PCA projects the original high-dimensional space to a lower one in the direction of maximum variance.LDA is a supervised dimensionality reduction technique that seeks a projection matrix to maximise class separability.However, neither PCA nor LDA was able to handle the nonlinear nature of HSI.Another method is minimum noise fraction (MNF) (Green et al. 1988) that is based on the maximisation of signal-to-noise ratio (SNR), in which the transformed principal components are ordered by SNR rather than variance.To handle nonlinearities in original data, a kernel version of MNF (KMNF) was introduced (Islam et al. 2020) and has shown superior performance over linear PCA and its kernel version.
Multiple kernel learning (MKL) which has been proposed recently, is another strategy to overcome the issue of limited training samples and to attain efficient spectral and spatial feature fusion (Zhan et al. 2018).By using MKL, the features that are more efficient in class separability are retained, while redundant features that may degrade classifier performance are removed.A linear or non-linear combination of some basic kernels is created in MKL-based methods (Gönen and Alpaydın 2011), in which each basic kernel is devoted to a specific feature set.Indeed, MKL's major role is preserving key kernels in such a way that the weight of each base kernel is determined according to the kernel importance.A simple MKL method was presented in (Tuia et al. 2010) in which just spectral features were exploited and the optimal kernel was produced via a linear combination of a few base kernels.The weights of these base kernels were computed by solving an optimisation problem using an iterative gradient descent algorithm.
A representative multiple kernel learning (RMKL) method (Gu et al. 2012) was developed for HSI classification in which kernel weights have been calculated based on eigenvalue decomposition (PCA) without any optimisation process.So, the computational load for finding an optimal kernel has been reduced.A sparse MKL classifier (Gu et al. 2014) improved RMKL performance in which an optimally sparse combination of kernels was learned from the base kernels, and then the standard SVM procedure was performed with the optimal kernel.In (Gu et al. 2016), a method for non-linear multiple kernel learning (NMKL) was proposed.In this method, the kernel matrices have been computed using a Hadamard product of two base kernels.Results of NMKL demonstrated that the accuracy of classification by a nonlinear combination of kernels was inadequate and had a high computational cost.To simplify the HSI kernel learning procedure, an algorithm was presented (Liu et al. 2016) where just elements of kernels corresponding to the two pairs of classes to be classified are employed, and only these kernel elements are used in the kernel learning phase.In (Cui et al. 2019) multiple classifiers based on multiple basic kernels train their models utilising the different training sample sets that are generated by the bootstrapping strategy.Then, the test samples are pseudo-labelled by mutual learning and added to the training samples set for retraining.Finally, the prediction results of all classifiers are combined by the voting mechanism.Previous MKL-based techniques mainly focused on discovering novel optimisation algorithms for merging base kernels and did not take into account various spatial and spectral feature extraction methods.
In this paper, a new MKL-based hyperspectral image classification method is proposed for a few training samples by extracting informative and rich spatial and spectral features.The MKL strategy generates kernels that adapt to the data characteristics by utilising different modalities, leading to more efficient feature learning.Although extracting significant features is a crucial component of MKL, it has not been taken into account in previous MKL methods.To do this, each spectral band is subjected to a stationary wavelet transform, which yields more detailed spatial information in different directions.Then, KMNF is applied to each feature set (wavelet coefficients and the original HSI) for dimension reduction and the removal of noisy features as well.Finally, the optimum combined kernel is incorporated into the SVM classifier.Since MKL performance is highly dependent on features, using KMNF in conjunction with SWT leads to high classification accuracy even with few training samples.
The main contributions of the proposed method are as follows: • Constructing spatial-spectral features jointly by employing SWT spatial information combined with KMNF spectral feature reduction to attain efficient features for classification.
• Providing an efficient classification method for HSI classification, especially for a few training samples using MKL.
The rest of this paper is organised as follows.Details of the proposed method are demonstrated in Section 2. In Section 3 real datasets are introduced and evaluation metrics are clarified.All obtained results are reported in Section 4. Finally, Section 5 concludes the paper.

The proposed method
Figure 1 illustrates a block diagram of the proposed method.In the proposed method called, SWT-KMNF-MKL, spatial and spectral features are extracted by using stationary wavelet transform and kernel minimum noise fraction respectively, then an optimum feature fusion is achieved by MKL.The SWT-KMNF-MKL steps are as follows: Step 1: Implement two levels of two-dimensional (2D) SWT on the spatial domain.
Step2: Five feature sets are constructed from the original dataset, approximation, horizontal, vertical and diagonal sets.
Step 3: Apply KMNF to each feature set in spectral direction to reduce its dimension.
Step 4: Base kernels are generated for five feature sets and each base kernel is associated with one feature set (RBF kernels with multiple scales are used).
Step 5: The kernel learning process is implemented based on SimpleMKL (Tuia et al. 2010).
Step 6: Finally, the optimum-learned kernel is incorporated into the SVM classifier to get the classification map.

Spatial feature extraction using stationary wavelet transform (SWT)
Although two-dimensional discrete wavelet transform (2DDWT) is a powerful tool for image processing and pattern recognition, it has some limitations.The dyadic decompositions in 2DDWT generate sub-bands with half rows and columns of the original image which reduces spatial resolution and is not appropriate for pixel-wise classification.Twodimensional stationary wavelet transform (2DSWT) has been solved this issue by removing the down-sampling step in standard DWT and results in sub-bands of the same size as the original image.2DSWT decomposes an image into four sub-bands (one low resolution called approximation (A) and three high-resolution sub-bands called horizontal (H), vertical (V) and diagonal (D) details) by employing a successive set of lowpass and highpass filters in horizontal and vertical directions.
In SWT, the output after applying filters to an N-sample signal, has N highpass and N lowpass coefficients, while in DWT the output has N/2 coefficients.This redundancy facilitates the identification of salient features in the image (Fowler 2005, Gharbia et al. 2018).
As previously stated, spatial features in conjuction with spectral ones is common in the literature for improving classification performance.Since in hyperspectral images, spectral correlation is higher than spatial correlation, using 3D transformations that treat all features similarly is not efficient.So the proposed method utilises a 2D spatial transform first, and then a spectral transform.The spectral transform leads to removing noisy features and dimensional reduction.We use 2 levels 2DSWT because of its powerful ability to decrease spatial correlation and extract significant spatial information in various scales, frequencies, and orientations.

Spectral feature extraction using kernel minimum noise fraction (KMNF)
PCA is the most commonly used spectral feature reduction approach in the current literature, however with fewer components, the contribution of noise does not necessarily reduce (Kordi Ghasrodashti et al. 2017).The minimal noise fraction (MNF) transform was used to tackle this problem.The MNF is similar to PCA except that instead of choosing new components based on maximum variance, it considers the maximum signal-to-noise ratio.Although MNF is useful in a variety of domains, it is inefficient in nonlinear problems where signal and noise cannot be assumed linearly separable.Kernel MNF (KMNF) was developed to overcome this issue (Gao et al. 2017).In KMNF, original data is mapped to a higher dimension space known as reproducing kernel Hilbert space (RKHS).Then, by estimating noise variance in the new space and removing noisy samples, the signal-tonoise ratio is maximised.Hence, the bands with higher SNR can be retained using KMNF.
Suppose a set of n observations (image pixels) and p variables (spectral bands), are represented as a matrix X by n rows and p columns: without loss of generality assume that each column of X has zero mean.
A set of linear combinations of variables, a T x i , is considered.Then, noise fraction (NS) of this set is minimised or equivalently signal-to-noise ratio (SNR) of this set is maximised.The details of the MNF algorithm are explained in the following.
In the MNF method, x i is a sum of signal part (x iS ) and noise part (x iN ) that are presumed to be uncorrelated: As a result, the sum of the signal and noise components yields the variance-covariance matrix S: The noise fraction is calculated as the ratio of noise variance to total data variance.So, for a linear combination a T x i of zero-mean x i , the NF is obtained as follows: Similarly, SNR is expressed as the ratio of variance of the signal and variance of the noise: We get the relations NF ¼ 1=ðSNR þ 1Þ or SNR ¼ ð1=NFÞ À 1 from the above equations, which implies we minimise NF by maximising SNR.
The MNF can be formulated as a maximisation problem: Because of the nonlinear structure of the scattering distribution function, multiple scattering within a pixel, and the variability of subpixel constituents in hyperspectral data (Mohan et al. 2007), the MNF approach cannot handle this nonlinearity.To overcome this issue in MNF, the Kernel MNF (KMNF) approach was developed.By KMNF the original data can be mapped into a higher dimensional feature space using a mapping function (Ψ), and then a linear analysis can be followed in this space.The higher dimensionality of the feature space guarantees that transformed samples are more likely to be linearly separable (Cover 1965).Because sample coordinates should be computed in that high-dimensional space, the computational load will be significantly increased.To avoid such a computation, the kernel trick is utilised in which a dot product between mapped samples is determined as a kernel matrix.Without knowledge of the transform functions, the kernel matrices provide all of the essential information for executing linear algorithms in the high-dimensional space.
The dual formulation of Equation ( 6) is obtained by setting a ¼ cX T b (c is a constant): Maximisation of equation ( 6) is a generalised eigenvalue problem.It can be shown that Equations ( 6) and ( 7) have the same nonzero eigenvalues.It's worth noting considering a ¼ cX T b makes the problem in the kernel version easier to tackle.By considering Ψ as a transformation function that is applied to the data matrix X, the kernelized version of (7) becomes: where K ¼ ΨΨ T and Ψ N is a mapping matrix of noise matrix X N with n � q size, K N ¼ ΨΨ T N .In this paper, using the original HSI and four concatenated wavelet bands of approximation and details, five feature sets are generated.First, in the direction of bands, KMNF is applied to the original dataset.In the kernel version, a transformation maps nonlinear relationships between features to a linear form that leads to effectively removing noisy features.As a result, bands that have been reconstructed using new features will have the highest SNR.After that, applying KMNF to four wavelet feature sets reduces feature dimension while also transferring all features to the same space.

Multiple kernel learning classification
As mentioned before, after extracting the feature vector and reducing its dimension, it is time to employ these reduced features to classify the data.The SVM is the most popular classification method used for hyperspectral images classification.Training datasets X with x i ; i ¼ 1; 2; . . .; N samples are available, fx i ; y i g are training pairs, with y i labels of classes.
The SVM problem is defined as a minimisation problem as follows: subject to y i ðhw; where w is a vector of parameters determining the optimal decision hyperplane hw; Φðx i Þi þ b ¼ 0; b and � i are bias and a slack variable respectively, C is a regularisation parameter that controls misclassifying samples and Φ is a nonlinear mapping function.
Based on Mercer's theorem, the inner product has an equivalent representation: where Kðx i ; x j Þ is a kernel function that satisfies some conditions such as continuous, symmetric and positive semidefinite (PSD).
Because the kernel may be regarded as a measure of similarity between two samples in SVM, kernel selection is a key challenge.Moreover, MKL is an effective way to incorporate various features or information sources that do not have the limitations of a single kernel model.
Some typical examples of kernels used in classification are as follows: (1) Linear kernel: The Gaussian (RBF) kernel has been employed in most MKL algorithms because it is more efficient in terms of classification accuracy and also is the most representative kernel function with certain characteristics such as translation invariability (Gu et al. 2012(Gu et al. , 2015)).There is no specified rule for changing the bandwidth of the Gaussian kernel.A small bandwidth kernel is indeed sensitive to changes in similarities, but a high bandwidth kernel is immune to minor variations in similarities.As a result, the bandwidth (σ) may be thought of as a scale on which the kernel compares samples, indicating the kernel's discriminative abilities.
In MKL methods, a linear or nonlinear combination of kernels is created.Since a linear combination of Gaussian kernels is more effective to extract similarity (Wang et al. 2016), our proposed method is based on the SimpleMKL in which a linear combination of base kernels is considered as follows: The optimal kernel is learned from data by constructing a weighted linear combination of M base kernels.Each kernel in the mixture may correspond to a different features or set of features.
The main reasons why we have proposed an MKL-based method are: (1) MKL is capable of efficiently fusing heterogeneous spectral and spatial features.
(2) The difficulty of kernel selection and the limitation of using a fixed kernel are alleviated.
(3) Learning from multiple kernels can provide better similarity measuring ability.
In the proposed method, Gaussian kernels with σ ¼ ½0:1; 1; 1:5;2� are constructed from original, approximation and detailed feature sets (Kernel scales are chosen as the references for fair comparison).In the following, details of the optimisation of linear weighted combination in the kernel learning procedure are explained.
The goal of MKL is simultaneously optimising α i and d m in (10) and ( 12) respectively.The SimpleMKL method utilises a gradient descent algorithm to solve the MKL problem.By replacing Equations ( 12) in (10), the SVM optimisation problem is: The maximising problem is represented as its dual form: min α JðdÞ subject to And w m is the weight of decision function associated with kernel k m .The whole proposed algorithm, SWT-KMNF-MKL, is summarised as follows: The Proposed Algorithm (SWT-KMNF-MKL)

Datasets
In our experiments, two HSI datasets were used.The first dataset, Pavia University, was collected by the Reflective Optics System Imaging Spectrometer (ROSIS-03) sensor over an urban area, surrounding the University of Pavia, 1 Italy.It has 103 spectral bands after removing 12 water absorption bands with 42,776 labelled samples from 9 classes of interest.The size of each band is 610 × 340 pixels with a spatial resolution of 1.3 m per pixel (Gamba 2004).In Figure S2 a false-colour image of this data set is illustrated.Another HSI dataset is Indian Pines.It was taken by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over an agriculture test site in north-western Indiana.It has 145 × 145 pixels with a spatial resolution of 20 m per pixel and 200 spectral bands after removing 20 water absorption bands.It consists of 10,366 labelled pixels and 16 classes that is available online. 2 Figure S3 shows a false-colour composite image of bands 50, 27, and 17 and the reference map.These two datasets were captured with two different spatial and spectral resolutions in different agricultural and urban regions.

Evaluation metrics
The classification methods are evaluated in terms of two quality metrics: the overall accuracy (OA) and Kappa coefficient (κ).

Overall accuracy
Overall accuracy is the ratio of correctly classified samples in each class to the tested samples in percentage and calculated from the confusion matrix as follows: where C ii is the number of correctly classified pixels in ith class, C T is the total number of test samples and ρ represents the number of classes in the reference map.

Kappa coefficient
Kappa coefficient represents the percentage of difference between classification results and the values assigned by chance that is defined as follows: where p 0 is the observed agreement and p e is the expected agreement.
ρ is the number of classes in the ground truth and C ij is the number of misclassified samples in jth class while they belong to the ith class.
The Kappa coefficient is between 0 and 1.If it equals 0, there is no agreement between the classified image and the reference image.If Kappa coefficient is equal to 1, it means that the classified image and the ground truth image are identical.So, the higher the kappa coefficient, the more accurate the classification is.

Simulation details and experimental results
As shown in the block diagram of the proposed method, Figure 1, different scales of kernels with different spatial features are constructed.After applying two levels of 2DSWT to the original images, feature sets are created by concatenating original spectral bands and approximation, horizontal, vertical and diagonal sub-bands separately.The high number of spectral bands leads to a degradation of the classification performance because of the Hughes phenomenon, especially in the case of a low number of training samples.This issue could be solved using the KMNF dimension reduction technique.Since the Indian Pines dataset is noisier than the Pavia University image, the number of retained bands with KMNF for the first one is less than the latter.In the proposed method, four Gaussian kernels have been chosen with σ in the range of 0.05 to 2 as in other MKL-based methods, by experimenting with different scales to compromise complexity and classification accuracy (Gu et al. 2017).The results confirm that for these scales [0.1 1 1.5 2] the best performance has been achieved, so for 5 feature sets, a total of 20 base kernels are generated.Finally, using the SimpleMKL method, basic kernels are linearly combined to get the optimal kernel combination.The computational complexity of the SimpleMKL algorithm is O(m × n^3), where m denotes the number of base kernels and n denotes the number of training samples (Rakotomamonjy et al. 2008).In all experiments, only a few pixels (i.e.1%, 3% and 5%) from each class are selected randomly as training data and the remaining pixels are considered as test samples.The mean of overall accuracy and classspecific accuracy are reported with a standard deviation of 10 times repetitions.The whole process has been run on a computer with an Intel Core i7 (3.6 GHz) and 32 GB of RAM on Matlab 2017b.Also, the computational time of testing and training procedures has been reported.
To evaluate the proposed method, classification accuracy based on OA and Kappa has been provided for the two aforementioned datasets.The results of three different experiments have been reported.In the first experiment, the classification accuracy of the proposed method was compared to a conventional sparse method SRC (Li and Du 2016).In SRC spectral features have been used and the regularisation parameter is set as the optimal parameter.Moreover, our findings have been compared to a random forest-based classifier (Hong et al. 2020b) in which spatial features are extracted via invariant attribute profiles (IAPs).MFL (Li et al. 2014) is another reference method based on the MLR classifier in which a stacked feature vector is constructed from morphological attribute profiles (MAPs) and spectral features.The MAPs have been generated using the standard deviation and area attribute.Threshold values for the standard deviation attribute are set in the range of 2.5% to 10% about the mean of the individual features, with a step of 2.5%, and thresholds of 200, 500, and 1000 are chosen for the area attribute.
In the second experiment, some MKL-based methods such as Sparse-MKL (Gu et al. 2014), CS-MKL (Liu et al. 2016), and NMKL (Gu et al. 2016) are referred to have a fair comparison.In this experiment, PCA is applied to datasets and the first bands with 99% of total variation are retained.Ten closings and openings are applied to each PC with ten sizes [1,2,3,4,5,6,7,8,9,10] of a diamond SE.Therefore, 21 base kernels are created that correspond to 10 closing, 10 opening and PCs respectively.For sparse MKL, the optimum kernel is acquired by robust PCA.In the CS-MKL method, base kernels are calculated for each pair of classes and kernel learning process is performed for any two classes separately.In NMKL, kernel matrix in the learning model is constructed based on Hadamard product between base kernels.The base kernels for NMKL(G) and NMKL(L) are Gaussian and linear respectively.
In the third experiment, the performance of two approaches with different feature extractions, KMNF-MKL and SWT-PCA-MKL, is presented to show the effectiveness of our proposed method.In KMNF-MKL, first KMNF is applied to the spectral domain without SWT, and bands with maximum SNR are separated.Then different base kernels (with four various Gaussian scales) are obtained from the original spectral features.Also, in the SWT-PCA-MKL method, all procedures are similar to the proposed SWT-KMNF-MKL, only KMNF is replaced by PCA.
In the proposed method, KMNF is applied to the original HSI, approximation and detailed coefficients set to extract spectral features.The number of KMNF bands for spectral features and wavelet features has gained experimentally in terms of maximum accuracy.

Experiment 1: comparison with non-MKL methods
The overall accuracy of some popular classification methods for 1% training samples has been shown in Table 1.In this trial, our approach has been compared to SRC (Li and Du 2016), IAP (Hong et al. 2020b) and MFL (Li et al. 2014).Overall accuracy higher than 96% for Pavia University demonstrates the superior performance of SWT-KMNF-MKL.It can be realised that the proposed method has higher computational costs while its accuracy is greater than all of these methods.The numerical results in Table 2 show that the proposed method outperforms all other common classifiers.SWT-KMNF-MKL increases SRC, IAP, and MFL overall accuracy about 23.6%, 7.6%, and 7.8%, respectively.We infer from the findings that by weighting the base kernels, the similarity between observed samples may be measured more accurately.Figure 2   The classification maps for Indian Pines are shown in Figure 3.As can be seen, the map produced by the proposed method can discriminate between soybeans-notill and mintill classes well and it is less noisy than the others.Also, the results in Figure S7 indicate better performance of our method in terms of Kappa coefficients, especially in the case of 10 and 20 training samples.
For better evaluation, the accuracy of the proposed method has been compared to some deep learning methods in Table S3.Since the common dataset between our method and the refereed paper (Jia et al. 2021) was Pavia University, we have only presented the results for this dataset.In training rates of 10 and 50 samples per class, the proposed method outperforms all deep learning methods.Further testing demonstrated that the accuracy of the proposed method is nearly the same as the deep learning approaches when training samples are 100 samples per class.Also, our method have been compared with a classification method based on 3D-SWT and SVM (Qian, Ye, and Zhou 2012), in which two levels of decomposition were applied to the hyperspectral cube and texture features were constructed by concatenating all wavelet subcubes.When the overall accuracy of SWT-KMNF-MKL for the Pavia dataset (96.32) and Indian Pines (94.05) is compared to 3D-SWT results (88.78 and 77.03 respectively), it shows the outstanding performance of the proposed method.

Experiment 2: comparison with MKL-based methods
For more evaluation, our method is compared with different MKL-based methods.Table 3 illustrates some results for Pavia University in 1% training samples.In detail, the proposed method has improved overall accuracy from 2.76% to 6.9% compared to all MKL-based approaches.According to the results for Indian Pines in Table 4,  the proposed classifier has superior performance among all competing methods in the same training samples size.When compared to other MKL-based approaches, the proposed method has the major advantage that it achieves high overall accuracy with a low number of training samples (less than 5%) without an increase in computational cost.The classification maps of MKL-based approaches in Figures 4 and 5 also illustrate that the output maps of the proposed method for two aforementioned datasets are highly close to the ground truth.The classes which are easy to be misclassified in the Pavia dataset such as gravel, bare soil, and bitumen have been better differentiated by the proposed SWT-KMNF-MKL.Figure S9 shows the bar plot of kappa coefficients versus the number of training samples per class for Pavia.The highest kappa coefficients of SWT-KMNF-MKL are consistent with the findings of Table 3.
Also, it can be seen in Figure S10 that misclassified pixels from the suggested method are negligible for Indian Pines.The proposed SWT-KMNF-MKL surpasses other classifiers, especially for classes that are easily mislabelled such as woods, soybeans no-till and min-till.Based on the results in Figure S11 for Indian pines, Kappa improves all five methods in a similar way as the training sample size increases.When the size of training samples is fixed, the performance of SWT-KMNF-MKL is always outstanding.

Experiment 3: other feature extraction methods
The main advantage of employing KMNF is shown in Tables S6 and S7, which compare the proposed SWT-KMNF-MKL technique with PCA-SWT-MKL and KMNF-MKL methods in order to investigate the impact of different feature extraction strategies.The results show that SWT-KMNF-MKL outperforms the other two methods in terms of overall accuracy for both datasets.Also, as the number of training samples increases, the standard deviation decreases.SWT-PCA-MKL employs PCA to extract features from the original dataset and wavelet bands.the remaining stages are similar to the SWT-KMNF-MKL method.Base kernels in KMNF-MKL are obtained from spectral features extracted by KMNF.The efficiency of the proposed method is due to the fact that noisy features are eliminated by using KMNF.
As indicated in Table S7, SWT-PCA-MKL achieves an accuracy of 64.86%, which is improved by KMNF up to 78.64% for the Indian Pines dataset.This means that the MKL classification performance based on SWT and KMNF is greater than that based on SWT and PCA.Furthermore, the superior performance of SWT-KMNF-MK over KMNF-MKL demonstrates that incorporating spatial information with spectral features leads to a significant improvement in classification accuracy.

Conclusions
n this paper, a spatial-spectral classification method based on MKL technique has been presented.In the proposed method first, 2DSWT is applied to each spectral band to extract significant spatial information.Then, KMNF is utilised to reduce the dimension of features and remove noisy features.After that, base kernels are constructed from the features to implement MKL procedure and obtain the optimum kernel.By utilising MKL, features leading to most class separability are retained and redundant features are omitted.The proposed method achieves maximum accuracy for very small training samples without increasing complexity cost.The most significant advantage of the proposed method is superior classification results compared to the state-of-the-art MKL-based methods for very low (1%, 3%, 5%) training samples.

Figure 1 .
Figure 1.Illustration of the proposed method.
depicts the classification maps of the referenced methods for Pavia University.This figure clearly shows that with the proposed technique, the number of incorrectly classified pixels in the bare soil class is much lower, while the classification result for bitumen and asphalt is much better.As well, the Kappa coefficients vs the number of training samples per-class in Figure S5 confirm the superiority of the proposed method for the Pavia dataset.MFL (e) SWT-KMNF-MKL

Figure 2 .
Figure 2. Classification maps of Pavia University for different approaches in 1% training samples.

Figure 3 .
Figure 3. Classification maps of Indian Pines for different approaches in 1% training samples.

Figure 4 .
Figure 4. Classification maps of Pavia University for the different MKL-based approaches in 1% training samples per class.

Figure 5 .
Figure 5. Classification maps of Indian Pines for the different MKL-based approaches in 1% training samples per class.
2: Apply KMNF to reduce the dimension of X; A; H; V and D feature sets.3: Generate RBF base kernels (k m ) from each feature set with σ ¼ ½0:1; 1; 1:5;2� (Totally 20 base kernels).4: Calculate a linear combination from base kernels: K t ðx i ; x j Þ ¼ P 20 m¼1 d m k m ðx i ; x j Þ 5: Compute the optimum weights (d m ) based on SimpleMKL and generate the combined kernel !K t 6: Apply SVM by setting K t to predict classification labels.Output: Predicted labels

Table 1 .
Classification results (in percent) for Pavia University compared with non-MKL methods.

Table 2 .
Classification results (in percent) for Indian Pines dataset compared with non-MKL methods.

Table 3 .
Classification results (in percent) for Pavia University dataset compared with MKL-based methods.

Table 4 .
Classification results (in percent) for Indian Pines dataset compared with MKL-based methods.