Late fusion of deep learning and handcrafted visual features for biomedical image modality classification

: Much of medical knowledge is stored in the biomedical literature, collected in archives like PubMed Central that continue to grow rapidly. A significant part of this knowledge is contained in images with limited metadata available which makes it difficult to explore the visual knowledge in the biomedical literature. Thus, extraction of metadata from visual content is important. One important piece of metadata is the type of the image, which could be one of the various medical imaging modalities such as X-ray, computed tomography or magnetic resonance images and also of general graphs that are frequent in the literature. This study explores a late, score-based fusion of several deep convolutional neural networks with a traditional hand-crafted bag of visual words classifier to classify images from the biomedical literature into image types or modalities. It achieved a classification accuracy of 85.51% on the ImageCLEF 2013 modality classification task, which is better than the best visual methods in the challenge that the data were produced for, and similar compared to mixed methods that make use of both visual and textual information. It achieved similarly good results of 84.23 and 87.04% classification accuracy before and after augmentation, respectively, on the related ImageCLEF 2016 subfigure classification task.


Introduction
The advent of the Internet has inverted many problems of information scarcity to problems of abundance. The sheer volume of information available online has made information retrieval increasingly important in daily life to avoid wasting time on irrelevant information as evidenced by the popularity of search engines such as Google and Bing. The medical field is no exception to this trend with searchable online archives such as PubMed, which continue to grow rapidly as more articles are submitted and indexed for use by medical practitioners and researchers.
The PubMed Central archive of the biomedical open access literature is a large collection of articles containing text and images that represent an important part of the biomedical knowledge. Visual information plays an important role in representing the knowledge stored but with only little metadata being available it is hard to exploit this information directly. Biomedical image modality classification is the problem of labelling biomedical images with their modality or in a larger sense the image type of the figure.
In medical imaging, a modality is a method or technique to create images such as X-rays, magnetic resonance imaging (MRI), computed tomography (CT) or electron microscopy [1]. The biomedical literature has also many other image types such as nonclinical graphs, flowcharts and illustrations. There are as well compound figures made up of two or more sub-images each with the same or different modalities. Such compound figures usually need to be separated before classifying the subfigures into their image types [2] but extracting image types without a separation is also possible.
Biomedical image type classification is important to improve medical image retrieval by filtering or reordering results taking into account modality information. Automatic prediction of modality in queries may be used to improve retrieval results [3]. Alternatively, users of medical image retrieval systems have suggested explicit querying by modality is useful to them [4]. Thus, modality classification is important for retrieval of images lacking explicit modality information as found in the biomedical literature.
Biomedical image type classification is made difficult by low intra-class similarity, imbalanced class distribution and scarce data regarding the wide variety of images in different classes. There are modalities containing images that are visually dissimilar or look different from one another as illustrated in Fig. 1. This low intraclass similarity and the semantic nature of biomedical image modalities make purely unsupervised methods difficult. Supervised methods are challenged by imbalanced class distribution and scarcity of some of the classes in the available datasets as illustrated in Fig. 2.
The classification model proposed in this paper makes use of deep convolutional neural networks (CNNs) for their state-of-theart performance in other image classification problems [6][7][8][9][10]. Transfer learning was applied to overcome the limited amount of training data that are available and score-based fusion was used for classifier combination of multiple CNNs as well as traditional hand-crafted features to further improve performance. Different CNN architectures, transfer learning methods and score-based fusion operators were explored. Only CNNs supported by MATLAB without third-party extensions were used.

Background
The original impetus for most work done in the biomedical image modality classification was the ImageCLEF modality classification or detection sub-task running from 2010 to 2013 [5,[11][12][13]. The key differences between the datasets are summarised in Table 1. A similar problem was later reintroduced as the ImageCLEF subfigure classification sub-task in 2015 and 2016 which focused on modality classification of individual subfigures of compound figures, in effect removing the 'COMP' or compound figure modality [14,15].
The low-level visual features were then used with different classifiers. Support vector machines (SVMs) were a popular classifier choice [16-19, 22, 23, 25] in biomedical image modality classification and still remain in use [26][27][28][29][30]. Other classifiers such as k-nearest neighbours [31], random forests [32], genetic programming [33] and linear regression [34] were attempted but did not perform significantly better than SVMs in many cases. Different low-level feature selection, combination or fusion criteria were also examined in biomedical image modality classification. Early fusion or feature level fusion using concatenation was the most common followed by late score level fusion using the average score [16][17][18][19][20][21][22][23]. Several papers also found late fusion with rank-based combination to be more stable than score-based fusion. More complex fusion criteria such as multiple kernel learning (MKL) or kernel level fusion [16] and covariance descriptors [35] were also explored.

Deep representation learning
Deep learning approaches learn multiple levels of representations or features from input images with each level transforming the input into a higher more abstract level until the final output as class labels [36]. CNNs are the current state-of-the-art deep learning approach for image classification as evidenced by results in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [8].
Deep learning approaches require large training datasets to achieve state-of-the-art classification performance, but this requirement can be mitigated using transfer learning. Transfer learning takes representations learned from an extensive training dataset and applies them to a different target dataset or problem that is usually smaller or more limited [37,38]. There are two main ways to approach transfer learning with CNNs:  i. Extract features from pre-trained CNNs without training CNNs on the target dataset for use in other classifiers like SVMs [39,40]. ii. Fine-tune pre-trained CNNs by adapting last CNN layers and then training on a target dataset to use the trained CNN as classifier [37,41].
Early deep learning approaches to biomedical image modality classification did not take optimal advantage of transfer learning and were limited by the relatively small datasets [42][43][44]. Later attempts used transfer learning (mostly with ImageNet), sometimes with data augmentation to expand the available dataset [26,27,45]. Fusion was also applied to combinations of different pre-trained CNN architectures to further improve classification performance [29,45,46].

Methods and materials
All the datasets used for evaluation were from the ImageCLEF 2013 modality classification and ImageCLEF 2016 subfigure classification sub-tasks, the most recent and extensive datasets for their respective sub-tasks [14,17]. All datasets are still available for researchers, so reproducibility of the results is given. Additionally, the training set of the '2016' dataset was augmented with all non-compound images from the '2013' dataset for comparison. It is denoted henceforth as '2016augtrn' to distinguish it from the original '2016' dataset without the augmented training set.
The high-level overview of the proposed method is shown in Fig. 3. We use the AlexNet [47], VGG-16 and VGG-19 [48] architectures for transfer learning in the deep learning models due to their image classification performance and compatibility with MATLAB. The pre-trained CNNs were originally trained to compete in ILSVRC on a source dataset of over 1, 000, 000 images in 1000 classes [8]. We use a bag of visual words (BoVW) or bag of keypoints [49] model using SIFT descriptors [24] as a common exemplar for hand-crafted visual features due to their good image classification performance and wide use in biomedical image modality classification [16][17][18][19]23].

Fine-tuning pre-trained CNNs for softmax classification
The pre-trained CNN is adapted by replacing the last three layers specific to the source ImageNet dataset with layers created for the target ImageCLEF dataset. The global learning rate of all the layers in the adapted CNN are lowered, whereas the newly created layers are given a multiplier to increase their learning rate. We empirically select a global learning rate of 0.00025 as well as a multiplier of 10 for the newly created layers.
The adapted CNN is then trained using stochastic gradient descent with momentum (SGDM) in mini-batches of size 32, shuffled every epoch and momentum of 0.9. We use early stopping in addition to the existing dropout and weight decay in the pretrained CNNs to avoid overfitting. All these parameters were chosen as standard parameters that have shown good performance in other image classification tasks. We hold back 10% per class of the training images as a validation set and train the adapted CNN on the remainder. Training continues until the minimum validation loss evaluated every 50 iterations does not decrease in 7 evaluations. Once training stops, the fine-tuned CNN is evaluated on the test images using the built-in softmax layer for classification.

Hand-crafted visual feature extraction for SVM classification
The traditional hand-crafted visual feature model is a BoVW model adapted from work done by Zare and Müller [50] on a similar dataset for the ImageCLEF 2015 compound figure detection subtask [14]. It was successfully applied to X-ray image classification [51,52], which is a similar problem and dataset.
A BoVW model begins by detecting or sampling keypoints from the image. We use a difference of Gaussians to detect local interest points and then extract SIFT [24] features around the local interest points or keypoints. Next, the SIFT feature vectors are quantised into K groups or partitions using k-means clustering [49]. Each SIFT feature vector is then assigned to the closest cluster centre using nearest neighbours with Euclidean distance. The image is represented by a feature vector or histogram constructed from the frequency that each cluster centre or codeword occurs in the image. The parameter K or codebook size was set to K = 500 after empirical testing for a range of values K = 400, 500, 600.
However, spatial information is ignored at this stage, as all keypoints are assigned equal weight. Spatial information is incorporated into the BoVW model using spatial pyramids [53]. The image is subdivided into L grids such that the level l grid has 2 l cells along each direction for l = 0, 1, …, L − 1. The codeword frequency histograms are constructed for each cell including the frequencies of the cells that subdivide them. The parameter L or pyramid level was set to L = 2 resulting in 2 0 × 2 + 2 1 × 2 = 5 cells or subdivisions. The final image feature vector produced is of dimension K × 5 = 2500. The adapted model uses the LibSVM library [54] for classification. The Gaussian or radial basis function kernel is used in case the features are not linearly separable. The hyperparameter optimisation is done using grid search over C = 2 −5 , 2 −3 , …, 2 15 and γ = 2 −15 , 2 −13 , …, 2 3 to minimise the 10-fold cross-validation error on the training images. The range of γ (gamma) and C (cost) that determine the impact extent of single training examples and the simplicity of the decision surface, respectively, are based on recommendations by LibSVM authors [55]. The trained SVM classifier is then evaluated on the feature vectors extracted from the test images and the posterior probabilities are estimated.

Classifier combination using score-based late fusion
Late fusion is quite common in biomedical image modality classification [16-18, 20, 43, 45]. However, usually only the average score is taken, which behaves similarly to the 'combSUM' fusion operator. We also explore other fusion operators, such as 'combMAX', 'combMIN' and 'combPROD' [56] or 'combMED' that takes the median score. The input scores are taken as the estimated posterior probabilities from the SVM and Softmax classifiers.

Results and discussion
The trained individual models are evaluated on the three datasets '2013', '2016', '2016augtrn' and recorded in Table 2. Then, the individual models are combined with score-based late fusion on the same three datasets. However, the possible combinations of individual models are not explored exhaustively but only selected combinations are examined for clarity and conciseness.

Individual models
All individual models exhibit a general increase in performance as the size of the dataset increases from 5483 images in '2013' to 10, 942 images in '2016' and last '2016augtrn' which is the sum of '2013' and '2016' excluding compound images totalling 14, 306 images. This is the expected behaviour for machine learning algorithms.
Deep learning outperforms the traditional hand-crafted feature exemplar of BoVW using SIFT features and spatial pyramids. It seems consistent with the relative performance of deep learning models in other image classification problems. However, it remains possible that there are implementation problems or insufficient optimisation due to limited resources.

Combination models
The estimated posterior probabilities from SVM and softmax classifiers are used to perform late score-based fusion using the 'combSUM', 'combPROD', 'combMAX', 'combMED' and 'combMIN' fusion operators. Classifier combinations across different CNN architectures are recorded in Table 3 as it was found to be effective in other similar articles [29,46]. Table 4 investigates effects of combining deep learning and traditional hand-crafted models. Fig. 4 compares the classification performance of several scorebased fusion operators. The aggregation based fusion operators 'combPROD' and 'combSUM' seem to outperform the selectionbased fusion operators 'combMAX' and 'combMIN' but 'combMED' is at a similar level. Fig. 4 also compares the classification performance of combinations with and without hand-crafted visual features. Combinations with hand-crafted visual features have better performance on the '2013' dataset but have worse performance on the '2016' dataset. It also affects the '2016augtrn' dataset as most images are from the '2016' dataset resulting in slightly worse performance.

Proposed model
The combination with the best test set accuracy for the '2013' dataset is a combination of a fine-tuned AlexNet, VGG-16, VGG-19 and hand-crafted BoVW models using the 'combPROD' fusion operator. However, it does not obtain the best classification performance in the '2016' dataset as previously noted. We prioritise the '2013' dataset, as it more closely resembles the distribution of images in biomedical literature by including compound images [5]. Table 5 compares the accuracy of our proposed model with other models as baselines. We achieve good performance compared with other visual methods using only input images. Mixed methods utilising text input such as captions in addition to input images still perform better. The models with best classification performance for each category and dataset are highlighted in bold.

Classifier analysis
After selecting the proposed model, we examine the classification results in detail to see what was classified correctly and what was misclassified. Table 6 in the Appendix records the detailed classification results for each modality or class for the 2013, 2016 and 2016augtrn datasets. It presents the precision, recall and F1 score or F-measure that are the preferred performance measures for imbalanced datasets.    'COMP' or compound figures are no longer a source of misclassifications in the '2016' and '2016augtrn' datasets. Fig. 6 shows that many classes are misclassified as the new dominant class 'GFIG' or statistical figures, graphs and charts. Microscopy images with modality code beginning with 'DM' are often misclassified as 'DMEL' or electron microscopy due to the visual dissimilarity within the class as illustrated in Fig. 1. Fig. 7 shows the same patterns as Fig. 7 with a smaller proportion of misclassified images due to the augmented or expanded dataset.

Combination analysis
After examining the proposed model in detail, we examine how each individual classification model in the late fusion contributes to the proposed model. Table 7 in the Appendix records the proportion of each modality class that was misclassified for each individual model as well as the proposed late fusion model evaluated on the '2013' dataset. The best technique with the lowest proportion of misclassified images is highlighted in bold per modality class. Table 7 shows that each individual model has classes or modalities it is best at with lowest proportion of misclassified images. It suggests that the individual models differ regarding which modality they are good at classifying. Hence, the individual models are good candidates for a combination with late fusion.
From the misclassification of the proposed model, it can be observed that there are modalities where the proposed model has a lower misclassification rate than even the best individual model misclassification. The decrease in the misclassification rate is mostly due to the late fusion selecting the best prediction from the individual model predictions.
However, there are also images that the proposed late fusion model classified correctly and that were misclassified in all the individual models as illustrated in Fig. 8. It suggests that there is a synergistic effect from the late fusion of the individual models.

Computational time analysis
The experiments were run with MATLAB r2017b on a machine equipped with an AMD Ryzen 1700 CPU and a single Nvidia GTX 1060 6GB graphics card. The hardware configuration is relatively low-end consumer grade compared to typical research machines, which is reflected in the computational times recorded in Tables 8  and 9.
The individual model computational times are shown as the time spent on late score-based fusion is negligible. Thus, the computational time for a combined model is simply the sum of each individual model if done sequentially, or the longest model if done in parallel. Tables 8 and 9 suggest that fine-tuned CNNs are faster than the BoVW implementation. There are several plausible reasons. Foremost, fine-tuning CNNs makes use of transfer learning that provides a head start whereas the BoVW must train from scratch. The graphics card may also perform more computations per time   [16] 81.68 --2016 best mixed [26] --88. 43 Valavanis et al. [30] 85.71 86.10 88.07 unit than the CPU. Finally, the implementation of the BoVW is not very optimised as it uses grid search instead of more advanced hyperparameter optimisation.
Besides this, Tables 8 and 9 also show some interesting patterns for CNN fine-tuning. The computational time increases with the complexity or number of layers of the CNN as expected. Unexpectedly, the '2016' dataset that is larger than the '2013' dataset sometimes requires less computational time. This is due to randomness inherent in SGDM.

Conclusion
A late fusion model is proposed in this text that combined deep learning models and traditional hand-crafted visual features for biomedical image modality classification. Transfer learning was used to mitigate the limited and imbalanced dataset. Specifically, fine-tuning of pre-trained CNNs AlexNet, VGG-16 and VGG-19 with early stopping was found to be effective relative to CNN feature extraction into SVM classifiers. A traditional hand-crafted BoVW model using SIFT features was found to improve the performance of the combined classifier although it had worse individual performance. A relatively simple late fusion method with the score-based fusion operator 'combPROD' was found to be sufficient compared to more complex methods like stacked SVMs.
The proposed model outperforms or is similar to other visual methods on two separate but similar datasets, the ImageCLEF 2013 modality classification sub-task and the ImageCLEF 2016 subfigure classification sub-task as shown in Table 5. However, it still falls short of mixed methods that use both visual and text input.
Future work or improvements may include combining more CNN architectures or even training from scratch, but this also increases the computational time required and hence decreases efficiency. Other traditional hand-crafted features could be included in the combination as well. More complex combination    Finally, mixed methods incorporating textual input such as image captions can be explored.