MSA-GCN: A Multi-information Selection Aggregation Graph Convolutional Network for Breast Tumor Grading

Physicians typically combine multi-modal data to make a graded diagnosis of breast tumors. However, most existing breast tumor grading methods rely solely on image information, resulting in limited accuracy in grading. This paper proposes a Multi-information Selection Aggregation Graph Convolutional Networks (MSA-GCN) for breast tumor grading. Firstly, to fully utilize phenotypic data reflecting the clinical and pathological characteristics of tumors, an automatic combination screening and weight encoder is proposed for phenotypic data, which can construct a population graph with improved structural information. Then, a graph structure is designed through similarity learning to reflect the correlation between patient image features. Finally, a multi-information selection aggregation mechanism is employed in the graph convolution model to extract the effective features of multi-modal data and enhance the classification performance of the model. The proposed method is evaluated on different clinical datasets from the Digital Database for Screening Mammography (DDSM) and INbreast. The average classification accuracies are 90.74% and 85.35%, respectively, surpassing the performance of existing methods. In conclusion, our method effectively fuses image and non-image information, leading to a significant improvement in the accuracy of breast tumor grading.


MSA-GCN: A Multi-information Selection Aggregation Graph Convolutional Network for Breast Tumor Grading
Kang Li , Suya Han , Lei Yang , Zizhao Sun , Zhan Yu , Hongwei Xu , Ling Ma , Jianbo Gao , and Huiqin Jiang Abstract-Physicians typically combine multi-modal data to make a graded diagnosis of breast tumors.However, most existing breast tumor grading methods rely solely on image information, resulting in limited accuracy in grading.This paper proposes a Multi-information Selection Aggregation Graph Convolutional Networks (MSA-GCN) for breast tumor grading.Firstly, to fully utilize phenotypic data reflecting the clinical and pathological characteristics of tumors, an automatic combination screening and weight encoder is proposed for phenotypic data, which can construct a population graph with improved structural information.Then, a graph structure is designed through similarity learning to reflect the correlation between patient image features.Finally, a multi-information selection aggregation mechanism is employed in the graph convolution model to extract the effective features of multi-modal data and enhance the classification performance of the model.The proposed method is evaluated on different clinical datasets from the Digital Database for Screening Mammography (DDSM) and INbreast.The average classification accuracies

I. INTRODUCTION
B REAST cancer is one of the most common malignancies in women, and its incidence is increasing year by year with a trend towards youthfulness [1].Mammography is an important imaging technique for the early detection of breast cancer [2].Doctors classify mammograms as Breast Imaging Reporting and Data Systems (BI-RADS) [3].In order to help radiologists and clinicians distinguish the fine grading (2,3,4,5) cases in mammography, this study aims to provide an intelligent BI-RADS grading prediction method, which can reduce doctors' workload and improve diagnostic efficiency.
BI-RADS grading is a generalized breast image assessment modality proposed by the American College of Radiology [4] that provides an estimate of the probability of malignancy of breast tumors.For example, in BI-RADS grading, grade 2 represents benign, grade 3 represents less malignant, grade 4 represents more malignant, and grade 5 represents highly malignant.BI-RADS grading criteria are clinically important for early diagnosis and treatment of breast cancer [5], [6], [7].
Artificial intelligence-aided medical imaging diagnosis is a hot research topic in recent years, and breakthroughs have been made in the field of intelligent diagnosis of common tumors such as breast cancer [8].For example, Li et al. improved the DenseNet network by deepening the number of network layers while reducing the model parameters to improve the performance of benign and malignant classification of mammogram images [9].The ResNet-GAP designed by Ding et al. coupled the tumor localization task with the classification task to improve the classification of breast malignant tumors [10].The MCUA developed by Senousy et al. used a context-aware module to capture the spatial dependencies between image blocks, achieved high accuracy in the classification of benign and malignant breast cancer histological images [11].These methods take tumor images as input and take full advantage of the layered pattern of Convolutional Neural Networks (CNNs) in processing image data Euclidean structure to obtain image features effectively [12].However, the existing diagnostic models are mainly designed for the benign and malignant classification of breast tumors and lack generality, which makes it difficult to meet the actual clinical demand for BI-RADS grading of breast tumors.
In actual clinical practice, clinicians grading breast tumors using the BI-RADS system not only focus on the breast's imaging information but also consider non-imaging phenotypic data.Common phenotypic data includes patient age, tumor calcifications, tumor density, and more.These non-imaging factors are crucial for assessing the nature of the breast condition.For instance, breast masses in older women are more likely to be malignant [13], and high breast density is associated with breast cancer [14].Integrating these non-imaging factors often enables clinicians to make more accurate BI-RADS grading diagnoses.Traditional deep learning methods can only handle single-modal information like medical imaging and fail to meet the high accuracy requirements of BI-RADS grading [15].Graph can be effectively implemented for end-to-end feature learning [16], and Graph Convolutional Networks (GCNs) can collect information from neighboring nodes in a convolutional manner based on the structure of graph, which has been widely used for multimodal data processing [17], [18].Considering that GCNs have powerful expression ability in fusing multimodal data, we choose GCNs to fuse imaging data with non-imaging data to achieve BI-RADS grading of breast tumors.However, there are still two issues that need further consideration when applying GCNs to medical diagnosis: i) The construction of population graph is flawed.GCNs are models used to process graph-structured data, and an input graph that accurately represents the correlations between subjects is crucial for GCNs to achieve powerful classification performance [19], [20], [21], [22].
In medical multimodal data, there is no readily available graph structure that can be directly used.[25].These methods have validated the feasibility of using GCNs for disease prediction.However, their rules for defining population graphs are relatively simplistic, without delving into the impact of population graphs generated from different phenotype data on model performance.It is necessary to define a novel approach for generating population graphs, enabling GCNs to be more effectively applied in the field of medical diagnosis.ii) Feature information extracted by GCNs is not fully utilized.GCNs classification algorithms face the oversmooth [26] in practical medical applications.That is, with the increase of graph convolution layers, the performance of GCNs will decline.The traditional deep learning method CNNs can achieve better performance by introducing residual structures to continuously deepen the number of network layers extracting deeper features [27].However, in order to avoid over-smoothing, the generic layers of GCNs are usually 1 or 2 layers, which makes it difficult to extract deeper effective features.The JK-GCN proposed by Xu et al. introduces the knowledgejumping connection structure, which alleviates the oversmoothing to some extent [28].Although GCNs have made some performance improvements in disease classification and diagnosis after introducing the aggregation mechanism of knowledge jump connection [29], it is still necessary to study appropriate aggregation strategies to obtain more structural information when facing practical medical problems.To address the challenges faced in the previous studies mentioned above, we propose a Multi-information Selection Aggregation Graph Convolutional Network (MSA-GCN) for fusing multi-modal medical information to predict the BI-RADS grading of breast tumors.BI-RADS grading of breast tumors is a multi-classification problem, and we model the grading diagnosis as a semi-supervised node-learning task on a graph [30], with medical image information and non-imaging phenotype information represented as nodes and edges on the graph, respectively.
The main contributions and innovations of our work can be summarized as follows: 1) New population graph construction scheme: We propose an automatic combination screening and weight encoder, which employs statistical methods to perform combinatorial filtering of phenotype data.Subsequently, it assigns distinct weights to each combination.

II. RELATED WORK
With the recent attention on graph-structured data, many researchers have tried to extend convolutional [31] which have been successful in computer vision to graph-structured data.The graph convolution in the spatial domain can be transformed into a multiplication calculation in the graph Fourier domain by the graph Fourier transform (GFT), which is similar to the conversion of the signals in time domain to frequency domain signals in the field of signal processing [32].To reduce the computational cost, Chebyshev Graph Convolution (ChebyGConv) uses Chebyshev polynomials to approximate the spectral graph convolution [33] to implement local filtering.Furthermore, the GCN presents a layer-by-layer propagation by simplifying the filter of the graph convolution through a localized first-order approximation of spectral graph convolutions [34].In addition, there are many non-spectral methods that focus on the local topology of nodes and deal directly with graphs, such as graph attention networks (GAT) [35], GraphSAGE [36], and so on.Based on these methods, the GCNs are introduced into the medical field to process medical multi-modal data represented as graph-structured data.
Sarah et al. argue that graph-based approaches using semisupervised learning should focus on both individual characteristics of subjects as well as pairwise similarities between patients in disease prediction [23].They used GCNs which represent populations as sparse graphs with graph nodes associated with imaging-based feature vectors and phenotypic information set as edge weights to fully integrate multi-modal information for predicting autism spectrum disorders and Alzheimer's disease.The excellent performance proves the feasibility of graph neural networks for multi-modal medical data.The Edge-Variational GCN (EV-GCN) [29] automatically combines image data and non-image data into the population graph by introducing a pairwise association encoders (PAE).and is able to obtain associations between subjects based on different phenotypic data.Therefore, the EV-GCN is superior to the model proposed in literature [23] in classification permanence.
The above two networks assume that the contribution of selected phenotypic data to the edge weight is equal, which does not conform to the actual situation of clinical classification diagnosis.Therefore, in our previous research, we proposed an adaptive multilayer aggregation GCN (AMA-GCN) to calculate the correlation between subjects based on different contributions of different phenotypic data [37].The AMA-GCN constructs the population graph by introducing a phenotypic measure selection and weighting encoder (PSWE).The designed encoder can automatically remove redundant phenotypic data based on their statistical properties.Although AMA-GCN has demonstrated its effectiveness in predicting the benign and malignant classification of breast tumors based on multimodal data, its classification performance still falls short of the high accuracy requirement when applied to the BI-RADS grading task.Our proposed MSA-GCN is specifically designed for accurate prediction of BI-RADS grading in breast tumors.

III. PROBLEM STATEMENT
In our study, the breast tumor grading is modeled as a semisupervised node classification task on a graph, with medical image data and phenotypic data from clinical records of patients as inputs to the model.All input information is collectively defined as an undirected population graph G = (V, E, A), where V denotes the set of nodes corresponding to the patient sample, and there are |V | patients.E denotes the set of edges, |E| represents the number of edges in the population graph.A denotes the adjacency matrix of the population graph G.The node feature matrix is extracted from the patient's imaging data, where n and d denote the number of samples and the dimensional size of the node features, respectively.According to the definition of graph node classification task, the goal of our method is to infer the category labels of all nodes on the graph based on the given category labels of nodes.

IV. METHODOLOGY
Fig. 1 depicts the overall structure of the proposed MSA-GCN.Firstly, the node feature matrix X is extracted from the imaging data.The adjacency matrices A p and A S for phenotypic data and imaging data characterizing patient relevance are obtained by the CSWE and similarity learning, respectively, which together with graph node features form two population graphs G P and G S .Then, the two population graphs are sent into the graph convolution layer to fully fuse the phenotype and imaging information by the multi-information selection aggregation mechanism to achieve multiple classifications.The whole framework consists of three main parts: CSWE: Calculate the measurement scores of automatically reorganize phenotype data and code them into the corresponding edge weights.Similarity learning: Obtain the graph structure reflecting the correlation between patient image feature.Multi-information selection aggregation mechanism: Fully obtain and fuse the deep information.
The next three subsections are specific descriptions of these three parts.

A. Combined Phenotypic Measure Selection and Weight Encoder(CSWE)
The edges of a graph characterize the interconnection between a pair of nodes, and the different weights of the edges indicate the different degrees of correlation between pairs of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.nodes.The edges on the population graph can be obtained and coded from phenotype information.In references [21] and [27] the researchers employed simplistic definition rules to generate population graphs.According to these rules, if two patients share the same phenotypic value, it is assumed that there will be an edge with a value of 1 connecting them on the population graph.However, this approach leads to a substantial inclusion of irrelevant information.When phenotypic measurements are blindly selected for generating population graphs, it results in the existence of connected edges between nodes of different categories within the population graph.
In our previous study, the proposed PSWE in AMA-GCN [37] by introducing different screening rules for quantitative and non-quantitative phenotypic measurement to achieve excellent performance in the dichotomous diagnosis of diseases.For quantitative phenotypic measures (e.g., age, height), the proposed PSWE completes the screening of phenotypic data by finding a threshold that separates the continuous values into two categories.However, it is difficult for the proposed PSWE to design suitable rules to find thresholds for multiple categories simultaneously when extending the dichotomous classification to a multi-categorization.In such case, it does not provide enough information to characterize the inter-subject connections for the construction of a graph, which may even lead to too much noisy and deteriorates the classification performance.Therefore, we make improvements on PSWE and design the new CSWE to consider only non-quantitative phenotypic measures (e.g., margins or density), as shown in Fig. 2.
The designed CSWE consists of three steps, which are as follows: i) Combination: We assume that patients of the same type are associated with multiple phenotypic measures at the same time, so we first combine the different phenotypic measures.Let there be a total of N node objects in the overall sample, which are classified into P categories, and initially a set h original phenotypic measures (e.g., margins, density, or view, etc.).After selecting e phenotypic measures from the set K for non-repetitive permutations, a set typic measures can be obtained.For example, from Fig. 2, it can be observed that the combined phenotypic measure M 1 is derived from the combination of K 1 to K e .ii) Screening: We then filter the combined phenotypic measures, and in the later formulation, the letter n notes the number of samples.The number of samples satisfying the requirements in each combined phenotypic measure M c is defined as n M c : otherwise. ( where n M c p_u represents the number of samples with the value u and category p in the combined phenotypic measure M c ; n M c u represents the number of samples with the value u in the combined phenotypic measure M c ; threshold δ is a custom variable used to measure the proportion of target samples that meet the requirements to the total samples when using combined phenotypic measure M c .If the number of target samples is too small, it indicates that the combined phenotypic measure M c cannot effectively represent the differences between patients.Meanwhile, for the values u in the value domain of the combined phenotypic measure M c that satisfy (1), we store them in the set U c .After the screening is completed, for each combined phenotypic measure M c , there is a corresponding sample count n M c and value domain set U c .
iii) Weighing: The weight scores α c corresponding to each combined phenotypic measure are obtained by normalizing n M c : Next, we define γ to measure the distance between the value of combined phenotypic measure M c of two graph nodes as follows: where θ is a relatively small value to ensure that the constructed population graph is not too sparse.
By screening and combining phenotypic information and assigning different weights, the population graph adjacency matrix A p is finally defined as follows: In the subsequent experimental section, we designed relevant experiments to validate the effectiveness of our proposed CSWE in constructing population graph.

B. Similarity Learning
The inter-subject connections from imaging data should not be ignored although the adjacency matrix A p of the GCN relies mainly on the inter-subject connections obtained from phenotypic data.The topology obtained from the phenotype data characterizes the similarity between graph nodes, but at the same time the similarity between the node features generated from the imaging data should not be ignored [38].The similarities obtained from the phenotype data and the imaging data, respectively, can be complemented to improve the representational power of the graph convolution model.As shown in Fig. 1, we exploit the inter-subject connections of imaging data through similarity learning.The goal of similarity learning is to obtain a k-nearest neighbor (KNN) graph G s = (A s , X), where is the node feature matrix, A s is the adjacency matrix.The vector x n represents the feature vector of size d for the n-th node, which is extracted from the breast tumor images using the standard ResNet-50 [27].The specific feature extraction method is the same as that used in EV-GCN [29], where the EfficientNet [39], with the classification layer removed, is employed to extract feature vectors from eye disease images.After obtaining the feature vectors, we employ recursive feature elimination [40] to further reduce the dimensionality of the feature vectors.The adjacency matrix A s can be obtained adaptively from the feature matrix X.
Firstly, the similarity matrix S ∈ R n×n between n nodes is calculated by the cosine similarity: where cos( • ) is the cosine similarity function; x i and x j denote the feature vectors of the i-th node and the j-th node; Q ∈ R d×b is a learnable weight vector, which performs dimensionality reduction on node features to alleviate the problem of excessive parameters for the computation of high-dimensional features' cosine similarity.Similar to GLCN [41], a dynamic graph structure can be obtained through similarity learning.The generated similarity matrix S is symmetric and has elements in the range [-11].Only the values greater than zero in S are considered since the adjacency matrix is non-negative and the ε-neighborhood of each node is chosen since the adjacency matrix is sparse.The asymmetric similarity matrix is defined as: where relu( • ) =max( 0, • ) guarantees the non-negativity of S knn , and KNN is the k-nearest neighbor algorithm, indicating that only the largest ε values in each row of the similarity matrix S are retained, guaranteeing the sparsity of S knn .Sparse operation leads to asymmetric matrices, and the final symmetric adjacency matrix A s is defined as: where S T knn is the transpose of S knn .The principle behind this symmetric operation is that if v i is one of the most similar nodes to v j , and v j is also one of the most similar nodes to v i , then averaging the similarities can maintain the initial degree of association between the two nodes.However, if v i is the most similar node to v j , but v j is not the most similar node to v i , taking the average of similarity can decrease the degree of correlation between the two nodes, and vice versa [42].

C. MSA-GCN
The corresponding adjacency matrices A p and A s are obtained from phenotypic data and imaging data by CSWE and similarity learning, respectively.Together with the node feature matrix X, they are used as input to the GCN.In order to make full use of the information aggregated by each graph convolutional layer while suppressing over-smoothing, the output of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Algorithm 1: MSA-GCN.
Input: X: feature matrix of nodes, Y: label matrix of nodes, L: the set of labeled nodes, K: phenotypic data, β i : trainable weighted coeffcient, Q, W: trainable matrix Output: Z: logits 1 A p ← K in Eq.4 2 A S ← {X, Q} in Eq.7 3 h (i) A s } in Eq.13 6 Z ← {h (final) , W } in Eq.14 7 L ← {Y, L, Z} in Eq. 15 8 Back-propagate L to update model weights each graph convolutional layer is first Hop-Normalize [43] and then the multi-information selection aggregation mechanism is used to fully obtain the deep effective information.The graph convolution structure can be divided into the phenotypic data part and imaging data part with A p and A s as input adjacency matrices respectively.
1) Part of Population Graph G P : This part uses a four-layer GCN and the intermediate representation of the node's output by each graph convolution layer is defined as follows: where θ k/A p is the learnable filter parameter, h (0) is the symmetric normalized Laplace matrix related to A p , where I is an identity matrix, D is a diagonal matrix whose elements on the diagonal correspond to the sum of the elements in each row of the adjacency matrix A p , respectively.T k ( L)= 2 LT k−1 ( L)−T k−2 ( L) is the Chebyshev polynomial defined recursively with T 0 ( L)= 1 and T 1 ( L)= L .
The intermediate representations of output obtained from each layer are normalized by Frobenius norm and multiplied by the learnable weight factor β for each layer: 2) Part of Population Graph G S : Analogous to the dominant population graph G P part, the supplementary population graph G S part uses a one-layer GCN and the output is defined as follows: where LA s is related to A s , θ k/A s is the learnable filter parameter, h (0) The output is also normalized by Frobenius norm and multiplied by the learnable weight factor β: h (1) The whole graph convolution structure uses a total of five layers of GCN requiring five learnable weighting factors β, which satisfy the following relationship: 3) Aggregation Layer: The purpose of the aggregation layer is to aggregate the multiple intermediate representations of nodes obtained from the population graph G P and G S , and the final node representation h (f inal) can be defined as: (1) A s (13) where g( • ) denotes the corresponding aggregation which is average pooling in this paper.
4) Loss Computation: The final node representation h (f inal) is fed into the final Multi-Layer Perceptron (MLP) and the output Z is then calculated as follows: where W is the trainable weight matrix of the MLP.Z ∈ R n×p obtained from the softmax activation function represents the diagnostic prediction matrix for all patients, and each row element of Z corresponds to the predicted outcome of a patient.We optimize the learnable parameters of the entire model in a gradient back-propagation manner with the goal of minimizing the cross-entropy loss function as follows: where L indicates the set of labeled training nodes, Y ij represents the the labels corresponding to these training nodes.The process of each part of the MSG-GCN proposed in this paper has been described and its pseudo-code is shown in Algorithm 1.The MSG-GCN can fully obtain and fuse the deep effective information from imaging and non-imaging data and improve the accuracy of multi-classification.

A. Dataset
To validate the effectiveness of the proposed MSA-GCN for breast tumor grading prediction, we evaluate it on the DDSM [44] database and the INbreast [45] database, respectively.The DDSM is the largest known public database of mammography data, containing the results of 2470 patients and some clinical information to study the Breast Imaging Reporting and Data System (BI-RADS) classification of breast tumor.Since the original medical images in the DDSM database have problems such as blurred imaging and inconsistent labeling standards, 842 mammography images with high quality were screened Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and organized into 512×512 pixels PNG format by professional software.These medical images were obtained from 276 cases, including 26 cases of B-2, 64 cases of B-3, 97 cases of B-4, and 89 cases of B-5, to investigate the more detailed issue of breast tumor grading.The collected phenotypic metrics include margins, density, shape, and 5 others.The mammography database INbreast collected a total of 115 patients.90 of these patients had two views of each breast (MLO and CC), and the remaining 25 patients included only two views of one breast due to mastectomy.Based on the ROI, we filtered 211 images removing the cases of B-1 and B-6 from them, including 100 cases of B-2, 21 cases of B-3, 43 cases of B-4, and 47 cases of B-5.The collected phenotypic metrics include acr, view, and laterality.The phenotypic information in the two datasets is derived from the clinical annotation of patients by physicians, and the names of the phenotypic measures are the same as in the datasets respectively.For example, the "density" in the DDSM dataset and the "acr" in the INbreast dataset both represent the breast density annotation of patients.The explanation of the phenotypic measures in these two datasets can be found in [44] and [45], respectively.

B. Baseline Method
We conduct a fair comparison experiment between the proposed MSA-GCN model and the following baseline methods.These baseline methods include both methods that use only imaging data and methods that use both imaging data and non-imaging data.
ResNet-50 [27]: A single modality classification approach introducing residual structure in CNN but using only image information.
GCN [23]: A model that first introduced graph neural networks into the field of disease prediction, which is capable of processing non-Euclidean diagnostic information.
JK-GCN [28]: A model applying jump knowledge connectivity structure to graph neural networks using an aggregation layer to integrate information.
GLCN [41] A model that integrates graph structure learning and graph convolution into a unified network, capable of dynamically adapting the graph structure.
EV-GCN [29]: A model uses a learnable encoder to automatically obtain the association between non imaging phenotypic data.
AM-GCN [38]: A model that uses a multi-channel graph neural network to simultaneously fuse node features and topological structure information.
AMA-GCN [37]: A model uses a PSWE to filter the nonimaging data and encode the corresponding edge weight scores and fully fuses the imaging and non-imaging information through a multi-layer aggregation mechanism.

C. Implementation Details
We have implemented the proposed architecture using Pytorch [46] and Keras [47] as backend.The Keras is used to build ResNet-50 as a feature extraction network to extract the corresponding feature vectors from the medical image data of the DDSM and INbreast datasets, respectively, as node inputs to the GCN.The Pytorch is used to build the proposed MSA-GCN and other comparative GCNs.All models use Adam [48] as an optimizer to update and compute network parameters during training to minimize the multiclassification cross-entropy loss function.Some of the baseline methods do not process phenotypic data, but all of these models require manual construction of the population graph from phenotypic data, and the graph structure generated from different phenotypic data has different effects on the classification performance of the model.To be fair, we chose the best phenotypic metric, "margins" for the DDSM dataset and "acr" for the INbreast dataset, and both AMA-GCN and our proposed MSA-GCN use all phenotypic data.The hyperparameters of the experiments are shown in Table I.We evaluate the classification performance of the model using a ten-fold cross-validation, where 90% of the samples are used as the training set and the remaining 10% as the test set.The accuracy (ACC), sensitivity, specificity, and F1-score are used as evaluation indicators.The calculations of above metrics are shown as follows:

ACC =
T P +T N T P +F N+F P +T N (16) where T P indicates true positive prediction, T N true negative, F N false negative and F P false positive.

D. Analysis of Comparison Results
We compare the performance differences between the proposed MSA-GCN and other baseline methods on a multigrade diagnostic task for breast tumors regarding the DDSM database and the INbreast database.
1) Quantitative Results: The results of the quantitative comparison experiments on the two benchmark datasets are shown in Table II.It is clearly observed from the results in Table II that single modality classification approach ResNet-50, which utilizes only medical imaging data, achieves the lowest classification accuracy of 62.44% and 57.38% on the DDSM and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II QUANTITATIVE COMPARISONS OVER TWO BENCHMARK DATASETS
INbreast databases.This is due to the inherent challenge of BI-RADS grading for breast tumors.In contrast, the graph-based approaches that fuse multi-modal information (imaging and non-imaging data) demonstrate improved performance.Taking the basic GCN as an example, its accuracy on the DDSM and INbreast databases are 72.21% and 83.42%, respectively.These values are 9.77% and 26.04% higher than the accuracy achieved by ResNet-50.This clearly highlights the effectiveness of incorporating both imaging and non-imaging clinical information for disease diagnosis.Although GCN, JK-GCN, GLCN, EV-GCN, and AM-GCN are all graph neural network-based models, they treat all phenotypic data equally when constructing the input population graph using phenotypic data, without considering that different phenotypic data have different effects on the classification performance.Therefore, the results of these methods are improved but still inferior to those of AMA-GCN and MSA-GCN.AMA-GCN takes into account the difference of phenotypic metrics on the validity of classification.It filters the phenotypic data and encodes different weight scores to achieve the construction of population graph.However, AMA-GCN does not consider the relationship between phenotypic metrics, and when there are fewer types of phenotypic metrics, it cannot fully capture the complementary information between phenotypic metrics, and thus has limited performance improvement for multi-classification tasks.The proposed MSA-GCN model fully considers the complexity of the multi-classification by combining different phenotypic metrics before filtering and coding them into weight scores.The MSA-GCN achieved the best results on both the DDSM and INbreast databases with an accuracy of 90.74% and 85.35%, respectively, and the results on the other three evaluation indicators sensitivity, specificity and F1 scores are also significantly better than the baseline model.For the INbreast database, due to the small sample size and the fact that only three phenotypic data are provided, the screening of phenotypic data by AMA-GCN did not achieve a significant improvement in this case.In contrast, the proposed MSA-GCN model is more adaptive by combining phenotypic data and can achieve effective screening even when phenotypic data are few.In the experiments with the INbreast dataset, the accuracy of MSA-GCN is 1.93% higher than that of AMA-GCN.The validity of the combined phenotype data will be described in detail in later subsection.
2) Visualization Results: To visually demonstrate the superior classification performance of our proposed MSA-GCN compared to other baseline methods, we employed t-SNE [49], an efficient nonlinear dimensionality reduction algorithm, to visualize the classification results by projecting the highdimensional node representations learned by the neural network into a two-dimensional space.In a fair evaluation setting, we used the DDSM dataset as an example and presented the t-SNE visualizations of GCN, GLCN, AMA-GCN, and our proposed MSA-GCN, as shown in Fig. 3(a)-(d), respectively.The points in different colors represent different categories, and it is evident that MSA-GCN exhibits the best visual representation, with similar points clustering together and dissimilar points separating from each other.The visual experiment validates that our proposed approach can generate more discriminative node representations, leading to improved classification performance.
3) Analysis of Training Set Ratio: Generally speaking, for training deep learning models, the more training samples, the stronger the performance of the learned model will be [50].However, in the field of medical deep learning, obtaining a large amount of labeled training data is a challenging task [51] so it is necessary to evaluate the performance of the model with a small percentage of the training set.As shown in Fig. 4, we evaluate the performance difference between our proposed MSA-GCN and the baseline methods GCN and AMA-GCN with different training set sample ratios on the DDSM dataset and INbreast dataset, respectively.The specific experimental step is to increase the sample ratio of the training set from 10% to 80% of the total number of samples, and the rest of the samples are divided into the validation and test sets, following the training approach of semi-supervised learning.The results in Fig. 4 show that as the training set sample ratio decreases, all models experience varying degrees of performance degradation, but it is obvious that our proposed MSA-GCN still achieves better classification results than the other baseline methods.Even with limited labeled training samples, our method exhibits more stable performance.

E. Influence of The Combined Phenotypic Data
To illustrate the effect of different phenotypic data on classification performance and demonstrate the effectiveness of combining phenotypic data, we assess the model's performance using phenotypic graph configurations and combined phenotypic graph configurations on the DDSM and INbreast databases, respectively.For the DDSM database, the original seven nonquantitative phenotypic data (excluding age) are directly employed to construct population graphs independently.Then we  Only the seven combined phenotype data with the highest classification accuracy are shown in Fig. 5(a).The graph built from the combined phenotype data of "margins+pathology" achieves the highest accuracy of 78.98%.The accuracy achieved for almost all combined phenotype data is greater than the original phenotype "margins" except "margins+left_right".In Fig. 5(b), the INbreast database exhibits similar outcomes to the DDSM database.The graph constructed from the combined phenotypic data "acr+laterality" achieves a classification accuracy of 84.39%, surpassing the highest accuracy of 83.42% attained by the original "acr" phenotype.These results directly substantiate the effectiveness of combining the original phenotype A class of diseases may be associated with multiple phenotypic measures simultaneously.The combination of different phenotypic data can enrich the value of the original phenotypic data on one hand and strengthen the correlation between the associated phenotypic measures and the disease on the other hand to better fit the complex of multiple classifications.The proposed CSWE combines different multiple phenotypic measures into a new combined phenotypic data, where the number of phenotypic measures selected for the combination has an impact on the classification performance.The effect of the number of combined phenotypic data in the MSA-GCN method on the classification accuracy is evaluated on the DDSM dataset with more phenotypic data categories.The classification results of selecting 3-6 phenotypic measures respectively for non-repeated permutations are shown in Fig. 6.As can be seen from the figure, the accuracy of MSA-GCN gradually increases as the number of phenotypic measures increases from 3 to 5.This indicates that the larger the number of phenotypic metrics used for combination, the more valid phenotypic information is extracted.The results of the combination of six phenotypic measures indicate that too much phenotype combination may cause redundant information to interfere with the classification.

F. Analysis of The Effectiveness of Feature Aggregation Strategy
We propose a selective aggregation strategy to dynamically fuse outputs from distinct graph convolutional layers.In order to validate the efficacy of our proposed methodology, we conduct a comparative analysis between the introduced aggregation strategy and four prevalent feature fusion techniques: Add: Directly adds the output features from different graph convolutional layers.
Concat: Concatenates the output features from diverse graph convolutional layers.
Maxpool: This strategy retains the highest values from the output features of various graph convolutional layers.
Avgpool: Applies average pooling to the output features from different graph convolutional layers.
The outcomes of the comparison, as presented in Table III, clearly indicate the efficacy of our designed selective aggregation strategy in enhancing classification performance.By introducing adaptive selection coefficients to the output features of distinct graph convolutional layers, we bolster the feature extraction capability of graph convolutional neural networks.This, in turn, yields more discerning aggregated features, thus facilitating subsequent classification tasks.

G. Ablation Study and Analysis
To investigate the impact the CSWE, similarity learning, and multi-information selection aggregation layer on the performance enhancement of the proposed MSA-GCN model, we conduct an ablation study by creating variants of the MSA-GCN model.In each variant, we selectively remove one of the three aforementioned components from the original MSA-GCN, resulting in the following models:

MSA-GCN(noC): This model omits the CSWE responsible
for combining and filtering phenotypic data.The remaining components of the MSA-GCN are retained, and the input population graph structure remains the same as the baseline method [23].

VI. DISCUSSION
The method based on graph convolutional networks (GCNs) effectively integrates multimodal medical information, resulting in significantly improved diagnostic prediction outcomes compared to traditional convolutional neural networks that rely solely on single-image information.Constructing a population graph and effectively utilizing the outputs of different graph convolutional layers are two major challenges when applying graph convolutional networks to medical diagnosis.To address these challenges, we propose a novel approach based on a non-imaging information fusion mechanism for generating population graphs, along with an adaptive strategy to aggregate outputs from different graph convolutional layers.Extensive experimental results validate the effectiveness of our proposed innovative modules, and our model proves effective for breast tumor grading tasks.
However, our study also exhibits limitations.Firstly, similar to most methods [23], [29], [37] that predefine population graphs, our proposed model is based on a static graph setting.Under such circumstances, our approach excels in semi-supervised node classification tasks but cannot directly extend to samples not encountered during training, in contrast to graph convolutional networks based on graph structure learning that do not face this issue [41].Secondly, throughout the model design, we primarily focus on enhancing the graph convolutional network structure.We simply employ a deep convolutional neural network to extract node features from image data, without considering the potential impact of different feature extraction networks on the classification performance of the graph convolutional network.In subsequent research, we will delve into designing dynamic graph-based graph convolutional networks and also pay attention to how various image feature extraction networks affect diagnostic accuracy.

VII. CONCLUSION
In this study, we introduce an innovative approach known as MSA-GCN, which integrates diverse medical data modalities to enhance the precision of breast tumor BI-RADS grading.To address the complexities of multi-classification, we design a combined screening mechanism for phenotypic data that reflects the clinicopathological characteristics of tumors.This mechanism calculates measurement scores for the amalgamated phenotypic data and encodes them into corresponding edge weights, thereby establishing a graph structure that characterizes patient correlations.Subsequently, the graph structure, reflecting internode correlations, is established through the process of similarity learning.Ultimately, we propose a multi-information selection aggregation mechanism to enhance the feature fusion ability of the graph convolutional neural network model.Empirical evaluations on both the DDSM and INbreast datasets affirm the efficacy of our proposed MSA-GCN in adeptly amalgamating multi-modal medical information and elevating the precision of breast tumor multi-categorization. Currently, the MSA-GCN model capable of efficiently integrating multimodal information has only been validated on datasets related to breast tumors.Our future plans encompass expanding its applicability to other disease classifications involving a combination of imaging and non-imaging phenotypic data, aiming to validate its generalizability.Moreover, the model's utility will be verified using real-world clinical data.

Fig. 1 .
Fig. 1.Overview of the proposed method.CSWE: combined phenotypic measure selection and weight encoder.Ap: adjacency matrix obtained from phenotypic data.As: adjacency matrix obtained from node similarity.GC: graph convolution.AL: aggregation layer.MLP: multilayer perceptron.

Fig. 2 .
Fig. 2. Overview of the proposed CSWE.K: the set of non-imaging phenotypic measures.M: the set of combined non-imaging phenotypic measures.

Fig. 4 .
Fig. 4. Performance comparison of different ratio of training set samples.

Fig. 5 .
Fig. 5. Effects of population graph constructed by different phenotypic measures and combined phenotypic measure on classification accuracy of different datasets.The boxplots report the results computed from 10-fold cross-validation.

Fig. 6 .
Fig. 6.Effects of the number of original phenotypic data contained in the new combined phenotypic data in the DDSM database on the classification accuracy.Results are computed from 10-fold cross-validation.
This model removes the part of MSA-GCN related to the similarity learning of node features.MSA-GCN(noM): This model eliminates the multi-information selection aggregation part of the MSA-GCN and replaces it with a two-layer GCN.The results of the ablation experiments on the DDSM and the INbreast are shown in Fig. 7(a) and (b), respectively.It can be observed that the classification performance of the MSA-GCN(noC), MSA-GCN(noS), and MSA-GCN(noM) all exhibit

TABLE III COMPARISON
RESULTS OF FEATURE FUSION METHODS