Minimum Entropy Principle Guided Graph Neural Networks

Graph neural networks (GNNs) are now the mainstream method for mining graph-structured data and learning low-dimensional node- and graph-level embeddings to serve downstream tasks. However, limited by the bottleneck of interpretability that deep neural networks present, existing GNNs have ignored the issue of estimating the appropriate number of dimensions for the embeddings. Hence, we propose a novel framework called Minimum Graph Entropy principle-guided Dimension Estimation, i.e. MGEDE, that learns the appropriate embedding dimensions for both node and graph representations. In terms of node-level estimation, a minimum entropy function that counts both structure and attribute entropy, appraises the appropriate number of dimensions. In terms of graph-level estimation, each graph is assigned a customized embedding dimension from a candidate set based on the number of dimensions estimated for the node-level embeddings. Comprehensive experiments with node and graph classification tasks and nine benchmark datasets verify the effectiveness and generalizability of MGEDE.


INTRODUCTION
GNNs are currently the most popular graph mining methods for learning low-dimensional node or graph embeddings to serve downstream machine learning tasks, such as classification [14,17,44], clustering [4,36,51], regression [12,33]. The relevant theories have already been applied to a range of real-world applications [16,43,49]. Here, the general rule is that the dimensionality of the embeddings affects the quality of the encoded semantics and the ultimate performance of the GNN. A small number of dimensions will typically result in semantic loss, while a large number of dimensions will lead to overfitting and issue with computational inefficiency [23,28,47]. Hence, estimating the proper dimensionality for the embeddings produced is a crucial part of harnessing the power of a GNN yet one that has seldom been studied.
Estimating this critical parameter needs to be done manually in current GNNs. But there are two challenges confronting one's guess. First, the current theoretical research [22,26] on GNNs focuses on how to embed structural and attribute information; the issue of how to estimate an appropriate embedding dimension has not been addressed. In practical terms, practitioners tend to select the proper dimensionality through a manual enumeration search. But these types of grid searches are time-consuming and computationally expensive. The whole process is not very efficient and black-box explainability issues can be confounding. Fig. 1 (A) illustrates the problem. That said, a few studies on word embeddings have been published where a suitable dimensionality is estimated via bias-, variance-, or entropy-based metrics [10,35,41,47]. Further, Luo et al. [20] used the above metrics to produce embeddings with suitable dimensionality for the nodes in a graph. The downside of this technique is, however, that the process overlooks the rich structural information in graphs. Second, different graphs will have different proper embedding dimensions. Yet current graph-level GNNs ignore this diversity and encode all graphs with a unified embedding dimension (see Fig. 1 (B)). Hence, enabling GNNs to The real optimal dimension The optimal dimension selected by grid search A GNN for Graph Classification d d-dimensional Graph Representations of N Graphs Classification Results d d Figure 1: In (A), a grid search method parses a GNN seven times on each of seven embedding dimensions and then selects the dimensionality with the highest accuracy. By contrast, MGEDE estimates the proper dimensionality without needing to parse a GNN. In (B), a GNN generates embeddings for graphs with a unified dimensionality of , which ignores the differences between the graphs.
encode graphs with a range of different embedding dimensions for graph-level tasks is the second challenge. Motivated by the minimum entropy principle [11], which indicates systems with minimum levels of uncertainty, we propose a novel framework called Minimum Graph Entropy Principle-guided Dimension Estimation, or MGEDE for short. MGEDE is designed to estimate the appropriate embedding dimensionalities for graphstructured data and is highly applicable for GNNs. The framework comprises both a node-level embedding dimension estimator (NDE) and a graph-level embedding dimension estimator (GDE). The NDE includes a minimum graph entropy function that simultaneously models attribute and structure entropy. Multi-order topological structures are captured to solve the appropriate number of dimensions for all node embeddings. The uncertainty within the node embeddings given different dimensionalities is approximately measured as attribute entropy based on the distributional hypothesis [29]. Meanwhile, an encoding tree [15] that naturally forms a hierarchical partition of the graph is used to calculate the structure entropy. The GDE module includes a new assignment mechanism that assigns each graph with a customized embedding dimension from a candidate set. The candidate set is assembled from the nodelevel embedding dimensions estimated by the NDE. Further, to embed the graphs into diversely customized dimensional spaces, we built a new training framework targeting graph-level embeddings from GNNs. For those who wish to reproduce our work, MGEDE is available at https://github.com/MGEDE.
The main contributions of this article include: • A novel embedding dimension estimation framework called MGEDE, which is based on theoretical minimum entropy. MGEDE estimates a suitable number of dimensions for nodeand graph-level embeddings, supporting GNNs to deliver competitive performance with downstream tasks. • A new structure entropy that measures the complexity of a graph's structure by capturing the graph's multi-order topological information.
• A new assignment mechanism that assigns each graph embedding with a customized number of dimensions from a candidate set.

• Extensive experiments demonstrate that MGEDE supports
GNNs and network embedding algorithms to deliver promising performance on node and graph classification tasks in terms of effectiveness.

DEFINITIONS
Definition 1. (Graph). A graph = (V, E) comprises a node set V and an edge set E. For , the adjacency matrix is A ∈ R × , and X ∈ R × is the node attribute matrix. If ∀A , = 1, , = 1, the graph is an undirected graph, otherwise it is directed. Definition 2. (Minimum Entropy). Given a variable containing states, codewords are used to depict each state. In this case, Shannon entropy [31] denotes the lower bound of the average length of the codewords, which is , where is the probability of the state occurrence. Hence, the graph entropy equals the average length of the codewords used to describe the graph. Minimizing the graph's entropy is equivalent to seeking the minimum length of a graph's description. Correspondingly, a graph that can be described by briefer and shorter codewords has less uncertainty. Definition 3. (Structure Entropy). Structure entropy is a metric for measuring the complexity of a graph's topology [15]. It represents the average length of the codewords, where each codeword depicts a random walk on the graph. Under a specific encoding scheme, structural entropy assigns a prefix codeword to each hierarchy (e.g., community) in the graph to shorten the average length of codewords. For example, the codeword of a random walk visiting any node in the community can be shortened through the prefix codeword of the community. Using prefix codewords, we only need to encode the in-community nodes. The out-of-community nodes can be ignored. The graph = (V, E) is then encoded into a threelayer tree T (as shown in 4 of Fig. 2), in which the root node denotes the whole graph, the tree nodes in the second layer denote the communities of the graph, and nodes in each community comprise the third layer (i.e., the leaf nodes of the tree). The adjacency matrix A represents the structure of , the structure entropy is defined as: where is a non-root tree node, which has a father tree node + . ( ) is the degree summation of all leaf nodes in the subtree rooted at . For the root node , ( ) is the degree summation of all leaf nodes in T . represents the number of edges in which only have one end node in the subtree rooted as . The probability of a random walk visiting tree node is ( ) , and -log is equal to the length of codeword that encodes in + .

METHODOLOGY
MGEDE comprises two methods: NDE, which estimates the dimensionality of the node-level embeddings, and GDE, which estimates the dimensionality of the graph-level embeddings. NDE can estimate appropriate dimensionalities for both undirected and  directed graphs (see section 3.1 and section 3.2, respectively). The GDE method estimates the candidate set of dimensionalities for the graph-level embeddings and selects the best fit for each graph (see section 3.3). A time complexity analyses of each method can be found in section 3.4.

Node Representation Dimension Estimator
Motivated by the minimum entropy principle (see Definition 2), NDE estimates the appropriate dimensionality of the node-level embeddings by minimizing the graph entropy , which is defined as: where denotes the attribute entropy and denotes the structure entropy. Our explanation of how is computed begins with a detailed, explanation of how and are calculated. A function over is then used to approximate . The appropriate dimensionality is then the value of that results in the minimum .

Attribute entropy.
Treating all nodes in the graph as isolated units, attribute entropy measures the amount of uncertainty present in the collection of all node attributes. Originally, there is an assumption that each node has a -dimensional vectorized node embedding to encode the node attribute. Referring to previous studies on computing the entropy of a set of word embeddings [35,47], we denote a pair-wise inner product among the assumed node embeddings as the basic unit for calculating , formally: where and represent the embeddings of the nodes and , and < . • . > is the dot product. Denoted as = , < • > , the attribute entropy is calculated as follows: Using as the variable to represent the number of nodes in the graph, and can be expressed as: .
(5) Yet, the value of < • > can not be directly obtained since both and are assumed embeddings not node attributes. To tackle this issue, an approximate calculation is made. As an empirical observation of the node embedding models in past experiments, we find the absolute values of each element in the vectorized node embeddings are uniformly distributed. According to the distributional hypothesis [29], we assume that each element in -dimensional vectorized node embeddings has an absolute value of one. That is, in -dimensional space, each node embedding maps to a vector that lies on the surface of a hyper-sphere with radius √ * 1. Subsequently, the approximate of < • > can be shown as: where is the angle between the vectors of and . Combining Eqs. (5) and (6) establishes a function to calculate the attribute entropy with a dimension of and : Referring to previous research [9,34], the probability density of the angle between two arbitrary vectors on the surface of a hyper-sphere with radius Here, the is denoted as: According to the Laplace approximation [1], small changes between and ( −2) can be ignored when is large enough. Formally: log sin −2 cos = ( −2) log sin + cos ≈ (log sin +cos ).

(9)
The maximum value of (log sin + cos ) is achieved by = (1), as reflecting the complexity of the topological information contained in A. The NDE method includes a novel form of structure entropy (A 1−2−3 ), that is specifically designed for GNNs, where the (·) defined in Eq. (1) is used on a normalized adjacency matrix A 1−2−3 containing multi-order link information. There are four steps to calculating (A 1−2−3 ) (see Fig. 2). Each step is explained in detail next along with why we have used A 1−2−3 instead of A.
Step 1. Computing the Multi-order Adjacency Matrices. GNNs capture multi-order link structures separately in multi-layer convolutional layers. Thus, a multi-order adjacency matrices is used to calculate structure entropy instead of only a first-order adjacency matrix A. Given an undirected graph, the first-order adjacency matrix is A. Similar to GNNs, a self-loop is added into A to yieldÃ, i.e.,Ã = A + I. The second-order adjacency matrix isÃ 2 =ÃÃ, and the third-order adjacency matrix is regarded asÃ 3 =ÃÃÃ.
Step 2. Applying Graph Laplacian Normalization to the Multiorder Adjacency Matrices. Inspired by graph Laplacian [14], we hope to capture the information transmission rate between two nodes to replace the explicit weight of the edge. The implicit information transmission rate establishes a more stable random walk probability distribution on the graph, so as to calculate the structure entropy more precisely. Therefore, we employ graph Laplacian to normalize the multi-order adjacent matricesÃ , = 1, 2, 3. The normalized multi-order adjacent matrices A are defined as Step 3. Fusing the Normalized Multi-order Adjacency Matrices. In most cases, GNN will only set several layers to avoid over-smoothing. Therefore, in this setting, only the normalized 1-order, 2-order, and 3-order adjacency matrices A 1 , A 2 , and A 3 are used to compute the structure entropy. However, there is much redundant structural information in A 1 , A 2 , and A 3 . Hence, if we fuse them, we should yield an adjacency matrix A 1−2−3 with reduced redundancy. A 2 and A 3 reflect the second-and third-layered graph convolutions in the GNNs. In our experiments, we found that the structure entropy of A 2 and A 3 tended to be larger than A 1 since they represent more complex structural information. Meanwhile, A 2 and A 3 bring more redundancy and over-smoothing. Hence, for a better fusion, we assigned different probabilities to A 1 , A 2 , and A 3 , where the probability of the normalized adjacency matrix under the order is calculated as: is the node not in the subtree rooted at , and is the node in the partition of subtree rooted at .
Since A 1 contain the minimum over-smoothing information, it is regarded as the basis of the probability calculation process. The smoothing degree of the higher-order normalized adjacency matrix (≥ 1) is reflected in the differences between the structure entropy themselves and A 1 's. Therefore, Eq. (13) assigns the lower probability to the higher-order normalized adjacency matrix (≥ 1) that is the most smooth. Such an approach to probability assignment is referred to as Inverse Document Frequency (IDF) [13]. The fusion adjacency matrix and the corresponding degree set are then defined as: Step 4. Calculating High-level Structure Entropy. According to Eq. (14), structure entropy is calculated through: We used the Python toolkit Louvain [2] to obtain the community partition and the encoding tree for A 1−2−3 . We chose the Louvain algorithm because it is fast and performs well [18].

Graph Convolutional Layers
-dimensional hidden representations  1 and 2 , respectively. Then, a GNN model represents Graphs 1 and 2 and Graphs 3 and 4 as 1 -dimensional and 2 -dimensional graph representations, respectively, and outputs the classification results.

Minimum entropy & Appropriate Dimension.
Based on Eqs.
(2), (12), and (15), the graph entropy of the graph is: where denotes the dimensionality of the node embedding and is the number of nodes in the graph . The appropriate node representation dimension is the one that makes = 0.

A Variant: Node Representation Dimension Estimator for Directed Graphs.
We also built a variant of NDE that can estimate the appropriate dimensionality of node embeddings for directed graphs. This is because the idea of using a fused adjacency matrix to do the estimation cannot be directly applied to directed graphs. The reason is that graph Laplacian operations are only applicable to symmetric matrices in theory, but the adjacency matrix of a directed graph is asymmetric. Hence, in this variant of NDE, we must build a symmetric adjacency matrix for the directed graph, that is: where A is the asymmetric adjacency matrix of the directed graph. Eq. (17) defines how to build the symmetric adjacency matrix for a directed graph, as shown in Fig. 3(a). To build a symmetric adjacency matrix that can encode the direction information in the directed graph, we also define A in Eq. (18). A connects two nodes that both have the out-degree edge to the same node. The process is illustrated in Fig. 3(b). Similarly, we can connect two nodes that both have in-degree edges from the same node in Eq. (19) to obtain A . The process is shown in Fig. 3(c). Here, N ( , ) denotes the nodes that have in-degree edges from the nodes , while N ( , ) denotes the set of nodes that have out-degree edges to nodes , .
The graph Laplacian can then be used to normalize A , A and A , and fuse them with probabilities to obtain the fusion adjacency matrix of a directed graph. A is the basis in the fusion process. Next, in the same way, as we estimate the embedding dimensionality with undirected graphs, we can derive the appropriate dimensionality with a directed graph.

Graph Representation Dimension Estimator
Most GNNs for graph-level tasks generate graph-level embeddings by aggregating the embeddings of all the nodes. That is to say, in these GNNs, node and graph representations share the same dimensional space. Hence, the appropriate dimensionalities of the node-level embeddings as estimated by the NDE should also be valuable for determining the proper dimensionality of a graph-level embedding. Thus, the GDE assembles a candidate set of dimensions from the NDE. Then, for each graph, the GDE selects the best-fit dimensionality from the candidate set.
To assemble the candidate set of dimensions, the NDE method is applied to each of the graphs { 1 , ..., } in the given graph database G. A vector [ −1, ..., − ] containing all proper dimensionalities for each graph { 1 , ..., } is then fed into a 1-D -means [21] clustering algorithm, from which clustering centroids, 1 , ..., are derived. The value of is a predefined hyper-parameter, and { 1 , ..., } is the candidate set of proper dimensionalities for the graph-level embedding.
The GDE method also includes a new training framework, specifically designed to support GNNs for graph-level tasks. The method is, outlined in Fig. 4. In this training framework, all graph convolution layers except for the last layer encode the graphs into a unified dimensionality 0 for the hidden embedding. This is done by applying another 1-D K-means to the vector [ − 1, ..., − ] with = 1. In the last convolution layer, the graphs are assigned with their best-fit dimensionality , ≥ 1.

Time Complexity Analysis
The time complexity of NDE is mainly composed of the Louvain algorithm (O( )), matrix normalization (O( 2 )), and the structure entropy calculation (O( + )), where and are the number of nodes and edges, respectively. Overall, the time complexity of NDE is O( 2 ). The time complexity of GDE is O( · 2 ), where is the number of graphs in the graph database.

EXPERIMENTS
In this section, we report the outcomes of an extensive set of experiments with nine benchmark datasets. The experiments were designed to test the effectiveness of the dimensionalities estimated by MGEDE with both node and graph classification tasks.

Experiment Setup
Datasets. For the graph classification task, we used three biological datasets (i.e., ENZYMES [3], PROTEINS [3], D&D [6]) and a social network dataset (i.e., COLLAB [45]). There were 6, 2, 2, and 3 classes in the four datasets, respectively. With the node classification tasks, we use three undirected graphs (i.e., Cora [30], Citeseer [30], and Pubmed [27]) and three directed graphs (i.e., Cora-ML [32], Directed Citeseer (Di-Citeseer) [30], and AM-Computer [24]). They have 7, 6, 3, 7, 6 and 10 classes, respectively. Models. We evaluated the MGEDE's performance in terms of how the estimated dimensionalities were with three types of popular GNNs and network embedding algorithms, including a). GNNs for supervised graph classification (i.e., GIN [44], SOPOOL [42], DGCNN [50], GraSAG [8] and DIFFP [48]); b). GNNs for semisupervised node classification on undirected graphs (GEN [40], GCN [14], GAT [39], and DAGNN [19]); c). GNNs for semi-supervised node classification on directed graphs (DiGCN [38], and DiGCL [37]); and d). Unsupervised network embedding algorithms (DANE [7] and CAN [25]). To the best of our knowledge, MinGE [20] is the only framework that can estimate the suitable number of dimensions for node-level embeddings with undirected graphs. Therefore, we take the grid-search-based heuristic method and MinGE as the node-level comparison methods with undirected graphs. There are no current methods estimating embedding dimensions for nodelevel embedding with directed graphs and graph-level embedding. Hence, for these two tasks, we could only use the grid-search-based heuristic method as the comparison method. Evaluation Metrics. We used accuracy as our evaluation metric for supervised graph classification and semi-supervised node classification. This is because these two tasks are relatively balanced in the experimental setting with this paper, and accuracy is a widely used evaluation metric [8,42,44,48,50]. For the unsupervised network embedding, following previous studies [7,25], we adopted Micro-F1 and Macro-F1 as the evaluation metrics. All the experimental results reported are from 10-fold cross-validations. Experimental Setting. MGEDE has only one hyper-parameter, being , and it only applies to graph classification tasks. controls the size of the candidate set of proper dimensionalities for graph embeddings. In our experiments, we set = 2.

Performance of GNNs on Supervised Graph Classification
Tab. 1 illustrates the performance of the GNNs and algorithms with supervised graph classification tasks. Here, the heuristic method is compared with MGEDE. In terms of maximum (max), minimum (min), and average (avg) performance gain. From the results for minimum performance gain, we can see that MGEDE performed competitively on all datasets. These results verify that MGEDE can support practitioners in their use of GNNs for graph classification tasks.

Performance of GNNs on Semi-supervised Node Classification
Tab. 2 provides the results with a semi-supervised node classification task on undirected graphs. MGEDE helps the GNNs to deliver promising performance on all undirected graphs with an average of 0.4% greater accuracy. Tab. 3 reports the results with directed graphs. Likewise, on all directed graphs, the GNNs were more accurate when using the number of dimensions estimated by MGEDE.

Performance on Unsupervised Network Embedding
Tab. 4 shows the performance of the two network embedding models DANE and CAN. With a node classification task, in this experiment, MGEDE helped the two models reach competitive performance in terms of Micro-F1 and Macro-F1 scores.

Time Efficiency Analysis
Beyond improving the classification accuracy of GNNs, one of MGEDE's is that it saves practitioners time. First, it estimates the proper number of dimensions for graph embeddings in a much shorter time than a grid search method would take. Second, the GNNs themselves operate at a higher running efficiency because the dimensions chosen for the embeddings are appropriate. As shown in Tab. 5 and Fig. 5, MGEDE took 1347 seconds (22.45 minutes) to estimate the proper number of dimensionalities for the graphlevel embeddings for the COLLAB dataset. The grid search method

Hyper-parameter Analysis
Using SOPOOL for a graph classification task, we undertook a sensitivity study of the hyper-parameter as shown in Tab. 6. SOPOOL delivered the best performance on all graph datasets when = 2, and its performance witnessed a steady decrease as increased. One possible reason for this is that a larger may lead the GNNs to overfit, especially with small-sized datasets, such as ENZYMES.

RELATED WORK
This literature review surveys work related to entropy and dimension estimation. Entropy. A representative way to measure a system's uncertainty is Shannon's entropy [31], which regards entropy as the distribution of basic unit events. Therefore, defining basic unit events is critical for calculating Shannon's entropy. Numerous researchers have tried to find a reasonable basic unit for events on graphs. Dehmer [5] defined the basic unit event in a graph by transforming the node into a positive integer. The structure encoding tree [15,46] captures the entropy of the graph structure information. Recently, there has also been some work on defining entropy with text data. Term Frequency -Inverse Document Frequency [13] takes the frequency of words in corpus and the inverse document frequency of this word as the basic unit. Su's [35] idea is that the inner product of two-word embeddings could be regarded as the basic unit. Dimension Estimation. There are some algorithms that estimate the proper dimensionality through a metric -for example, with a loss function that evaluates different numbers of dimensions. Yin and Shen [47] defined the pairwise inner product as a loss function for estimating dimension. Their work focuses on measuring the change in bias and variance resulting from different dimensionalities. The dimension option that provides the most balanced bias and variance is then selected as the best choice. Inspired by this work, Wang [41] presented a score function to gauge performance with different numbers of dimensions, ranging from 2 to a predefined maximum number. Su [35] built an association between entropy and the dimensions of the embeddings based on the semantic distribution hypothesis [29]. Also motivated by the semantic distribution hypothesis, Luo et al. [20] proposed a dimension estimation method for node-level embeddings.

CONCLUSION
In this paper, we proposed a novel node and graph representation dimension estimation framework called MGEDE, designed to support GNNs produce embeddings for downstream tasks. Based on the minimum entropy principle, we presented a new method of calculating graph entropy. Composed of attribute entropy and structure entropy, these novel measures can be used to estimate the uncertainty in a graph. Additionally, the framework automatically chooses the appropriate dimensionality for node-and graph-level embeddings, with the choice being the one that minimizes the uncertainty in the graph. Moreover, we also devised a new GNN training architecture that can encode graphs into different dimension spaces. Extensive experiments demonstrate the effectiveness of MGEDE in guiding GNNs on both node-and graph-level tasks.