Aspect-Aware Graph Attention Network for Heterogeneous Information Networks

Graph Convolutional Networks (GCNs) derive inspiration from recent advances in computer vision, by stacking layers of first-order filters followed by a nonlinear activation function to learn entity or graph embeddings. Although GCNs have been shown to boost the performance of many network analysis tasks, they still face tremendous challenges in learning from Heterogeneous Information Networks (HINs), where relations play a decisive role in knowledge reasoning. What’s more, there are multiaspect representations of entities in HINs, and a filter learned in one aspect do not necessarily apply to another. We address these challenges by proposing the Aspect-Aware Graph Attention Network (AGAT), a model that extends GCNs with alternative learnable filters to incorporate entity and relational information. Instead of focusing on learning the general entity embeddings, AGAT learns the adaptive entity embeddings based on prediction scenario. Experiments of link prediction and semi-supervised classification verify the effectiveness of our algorithm.


I. INTRODUCTION
Heterogeneous Information Networks (HINs) are widely used for many practical tasks, and have become crucial resources for several intelligent applications, including web search [1] and question answering [2].HINs usually describe knowledge as triple facts (v i , r m , v j ) where v i represents a head entity, and r m is a relation that connects v i to a tail entity v j [3].The triples that constitute HINs contain a wealth of knowledge that can provide effective structured information.However, existing methods are not good at handling large-scale HINs due to computation inefficiency and data sparsity.A feasible solution is to embed HINs, including entities and edges, into a continuous low-dimensional vector space, which can significantly promote knowledge acquisition and inference [4], [5], [6].
As an efficient variant of Convolutional Neural Networks, Graph Convolutional Networks (GCNs) stack layers of first-order filters followed by a nonlinear activation function to learn entity embeddings of graphs, which have shown promising results [7], [8].Specifically, the feature vectors of the central entity and its neighbors are summed up and then normalized by nontrainable weights depending on the Fig. 1.Example for multiaspect in HINs.For one aspect whether v i and v j are alumni, we pay attention to where they graduated from, but for another aspect whether v i and v j are colleagues, we care more about where they are living.entity degrees.However, not all neighbors of an entity are equally important.A strategy sharing equal weights among all neighboring entities in the receptive field would hinder GCNs from automatically deciding what kind of features to extract [9].Many works have been proposed to solve this problem.For example, graph attention network (GAT) [10] tries to enable learnable weights when aggregating neighboring feature vectors by employing the multihead attention mechanism.While the multihead attention aggregator can explore multiple representation subspaces between the central entity and its neighbors, some of these subspaces are less important.In view of this, Zhang et al. [11] proposed a network architecture named Gated Attention Networks that uses a convolutional sub-network to control each attention head's importance.Gating mechanism is essential for recurrent neural networks to achieve the state-of-the-art performance [12].To aggregate the results of multilayer and avoid the performance degradation of cascaded graph neural networks, Lu et al. [13] proposed an adaptive gating mechanism, using an adaptive adjacency matrix for the gating layer, to selectively update and forget high-dimensional features.However, the methods mentioned above are focused on homogeneous networks.There is still room for improvement for existing methods in HINs where relations play a decisive role in knowledge reasoning.
In addition, most GCNs learn entity embeddings for one single aspect from the complex and multityped HINs [14]."Aspect" here refers to the role that an entity plays in the current scenario, see Fig. 1 for an example, where we take each relation type to be predicted as an aspect.More specifically, in conventional algorithms such as relational graph convolutional network (R-GCN) [15], every prediction scenario r m is associated with a diagonal matrix R m ∈ R d×d and a triple (v i , r m , v j ) is scored as v T i R m v j , where v i and v j denote the embeddings of entity v i and v j , respectively.However, each entity may have multiple roles (or aspects) in HINs, so embedding entities of HINs into single low-dimensional space would lead to information loss.
The above analysis shows that HINs are complicated, not only because HINs contain various entities and relations, but also the entities in HINs may have different representations in different aspects.This motivates us to propose the Aspect-Aware GATs (AGATs), a model that extends GCNs with alternative learnable filters to incorporate entity and relational information.Instead of focusing on learning general entity embeddings, we seek to learn adaptive entity embeddings which can change with the prediction scenario.More specifically, we propose an aspect-aware attention mechanism, in which there are multiple alternative learnable filters.Each filter covers not only entities but also relations.We can generate the specified entity embeddings given a prediction scenario (herein denotes the aspect a) via v i = AGAT(v i , a).Besides, we introduce the gating mechanism to the convolutional layer, where the relation type acts as a gating to filter the neighboring features during the aggregation operation.
The contributions in this brief can be summarized as follows.
1) We propose an aspect-aware attention mechanism, which aims to generate adaptive entity embeddings according to the given aspect, avoiding the problem that a filter trained in one aspect performs poorly when applied to another aspect.2) We propose a gated aggregator, which can automatically extract meaningful features.In alliance with the aspect-aware attention mechanism, the information sent from neighbors is scaled up or down through the relation types macroscopically and microscopically.3) We evaluate the effectiveness of the proposed method on two common tasks: link prediction and semi-supervised classification.Experimental results on public datasets show that our model significantly outperforms other baselines.

A. Problem Definition
Given a graph denoted by G = (V, R), where V = {v 1 , v 2 , . . ., v N } and R = {r 1 , r 2 , . . ., r M } represent the set of N entities and M relation types, respectively.The set of entity indices that have labels are recorded as I.For each entity v i ∈ I, it belongs to one of the categories in C = {c 1 , c 2 , . . ., c s }.Our method aims to learn a function of features on the graph which takes as inputs.
1) Two feature matrices X ∈ R N×dv and Y ∈ R M×dr , where each row in X is a feature description x i for entity v i , and each row in Y is a feature description y i for relation type r i .Here, d v and d r denote the number of input features for entity and relation type, respectively.2) A representative description of the graph structure G, typically in the form of triple facts (v i , r m , v j ) where v i ∈ V represents a head entity, and r m ∈ R is the relation type that connects v i to tail entity v j .Then, it produces an entity-level output V a ∈ R N×do given aspect a specified by the downstream tasks.For semi-supervised classification, we regard the category to be predicted as an aspect, and for link prediction, we regard the relation type to be predicted as an aspect.Here, d o is the number of output features per entity.The input of the first layer can be a unique one-hot vector for each entity (or relation) in HINs if no features are present.

B. Overview
In this brief, we propose the AGAT which can be applied to link prediction and semi-supervised classification.
1) Link Prediction: The framework for link prediction is shown in Fig. 2. For each triple fact (v i , r a , v j ), we first hide the relation r a , and then input the remaining part into our proposed aggregator to get embedding v m j ∈ R do for each r m ∈ R, i.e., v m j = (v j , r m , G ) ∀m = {0, 1, . . ., M}.
(1) Fig. 2. AGAT for link prediction.For each triple fact Here, r 0 denotes the unconnected relation from head entity v i to tail entity v j , G denotes the remaining part of G by masking r a , and denotes the proposed aggregator which will be discussed in detail in Sections II-C and II-D.To train , we introduce two loss function L 1 and L 2 , which are explained below.The goal of L 1 is to maximize the likelihood of observing head entity v i given the tail entity v j and relation type r a , i.e., max ( Profit from the success experience of Line [16], we model the conditional likelihood of every triple fact as a softmax unit parametrized by a dot product of v a j and u i , i.e., Here, v a j = (v j , r a , G ) and u i ∈ R do is the "context" embedding of entity v i , which is a set of trainable parameters in our experiments.In this way, our method is applicable for both directed and undirected HINs.Since optimizing (3) is expensive for large datasets, we approximate it using hierarchical softmax or negative sampling [17].Without loss of generality, hereinafter we only talk about the negative sampling method used in our experiments.Thus, the objective is to maximize the following function for each triple fact (v i , r a , v j ): where is the sigmoid function.The first term models the observed samples, and the second term models the negative samples drawn from the noise distribution P n .
For the second loss function L 2 , we seek to optimize the following objective function, which maximizes the log-probability of observing a relation-type r a conditioned on its head entity v i and tail entity v j : max log Pr r a |(v i , v j , G ) . ( Let ra ∈ R M+1 denotes the one-hot vector that consists of 0 s in all cells with exception of a single 1 in a cell used uniquely to identify r a .Thus, L 2 is the log-likelihood cost function shown as follows: where ra [m] ∈ R 1 is the mth cell of ra . Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Similar to (3), we model the conditional likelihood of every triple fact as a softmax unit parametrized by a dot product of v a j and u i , except that the denominator is changed into the summation over the units for all relation types.Thus, we have where λ is the trade-off coefficient, and we train by optimizing L with stochastic gradient descent.Unlike approaches limited to predicting the head entity or the tail entity, AGAT can predict the relation type between them.
2) Semi-Supervised Classification: The framework for semi-supervised classification is shown in Fig. 3, in which we input the whole graph G into our aggregator to output multiple embeddings.Here, we denote the embedding of v j in aspect c s as v * s j ∈ R do to differentiate it from entity embeddings in link prediction.Then, we evaluate the cross-entropy error over all labeled examples by the following function: where Here, I is the set of entity indices that have labels and cb ∈ R S denotes the one-hot encoding of the ground truth label c b .In addition, u b ∈ R do is a set of trainable parameters for the category c b .We can use the framework in Fig. 3 for semi-supervised learning alone or in combination with the framework shown in Fig. 2. When used jointly, L 1 and L 2 act the pretraining model learning higher order relations with the self-supervised prediction task and then fine-tuning by the semi-supervised task with labels for entities.

C. Gated Aggregator
In this section, we present the building block layer used to construct AGAT .First, one needs a linear feature transformation to map the input features into higher level features.Since different relation types may contribute differently to the central entities, Schlichtkrull et al. [15] introduced relation-specific transformations.A central issue with their method is the rapid growth in the number of parameters with the number of relation types in the graph.In practice, this can easily lead to overfitting on rare relation types.
Gating mechanisms control the path through which information flows in the networks and have proven usefulness for recurrent neural networks [12].Inspired by this, we model HINs with the gated linear units [18] to control what information should propagate through the edge.Specifically, given an entity v i , we define its neighbors N(i) as one-hop neighbors of entity v i and itself.Then, the architecture of the lth propagation layer is shown as follows: Here, r are, respectively, the hidden state of entity v i and the embedding of relation type between v i and v j in the lth layer of the neural network, and is a trainable parameter matrix whose purpose is to map g (l)  i→ j into the same channels as . Besides, denotes the Hadamard product, σ denotes the sigmoid function, and f is a nonlinear activation function such as ReLu.Since not all features in h (l)  j W (l)  v are equally important to h (l)  i , the information sent from h (l) j W (l) v is scaled up or down through σ (g (l)  i→ j W (l) r ) microscopically.It is worth noting that W (l) r and W (l)  v are shared in all triples.Therefore, the trainable parameters are greatly reduced, which alleviates the overfitting on rare relation types.
To obtain sufficient expressive power of relation embedding, g i→ j is updated between layers according to the following rules:

D. Aspect-Aware Attention Mechanism
For conventional GCNs [7], the feature vectors of the central entity and its neighbors are summed up and normalized by non-trainable weights depending on the entity degrees.This convolution-like operation is intrinsically different from the regular convolution which can automatically select features by learning the weights in trainable filters [9].GAT [10] tries to enable learnable weights when aggregating neighboring features by employing the attention mechanism.There is still room for improvement of GAT when applied to HINs, where relations play a decisive role in knowledge reasoning.Besides, we may pay different amounts of attention to the entities and relations for different aspects in HINs.For example, in Fig. 1, regarding an aspect whether v i and v j are alumni, we pay more attention to entity "Nanyang Technoligical University (NTU)" and relation "Study at," but for another aspect whether v i and v j are colleagues, the entity "Singapore" and the relation "Live in" should have higher coefficients.
Motivated by above, we propose an alternative learnable filter parameterized by a weight vector (l)  a ∈ R d (l) r +d (l) v * 2 for each aspect a in the lth layer of the neural network.Here, d (l)  v and d (l) r are the input channels of entity embedding and relation embedding, respectively.Since the propagation operations of different aspects are independent and identical, we discuss the graph attention layer of one aspect as an example.
An attention function can be described as mapping a query and a set of key-value pairs to an output.In this case, for each entity v i , we define the "query," "key," and "value" as follows: a key : g (l)  i→ j h (l) a: j h (l) a:i ∀v j ∈ N(i) Here, h (l)  a:i denotes the hidden state of entity v i for aspect a in the l-th layer, σ is the sigmoid function, || is the concatenation operation, and denotes the Hadamard product.Mathematically, we calculate the coefficient by the following equation to indicate the importance of entity v j to entity v i , i.e., γ (l)  a:i→ j = (l) a • g (l)  i→ j h (l) a: j h (l) a:i .( Here, γ (l) a:i→ j ∈ R1 is a scalar and x•y denotes the dot product between x and y.
Each entity may have different numbers of neighbors.To avoid changing the scale of the feature vectors, we need to normalize γ (l)  a:i→ j .Thus, we have Here, N(i) consists of entity v i and its one-hop neighbors.
Once obtained, the normalized attention coefficients are used to compute a linear combination of the features corresponding to them.We essentially arrive at the propagation rule Here, the information sent from h (l) a: j W (l) e is scaled up or down through γ (l)  a:i→ j macroscopically.The entity embedding of v i in aspect a is obtained by v a i = h (L)   a:i where L denotes the number of final layer.For each aspect we perform the attention function in parallel, yielding multiple output values which are used for downstream tasks, as depicted in Figs. 2 or 3.

E. Complexity Analysis
The computation complexities of the proposed aggregator come from the vector-matrix product operation of entity and relation type in (14) The time complexity of the method with a single aspect (which we denote by SGAT) can be expressed as In AGAT, by applying aspect-aware attention, it would increase the storage and parameter requirements of SGAT by a factor of |A| (the number of "aspects"), but we note that the individual aspect's computations are fully independent and can be parallelized.

III. EXPERIMENTS
In this section, we first evaluate the performance of AGAT on two tasks: link prediction and semi-supervised classification.Further, we conduct ablation study to examine different components of the proposed algorithm.State-of-the-art graph neural networks, such as R-GCN [15], heterogeneous graph attention network (HAN) [19], heterogeneous graph transformer (HGT) [20], and general attributed multiplex heterogeneous network embedding (GATNE) [21] are compared as baselines.
1) R-GCN: R-GCN is the first GCN framework explicitly developed to deal with the highly multirelational data characteristic of realistic knowledge bases.2) HAN: HAN is a novel heterogeneous graph neural network based on hierarchical attention, including entity-level and semantic-level attention.This work involves the design of meta paths for each type of heterogeneous graphs, requiring specific domain knowledge.3) HGT: HGT is inspired by the architecture design of Transformer [22].The entity-and relation-type independent attention mechanism is used to handle graph heterogeneity. ) GATNE: GATNE is a novel approach that can utilize multiplex topological structures from different entity types, and it can capture rich attributed information.GATNE supports transductive and inductive embedding learning.In addition, Single-Aspect GAT (SGAT) is used as a baseline.SGAT is a variant of AGAT, in which the aspect-aware filters are replaced by a single-aspect filter.
All the experiments are conducted on a Linux (Ubuntu 18.04.3LTS) server with two GPUs (GeForce RTX 2080 Ti) and two CPUs (Intel Xeon Gold 5218).Our implementation is based on PyTorch, and it is available here. 1

A. Link Prediction
Link prediction is a common task in academia and industry, and it has been widely used to evaluate the quality of network analysis technology.We work on three public datasets, Amazon, YouTube, and Twitter, that have been used in [21] for the link prediction task.The statistics of these datasets are summarized in Table I.
1) Amazon: There are two relation types in Amazon, including the co-viewing and co-purchasing links between products of electronic category.2) YouTube: YouTube consists of five types of interactions between 15 088 entities.The relation types include contact, shared friends, shared subscription, shared subscriber, and shared favorite videos between users.3) Twitter: Twitter comprises four directional relationships between more than 450 000 Twitter users.The relation types are re-tweet, reply, mention, and friendship/follower relationships between Twitter users.We use the train/valid/test datasets split in previous work [21].Some popular evaluation criteria, such as the area under the ROC curve (ROC-AUC), the precision-recall curve (PR-AUC) and F1 score are used for evaluation in our experiments.We summarize the experimental results of the proposed methods and baselines in Table II.
From Table II, we can see that on datasets Amazon and Twitter, the proposed methods consistently outperform baselines in terms of all metrics, which verify the effectiveness of the proposed aggregator.On dataset YouTube that is denser and has more relation types, GATNE passes SGAT as the second-ranked algorithm on all metrics.Only AGAT has better results.In GATNE, the overall embedding of entity v i in aspect a is obtained through a linear weighted summation of base embedding and edge embedding, where the base embedding is shared between different relation types, and the edge embedding is aggregated from neighbors with specific relation type.Compared with GATNE, AGAT automatically generates adaptive entity representations by changing the filter of aggregator based on the prediction scenario.Therefore, we draw the following two conclusions: 1) each entity in HINs has multiaspect representations, and adaptive entity embedding is more beneficial than general embedding to downstream  Therefore, a method that can generate adaptive entity embedding by alternating filters based on the scenario is more powerful.

B. Semi-Supervised Classification
We conduct semi-supervised classification experiments on two public datasets: The Institute of Applied Informatics and Formal Description Methods (AIFB) and PubMed.The AIFB dataset describes the AIFB research institute in terms of its staff, research groups, and publications.It contains 178 members of five research groups.Following [23], we delete the smallest group with only four people, leaving four classes.Entities in AIFB are not associated with any features, with each entity having only one label.PubMed is a network of genes, diseases, chemicals, and species.We use the version collected from [24], in which Yang et al. label a small portion of diseases into eight categories, and each disease has at most one label.All entities have 200-dimensional features which are computed by Word2vec [17].The statistics of the two datasets are summarized in Table III.
The experimental results in Table IV show that the proposed models can achieve competitive results, and SGAT outperforms AGAT on the PubMed dataset.The PubMed dataset consists of 63 109 entities, of which only 454 entities have labels (368 for training and 86 for testing).Considering that there are eight categories (aspects) in PubMed dataset, i.e., only 46 entities on average per aspect are used for training, which is already less than the number of parameters in AGAT.Due to the limited amount of data, it is acceptable for AGAT to get the second-best results.Furthermore, we noticed that AGAT occasionally outperformed SGAT (yielded a result of "Micro-F1 = 60.46%")out of ten experiments.

C. Case Study
We apply AGAT to the knowledge graph to see its performance on the dataset FB15k-237 that consists of abundant relation types.Specifically, we adopt the dataset FB15k-237-v1 collected from [25].Given the test triplet (v i , ?, v j ) to be predicted, we rank it against all candidate triplets which filter out all the valid triplets appearing in the train/valid/test sets.We choose Mean Reciprocal Rank (MRR), Hits@1, Hits@3, and Hits@10 as evaluation metrics to compare our model with composition-based multi-relational graph convolutional network (COMPGCN) [26].
From the results in Table V, we can see that AGAT outperforms COMPGCN by a large margin.The reason is that AGAT can generate adaptive embeddings according to the scenario, which is more useful for downstream tasks.
We further investigate how the aspect-aware attention mechanism works and what it learns from the dataset.For clarity, we choose 11 relation types to show an intuitive impression of AGAT.Fig. 4(b) is an example to demonstrate the meaning of each relation type.We score the coefficient of a i and r j via i •r j .Here, r j is the relation embedding of r j , and i ∈ R dr is the weight vector of relation embedding in aspect a i .We get the following findings from Fig. 4(a).
1) The coefficients vary greatly from row to row, validating our hypothesis that different prediction scenarios have different attention to relation types.
2) The coefficients for most diagonal positions are significant.This suggests that the same relation type as the current prediction scenario is important.For example, a person's "native language" can often be inferred from his friends' native language.3) Some off-diagonal coefficients are significant.Like in the eighth row, when we predict the genre of a film, we pay much

D. Ablation Study
To systematically analyze the effectiveness of the proposed aggregator, we conduct extensive ablation studies on AGAT (or SGAT) to examine the contribution of each component.
1) Impact of Gating Mechanisms: Like GAT, the proposed aggregator aims to enable learnable weights when aggregating neighboring feature vectors by employing an attention mechanism.The difference is that SGAT incorporates the edge embeddings via (14), which can be applied to HINs.Next, we test the effectiveness of the proposed aggregator by comparing SGAT with GAT ∼ in which ( 14) is replaced with the following expression: Here We record the above method as GAT ∼ because it is essentially GAT [10].Besides, the hierarchical dual-level attention mechanism proposed in incorporate node and edge features in graph neural network (NENN) [27] is adopted as a baseline.The x-axis in Fig. 5 represents the number of training epochs, and the y-axis indicates the Micro-F1 (or ROC-AUC) of the classification (or link prediction) results.It can be seen that SGAT and NENN consistently outperform GAT ∼ in both tasks, which verifies that edge embeddings play an vital role in aggregating neighboring features.In addition, NENN is inferior to SGAT in most cases.Although both methods incorporate edge embeddings into the attention mechanism, NENN ignores the heterogeneity of relations, which results in information loss on HINs.Specifically, SGAT assigns a vector representation for each relation type while NENN assigns a vector representation for each edge.
2) Parameter Sensitivity: We investigate the sensitivity of hyper-parameters in AGAT, including the trade-off parameter λ and the number of layers l, by performing link prediction experiments on Previous work generally maximizes the likelihood of observing the head entity given the tail entity and relation type [21], which is the loss function L 1 in this brief.We extend previous works by introducing the second loss function L 2 with the goal maximizing the probability of observing relation type conditioned on head entity and tail entity.Here, λ is the trade-off parameter to balance the contributions of the two loss functions.We search it over {0, 0.001, 0.01, 0.1, 1, 10, 100} to see AGAT's performance in terms of ROC-AUC.From Fig. 6(a), we can see that the performance of AGAT rises rapidly with the increase of λ and reaches its peak at λ = 10.At this time, the value of ROC-AUC is 93.93%, which is 10.75% higher than the performance of only considering the loss function L 1 .This proves that the loss function L 2 is helpful in enhancing the effectiveness of AGAT.
We also examine how the number of layers l affect the performance.While fixing λ = 1, we vary l from 1 to 6 to see the link prediction performance of SGAT in terms of ROC-AUC on YouTube dataset, see Fig. 6(b).The performance of AGAT continues to grow as l increases, without the performance degradation described in [28] even when l > 3. We attribute this to the gating mechanism in the proposed aggregator that helps improve the long-term propagation of information across the graph structure.

IV. RELATED WORK
With the advances in representation learning technology, heterogeneous network embedding has attracted intense research focus [29].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Its premise is that the intrinsic structural and semantic properties of the input graph can be encoded into the latent embedding vectors and thus benefit downstream tasks [14].Conventional heterogeneous network embedding algorithms can be roughly divided into translational models [30], [31], [32], [33] and semantic matching models [4], [34], [35], where the former interprets relation vectors as translations in vector space, and the latter measures plausibility by matching latent semantics of entities and relations.
More recently, we have also witnessed the emerging success of convolutional models in heterogeneous network embedding.One of the pioneer works is the R-GCN [15], which keeps a distinct linear projection weight for each relation type.Another notable effort is the ConvE model [3], which uses 2-D convolutions over embeddings to predict missing links in HINs.In addition, there are also some state-of-the-art heterogeneous network embedding algorithms such as [36] and [14].However, most of them require specific domain knowledge [19] since they rely on the manual exploration of heterogeneous structure, i.e., the selection of meta paths or variants, for capturing the structural and semantic dependencies.
Multiview (multiplex or multidimensional) network embedding algorithms are the related line of research.HINs usually have multiple types of proximities between entities, yielding networks with multiple views.To learn robust entity representations with considering multiple views of HINs, various multiview network embedding methods have been proposed, such as multiplex network embedding (MNE) [37], multi-view network embedding (MVE) [38], MultiKE [39], deep multi-graph embedding (DMGE) [40], mGCN [41], NeuACF [42], and GATNE [21].DisenGCN [43] is the first work that studies the problem of disentangling the factors behind the formation of a homogeneous graph.DisenHAN [44] extends the work to heterogeneous networks for top-N recommendation.
The above methods focus on learning general entity embeddings and change not when the scenario switch dynamically.Therefore, they are prone to overreact to irrelevant neighbors which do not fit the current scenario.The most similar concurrent work is the "Relationaware attentive fusion" in DisenKGAT [45].The core difference is that AGAT automatically generates adaptive entity embeddings according to the given scenario, rather than adjusting the importance of each aspect on final prediction.

V. CONCLUSION
In this brief, we propose an aspect-aware attention mechanism that can generate adaptive entity embeddings according to the prediction scenario.To apply the aspect-aware attention mechanism to HINs, we propose two frameworks, semi-supervised classification and link prediction, that are customized for the attention mechanism.Besides, we propose a gated aggregator which can automatically extract meaningful features.In alliance with the aspect-aware attention mechanism, the information sent from neighbors is scaled up or down through the relation types macroscopically and microscopically.Therefore, AGAT can provide more modeling power in nature.Experiments of link prediction and semi-supervised classification as well as the subsequent ablation studies have verified all our hypotheses.Although AGAT has many advantages, it still has challenges in handling large datasets due to the necessity of inputting the entire graph into the memory for training.We leave the task of further improving the efficiency of AGAT for future research.
Aspect-Aware Graph Attention Network for Heterogeneous Information Networks Qidong Liu , Cheng Long , Senior Member, IEEE, Jie Zhang, Mingliang Xu , and Dacheng Tao , Fellow, IEEE Abstract-Graph Convolutional Networks (GCNs) derive inspiration from recent advances in computer vision, by stacking layers of first-order filters followed by a nonlinear activation function to learn entity or graph embeddings.Although GCNs have been shown to boost the performance of many network analysis tasks, they still face tremendous challenges in learning from Heterogeneous Information Networks (HINs), where relations play a decisive role in knowledge reasoning.What's more, there are multiaspect representations of entities in HINs, and a filter learned in one aspect do not necessarily apply to another.We address these challenges by proposing the Aspect-Aware Graph Attention Network (AGAT), a model that extends GCNs with alternative learnable filters to incorporate entity and relational information.Instead of focusing on learning the general entity embeddings, AGAT learns the adaptive entity embeddings based on prediction scenario.Experiments of link prediction and semi-supervised classification verify the effectiveness of our algorithm.

Fig. 3 .
Fig. 3. AGAT for semi-supervised classification.For each entity v j , L 3 aims to maximize the probability of Pr(c b |(v j , G)).
, which are O(d r × (d o ) 2 ) and O(d v × (d o ) 2 ), respectively.Here, d v and d r are the numbers of input features for entity and relation type, and d o is the number of output features per entity.Let |V| and |E | denote the number of entities and relations, respectively.

Fig. 6 .
Fig. 6.Parameter sensitivity.(a) Performance of AGAT rises rapidly with the increase of λ and reaches its peak at λ = 10.(b) Performance of AGAT continues grow as l increases, without the performance degradation.

TABLE II LINK
PREDICTION (%): THE BEST SCORES ARE SHOWN IN BOLD WHEREAS THE SECOND-BEST SCORES ARE UNDERLINED.RESULTS MARKED ( †) ARE TAKEN FROM THE ORIGINAL PAPER TABLE III STATISTICS OF DATASETS FOR SEMI-SUPERVISED CLASSIFICATION

TABLE IV SEMI
-SUPERVISED CLASSIFICATION (%): THE BEST SCORES ARE SHOWN IN BOLD WHEREAS THE SECOND-BEST SCORES ARE UNDERLINED tasks and 2) different scenarios pay different attention to neighbors.