Prediction models with graph kernel regularization for network data

Traditional regression methods typically consider only covariate information and assume that the observations are mutually independent samples. However, samples usually come from individuals connected by a network in many modern applications. We present a risk minimization formulation for learning from both covariates and network structure in the context of graph kernel regularization. The formulation involves a loss function with a penalty term. This penalty can be used not only to encourage similarity between linked nodes but also lead to improvement over traditional regression models. Furthermore, the penalty can be used with many loss-based predictive methods, such as linear regression with squared loss and logistic regression with log-likelihood loss. Simulations to evaluate the performance of this model in the cases of low dimensions and high dimensions show that our proposed approach outperforms all other benchmarks. We verify this for uniform graph, nonuniform graph, balanced-sample, and unbalanced-sample datasets. The approach was applied to predicting the response values on a ‘follow’ social network of Tencent Weibo users and on two citation networks (Cora and CiteSeer). Each instance verifies that the proposed method combining covariate information and link structure with the graph kernel regularization can improve predictive performance.


Introduction
Recent decades have seen significant advances in data collection, leading to graph data being collected in many modern applications and corresponding graphs being created that record the relationship of individuals in a system. This information is often collected along with covariate information on each unit of the system [17]. For example, the graph data we focus on in this paper include social and citation networks. A large body of work on traditional regression methods typically takes into consideration only information about covariates and assumes that samples are mutually independent. However, samples usually come from individuals connected by a network. Much literature also pays attention to analyzing the graph structure along with the underlying network, such as work on community detecting [1,6]. Obviously, if the covariate of each node can be used to analyze the network itself, it will be more efficient to find the best communities [2,15,25]. In addition, there is no shortage of methods considering both nodes attributes and graph link connections simultaneously. For example, Zhang et al. [24] put forward a method with the idea of collective inference, which belongs to semi-supervised learning field and is generally applicable to the classification problem. Li et al. [13] proposed the RNC (Regression model with Network Cohesion) model with the idea of network cohesion to solve the problem of network regression. They assume that the nodes in a network have different nodes effect (varying intercepts) and the nodes effect of connected nodes in a network tend to be similar. Based on the assumptions above, Li et al. [13] obtained the RNC model by constructing a regularization term on the nodes effect vector through the network Laplacian matrix. However, the RNC model is only effective in low-dimensional situations because its penalty only acts on the nodes effect without considering the regression parameters. Also, Tibshirani and Land [20] proposed their models with the idea of fusion penalties, which have been used to select variables based on the network-constrained variables [8,[10][11][12]16], but this line of work is not directly relevant to ours because what we are really interested in is networks consisting of observations rather than variables.
On the other hand, the sample dimension in many studies is usually much larger than the number of samples, that is the common high-dimensional low sample problem. In this case, an ordinary linear regression model can not be leveraged directly to estimate the regression parameters. To solve this problem, researchers have proposed many regularization methods for identifying key variables in a regression framework, such as Lasso [19], SCAD [5], Elastic-Net [27], fused Lasso [20], Lars [4], adaptive Lasso [26], group Lasso [22] and L 1 2 +2 hybrid regularization method [7,9,14]. All the above methods have group effect to some extent, the covariates with strong correlation are either selected or eliminated at the same time. However, the above methods have common limitations, that is, these methods are only from the perspective of calculation or algorithm, and do not use any prior knowledge such as graph structure information. In recent years, network penalty usually defined as the quadratic form of Laplacian matrix has been used in a large number of practical applications. For example, Li and Li [11], Chen and Zhang [3] and Wang et al. [21] used the network-based L 1 penalty to conduct regression analysis to select key variables on genomic data. Thus, we expect that considering a fused penalty on both covariate and network link information in regression models will be better.
Our task in this paper is to predict the response value using both covariate and graph link information without considering the reason why the linked nodes behave similarly. Different from the classical regression models, we consider the following interpretation: if nodes u and v are linked, then they are more likely to have similar predictive values, that is p(u) = p(v). This assumption is helpful to construct the regularization term in our proposed model, and it violates almost all guarantees that could provide good performance for the traditional regression models, while we expect it to be helpful in our predictive tasks since it suggests pooling information from neighbors. The objective function of our proposed method can be formulated as a loss function plus a penalty acting on both covariates and graph Laplacian and its explicit solution can be derived under certain conditions. In contrast to previous work, no information about the potential groups of the underlying network is required, and we verify that our proposed model presents an improvement beyond the state of the art local regularized regression models on graph data through analyzing various cases such as a uniform graph, nonuniform graph, balanced-sample, or unbalanced-sample in low-dimensional and high-dimensional situations, especially in high dimensions. Furthermore, the proposed model is effective for both sparse and dense networks, the latter with an extra graph sparsification while keeping good predictive performance.
The rest of our paper is organized as follows. In Section 2, we deduce our model in the setting of linear regression. We frame it as a penalized least squares problem that has a closed-form solution. Then the methodology based on the linear regression can be extended to predictive models with log-likelihood loss function, although we only explicitly investigate the logistic regression. We exhibit a discussion of relevant theoretical properties of the estimator in Section 3. Section 4 gives a brief comparison of the simulated numerical performance for various cases. Section 5 presents a detailed comparison of our approach with several different methods on the representative 'follow' network(in Tencent Weibo) and citation networks(Cora and CiteSeer). We apply our method to predict user activity calculated from known statistics on the 'follow' network, and we aim to predict the label on the citation networks. Subsequently, a concise summary and future research direction in Section 6 concludes the paper.

Regression with graph kernel regularization
We start by introducing the notations. All vectors are treated as column vectors by default. The underlying dataset that we are going to study consists of n observations (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n ), where x i ∈ R p is the vector of covariates and y i ∈ R is the response variable for i th observation. We write Y = (y 1 , y 2 , . . . , y n ) T for the response variable vector and X = (x 1 , x 2 , . . . , x n ) T for the n × p design matrix, which is treated as fixed and whose columns have been standardized to have mean 0 and variance 1. The graph connecting observations is G = (V, E), where V = (1, 2, . . . , n) is the node set, and E ∈ V × V is the edge set.The graph structure can be represented by its adjacency matrix A ∈ R n×n , whose element A uv = 1 if nodes u and v are linked by an edge, i.e. (u, v) ∈ E, and 0 otherwise. We assume that there are no loops, therefore, A vv = 0 for any node v and the underlying graph is undirected, that is, A uv = A vu . The network Laplacian of G is measured by L = D − A, where D = (d 1 , d 2 , . . . , d n ) represents the diagonal degree matrix with node degree d u defined as d u = v∈V A uv .

Linear regression with graph kernel regularization
We define our problem more abstractly as follows in the context of linear regression. We wish to predict a real-valued output y i based on the corresponding covariates x i and the graph link structure. Assume that the graph-based linear regression model can be constructed as and its matrix form can be written as whereX = ( √ μI n , X), θ = (α, β) T and = (ε 1 , . . . , ε n ) T . To be more detailed, β = (β 1 , . . . , β p ) T is a p-dimensional regression coefficients vector of the covariates and α = (α 1 , . . . , α n ) T , a column vector containing information that a fixed intercept term cannot explain, represents the individual nodes effect. In other words, we can also treat α i s as a special attribute of each node, which behaves similarly whenever two different nodes are linked. As for μ, it is a small tuning parameter greater than zero to make the system strictly positive and thus to obtain numerical stability. Moreover, no assumption is made about the distribution of the error ε, although we do need to assume that E = 0, Var( ) = σ 2 I n , where I n represents n × n identity matrix. Including individual nodes effect α rather than a shared intercept is important to considering network similarity. Generally, α and β, n + p unknown parameters totally, cannot be estimated from n observations without additional assumptions. Thus, we are going to construct a regularization on θ including α and β and obtain the general objective function L(θ) as follows where l(·) represents general loss function and R(θ ) represents the regularization term on θ. In particular, if l(·) takes the square loss function and R(θ ) represents the L 2 norm of θ, Equation (3) is the ridge regression. Also, by defining a proper kernel function, ridge regression can be re-written aŝ where λ is a tuning parameter and can be selected by cross-validation, f = (f 1 , f 2 , . . . , f n ) represents the vector of predictive values and K = [k(x i , x j )] is the kernel gram matrix with k(x i , x j ) = x i T x j for i, j = 1, 2, . . . , n. Moreover, we shall replace K by μI + K in order to keep numerical stability since K may not be invertible.
In graph-based learning, we can obtain the predictive values f i s by constructing a regularization condition on L and solve the following optimization problem: where λ is a hyperparameter. If we have a weighted graph with edges E and weights w ij associated with (i, j) ∈ E, according to the property of the Laplacian, we have We just set w ij = 1 in this paper, since the identification of weights is not our main topic. But it should be noted that the regularization operator Laplacian L can be treated as an inverse of some defined kernel matrix from the aspect of kernel, and L −1 exists as long as there are edges in the network, which could be seen from its definition and the fact that Equation (6) is greater than or equal to zero. We are naturally to combine K and L with the following two Lemmas based on Zhang et al. [24]. Lemma 2.1: Let K = K 1 + K 2 , then we have The proofs of Lemmas 2.1 and 2.2 are given in Appendix A and B. By using them, we can obtain the Regression model with Graph Kernel regularization (RGK) as followŝ where . The detailed derivation process is shown in Appendix C. In particular, if l(·) in Equation (7) takes the square loss, we have Where · 2 denotes the square of L 2 norm and the penalty term λ 2 θ T θ + λ 2 μα T Lα can be regarded as the fusion of penalties of ridge regression and network, which can not only used to reduce the number of covariates but also control the nodes effect. Furthermore, note that: 1) we avoid computation related to K, reducing the computing cost extremely; and 2) Equation (8) is convex in θ, implying that there is only one global optimal solution, that isθ where M = L 0 n×p 0 p×n 0 p×p .
In summary, our proposed RGK model combines both covariates and graph information through a linear kernel combination. The objective function of RGK involves a loss function with a regularization term. This regularization can not only be used to compress the number of covariates but also encourage similarity between linked nodes. We expect that it will contribute to improvement over traditional regression models.

Prediction model and tuning parameters selection
To obtain the fitted values of the training samples, we just use Xβ + √ μα. However, our task is to predict the response values of the out-of-training samples based on their available covariates and connection relationships. As mentioned above, there are different nodes effect α i s, so predicted nodes effect are needed before obtaining the response values for testing samples. We assume that there are n training samples and n test samples, leading to a network containing a total of n + n nodes. Then the corresponding Laplacian can be written as where L 11 corresponds to the n training samples and L 22 corresponds to the n testing samples. Similarly, write the nodes effect vector as (α 1 , α 2 ), where α 1 =α is estimated from training samples and α 2 needs to be predicted based on the former. With the assumption that connected nodes have similar node effects, we can obtain the nodes effect of the new samples by solvingα After taking the derivative, we haveα 2 = −L −1 22 L 21α . The optimal tuning parameters λ, λ and μ can be selected by 10-fold crossvalidation(CV) on the training data. In the CV, the training samples are randomly split into 10 folds, where nine folds of data and the corresponding induced subgraph are leveraged to train a model for validation on the remaining tenth fold, cycling through each of the ten folds in turn. The cross-validation error is computed as the average of the prediction errors on the fold that was left out, and then the (λ, λ , μ) grid that minimizes the cross-validation error is selected as the optimal parameters. However, it will take a lot of time to pick the optimal parameters by 10-folds cross-validation grid search if there are many tuning parameters. Therefore, according to Zhang et al. [23], we could just optimize the hyperparameter λ by 10-folds CV, simply take μ = 0.01, and further make λ satisfy λ n = 0.01 without additional optimization. Furthermore,  also shows that the parameter setting above would not have a great impact on the numerical results, but we believe that there could be a slightly better result with optimized μ and λ .

Model generalization
We are going to extend the RGK methodology based on the linear regression model to generalized linear regression model or other regression models. Taking for example the logistic regression, we can take the logit function as its link function, and the general form of the link function can be taken as for any generalized linear model. Suppose the log-likelihood function is L(α, β, X, Y).
Then if there are edges in a network, we can fit the model by minimizing When L(·) is concave in α and β, which is the case for exponential families, we can solve this optimization problem through some convex optimization algorithm like Newton-Raphson. Note that the quadratic approximation in Equation (10) is the quadratic approximation to a log-likelihood plus a penalty. This means that the problem can be efficiently solved by iteratively reweighted linear regression with network node effects, just like the generalized linear model is fitted by iteratively reweighted least squares.

Theoretical properties of the RGK estimator
Given the RGK estimator presented in Equation (9), it is not difficult to find thatθ exists iff P =X TX + λI n+p + λ μM is invertible. Also, we know L is a semi-positive matrix from Equation (6). Proposition 3.1 gives an explicit condition for the existence ofθ .
The proof is given in the Appendix D. Theorem 3.2 shows the bias, variance, and mean squared error of the linear RGK estimator based on Proposition 3.1.

Theorem 3.2:
The linear RGK estimatorθ = (α,β) defined by Equation (9) satisfies: (1) The bias of the estimator is given by where b(·) denotes the bias of an estimator and P =X TX + λI n+p + λ μM; (2) The variance of the estimator is given by (3) The mean square error of the estimator is given by (4) There is an upper bound on the mean square error of the predicted value. That is, Theorem 3.2 applies to any fixed n and its proof is given in the Appendix E. Finally, we explore the effect of the graph sparsification when the given graph is dense, which means that both its adjacent matrix and Laplacian are dense matrices. We can leverage the sparsification method proposed in Spielman and Teng [18] to reduce the computing cost. For any ε ≥ 0, which controls the sparsity degree of the given graph, let L * be the Laplacian of a network on the same nodes satisfying Besides, letθ be the minimizer of g(θ) = (θ) + θ T Wθ , where (θ) denotes loss function and W = (λI n+p + λ μM). Similarly, letθ * be the minimizer of g * (θ ) = (θ ) + θ T W * θ, where W * = (λI n+p + λ μM * ). Theorem 3.3 provides theoretical guarantees for the accuracy of the linear RGK estimator on L * and L.

Theorem 3.3:
For 0 ≤ ε ≤ 1 2 , given two Laplacian matrices L and L * satisfying Equation (11), assume (θ) is twice differentiable and g is strongly convex with m ≥ 0, such that for any θ ∈ R n+p , Thenθ andθ * satisfy The proof is given in the Appendix F. Note that the termθ T Mθ =α T Lα is expected to be small, and terms |θ * T Mθ * −θ T Mθ |,α * T Lα * andθ * Tθ * are much smaller than the first term. Moreover, the second bound in Equation (12) is typically much greater than the first bound. Thus, we can simplify the equation and obtain a theoretical upper bound of the approximation error θ * −θ 2 , that is Clearly, there is a linear relationship between the squared error bound of estimator and ε. Generally, the range of ε is from zero to one, since the current graph may be disconnected when ε is greater than one.

Numerical analysis
Simulations are conducted to investigate the effects of the RGK model on graph data in the context of linear and logistic regression. We illustrate it in low-dimensional and highdimensional situations respectively and subdivide these into uniform and nonuniform graph cases. We end this section with a graph sparsification example.

Low dimension
We conducted experiments on a simulated graph with n nodes assigned to K blocks in case of n > p. The nodes were allocated to some block independently by sampling from a multinomial distribution PN(π 1 , π 2 , . . . , π K ), and then each block was labeled with c i (i = 1, 2, . . . , K). Edges e ij s were generated as independent Bernoulli random variables with p(e ij = 1) = p ij , thereby forming a symmetric probability matrix containing the probabilities of within-block and between-block connections. Here we considered two cases: the uniform graph (the probabilities of nodes belonging to a certain block were equal, i.e.π 1 = π 2 = · · · = π K ) and the nonuniform graph (the probabilities were generated from the uniform distribution). We set p ii , the probability of within-block connection, to 0.8 and p ij of between-block connection to 0.2 for i = j. The generated networks can be visualized in Figure 1. In addition, the covariates were generated from Pois (1), predictor coefficients β from N(0, 1) and α i s from N(τ i , s 2 ), where τ i denotes the expectation of the i th block and s 2 corresponds to its variance. It should be pointed out that our experimental settings are almost consistent with that of the RNC model proposed in Li et al. [13], except that the latter only considers the case of low dimensional uniform network. Also, the objective function of RNC is where α is the vector of nodes effect and λ is a tuning parameter, which can be identified by cross-validation. Therefore, to highlight the novelty more easily, a brief comparison between RNC and RGK models is conducted, including that 1) the RNC model is only effective in low-dimensional situation because of its limited scope of application; and 2) the penalty of RNC model only acts on the nodes effect without considering the regression parameters. However, our proposed RGK model could negate the above two points. We compared the performance of our proposed linear RGK with RNC, ridge regression (RI_REG), Lasso and ordinary linear regression (OLS) on each simulated graph in terms of the predictive mean absolute error(MAE). For each model, our simulated data consisted of a training set and a test set with a sample size of 100. Models were fitted on the training set only, and we computed the predictive mean absolute errors on the test data. We repeated the simulations 100 times independently for each model. Moreover, the hyperparameter λ was identified by 10-fold cross-validation, while the tuning parameter μ was just simply fixed as 0.01 and λ satisfied λ n = 0.01 as mentioned early. Table 1 shows the predictive mean absolute errors of the above methods.
The results reveal that our proposed linear RGK method gave the smallest MAE among the compared procedures above for each case. Specially, both the linear RGK and RNC models are better than other methods, which indicates considering both covariates and link information is proper. Furthermore, the MAE of linear RGK is indeed smaller than that of RNC with almost equal variance, though it is not remarkable, which might be caused by that: (1) the effect of our RGK penalty on covariates is not obvious due to the small number of covariates; (2) the tuning parameters μ and λ are not optimized. We expect that there will be a significant improvement if we take steps to negate the above two points.

High dimension
In case of p > n, we generated graphs, illustrated in Figure 2, as in the low-dimensional case except each node was accompanied with 1000 covariates. We compared the performance of our linear RGK with ridge regression and Lasso in terms of MAE. Table 2 gives the average results of 50 simulations for each model. We can obtain a direct insight on the result that the linear RGK still gives the smallest MAE, about a third less than the other procedures. Ridge regression performed a little bit better than Lasso with almost equal standard deviations, which might be a result of multicollinearity in the covariates. This high-dimensional case confirms that incorporating both covariate information and network link structure into graph-based linear predictive models can exactly lead to considerable improvement. Moreover, the penalty imposed on these two information sources contributed to better accuracy. We conclude that the linear RGK method performed better than other procedures in low and high dimensions, especially in the latter case.

Low dimension
We conducted simulations to evaluate our proposed logistic RGK model and compared it with RNC, logistic regression(LR), ridge regression, Lasso and OLS in terms of predictive accuracy. We simulated the graphs seen in Figure 3 just as we did for the linear RGK case in the low-dimensional situation. However, the covariates were generated from N(0, 1) and the response variable y i s from the Bernoulli distribution with probabilities of success given by the logit function of X T β + α. Therefore, the results can be naturally scaled between 0 and 1. All models were fitted on the same training set, and we computed the predictive recall scores, F 1 scores, and AUC values on the test data with a sample size of 100. For each model, we repeated the simulation 50 times. Moreover, it is important to point out that AUC is irrelevant to the threshold, while recall score and F 1 score are closely related to the selection of the threshold. We set the threshold to 0.5, which means that a sample is a positive example if its predictive probability is greater than 0.5 and is a negative example otherwise. Table 3 shows the predictive recall scores and F 1 scores, and corresponding comparison of ROC curves and AUC values for the models above is shown in Figure 4.
From the comparison seen in Table 3 and Figure 4, it is easy to see that our proposed logistic RGK can perform better than other benchmarks. The logistic RGK gave the largest F 1 score and AUC value, which indicates that our logistic RGK procedure had a better classification capability with threshold 0.5 on both the uniform graph and nonuniform graph data when n > p. Moreover, we can also find that although the F 1 scores of ridge regression, Lasso and OLS are not so good, their recall scores are significantly bigger than other models, almost to 1, implying that their ability to recognize the minority class is better but there are also many misclassification cases.

High dimension
Simulations were conducted to evaluate the performance of the logistic RGK when p > n and to compare it with logistic regression, ridge regression and Lasso in terms of predictive classification accuracy. The way to generate graphs was consistent with that in Section 4.2.1,   while the covariates with a size of 1000 were sampled independently from N(1, 5). We repeated the simulations 50 times for each model, and generated graphs in Figure 5. Additionally, Table 4 shows the predictive recall scores and F 1 scores. The corresponding ROC curves and AUC values for the above models are shown in Figure 6. In terms of F 1 score and AUC value, we observed that logistic RGK outperformed all benchmark models in both uniform and nonuniform networks. Moreover, the recall scores  were extremely similar to those in the low-dimensional case. All in all, the logistic RGK approach was better than all benchmarks on the simulated graph datasets both in the lowdimensional and high-dimensional situations, especially in the latter.

Linear RGK on the sparsified graph
This subsection uses a simple example to illustrate the graph sparsification approach in the linear regression. According to Spielman and Teng [18], we can find A * , a sparsified graph adjacency matrix, so that a corresponding Laplacian matrix L * can be found satisfying Equation (11) with a nonnegative constant ε. To understand the sparsification process further, we refer readers to [18]. In this experiment, we generated a network with 1000 nodes. In addition, we set ε to 1, and the other settings were as in the uniform graph case of linear RGK in the low-dimensional situation. After calculation, we obtained that the average degree of the original network and corresponding sparsified network are 167.498 and 55.114, respectively, and the approximation quality value, a measurable index proposed in [18], is 0.93.
Next, we compared the performance between the linear RGK with original L and sparsified L * using two aspects. Firstly, the time cost of the original adjacency matrix is twice that of the corresponding sparsified version. Secondly, the performance with the latter matrix is almost identical to that with the former in terms of the predictive mean absolute error, which implies that the sparsified L * captures most of the structural information that the original graph contains. Therefore, when given a dense graph with covariates in the predictive task, we could use the sparsified L * to replace the original L, thereby reducing computation cost while maintaining good performance.

Real data analysis
We investigated the performance of our proposed RGK on real graph data, including linear RGK on a 'follow' network (in Tencent Weibo users) and logistic RGK on citation networks (Cora and CiteSeer).

Linear RGK on social networks
The users in the 'follow' network, numbered in the millions, are provided with rich information (e.g. demographics, profile keywords, follow history, action data) to generate a good prediction model. In this application, the follow history is used to form an undirected social network. The action data, including statistics about the 'at' (@) actions between the users and the number of tweets, retweets, and comments in a period of 90 days, can be used to describe user behavior. The profile keywords containing the keywords extracted from the tweet/retweet/comments of a user can be used as one s features in our prediction model.

Low dimension
Our task is to predict the users activity from their covariates and the current network. The response value is calculated based on the action data of each individual, that is, the summation of 'at' (@) actions between the users and the number of tweets, retweets, and comments, each divided by the length of time. Note that we just selected a subgraph as our data in this experiment since the original data is too huge; this subgraph consisted of 1652 nodes with 8546 covariates. Furthermore, what we investigated here was in the low-dimensional context, so some dimension reduction method was needed to reduce the number of covariates. We adopted the random forest(RF) to arrive at this goal, and we were left with 30 covariates according to the importance rank of RF. Therefore, our focus was concentrated on this selected network.
We compared the performance of linear RGK with RNC, ridge regression and OLS. We did five runs and reported the averages and standard deviations in terms of MAE. Specially, we held out 451 samples as the test samples and fitted all the models on the rest data. In addition, the way to determining the tuning parameters is exactly the same as before. Table 5 shows the predictive MAE of the methods above. Table 5 shows that our linear RGK outperformed all benchmarks in terms of the MAE. Note that both ridge regression and OLS performed better than RNC, which might be due to the following two facts: (1) the penalty of RNC is only imposed on the nodes effect without constraints on the regression parameters; (2) the 'follow' network is so sparsified  that the nodes' attributes may be more useful than the graph link information for model effect. Moreover, ridge regression outperformed OLS slightly, implying that there could be multicollinearity in the covariates. All the analysis above indicates that imposing the penalty on covariates and graph structure is reasonable.

High dimension
In this section, all 8546 covariates were taken into consideration and other settings were like that in Section 5.1.1. We compared the performance of the linear RGK with Lasso and ridge regression. For each method, we performed five runs and reported the results on the test data. Table 6 shows the results of these different models. Clearly, the linear RGK gave the smallest MAE with the smallest standard deviation. It is not surprising that the performance difference between these models above is not explicitly remarkable, which could be caused by: (1) We did not optimize the tuning parameters μ and λ using cross-validation. (2) The attributes of each node often carry more information than the graph link structure, especially when the network is extremely sparsified. So far, we have verified that our proposed linear RGK outperforms the benchmarks consistently on both simulated and real graph data.

Logistic RGK on citation networks
Cora consists of machine learning papers tagged with one of these labels: case-based, genetic algorithms, neural networks, probabilistic methods, reinforcement learning, rule learning, or theory. Similarly, papers in CiteSeer are tagged with Agents, AI, DB, IR, ML, or HCI. In addition, these papers were selected in a way such that each paper was quoted or cited by others in the final corpus, leading to 2708 papers in the whole corpus of Cora and 3312 in CiteSeer. After stemming and removing stopwords, we were left with a vocabulary of 500 unique words for the former and 3703 for the latter. All words with document frequency less than 10 were removed in these two networks.

Low dimension
Considering the logistic RGK is suitable for binary classification and the labels in Cora are unbalanced, we selected binary subnetworks, respectively, in cases of the balanced-sample and unbalanced-sample. Specially, we chose nodes labeled theory and case-based for the former and nodes tagged rule learnings and probabilistic methods for the latter.
We compared the performance of our proposed logistic RGK with RNC and logistic regression in terms of the predictive error and accuracy. The two networks were segmented into a training set and a test set with a sample size proportion of 30%, respectively. Models were fitted on the same training set, and we computed the log losses, recall scores and F β (β = 1, 2) scores of these models on the same test set. For each model, we performed Table 7. Comparison of predictive log loss, recall score, and F β score (SE) using different methods for balanced-sample and unbalanced-sample cases in the low-dimensional situation. five trials and reported the averaged predictive results. Table 7 shows the detailed results in balanced-sample and unbalanced-sample cases. Obviously, the logistic RGK gave a smaller log loss than RNC and logistic regression in both the balanced-sample and unbalanced-sample cases. Moreover, we found that the logistic RGK was under-confident, while the other two methods were over-confident through comparing the predictive probabilities of each model. It is because the probabilities of the former were mainly between 0.5 and 0.7, while the latter were extremely close to 0 or 1. Moreover, the logistic RGK also performed better in terms of F β score and recall score, showing that the logistic RGK is a better classifier.

High dimension
Like the case in Cora, We selected nodes labeled Agents and ML as the final dataset in the balanced-sample case and nodes labeled AI and IR for the unbalanced-sample case. The experimental setting is completely identical to that in the Cora except that here we only compared the performance of the logistic RGK with ordinary logistic regression, due to RNC being inappropriate for high-dimensional data. Table 8 gives the detailed comparative results for the balanced-sample and unbalanced-sample cases. Table 8 shows that the logistic RGK outperformed logistic regression remarkably in terms of the predictive log loss, recall score, and F β score both in the balanced-sample and unbalanced-sample cases. To sum up, we conclude that our proposed logistic RGK model outperforms other methods on these two citation networks.

Conclusion and future research
This paper introduces a novel method based on the graph regularization formulation of the kernel method for learning from both covariates and graph link structure. The method is applicable to regression problems for graph data. The objective function of our model is a well-formed convex optimization problem, which means that a globally optimal solution can be computed efficiently. In addition, imposing a penalty on both covariates and the graph Laplacian not only leads to a reduction of covariates but also provides a clear tradeoff between the bias and variance. Moreover, the proposed RGK model is somewhat similar to a standard kernel method, with an appropriately defined kernel based on the underlying graph. Experimental results on both simulated graph data and real networks in various situations indicate that the RGK model can lead to better performance in graph regression tasks.
In the future research, we are going to extend the RGK methodology to other situations such as Cox s model and to investigate the performance of RGK on other kinds of networks such as random networks, which is significantly different from what we did here, treating the underlying graph as fixed. We would expect that applying our combined graph kernel regularization to random networks could lead to a significant improvement compared to the previous methods.