Variable Selection for High-Dimensional Nodal Attributes in Social Networks with Degree Heterogeneity

Abstract We consider a class of network models, in which the connection probability depends on ultrahigh-dimensional nodal covariates (homophily) and node-specific popularity (degree heterogeneity). A Bayesian method is proposed to select nodal features in both dense and sparse networks under a mild assumption on popularity parameters. The proposed approach is implemented via Gibbs sampling. To alleviate the computational burden for large sparse networks, we further develop a working model in which parameters are updated based on a dense sub-graph at each step. Model selection consistency is established for both models, in the sense that the probability of the true model being selected converges to one asymptotically, even when the dimension grows with the network size at an exponential rate. The performance of the proposed models and estimation procedures are illustrated through Monte Carlo studies and three real world examples. Supplementary materials for this article are available online.


Introduction
Statistical modeling of networks has attracted tremendous attention as it provides a principled approach to characterizing complex relational systems with a broad spectrum of applications to fields such as sociology, finance and genomics.Network analysis is the key to help us understand the generative mechanism of networks and make link predictions.This article is concerned with binary and undirected networks, where the network consists of a set of n nodes V = {1, . . ., n} and pairwise binary ties {y ij } i,j∈V indicating the presence or absence of some relation between node i and j.The n × n adjacency matrix is denoted as Y, with entries y ij = y ji = 1 representing that node i and j are connected.In addition to the network itself, we often observe node-level features X i ∈ R d for each node i. Homophily, a commonly observed network phenomenon, means that nodes with similar features are more likely to connect to each other than dissimilar ones.Moreover, most social networks exhibit degree heterogeneity, where there exists some nodes with high degrees known as hubs, along with many of other nodes only having few connections.With the advent and development of data storage technology, researchers can easily collect a large number of node-level features.However, not all will contribute to the development of the connection between nodes.In this article, we propose Bayesian approaches to selecting nodal attributes in the presence of degree heterogeneity and study their corresponding model selection consistency in the setting with high-dimensional node-level covariates.
There are numerous existing methods to model network data.Among them, it is common to assume dyadic independence of CONTACT Xizhen Cai xc2@williams.eduDepartment of Mathematics and Statistics, Williams College, Williamstown, MA 01267.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JASA.the links given the probability, denoted by π ij , of a connection between node i and j.That is, provided the underlying probability matrix π, {y ij } i,j∈V are just independent Bernoulli random variables with probability {π ij } i,j∈V .While it is impossible to estimate π from a single realization of the network without any assumptions on the structure of π , many statistical models for π ij have been developed in the literature.As an Exponential Random Graph Model (ERGM) for undirected networks, β-model (Chatterjee, Diaconis, and Sly 2011) is the simplest one considering degree heterogeneity by using degrees as sufficient statistics, and models the connection probability by nodespecific popularity parameters.To account for homophily, this can be further extended by incorporating nodal covariates.In this article, we consider the following model, which accommodates both homophily and degree heterogeneity simultaneously, (1) In the model, Z ij is a p-dimensional pairwise covariate which is constructed by a known function of nodal attributes Z ij = g(X i , X j ) = g(X j , X i ), and g(•, •) : R d × R d → R p .For example, as a measure of homophily, Z ij can be the negative value of the element-wise Euclidean distance Z ij = −|X i − X j | or angles by element-wise products Z ij = X i X j .In the model, the degree heterogeneity of ith node is parameterized by δ i .A larger value of δ i represents a higher activity level of node i.The vector β ∈ R p is the regression coefficients for nodal covariates.
Model (1) falls within the range of ERGMs.However, deriving its asymptotic properties with the growing size of the network is nontrivial.Shalizi and Rinaldo (2013) gave the conditions for the consistency of Maximum Likelihood Estimation (MLE) in ERGMs, and showed that the consistency of MLE is problematic in ERGMs with dependent structures, such as triangles statistics.Even for the β-model under dyadic independence assumption, as the number of parameters n diverges with the network size, it is a high-dimensional problem prohibiting the direct application of the large sample properties of MLE.Recently, the consistency of MLE for the β-model has been studied in Chatterjee, Diaconis, and Sly (2011) and Yan, Leng, and Zhu (2016).Graham (2017) extended the β-model by including a fixed number of nodal attributes and studied the consistency of the MLE under the assumption that all parameters are bounded.However, in real-world networks, such assumption is typically violated.For example, if the true node-specific popularity parameters for the ith node, δ 0 i , is bounded, then the expected node degree increases at the order of O(n).This contradicts with the reality, as most social networks exhibit sparsity with a group of nodes having only few connections.Therefore, it is reasonable to assume that as network size increases, a subset of nodes has less influence on the network and their expected degree cannot increase arbitrarily, that is δ 0 i −→ −∞.Base on this observation, Yan et al. (2019) relaxed the assumption on the bounded ||δ 0 || ∞ in Graham (2017) to be ||δ 0 || ∞ ≤ τ 0 log n and discussed consistency and asymptotic normality of the MLE under a fixed number of nodal attributes.On the other hand, to the best of our knowledge, there is little literature on the topic of model selection consistency when the dimension of nodal covariates increases at an exponential rate with the network size.
Our approach of model selection falls in a Bayesian framework where we apply the well-known spike and slab priors to the coefficients of nodal attributes.Since George and McCulloch (1993) used a latent binary vector to indicate the presence of covariates and placed the spike-and-slab priors on the coefficients conditioning on the latent vector, many selection procedures with similar structures have been proposed (see Ishwaran and Rao 2005 and references therein).While the spike and slab priors have been widely used in practice for its simple interpretability, the development of its theory is relatively slow, especially under high dimensional settings.Only until recent decades, such work appears to study the theoretical properties in linear, generalized linear or partially linear models, see Johnson and Rossell (2012), Liang, Song, and Yu (2013), Narisetty and He (2014), Narisetty, Shen, and He (2019), and Wang, Cai, and Li (2021).However, there is little work in the setting of highdimensional network models.
In this article, we first propose a Bayesian approach to selecting nodal attributes in the presence of degree heterogeneity in (1).We develop an estimation procedure based on the Gibbs sampling, and further establish the model selection consistency under high-dimensionality when log p = o(n).Since the theoretical properties are derived under the relaxed assumption on nodal popularity parameters e ||δ 0 || ∞ n τ 0 , the results are applicable to both dense and sparse networks.Here sparsity refers to the general sub-linear definition (Newman 2018) that the expected degree is allowed to grow slower than a linear rate of n, rather than the one that is commonly used for Stochastic Block Models block recovery.To reduce the computational cost required by large-scale networks, we develop a working model within the Gibbs sampling framework.The working model only uses the current collection of active or influential nodes with bounded δ 0 i to update parameters, and plugs in a proxy probability as the likelihood for all other cases.The model selection consistency is also explored and studied under this computational efficiency alternative.
The rest of the article is organized as follows.In Section 2, we introduce the newly proposed Bayesian approach to selecting nodal attributes in the presence of degree heterogeneity.A modified model is further studied in Section 3 for large sparse networks.Numerical studies are presented in Section 4 followed by a brief summary of the paper and some discussions in Section 5. We include some technical proofs for theoretical results in Appendix.Other technical proofs and additional numerical results are rendered in the supplemental material of this article.

Model and Notation
We first introduce notations used in the paper.For simplicity of the presentation, hereafter, all densities are conditional on X.
Consider an undirected network with n actors V = {1, . . ., n} and binary ties {y ij } i,j∈V generated by (1).The total number of possible connections is denoted as N = n(n − 1)/2.The covariates Z ij = g(X i , X j ) between node i and j has dimension p n which can grow with the network size n.
A length-p n binary latent random vector γ is introduced to indicate the inclusion of covariates, that is, the jth element γ j represents whether the jth covariate is included in the model (1 = present, 0 = not present).Therefore, the model space is fully specified by γ , and we use γ and M as notations for models interchangeably.The cardinality of model M, denoted by |M|, is the size of the model.We denote M = (β M , δ n ) ∈ R |M|+n as the parameters under model M. Note that here δ n is defined as δ n def = n −1/2 δ, to ensure that the design matrix of nodal popularity and nodal covariates are at the same scale.Following these notations, β M and Z M,ij are subvectors of β and Z ij separately with respect to model M. Let e ij ∈ R 1×n be a constant matrix with ith and jth entries as 1 and all other entries as 0. Denote and T as the stacked design matrix where in T, we denote the part about nodal features X as L and the rest constant matrix as E. Thus, |M|+n) is the submatrix of T under model M. We propose a Bayesian subset modeling to select nodal attributes in social networks (BSM-net) as follows.
To achieve model selection consistency, we need to choose priors that depend on the network size, that is, with parameters diverging or converging with n (Narisetty and He 2014;Narisetty, Shen, and He 2019;Wang, Cai, and Li 2021).We use independent Bernoulli distribution with probability q n as the prior for each γ j .With a diverging p n , the preliminary inclusion probability for each covariates q n is supposed to converge at some rate toward 0. Each β j has a mixture normal distribution.Conditioning on γ j = 1, β j has a normal distribution with a relative large variance σ 2 1,n .This corresponds to a very wide and flat distribution, usually referred to as a slab prior.Its variance, σ 2 1,n depends on the network size, and diverges at a certain rate when the network size n goes to infinity.Conditioning on γ j = 0, β j has a normal distribution with variance σ 2 0 .Since the choice of σ 2 0 does not influence the asymptotic results, it can be chosen to be a fixed value, or to depend on the network size.A normal distribution is proposed for each node-specific popularity parameters δ ni .As the number of parameters in δ n diverges with n as well, its variance σ 2 d,n should also diverge to infinity at some rate.In practice, q n , σ 2 0 , and σ 2 1,n control the size of the selected model.Lower inclusion probability and larger difference of variances in spike-and-slab priors result in more sparse models.

Estimation Procedure
Markov chain Monte Carlo (MCMC) is used to estimate the parameters.In each step we aim to update parameters via the full conditional distributions.With linear models, we can directly draw samples from the full conditional distributions due to the conjugacy of priors.Yet the conditionals in the model framework studied in this article are not standard distributions.Even though the rejection sampling could be used by generating samples from the target distribution and calculating the likelihood ratio, the computational cost for evaluating the likelihood in networks is expensive.This makes this method infeasible for large scale problems.With the help of the scale mixture of normals representation for the noise process in logistic regression (Andrews and Mallows 1974), we establish the conditional conjugacy for updating β by introducing auxiliary variables (Holmes and Held 2006).The standard logistic regression can be decomposed into the following hierarchical model (3) Sampling from the KS distribution is nontrivial, but we can approximate the standard logistic distribution well by a scaled t-distribution wt v taking w = π √ (v − 2)/(3v), v = 7.3 (Albert and Chib 1993;Pingel 2013).And t ∼ wt v is equivalent to t|ψ ∼ N (0, ψ), ψ ∼ w 2 IG(v/2, v/2).Thus, last several steps of (3) can be further approximated by Therefore, the MCMC algorithm for (2) by introducing auxiliary variables g ij i<j and ψ ij i<j is as follows 1. Jointly update β and δ n from a multivariate normal distribution.In each iteration, we divide β = (β M , β M c ) into the active group and the inactive group based on γ , where M is the collection of indices j having γ j = 1 and M c is the collection of indices j having γ j = 0.
i. Update the active group M along with δ n .
2. Update g ij i<j from truncated normal distributions.
Remark 1 (Selection procedure).In the typical Bayesian variable selection approach, the model with the highest posterior is selected as the final model, referred to as maximum a posterior model (MAP): M = argmax M Pr(M|Y).With the spike and slab prior, the posterior of the model space is usually reflected by the posterior probability of the latent variable γ .Alternatively, a more convenient way is to consider the marginal probability of Pr(γ j = 1|Y).Specifically, one will select the jth covariate if Pr(γ j = 1|Y) is greater than a certain threshold.A threshold of 0.5 is a natural choice, which is known as the median probability model (MPM).It has been shown MPM has good predictive power (Barbieri and Berger 2004).Some other data-driven criteria could also be used in determining the threshold, for example, AIC, BIC and EBIC (Chen and Chen 2012).We explore the performance of the proposed procedures with different thresholds via a simulation study in Remark 7.Although it is likely that these two approaches may produce different results in practice, those two selection methods are asymptotically the same under strong selection consistency (Narisetty and He 2014).

Model Selection Consistency
In this section, we study the theoretical properties of the proposed working model in ( 2).The convergence rate of the Maximum Likelihood Estimation (MLE) for a given overfitted model is shown in Lemma 2.1.In Lemma 2.2, we highlight the connection between the proposed Bayesian method and the penalized method.Finally in Theorem 2.3, we show that the proposed model preserves the strong selection consistency.If the true model is denoted as M 0 , the strong selection consistency is referred to as which means the posterior probability on the true model converges to 1 asymptotically.Such property is also discussed in Johnson and Rossell (2012), Narisetty and He (2014), Narisetty, Shen, and He (2019), and Wang, Cai, and Li (2021) in different statistical settings.
Based on model (1), the log-likelihood under model M is given by −1 .The score function and the negative Hessian matrix are obtained and denoted as where Before we present regularity conditions for the theoretical results, we need to introduce more notations.Notation M is used for the MLE of under model M. We denote M 0 as the true model and 0 If M is positive definite, we use λ min (M) and λ max (M) as the notations for the minimum and the maximum eigenvalues of M. The following conditions are used to establish the theoretical results.
Condition A (On the dimension and priors).log Condition C (On the regularity of the design).
(C-1) The support of Z is a compact subset of R p n and where ) is the projection matrix, and for any α ∈ span P T M ( 0 M ) , there exists C > 0, And N C such that for any n ≥ N C , we have Condition A states the convergence and divergence rates of the parameters in the priors and the dimension of the variables.Condition B-1 guarantees the minimal signal of active attributes does not vanish.The compactness assumption on the support of β and Z is also assumed in Graham (2017) and Yan et al. (2019).
As stated in Condition B-2, the true popularity parameter δ 0 does not have to be bounded.Bounded δ 0 i indicates the degree of ith node grows arbitrarily fast with n.Under Condition B-2, the growth rate of degree can be relaxed to O(n 1−τ 0 ), thus, the proposed model is also applicable to sparse networks.Condition B-3 gives the condition on the size of the active covariates.
Conditions C-1 and C-2 give constraints of the design matrix.Condition C-2 also gives requirements on the eigenvalues, which guarantees the good behavior of the negative Hessian matrix around the true values.However, this condition is weaker than the bounded eigenvalue assumption required by the lassotype L 1 penalization method.Condition C-3 is also assumed in Narisetty, Shen, and He (2019).This is a mild assumption as such C > 0 always exits due to the sub-Gaussianity of U.
Remark 2. Note that the number of possible edges is at the order of n 2 .With bounded δ 0 , this rate agrees the common root-n consistency with fixed dimensions |M| up to a logarithm term.
Allowing unbounded true parameter δ 0 slows the convergence rate.To make the parameter estimation consistent under a fixed dimension |M|, that is, M − 0 M ∞ = o P (1), the δ 0 i can diverge to −∞ at most at the rate of − log(n).For nodes with δ 0 i = −c log(n), their expected degree grows at the rate of n 1−c , for 0 < c < 1.The rate is faster than the logarithmic degrees in stochastic block models (Abbe 2017) due to the task here is parameter estimation and model selection, not community detection.
Lemma 2.2.(Connection with penalized methods).Denote the likelihood ratio between any candidate model M and the true model M 0 as PR(M, M 0 ).For any overfitted model for some h > 0.
where m n = c 2 log n is defined in Condition B-3.
Theorem 2.3 states the strong selection consistency of the proposed model.Note that such property is stronger than the Bayes factor and model selection consistency considered in the penalized methods, which show Pr(γ = M 0 |Y) > sup M =M 0 Pr(γ = M|Y).Since the posterior probability of the true model is still possible to be small under the model selection consistency, it can be hard to identify the true model within finite updates by MCMC.However, with the strong selection consistency, it is guaranteed that the MCMC stays on the true model with high probability.The detailed proof is provided in the supplementary material.
Remark 4 (Restriction on the candidate model size).Note that |γ | ≤ m n + |M 0 |, where m n = c 2 log n, is the condition on the candidate model size for the model consistency.Models with large sizes could be ruled out by adjusting the prior of γ by assigning 0 probability when the size of the model, |γ |, is larger than some threshold.That is Pr(γ with a constant c large enough.This is commonly assumed in the Bayesian variable selection literature (Narisetty, Shen, and He 2019;Liang, Song, and Yu 2013).

Nodal Attributes Selection for Large Networks with Degree Heterogeneity
The model specified in Section 2 is applicable for large scale social networks.However, the computational cost is very high when n gets large.On the other hand, large social networks are usually sparse with a subset of nodes having large numbers of connections but the other majority of nodes with only very few connections.In this section, we take advantage of this sparsity and propose an alternative algorithm to reduce the computational cost.
We assume the network consists of two kinds of nodes.As the network size increases, one group of nodes tends to have less influence on the network, and are referred to as inactive nodes.For a node i in this group, the true nodal popularity δ 0 i is assumed to have the order δ 0 i −log n.Thus, the expected degree of inactive nodes cannot increase arbitrarily as n increases.For a node i in the active set, the true nodal popularity δ 0 i is assumed to be bounded, that is, O(1).We identify the set of active nodes within iterations of the MCMC, and only use them to update parameters.A toy example is shown in Figure 1, where we start with the full graph, and gradually trim the graph and take a smaller and denser sub-graph to update parameters.
Remark 5 (Comparison with estimation on a randomly sampled sub-graph).By updating parameters based on the active node set, we may lose statistical power by using a smaller network.However, it has more advantages to the estimation procedure which only uses a randomly sampled sub-graph with the same size.Our motivations are (i) in the generative model i + δ 0 j dominates the part on nodal covariates.Such nodes do not carry much information for the estimation of β. (ii) By Lemma 2.1, conditioning on the active node set with bounded nodal popularity parameters, the MLE has faster convergence rate, thus, the required minimal signal for identifying true nodal covariates is smaller than a model based on a randomly sampled sub-graph with the same size.
Another binary latent n-dimensional vector η is introduced to indicate the status of nodes with η i showing that ith node is active (1) or not (0).For nodes i and j in the active set, that is, η i = η j = 1, we apply the original likelihood function Pr(Y ij = 1|γ , η, β, δ).But for other cases, we plug in a proxy probability expit −τ log n as the likelihood for the connection of node i and j.This working model is denoted as BSM-net.sp, and the details are given below.
Similar to BSM-net in (2), priors are network-size dependent to ensure selection consistency.Conditional on γ , we place a spike-and-slab prior as described in (2).The marginal probability of being an active node q dn converge to 0 with the increase of n.Conditional on an active node i with η i = 1, we place a normal distribution with a relative large variance σ 2 d,n , so the prior is diffusing and almost flat.But for an inactive node i with η i = 0, a fixed prior is given, and here we simply choose a natural logarithm of truncated normal distribution with a small fixed variance.Lastly, conditional on the sub-graph consisting of active nodes, we employ the original likelihood for edges.While for other cases, a proxy connection probability 1/(1 + n τ ) is plugged in, where τ > 0 is a positive constant.As discussed in (2), those parameters control the size of the selected model and the selected active set in empirical studies.It yields smaller size of model and active node set with lower inclusion probability or higher difference in the spike-and-slab priors.See Section 4 for practical choices of parameters.The estimation procedure is similar to that of BSM-net, and to save space, we put the detailed steps in Section S.5 of the supplementary materials.

Selection Consistency
We use V as the notation for a node set, and V 0 as the true active node set.Notation M,V is used for the vector of parameters on node set V given model M. Therefore, |V|+|M|) is the negative Hessian matrix based on node set V and model M. We introduce additional regularity conditions as follows.
Condition D (On the dimension and priors).log p n = o(n s ) and For any where Conditions assumed in this section are similar to those in Section 2.Here we highlight some differences.Condition E-1 on the minimal signal of β is weaker than Condition B-1 by simply replacing n with n s .This is due to the fact that the convergence rate of MLE on active nodes is faster than that on a randomly selected subset of the same size.Condition E-2 gives the identifiability condition for the active nodes.Compared with Condition C-2, we add more conditions on the eigenvalues of the negative Hessian matrix evaluated on a subset of active nodes.Those conditions are not strong, as the true parameters on those subset are bounded.n s .For any active node i ∈ V 0 , there exists a constant c 5 , such that c 5 ≤ δ 0 i ; For any inactive node i / ∈ V 0 , there exists τ min > 0, such that n −τ 0 e δ 0 i n −τ min .Assume the size of the true nodal covariates |M 0 | ≤ M and log p n = o(n s ) for constants M and s given in E-3, then we have, where c 8 is a positive constant and m 2n n t for s < t < (s + τ min )/2 is defined in Condition E-3.Theorem 3.1 gives strong selection consistency for the computationally efficient modification.Meanwhile, it shows that, out of |V 0 | n s influential nodes, the size of missing influential nodes is at most c 8 log n.Note that super-active nodes with δ 0 i −→ ∞ are usually rare or ignorable in the real world networks.In the case when they are not ignorable in a network to be considered, our proposed model framework can be generalized by changing the binary latent indicator η i into a multinomial distribution with η i = 0 (inactive), η i = 1 (active) and η i = 2 (super-active), and update the parameter β on active nodes with η i = 1.The proof of Theorem 3.1 is given in the Appendix.
Remark 6 (Trade-off between BSM-net.sp and BSM-net).By identifying active nodes and updating the parameter β based on the active node set, the computational burden of a large network is greatly alleviated at some cost of statistical power.It can be shown that the computational complexity of BSM-net.sp for each iteration is O n 2c p n ∨ n 2 for a small positive constant c ∈ (0, 1), which is lesser than the computational complexity of O n 2 p n ∨ n 2 for BSM-net.Figure 2 shows a computational time comparison with n varying from 50 and 200 under a fixed p = 1000.However, the convergence rate of individual posterior ratio summed over subsets of the model space for BSM-net.sp is slower than that of BSM-net for under-fitted models (see Section S.4 of the supplementary materials), thus BSM-net is less likely to miss active variables in practice.This is also reflected in the identifiability conditions of active variables (Condition B-1 and Condition E-1), as the required minimal signals for BSM-net is weaker.Moreover, for BSM-net.sp,we need the existence of densely connected sub-network, a stronger condition on the dimensionality and a slightly stronger condition on the eigenvalues of the negative Hessian matrix evaluated on a subset of active nodes.

Simulation Study
In this section we study the performance of the proposed models in ( 2) and ( 7) under different settings, based on simulated networks with p = 1000, n = 50.We compare the performance of the proposed algorithms with existing methods including LASSO (Tibshirani 1996), SCAD (Fan and Li 2001) and MCP (Zhang 2010) with regularization parameter selected by the BIC criterion (Schwarz 1978).We generate two types of X: Case 1. X ∼ N (0, ) with autoregressive covariance matrix: ij = ρ |i−j| ; Case 2. X ∼ N (0, ) with compound symmetric covariance matrix: ij = 1 for i = j and ij = ρ for i = j.The correlation ρ is taken as 0.25 or 0.75 indicating low or high correlation cases.Nodal attributes Z ij is taken to be the element-wise product Z ij = X i X j .The true model is set to be M 0 = (1, 2, 5, 8) with coefficients of β M 0 = (2.1, −2.4,2.7, −3.0).In our setting, we generate δ in a way that δ and X are correlated.We select top |V 0 | = 30 nodes giving highest values of (X 1 + ) as active nodes, where ∼ N(0, 3).For an inactive node i, δ i ∼ Uniform(−1, 1), while for an active node j, δ j ∼ Uniform(−1, 1)+s×Uniform(−8, −4), where s ∈ {0, 1}.Different values of s leads to networks with different sparsity (s = 0 for dense networks, while s = 1 for sparse networks.).See Figure 3 for examples of simulated networks under different values of s.In each case, we allow 5000 iterations, and treat the first 3000 as burn-in samples.We report simulation results based on MPM for our proposed methods.Model selected by MAP or MPM yields very similar result.See the comparison in Table S.1 of the supplementary materials.
Here we specify several hyper-parameters.We use σ 0 = 0.5 for our proposed method and choose q n = q 1n = Pr(γ j = 1) such that Pr( p n j=1 γ j > K) = 0.1, for a prespecified value of K. Its value can be set as the preliminary guess of the size of the active set or derived from some data-driven methods.We set K = 10 for our simulation studies.The results for other choices of K are given in Table S.1.Similarly, q dn = Pr(η i = 1) such that Pr( n i=1 η i > 50) = 0.1.The variance for the diffusing prior on δ is set as max{log n/20, 1}.For BSM-net, the variance for the diffusing prior on β is set as σ 1,n = max{exp 1.01n(log n) 1.01 /600 /n, log n}, while for BSM-net.sp, the variance is set as σ 1,n = max{exp (2.01 + 2 s)n s (log n) 1.01 /600 /n, log n}, where s is our preliminary guess for s such that n s is the number of active nodes.In our study, s is chosen to be the value making n s = 50.The variance for the truncated normal is σ d0 = 1.96 exp(− τ 2 log n), where τ = logit( p) and p is the 80% percentile of the degree sequence divided by the network size n.
Models are evaluated by several criteria over 50 replications.p max M c 0 and p min M 0 are the average of the maximal of marginal posterior probabilities on the true inactive covariates, and the average for the minimal of marginal posterior probabilities on the true active covariates; the proportion of replications with the exact model being selected is denoted as p M 0 =M ; p M 0 ∈M gives the proportion of replications with all true active variables being selected; TPR M and TPR V are the true positive rates of selected variables and nodes, while FDR M and FDR V give false discovery rates of selected variables and nodes over replications; TPR is the measure of TPR given the selected model size is 4, which indicates the model ability of ranking active variables over inactive ones; Model error (ME) is given by || β − β|| 2 .
The simulation results are summarized in Tables 1 and 2. The comparisons suggest that our proposals outperform the existing approaches with much higher exact-fit rates regardless of the correlation structure in X and the level of correlation.The existing methods tend to overfit with much large False Discovery Rates (FDR), especially under Case 2 with the compound symmetric covariance matrix.The difference is more profound under the setting with a higher correlation of 0.75 for sparse networks (s = 1).The exact-fit rates of existing methods are near 0's, but those of our proposed methods are above 60%.Although there is a higher probability to select models that include the true model, the existing methods give lower TPR |M 0 |=4 M for sparse networks (s = 1) by failing to rank important variables over inactive ones.
In terms of the comparisons between BSM-net and BSMnet.sp,BSM-net works consistently well overall, while BSMnet.spmodel performs well in most cases except when the sparsity assumption is violated and with a high correlation.In Case 2, we observe that both models work well when ρ = 0.25, but the differences are noticed when ρ = 0.75 and s = 0. Specifically, the BSM-net.spgives a larger model error and the exact-fit rate is 62%, which is below the 86% given by BSMnet.However, the tradeoff is on the computational time (see Figure 2 for the computation time comparison under various network sizes), where the differences become more significant with larger and sparser networks.It is also worth noting that in Figure 3, this setting corresponds to networks with s = 0 and density of 0.50, which is quite unrealistic for real world social networks.Overall, we recommend use BSM-net.sp for large scale social networks.In our real data examples, with network density varying from 0.021 to 0.067, we only apply BSM-net.sp for its fast computational time and good performance on sparse or moderate sparse networks.
To save space, we include additional numerical results in the supplementary materials.Simulation results for different choices of K in Table S.1 imply that the performance of the proposed algorithms is not very sensitive to the choice of K.A comparison with larger values of n and p (including lowdimensional cases) is given in Tables S.2-S.4 depict simulation  results for BSM-net and BSM-net.sp for Case 1-2 under (n, p) = (100, 1000), where X and δ are generated independently.Results in Tables S.2-S.4 all show that the proposed algorithms perform well.
Remark 7 (Impact of using thresholds other than 0.5).With the MPM, the ith variable is selected if its posterior probability Pr(γ i = 1|Y) ≥ 0.5.We further investigate the impact of using different thresholds in the simulation setting of Case 2. with ρ = 0.75, s = 1. Figure 4 plots the exact-fit, true positive and false discovery rate at various threshold values from 0 to 1.It shows that more spurious variables are likely to enter the model with smaller threshold values (i.e., high false discovery rate), while important variables are easily to be missed with larger threshold values (slight drop in the true positive rate).In balancing between the two, as well as maintaining a stable exactfit rate, 0.5 seems to be a reasonable and good choice.

Real Data Analysis
Example 1 (Facebook Friendship Network).We first analyze a publicly accessible friendship network dataset downloaded from Stanford Large Network Dataset Collection and considered in McAuley and Leskovec (2012).The dataset is collected by surveys participants using the Facebook app.This dataset includes anonymized users' profile features and an undirected network showing friendship connections between users.The full dataset consists of multiple networks with different sizes.We focus on the first network with 347 individuals and 224 nodal features.Those 224 nodal features are 0 − 1 dummy coded features from 21 survey questions about their profiles, including birthday, education, first and last names, language, hometown, gender and work status etc.This network is highlighted with several features.First, it exhibits substantial degree heterogeneity with degrees varying from 0 to 77.Additionally, it is sparse as the graph density, which is the ratio between number of existing links and the number of possible links, is 0.042.It motivates us to apply the working model in Section 3 to identify nodal covariates contributing to the friendship connection.We first construct the kth entry in nodal attributes {Z ij } i<j by 1 if individual i and j have the same answer for feature k.
The choice of the hyper-parameters is the same as the set up in the simulation study.Table 3 reports details about 8 nodal attributes selected by MPM.We find that the variables selected can all be reasonably explained: people are more likely to make friends with classmates (education;classes and education;year) and connect ones with the same degree (education;degree); Generally people make friends with the same gender (gender); hometown is obviously important; last name indicates families and relatives and it makes sense that first name is not selected; location is selected as individuals are prone to socialize with people who are physically accessible; work;employer is chosen showing coworker relationships.It turns out 242 out of 347 nodes are detected as active nodes, which are marked as orange in Figure 5.The histograms in Figure 6 show the degree distributions for detected active and inactive nodes.Generally active nodes associate with larger number of connected edges.Example 2 (Paper citation network).As the second example, we consider the citation networks, where each paper is served as a node and an edge is declared among paper i and j if either paper i and paper j cited each other.We apply the proposed method on a publicly available citation network from AMiner (Tang et al. 2008).The citation data is extracted from bibliography sources in the field of computer science, including DBLP, ACM, and MAG.We use the first version citation dataset consisting of 629,814 papers and 632,752 citations.Extra information about papers is also available, including abstract, authors, year, publication venue and title.
Papers which share common topics are likely to cite each other, so we construct nodal attributes using the latent topics of the papers.We apply a topic modeling analysis on paper abstract.We remove papers with missing abstract and low citations and end up with an undirected network with n = 476.In order to conduct a reliable topic modeling analysis, we randomly sample 4524 more papers in addition to those n = 476 papers, and fit a Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003) model based on 5000 paper abstract.The number of latent topics is prespecified to be p = 500.This number will increase with the network size as more papers in the corpus will bring in richer and more diversified content.Thus, the jth entry of node-level attributes X i for ith node is taken to be the posterior probability of paper i belongs to topic j.Nodal covariates are constructed by Z ij = X i X j based on standardized X.Details for selected leading topics are presented in Table 4.Those selected topics are excellent representatives of mainstream topics in computer science: security and privacy, animation and robotics, geometric computing.See Figure 7 for a visualization of paper citation network, where 126 influential papers are detected and marked in orange.
There is a third example on the international mutual trade network.Details and the results of it are presented in Section S.7 of the supplemental materials.

Discussion and Conclusion
In this article, we propose a variable selection procedure for high-dimensional nodal attributes of network data in the presence of degree heterogeneity.We study a class of models in which the probability of a connection between nodes is determined by their nodal popularity parameters and some high-dimensional nodal attributes.Since the bounded assumption on true nodal popularity parameters δ 0 is unrealistic for large sparse social networks in practice, we construct a Bayesian subset modeling approach to selecting nodal attributes and prove selection consistency when log p = o(n) under the relaxed assumption that e ||δ 0 || ∞ n τ 0 .Moreover, by incorporating the feature of the network sparsity and degree heterogeneity, we further develop a new procedure to reduce the computational cost required by large social networks and establish its selection consistency as well.Both procedures are implemented by the Gibbs sampling.Specifically in the computational efficient version, active nodes are detected in each iteration, and the parameters are updated only based on the dense sub-graph consisting of active nodes.One of the common challenges of MCMC algorithms is that it is not guaranteed to converge to the global optima when the target posterior is multi-modal.Despite this concern, our simulation results have shown satisfactory performance in identifying the true model successfully.Additionally, we have validated simulation results and the real data analysis by using multiple distanced starting points and observed very similar values at convergence.In practice, researchers may consider aggregating results from multiple initializations.Full investigation on this issue is out of the scope of this paper, and would be an interesting topic for future research.
In this article, we focus on examining the performance of the proposed procedures with moderate-size n of network with possible high-dimensional p.It is of interest to consider largescale network.As discussed in Remark 6, the computational complexity may increase dramatically as the size n increases since the number of possible edges of a size-n network is N = n(n − 1)/2.To apply the proposed procedures for large-scale network, we may consider network partition by imposing a certain network structure, such as stochastic block models, and apply the proposed method on smaller relatively self-contained sub-graphs.On the other hand, the sparsity condition imposed in Section 3 implies that the number of active edges within each block or sub-graph cannot be too small in order to achieve consistency of the resulting estimate of model parameters.Further study about its application on very large-scale networks is beyond the scope of this work, and would be a good topic for future research.

Appendix
In the appendix, we include the proof for Lemma 2.1 and the proof for Theorem 3.1.To save space, the remaining proofs of Lemma 2.2 and Theorem 2.3 and additional numerical results are rendered in a supplementary file.
Proof of Lemma 2.1.We prove this lemma by using related techniques to the proof of Theorem 1 in Yan et al. (2019).First, for a given model M 0 ⊂ M, by replacing κ with M z M β |M| in the proof of Theorem 1 in Yan et al. (2019), we have where the third equality follows by Condition C-2.It follows by Condition C-2 that Proof of Theorem 3.1.We divide the model space into three disjointed parts: overfitted models and exact-fitted models missing a large set of active nodes Notice that the whole space for the node set V can be split into three disjointed parts Denote the size for each part as s 1 = |V 0 ∩ V|, s 2 = |V 0 /V 0 ∩ V| and s 3 = |V/V 0 ∩V|.So we have s 1 +s 2 = |V 0 | and s 1 +s 2 +s 3 = O(m 2n ).It is worth noting that s 1 +s 2 = |V 0 | so the maximum of them max{s 1 , s 2 } is at the same rate of |V 0 |.Under each following case, we discuss the difference on likelihood for those three node sets.
As for V/V 0 ∩ V, l (1) We next deal with under-fitted models P 2 , where there exists j ∈ M 0 , such that j ∈ M.
And we have On the node set V ∩ V 0 , similar to the proof in Theorem 2.3, denote M * = M M 0 .Let ˜ M * ,V∩V 0 be the vector having M,V for (β M , δ V∩V 0 ) and 0 for all the rest entries, then by Condition E-1 on the minimal signal, we have For V/V 0 ∩ V, similar to (A.2), we have l (1) (1) V/V 0 ∩V ( 0 M * ,V∩V 0 ) + l (1) O P (1)s 3 (s 3 + 2s 1 − 1)n −τ min log n.
Consider V 0 /V 0 ∩ V, similar to (A.1), it follows that l (2) Lastly, we consider exact-fitted models missing a large set of active nodes P 3 .Under P 3 , we have |s 2 | ≥ c log n.On the node set V/V 0 ∩ V, by (A.2), we have l (1) O P (1)s 3 (s 3 + 2s 1 − 1)n −τ min log n the true parameters under network size n, where δ 0 n = n −1/2 δ 0 .Notations a n b n or b n a n means a n = O(b n ), and a n ≺ b n or b n a n means a n = o(b n ).Similarly a n b n represents a n b n and a n b n , while a n ∼ b n refers a n /b n −→ 1.An identity matrix with size n × n is denoted as I n .The spectral norm of a matrix M is represented by ||M||, which is the largest singular value of M.
Lemma 2.1.(Convergence rate of MLE).Assume that Condition B-2 holds for any overfitted model M (i.e., M 0 ⊆ M) with size |M| ≤ m n +|M 0 |, where m n = c 2 log n is defined in Condition B-3.Suppose that the support of β and Z are compact sets with ||β|| ∞ ≤ M β , ||Z|| ∞ ≤ M z , then as n −→ ∞, the MLE M uniquely exists and satisfies

Figure 1 .
Figure 1.A toy example showing the process of trimming the network inside the iterations of MCMC.Active nodes-orange; Inactive nodes-blue.
On the true model and identifiability).
(E-1) On β: c 4 n −s/2 log n ≤ min j∈M 0 |β 0 j |, for a large enough c 4 > 0. The support of β is a compact subset of R p n and ||β|| ∞ ≤ M β .(E-2) On δ: For active nodes i ∈ V 0 , there exists a constant c 5 , such that c 5 ≤ δ 0 i ; For inactive nodes i / ∈ V 0 , there exists τ min > 0, such that n −τ 0 e δ 0 i n −τ min .The proxy probability in the working model τ ≥ τ min .(E-3) On the size of the support: |M 0 | ≤ M for a M large enough and |V 0 | = n s = o(m 2n ) for some s > 0, where m 2n n t for s < t < (s + τ min )/2.Condition F (On the regularity of the design).For any overfitted model M 0 ⊆ M with size |M| ≤ M + |M 0 | and node set V with size |V| = v ≤ m 2n + |V 0 |, we have max Theorem 3.1.(Strong selection consistency).Suppose Conditions C-1, C-3, D, E, and F hold for model (1) with the size of the true active nodes |V 0 |

Figure 2 .
Figure 2. Comparisons of the computational time (in minutes) using BSM-net (solid red line) and BSM-net.sp(dotted blue line) with network size n increasing from 50 to 200 for Case 1 of the simulation setting with ρ = 0.75 and s = 1 by setting fixed dimension p = 1000 and true active node size |V 0 | = 30.Each point is the median of 10 replications.

Figure 4 .
Figure 4. Comparison of using different thresholds for model selection: exact-fit rate (solid blue line), false discovery rate (dotted red line) and true positive rate (dashed green line).

Figure 5 .
Figure 5. Facebook friendship network with 347 individuals.The graph density is 0.042, and node degree varies from 0 to 77.Among 347 nodes, 242 are detected as active (orange), while 105 are inactive nodes (blue).

Figure 6 .
Figure 6.Histograms showing degree distributions for detected active (orange) and inactive nodes (blue) in Facebook network.

Figure A. 1 .
Figure A.1.Visualization to node setsV 0 /V 0 ∩ V, V 0 ∩ V and V/V 0 ∩ V,where the red area stands for the true active node set and the blue for an arbitrary node set.
Theorem 2.3.(Strong selection consistency).Assume that e ||δ 0 || ∞ n τ 0 for some constant 0 < τ 0 < 1/24 holds for model in (1).Suppose log p n = o(n) and the size of the true model is bounded by |M 0 | = o(log n), with regularity Conditions A, B, and C, we have Pr D M is a diagonal matrix with the first |M| values on the diagonal as σ −2 1,n and the rest n values as σ −2 d,n .Remark 3. In (6), the term l n ( M ) − l n ( M 0 ) represents the difference in the goodness of fit between two models, where the term M D M M is similar to a L 2 -regularizer.The first term including |M| − |M 0 | can be interpreted as a L 0 -penalty on the model size, which encourages sparser models.

Table 1 .
Summarized simulation results for Case 1: p = 1000, n = 50 and X has autoregressive correlation matrix.Reported values are means of different performance measures averaged over 50 replications.The details of measures are provided in Section 4.1.

Table 2 .
Summarized simulation results for Case 2: p = 1000, n = 50 and X has compound symmetric correlation matrix.
Reported values are means of different performance measures averaged over 50 replications.The details of measures are provided in Section 4.1

Table 3 .
Eight nodal attributes selected by MPM for Facebook network.
NOTE: For each attributes, feature name, posterior marginal inclusion probability, estimated coefficient and a 95% credible interval are reported.

Table 4 .
Selected topics for paper citation network.Three leading topics with positive coefficients are selected by MPM.For each selected topic, key words, posterior marginal inclusion probability, estimated coefficient and a 95% credible interval are reported Figure7.Citation network with n = 476 papers.The graph density is 0.021 and node degree varies from 4 to 36.Among 476 papers, 126 are detected as active (orange), while 350 are inactive (blue).