Directed Community Detection With Network Embedding

Abstract Community detection in network data aims at grouping similar nodes sharing certain characteristics together. Most existing methods focus on detecting communities in undirected networks, where similarity between nodes is measured by their node features and whether they are connected. In this article, we propose a novel method to conduct network embedding and community detection simultaneously in a directed network. The network embedding model introduces two sets of vectors to represent the out- and in-nodes separately, and thus allows the same nodes belong to different out- and in-communities. The community detection formulation equips the negative log-likelihood with a novel regularization term to encourage community structure among the nodes representations, and thus achieves better performance by jointly estimating the nodes embeddings and their community structures. To tackle the resultant optimization task, an efficient alternative updating scheme is developed. More importantly, the asymptotic properties of the proposed method are established in terms of both network embedding and community detection, which are also supported by numerical experiments on some simulated and real examples.


Introduction
Network data have attracted long-running and continuing interests in literature (Fienberg 2012;Zhang, Levina, and Zhu 2017;Li, Levina, and Zhu 2020), and have a wide spectrum of applications in various domains, ranging from social networks (Girvan and Newman 2002;Kossinets and Watts 2006), information networks (Stanton-Salazar and Dornbusch 1995; Leskovec et al. 2008), to biological networks (Bader and Hogue 2003;Nepusz, Yu, and Paccanaro 2012). One of the key interests in network data is its community structure (Leicht and Newman 2008;Bickel and Sarkar 2016;Rohe, Qin, and Yu 2016), where nodes within the same community tend to share similar characteristics.
Most existing methods in literature consider undirected networks, where nodes are linked with undirected edges and nodes in the same community are more tightly linked than those in different communities. These methods include the modularity method (Newman 2006), the spectral clustering method (Chung and Graham 1997;Von Luxburg 2007), the hierarchical clustering method (Joe and Ward 1963;Kaufman and Rousseeuw 2009), the stochastic block model (Holland, Laskey, and Leinhardt 1983;Zhao, Levina, and Zhu 2012;Sengupta and Chen 2018), and so on. Yet many real network data come with directed edges, indicating asymmetric information flow among nodes (Leicht and Newman 2008). For instance, in the Twitter network, a directed edge means one user is following the other but not the other way around, and celebrities and opinion leaders usually have a large number of followers but only follow a few other users. This asymmetry property calls for CONTACT Junhui Wang j.h.wang@cityu.edu.hk School of Data Science, City University of Hong Kong, Tat Chee Avenue, Kowloon Tong, Kowloon, Hong Kong. * The first two authors contribute equally to this work.
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA. new community detection methods for directed networks, and it is of great interest to identify communities of nodes sharing similar patterns of sending and receiving edges (Rohe, Qin, and Yu 2016).
To accommodate the asymmetry property, most existing methods propose to convert directed networks into undirected ones and then apply the existing community detection methods for undirected networks, such as the symmetrization method (Satuluri and Parthasarathy 2011) and the bipartite network method (Guimerà, Sales-Pardo, and Amaral 2007). Latent space model (LSM; Hoff, Raftery, and Handcock 2002) is another popular route to model directed networks, which assumes that each node is represented by one latent position vector, and the directed structure can be measured by the weighted projection between latent vectors. Yet all these methods completely ignore the difference between sending and receiving edges in directed networks and thus often lead to suboptimal performance. It was only until recently that co-clustering methods (Rohe, Qin, and Yu 2016;Guo et al. 2020) have been developed to detect two different sets of communities for the same nodes in directed networks, one with similar sending patterns and the other with similar receiving patterns. However, these co-clustering methods focus solely on community structure without estimating link probability, and thus cannot extract the rich information contained in the network.
In this article, we propose a novel method to simultaneously construct network embeddings and detect out-and incommunities in directed networks. Network embedding (Tang et al. 2015;Cui et al. 2018) has been a popular tool to represent network in a low-dimensional vector space so that network structure can be largely preserved (Perozzi, Al-Rfou, and Skiena 2014;Wang, Cui, and Zhu 2016;Wang et al. 2017). Different from the existing network embeddings, the proposed method decomposes each node into an out-node that sends edges and an in-node that receives edges, and represents each node as two different embeddings to facilitate the detection of different out-and in-communities. In essence, the out-communities contain out-nodes with similar sending patterns and the incommunities contain in-nodes with similar receiving patterns. This is of particular interest in practice. For example, in social networks, it is extremely important to identify users with similar interests and thus have similar following patterns, and users with similar profiles and thus have similar followers.
The main contribution of this article is development of a novel framework for jointly estimating network embeddings and detecting communities in directed networks. The framework integrates network embedding model with community detection via a regularization formulation, together with theoretical justification for its asymptotic consistencies in both network embedding and community detection. Specifically, the network embedding model introduces two sets of vectors to represent the out-and in-nodes in a low-dimensional space, and thus allows for detection of different out-and in-communities of the same nodes. The regularization formulation equips the log-likelihood with a novel regularization term (Tang, Xue, and Qu 2020) to encourage community structure among the nodes representations. An efficient alternative updating scheme is developed to tackle the resultant optimization task, and numerical experiments on both simulated and real examples indicate that the proposed method is superior than most existing methods in terms of both network estimation and community detection in directed networks. Asymptotic consistencies of the proposed method are also established in terms of network estimation and community detection under the stochastic co-block model (ScBM). Most interestingly, for community detection in a directed network with n nodes, the fast convergence rate of O p (n −1 log n) is attained by the proposed method under mild condition compared with existing literature (Rohe, Qin, and Yu 2016).
The article is organized as follows. Section 2 presents the proposed community detection method in directed networks with network embedding, and develops an alternative updating scheme to tackle the computational challenge. Section 3 establishes some theoretical results on the asymptotic consistencies of the proposed method. In Section 4, we illustrate the performance of the proposed method on a number of simulated examples, and compare it against some popular competitors in literature. Real applications on an E-mail network and a statistician citation network are provided in Section 5 to demonstrate the strength of the proposed method. The online Appendix contains all of the technical proofs.

Community Detection in Directed Networks
Consider a directed network G with n nodes labeled by 1, . . . , n and an adjacent matrix A = (a ij ) n i,j=1 with a ij ∈ {0, 1}, where a ij = 1 if there is a directed edge pointing from node i to node j, and a ij = 0 otherwise. Note that A is usually asymmetric, and the rows and columns of A can have different meanings although they index the same nodes. Specifically, each node can be split into an out-node and an in-node, and the rows of A index the out-nodes and the columns of A index the in-nodes. It is common that the out-nodes and in-nodes can have different structures, indicating nodes may have different behaviors in terms of sending and receiving edges.

Proposed Method
To model the different roles of out-nodes and in-nodes in G, we consider the network embedding method, where p ij = P(a ij = 1), and α i and β j are embedding vectors of out-node i and in-node j in an r-dimensional space with r n. In this dimension reductive space, the structure and inherent properties of the directed network shall be preserved, so that the embedding vectors of similar nodes are close as well. Consequently, the community structures of the out-nodes and in-nodes can be modeled through some distance measures among {α i } n i=1 and {β j } n j=1 . With the network embedding model in (1), the likelihood function of G becomes . (2) Note that the embedding vectors α i 's and β j 's may not be identifiable in the likelihood function, which can be largely overcome by introducing some regularity terms on the embedding vectors. This leads to a regularized negative log-likelihood function, where J 0 (α, β) and J 1 (α, β) are two regularized terms, and λ 0n and λ 1n are two tuning parameters that may depend on n.
Here, J 0 (α, β) = α 2 F + β 2 F is a standard regularization term on α and β, where · F denotes the Frobenius norm. The Frobenius norm penalty assures that the estimated α and β can be determined up to a rotation, which does not affect their community structures. Similar penalty terms have been widely used in the literature, such as the regularized collaborative filtering for recommender system (Zhu, Shen, and Ye 2016) and regularized LSM for networks (Ma, Ma, and Yuan 2020). To facilitate community detection, we propose to add another novel penalty term J 1 (α, β) to encourage subgrouping of out-nodes and in-nodes with similar representations in the embedded space. Specifically, for a given value γ > 0, we let where {c out l } k 1 l=1 and {c in m } k 2 m=1 are community centers for outnodes and in-nodes, respectively, and k 1 and k 2 are the corresponding numbers of communities. It resembles the classical idea of K-means clustering, and attains the effect of pushing α i and β j toward their respective community centers. Similar idea has been employed in various models, including highdimensional regression (Witten, Shojaie, and Zhang 2014) and individualized variable selection (Tang, Xue, and Qu 2020). Note that k 1 and k 2 are usually unknown, and can be treated as tuning parameters and determined by some data-adaptive tuning schemes. It is clear that as λ 1n tends to infinity J 1 (α, β) pushes the nodes representations toward some community centers, and thus attains the purpose of community detection. Also, the out-and in-communities can be different as {c out l } k 1 l=1 and {c in m } k 2 m=1 are generally different. This is particularly important in directed networks where nodes can have different patterns in sending and receiving edges.
Note that the proposed model differs from the LSM (Hoff, Raftery, and Handcock 2002) in that two embedding vectors are introduced for each node in (1) to capture different sending and receiving patterns of each node, whereas LSM only allows one latent position vector for each node and thus completely ignores the difference in sending and receiving edges. Another important feature of the proposed model is that the network embedding and its community structure are jointly estimated in a unified formulation in (3), which allows these two counterparts to leverage with each other and thus helps improve the accuracy of both network embedding and community detection. Although not explicitly stated, LSM only estimates the latent vectors for each node, and its application to community detection is conducted in a separate clustering step on the latent vectors, leading to a two-step procedure. Furthermore, the proposed method is very flexible and can be extended to undirected networks by setting α i = β i for each node i, and to directed networks with identical out-and in-community structures by

Computing Algorithm
We now develop an efficient algorithm for minimizing the regularized negative log-likelihood function L λ (α, β). Particularly, the algorithm proceeds by alternatively updating (α, β) and (c out , c in ). When (c out , c in ) is given, the minimization task becomes a bi-convex problem, and can be solved by alternatively updating α and β. When (α, β) is given, the minimization task becomes a standard clustering problem, and different clustering algorithms can be employed for different values of γ .
We now illustrate the computing algorithm for γ = 2, while the algorithm can be adapted to other choices of γ with slight modification. Given (α (k) , β (k) ) and (c (out,k) , c (in,k) ) at the kth step, we update α and β separately to obtain α (k+1) and β (k+1) . Let t Clearly, the formulation in (5) is the same as a regularized logistic regression with some L 2 -type penalty, which can be efficiently solved by a gradient descent algorithm. Here, as opposed to solving (5), we just update α (k) i along the gradient for one step to further expedite the computation. Specifically, for each i ∈ {1, 2, . . . , n}, Similarly, the suboptimization task related to β (k+1) is along the gradient for one step, and thus for each j ∈ {1, 2, . . . , n}, Next, given α (k+1) and β (k+1) , the suboptimization task related to c (out,k+1) and c (in,k+1) Clearly, it can be solved by applying the standard K-means algorithm to α (k+1) and β (k+1) , respectively. The details of the computing algorithm can be summarized as in Algorithm 1. Algorithm 1 is guaranteed to converge (Gorski, Pfeuffer, and Klamroth 2007), and the final estimates are denoted as (α,ĉ out ,β,ĉ in ). It also leads to two community assignment functionsψ out andψ in , which assign out-node i withα i to the closest c out l and in-node j withβ j to the closestĉ in m . More precisely, for any i, j ∈ {1, 2, . . . , n}, we havê It is worth pointing out that the objective function in (3) is nonconvex, and finding its global minimum requires some computationally expensive nonconvex optimization algorithms, such as the branch-and-bound algorithm (Morrison et al. 2016).
Algorithm 1 provides an efficient way to find a reasonable solution, which is guaranteed to converge to a stationary point of (3) and yields satisfactory numerical performance in both simulated and real examples in Sections 4 and 5. Its numerical performance may be further improved if multiple random starts or educated initialization are employed, but at the cost of increased computational cost.

Tuning Procedure
Note that the performance of the proposed method relies on two tuning parameters λ 0n and λ 1n . Based on our limited numerical experience, the performance of the proposed method is satisfactory as long as λ 0n is sufficiently small, and thus we set λ 0n = 10 −10 in all the numerical experiments in Section 4.
To tune λ 1n , we adopt a stability-based tuning scheme to select the optimal parameter via the grid search over grid points 10 −9+0.3×m : m = 0, 1, . . . , 39 . It proceeds in a bootstrap fashion, and randomly generates two sub-networks, each with 80% of randomly selected edges from the training network. The stability is defined as the disagreement between the two estimated communities based on these two sub-networks. Specifically, we randomly choose 80% entries of {a ij } n i,j=1 , and apply the proposed method with λ 0n = 10 −10 and a given λ 1n to get ψ 1 = (ψ 1 out ,ψ 1 in ). Repeating this process on another randomly chosen 80% entries, the proposed method yields anotherψ 2 = (ψ 2 out ,ψ 2 in ). Finally, we set the optimal λ 1n as λ 1n = arg min λ 1n clust.err(ψ 1 out ,ψ 2 out ) + clust.err(ψ 1 in ,ψ 2 in ) .
The whole subsampling scheme can be repeated multiple times, which is analogous to the stability criterion for clustering in Wang (2010).

Theory
In this section, we establish some theoretical results to quantify the asymptotic behavior of the proposed method. Consider the network embedding model (1) with the true parameter θ * following the ScBM (Rohe, Qin, and Yu 2016), where Z * ∈ {0, 1} n×k 1 , B * ∈ R k 1 ×k 2 and Y * ∈ {0, 1} n×k 2 , and each row of Z * and Y * have only one 1, indicating the community membership of the corresponding out-node and in-node, respectively. Let ψ * out and ψ * in be the true community assignment functions for the out-nodes and in-nodes, and then Z * iψ * out (i) = 1 for any out-node i and Y * jψ * in (j) = 1 for any in-node j.
Let rank(θ * ) = r ≤ min{k 1 , k 2 } and r is allowed to diverge, and be the density of a ij with parameter θ ij , and we assume that c l e −T n ≤ min 1≤i,j≤n p(θ ij ; 1) ≤ max 1≤i,j≤n p(θ ij ; 1) ≤ c u e −T n for any θ ∈ to accommodate network sparsity, where e −T n > log n n and c l ≤ c u are two positive constants. Furthermore, let L(θ ij ; a ij ) = −log p(θ ij ; a ij ) be the negative log-likelihood function, and L λ (θ ) = 1 1 (α, β). It follows that there must exist some positive constant c 0 such that , then e L (θ, θ * ) ≥ 0 for any θ ∈ , and thus L is Fisher consistent (Lin 2004) in the sense that θ * is a minimizer of n i,j=1 E L(θ ij ; a ij ) . Furthermore, let the estimated parameter beθ = (θ ij ) n i,j=1 = (α T iβj ) n i,j=1 , whereα i andβ j are estimated by minimizing (3) with γ = 2 in (4), then the estimation error ofθ is measured by the averaged Hellinger distance, where h(θ ij , θ * ij ) is the discrete Hellinger distance between p 1/2 (θ ij ; a ij ) and p 1/2 (θ * ij ; a ij ), defined as To proceed, we first establish the relationship among h 2 (θ ij , θ * ij ) and the first and second moments of L(θ ij ; a ij ) − L(θ * ij ; a ij ).

Lemma 1. For any
Lemma 1 paves a bridge between the first and second moments of L(θ ij ; a ij ) − L(θ * ij ; a ij ), which is an important building block to bound the empirical process for establishing the large deviation inequalities in Proposition 1 and Theorem 1.
Proposition 1. Suppose J λ (θ * ) = c 1 2 n with 0 < c 1 ≤ 1 2 and 2 n = c 2 n −1 r log n for some c 2 > 0. Then there exists some constant c 3 > 0, such that when n is sufficiently large, it holds true that where P * denotes the outer probability measure.
Proposition 1 quantifies the behavior of L λ (θ) in the neighborhood of θ * defined by e L (θ , θ * ), which is established through some large deviation inequality on some independent but not identically distributed random variables. Proposition 1 provides an important intermediate result for the estimation consistency ofθ in Theorem 1, after establishing a connection between e L (θ , θ * ) and h(θ, θ * ). Theorem 1 shows thatθ is a consistent estimate of θ * with a convergence rate depending on T n and r. It is interesting to note that the assumption on J λ (θ * ) implies that λ 0n shall converge to 0 at some appropriate order, and both Proposition 1 and Theorem 1 hold true regardless of the value of λ 1n , due to the fact that J 1 (α * , β * ) = 0 for any α * , β * ∈ R n×r satisfying θ * = α * (β * ) T .

Theorem 1. Suppose all assumptions in Proposition 1 hold, it then holds true that
Next, we turn to establish the consistency in terms of community detection. Let N * k = i : ψ * out (i) = k be the kth true out-community for 1 ≤ k ≤ k 1 , and R * s = j : ψ * in (j) = s be the sth true in-community for 1 ≤ s ≤ k 2 . Similarly, let N k = i :ψ out (i) = k andR s = j :ψ in (j) = s be the kth estimated out-community and sth estimated in-community, respectively. Note that the community structure is invariant up to a permutation of the community index, so we define the estimation error ofψ out andψ in as err(ψ out , ψ * out ) = min err(ψ in , ψ * in ) = min {q 1 ,q 2 ,...,q k 2 } 1 n where | · | denotes the set cardinality, N * k \N p k contains nodes in N * k but not inN p k , p 1 , . . . , p k 1 is a permutation of 1, . . . , k 1 , and q 1 , . . . , q k 2 is another permutation of 1, . . . , k 2 . It is clear that the community detection errors in (11) and (12) measure the smallest possible fraction of nodes in N * k and R * s which are assigned byψ out andψ in to wrong out-communities or in-communities.
Assumption 1. There exist a positive constant c 4 and a sequence ζ n , such that when n is large enough, min k =k Assumption 2. There exists a positive constant c 5 , such that when n is large enough, n −1 min min 1≤k≤k 1 k 1 |N * k |, min 1≤s≤k 2 k 2 |R * s | ≥ c 5 .
Assumption 1 is a necessary identifiability assumption, and assumes that rows and columns of B * shall be substantially different from one another, so that the true out-and incommunities are well separated. The difference is governed by the positive sequence ζ n , which may converge to 0 at some slow rate. Note that the values of B * can be very negative, leading to small link probabilities and thus sparse networks. Assumption 2 assumes that all the true communities are well defined and will not degenerate too fast asymptotically. A similar assumption can be found in Ke, Shi, and Xia (2020) for community detection in undirected hypergraphs, and a relatively stronger assumption assuming equal community sizes is made in Chien, Lin, and Wang (2019).
Theorem 2. Suppose all assumptions in Theorem 1 and Assumptions 1 and 2 hold, and that lim n→∞ λ 1n (n −1 re T n log n) 1 2 = ∞. Then there holds true that err(ψ out , ψ * out ) = O p k 1 re T n log n nζ 2 n and err(ψ in , ψ * in ) = O p k 2 re T n log n nζ 2 n , provided that max{k 1 , k 2 }rn −1 ζ −2 n e T n log n converges to 0 as n diverges.
Theorem 2 assures that the proposed method can consistently estimate the true out-and in-communities at a fast rate. Note that although Theorem 1 holds true for any value of λ 1n , it needs to diverge with n sufficiently fast to attain the fast rate in terms of community detection in Theorem 2. More importantly, when ζ n = O(1), the convergence rate of Theorem 2 is consistent with that achieved by some state-of-the-art method (Rohe, Qin, and Yu 2016). Precisely, under the four-parameter ScBM model discussed in Rohe, Qin, and Yu (2016), the clustering error rate of the proposed method is O(K 2 log n/n), which is exactly the same as that in Rohe, Qin, and Yu (2016). Note that Guo et al. (2020) further improved the convergence rate of the community detection error to O(K 2 /n) by using the technical tools in Lei and Rinaldo (2015), with the diverging rate of K compromised. Furthermore, both Theorems 1 and 2 hold true for sparse networks with link probabilities tending to 0 at order no faster than O( log n n ) or the expected node degree growing at the order of O(log n).

Numerical Experiments
In this section, we examine the numerical performance of the proposed method, and compare it against some existing coclustering methods in literature (Dhillon 2001;Kluger et al. 2003;Mariadassou, Robin, and Vacher 2010;Rohe, Qin, and Yu 2016). For simplicity, we denote the proposed network embedding model with γ = 2 as NEM, and the existing co-clustering methods as RQY (Rohe, Qin, and Yu 2016), KBCG (Kluger et al. 2003), ISD (Dhillon 2001), and VEM (Mariadassou, Robin, and Vacher 2010), respectively. RQY and KBCG employ the spectral clustering algorithm in slightly different fashions, ISD uses a bipartite spectral graph partitioning method, and VEM is a variational EM algorithm for fitting the stochastic co-block model. All the methods are compared in terms of their accuracies in community detection and network estimation.
To evaluate community detection accuracy, note that the error measures in both (11) and (12) require enumeration of all possible permutations, and thus can be computationally infeasible. Here, we employ the clustering error in Wang (2010) to assess the community detection accuracy. Particularly, letψ be the estimated out-or in-community assignment function, and ψ * be the corresponding true community assignment function, then the error ofψ in estimating ψ * is defined as clust.err(ψ, ψ * ) It is clear that clust.err(ψ, ψ * ) measures the probability that ψ and ψ * assign nodes i and j differently. The community detection accuracy of each method is then evaluated by the average of clust.err(ψ out , ψ * out ) and clust.err(ψ in , ψ * in ). Further, we define an error measure for the estimated probabilityp = (p ij ) n i,j=1 in estimating the true probability p * = (p * ij ) n i,j=1 as est.err(p, It measures the difference betweenp and p * , relative to the magnitude of p * .

Simulated Examples
We consider four simulated examples, two of them generated from the network embedding model in (1), and the other two from the ScBM model in (7).
In all the simulated examples, the averaged community detection errors and probability estimation errors and their standard deviations of all the methods over 50 independent replications are reported in Tables 1 and 2. From Tables 1 and 2, it is evident that NEM outperforms the other four competitors in terms of both community detection and probability estimation in all scenarios. Furthermore, RQY, KBCG, and ISD all employ spectral method with slight difference and deliver competitive performance in terms of com- Table 1. The averaged community detection errors and their standard deviations of various methods over 50 independent replications. The best performer in each scenario is boldfaced.  munity detection, whereas VEM employs a variational method to directly optimize likelihood and is less competitive. The performance of VEM in Examples 1 and 3 are much worse than that in Example 2 when the data-generating scheme deviates from the ScBM model. Note that ISD is only applicable to Example 2 with equal numbers of out-and in-communities, KBCG is not applicable to sparse networks in Example 3, and thus the blanks with symbol "-" in Table 1. Moreover, RQY, KBCG and ISD do not produce probability estimates directly, unless some additional steps are conducted as in Chen and Lei (2018), and thus the blanks with symbol "-" in Table 2.
We also examine the robustness of the proposed method against the misspecification of the community numbers in the following simulated example.
Note that the community detection error in (13) is inappropriate to measure the community detection accuracy for methods with different community numbers, and thus it can be modified as clust.err.num(ψ, ψ * ) whereĈ k = {i :ψ(i) = k}. It measures the probability that ψ assigns nodes i and j into the same community while ψ * does not, which accounts for the fact thatψ may split true communities into sub-communities when k is set to be larger than the true community number. The averaged performance of the proposed method with different community numbers over 50 replications are plotted in Figure 1. It is clear that as k increases, the modified clustering error first decreases and then becomes stable when k gets larger than the true one, implying the robustness of the proposed method against the values of k.

Real Applications
We apply the proposed NEM method to analyze two reallife networks, including the E-mail-Eu-core network data (Jure and Andrej 2014) and the statisticians citation network data (Ji and Jin 2016). Note that the true probability matrix and the community structures are no longer available, so the error measures in both (13) and (14) are not applicable to assess the numerical performance. For simplicity, we follow the treatment in Rohe, Qin, and Yu (2016) and Guo et al. (2020) to determine community numbers in both networks. Note that some dataadaptive procedure such as network cross-validation (Li, Levina, and Zhu 2020) may also be employed, but at the cost of increased computational cost.

E-mail-Eu-Core Network
The E-mail-Eu-core network is generated from the E-mail exchanges among 1005 employees from 42 different departments in a large European research institute. Specifically, there is a directed edge u → v if person u has sent at least one E-mail to person v. This amounts to a total of 25,571 directed edges in the network. First, we examine the first 50 singular values of the adjacent matrix of the E-mail network as in Figure 2. As there is a gap between the fifth and sixth singular values, we set the number of both out-and in-communities to be 5.
We then apply NEM with k 1 = k 2 = 5 and r = 3 to the E-mail-Eu-core network and obtain the out-and incommunity assignment functionsψ out andψ in as well as the embedding vector of each node in the network. The estimated three-dimensional embedding vectors of the out-nodes and innodes are showed in Figure 3, with different colors representing their assigned communities. It is clear that the estimated outand in-communities are well separated in the three-dimensional embedding space, and the out-and in-community memberships are substantially different.
Further, we scrutinize the sending and receiving patterns of the largest out-and in-community in Figure 4. In the left panel, 13 out-nodes are selected from the largest out-community, and the numbers of edges received by nodes in the right column range from 3 to 8, with an average 5.4 out of 13 out-nodes, which implies that the out-nodes in the largest out-community share very similar sending pattern. In the right panel, 14 in-nodes are selected from the largest in-community, and the numbers   of edges sent from nodes in the left column also range from 3 to 8, with an average 6.1 out of 14 in-nodes, which also implies that the in-nodes in the largest in-community share similar receiving pattern. Moreover, it is also interesting to point out that the nodes in the largest out-and in-communities come from different departments, suggesting that department may not be a good indicator for communities with similar sending and receiving patterns.

Statistician Citation Network
The statistician citation network consists of statisticians who published at least one article in four major statistical journals from 2003 to the first half of 2012. Specifically, each node represents an author and a directed edge u → v indicates author u cites at least one article of author v. The citation network is very sparse, and there are many small-sized connected sub-networks that do not connect with any other nodes. For illustration, we remove all these small isolated sub-networks for the subsequent analysis, which amounts to a processed citation network with 2654 nodes and 22,195 directed edges.
As showed in the right panel of Figure 2, there exists an eigengap between the third and fourth singular values of the adjacent matrix, and thus we set the number of both out-and in-communities to be 3. We then apply NEM with k 1 = k 2 = 3 and r = 3 to the citation network and obtain the estimated out-and in-community assignment functionsψ out (i) andψ in (i) as well as the embedding vector of each node in the network. The estimated three-dimensional embedding vectors of the outnodes and in-nodes are showed in Figure 5, with different colors representing their assigned communities.
It is clear that the estimated out-and in-communities are well separated in the three-dimensional embedding space. The community membership also appears reasonable in that it captures some common perception of different statisticians' research interests. For example, the first out-community consists of Emmanuel J. Candès, Jianqing Fan, Trevor Hastie, and many other authors working on machine learning and variable selection. Moreover, the in-communities appear to be more outspread than the out-communities, which agrees with the observation that authors citing similar articles tend to share more similar research interests than those who are cited by similar authors.
Although the statistician dataset has been widely analyzed in literature, most existing results focus on undirected networks (Li, Levina, and Zhu 2020), or directed networks with the same out-and in-community structures (Ji and Jin 2016), except the recent work on ArXiv (Guo et al. 2020). However, the estimated community structures by the proposed method is much more clearly separated than those in Guo et al. (2020), where different communities lie in distinct directions but severely overlap around the center.

Discussion
In this article, we propose a novel method for detecting directed communities with network embedding. The proposed method builds upon the idea of network embedding, which introduces two sets of numeric vectors to represent the outand in-nodes separately. It thus allows detection of different out-and in-communities of the same nodes based on their patterns of sending and receiving edges. The proposed method is formulated as a regularization negative log-likelihood, where a novel regularization term is enforced to encourage community structure among the nodes representations. Asymptotic analysis has also been conducted, assuring the desirable theoretical properties of the proposed method, which are also supported by numerical experiments on a number of simulated and real examples.
One possible extension of the proposed model is to consider node degree heterogeneity. Particularly, let α i = (α T i , a i , 1) T and β j = (β T j , 1, b j ) T , and the degree-corrected version of (1) becomes θ ij = α T i β j =α T iβj + a i + b j , where a i and b j are the degree corrected terms. It is equivalent to assume that where a large value of a i implies node i sends more edges and a large value of b i implies node j receives more edges. It is clear that the degree-corrected model is different than most existing degree-corrected treatments in literature, but is still capable of incorporating the degree heterogeneity. The computing algorithm can be modified accordingly, but the theoretical derivation is more involved.

Supplementary Materials
Supplementary materials contain all Python codes to run the algorithms and simulated examples.