Hierarchical Network Models for Exchangeable Structured Interaction Processes

Abstract Network data often arises via a series of structured interactions among a population of constituent elements. E-mail exchanges, for example, have a single sender followed by potentially multiple receivers. Scientific articles, on the other hand, may have multiple subject areas and multiple authors. We introduce a statistical model, termed the Pitman-Yor hierarchical vertex components model (PY-HVCM), that is well suited for structured interaction data. The proposed PY-HVCM effectively models complex relational data by partial pooling of local information via a latent, shared population-level distribution. The PY-HCVM is a canonical example of hierarchical vertex components models—a subfamily of models for exchangeable structured interaction-labeled networks, that is, networks invariant to interaction relabeling. Theoretical analysis and supporting simulations provide clear model interpretation, and establish global sparsity and power law degree distribution. A computationally tractable Gibbs sampling algorithm is derived for inferring sparsity and power law properties of complex networks. We demonstrate the model on both the Enron e-mail dataset and an ArXiv dataset, showing goodness of fit of the model via posterior predictive validation.


Introduction
Modern statistical network analysis focuses on the study of large, complex networks that can emerge in diverse fields, including social, biological, and physical systems (Barabási 2016;Newman 2010;Estrada 2012;Latora, Nicosia, and Russo 2017;Goldenberg et al. 2009). The expanding scope of network analysis requires statistical models and inferential tools that can handle the increasing complexity of network data structures. In this paper, we focus on network data originating from sequences of interactions. Network data arising in this manner will benefit from a framework built upon the interaction as the statistical unit McCullagh (2002) rather than upon the constituent elements within each interaction as the statistical units. Edge exchangeable models as described in Crane andDempsey (2017, 2018b) are well adapted to analysis of datasets containing these complex interactions. A similar nonparametric Bayesian treatment of edge exchangeability was described in Cai and Campbell (2015); Cai, Campbell, and Broderick (2016).
While edge exchangeability provides a framework for statistical analysis of interaction data, the model proposed in Crane andDempsey (2017, 2018b), named the Hollywood model, only captures basic global features. Specifically, the Hollywood model's asymptotic behavior reflects the empirical properties of sparsity and power law degree distributions observed in realworld network data, which are not as well reflected in classic statistical network models such as the ERGMs Wasserman and Pattison (1996), graphon models Airoldi, Costa, and Chan (2013), and stochastic blockmodels (SBMs) Holland, Laskey, and Leinhardt (1983a). While edge exchangeability is attractive as a theoretical framework, the set of current edge exchangeable models is inadequate for modern network data with high structural complexity.
The edge exchangeable model proposed in this paper is motivated by an important fact: most common complex networks constructed from interaction data are structured. A phone-call interaction, for example, takes the form of a sender and receiver pair. E-mail correspondence generalizes this type of interaction to one sender but potentially multiple receivers with different attributes like "To, " "Cc, " and "Bcc". This paper makes a substantial push forward by constructing hierarchical models that reflect this common structure of structured interaction data. In particular, we describe the Pitman-Yor hierarchical vertex components model (PY-HVCM). This model overlays local behavior (e.g., per sender) with global information by partial pooling through a shared global, latent distribution. Simulation and theoretical analysis confirm that the PY-HVCM achieves two important properties: varying local power law degree per sender and global power law degree distribution. By explicitly modeling hierarchies of interactions, the proposed framework goes beyond previous work in modeling interaction data and network valued data.

Relevant Prior Work on Interaction Data
Interaction data often arises in settings where communications amongst a set of constituent elements over a specific time period are recorded (Tyler, Wilkinson, and Huberman 2005;Eagle and Pentland 2006). Examples are numerous and include authorship and co-sponsoring of legislation (Fowler 2006;Signorelli and Wit 2018), sending and receiving e-mails (Tyler, Wilkinson, and Huberman 2005;Cohen 2009), posting and responding on a community forum (Rossi and Gnawali 2014), and traceroute (Luckie, Hyun, and Huffaker 2008). In each case, the interaction (edge) is the statistical unit to be modeled, as contrasted with the subjects (nodes) of interactions considered in other work (Goldenberg et al. 2010). See Crane andDempsey (2018b, 2017) for further discussion of the advantages of defining interactions as the statistical units.
The literature contains several papers focused on statistical modeling of interaction data. Perry and Wolfe (2013) constructed a Cox proportional intensity model (Cox 1972). Butts (2008) considered likelihood-based inference using a variant of the proportional intensity model to capture interaction behavior in social settings. Crane and Dempsey (2017) considered nonhierarchical models for interaction data. They describe the notion of edge exchangeability and explore its basic statistical properties. In particular, they show that edge exchangeable models allow for sparse structure and power law degree distributions, widely observed empirical behaviors that cannot be handled by conventional approaches. (Williamson 2016, s) study edge exchangeable models from the perspective of link prediction, showing strong predictive performance in empirical studies.
An alternative approach emerges out of the recent work of Caron and Fox (2017), who construct random graphs from point processes on R 2 + = [0, ∞) × [0, ∞). The random graph is characterized by an object called a graphex (Veitch and Roy 2015). Random graph models generated by this procedure can incorporate sparse, power law behavior into a well-defined population model. Finite random graphs can be obtained via a thresholding operation, termed p-sampling (Veitch and Roy 2019;Borgs et al. 2019). Such random graphs are exchangeable in that they are built from exchangeable point processes. Note that the graphex takes the vertex as the natural unit of observation. Development of edge exchangeable models was instead prompted by the observation that the interaction is often a more appropriate statistical unit when the observed data is a sequence of interactions. The PY-HVCM proposed in this paper takes an additional step forward by nesting interactions within a hierarchical structure that better fits much sparse real world network data.

Relevant Prior Work for Network-valued Data
Interaction data is closely related to network-valued data. The most common approach to statistical network modeling in the literature (Goldenberg et al. 2010) is to model the adjacency matrix of a graph -this includes the stochastic block model Holland, Laskey, and Leinhardt (1983b) and its many variants (Airoldi et al. 2008;Latouche et al. 2011;Han, Xu, and Airoldi 2015), latent space models (Hoff, Raftery, and Handcock 2002;Athreya et al. 2017), and exponential random graph models (Robins et al. 2007). Most often, the observed graph is a random subgraph from a larger population graph (Orbanz 2017). The graph is usually constructed from the observed interaction data rather than being directly observed itself. This can result in a loss of information contained in the original data. While network-valued models have generalizations to directed, nonbinary valued graphs, none of them fully respect the fundamental structure of interaction data. It has been observed that projection of interaction data with simple structure onto the space of adjacency matrices can fundamentally alter the characteristics of the network, such as sparsity; see (Crane and Dempsey 2018a, Theorem 4.4). In contrast to Crane and Dempsey (2018a), however, our proposed framework aims to respect the complete structure of the data, including both the directed nature of the interaction data and its hierarchical set structure.
There has also been previous work on Bayesian modeling for network data (Latouche et al. 2014;van der Pas and van der Vaart 2018;Durante, Dunson, and Vogelstein 2017). The model proposed here follows this paradigm, utilizing the posterior predictive density for model behavior exploration and validation (Gelman, Meng, and Stern 1996), which is an often overlooked aspect of network models.

Motivation for Extended Edge Exchangeable Models
The critical point is that while there are network models built from interaction data, none have the following three important properties: (i) a probabilistic symmetry faithful to the statistical units, (ii) provable empirical properties of sparsity and power law degree structure, and (iii) hierarchical/multi-level structure to account for local variability in network connectivity. Our paper addresses the following fundamental question of network modeling: Is there a general framework for modeling probabilistic symmetry that provides better fit to sparse graphs with global network properties such as power law degree distribution and also accounts for variations in local network behavior? This expands upon a fundamental network modeling question raised first in Orbanz and Roy (2014), updated here to reflect the need for models that account for local variability in network connectivity (e.g., differences in receiver distributions per sender). This paper answers in the affirmative. Indeed, we start from the observation made by Crane and Dempsey (2018a) that the interaction is the fundamental statistical unit in many network settings; this simple observation leads to models with interaction exchangeability and global network properties. However, no prior work on interaction and network-valued data has focused on accounting simultaneously for interaction exchangeability, power law behavior, and local variation (e.g., across callers, senders, and article topics).
A key benefit of our Bayesian approach is that it allows us to specify posterior predictive checks (PPC) in order to perform model criticism (Blei 2014;Box and Hunter 1962), an often overlooked issue in the analysis of network data. Indeed, as written by George Box, "since all models are wrong the scientist must be alerted to what is importantly wrong" (Box 1976). Consider, for example, the Enron e-mail network presented as a case study in Section 6. While PPC confirm model fit to both global and many local summary statistics of interest, they also help identify anomalous local behavior using the summary statistic of total number of unique receivers in the local sender distribution.
An additional benefit of our approach is that the model class can be used to study network interdependency. Consider the Enron e-mail network. Figure 7 shows how the model captures interdependency across distributions per e-mail sender, while Figures 5 and 6 show our model simultaneously captures the global power law degree distribution and local behavior, respectively. In our second case study of the ArXiv dataset, we show how the model output can be used to infer interdependency. Figure 11 visualizes a spectral clustering that captures the interdisciplinary nature of science better than simple application of normalized spectral clustering to the raw data; see Figure 8 in the supplementary materials.

Outline and Main Contributions
The main contributions of this article are as follows: 1. We start by formally defining structured interaction data (Definition 2.1) and exchangeable structured interactionlabeled networks (Definition 2.3). 2. A computationally tractable family of such exchangeable networks, termed PY-HVCM, is introduced in Section 3. 3. An efficient Gibbs sampling inferential algorithm for this family is derived in Section 4. 4. We establish basic statistical properties of the PY-HVCM in Section 5. In particular, we provide theoretical guarantees of sparsity and power law-two important empirical properties of network data. We place the model within a larger class of hierarchical vertex components models (HVCM). 5. The model is applied to both the Enron e-mail dataset and ArXiv dataset in Section 6. Goodness of fit checks are performed via PPC. We demonstrate superior performance to nonhierarchical edge exchangeable and graphex models. 6. We end by showing HVCMs are a subfamily of exchangeable structured interaction processes, providing a representation theorem in Theorem 8.1.

Structured Interaction Data
Consider network data that arises by sampling telephone calls at random out of a call log. Each randomly sampled call can be represented as an ordered pair ({s}, {r}) corresponding to a sender s and receiver r. Next consider network data that arises by sampling e-mails at random out of an e-mail database. Each randomly sampled e-mail can again be represented as an ordered pair ({s},r = {r 1 , . . . , r k }) corresponding to a sender s and multiple receiversr. The same structure can be applied to a post on social media, where there is a poster s and a set of reactors r = {r 1 , . . . , r k }. Figure 1 visualizes a randomly sampled social media post out of a database which has the defined structure. Finally consider network data that arises by sampling articles at random out of the ArXiv database. Each article can again be represented as an ordered pair (s = {s 1 , . . . , s k 1 },r = {r 1 , . . . , r k 2 }) corresponding to multiple subjectss and multiple authorsr. All of the above are examples of structured interaction data, in which observations are sequences of sets from one or more populations. We next formally define structured interaction data and more rigorously describe the above sequence of concrete examples. To do so, we require some additional notation. Let P denote a set of constituent elements. Then for a set P, we write fin(P) to denote the set of all finite multisets of P, and fin k (P) to denote the multisets of size k. Therefore, fin(P) is the disjoint union ∪ ∞ k=1 fin k (P).

Definition 2.1 (Structured interaction data).
A structured interaction process for an ordered sequence of sets (P 1 , . . . , P k ) is a correspondence I : I → fin(P 1 ) × . . . × fin(P k ) between a set I indexing interactions and the ordered sequence of finite multisets of (P 1 , . . . , P k ).
Remark 2.1 (Difference from interaction data). In Crane and Dempsey (2017), an interaction process is defined as a correspondence I : I → fin(P) where P is a single population. Structured interaction data, instead, consists of a series of finite multisets, and does not require the sets to be the same, that is, subjects and authors in the article example. Note that both structured interaction and interaction data are defined in terms of multisets to allow for multiplicity (e.g., one actor can play multiple roles in a movie as seen in Example 2.4).
Example 2.1 (Phone-calls). Assume P k are the same and let P k =: N be a countably infinite population. A phone-call can be represented as an ordered pair of "sender" and "receiver" drawn from N. Therefore, a phone-call interaction process is a correspondence I : I → fin 1 (N)×fin 1 (N). For instance, I(1) = ({a}, {b}) is a phone-call from sender a to receiver b, both in population N. This is distinct from ({b}, {a}) where sender and receiver roles are reversed.
Example 2.2 (E-mails). Assume P k are the same and let P k = N be a countably infinite population. An e-mail can be represented as the ordered sequence of sets: sender, receivers. Then an e-mail interaction process is a correspondence I : is an e-mail from sender a to receivers b and c. This is distinct from ({b}, {a, c}) and ({c}, {a, b}). Figure 1 is a visualization of a similar structured interaction dataset formed from Facebook posts (i.e., poster followed by finite multiset of responders).

Example 2.3 (Scientific articles).
Consider summarizing a scientific article by its (i) list of subject areas and (ii) list of authors. Then the scientific article process is a correspondence I : I → fin(P 1 )×fin(P 2 ). For instance, I(1) = ({a, b}, {c, d}) is an article with subject areas a and b and authors c and d. Here, P 1 and P 2 are distinct populations.
Example 2.4 (Movies). Consider summarizing a movie by its (i) genre, (ii) list of producers, (iii) director, and (iv) list of roles. Of course, there is overlap in certain populations, as producers can be directors, directors can be actors, but none are a genre (unless Scorsese, Spielberg, or Tarantino are considered genres unto themselves). Then the movie process is a correspondence I : I → fin 1 (N) × fin(N) × fin 1 (N) × fin(N). For instance, I(1) = ({a}, {b, c}, {d}, {d, e, d, e}) is a movie with genre a, producers b and c, director d, and actors d, e, and f. Note one actor plays two roles.
The above shows Definition 2.1 covers a wide variety of examples from network science. Next, we formally define (exchangeable) structured interaction-labeled networks.

Remark 2.2 (Networks constructed from interaction examples).
Often interaction data are converted into a network object. Phone-calls (ex. 2.1) can be used to construct a directed graph among callers and receivers. E-mails (Example 2.2) can be used to construct a directed multigraph among senders and receivers. Scientific articles (Example 2.3) can be used to construct a kpartite graph from subjects to authors. Movies (Example 2.4) do not fall naturally within one of these categories, but can be viewed as an extension of the k-partite graph. Projecting interaction data onto simple graphs can fundamentally alter the characteristics of the process, such as sparsity; see (Crane and Dempsey 2018a, Theorem 4.4). Therefore, this manuscript focuses on building models for the observed structured interaction data rather than models for graphs generated from the interaction data.
Remark 2.3 (Covariates). In this article, we focus on the study of structured interaction processes in Definition 2.1 with no additional covariate information. Incorporating such covariate information is quite difficult; see (Tallberg 2004;Mariadassou, Robin, and Vacher 2010;Airoldi, Choi, and Wolfe 2011;Latouche, Robin, and Ouadah 2015;Sweet 2015;Zhang, Levina, and Zhu 2016) for examples of incorpating covariates into network analysis. There are two types of covariate information: (i) covariate information on the interaction; and (ii) covariate information on constituent elements. Examples of (i) include subject line or body text in an e-mail, or gross movie sales for a movie. Examples of (ii) include gender, age, job title, or university affiliation of authors of a scientific article. Certain interaction covariates can be incorporated into the models considered in this paper. For example, in the ArXiv dataset, the article's subject can be viewed as covariate information on the interaction. We show that this information can be incorporated as part of the structured interaction data structure.

Interaction-Labeled Networks
For the remainder of this article, we focus on structured interaction processes of the form I : I → fin(P 1 ) × fin(P 2 ). This type of a structured interaction process captures the phonecall, e-mail, and scientific article examples. The arguments presented naturally extend to more general structured interaction processes as given in Definition 2.1. When two populations of constituent elements are the same (i.e., sender and receiver populations in the phone-call and e-mail examples), we write P 1 ≡ P 2 . The interaction-labeled network is an equivalence class constructed from the structured interaction process by quotienting out the labeling of the constituent elements. Let ρ j : P j → P j be a bijection for j = 1, 2. We write ρ : P 1 × P 2 → P 1 × P 2 to be the composite bijection obtained by applying {ρ j } j=1,2 componentwise. If P 1 ≡ P 2 , then ρ 1 ≡ ρ 2 ; that is, bijections among equivalent populations, for example, the senders and receivers in an email network, denoted bys andr, respectively, must agree. Then ρ induces an action on the product space fin(P 1 )×fin(P 2 ) by the composite map Therefore, the bijection ρ acts on the structured interaction process via composition (ρI)(i) = ρ(I(i)), i ∈ N. Using this bijection, we define a structured interaction-labeled network: Definition 2.2 (Structured interaction-labeled network). A structured interaction-labeled network is the equivalence class constructed from the structured interaction process by quotienting out over bijections ρ: where #P j is the cardinality of the population.
Note we have only quotiented out labels for constituent elements, so the object y I still has uniquely labeled interactions. For simplicity, we write y and leave the subscript I implicit.
In the remainder of the article, we assume the index set I is countably infinite and replace it by N. For any S ⊂ N, we define the restriction of I : N → fin(P 1 ) × fin(P 2 ) to the subset S ⊂ N by I| S . This restricted interaction process induces a restriction to S of the interaction-labeled network. We write y S to denote the interaction-labeled network associated with the restricted process I| S . For S = [n] := {1, . . . , n}, we simply write I n to denote the restricted structured interaction process and y n to denote the corresponding structured interaction network.

Exchangeable Structured Interaction-Labeled Network
Let y denote the interaction-labeled network constructed from the structured interaction process I : N → fin(P 1 ) × fin(P 2 ). Then for any finite permutation σ : N → N, let I σ denote the relabeled structured interaction process defined by I σ (i) = I(σ −1 (i)), i ∈ N. Then y σ denotes the corresponding interaction labeled network constructed from I σ . Note that the choice of representative from the equivalence class does not matter. The above relabeling by permutation σ is not to be confused with the relabeling in the previous section by the bijection ρ. The bijection ρ relabels the constituent elements and is used to construct the equivalence class in (1). The permutation σ reorders the interaction process and therefore relabels the interactionlabeled network.
In the remainder of this article, we write Y to denote a random interaction-labeled network. We assume the interactions are labeled in the countably infinite set N. Interaction exchangeability is characterized by the property that the labeling of the interactions (not the constituent elements) is arbitrary. We now define exchangeable structured interaction networks.
We say a structured interaction process is exchangeable if its corresponding structured interaction-labeled network is exchangeable.

Sequential Description of the PY-HVCM
Here we provide a sequential description of a particular family of exchangeable structured interaction processes, termed the PY-HVCM; see Remark 3.1 and Section 5.1 for why this particular name was chosen. For ease of comprehension, we start with the setting of a single sender where the size of the first component is one. In this setting, the sequential description is presented in the context of e-mails. Let (α,θ) satisfy either (1) 0 ≤α < 1 andθ > 0, or (2)α < 0 and θ = −Kα for some K ∈ N. In setting (1), the population P 1 is infinite, while in setting (2) the population is finite and equal to K. In Section 3.2, we show how to extend this model to the general setting of multiple senders.
For ease of comprehension, we let S = P 1 (senders) and R = P 2 (receivers) denote the two sets of constituent elements. We introduce some additional notation. For each n = 1, 2, . . ., the nth e-mail E n is given by the structured interaction (S n ,R n ) = ({S n,1 }, {R n,1 , . . . , R n,k n }) where S n,1 ∈ S is the sender, and R n,j ∈ R is the jth receiver of the nth email. Suppose n e-mails have been observed and define H n = {E 1 , . . . , E n } to be the observed history of the first n e-mails. For the (n + 1)st e-mail, choose the sender according to where D out n (s) is the outdegree of the sender s, and S n are the set of unique senders in (S 1 , . . . ,S n ) and |S n | is the set's cardinality.
Given S n+1,1 = s, we choose the number of recipients according to the discrete probability distribution function {ν . Let D n,j (s, r) denote the indegree of receiver r when restricted to e-mails from sender s after the first n e-mails and the j -1 recipients of the nth e-mail; that is, Let m n,j (s) = r∈R D n,j (s, r) be the number of receivers (accounting for multiplicity) of e-mails from sender s. Each statistic is a measurable function of H n and is local (i.e., specific to the sender s).
Here we describe a procedure for sharing information across senders. To do this, we define a partially observable global set of information. Let R n,j to be the set of observed receivers after the first n e-mails and the j -1 recipients of the nth e-mail; that is, R n,j = {r ∈ R | ∃ R m,l = r for m < n and l ≤ k m , or m = n and l < j}.
Additionally, let K n,j = |R n,j | be the cardinality of this set. For each r ∈ R n,j we posit existence of a latent degree per sender s ∈ S n and receiver r denoted by V n,j (s, r). We then define V n,j (•, r) = s∈S n V n,j (s, r), V n,j (s, •) = r∈R n,j V n,j (s, r), and m n,j = s∈S n r∈R n,j V n,j (s, r). The global latent degrees {V n,j (•, r)} r∈R will serve as the primary means for sharing receiver information among senders; see Section 3.1 for a discussion.
Define R n,j (s) to be the complete set of receivers when restricting to e-mails from sender s ∈ S, and H n,j to be the observable history H n−1 union {S n,1 , R n,1 , . . . , R n,j }. That is, H n,j is the observed history up to the j -1th receiver on the nth e-mail, where H n,0 implies only sender information for the nth e-mail. For each s ∈ S, let (α s , θ s ) satisfy either (1) 0 ≤ α s < 1 and θ > 0, or (2) α s < 0 and θ s = −K α s for some K ∈ N. Under (1) the receiver population P 2 is infinite. Under (2) the population is finite and equal to K . For the remainder of this paper, we assume setting (1).
Given the indegree distribution {D n,j (s , r )} r ∈R n,j ,s ∈S n , the latent degree distribution {V n,j (s , r )} r ∈R n,j ,s ∈S n , the current sender s, along with the observable history H n,j , the probability of choosing receiver r is proportional to and Note the difference in the discount of indegree in Equation (3) and outdegree in Equation (2). For the sender distribution (2), the outdegree discount isα; on the other hand, for Equation (3), the indegree discount is α s V n,j (s, r). This reflects that in Equation (2), sender s is chosen from a single distribution; however, in Equation (3), receiver r can be chosen either locally or globally. The remaining question is how to update the degree distributions. In Equations (3) and (4), we can either observe r "locally", or we escape the local model and observe r due to the latent global information. Given R n,j = r we update both local and global degrees. If r ∈ R n,j (s) then the global degree V n,j (s, r) increases from zero to one. If r ∈ R n,j (s) then the local degree D n,j (s, r) increases by one and the latent degree is increased by one with probability θ s +α s V n,j (s,•) m n,j (s)+θ s . The exact procedure for incrementing V n,j is discussed in Section 4.

Partial Pooling
The importance of the latent global degree distribution is that it allows information to be shared across the conditional receiver distributions. The above model formalizes the partial pooling of information. The degree of pooling is controlled by the probability of escaping the local distribution θ s +α s V n,j (s,•) m n,j (s)+θ s . Note that this quantity decreases as the number of e-mails from sender s increases whenever α s < 1. Therefore, the impact of the latent global degree information on the local distribution becomes negligible as more samples on the particular sender are collected. However, the first time a senderreceiver pair is observed, it must occur via the shared global set of information. The global latent degrees {V n,j (s, r)} r∈R n,j ,s∈S n therefore contribute to the behavior of new and/or rarely seen senders.
Partial pooling is only possible due to Definition 2.1 which extends the prior notion of the interaction process (see Remark 2.1). Our sharing mechanism exploits the natural structure of many interaction datasets. The next question is whether partial pooling as defined above breaks interaction exchangeability. Theorem 5.1 will show that our proposed model satisfies this natural invariance property. Moreover, with the added flexibility compared to the Hollywood model, one may be concerned that theoretical investigation of limiting properties may be difficult; yet we establish in Theorems 5.2 and 5.3 that global sparsity and power law are preserved. Finally, the latent global degree information ensures a consistent labeling across local models. Without this, global summary statistics such as the number of unique receivers and global degree distribution are not well-defined; see Section 1 in the supplementary materials for further discussion.
We demonstrate partial pooling via simulation, which shows the level of heterogeneity among local distributions (i.e., per sender) depends on the parameter α s . In the sequential description, α s controls the number of times that the latent global distribution is used to choose the next receiver. When α s ≡ 1, the model reverts to the Hollywood model Crane and Dempsey (2017). For the simulation, the following priors are used: α ∼ Beta(5, 5), θ ∼ Gamma(10, 10), θ s ∼ Gamma(10, 10). α s is set constant across s ∈ S. For α s ∈ {0, 0.25, 0.5, 0.75, 1}, we simulate a dataset of 100 senders and 10,000 interactions per sender of size 1 (i.e., single receiver per interaction). For each sender, the L 1 distance between the local distribution and the empirical, global distribution is computed. We then run this experiment 50 times for different random interactions. Figure 2 visualizes the results, showing that as α s increases, the L 1 distance -and thus the degree of heterogeneity -decreases as a function of α s . A plate diagram is provided in Section 1 of the supplementary materials which visualizes the partial pooling within the proposed model.
Remark 3.1 (Connections to CRFP). Note that the PY-HVCM is closely related to the Chinese Restaurant Franchise Process (CRFP), a well-known process in the machine learning literature Pitman (2006); Blei, Ng, and Jordan (2003) that is almost exclusively used to model latent clusters in data. Here, we use these ideas in the construction of the interaction process. Thus, the objectives are quite different; for instance, there is almost no focus on the inference of the model parameters in the ML community; in our setting, these parameters are crucial to understanding the overall interaction process behavior. It is also important to note that in contrast to the CRFP, the proposed model does not have a convenient stick-breaking representation. This model is most similar to Teh (2006), where it is used for language modeling. The above construction is also related to the Pitman-Yor process (thus the model's name) as well as the GEM distributions (Pitman 2006).

Accounting for Multiple Elements in First Component
In the general setting, the first component,S n , is a random element of fin(P 1 ) (i.e., a random finite multiset of elements from P 1 ). In the sequential description, we assumed the size of this multiset was one. We now considerS n = {S n,1 , . . . , S n,k n,1 } for general k n,1 ≥ 1. First, let H (s) n,j = H n ∪ {S n+1,1 , . . . , S n+1,j } denote the history of the first n e-mails and the first j senders of the n + 1st e-mail. Extension of Equation (2) to handle multiple senders is straightforward by replacing H n by H (s) n,j and defining all other terms similarly.
In the sequential description, the sender S n,1 is used to specify which local statistics (i.e., V n,j (r, s), D n,j (r, s) and m n,j (s)) to consider. However, when there are multiple senders, this choice is no longer straightforward. To address this, we introduce a random variable Z n with domainS n . This variable indicates which local statistics will be used in receiver distributions (i.e., Equations (3) and (4)). Define S (z) n to be the unique elements in H (z) n := (Z 1 , . . . , Z n ). Then where (1) 0 <α z < 1 andθ z > 0 if the population S is considered infinite, and (2)α s < 0 andθ z = −Kα z if population is finite and |S| = K. This is equivalent to restricting Equation (2) to be nonzero only on the domainS n . Moreover, it is conditional on the history H

Posterior Inference
We now consider performing posterior inference for the proposed model given an observed interaction network Y n . As in Section 3, we start with the setting where the size of the first component is one (i.e., ν k 1 = 1[k 1 = 0]). The parameters (v (s) k 2 ) k 2 ∈N , for all s ∈ S are estimated nonparametrically, and are not important for the remainder of the article; therefore, the details are omitted for these parameters.
We start by reparameterizing the model in a more useful form for inference, and which gives an explicit structure for updating the latent degree V n,j -we call this the extended PY-HVCM. In this representation, every "escape" from the local distribution and choice of receiver r leads to an auxiliary vertex v being introduced locally for a sender s-auxiliary vertices are not shared between senders. The label l s (v) of the auxiliary vertex is r; the auxiliary vertex accounts for the fact that the global distribution can select receiver r multiple times. Finally, the observed receiver is assigned to the auxiliary vertex, and we write that assignment φ n,j = v. The number of auxiliary vertices with label r and sender s is equal to the number of times the local distribution for sender s escapes and choose the global set of information (i.e., V n,j (r, s)). The sum of the degrees across auxiliary vertices with label r and sender s is equal to the indegree for receiver r (i.e., D n,j (r, s)). Finally, we write d srv to denote the degree of auxiliary vertex v in sender s that also has label r. Note that for r = l s (v), d srv = 0.
Given H n and S n+1,1 = s, the probability that R n+1,j is assigned to auxiliary vertex φ n+1,j = v is: Further, if φ n+1,j = V n+1,j (s, •) + 1, then we add an auxiliary vertex V n+1,j (s, •) + 1 with its label chosen with probability: where for c ∈ N and a, b ∈ R + . The joint density as written in (6) is exchangeable with respect to re-ordering of the interactions.
Lemma 4.1 establishes that the proposed model in Section 3 is recovered by marginalizing over configurations of auxiliary vertex labels and assignments, which leaves only the observed degrees D n,j and latent degrees V n,j . The proof is given in Section 3 of the supplementary materials.
Lemma 4.1. Marginalizing the extended PY-HVCM over configurations of auxiliary vertex assignments and labels recovers the PY-HVCM.

Choice of Priors
Here, we define two approaches to defining priors for the global parameters θ , α and local parameters θ s , α s , s ∈ S.

Conjugate Bayesian Parameters
The first approach is to set the priors for θ parameters to a high-variance Gamma distribution, and the priors for the α parameters to the Beta distribution. In general, the global θ will be much larger than the local parameters, and the appropriate values will depend on the sparsity of the overall network -for instance, the global θ for the arXiv data is an order of magnitude greater than the global θ for the Enron dataset. An appropriate prior is a prior distribution such that the PPC on sparsity match the observed degree of sparsity in the interaction data. See Section 6.3 for details on PPC and model comparison.
For datasets of reasonable size, we have found that the prior for the global parameters does not significantly affect the resulting posterior density. In the subsequent examples, the size of the datasets was more than sufficient to not be strongly affected by the choice of global priors. For the α parameter, this suggests using Beta(1, 1) distribution, that is, the Uniform distribution. With θ , different datasets can have a difference in posterior means that are 2 or 3 orders of magnitude. Although the posterior density is mostly unchanged, attempting inference with a mismatched θ prior will require more Gibbs samples before mixing occurs. We have found that θ ∼ Gamma(1, 10, 000) is an appropriate diffuse prior that allows for fast mixing. The lower-level parameter θ s is generally much less than the global θ , so θ ∼ Gamma(1, 1000) is an appropriate prior that allows for variety in distribution but also has a prior mean that is lower than the global θ . For the local α s , we again use the Uniform distribution. Section 6 in the supplementary materials provides a simulation study that supports the claim of fast mixing of our sampling algorithm.

Priors Based on Hollywood Model Fits
The second approach, which is used in Section 7 for the arXiv dataset, is to fit the Hollywood model Crane and Dempsey (2018a) to each of the local datasets, and then use a Gamma(θ/100, 100) prior for the θ 's, whereθ is the estimate of θ for the Hollywood model. The priors for the α's are again set to Beta(1, 1).

Gibbs Sampling Algorithm
Here we introduce a Gibbs sampling algorithm for sampling from the posterior distribution of given an observed interaction-labeled network Y n . To do this, we use auxiliary variable methods (Escobar and West 1995) to perform conjugate updates for all parameters. First, define the binary auxiliary variables z rj for r ∈ R, j = 1, . . . , v •r − 1 and z srvu for s ∈ S, r ∈ R, v = 1, . . . , V N (s, ·) − 1, u = 1, . . . , d srv − 1. Next define auxiliary variables y i for i = 1, . . . , v(Y n ) − 1 and y si for s ∈ S and i = 1, . . . , d s•• − 1. Finally, define auxiliary variables x, {x s } s∈S ∈ [0, 1]. We formally derive these updates in Section 2 of the supplementary materials; this algorithm is similar to the one described in Teh (2006), except for the modifications required for our model. While sampling each auxiliary vertex for the receivers, we also update the set of auxiliary vertices [V N (s, r)] and their degrees d srv .
x ∼ Beta(θ + 1, m N − 1), There are two important differences between this algorithm and Teh (2006). First, in the case of multiple elements in the first component, we perform an approximate sampling procedure found in Section 4.3 to find the latent Z i . Second,the language model in Teh (2006) has multiple levels of hierarchical parameters, where we have only two levels of components. Convergence can be checked via traceplots and, in our experiments, occurs within the first hundred or so iterations; see Figure 4 for traceplots in the email network example.

Approximate Sampling in the Case of Multiple Elements in the First Component
In the caseS n may contain multiple elements, one can sample from the posterior . Note that, in general, the joint likelihood pr(R i =r i |Z i = s) is difficult to calculate due to the marginalization over all possible vertex label configurations forR i . Instead, we propose a sampling procedure to approximate this quantity, by sequentially sampling the vertex labelsV i using the given counts, whereV i denotes the multiset V i,1 , . . . , V i,k i,2 : After samplingV i for a number of runs, we average the likelihoods to get an estimate of pr(R i =r i |Z i ).

Statistical Properties
We now state several theoretical results for the proposed PY-HVCM in Section 3.
Theorem 5.1 is not immediate from the sequential construction in Section 3, but is clear from the reparameterization of the model presented in Section 4, and its connection to the PY-HVCM given in Lemma 4.1. This connection is formalized in Section 3 of the supplementary materials.
The remainder of this section focuses on the setting where the size of the first component is one (i.e., ν k 1 = 1[k 1 = 1]). Moreover, we will make certain alternative assumptions concerning the sender distributions. These constraints allow sufficient complexity to be interesting, but assume sufficient regularity to push through the theoretical analysis. First, we turn to the growth rates in the expected number of unique receivers. Unlike the Hollywood model, this rate depends on both the distribution over senders, the global parameter α, and the local parameters {α s } s∈P 1 . Before stating the theorem, we require a formal definition of sparsity. For clarity, we define quantities in terms of receivers to distinguish vertices observed as senders and those observed as receivers (i.e., in P 1 and P 2 respectively).
For a structured interaction-labeled network Y, let v(Y) denote the number of nonisolated receivers; e(Y) is the number of interactions; M k (Y) is the number of interactions with k receivers; N k (Y) is the number of receivers that appear exactly k times; and d( (Y). Note that these are global statistics that do not depend on the interaction labels. We define local versions by superscripting each statistic by s ∈ P 1 . For instance, v (s) (Y) is the number of non-isolated receivers when restricting Y to only those interactions involving sender s.
is the average arity (i.e., number of receivers) of the interactions in E n . A non-sparse network is dense. We say the sequence is is the average arity (i.e., number of receivers) of the interactions in Y n from sender s ∈ P 1 . A network that is not s-locally sparse is s-locally dense.
For (X n ) n≥1 a sequence of positive random variables and (y n ) n≥1 a sequence of positive nonrandom variables, let X n y n indicate lim n→∞ X n /y n exists almost surely and equals a finite and positive random variable. Theorem 5.2 shows the PY-HVCM may be either globally sparse and/or dense. The theorem assumes a finite population of senders with number of e-mails per sender drawn from a multinomial distribution.
Theorem 5.2 establishes that the PY-HVCM can capture degrees of sparsity. If μ s = μ for all s ∈ P 1 and αα < μ −1 then it must be the case that αα s < μ −1 for all s ∈ P 1 . Therefore, a dense network must be s-locally dense for all s ∈ P 1 . However, a sparse network can be s-locally dense for some, but not all, s ∈ S. We turn now to considerations of power law degree distribution for interaction-labeled networks. We start with a definition.
Definition 5.2 (Global power law degree distributions). A sequence (Y n ) n≥1 exhibits power law degree distribution (Crane and Dempsey 2017;Caron and Fox 2017;Veitch and Roy 2015) if for some γ > 1, then the degree distributions (d(Y n )) n≥1 satisfy d k (Y n ) ∼ l(k)k γ as n → ∞ for all large k for some slowly varying function l(x); that is, lim x→∞ l(tx)/l(x) = 1 for all t > 0, where a n ∼ b n indicates that a n /b n → 1 as n → ∞. More precisely, (Y n ) n≥1 has power law degree distribution with index γ if Theorem 5.3 establishes the power law degree distribution for the PY-HVCM for the case of α s = 1, ∀s ∈ S. Theorem 5.3. Let (Y n ) n∈N obey the sequential description in Section 3 with parameters (α,θ) and let α s = 1 for all s ∈ P 1 . For each n ≥ 1, let p n (k) = N k (Y n )/v(E n ) for k ≥ 1 be the empirical receiver degree distribution where N k (E n ) is the number of receivers of degree k ≥ 1 and v(E n ) is the number of unique receivers in E n . Then, for every k ≥ 1, where (t) = ∞ 0 x t−1 e −x dx is the gamma function. That is, (Y n ) n≥1 has a power law degree distribution with exponent γ = 1 + α ∈ (1, 2).

Hierarchical Vertex Components Model (HVCM)
Here we construct a larger family of exchangeable structured interaction processes. As in Section 3, for ease of comprehension, we focus on the setting of a single sender. First, choose a distribution of senders, f = (f s ) s∈P 1 , in the simplex Next, for each s ∈ P 1 , construct a conditional distribution over the receivers, that is, the second component fin(P 2 ). That is, for every s ∈ P 1 , we choose f s = (f r | s ) r∈P 2 in the simplex We combine these distributions to form f ∈ F 1 × (⊗ s∈P 1 F 2 ), which determines a distribution on the space fin 1 (P 1 )×fin(P 2 ) by where ν (s) l ≥ 0 and ∞ l=1 ν (s) l = 1 for each s ∈ P 1 . This determines an exchangeable structured interaction process, which we call the hierarchical vertex components model (HVCM). Given f, I(1), I(2), . . . are independent, identically distributed (iid) random structured interactions drawn from (19). Note that the recipients are chosen with replacement. An extension to allow for sampling without replacement is considered important future work.

Connection Between Sequential Description and Hierarchical Vertex Components Models
The sequential description in Section 4 is equivalent to a particular HVCM, which is why it is termed the PY-HVCM. The HVCM representation given by (19) is helpful in providing important intuition regarding assumptions underlying the PY-HVCM. Indeed, Equation (19) shows HVCMs assume conditional independence among the receiver propensities given the sender information. When α s = 0 for all s ∈ P 1 , an analytic stick-breaking representation can be derived. This connects the sequential process directly to Equation (19). To do so, we start by constructing the sender distribution. Here, we assume P 1 ≡ P 2 ≡ N. For s ∈ N, define independent random variables β s ∼ Beta(1−α,θ +sα). Then, conditional on {β s } ∞ s=1 , the probability of choosing sender s ∈ N is given by where the product is set equal to one for s = 1, and f = {f s } ∞ s=1 . We still only consider a single sender (i.e., ν k 1 = 1[k 1 = 1]).
We now construct, for each s ∈ N the probabilities {f r | s } ∞ r=1 via a hierarchical model given α > 0 and θ > −α, and we set f = {{f r|s } ∞ r=1 } ∞ s=1 . To do this, we first define global independent random variablesβ r ∼ Beta(1 − α, θ + rα) for r ∈ N. Then, conditional on {β r } ∞ r=1 , for r ∈ N, we define associated stick-breaking probabilitiesπ r =β r These are probabilities of choosing receiver r based on the global random variables {β} ∞ r=1 . The local stick-breaking distributions are then defined via a perturbation of these global probabilities. That is, for θ s > 0, define independent random variableβ where the product is defined equal to one when r = 1. This yields a stick breaking representation for f = {f , f } for a particular hierarchical vertex components model. The above construction yields a more natural mechanism for understanding partial pooling. Indeed, partial pooling occurs via the shared global probabilitiesπ r . The local distributions can be thought of as a perturbation of the global distribution {π r } ∞ r=1 where θ s controls the amount of perturbation. In particular, f r | s → π r with probability one as θ s → ∞. By ) is a random variable over the space F 1 × ⊗ ∞ s=1 F 2 . Lemma 5.1 establishes the connection between these random variables and the PY-HVCM for α s = 0. Section 3 of the supplementary materials describes a specific probabilistic construction of these frequencies.
Lemma 5.1. The PY-HVCM model for α s = 0 for s ∈ N is equivalent in distribution to (3), where f is distributed according to the stick-breaking construction described above.
Proof can be found in Section 3 of the supplementary materials; see Pitman (2006); Ishwaran and James (2001) for further details on the stick-breaking representation. Note that when α s ≡ 0, there is no longer a stick breaking representation, but the PY-HVCM remains in the class of HVCMs and admits asymptotic frequencies f r | s for each (s, r) ∈ S × R. It is simply that the associated measure does not admit a simple probabilistic mechanism for generating the propensities directly.

Application to Enron Email Network
In this section, the PY-HVCM is applied to the Enron email dataset. Further, techniques to demonstrate the goodness of fit of the model are discussed, and are compared with the previously published "Hollywood" model Crane and Dempsey (2017) and the generalized gamma process (GGP) model Caron and Fox (2017); in particular, the HVCM model is shown to have better model fit at the local level compared to others. There is variation in the shape of these distributions; the HVCM accounts for and parameterizes this difference in behavior when compared with the global degree distribution.

Dataset Overview
The Enron email dataset consists of approximately 500,000 emails collected from 1998 to 2002 and was originally collected by the Federal Energy Regulation Commission during its investigation into the company Cohen (2009). The dataset originates from an email dump of 150 users. In total, there are 19,752 unique senders, 70,572 unique receivers, for a total of 79,735 unique entities. The dataset has been used as a testbed for classification (Klimt and Yang 2004), topic modeling (McCallum, Wang, and Corrada-Emmanuel 2007), and graph-based anomaly detection (Priebe et al. 2005;Silva and Willett 2009), among other tasks. While the underlying data generating process evolves forward in time, the current modeling framework envisions a database of Enron emails, from which the observed data are a random sample. See Section 7 of Crane and Dempsey (2019) Crane and Dempsey (2018b) for further discussion of this perspective. The estimated local distributions are therefore marginal over time, building receiver propensities that average over potential temporal variation. Figure 3 shows the global receiver degree distribution, as well as the local receiver distributions for the six senders with the largest number of emails. There is significant variation in behavior of the local degree distributions, both in comparison to themselves and to the behavior of the global degree distribution. This suggests that a modeling approach that allows for these differences is critical to accurately capturing the behavior of the entities, and thereby allow for superior data summarization, sound inferences and strong prediction performance. While the Hollywood and GGP model would be unable to account for this variation, the PY-HVCM is equipped to capture this behavior.

Fit to the Data
The Gibbs sampling algorithm introduced Section 4 is applied to the dataset for 1000 iterations, discarding the first 500 as burn-in. For this dataset, the following priors were used: pr(θ ) ∼ Gamma(2, 1000), pr(α) ∼ Beta(1, 1), pr(θ s ) ∼ Gamma(1, 20), and pr(α s ) ∼ Beta(1, 0.9). Trace plots and histograms of the posterior samples of the global parameters α and θ and histograms of estimated local parameters α s and θ s are displayed in Section 4.1 of the supplementary materials.

PPC and Model Comparison
In this section, examples of posterior predictive model checks are shown in order to demonstrate the goodness of fit of the PY-HVCM. PPC are often used in order to verify that the proposed fitted model generates reasonable values on statistics of interest; these checks can also be used to diagnose where the model fails to perform well (Gelman, Meng, and Stern 1996).
Multiple synthetic datasets are generated according to the posterior predictive distribution as prescribed in Gelman, Meng, and Stern (1996), and statistics of interest are calculated and compared with the statistics of the real data. The synthetic data is generated from the model with the parameters set to a posterior sample generated from the inference procedure. Since we are interested in the ability of the model to account for variation in local behavior, we take the sender sequence and number of receivers for each email as given, in order to directly compare the local receiver distributions of the posterior predictive data with the real data.
In addition to generating PPC for the fitted PY-HVCM, they are also generated for the Hollywood Crane and Dempsey (2017) and GGP Caron and Fox (2017) models for comparison. In the following subsections, a variety of posterior predictive statistics are described, both for the global dataset and for the local data per sender. These checks show that the PY-HVCM both provides a good global fit of the data, in addition to significantly improving the fit to local distributions compared to the Hollywood model. Table 1 details the 95% posterior predictive intervals for the global statistics, and Table 2 summarizes the posterior predictive coverage rate for local distributions for  the proposed model and the Hollywood model. The statistics compared are number of unique receivers in the dataset and number of receivers with degree 1, 10, and 100.

Number of Unique Receivers
The first statistic we consider is the number of unique receivers, both in the global dataset as well as each local sender datasets. The number of unique receivers can be thought of as a surrogate for sparsity, and thus an important statistic for a candidate model to replicate. Figure 4 displays the results.
On the left plot, the PPC statistics are shown for the number of unique receivers in the global dataset. Both the Hollywood model and the PY-HVCM perform well on the global statistic. On the right plots are four examples of the PPC statistics for the number of unique receivers on the local sender datasets with the most emails. Only the results from the PY-HVCM is shown, because neither the GGP model nor the Hollywood model is able to take into account variation among the local distributions; if the sender labels are attached post-hoc to the synthetic data generated from the GGP or Hollywood model, they are completely unable to replicate any local behavior statistics. The PY-HVCM clearly accounts for the varying local behavior, even when that local behavior is unusual (in the case of sender 58937). The superiority of the model compared to the Hollywood model is clearly shown in Table 2, as the proposed model's local posterior predictive intervals in the local distributions covers the real values 99% of the time, as opposed to the Hollywood model's coverage rate of 39%. In Section 4 of the supplementary materials, PPC statistics for the number of unique receivers for the top 25 senders with the most emails are presented under the PY-HVCM and a local mean-field approximation that demonstrate the model's superiority over local approximations as well; see Section 1 of the supplementary materials for details on the approximation.

Degree Distribution
An important global behavior to capture is the global degree distribution. In order to evaluate this fit, posterior predictive intervals of the number of nodes with degree 1, 10, and 100 are shown in Table 1. Note that the HVCM performs the best, where the real number of receivers with degree 10 are within the PP interval. Figure 5 shows this result. When comparing the degree distributions, it is also clear that the Enron data does not perfectly align with the posterior predictive example, as the synthetic data overestimates the number of receivers with degree 1 and underestimates the number of receivers with degree 100. However, it is also clear that this model fit is still superior to both the Hollywood and GGP models (Table 1). Further, Table 2 demonstrates that the coverage for the posterior predictive intervals is much more robust in the proposed model for each of the degree statistics. Figure 6 also compares local degree distributions between the PY-HVCM and the real data. In the both the global and local case, the PY-HVCM is able to better replicate the degree distribution.

Node Sharing Across Local Distributions
In order to visualize how effectively the PY-HVCM is capturing the varying dependencies between the local and global distributions, we count the number of receivers that are seen in a particular number of local sender distributions. This allows for direct comparison of the effectiveness of the models to capture the interdependency and interaction among the local datasets. Figure 7 shows the results.
It is quite clear that the PY-HVCM replicates the observed behavior in the real data, while both the GGP and Hollywood models fail to capture the degree of pooling across the local datasets. Specifically, the other models seem to overestimate the rate at which receivers are shared across the local distributions.

L1 Distance From Degree Distribution
With our posterior predictive samples, we can also directly examine the difference in distribution between synthetic data and the real data. Figure 8 shows histograms of the TV distance between the global degree distributions of the synthetic data generated from the posterior predictive distribution and the real dataset. Again, the PY-HVCM leads to an improvement over the Hollywood model and GGP model.

Using PPC to Identify Irregular Interaction Behavior
Not every local distribution is fit perfectly by a power law distribution; Sender 58,397 is one example displayed in Figure 6   that has very few receivers and clearly does not follow a power law. Other outliers are clearly present. One additional benefit of PPC is the ability to identify anomalous local distributions.
In order to demonstrate this technique, the number of unique vertices is used as the local statistic. After generating posterior predictive data, an 80% posterior predictive interval is generated for each local distribution. Then senders whose unique receiver statistic does not fall into the associated interval are identified, and the level of deviancy is calculated as the absolute distance from the edge of the confidence interval to the real value, divided by the real number of unique receivers. The degree distribution of the senders with the top three deviancy scores are shown in Figure 9. As with sender 58,397, which had the higher deviancy score), the other two senders in Figure 9 exhibit non-power law behavior. All three are characterized by sending a large number of emails to themselves. Note that this procedure can use alternative statistics to discover different types of anomalous behavior, making for a very flexible method for exploring complex relational data.

Prediction of Future Interaction Dynamics
In addition to PPC, it is also reasonable to perform a predictive power comparison among the models in question. To do so, we train both the PY-HVCM and Hollywood models on observed interaction data up to times 6/31/2000 (n=65,143), 12/31/2000 (204,285), and 6/31/2001 (n = 362,899) respectively. We then evaluate the predictive log-likelihood for ranges (next 6-months, 1-year, and all remaining data) of both models. For the PY-HVCM, this is done as an expectation over the posterior distributions, as calculated using the Gibbs samples. The results are presented in Table 3 and demonstrates that the HVCM has greater predictive power than the Hollywood model for the Enron dataset over multiple data splits and time-horizons.

ArXiv Dataset
In this section, a larger and more complex dataset is used to demonstrate the flexibility of the PY-HVCM. The model is applied to the arXiv dataset https://archive.org/details/arxivbulk-metadata, which contains nearly all arXiv articles from 1986 to 2017. Our proposed model considers the observed data being a simple random sample from the complete database of ArXiv articles (see Section 9 in Crane and Dempsey (2018b) for more details). While articles are generated progressively, the sampling perspective implies the estimated local distributions are marginal over time, building receiver propensities that average over potential temporal variation.
Like the Enron dataset, the arXiv data has a hierarchical structure-each article is required to have at least one associated subject. However, unlike the email dataset, which had only one sender per email, each article may have more than one subject. Our proposed model is well suited to this case of multiple entities and the data can still be appropriately represented by Equation 19.Recall from Section 3.2 that if the article contains topics {S n,1 , . . . , S n,k n,1 }, then our proposed model selects one of these observed topics via a latent variable Z n which indicates which local statistics will be used for the author distribution. This allows for the direct study of interdisciplinarity among authors and overlap among the subject classes on arXiv.
The arXiv subjects have been divided into 11 main classes; the full list can be found on https://arxiv.org/help/prep. In order to reduce the effect of author name ambiguity, we restricted ourselves to articles which have at least one subject from the math, cs, stat, and physics subject classes. A full description of the subjects of interest is found in Section 5 of the supplementary materials. Figure 10 shows a degree distribution for the subjects, along with a histogram of the number of subjects per article. In total, there are 510,812 scientific articles with 413,029 unique authors and 130 unique subjects. There is also a broad range of subject frequencies, with the most popular  We apply our posterior sampling methods found in Section 4, and in particular use the approximate method of calculating the posteriors of the indicator variables Z i using the methods described in Section 4.3. Trace plots of posterior estimates of certain parameters, PPC for the data and other details of the inference can be found in Section 5 of the supplementary materials.

Subject Overlap
The fitted model allows us to explore the amount of overlap between arXiv subjects. Two subjects are considered overlapping if the model has difficulty distinguishing between them when they are used as labels for the same article. This difficulty can be measured using the Shannon entropy, which is defined over discrete probability distributions p = [p 1 , p 2 , . . . , p k ] as: H(p 1 , p 2 , . . . , p k ) = − k p k log 2 p k .
Entropy is at its maximum when the distribution p is the uniform distribution, that is, when all outcomes are equally likely. In order to estimate subject overlap for subjects s 1 and s 2 , every article which lists s 1 and s 2 among its subjects is found, and the entropy of the posterior mean of the Z i distribution given that the subject is either s 1 or s 2 is calculated, and the entropy is averaged over the articles. This score, SO(s 1 , s 2 ) is computed as: SO(s 1 , s 2 ) = 1 |{S i : s 1 , s 2 ∈S i }| (20) i:s 1 ,s 2 ∈S i H(pr(Z i = s|{S i ,R i }, Z i ∈ {s 1 , s 2 })). Figure 11 shows a heatmap of the subject overlap scores for subjects that are seen in the same article at least 100 times. The subjects are ordered according to a normalized spectral clustering Ng, Jordan, and Weiss (2002), using the subject overlap matrix SO as the affinity matrix, and setting the number of clusters to 6.
From this analysis, we conclude the following. Cluster 1, which includes cs.AI (Artificial Intelligence) and cs.IR (Information Retrieval), is a group of subjects that pertain to algorithmic approaches to artificial intelligence. Note that this cluster is differentiated from cluster 5, which tends to represent more theoretical papers that use statistics; this cluster includes math.ST (Statistical Theory), stat.ML (Machine Learning), and stat.ME (Methods). Cluster 2 can be considered the core math cluster, which encapsulates many pure and applied math subjects. Similarly, cluster 3 is the core computer science cluster, which are the computer science subjects that generally don't use statistics such as cs.SE (Software Engineering) and cs.CE (Computer Engineering). Cluster 4 is the core physics cluster, with the subjects of physics that tend not to be interdisciplinary outside of physics as other physics subjects. Finally, cluster 6 consists of subjects that involve the philosophy, teaching or history of physics. Perhaps unsurprisingly, as AI is a fast moving field, two clusters are found (clusters 1 and 4) within AI that do not have a designated arXiv category.
We compare these results with results of an application of a standard nonprobabilistic spectral clustering algorithm to the co-authorship network in Section 5 of the supplementary materials. Spectral clustering is unable to recover the meaningful groupings that the PY-HVCM produces.

Structured Interaction Exchangeable Models
A final question is whether the family of HVCMs represent all exchangeable structured interaction processes. Here, we show that this is not the case by providing a representation theorem. We focus on the setting where each interaction (s,r) is either never observed or observed infinitely often. This is commonly referred to as the "blip-free" setting Crane and Dempsey (2019), where blips refer to interactions (s,r) that are observed only once. We first define the fin(P 1 ) × fin(P 2 )-simplex F = (f (s,r) ) (s,r)∈fin(P 1 )×fin(P 2 ) and (s,r)∈fin(P 1 )×fin(P 2 ) f (s,r) = 1 where (s,r) := ({s 1 , . . . , s k 1 }, {r 1 , . . . , r k 2 }) for s 1 , . . . , s k 1 ∈ P 1 and r 1 , . . . , r k 2 ∈ P 2 . Let φ be a probability measure on the simplex and define f ∼ φ to be a random variable drawn Figure 11. Heatmap of Shannon entropy per article. For each pair of subjects s 1 , s 2 , and every article that contains both s 1 and s 2 , the entropy of pr(Z i = z|Z i ∈ {s 1 , s 2 }) is calculated. Finally, each entry is normalized by the total number of occurrences of s 1 and s 2 appearing together in the same article. Subjects are clustered on their entropy scores using the spectral clustering procedure of Ng, Jordan, and Weiss (2002). from this measure. Then, given f ∈ F, let the sequence of interactions I(1), I(2), . . . be generated according to pr I(i) = {s 1 , . . . , s k 1 }, {r 1 , . . . , r k 2 } | f = f (s,r) .
Given I, the associated network Y := y I is obtained through Equation (1), whose distribution we denote by f . Theorem 8.1 states that all blip-free structured interaction exchangeable networks can be generated in this manner. The proof is in Section 7 of the supplementary materials.
Theorem 8.1 (Blip-free representation theorem). Let Y be an exchangeable structured interaction-labeled network that is blip-free with probability 1. Then there exists a probability measure φ on F such that Y ∼ φ , where Theorem 8.1 and Equation (21) show that both senders and receivers arrive in size-biased order in structured interaction processes. The arrival of senders and receivers is weighted by the relative frequency of their occurrence f • = (s,r):s∈s fs ,r , respectively. That is, an individual who sends many e-mails but receives few will likely be observed early in the interaction process as a sender but much later as a receiver and vice versa.

Concluding Remarks
This paper has presented the class of exchangeable structured interaction models. By exploiting the common hierarchical nature of structured network data, complex models with both appropriate invariance and empirical properties are introduced.
The PY-HVCM captures partial pooling of information, and can model complex local-behavior with global power law degree behavior. A Gibbs sampling algorithm is proposed and applied to the Enron e-mail and arXiv datasets. The focus of this paper has been on e-mail and similarly structured interaction datasets. Extensions to more complex examples will be in considered future work. This article lays the foundation for how the interaction exchangeability framework can account for complex behavior. Of course, many interaction networks occur with timestamps; therefore, extensions to account for temporal dependence is required and will be an important next step.