Conditional Functional Graphical Models

Abstract Graphical modeling of multivariate functional data is becoming increasingly important in a wide variety of applications. The changes of graph structure can often be attributed to external variables, such as the diagnosis status or time, the latter of which gives rise to the problem of dynamic graphical modeling. Most existing methods focus on estimating the graph by aggregating samples, but largely ignore the subject-level heterogeneity due to the external variables. In this article, we introduce a conditional graphical model for multivariate random functions, where we treat the external variables as conditioning set, and allow the graph structure to vary with the external variables. Our method is built on two new linear operators, the conditional precision operator and the conditional partial correlation operator, which extend the precision matrix and the partial correlation matrix to both the conditional and functional settings. We show that their nonzero elements can be used to characterize the conditional graphs, and develop the corresponding estimators. We establish the uniform convergence of the proposed estimators and the consistency of the estimated graph, while allowing the graph size to grow with the sample size, and accommodating both completely and partially observed data. We demonstrate the efficacy of the method through both simulations and a study of brain functional connectivity network.


Introduction
Functional graphical modeling is gaining increasing attention in the recent years, where the central goal is to investigate the interdependence among multivariate random functions. Applications include time course gene expression data in genomics (Wei and Li 2008), multivariate time series data in finance (Tsay and Pourahmadi 2017), electrocorticography, and functional magnetic resonance data in neuroimaging , among many others.
Our motivation is brain functional connectivity analysis based on functional magnetic resonance imaging (fMRI). Functional MRI measures brain neural activities via blood oxygenlevel-dependent signals. It depicts brain functional connectivity network, which is shown to alter under different disorders or during different brain developmental stages. Such alterations contain crucial insights of both disorder pathology and development of the brain (Fox and Greicius 2010). The fMRI data are often summarized in the form of a location by time matrix for each individual subject. The rows correspond to a set of brain regions, and the columns correspond to time points that are usually 500 msec to 2 sec apart and span a few minutes in total. From the fMRI scans, a graph is constructed, where nodes represent brain regions, and links represent interactions and dependencies among the regions (Fornito, Zalesky, and Breakspear 2013). Numerous statistical methods have been developed to estimate functional connectivity network. Most of these methods treat the fMRI data as multivariate random variables with repeated observations, where each region is represented by a random variable and the time-course data are taken as repeated measures of that variable (e.g., Bullmore and Sporns 2009;Wang et al. 2016b). There are recent proposals to model the fMRI data as multivariate functions, where the time-course data of each region are taken as a function (e.g., Li and Solea 2018). Given the continuous nature and the short-time interval between the adjacent sampling points of fMRI, we treat the data as multivariate functions, and formulate the connectivity network estimation as a functional graphical modeling problem in this article.
Nearly, all existing graph estimation methods tackle the problem by aggregating samples, sometimes according to the diagnostic groups. However, there is considerable subjectlevel heterogeneity, which may contain crucial information for our understanding of the network, but has been largely untapped or ignored by existing methods (Fornito, Zalesky, and Breakspear 2013). Heterogeneity could arise due to a subject's phenotype profile; for example, in our study in Section 7, the brain connectivity network may vary with an individual's intelligence score. It could also arise due to a time variable, for example, the subject's age, and the connectivity network may vary with age, which leads to dynamic graphical modeling. In this article, we introduce a conditional functional graphical model for a set of random functions, by modeling the external variables such as the phenotype or age as the conditioning set. Our proposal thus extends two lines of existing and relevant research: from conditional and dynamic graphical modeling of random variables to that of random functions, and from unconditional functional graphical modeling to conditional functional graphical modeling.
The first line of relevant research is the class of graphical models of random variables. Within this class, there is a rich literature on unconditional graphical models (Yuan and Lin 2007;Friedman et al. 2008;Peng et al. 2009, among others). There are extensions to joint estimation of multiple graphs, which arise from a small number of groups, typically two or three, according to an external variable such as the diagnostic status (Danaher, Wang, and Witten 2011;Chun, Zhang, and Zhao 2015;Lee and Liu 2015;Zhu and Li 2018). There have also been some recent proposals of dynamic graphical models (Kolar et al. 2010;Xu and Hero 2014;Zhang and Cao 2017). However, they only considered a discrete-time setting, in which the network is estimated only at a small and discrete number of time points. The most relevant work to ours is Qiu et al. (2016), who also targeted estimation of the individual graph according to an external variable, for example, age. Although both Qiu et al. (2016) and our method are designed for graph estimation at the individual level, the two solutions differ in many ways. Particularly, Qiu et al. (2016) assumed that the repeated observations of each individual come from the same Gaussian distribution, whose dependency was required to follow a certain stationary time series structure. By contrast, we treat the repeated measurements as realizations from a random function, and do not impose any structural or relational assumption on the entire function. More importantly, compared to the random variable setting in Qiu et al. (2016), our functional setting involves an utterly different and new set of modeling techniques and theoretical tools.
The second line of relevant research is the class of unconditional graphical models of random functions, which appeared only recently. Qiao, Guo, and James (2019) introduced a functional Gaussian graphical model, by assuming that the random functions follow a multivariate Gaussian distribution. Li and Solea (2018) relaxed the Gaussian assumption, developed a precision operator that generalizes the concept of precision matrix to the functional setting, and used it to estimate the unconditional functional graph. However, their precision operator is nonrandom, and their graph dimension is fixed. By contrast, we introduce a conditional precision operator (CPO), which is a function of the conditioning variable and is thus random, and we allow the graph dimension to diverge with the sample size. These differences bring extra challenges in analyzing the operator-based statistics. Moreover, because the relation between the CPO and the conditioning variable can be nonlinear, its estimation requires the construction of a reproducing kernel Hilbert space (RKHS) of the conditioning variable, which leads to a more complex asymptotic analysis than that of Li and Solea (2018). We also derive a number of concentration bounds and uniform convergence for our proposed estimators, while such results are not available in Li and Solea (2018).
To address the problem of conditional graphical modeling of random functions, we introduce two new linear operators: the CPO, and the conditional partial correlation operator (CPCO), which extend the precision matrix and the partial correlation matrix from the random variable setting to both the conditional and functional settings. We show that, when the conditional distribution is Gaussian, the conditional graph can be thoroughly captured by the nonzero elements of CPO or CPCO. This property echoes the classic result where a static graph can be inferred by the precision matrix or the partial correlation matrix under the Gaussian assumption. We note that, some early works, such as Lee, Li, and Zhao (2016a,b), also estimated the parameters of interest through linear operators. However, we studied utterly different problems: Lee, Li, and Zhao (2016b) targeted variable selection in classical regressions, Lee, Li, and Zhao (2016a) targeted unconditional graph estimation for random variables, while we target conditional graph estimation for random functions. Both the methodology and theory involved are thus substantially different.
Our proposal makes useful contributions at multiple fronts. On the method side, it offers a new class of statistical models to study conditional graph estimation for multivariate functional data, a problem that remains largely unaddressed. We investigate the parallels between the random variable-based and random function-based graphs, and between the unconditional and conditional graphs. On the theory side, our work develops new tools for operator-based functional data analysis. We establish the conditional graph estimation consistency, along with a set of concentration inequalities and error bounds, for our proposed method. To our knowledge, very little work has investigated function-on-function dependency at such a level of complexity that involves estimating the linear operators under a conditional framework. The tools we develop are general, and can be applied to other settings in high-dimensional functional data analysis. On the computation side, under a properly defined coordinate system, the proposed operators are functions of the sample covariance operator of dimension n × n, with n being the sample size. It is relatively easy to calculate, and the accompanying estimation algorithm can be scaled up to large graphs.
The rest of the article is organized as follows. We begin with a formal definition of conditional functional graphical model in Section 2. We introduce a series of linear operators in Section 3, develop their estimators in Section 4, and study their asymptotic properties in Section 5. We report the simulations in Section 6, and an analysis of an fMRI dataset in Section 7. We relegate all proofs and some additional results to the supplementary appendix.

Model
In this section, we formally define the conditional functional graphical model. Let ( X , F X ) be a measurable space. Suppose X i is a Hilbert space of R-valued functions on an interval T ⊂ R, for i = 1, . . . , p, and X is the Cartesian product X 1 ×· · ·× X p . Suppose X = (X 1 , . . . , X p ) T is a p-dimensional random element on X . Let G = (V, E) be an undirected graph, where V = {1, . . . , p} represents the set of vertices corresponding to the p random functions, and E = {(i, j) ∈ V × V, i = j} represents the set of undirected edges. A common approach to modeling an undirected graph is to associate the separation on the graph with the conditional independence; in other words, nodes i and j are separated in G if and only if X i and X j are independent given the rest of X, that is, where X −(i,j) represents X with its ith and jth components removed, and ⊥ ⊥ represents statistical independence. Based on Equation (1), Qiao, Guo, and James (2019) proposed a functional graphical model, which assumed that X follows a multivariate Gaussian distribution. Next, we introduce a conditional functional graphical model that allows the graph links to vary with the external variables. We focus on the univariate external variable case, but our method can be generalized to multivariate external variables, or external functions. Let Y be a random element defined on Y , and F Y be the Borel σ -field generated by the open sets in Y . Let P Y : F Y → R and P X|Y (·|·) : F X × Y → R be the distribution of Y and conditional distribution of X | Y, respectively. We next give our formal definition.
Definition 1. Suppose a random graph E y for each y ∈ Y is defined via the mapping Y → 2 V×V , y → E y , where 2 V×V is the power set of V×V. We say X follows a conditional functional graphical model with respect to E y if and only if, for y ∈ Y , We note that Li, Chun, and Zhao (2012) introduced the notion of conditional graphical model. However, our notion of conditional functional graphical model is considerably different, in that our model extends theirs not only from the setting of random variables to random functions, but also from the setting of static graphs to random graphs. Specifically, letting X = (X 1 , . . . , X p ) T and Y = (Y 1 , . . . , Y q ) T denote two random vectors, Li, Chun, and Zhao (2012) considered the model, for all y ∈ R q . In this model, E 0 ⊆ 2 V×V is a fixed graph, and does not change with the value of Y. In comparison, our model in Equation (2) allows X to be a p-variate random function, and the graph E y to vary with Y.

Linear Operators
In this section, we first introduce a series of linear operators, based on which we then formally define the CPO and the CPCO. Finally, we study the relation between these two operators and the conditional functional graph. We adopt the following notation throughout this article. For two generic Hilbert spaces , , let B( , ) and B 2 ( , ) denote the class of all bounded and Hilbert-Schmidt operators from to , respectively. We abbreviate B( , ) and B 2 ( , ) as B( ) and B 2 ( ) whenever they are appropriate. Moreover, let · and · HS be the operator norm in B( ) and Hilbert-Schmidt norm in B 2 ( ). Moreover, let ker(A), range(A), range(A) denote the null space, the range, and the closure of the range of an operator A, respectively.

Conditional Covariance and Correlation Operators
We first define three covariance operators, V YY , V X i X j , and V YX ij . We then define the conditional covariance operator V y X i X j , and the conditional correlation operator C y X i X j , which is the building block of CPO and CPCO.
Let κ Y : Y × Y → R be a positive-definite kernel, H Y be its corresponding RKHS, and L 2 (P Y ) be the collection of all square-integrable functions of Y under P Y . The next assumption ensures the square-integrability of X i under P X|Y , and that H Y is a subset of L 2 (P Y ).
We comment that, we choose RKHS as the modeling space, so that the relation of X on Y can be very flexible, and the kernel matrix of Y is of dimension n × n, an attractive feature when the dimension of Y is large compared with n. In addition, a good number of asymptotic tools for RKHS operators have been developed (see , Bach 2009;Lee, Li, and Zhao 2016a). That being said, our theoretical development can also be easily extended to the spaces beyond RKHS. In fact, our population development only requires that H Y is a proper subset of L 2 (P Y ), which can be ensured by the square-integrable condition, var , respectively. Their expectations uniquely define the covariance operators, for all f ∈ X i , g ∈ X j , and h, h 1 , h 2 ∈ H Y . The next proposition justifies the existence of V YY , V X i X j , and V YX ij . Proposition 1. If Assumption 1 holds, then there exist linear operators V YY , V X i X j , and V YX ij satisfying the relations in Equation (3).
We next introduce a regression operator, M X ij |Y through the relation, where † is the Moore-Penrose inverse. We first need an assumption on the ranges of V YY and V YX ij , and H Y is sufficiently rich in L 2 (P Y ).

By Assumption 2, for any
is satisfied generally. For instance, it holds when the rank of B 2 ( X j , X i ) is finite, which is reasonable, because in practice X i can often be approximated by the spanning space of a few leading eigenfunctions of V X i X i .
The next proposition shows that M X ij |Y maps every (f , g) ∈ , that is, the conditional expectation, and thus this operator can be viewed as a regression operator.
Proposition 2. If Assumptions 1 and 2 hold, then for all Now we are ready to define the conditional covariance operator, whose existence is justified by Proposition 2 and Riesz representation theorem.
. When X i and X j are random vectors, Fukumizu, Bach, and Jordan (2009) introduced the homoscedastic conditional covariance operator, which induces E[cov( f , X i , g, X j | Y)]. Our conditional covariance operator is different from that of Fukumizu, Bach, and Jordan (2009), as it extends the classical conditional covariance to the functional setting, and it deals directly with cov( f , X i , g, X j | y), instead of its expectation. We write the joint operator V y XX : X → X as the block matrix of operators whose Given the conditional covariance operator V y X i X j , we next define the conditional correlation operator, C Its existence and uniqueness are ensured by Baker (1973). Similar to the construction of V y XX , we write the joint operator C y XX : X → X as the block matrix of operators whose Next, we impose the distributional assumption on X | y.
Assumption 3. Suppose X | y follows the conditional functional graphical model as defined via (2), and the conditional distribution of X = (X 1 , . . . , X p ) T given Y = y follows a centered Gaussian distribution.
The zero-mean condition is imposed to simplify both methodological and theoretical development, and can be relaxed with some modifications. Moreover, we require the conditional distribution X | y to follow a Gaussian distribution. It is possible to relax this Gaussian assumption, by extending the notion of functional additive conditional independence (Li and Solea 2018), or the copula graphical model (Liu et al. 2012). However, we feel that the Gaussian case itself is worthy of a full investigation, and we leave the non-Gaussian extension as future research.
Assumption 3, together with Definition 2, imply that, for

Conditional Precision and Partial Correlation Operators
We next formally define the operators CPO and CPCO, then establish their relations with the conditional functional graph. We introduce two assumptions, which ensure that C y X i X j is Hilbert-Schmidt and C y XX is invertible. We present some intuitions here, but relegate the detailed technical discussion to Section S.3 in the appendix.
Assumption 4 characterizes the level of smoothness for the underlying distributions of the random functions. Assumption 5 is to prevent the existence of a constant function consisting of linear combination of nonconstant functions. It can be viewed as the generalization of the nonexistence of collinearity in linear models, or the empty concavity space in generalized liner models (Hastie and Tibshirani 1990). Assumption 4 ensures that C y X i X j is Hilbert-Schmidt, and thus compact. Meanwhile, Assumptions 4 and 5 together ensure that C y XX is lowerbounded by a strictly positive constant. This implies that C y XX is invertible, and that P y = C y XX −1 is bounded, which justifies the following definition.
Definition 3. Define the CPO as the inverse of the joint conditional correlation operator, The operator P y generalizes the precision matrix to the functional and conditional settings. We should clarify that, unlike the standard definition where the precision matrix is defined as the inverse of the covariance matrix (Cai, Liu, and Luo 2011), our CPO is defined as the inverse of the conditional correlation operator. This is to avoid taking inversion on the covariance operator, which is usually not invertible because of its compactness. Next, we develop an operator that generalizes the partial correlation matrix to the functional and conditional setting. Similar to the definition for V y XX , we define V y X A X B for any subsets A, B ⊆ V. Therefore, for any subsets A ⊆ V\{i, j}, We define an intermediate operator, Its proof is similar to that of (Lee, Li, and Zhao 2016a, theo. 1) and is thus omitted. It justifies the definition of the following operator.
Definition 4. We call the operator R y X i X j |X −(i,j) in Proposition 3 the CPCO between X i and X j given X −(i,j) and Y.

Relation With Conditional Functional Graph
We first show that the conditional covariance operator can be constructed by the functions of conditional covariances between the Karhunen-Loève coefficients and the associated eigenfunctions. This simple form provides a convenient way to estimate the conditional covariance operator later. Let {(λ a i , η a i )} a∈N denote the collection of eigenvalue and eigenvector pairs of This expression is known as the Karhunen-Loève (K-L) expansion (Bosq 2000).
Proposition 4. Suppose the same conditions in Proposition 2 hold. Then we have We next show that CPO and CPCO fully characterize the conditional functional independence, and are thus crucial for our estimation of conditional functional graph.
Theorem 1. If Assumptions 1-5 hold, then we have, for any y ∈ Y , Under the Gaussian distribution, the equivalence between the conditional independence and the zero element of a nonrandom precision matrix is well known in the classical random variables setting. By contrast, Theorem 1 extends to the setting of random functions and also allows the precision operator to vary with Y.
Theorem 2. If Assumptions 1-3 hold, then we have, for any y ∈ Theorems 1 and 2 suggest that one can estimate the conditional functional graph E y in Equation (2) through the proposed operators, CPO or CPCO. In the following, we primarily focus on the graph estimation based on CPO, and investigate the corresponding asymptotics. The results based on CPCO can be derived in a parallel fashion, which we only briefly discuss in Section S.4 in the Appendix.

Estimation
In this section, we first derive the sample estimate of CPO and the conditional graph at the operator level. We then construct empirical bases and develop coordinate representations for the functions observed at a finite set of time points. Using these coordinate representations, we are able to compute our estimated linear operators. Last, we provide a step-by-step summary of our proposed estimation procedure.

Operator-Level Estimation
We first derive the sample Karhunen-Loève expansion. We then sequentially develop the estimators of V y XX , C y XX , P y , and finally E y .
Let (Y 1 , . . . , Y n ) denote iid samples of Y, and (X 1 , . . . , X n ) denote iid samples of X, with X k = (X k 1 , . . . , X k p ) T , for k = 1, . . . , n. Let E n denote the sample mean operator; that is, for a sample (ω 1 , . . . , ω n ) from , E n (ω) = n k=1 ω k /n. We estimate the covariance operators, V X i X i , V YX ij , and V YY , bŷ By its definition, we estimate the regression operator M X ij |Y byM where Y > 0 is a prespecified ridge parameter, and it imposes a level of smoothness on the regression structure. Next, by Propositions 2 and 4, given y, d, i, j, we estimate the conditional covariance operator V Therefore, by Equation (4), we estimate the conditional correlation operator C which is the identity mapping from X i to X i , and 1 is a ridge regularization parameter imposed on the inverses ofV The next result shows that the norm ofĈ y X i X j (d, Y , 1 ) is bounded by one, which resembles the property of the correlation in the classical setting.
Finally, we estimate the conditional precision matrix operator P y byP where 2 > 0 is another ridge parameter. UsingP y (d, Y , 1 , 2 ), for each y ∈ Y , we estimate the graph E y bŷ where ρ CPO > 0 is a thresholding parameter. That is, we can obtain an estimator of the conditional graph by hard thresholding. For notational simplicity, hereafter we abbre-

Empirical Bases and Coordinate Representation
We next introduce a set of empirical bases and the corresponding coordinate representations. We adopt the following notation. Let be a Hilbert space of functions of T, spanned by a set of bases B = {b 1 , . . . , b m }. For any ω ∈ , let ω B =  ( ω B,1 , . . . , ω B,m ) T denote its coordinate with respect to B. Then ω can be represented as s,t=1 denote the Gram kernel matrix, which implies that, for any (ω 1 , ω 2 ) ∈ , the inner product Note that the random function X k i can only be observed at a finite set of points. To enable computation, we need to approximate the random functions using the partially observed data. We adopt the construction used in Li and Solea (2018), which assumes the sample path of X k lies on an RKHS of the time variable T with a finite basis. Specifically, suppose X k i is observed on a finite set of time points T k = {T k1 , . . . , T kn k }, k = 1, . . . n. Let T = (T 11 , . . . , T 1,n 1 , . . . , T n1 , . . . , T nn n ) T = (T 1 , . . . , T N ) T , which pools together all the unique time points across all subjects, and N is the length of T . Letting κ T : be the N × N Gram matrix of κ T , and its eigen-decomposition, Here we require m ≤ N. Therefore, we can construct an orthonormal basis of N via where B 0 (·) = [B 0 1 (·), . . . , B 0 N (·)] T . Then X k i can be represented as Note that for the kth individual, the function is observed at n k time points, which implies that X k . Therefore, for a given ridge parameter T , the coordinate X k i B can be estimated by We next derive the coordinate representations ofV X i X i , V YX ij ,V YY , which then lead to the coordinates ofV Moreover, for each i ∈ V, let {(λ a i , η a i )} m a=1 denote the collection of eigenvalue and eigenvector pairs of the sample Karhunen-Loève expansion of X k i is of the form, where Finally, we compute the squared Hilbert-Schmidt norm of [P y ] i,j as for (i, j) ∈ V × V with i = j, and · F is the Frobenius norm.

Algorithm
We now summarize our conditional functional graph estimation algorithm based on CPO. The algorithm based on CPCO is similar and is thus omitted. Let λ i (A) be the ith largest eigenvalues of an a × a matrix A, for i = 1, . . . , a.

1.
Choose the kernel function κ T . Some commonly used kernel functions include the Brownian motion function κ T (s, t) = min(s, t), or the radial basis function (RBF) Compute the N × N Gram matrix K T of κ T , and let U 1 T D 1 T (U 1 T ) be its eigen-decomposition associated with its m leading eigenvalues. Then use (11) to construct the reduced basis B(·), and the matrix B k . We suggest choosing m by min{m : 99}, in the sense that the cumulative percentage of total variation of K T explained exceeds 99%.

4.
Perform eigen-decomposition of V X i X i , for i ∈ V, and obtain the ath eigenvector η a i , and the ath K-L coefficientα k,a i using (13), a = 1, . . . , d. Stackα 1,a i , . . . ,α n,a i to form a a i , a = 1, . . . , d, i = 1, . . . , p. We choose d according to min{d : Choose the kernel function κ Y , and compute the corresponding n × n Gram matrix K Y . Compute the coordi- (14), with the ridge parameter Compute the representation ofP y using (15), for a given y ∈ Y and the ridge parameters 1 = n −1/5 max{λ 1 ( V y X i X i ) : i ∈ V}, and 2 = n −1/5 λ 1 ( P y ).

7.
Estimate the conditional functional graph for a given with a given threshold ρ CPO . We determine ρ CPO by minimizing the following generalized cross-validation (GCV) criterion over a set of grid points, , and its coordinate representation is derived in Section S.4 of the appendix, and DF i (ρ) = d 2 card [N i (ρ)], with card(·) being the cardinality and N y i (ρ) = {j ∈ V : (i, j) ∈Ê y (ρ)} the neighborhood of the ith node in E y (ρ).
Our graph estimation algorithm involves multiple tuning parameters, and their choices are given in the above algorithm. We further study their effects in Section S.6 of the appendix. In general, we have found that the estimated graph is not overly sensitive to the tuning parameters as long as they are within a reasonable range.
We also remark that, the above algorithm assumes the zeromean condition. In practice, if this does not hold, we can easily modify the algorithm. Specifically, from the coordinates of the estimated conditional covariance operator, we can estimate E(α a i | y) by (a a i ) T c(y) 1 n , where 1 n is the n-dimensional vector with all ones. By redefining the conditional covariance operator as V y , where c (y) = diag (y) − (y) (y) T . We Step 5 of the algorithm, and all subsequent procedures, by B

Asymptotic Theory
We begin with some useful concentration inequalities and uniform convergence for several relevant operators. We next establish the uniform convergence of the CPO, and the consistency of the estimated graph. For simplicity, we first assume that the trajectory of the random functions X(t) = [X 1 (t), . . . , X p (t)] T is fully observed for all t ∈ T. We then discuss the setting when X i is only partially observed in Section 5.3. We also remark that, all our theoretical results allow the dimension of the graph to diverge with the sample size.

Concentration Inequalities and Uniform Convergence
We first derive the concentration bound and uniform convergence rate for the sample estimatorsV X i X j ,V YX ij , andV YY . For two positive sequences {a n } and {b n }, write a n ≺ b n if a n = o(b n ), a n b n if a n = O(b n ), and a n b n if a n b n and b n a n ; moreover, let a n ∨ b n = b n if a n b n . Similarly, if c n is a third positive sequence, then we let a n ∨b n ∨c n = (a n ∨b n )∨c n .
Theorem 3. If Assumptions 1 and 3 hold, then there exist positive constants C 1 to C 6 , such that, HS . We define the exponent of a compact and self-adjoint operator as A β = a∈N λ β a (η a ⊗ η a ), for any β > 0, where {(λ a , η a )} a∈N is the collection of eigenvalue and eigenfunction pairs of A. We need another assumption.

Assumption 6. For any
This assumption regulates the complexity of (X i , X j ) T given Y. To see this, note that, as β increases, the regression relation from (X i , X j ) T on Y are more concentrated on those eigenfunctions corresponding to larger eigenvalues of V YY . Also, we define κ d ≡ min |λ a i − λ b i | : 1 ≤ a < b ≤ d + 1, i ∈ V as the minimal isolation distance among all d + 1 leading eigenvalues of V X i X i , for all i ∈ V. Note that we allow d to grow with n. Theorem 4. If Assumptions 1-3 and 6 hold, Y ≺ 1, and Next, we establish the uniform convergence of the estimated conditional correlation operatorĈ y XX . We need an assumption on the tail behavior of random functions.
Assumption 7 is on the decaying rate of the tail eigenvalues of V y X i X i , which, in a sense, characterizes the smoothness of the distribution of X i given Y.

Uniform Convergence of CPO and Graph Consistency
We next derive the convergence of the estimated CPO,P y . We need an assumption to regulate the relation between X i and X j when conditioning on X −(i,j) and Y.
The following proposition provides a condition under which Assumption 8 is satisfied. Its proof is similar to Proposition S2 and is omitted.
Proposition 8. Suppose Assumptions 1, 3-5 hold. Then, for any y ∈ Y , there exists c 3 > 0 such that, Both conditions (5) and (16) introduce certain levels of smoothness on the conditional distribution of X given Y. Nevertheless, they target different subjects: Equation (5) is about the relation between X i and X j given Y, which is required for the consistency of C y X i X j , whereas Equation (16) is about the relation between X i and X j given X −(i,j) and Y, which is required for the consistency of [P y ] i,j .
Theorem 6. If Assumptions 1-8 hold, Y , 1 ≺ 1, and We make a few remarks. First, Li and Solea (2018) established the consistency of their precision operator, as well as the unconditional graph estimation consistency. Note that their operator is nonrandom, and their results were derived with the graph size p fixed. By contrast, Theorem 6 establishes the uniform convergence of the operator [P y ] i,j , which is random, and the graph estimation consistency is obtained with a diverging p. For instance, if 2 = O(n −π /2 ) with π > 0, then Theorem 6 says that the uniform convergence rate of the estimated CPO depends on p(log p/n 1−π −π ) 1/2 . This implies that we allow p to grow at the polynomial rate of n.
Second, a careful inspection of our proof reveals that, the convergence rate of the empirical CPO depends on the difference Ĉ y XX − C y XX HS , whose order, by Theorem 5, can be no faster than p(log p/n 1−π ) 1/2 . We note that, under the classical random variable setting, the convergence rate of the hard thresholding sample covariance matrix in terms of Frobenius norm is [pc 0 (p) log p/n] 1/2 , where the term c 0 (p) imposes a sparsity structure on the covariance matrix and satisfies that p j=1 1[cov(X i , X j ) = 0] ≤ c 0 (p), for i = 1, . . . , p, with 1(·) being the indicator function (Bickel and Levina 2008, theo. 2). We feel our rate of p(log p/n 1−π ) 1/2 is reasonable and is comparable to the classical result. We also note that there is a difference between p and p 1/2 in our rate and the rate of (Bickel and Levina 2008, theo. 2). This difference is mainly due to different sparsity settings imposed by our method and by Bickel and Levina (2008). Note that we do not impose any sparsity structure on the conditional correlation operator C y XX . This means we need to estimate all the off-diagonal elements of C y XX , whose cardinality grows in the order of p 2 . In comparison, Bickel and Levina (2008) imposed a sparsity structure on the covariance matrix and the cardinality of nonzero covariances grows only in the order of pc 0 (p). Moreover, we are dealing with a more complicated setting of random functions and random operators. On the other hand, we show in Section S.5 in the appendix that, if we introduce additional sparsity and regularization, we can further improve the rates in Lemma S8 in the appendix and Theorem 6, so that p can grow at an exponential rate of n. Actually, we have developed a theoretical platform that is not only limited to the present setting. The concept of using vanishing CPO to identify the conditional independence between random functions in a conditional graph can have divarication beyond the scenarios studied in this article; for example, when there are additional sparsity or bendable structures.
Finally, in our theory development, we have imposed a series of technical conditions, which are commonly imposed in the literature, and are usually easy to satisfy.

Consistency for Partially Observed Random Functions
We next derive the consistency under the scenario when the random functions are only partially observed. Partially observed functions are collected via a dense or a non-dense measurement schedule; see Wang, Chiou, and Muller (2016a) for more discussion on measurement schedule. To avoid digressing from the main context, in this article, we do not go after any specific measurement schedule or regularity setting on the partially observed random function. For partially observed functions X(t), supposeX(t) = [X 1 (t), . . . ,X p (t)] T is the estimate of X(t) using the empirical bases developed in Section 4.2. We then estimate the series of the operators and the graph bỹ i is the eigenfunction ofṼ X i X i , andC y XX is the block matrix of operators with (i, j)th element beingC y X i X j . Theorems 5 and 6 show that the convergence of the estimated CCO and CPO depends on the uniform convergence of the sample covariance operators in Theorem 3. In particular, it relies on the convergence rates of max i,j∈V V X i X j − V X i X j HS and max i,j∈V V YX ij − V YX ij HS . When the random functions are completely observed, both rates, by Theorem 3, are equal to (log p/n) 1/2 . When the random functions are only partially observed, we let the convergence rates of max i,j∈V Ṽ X i X i − V X i X i HS and max i,j∈V Ṽ YX ij − V YX ij HS to be slower than (log p/n) 1/2 . More specifically, as specified in Equation (17), we introduce a quantity a that reflects how dense the time points are observed in the random functions. The denser the time points are, the closer a is to zero. Correspondingly, the next theorem extends Theorem 6 to partially observed functions. Its proof is similar to that of Theorem 6, and is thus omitted.
The first condition of Equation (17) is satisfied if the tail of Ṽ X i X j − V X i X j HS behaves as a sub-Exponential distribution. That is, when there exist c , c > 0 and a ∈ [0, 1), such that P( Ṽ X i X j − V X i X j HS > t) ≤ c exp(−c 2 n (1−a)/2 t), for any t ≥ 0. A similar condition also holds for the second condition of Equation (17). Recall in Theorem 3, we have shown that, for completely observed functions, V X i X j − V X i X j HS > t and V YX ij −V YX ij HS > t behave as a sub-Gaussian distribution for a small t. As such, Equation (17) is more appropriate for partially observed functions.

Simulations
Next, we study the finite-sample performance of our method through simulations. Specifically, we consider three graph structures of E: a hub, a tree, and a random graph, as shown in Table  1. We consider p = 10 nodes with the sample size n = 100 first, and consider larger graphs later. Given the graph structure E and the conditional variable y, we generate X j (t) and its parent nodes based on the following model: where X(t) | y = [X 1 (t) | y, . . . , X p (t) | y] T is constructed sequentially via the given graph, P j is the set of the parent nodes of j, and c j is the scale parameter as specified in Table 1. We generate the error function, ε i (t) as J u=1 ξ u κ T (t, t u ), where ξ 1 , . . . , ξ J are iid standard normal variables, t 1 , . . . , t J are equally spaced points between [0, 1] with J = 50, and κ T is a RBF or a Brownian motion covariance function. We then generate the conditioning variable, Y 1 , . . . , Y n , as iid Uniform(0, 1). We generate each X k i , k = 1, . . . , n, i = 1, . . . , p, with n k = 50 time points. In this model, there are two types of edge patterns: when ν ij = 1, the strength of edges grows with y, and when ν ij = 0, the strength of edges decays with y. The tuning parameters are determined by the rules suggested in Section 4.3. In Section S.6 of the appendix, we discuss in detail the effect of the tuning parameters, and show that our method is relatively robust to a range of tuning parameters.
We compare our method with three alternative solutions, all of which are variants of graphical Lasso (Friedman et al. 2008, gLASSO). The first solution, which we refer as "Average, " is similar to Kolar et al. (2010). It first estimates the conditional covariance matrix byˆ where X k (T j ) = [X k 1 (T j ), . . . , X k p (T j )] T , j = 1, . . . , 50, k = 1, . . . , n. It then estimates the conditional precision matrixˆ y by applying gLASSO toˆ y XX . The second solution, which we refer as "Majority, " is similar to a procedure in Qiao, Guo, and James (2019). It first estimates the covariance matrix at each time point byˆ It then estimates the conditional precision matrix at each time point, by applying gLASSO toˆ y XX (T j ). It selects those edges that are detected by the majority of the estimates among all the estimated graphs. The third solution, referred as "Unconditional, " is similar to the naive procedure in Qiu et al. (2016), which, without using the information of Y, estimates the covariance matrix byˆ k XX = 50 j=1 X k (T j )[X k (T j )] T /50. It then estimates the conditional precision matrix by applying gLASSO tô k XX . For the penalty parameter in gLASSO, we adopt the empirical rule suggested by Rothman et al. (2008), and set it to (log p/n) 1/2 . We use the RBF kernel for both κ T and κ Y in all simulations.
We evaluate the accuracy of the estimated graph using the area under ROC curve (AUC). We first compute the false positive (FP) and true positive (TP) as, for a given ρ and y ∈ Y , and E 0 is the true graph. We then compute the pairs of (TP y,ρ , FP y,ρ ) over a set of values of ρ, which we set to be the sorted norms of empirical CPO. Figure 1 reports the AUC for the estimated graph with respect to the external variable Y, under three graph structures, and by the four methods. It is seen that the methods "Majority" and "Unconditional" perform consistently the worst, while "Average" and our proposed CPO methods perform similarly in this example.
We next extend our simulations to larger graphs, where we set the graph size p = {30, 50, 100}, with the sample size n = 30 and n k = 30 time points, k = 1, . . . , n. We then generate the graph in a similar way as before. Specifically, for the hub structure, we generate p/5 independent hubs, and within each hub, two edges have their strength growing with y, and two edges decaying with y. For the tree structure, we expand the tree until the graph reaches the designated size, and in each subsequent layer of the tree, one edge has growing strength and the other has decaying strength with y. For the random structure, we select the edges randomly following a Bernoulli distribution with probability 1/(p−1), and also randomly select half of the selected edges to have growing strength and the other half to have decaying strength. Due to the poor performance of "Majority" and "Unconditional" in the previous simulation setting, we only compare our CPO method with "Average" in the large-graph setting. Figure 2 reports the AUC. It is seen that our CPO method clearly outperforms the "Average" method in some settings, and is comparable in other settings. In particular, for a large graph with p = 100, the improvements of CPO over "Average" in both the hub and random structures are substantial.

Application
In this section, we illustrate our conditional functional graph estimation method with a brain functional connectivity analysis example. We analyze a dataset from the Human Connectome Project (HCP), which consists of resting-state fMRI scans of 549 individuals. Each fMRI scan has been processed and summarized as a spatial temporal matrix, with rows corresponding to 268 brain regions-of-interest, and columns corresponding to 1200 time points (Greene et al. 2018). Additionally, each subject is collected with a score of the Penn Progressive Matrices, which is a nonverbal group test typically used in educational settings and is generally taken as a surrogate of fluid intelligence. It is of scientific interest to unravel the connectivity patterns of the brain regions conditioning on the intelligence measure.
We apply our conditional functional graphical modeling approach for the whole brain of 268×268 connectivity network. Applying the hard thresholding approach would yield a sparse    estimate of the conditional graph at each value of Y = y. To identify edges that vary along with the conditional variable Y, we further employ a permutation test approach. Specifically, we permute the observations of Y five hundred times. For each permuted sample, we compute the Hilbert-Schmidt (H-S) norm of the sample CPO for each edge. We then compute the variance of the H-S norms based on 500 permutations. We treat those edges whose corresponding variances above 95% percentile as significant, since in this context a nonzero variance implies that the CPO varies with the conditional variable. In addition, those 268 brain regions have been partitioned into eight functional modules: medial frontal, frontoparietal, default mode, motor, visual, limbic, basal ganglia, and cerebellum (Finn et al. 2015). We count the number of significant edges from the permutation test within each functional module, then use the Fisher's exact test to determine if a module is significantly enriched.
We have found that the medial frontal module is significantly enriched with numerous significant edges that vary along with the intelligence score. Figure 3 shows the identified significant edges. This finding agrees with the literature, as this module has been reported to contribute to high-level cognitive functions such as emotional responses (Smith et al. 2018) and learning (Zanto and Gazzaley 2013). Moreover, this module was found to have more impact on fluid intelligence compared to other modules (Finn et al. 2015). We have also observed more cross-lobe and inter-hemisphere interactions, which suggests the importance of these edges in cognitive functions. This again complies with the existing literature in neuroscience that the altered interhemispheric interactions of prefrontal lobe is closely related to higher level cognitive traits such as the Internet gaming disorder (Wang et al. 2015). Figure 4 reports the changes of the identified significant edges for the medial frontal module at five different values of the intelligence score. We see from the plot that both increasing and decreasing patterns exist for the strength of the significantly varying edges.
As an independent validation, we replicate the analysis using another resting-state fMRI data of 828 individuals in HCP. We report the changes of the identified significant edges for the medial frontal module from the new dataset in Figure S6 of the appendix. We again find that the medial frontal module is enriched. Hearne, Mattingley, and Cocchi (2016) identified positive correlations between functional connectivity and intelligence in general. Song et al. (2008) reported that most of the brain regions whose connectivity patterns are negatively correlated with the intelligence are around medial frontal gyrus, or part of motor regions, which are part of medial frontal module. Combined with our findings, it suggests that the medial frontal module may play a unique role in intelligence compared to other brain modules.