Nonparametric Functional Graphical Modeling Through Functional Additive Regression Operator

Abstract In this article, we develop a nonparametric graphical model for multivariate random functions. Most existing graphical models are restricted by the assumptions of multivariate Gaussian or copula Gaussian distributions, which also imply linear relations among the random variables or functions on different nodes. We relax those assumptions by building our graphical model based on a new statistical object—the functional additive regression operator. By carrying out regression and neighborhood selection at the operator level, our method can capture nonlinear relations without requiring any distributional assumptions. Moreover, the method is built up using only one-dimensional kernel, thus, avoids the curse of dimensionality from which a fully nonparametric approach often suffers, and enables us to work with large-scale networks. We derive error bounds for the estimated regression operator and establish graph estimation consistency, while allowing the number of functions to diverge at the exponential rate of the sample size. We demonstrate the efficacy of our method by both simulations and analysis of an electroencephalography dataset. Supplementary materials for this article are available online.


Motivation and Proposal
Multivariate functional data, where continuous observations are sampled from a vector of stochastic processes, are emerging in a wide range of scientific applications.Examples include timecourse gene expression data in genomics studies (Wei and Li 2008), multivariate time series data in finance (Tsay and Pourahmadi 2017), electrocorticography and functional magnetic resonance data in neuroimaging (Zhang et al. 2015), among many others.Functional data analysis has received enormous interest recently; see, for instance, Ramsay and Silverman (2005) and Hsing and Eubank (2015), for reviews of contemporary developments.A central problem in multivariate functional data analysis is to investigate the interdependence among the multivariate random functions.This can be formulated as graphical modeling of multivariate functional data, and is the problem to be tackled in this article.
Our motivation is brain functional connectivity analysis, which is currently at the forefront of neuroscience research.Brain functional connectivity reveals intrinsic functional architecture of the brain (Varoquaux and Craddock 2013).Accumulated evidences have shown that brain connectivity network alters under different pathological conditions.Such alterations are associated with cognitive and behavioral functions, and hold crucial insights of pathologies of neurological disorders (Fox and Greicius 2010).One of the common imaging tools to CONTACT Kuang-Yao Lee kuang-yao.lee@temple.eduStatistical Science, Temple University, Philadelphia, PA 19122-6008.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JASA.study functional connectivity is electroencephalography (EEG), which measures brain activities through voltage values of electrodes placed at various scalp locations over a period of recording times.It results in multivariate functional data that take the form of a location by time data matrix for each individual subject.Based on these functional data, an undirected graph is constructed to characterize brain connectivity, where nodes represent neurological elements such as different locations and regions of the brain, and links represent interactions and dependencies among the nodes (Fornito, Zalesky, and Breakspear 2013).
In this article, we propose a new nonparametric functional graphical modeling approach.Built on a recently proposed notion of functional additive conditional independence (Li and Solea 2018), we formulate functional graphical estimation as a neighborhood selection problem in a nonlinear regression framework.To do so, we first introduce a new statistical object called functional additive regression operator (FARO), which is capable of capturing nonlinear relations without requiring distributional or linear structural assumptions, nor any conditional mean model.We next introduce an objective function that follows the spirit of least-squares but whose deployment is at the Hilbert-Schmidt operator level.This is set forth from a broad synthesis of regression with linear operators as parameters.We show that FARO is the minimizer of the proposed objective function, and thus, offers a versatile nonlinear measure of the function-to-function dependency.
We estimate FARO by minimizing the objective function subject to a mix of L 1 and L 2 type penalties.The estimation algorithm is implemented efficiently via convex solvers.The functional graphical model is then recovered according to the zero operators in the coefficients of the neighborhood selection.

Related Work
Our proposal is related to but also clearly distinct from several existing lines of work, including graphical modeling for random variables, functional graphical modeling, and linear operatorbased methods.Next we discuss the connections and differences in detail.
There have been a large number of proposals for graphical modeling of random variables, most notably, sparse graph estimation under an L 1 penalty or its variants (Yuan and Lin 2007;Friedman, Hastie, and Tibshirani 2008;Ravikumar et al. 2011;Cai, Liu, and Luo 2011;Fan and Lv 2016).It is also noteworthy that some solutions reformulated the problem as neighborhood selection (Meinshausen and Bühlmann 2006;Peng et al. 2009).However, most of those methods imposed a Gaussian structure, whereas some later proposals relaxed the Gaussian assumption (Liu, Lafferty, and Wasserman 2009;Liu et al. 2012;Xue and Zou 2012;Voorman, Shojaie, and Witten 2014).Besides, they all have focused on graphical modeling of random variables.When moving from random variables to random functions, it involves a different set of techniques for both the method and the theory.Moreover, when the functions are only partially observed, as we discuss in Section 4.4, we need to estimate the random functions and account for the estimation error, which in turn would lead to a slower rate of convergence as shown in Theorem 7. Qiao, Guo, and James (2019) recently proposed a graphical model for functional data, assuming that the random functions follow a multivariate Gaussian distribution.Specifically, let X = (X 1 , . . ., X p ) T be a p-dimensional random function on an interval of time T in R. Let V = {1, . . ., p} and E = {(i, j) ∈ V × V, i > j} denote the sets of nodes and edges, and G = (V, E) the corresponding undirected graph.A natural way to describe the interrelations among X is via conditional independence (CI).That is, nodes i and j are not connected in G if and only if X i and X j are independent conditioning on the rest of random functions: where X −(i,j) denotes the set of random functions {X k : k ∈ V\{i, j}}.The relation embedded in (1) represents a functional graphical model.As shown in Pearl, Geiger, and Verma (1989), the three-way statistical relation of conditional independence satisfies the semi-graphoid axioms, which are the key properties of the notion of separation that underpins any graph structure.For this reason, CI is commonly used as a criterion to define a graph, and built on this notion, Qiao, Guo, and James (2019) developed a functional graphical model under the Gaussian assumption.
Although an intuitive and appealing idea, using CI to characterize separation among the nodes requires the Gaussian assumption, which is parametric and can be unrealistically strong in numerous applications.Alternatively, one can resort to a fully fledged nonparametric approach.However, it inevitably involves high-dimensional kernels, and thus, often suffers from the curse of dimensionality, a problem that is more severe for large networks.To strike a balance, Li, Chun, and Zhao (2014) proposed a new three-way statistical relation to characterize node separation: additive conditional independence (ACI).ACI parallels many canonical principles of CI.Like CI, ACI also satisfies the semi-graphoid axioms.But unlike CI, the estimation of ACI requires neither parametric assumption nor highdimensional kernels.It thus, avoids the curse of dimensionality, and is able to scale to large networks.Li and Solea (2018) extended the notion of ACI to functional additive conditional independence (FACI) to construct a nonparametric graphical model for multivariate random functions.Our proposal is similar to Li and Solea (2018) in that our method is built upon FACI as well.However, our proposal is also considerably different from theirs in both methodology and theory.Methodologically, Li and Solea (2018) used the pairwise Markov property induced by FACI as the criterion to construct the graph, whereas we use the local Markov property induced by FACI as such a criterion.More importantly, they used hard thresholding to achieve sparsity, whereas we use penalized minimization to achieve this purpose, which can be carried out via a range of penalty functions.Such a difference is analogous to sparsifying a partial correlation matrix for random variables by hard thresholding versus by penalized regularization (Zhu, Shen, and Pan 2014).Theoretically, Li and Solea (2018) only considered the scenario when the number of graph nodes is fixed, and did not derive any concentration bounds.By contrast, we allow the graph size to diverge at an exponential rate, establish graph estimation consistency, and derive a set of concentration inequalities and error bounds.The new asymptotic development involves considerable challenges, as little theoretical work has been done previously to investigate function-on-function dependency involving both high dimensionality and nonlinearity.In fact, even under the setting for random variables, the concentration inequalities and error bounds obtained here appear to be the first of their kinds.Furthermore, the tools and techniques we develop are sufficiently general to be applied to broader settings involving sparse estimation at the operator level.
More recently, Solea and Li (2020) proposed a copula method, and Solea and Dette (2020) proposed a nonparametric method, both for functional graphical modeling.Although we target similar problems, our solutions are completely different.Solea and Li (2020) extended the copula Gaussian idea of Liu et al. (2012) to the functional setting.Our model is more general, and can capture structures beyond the copula Gaussian distribution when using nonlinear kernels such as the radial basis function kernel.Actually, if we choose the second-layer kernel to be the inner product of the copula transformation functions, our model includes that of Solea and Li (2020) as a special case.Solea and Dette (2020) extended the joint additive model of Voorman, Shojaie, and Witten (2014).Specifically, let {α k i } ∞ k=1 be the collection of all functional principal components (fPC) from the random function X i .The model of Solea and Dette (2020) , where f k j : R → R is an arbitrary function.That is, the conditional mean of every fPCs of X i is an additive function of all the fPCs of the rest of the random functions, for each i ∈ V. Our model, on the other hand, is of the form, E(f (X i ) | {X j : j = i}) = j =i f j (X j ), for all f ∈ H X i and f j is an arbitrary function on X j .Our model is clearly more general, and it reduces to the model of Solea and Dette (2020) when we choose the kernel of X i to be the linear kernel and the kernel of each component in X −i to be the additive kernel on the fPCs.Another crucial difference is that, both Solea and Li (2020) and Solea and Dette (2020) were built on the classical conditional independence, whereas our method is built on the functional additive conditional independence.These differences lead to a substantially different solution.
There have also been developments of linear operator-based methods; see Li (2018) for a survey.In particular, Lee, Li, and Zhao (2016a) introduced an additive partial correlation operator to characterize ACI, which extends the partial correlation to the nonlinear setting.Our method differs from Lee, Li, and Zhao (2016a) in that we replace the hard thresholding with the penalized, operator-level neighborhood selection, allow the dimension to go to infinity, and replace random variables with random functions.Lee, Li, and Zhao (2016b) applied ACI to perform variable selection in a classical regression setting, which contains a similar idea as this article, that is, regression with operators as coefficients.On the other hand, we aim at a completely different problem: the graphical modeling of multivariate random functions.At the individual regression level, Lee, Li, and Zhao (2016b) only considered the setting when p is fixed, whereas our theory allows p to diverge at an exponential rate with the sample size.At the graph estimation level, we need to consider p regressions simultaneously with a diverging p.In addition, we establish the concentration bound of our functional additive regression operator, and this type of result is not available in Lee, Li, and Zhao (2016b).There has been some work studying the concentration bounds of the empirical covariance operator, and some on the concentration bounds in a classical regression setting where the response space is the Euclidean space; see Bosq (2000); Yao, Rosasco, and Caponnetto (2007).Our regression setting is more general than those existing results in that our response space is a reproducing kernel Hilbert space.

Organization
The rest of the article is organized as follows.We set up the nonparametric functional graphical model based on ACI and neighborhood selection in Section 2. We develop an estimation procedure at the operator-level in Section 3, and derive the asymptotic theory in Section 4. We implement the estimation procedure and present an algorithm using a coordinate representation in Section 5. We conduct simulations, and illustrate our method with an EEG data analysis in Section 6.We conclude the article in Section 7, and relegate all proofs and additional results to the supplementary material.

Model
In this section, we first introduce the functional additive regression operator (FARO), then develop the notions of functional additive conditional independence and functional neighborhood selection, from which we define our version of functional graphical model.Finally, we connect neighborhood selec-tion with FARO, which lays the foundation for the subsequent development of estimating functional graphical model through sparse estimation of FARO.

Functional Additive Regression Operator
Let ( , F, P) be a probability space.For each i = 1, . . ., p, let X i denote a Hilbert space of R-valued functions defined on an interval T ⊆ R, and X the Cartesian product X 1 ×• • •× X p .Let X = (X 1 , . . ., X p ) T be a p-dimensional random function defined on and taking value in X .
Definition 1.Let •, • X i denote the inner product in X i .Then, for any i = 1, . . ., p, a positive definite kernel κ i : The same construction was used in Li and Solea (2018).This definition is inspired by kernel mapping in the multivariate random variable setting, except that the domain of the kernel has been changed from a Euclidean space to an infinite-dimensional functional space.There are many types of kernels, for instance, the radial basis kernel ), and the polynomial kernel Given the kernel κ i , we build up a second-level Hilbert space H X i , which is the reproducing kernel Hilbert space generated by κ i .Let H X be the direct sum ⊕ p i=1 H X i , which is the linear space For any subset A of {1, . . ., p}, we define H X A to be the direct sum Suppose, for each i = 1, . . ., p, every member of H X i is square-integrable.Then by Theorem 2.2 of Conway (1990), for each pair (i, j), there exists a unique linear operator X i X j : H for all f ∈ H X i and g ∈ H X j .We then define an operator In other words, XX is a matrix of operators, whose (i, j)th entry is X i X j .This operator is called the functional additive variance operator in Li and Solea (2018).Similarly, for any subvectors U, V of X, we define the functional additive covariance operator UV as the matrix of operators whose entries are U i V j .
We next define the Moore-Penrose inverse of an operator.For a linear operator T, let (T) denote the kernel space (or null space) of T; that is, ker(T) = T −1 ({0}) = {f : Tf = 0}.Let ran(T) denote the range of T, and ran(T) denote the closure of ran(T).Note that XX is not invertible if ker( XX ) = {0}.However, if we let { XX | ker( XX ) ⊥ } to be XX restricted on ker( XX ) ⊥ , then the restricted operator is invertible.We call the inverse of this restricted operator the Moore-Penrose inverse of XX , and denote it by † XX .Since XX is a self-adjoint operator, we have ker( XX ) ⊥ = ran( XX ).Thus, † XX is a mapping from ran( XX ) to ran( XX ) that maps the member y of ran( XX ) to the unique member x ∈ ran( XX ) that satisfies XX x = y.
Assumption 1. Suppose ran( UV ) ⊆ ran( UU ) for all disjoint subvectors U, V of X.Moreover, the linear operator B 0 We extend the definition of B 0 UV is welldefined when ran( UV ) ⊆ ran( UU ), which is ensured by Assumption 1.Moreover, the condition ran( UV ) ⊆ ran( UU ) means the operator UV sends the function in H V to the lowfrequency part of UU , which is a type of collective smoothness in the relation between U and V.This condition is not directly related to the dimension of V, as H V consists of scalarvalued functions of V. Nevertheless, when the dimension of V increases, we expect the class of H V to be more complex, and the condition ran( UV ) ⊆ ran( UU ) would impose a stronger smoothness on the relation between U and V.This agrees with the spirit of typical nonparametric estimations: the more complex the function is, the stronger the penalty should be.Later in Section 2.2, we discuss a concrete setting under which Assumption 1 is satisfied.
Definition 2. Let U, V be any subvectors of X, and suppose Assumption 1 holds.Then we call the operator B 0 UV the functional additive regression operator (FARO) from This concept of FARO plays a central role in our proposal.It can be viewed as the functional extension of the additive regression operator, which was developed in Lee, Li, and Zhao (2016b) for nonparametric variable selection.The term "regression operator" was motivated by the similarity of † UU UV to the regression coefficient matrix in multivariate linear regression.

Functional Additive Conditional Independence
We first give an alternative but equivalent definition of FACI to that in Li and Solea (2018).
Definition 3. Let U, V, W be subvectors of X, and suppose Assumption 1 holds.We say that the random elements U and V are additively conditionally independent given the random element W, denoted by Li and Solea (2018) defined FACI using orthogonality between subspaces.The above alternative definition directly involves FARO, which helps greatly to simplify the subsequent theoretical and computational developments.
Next, we discuss how different choices of kernels can lead to different FACI relations.We first show that a stronger FACI is implied by larger RKHSs.
Proposition 1. Suppose U, V are subvectors of X with the corresponding RKHSs H (1) The next result shows different relations between FACI and CI under the functional copula Gaussian model.Specifically, suppose X i = E(X i ) + ∞ k=1 α ik φ ik is the Karhunen-Loéve expansion of X i , for each i ∈ V. We say X i follows a copula Gaussian distribution if there exists a collection of monotone functions Li and Solea (2018) for such definitions.
Proposition 2. Suppose X follows a copula Gaussian distribution with the sequences of copula transformation functions The proof of this proposition follows Li and Solea (2018, Theorem 1) and is omitted.Proposition 2(a) shows that FACI implies CI, but not vice versa, when the kernel is characteristic, for example, the radial basis kernel, while Proposition 2(b) shows that FACI and CI are equivalent when the kernel induces the same space as spanned by the copula transformation functions.Both Propositions 1 and 2 suggest that it is beneficial to choose a kernel that is dense in the ambient L 2 -space.In practice, we suggest choosing a kernel function that satisfies this condition, such as the radial basis kernel function.
Next, to further understand the probabilistic mechanism underlying FACI, as well as its relation with CI, we introduce a new distribution, the Additive Gaussian Distribution, under which FACI and CI are equivalent.This distribution is much more general than the Gaussian distribution.More importantly, it provides a concrete probability model that generates the FACI relation for multivariate random functions.
If the p-variate random function X = (X 1 , . . ., X p ) T satisfies: (a) X i is a finite-dimensional Hilbert space spanned by an orthonormal basis {η 1 , . . ., η m } with inner product •, • X i , and (b) We say that U follows an additive Gaussian distribution if there exists c > 0 such that the density of U satisfies We write it as U ∼ AN(μ α , α ).Note that AN(μ 1 , 1 ) is the p-variate Gaussian distribution; that is, the Gaussian distribution is the first-order additive Gaussian distribution.If U ∼ AN(μ α , α ), then we call X an additive Gaussian random function in Since there is a one-to-one correspondence between X and U, we also write X ∼ AN(μ α , α ).
Next, we give a concrete example to justify the existence of the density in (2).
In Section S4 of the supplementary material, we show the existence of the density f in Example 1.We can also extend the same construction to the multivariate case of X = (X 1 , . . ., X p ) T .
The construction of X is in line with the original motivation of ACI in Li, Chun, and Zhao (2014), and reveals the probabilistic mechanism that generalizes partial correlation and Gaussianlike behavior to nonlinear features.When X ∼ AN(μ α , α ), FACI and CI are equivalent, as shown in the next theorem.
We note that, when X is an additive Gaussian random function, Assumption 1 holds, because all operators involved are finite-rank operators.

Functional Neighborhood Selection
Based on the definition of FACI, we now define the neighborhood N i of node i = 1, . . ., p. Definition 4. Suppose Assumption 1 holds.A subset N i of V is called the neighborhood of node i with respect to FACI, if (a) By definition, N i is the smallest subset of V\{i} such that, conditioning on X N i , X i is additively independent of the rest of random functions.Note that an equivalent way of saying Definition 4 essentially gives a formal definition of our version of the functional graphical model.That is, we aim to find the graph G, a functional additive semi-graphoid, such that X is associated with G with respect to FACI.Li and Solea (2018) introduced their version of functional additive semi-graphoid model based on the following equivalence ( 3 ) It is interesting to note that, the relations characterized in Definition 4 and Equation (3) are closely related but not identical.Specifically, they are related to different Markov properties for undirected graphs.Equation (3) requires the pairwise additive Markov property to hold, meaning that if two nodes are not connected, they are additively conditionally independent given the rest of nodes.Definition 4 relies on the local additive Markov property, which indicates that, given its neighbors, a node is additively conditionally independent with every nonadjacent node.Lauritzen (1996, chap.3) has shown that the local Markov condition is generally stronger than the pairwise Markov condition.The next proposition provides a parallel relation between the local and pairwise additive Markov conditions.
We next give a counterexample to show that the local additive Markov property is not implied by the pairwise additive Markov property.
Example 2. Let X = (X 1 , X 2 , X 3 ) ∈ 3 be a 3-variate random function, and satisfy that P , where V = {1, 2, 3} and E is an empty set.Then X satisfies Equation (3) with respect to G: for example, X 1 ⊥ ⊥X 2 | X 3 because X 1 and X 2 are both fixed given X 3 .However, X does not satisfy Definition 4 with respect to G: Although the pairwise additive Markov property does not always imply the local additive Markov property, we next show that this relation can still hold under some conditions.Proposition 4. Suppose ker( XX ) = {0}, and the correlation operator Next we show that, the interdependency defined by functional neighborhood selection in Definition 4 can be fully captured by FARO.The next theorem involves the regression operator B 0 X −i X i .Since this is an operator from H X i to H X −i , it can be written as a vector of operators (B 1 , . . .
Theorem 2. Suppose Assumption 1 holds.Then, for any Theorem 2 implies that one can recover the graphical model through functional neighborhood selection, by estimating the set of active predictors in regressions with X i as the response and X V\{i} as the predictors.Also, by the rule of operator inversion, we have

Estimation
In this section, we first develop a population-level objective function whose minimizer is FARO.We then add a mixture of the L 1 and L 2 penalties to the sample-level objective function to obtain a sparse estimate of the FARO, then the functional graphical model.

The Objective Function
We first introduce some notation.Given two Hilbert spaces H and K, let B(H, K) denote the collection of all bounded linear operators from H to K, B 1 (H, K) the collection of all traceclass operators from H to K, and B 2 (H, K) the collection of all Hilbert-Schmidt operators from H to K. It can be shown that Hilbert space, and we use •, • HS and • HS to denote its inner product and norm; B(H , K ) is a Banach space with respect to the operator norm denoted by • .For convenience, B 1 (H), B 2 (H), and B(H) are used whenever H = K.
In the classical setting, the neighborhood of each node can be determined by the nonzero linear regression coefficients.To capture the nonlinear relations, we extend linear regression to regression at the operator level.Toward that end, we define the following objective function, This objective function is motivated by the least squares idea which minimizes var(Y − β T X) over β.As we show in Theorem 3 below, minimizing (4) is equivalent to minimizing where {f a } ∞ a=1 is an orthonormal basis in H X i .As such, (4) is a generalization of least squares by regressing a class of functions of X i on a class of functions of X −i .We next show that B X −i X i is the minimizer of L i (B).We need an additional condition.
Assumption 2. There exists a constant m > 0, such that This condition requires the kernel κ i to be uniformly bounded, which is satisfied by many commonly used kernels such as the radial basis kernel and the Laplacian kernel.Without loss of generality, we take m = 1 in Assumption 2 for all subsequent analyses.Theorem 3. Suppose Assumptions 1 and 2 hold.Then, Lee, Li, and Zhao (2016b, Theorem 2) showed a similar result.However, we only require bounded kernels, while Lee, Li, and Zhao (2016b) required the covariance operator to be traceclass.

Sample-Level Regularized Estimation
We next derive the sample version of L i (B), and further introduce a mix of L 1 and L 2 penalties.Let {(X k 1 , . . ., X k p ) T : k = 1, . . ., n} be independent copies of (X 1 , . . ., X p ) T .We estimate the mean element μ where ⊗ denotes the tensor product.We then use ˆ X i X j to build up ˆ UV for any subvectors U, V of X.With the covariance operators in (4) substituted by their empirical counterparts, the sample version of L i (B) becomes (5) To encourage sparsity and enhance prediction, we introduce two penalty terms on Li (B): The first penalty term B 2 HS is the Hilbert-Schmidt norm and is similar to an L 2 penalty, while the second is equivalent to ( j∈V\{i} (B) j HS ) that resembles an L 1 penalty.This type of mixture of L 1 and L 2 penalties are often employed in highdimensional regressions (Zou and Hastie 2005;Lee, Li, and Zhao 2016b).We use it here to achieve desired asymptotic properties.Also, to simplify tuning and theoretical development, we impose the same parameter λ n for both penalty terms, but in principle they can be different. Let We then use Ni = { ( Bi ) j HS = 0 : j ∈ V\{i}} to estimate the neighborhood of i. Subsequently, we use Ni to estimate E via These are two slightly different ways to construct an estimate of E in (7), because it may happen that j ∈ Ni but i / ∈ Nj , and vise versa.However, as we show later in Section 4.3, this type of discrepancy, and thus, the difference between ÊOR and ÊAND , is asymptotically negligible.In our implementation in Section 5.2, we choose ÊOR as our final estimate of the functional graphical model.

Theory
In this section, we derive the asymptotic property of the sparse FARO estimator Bi .We then derive the consistency of the subsequent neighborhood estimator Ni and the graph estimators ÊOR , ÊAND based on Bi .We achieve our goal in four steps.First, we derive a concentration bound on the sample covariance operator.Second, we introduce an intermediate estimator, B0 i , and derive the concentration bound for the distance between the intermediate estimator and the true FARO.Third, we show that we can construct a minimizer of the original objective function (6) based on B0 i with a high probability.Combined with the second step, this in effect establishes the consistency and convergence rate of the minimizer Bi of (6).Finally, we establish the desired neighborhood selection and the graph estimation consistency built on Bi .We mostly assume that the trajectory of X i is fully observed on t ∈ T, for i = 1, . . ., p.We briefly discuss the scenario when X i is partially observed in Section 4.4.In the interest of space, we relegate some technical results to the supplementary material.

A Key Lemma and an Intermediate Estimator
We begin with a lemma that establishes a key concentration inequality for the norm ˆ X A X B − X A X B HS .Let |A| denote the cardinality of a set A.
Lemma 1. Suppose Assumption 2 holds.Then, for any δ > 0, Since both |A| and |B| can be arbitrary, this lemma extends the bound for a high-dimensional covariance matrix to the operator level.Since the estimation of covariance plays a crucial role in many topics in high-dimensional statistics, we expect this result to be useful in other contexts involving high-dimensional matrices of linear operators.We note that Bosq (2000) has studied the concentration bound of the empirical covariance operator for functional data too.However, our result extends his to both nonlinear and high-dimensional settings.
We next derive the concentration bound for the FARO estimator Bi .Toward that end, we introduce an intermediate estimator.Consider the following objective function, where Note that the objective function (8) differs from (6) in that (8) treats N 1 , . . ., N p as known.Besides, the dimension of the minimizer B0 i of ( 8) is |N i |, whereas the dimension of the minimizer Bi of ( 6) is (p − 1).We first establish the concentration bound for B0 i .Let C N i N i be the correlation operator from The existence and uniqueness of the correlation operator C N i N i was established by Baker (1973).We require another two assumptions.
Assumption 3.There exists a constant c > 0 such that, for all i = 1, . . ., p, cI Assumption 3 holds for all i ∈ V, if there exists c > 0 such that the joint correlation operator C VV is bounded below by cI.This is a fairly general condition, and it holds when C VV is invertible and all its off-diagonal elements C i,j , i, j ∈ V × V with i = j, are compact; see also Solea and Li (2020, Proposition 2).Fukumizu, Bach, and Gretton (2007) has studied the condition for the compact operators, and showed that the correlation operator is compact when the mean square contingency of the associated random elements is finite, which in general requires that there cannot be too strong dependency between the random elements.We also note that Zhao and Yu (2006); Wainwright (2009); Ravikumar et al. (2009) all imposed a similar condition in the linear model or the generalized additive model settings to derive the consistency of LASSO.Moreover, under the Additive Gaussian distribution setting as discussed in Section 2.2, Assumption 3 holds, because all the pairwise correlation operators have finite ranks. Suppose Assumption 4. For i = 1, . . ., p, there exists an operator for all i ∈ V.Under the additive Gaussian distribution setting, this is equivalent to where , and • F is the Frobenius norm.Condition ( 9) is essentially a form of smoothness in the relation between X i and its neighborhood X N i .To see this, note that (9) implies that {( α ) j,k : j, k ∈ N i } −1 {( α ) j,i : j ∈ N i } F is uniformly bounded, which means the Frobenius norm of the regression coefficient for regressing Q α (U i ) on {Q α (U j ) : j ∈ N i } is uniformly bounded.Li and Song (2017) used a similar condition and referred it as "collective smoothness" in the context of nonlinear dimension reduction.The next proposition shows that, if X follows an Additive Gaussian distribution with some additional conditions, then Assumption 4 is satisfied for each i ∈ V.

Consistency and Convergence Rate of the FARO Estimator
We next construct an estimator based on the intermediate estimator B0 i , and show that, with a high probability, it minimizes the objective function PL i (B) in ( 6).Coupled with Theorem 4, this in effect establishes the concentration bound and convergence rate of the FARO estimator Bi of ( 6).Specifically, we construct an operator in , and the rest is a (p−1−|N i |)-dimensional zero-operator 0. To avoid overly complicated notation, we denote this operator by ( B0 i , 0), keeping in mind that B0 i doesn't have to occupy the first |N i | positions.We derive a series of inequalities under which ( B0 i , 0) satisfies the Karush-Kuhn-Tucker (KKT) conditions with a high probability.This is equivalent saying that ( B0 i , 0) and the FARO estimator Bi are asymptotically equivalent.In the interest of space, we give the full details in Section S3 in the supplementary material.We also introduce another assumption, Assumption 5.There exists 0 < η ≤ 1 such that, for any j ∈ V\(N i ∪ {i}), and i = 1, . . ., p, Assumption 5 can be viewed as a generalized version of the irrepresentable condition usually imposed in the classical regression setting to establish the consistency of LASSO (Zhao and Yu 2006;Wainwright 2009).Under the Additive Gaussian distribution, we have This simply means that the Frobenius norm of the regression coefficient for regressing Q α (X j ), which are the random elements at the nonneighboring nodes, on {Q α (X j ) : j ∈ N i }, which are the random elements at neighboring nodes, is uniformly bounded by 1 − η.Since Q α (X j ) are random vectors, this condition is similar in spirit to its counterpart in the random variable case; that is, it avoids strong dependency between the nonneighboring and neighboring nodes.Besides, Qiao, Guo, and James (2019, Condition 5) used a similar condition for the Gaussian functional graphical model.Our Assumption 5 can be seen as a nonlinear extension of their condition.We also give an example under which Assumption 5 is satisfied for each i ∈ V.
Example 3. Suppose X = (X 1 , X 2 , X 3 ) T follows an Additive Gaussian distribution, such that, for each i = 1, 2, 3, (a) , where X i is the linear span of the orthonormal basis {η 1 , . The proof for this example is given in the supplementary material.Similar to the classical neighborhood selection, when Assumption 5 does not hold, we still expect the support of our functional neighborhood selection to recover the true graph to a certain extent, in the sense that the probability of erroneous selection converges to a small positive constant instead of zero.Moreover, in the usual regression setting, alternative regularization methods, such as adaptive LASSO and SCAD, may be employed to relax the irrepresentable condition.We expect that similar modifications can be made to FARO, so that Assumption 5 can be removed.
The next corollary provides the convergence rate of Bi , and the connection between the convergence rate and the λ n and p.
Corollary 1. Suppose Assumptions 1-5 hold, |N i | does not depend on n, and {(log We remark that the convergence rate of FARO in terms of the Hilbert-Schmidt norm depends on p through λ n , whose order of magnitude can be arbitrarily close to (log p/n) 1/3 .In the classical linear regression, the convergence rate of the estimated regression coefficient from LASSO and Dantzig estimators in terms of the L 2 norm depends on (log p/n) 1/2 (Bickel, Ritov, and Tsybakov 2009;Fan and Lv 2010).This discrepancy is somehow expected though, as our setting is more general and involves both nonlinearity and high-dimensional functional data.

Neighborhood Selection and Graph Estimation Consistency
We now establish the asymptotics of neighborhood selection.
The next theorem provides an upper bound for the probability of incorrectly selecting the neighbors.
Theorem 5. Suppose Assumptions 1-5 hold, HB .Then, , In the classical regression setting, the sparsistency of LASSO has been studied in Zhao and Yu (2006); Wainwright (2009).Theorem 5 establishes the sparsistency in a much more general setting where both response and predictors are random functions, and no structural assumptions such as linearity are imposed on their relationships.
The next theorem shows that, using ( 7), E can be correctly recovered with probability tending to one.Moreover, the difference between ÊOR and ÊAND is asymptotically negligible.Its proof follows immediately from Theorem 5, and is omitted.Theorem 6. Suppose the same conditions of Theorem 5 hold.Moreover, suppose |N 1 |, . . ., |N p | do not depend on n, and Ê can be either ÊOR or ÊAND defined in (7).Then, there exists c 2 > 0 such that P( We make some remarks regarding Corollary 1 and Theorem 6.First, by Corollary 1, the convergence rate of our FARO estimator depends on Li and Solea (2018, Theorem 4) showed that the rate of convergence for their FACI estimator was n −1/6 .However, they treated the number of functions p as fixed, while we allow p to grow with n in an exponential order.If we also treat p as fixed, then our rate can be made arbitrarily close to n −1/6 .Second, in the same vein, our consistency of graph estimation in Theorem 6 holds while allowing the graph size to diverge at an exponential order, whereas Li and Solea (2018) treated p as fixed.In fact, their estimator did not take advantage of the sparsity of the graph, and needed to estimate all the off-diagonal elements on their precision operator.Since the cardinality of all the off-diagonal elements grows in the order of p 2 , this means p can grow only in a polynomial rate of n, but not in an exponential rate as in our result.Finally, we note that, in both Corollary 1 and Theorem 6, we assume the number of neighborhoods for each node fixed.This condition can be relaxed by carefully choosing the rates of p, |N i |, B 0 i HS , and b 0 i , which we leave as potential future research.

Consistency for Partially Observed Random Functions
We have so far assumed that X i is fully observed.Next, we briefly study the scenario when the function is only partially observed.See Wang, Chiou, and Muller (2016) for discussions on different schedules on which functional data are collected.Note that Theorems 5 and 6 only rely on the concentration bound of the sample covariance operator in Lemma 1.In the following, we allow the convergence rate under a partially observed schedule to be slower than the one under a fully observed schedule.However, we do not pursue any specific measurement schedule or smoothing setting to avoid digressing too much from the main theme.
Suppose X = ( X1 , . . ., Xp ) T is an estimator of X.We then estimate the sample covariance operator by for i, j = 1, . . ., p.For any subvectors U, V of X, let ˜ UV be the matrix of operators whose elements are composed of ˜ X j X i with X j ∈ U and X i ∈ V. Then we compute the new penalized estimator of B by Bi = arg min{ PL i (B) : 5) and ( 6) accordingly.Finally, we estimate the neighborhood and the graph by The next theorem shows the consistency of our method under a partially observed schedule, which is a generalization of Theorems 5 and 6.Its proof follows immediately from that of Theorem 5 and is thus, omitted.

Implementation
In this section, we introduce a coordinate system to implement the estimator developed at the operator-level in Section 3.

Coordinate Representation
We first develop the coordinate system for X i , X , H X i , and H X .For a generic finite-dimensional Hilbert space spanned by B = {b 1 , . . ., b m }, any x ∈ can be written as m u=1 α u b u .We call the vector (α 1 , . . ., α m ) T the coordinate of x relative to the spanning system B, and write it as is the Gram matrix of B.
Let (X 1 , . . ., X n ) denote iid samples from X of size n and i is observed on a finite subset T k = {t k1 , . . ., t km k } of T, where m k is the number of time points observed for subject k.Let (τ 1 , . . ., τ M ) = ∪ n k=1 T k denote all the unique time points ordered from the smallest to the largest, where M is its cardinality.Let κ T : T × T → R be a positive definite kernel.We consider the reproducing kernel Hilbert space n = span{κ T (•, τ 1 ), . . ., κ T (•, τ M )} ≡ span{B n u : u = 1, . . ., M}, with the inner product determined by where where k T is a ridge-regression-type tuning parameter.Note that the coordinates are not unique when the spanning system is linearly dependent.Nevertheless, both the inner product and the norm it induces are unique.This is because the inner product, like eigenvalues and eigenfunctions, is coordinate-free.We then compute the inner product between X k i and X i by, where K k, T = [κ T (t ku , t v )] u=1,...,m k ,v=1,...,m , for k, ∈ {1, . . ., n}.Having constructed X k i , we next proceed to the construction of the sample version of H X i , which we denote by H n X i .Letting κ i (•, •) be the second-level kernel that can be computed via Definition 1 and Equation (11), we define Let H n X i be the RKHS spanned by {φ k i : k = 1, . . ., n}, and K i be the Gram matrix n 1 n with I n and 1 n being the identity matrix and the n-dimensional vector (1, . . ., 1) T .
It is often the case that the important features of a kernel are concentrated on leading eigenfunctions (Lee and Huang 2007;Chen et al. 2010).So, without losing much efficiency, we may use the leading eigenfunctions to construct the empirical RKHS, which can bring substantial saving of computing time.Suppose G i has the eigen-decomposition, where V i D i V T i and Ṽi Di ṼT i correspond to the first n i and the last n − n i eigenvalues of G i .Let T .We then use {ψ 1 i , . . ., ψ n i i } as a basis of the reduced space H n i X i , by which we replace H n X i to save computing time.We lose no information as long as ran( X i X −i ) ⊆ span{ψ 1 i , . . ., ψ n i i }.In the following, we denote (ψ 1 i , . . ., ψ n i i ) T by ψ i .Using the coordinate representations and the reduced space derived above, we next provide the numerical procedure to implement the constrained optimization problem in Section 3.2.Let B i be an operator in .
By the definition of empirical covariance operator, the penalized objective function PL i (B) in ( 6) can be written as where being the coordinate expression of (B i ) j with respect to ψ i and ψ j .

Algorithm and Tuning
We next summarize our estimation algorithm, followed by a discussion on parameter tuning and the computation complexity.
Step 3: Determine the ridge parameter k T via k T = c T × σ max (K k T ), k = 1, . . ., n, where σ max (•) denotes the largest eigenvalue of the associated matrix; c T is to control the level of smoothness, which we fix at c T = 0.04.Then use (11) to calculate the inner product X k i , X i n , for k, = 1, . . ., n and i = 1, . . ., p.
Step 4: Select the second-level kernel function, and compute the second-level Gram matrix K i and its centered version G i for i = 1, . . ., p.If the RBF kernel is used, compute the width parameter γ by , where • n is the norm induced by the inner product in (11).
Step 5: Conduct the eigen-decomposition on G i in (12).For the selection of n i , we follow the rule in Ravikumar et al. (2009) and choose it adaptively based on the sample size as n i = O(n 1/5 ), i = 1, . . ., p.
Step 6: For a given λ n and each i = 1, . . ., p, minimize PL i (A i ) in ( 13) over A i ∈ R n −i ×n i , where n −i = j =i n i , using, for example, the disciplined convex programming method of Boyd and Vandenberghe (2004) 7).We use Êλ n OR as the final estimate for all the numerical analyses.
We next discuss the tuning of the penalty parameter λ n in Step 6.Specifically, we consider a BIC-type criterion, Our idea is to regress each random function on its neighboring functions, and thereby calculate the sum of squared errors RSS i (λ).We then select λ that minimizes BIC(λ) over a grid of candidate values.
In Section S5.3 of the supplementary material, we further investigate the effect of the choice of the tuning parameters and the kernel functions.In general, we have found that our algorithm is relatively robust, as long as the choices are within reasonable ranges.
Finally, we discuss the computation complexity of our algorithm.This complexity can be divided into two parts: the preconvex optimization part and the convex optimization part.The first part involves the construction of both first-layer and second-layer kernels, and the eigen-decomposition of the Gram matrix G i .Its complexity grows in the order of p[ n k, =1 (m 3 k + m k m )+n 3 ] for the unbalanced setting, and p(n 2 m 3 +n 3 ) for the balanced setting, assuming that m k = m for all k = 1, . . ., n.For the convex optimization part, for graphs of small to moderate sizes, we recommend CVX, a Matlab toolbox for constrained minimization, whose default solver is based on the semidefinitequadratic-linear program (Tutuncu, Toh, and Todd 2003).For each iteration and each regression, the complexity of this part is of the order (pn * ) 3 (Sra, Nowozin, and Wright 2011, chap. 3), where n * = max(n 1 , . . ., n n ).For large graphs, we can reduce the computation by keeping only the L 1 penalty in the optimization.This simplified algorithm is equivalent to the group Lasso, and can be easily implemented by the iterative shrinkage thresholding algorithm (ISTA, Beck and Teboulle 2009), whose complexity grows in the order of p(n * ) 2 (n + n * ).In our numerical experiments, we have found this simplification loses little estimation accuracy, but brings substantial gain in computation.Moreover, in our simulation with p = 100 and n = 100, the average number of iterations for our algorithm to converge was 33, and the average running time of a single optimization was 2.48 sec.The implementation was done on a 2 x E5-2630 v4 workstation.

Numerical Studies
In this section, we first investigate the finite-sample performance of our proposed method by simulations.We then illustrate our method with an EEG data analysis.

Simulations
We generate multivariate random functions using the structural equation model of Pearl (2009).We consider both the Gaussian case (Model I), and the non-Gaussian case with nonlinear relations among the nodes (Models II and III).Specifically, given a directed edge set D with the ordering 1 → • • • → p, we generate (X 1 (t), . . ., X p (t)) T sequentially via for some functions f 1 , . . ., f p we specify later.We use the Brownian motion covariance as the kernel to construct the error function ε i (t), i = 1, . . ., p; that is, ε i (t) is generated by J u=1 ξ u κ T (t, t u ), where t u and ξ u are independently generated from Uniform(0, 1) and Normal(0, 1).For the observed time points {t k1 , . . ., t km k : k = 1, . . ., n}, we consider two different scenarios: the balanced case with m k = 10 equally spaced points between [0, 1], and the unbalanced case with m k = 10 time points independently drawn from the discrete uniform distribution on {0.01, 0.02, . . ., 1}.We consider the following choices of f i : The edge set D is generated from a hub structure.We then use the moral graph of D as the undirected graph E (Lauritzen 1996).Given a graph of size p, ten independent hubs are generated so that the module in each hub is of degree p/10 − 1.We set the sample size as n = 100, and the number of nodes as p = 100.In Section S5.1 of the supplementary material, we further consider additional network structures, including the tree and the chain structures, and additional network sizes, including p = 50, 200.
For each model, the proposed method is applied with the second layer kernel being a Gaussian kernel (denoted as FARO), and a linear kernel (denoted as Linear).The latter shares the same spirit as the Gaussian graphical model method of Qiao, Guo, and James (2019), while we further compare with Qiao, Guo, and James (2019) in Section S5.2 of the supplementary material.We also compare with the functional additive precision operator of Li and Solea (2018) (denoted as Li and Solea).We first calculate the false positive (FP) rate and true positive (TP) rate, for a given parameter λ, where I(•) denotes the indicator function, E 0 the true graph, and Ê(λ) the estimated graph under λ.
We then compute the receiver operating characteristic (ROC) curve for a gird of values of λ. Figure 1 shows the average ROC curves for the three methods under different models.Each curve is averaged over 40 replications.It is seen that our method performs about the same as the Gaussian estimation method when the true model is indeed Gaussian (Model I), but performs much better when the underlying model is non-Gaussian (Models II and III).Moreover, FARO performs significantly better than Li and Solea (2018) in Model III for both the balanced and unbalanced settings, indicating the benefit of penalized optimization over hard thresholding.

EEG Data Analysis
We next apply our method in an EEG data analysis.The data are available at https://kdd.ics.uci.edu/databases/eeg/eeg. data.html.The goal of the study is to investigate the EEG relatedness of genetic predisposition with alcoholism.The data were collected from two groups of individuals: 77 subjects from the alcoholic group and 45 from the control group.Each subject was asked to wear a 64-channel electrode cap and shown one of three types of stimuli, while the voltage value from each electrode was recorded every second for a total of 256 sec.Each subject completed 120 trials from each of the three stimuli.
The same dataset has been analyzed before (Li, Kim, and Altman 2010;Xia and Li 2017;Qiao, Guo, and James 2019).Following Li, Kim, and Altman (2010), a preprocessing step was carried out by taking the average of measurements from single-stimulus trials.This results in a 64 × 256 data matrix for each individual.Our goal is to estimate the brain connectivity network with p = 64 for the alcoholic group and the control group, respectively.We employ a Gaussian kernel for this data analysis, and use the BIC criterion in (14) for parameter tuning.
Figure 2 shows the estimated graphs, ÊALC and ÊCTL , for the alcoholic group and the control group, as well as the difference graphs, ÊALC \ ÊCTL and ÊCTL \ ÊALC .It is seen that both ÊALC and ÊCTL are relatively sparse, with a 10.2% and 9.9 % sparsity rate, respectively, which also indicates a decrease of connectivity in the alcoholic group.Moreover, compared to the control individuals, the alcoholic individuals reveal asymmetric patterns, in which the left frontal, central and parietal regions have more connections than their right counterparts.We also note that the electrodes in regions other than frontal and parietal are connected only sparsely.These findings in general agree with the literature (Hayden et al. 2006).

Discussion
In this article, we have proposed a new nonparametric functional graphical model.A key and novel feature of this work is the estimation of a large number of regressions at the operator level, where both the number of predictors in each regression and the number of the regressions increase with the sample size at an exponential rate.This versatile framework yields flexible, accurate and computationally feasible estimate of the non-Gaussian and nonlinear functional graphical model with a large number of nodes.To the best of our knowledge, our consistency result is the first of its kind for neighborhood selection where both the response and the predictor are functions, the relation between them can be nonlinear, and the dimension of the functions can outgrow the sample size.Additional novel features include the use of ACI as the selection criterion in the functional setting, the new concept of local additive Markovian property and its relation with the pairwise additive Markovian property, as well as the introduction of the Additive Gaussian distribution that puts ACI in solid footing.
This new framework involves considerable asymptotic developments.Specifically, we need to extend the individual convergence in Lee, Li, and Zhao (2016b) to the uniform convergence.Toward that end, we derive a series of concentration bounds for the sample covariance operator and the sample mean elements in RKHS, as shown in Lemma 1 and Lemma S3.From these, we show that the tail probability of the estimation error in terms of the Hilbert-Schmidt norm behaves like a sub-Gaussian variable.These concentration bounds and tail probabilities are the key elements for developing the uniform consistency in the highdimensional setting.Li and Solea (2018) and Lee, Li, and Zhao (2016b) did not have such results.
Moreover, in order to make these concentration bounds applicable to our setting, we need to develop several relations to link the covariance operator with various key quantities in our method.For example, in Proposition S2, we derive an inequality between the estimated FARO and an intermediate operator, based on which we obtain the concentration bound of the FARO estimator.In Proposition S3, we derive a series of inequalities under which the KKT conditions are satisfied, which leads to an upper bound of the probability of erroneous neighborhood selection.These developments are far beyond routine.Besides, they are sufficiently general to be useful for other problems in functional data analysis.
Although the current form of the Additive Gaussian distribution is of a finite dimension, we expect that it can be extended to the infinite-dimensional setting.In addition, our experience suggests that, often in practice, there exist only a few dominating eigenfunctions in the empirical covariance operator.For the truly infinite-dimensional setting, we expect that some additional regularization is needed, such as the fast decaying of the tail eigen-structures.We leave such an extension to future research.
Now we are ready to derive the concentration bound and the convergence rate of the intermediate estimator B0 i .Hereafter, for two positive sequences {a n } and {b n }, let a n b n represent a n = O(b n ); let a n ≺ b n represent a n = o(b n ); a n ∧ b n = a n and a n ∨ b n = b n if a n b n .Similarly, if c n is a third sequence and orders {a n }, {b n } and {c n }, then we use the notations a n ∧ b n

Figure 1 .
Figure 1.ROC curves for Models I to III, and for the balanced case (top panels) and the unbalanced case (bottom panels).
.Step 7: Let Âλ n i denote the resulting minimizer in the previous step.Estimate the neighbors Nλ n i = {j ∈ V\{i} : tr[( Âλ n i ) T j ( Âλ n i ) j ] = 0}.Then estimate the graph Êλ n by