Sparse Functional Dynamical Models—A Big Data Approach

ABSTRACT Nonlinear dynamical systems are encountered in many areas of social science, natural science, and engineering, and are of particular interest for complex biological processes like the spiking activity of neural ensembles in the brain. To describe such spiking activity, we adapt the Volterra series expansion of an analytic function to account for the point-process nature of multiple inputs and a single output (MISO) in a neural ensemble. Our model describes the transformed spiking probability for the output as the sum of kernel-weighted integrals of the inputs. The kernel functions need to be identified and estimated, and both local sparsity (kernel functions may be zero on part of their support) and global sparsity (some kernel functions may be identically zero) are of interest. The kernel functions are approximated by B-splines and a penalized likelihood-based approach is proposed for estimation. Even for moderately complex brain functionality, the identification and estimation of this sparse functional dynamical model poses major computational challenges, which we address with big data techniques that can be implemented on a single, multi-core server. The performance of the proposed method is demonstrated using neural recordings from the hippocampus of a rat during open field tasks. Supplementary materials for this article are available online.


Introduction
A dynamical system is a process evolving in time, with the current output of the system functionally dependent on current and past inputs. Nonlinear structure in the functional dependence can result in tremendous complexity in the system. The first documented study of a nonlinear dynamical system involved two interacting species, predators and prey, and was conducted independently by Volterra andLotka in the 1910s and1920s. The interaction was modeled through nonlinear differential equations (Volterra 1930). A general solution to such nonlinear differential equations is the Volterra series (Volterra 1930;Flake 1963;Karmakar 1979), which represents the output of the system as a series of convolutions of its inputs. The Volterra series is often referred to as a Taylor series with memory.
For a system with a single input x(t ) and a single output z(t ), the Volterra series is where κ p is known as the pth-order Volterra kernel. In practice, the Volterra kernels are unknown and have to be estimated in the process of system identification.
In addition to early applications in the physical sciences (voltage, current, velocity, stress, temperature;Wiener 1958;Schetzen 1981), nonlinear dynamical systems have been identified in areas as diverse as economics and psychology (human behav-CONTACT Ela Sienkiewicz ela.sienkiewicz@yahoo.com Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/JCGS. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JCGS. ior and interactions; Vallacher and Nowak 1994;Schenk-Hoppé 2001), computer and network science (system failures, network intrusions, resource misuse, etc.; Debar, Becker, and Siboni 1992;Cannady 1998), and physiology (cardiovascular, respiratory, renal; Marmarelis 1993), among many others. Biological processes are difficult to model, since the rules governing their behavior are unknown or only partially understood. Particularly challenging examples come from neuroscience, with the human brain considered to be the most complex nonlinear dynamical system of all. The brain is also a good illustration of a system composed of subcomponents, in which understanding the functionality of the simplest building blocks is not enough to understand and model the entire system. The behavior of neurons, each one consisting of a soma, an axon, and dendrites, has been fairly well understood at the individual level. Existing parametric models (e.g., Hodgkin-Huxley, Fitzhugh-Nugamo, Hindmarsh-Rose), consisting of sets of differential equations, can successfully describe the nonlinear dynamics of the membrane voltage at the level of a single neuron (Fuchs 2013). However, these models cannot easily scale up to the level of neural ensembles, which are groups of interconnected and functionally similar neurons, due to numerous unknown biological mechanisms and processes within the neuronal networks. On the other hand, a nonlinear dynamical model in the form of the Volterra series models the neural ensembles directly from their input/output data. It does not rely on complete knowledge of the system and thus provides an appealing approach for modeling large-scale, complex nervous systems.
In this article, we model the nonlinear dynamics of neural ensembles using spiking activities. Spikes, also termed action potentials, are brief electrical pulses generated in the soma and propagating along axons. Neurons receive spike inputs from other neurons through synapses and then transform these inputs into their spike outputs. Since spikes have stereotypical shapes, they can be simplified as point-process signals carrying information with the timings of the pulses. It is widely accepted that information in the brain is encoded in the spatio-temporal patterns of spikes. Thus, characterization of the point-process input-output properties of neurons using spiking activities is a crucial step for understanding how the neurons, as well as the neuronal networks transmit and process information; see Song et al. (2007Song et al. ( , 2013. Within a neural network, this procedure is essentially the identification of the functional connectivity between neurons. If the neural spikes are represented as binary signals (e.g., Song et al. 2007Song et al. , 2009, then Equation (1) is not sufficient to describe the output of the system in terms of its inputs, and a generalized linear model (GLM) is often introduced. In Song et al. (2007Song et al. ( , 2009), Volterra kernel models were combined with generalized linear models to characterize the input-output properties of neurons. To facilitate model estimation, Volterra kernels were expanded with basis functions, for example, Laguerre basis or B-spline basis, and model coefficients were estimated with a penalized likelihood method (Song et al. 2013).
However, this estimation methodology imposes major computational challenges, chiefly because the first-order Volterra series has been shown to be inadequate to capture the system dynamics (Song et al. 2007). The second-order series, with even a moderate number of inputs and basis functions, requires estimating thousands of parameters. A large number of observations is critical to account for sparsity in the inputs and to avoid over-smoothing, see Marmarelis (1993) and Korenberg and Hunter (1996).
With a large sample size and large number of parameters, system identification faces a big data challenge, which is characterized by both hardware and software issues; see Fan, Han, and Liu (2014) for a recent review. The data may not fit in the memory of a single computer, and some statistical methods, if they can be adjusted to accommodate distributed data and computing, just take too long to be computationally feasible. As a result, none of the cited articles was successful in providing a viable and reproducible method of regularized estimation of Volterra kernels of higher orders. Regularization is key to achieving functional sparsity at both local and global levels, necessitated by the sparsity in neural connectivity (Tu et al. 2012). Song et al. (2013) provided a comprehensive study of sparsity by comparing the effects of Laguerre basis and B-spline basis. Both types are adequate in representing the global sparsity, but only the B-spline basis successfully modeled the local sparsity. To achieve sparsity, Song et al. (2009) used the forward step-wise model selection and Song et al. (2013) implemented a regularization-based method, that is, LASSO (Tibshirani 1996).
Aside from nonparametric modeling using Volterra kernels, other approaches include cascade or block structure modeling (Korenberg and Hunter 1986), basis representation of orthogonalized Wiener kernels (Marmarelis 1993), and artificial neural networks, which have a correspondence with Volterra kernel models (Wray and Green 1994). These alternative methods are not considered further in this article, though some of the computational techniques we describe can be applied in these other settings.
The contributions of this article are threefold. First, we tackle the problem of function estimation using a classical likelihood method in the analysis of neural spikes. Second, for better system identification and estimation, we present a novel regularization approach for function estimation. In particular, we use a sparse basis expansion of second-order or higher-order Volterra kernels and regularize to preserve both local and global sparsity. Third, we combine various existing statistical and optimization techniques in a novel way and apply big-data techniques for fast and reliable computations. Specifically, we show that the standard big-data approach of distributing data does not lead to a feasible computational strategy in our problem, so we show instead how to perform all the necessary computations on a single multi-core server.
The rest of this article is organized as follows. Section 2 describes the data and the research goals. Section 3.1 presents our proposed dynamical multiple-input single-output (MISO) model based on the generalized second-order Volterra series. In the remaining subsections of Section 3, we discuss the likelihood-based and penalized estimation of the parameters of the model. In Section 4, we present computational details in the context of big data. We describe the challenges encountered in fitting the proposed model when both numbers of parameters and observations are large, and we suggest methods to overcome those challenges. In Section 4.2, we describe the design matrix and show how to take advantage of redundancy in its structure. In Sections 4.3 and 4.4, we detail the methods of likelihoodbased and penalized estimation, using only matrix-vector operations. We show in Section 4.7 how matrix-vector operations can be executed in the map-reduce environment. In Section 5, we conclude with model development and validation for neural recordings from the hippocampus of a rat during open field tasks.

Data Description and Problem Formulation
The neural spike data that we consider are available from the public repository crcns.org; for details see Mizuseki et al. (2009a,b). The neural recordings come from layer CA1 of the dorsal hippocampus, an area of the brain associated with learning and forming memories, in particular spatial memories. The recordings are performed with eight electrodes on a Long-Evans rat during open field tasks, within a 90-min time frame. Each electrode registers neural spikes from a cluster of neurons. Though we cannot reliably differentiate cells within a cluster, we will for simplicity refer to the cluster as a neuron. A portion of the dataset, recorded over 3 sec, is shown in Figure 1.
In Figure 1, each horizontal line contains a sequence of spikes from one neuron. Here, each short vertical line indicates the presence of a spike. The data are recorded at a sampling rate of 20 MHz. Although timings of neural spikes are exact, it is practical to encode each spike train as a binary process, a 1 is recorded when the spike is observed at some predefined time interval, and 0, if it is not. For an interval of 1 msec, this transformation leads to 5.4 million observations for each neuron. More details are provided in Section 3.1. Our primary goal is to model the functional relationship between the output neuron and input neurons. In particular, our interest centers on the probability of the output spike at a specific time given the history of input and output neuron firings. This relationship is known to be nonlinear and time-dependent (Song et al. 2009;Fuchs 2013), so we adapt the Volterra series (1) to the generalized linear model setting, with discrete inputs. Our second goal is the identification of functional connectivity between neurons. The connectivity is known to be sparse (He 2005;Song et al. 2007). Here, we formulate this problem as a model selection problem. In particular, we take a regularization approach to establish a set of relevant input neurons as well as intervals of relevance.

Generalized Second-Order Volterra Model
The output of a neuron or neural ensemble consists of a series of spikes, and it is convenient to model it in terms of a probability or a rate of action potentials. Let y(t ) be a binary output signal, and let x 1 (t ), . . . , x d (t ) (0 ≤ t ≤ T ) be d binary input signals indicating the times at which the spikes occurred in a T seconds interval. Both input and output signals are recorded at a fixed sampling frequency, that is, for each signal, observations are obtained at t = δ, 2δ, . . . , nδ and T = nδ, where δ is the length of a sampling time interval. A 1 is recorded if a spike occurs in time interval ((t − 1)δ, tδ], and 0 otherwise. Let x(t ) = (x 1 (t ), . . . , x d (t )) T be a vector of input signals. The conditional probability of the action potential of the output signal, given the history of the input and output signals, can be written as Here M = mδ is the length of the system memory.
The generalized functional additive model with an appropriate link function is a natural starting point for estimation of θ (t ). To capture system dynamics in such a model, we adapt (1) to our context and propose a generalized second-order Volterra model where −1 is the probit link function, an inverse of a standard normal distribution function, and κ 0 , are Volterra kernels of zero-, first-, and second-order, respectively. In addition, N x i , N y are counting measures with respect to input and output signals. Our choice of the probit link function, over other options including the commonly used logit, is due to the broader use of the probit in neural modeling (Song et al. 2013).
Like the system output z(t ) in the Volterra series (1), −1 (θ (t )) in (2) is an analytic function of t. Our proposed model is a causal model with a finite memory, so that (2). This choice is determined by the nature of the brain dynamics: a neural firing is influenced by only the recent history of input firings. Our model (2) also contains an autoregressive component as its final term, which is essential to model the refractory period between neural firings. Finally, the integrals in (2) are with respect to counting measures, to account for the binary point-process inputs.
Model (2) is of second order. It can be further extended by including higher-order terms. The zeroth-order κ 0 captures the baseline probability of an output spike, when no inputs are present. Each first-order kernel ϕ (1) (t ) measures the contribution of a corresponding input signal. The second-order kernels ϕ (2) (t 1 , t 2 ) measure the interaction effects between the same input at a different time lag, and between different inputs. The kernels are unknown and need to be estimated in the process of system identification.
A discretized version of this model has been proposed in Song et al. (2007), where each kernel function is discretized and represented as a vector of coefficients. The estimation procedure for these coefficients is likelihood-based. The drawback is that the numbers of parameters and observations are very large, which makes computations intractable and unstable. For instance, in our motivating example, the total number of parameters in the discretized model is on the order of 10 6 . Our proposed model, based on continuous Volterra kernels, provides a more feasible computational strategy; see Section 4 for details.

Likelihood and Parameter Estimation
In this section, we consider a likelihood-based approach for estimation of functionals ϕ in (2). For the binary output signal y(t ), observations are recorded at t = δ, 2δ, . . . , nδ. Assuming y(t ) has Markov property of order-m, the likelihood function of y = (y((m + 1)δ), . . . , y(nδ)) T is The likelihood is a function of θ (t ) and therefore, by (2), a function of the unknown κ 0 , ϕ (1) , and ϕ (2) . The log-likelihood is To reduce the dimension of the problem, we approximate the functionals ϕ using B-spline basis functions, where J depends on the number of knots and the degree of spline functions (both selected by the user) and α (i) j , α (i 1 ,i 2 ) j 1 , j 2 are unknown coefficients. Let . These quantities are fully known, since they are derived from observed signals x i . Next, for each t = (m + 1)δ, . . . , nδ, we define the row vector Finally, we derive the approximate log-likelihood l α as follows: The maximizers of (8) Maximizing (8) with respect to the unknown parameters is a well-studied convex optimization problem with an iterative solution based on the Taylor series approximation of (8) or equivalently, the Newton-Raphson method or Fisher scoring (McCullagh and Nelder 1983;Green 1984). These two methods are not equivalent for the probit link, unlike the case for logistic regression. Newton-Raphson provides faster convergence but requires the calculation of the Hessian matrix at each step (compared to computationally simpler alternative of Fisher information). More details are given in Section 4.3.

Functional Sparsity and Neural Connectivity
In a neural network, neural connectivity is believed to be sparse in nature (He 2005;Song et al. 2007). Two different types of sparsity were defined by Tu et al. (2012). In particular, given multiple input neurons, only a small subset of these may have any relevance to the signal recorded from the output neuron. This property is referred to as global sparsity, and it should lead to the estimation of functionals ϕ i corresponding to irrelevant inputs as zero. Among the inputs that show relevance to the output signal, the impact will be limited in time, and the duration of the effect is often of interest. Outside of the interval in question, the functional will be zero, which is referred to as local sparsity.
Maximizing (8) does not guarantee either sparsity, so a penalized approach enforcing sparsity is preferable. For parametric regression models, penalized approaches have been widely studied and have gained in popularity due to simultaneous parameter estimation and variable selection. Commonly used penalization methods include LASSO (Tibshirani 1996), SCAD (Fan and Li 2001), group LASSO (Yuan and Lin 2006), and group bridge (Huang et al. 2009). In Wang and Kai (2015), these penalization methods were evaluated for the problem of detecting functional sparsity for univariate, nonparametric regression models. In addition, these authors incorporated the group bridge penalty with an intuitive grouping method for estimating functions with global or local sparsities.
For demonstration purposes, consider the function shown as a thick, solid line in the bottom panel of Figure 2. Note that this function is zero on [0, 0.2]. In the top panel, 13 cubic B-spline basis functions with nine equally spaced interior knots are depicted. To approximate the bottom-panel function with these B-splines and achieve sparsity on each subinterval of [0, 0.2], zero coefficients for the first four B-splines (shown with dashed lines in the top panel) are required. Hence, it is natural to treat these four coefficients as a group. This grouping generalizes beyond cubic: for B-spline basis functions with degree b, each group consists of b + 1 coefficients. In this article, we adopt the grouping idea suggested by Wang and Kai (2015). For the ith first-order kernel ϕ (1) i (τ ) in (5), its coefficients for the B-spline approximation in (5) can be assigned to J − b overlapping groups, For the second-order kernel ϕ (2) i 1 ,i 2 (τ 1 , τ 2 ), the grouping becomes rather complicated. Over each small rectangle spanned by two adjacent knots on τ 1 and τ 2 , the sparsity can be determined by (b + 1) 2 coefficients. To be more specific, the coefficients of the B-spline approximation in (6) can be assigned to (J − b) 2 overlapping groups, For our convenience, we write the group penalty as P (γ ) where G is the total number of groups and γ ∈ (0, 1) is a constant. Here α A 1 , . . . , α A G represent the groups of coefficients as defined above.
Finally, the penalized log-likelihood criterion can be expressed as The penalized maximum likelihood estimation (MLE) is defined as the maximizer of this criterion. The computational aspects are covered in Section 4.5.

Overview of the Algorithm
In this section, we focus on computational aspects of fitting model (2), beginning with maximization of the approximate log-likelihood (8). We proceed by selecting statistical and optimization methods that can be adapted to big data computations because they operate on input data only through matrix-vector operations that can be tailored to a parallel environment; see Section 4.2 for initial results. The methods include iteratively reweighted least squares, which maximizes a quadratic approximation to the log-likelihood function. Each iteration of IRLS relies on an L 2 -minimization step, which can be executed using the conjugate gradient method. The computational details are available in Section 4.3. Next, we consider maximization of (9), a penalized version of the approximate log-likelihood. Some penalties lead to nonconvex optimization problems. Yet, for penalties like SCAD or Group Bridge, a solution can be found iteratively by solving a sequence of L 1 problems. Maximization of the log-likelihood function with L 1 (or LASSO) penalty can be solved using the Coordinate Descent algorithm, which requires the maximum likelihood estimate as an input. Details of this method, and the derivation of correspondence between Group Bridge and a sequence of LASSO steps, are described in Sections 4.4 and 4.5.
Pseudo-code and the code location are available in the supplementary materials.

Design Matrix
Under the B-spline approximations (5) and (6), the conditional mean of y is where the standard normal cdf (·) is applied element-wise. Write X = [X 1 , X 2 ], where X 1 and X 2 correspond, respectively, to the first-order kernels in X 1 (t ) and the second-order kernels in X 2 (t ), as defined in (7). The elements of the matrix X 1 , (ξ (1) where {t ik } denote the set of time stamps in (t − M, t] such that x i (t ik ) = 1, for t = (m + 1)δ, . . . , nδ, j = 1, . . . , J and i = 1, . . . , d + 1. The element ξ (i) j (t ) is in row r, such that t = (m + r)δ, and column (i − 1)J + j. The elements of the matrix X 2 , each row r containing (ξ (1,1) 1,1 (t ), ξ (1,1) 1,2 (t ), . . . , ξ (d,d) J,J (t )) T for t = (m + r)δ, are derived using the formula The two-dimensional B-spline basis functions are tensor products of the one-dimensional B-spline basis functions. Consequently, the rows of X 2 can be efficiently calculated as the Kronecker products of the rows of X 1 . This observation is rather important for big data computation; in fact, the matrix X 2 does not have to be generated in its entirety: its rows can be dynamically recreated during matrix-vector operations. This largely alleviates storage and computational challenges in further calculations. More details are provided in Section 4.7.

Maximum Likelihood Estimation With Conjugate Gradient
Iteratively reweighted least squares (IRLS) is a well-established algorithm for finding a maximum likelihood estimate of a generalized linear model (10); see McCullagh and Nelder (1983) for more details. This iterative approach is based on the Taylor expansion of the log-likelihood function using either observed or expected information. To simplify the implementation, we employ the expected information (i.e., Fisher scoring). At the (k + 1)st step, one finds the estimate of α, say α (k+1) , based on the previously calculated α (k) by minimizing Here, W k is a diagonal matrix with elements that depends on the previously calculated α (k) The vector z (k) has elements z (k) j that depend on current responses via where x T j is the jth row of X, θ (k) j = (α 0 + x T j α (k) ), and φ, are the density and distribution functions of standard normal, respectively. Moreover, the minimization problem above has a closed-form solution, (α (k+1) In practice, for a moderately large matrix X, the closed-form solution at every iteration can be calculated using a map-reduce framework as described in Section 4.7. However, the operation of matrix inversion is not feasible with a massively large matrix X. An alternative solution is to use the conjugate gradient method.
Conjugate gradient is an effective optimization method developed for solving quadratic optimization problems of the type: min where H is a p × p matrix, but it can also be used for more general nonlinear optimization. It is based on conjugate vectors d 1 , . . . , d p such that d T i Hd j = 0, for i = j. The conjugate vectors are linearly independent and span the Euclidean space R p , so that the solution of the quadratic optimization problem can then be expressed as a linear combination of these vectors. The coefficients of the conjugate vectors can be obtained iteratively in p steps. The conjugate gradient method has been successfully applied to classification in the big data context due to its scalable computational complexity; see Komarek (2004). In fact, the number of operations for a convex quadratic problem is estimated to be O(p). The implementation requires approximately p matrix-vector multiplications; see, for example, Antoniou and Lu (2007) for details. To carry out the IRLS algorithm, the minimization problem at the (k + 1)st step can be expressed in terms of H = 2Z T W k Z and b = 2Z T W k z (k) , where W k and z (k) are defined as in (13) and (14). In big data computation, it may be not desirable to compute the matrix H in every iteration of IRLS, since matrix-matrix operations are expensive. But H is only used in matrix-vector multiplications, so one can compute 2Z T W k Zx instead, for some vector x of size p × 1. This operation is computable in a map-reduce framework; see Section 4.7. The vector b is also of size p × 1 and it can be computed with map-reduce and stored in memory.

Penalized Maximum Likelihood Estimation
The MLE obtained using IRLS in Section 4.3 does not provide any sparsity. One way to enforce sparsity of the identified model is to consider maximizing the penalized log-likelihood as expressed in (9). The optimization problem poses computational challenges since it is nonconvex. A popular approach is to approximate the log-likelihood function by a quadratic function. Let l 0 (α) = l α ( α MLE 0 , α) be the log-likelihood function after plugging in α MLE 0 . The approximation of l 0 (α) can be obtained via a Taylor series expansion at α MLE , that is, where z = X α MLE . Thus, the penalized criterion (9) can be expressed as In the special case of γ = 1, the penalty is essentially the LASSO penalty (Tibshirani 1996). For γ ∈ (0, 1), the optimization problem is nonconvex, and Huang et al. (2009) provided an iterative solution: 1. For a given λ n , calculate τ n = λ 1/1−γ n γ γ /1−γ (1 − γ ) 2. For s = 1, 2, . . ., until convergence, compute for g = 1, . . . , G. Each stage of the iterative procedure can be reduced to the minimization of a quadratic form with L 1 penalty applied to the coefficients α. For a small dataset, it can be efficiently solved by the LARS algorithm (Efron et al. 2004). However, in the big data context, this algorithm may not be feasible. Solving this problem with a large dataset is the subject of the next section.
The group bridge penalty described above can encounter stability issues in calculations (Percival 2012). An alternative approach is to use the SCAD penalty (Fan and Li 2001) p λ (·), which satisfies p λ (0) = 0 and for a constant a usually equal to 3.7. An analog of (9), equipped with SCAD penalty, can be expressed as Zou and Li (2008) proposed a one-step estimation procedure, which is based on a linear approximation of the SCAD penalty function. Thus, (16) can be written as a standard LASSO problem, that is, As will be seen in Section 5, in our application, the estimates using the group bridge penalty and the SCAD penalty are comparable.

L 1 Regularization via Coordinate Descent Algorithm
As outlined in Section 4.4, parameter estimation with the group bridge penalty and SCAD penalty can be reduced to the LASSO problem. LASSO has an efficient implementation using a variation of the LARS algorithm (Efron et al. 2004), which is not well-suited for operating on a design matrix of high dimension. Another implementation was suggested by Friedman et al. (2007) and Friedman, Hastie, and Tibshirani (2010). The elasticnet penalty, of which LASSO is a special case, can be computed by a coordinate descent and soft thresholding. The algorithm is applicable to linear models, or generalized linear models, in which case it can be integrated with the IRLS steps.
The idea of soft thresholding was proposed by Donoho and Johnstone (1995), and it was designed to recover a signal in the noisy environment by providing an adaptable threshold. This approach is motivated by a simple optimization problem: min t { 1 2 (t − t 0 ) 2 + λ|t|}, where λ > 0 and t 0 is a fixed constant. The solution can be written as t min = S(t 0 , λ) ≡ sign(t 0 )(|t 0 | − λ) + . This observation motivated the gradient descent algorithm for the LASSO solution; see Friedman, Hastie, and Tibshirani (2010) for more details. In particular, starting from any given parameter values, one parameter is updated using the soft thresholding. The detailed algorithm is described in Friedman, Hastie, and Tibshirani (2010). When implementing this algorithm, only matrix-vector products of the form X T y are necessary and can be carried out efficiently under the map-reduce framework.

Tuning Parameters
The performance of parameter estimation relies on the choice of λ and γ . Here, the parameter γ is fixed at 0.5 following recommendations in Huang et al. (2009). The common procedure for selecting λ is to proceed with model identification for multiple values of the coefficient and choose one according to a selected criterion, for example, Akaike's Information Criterion (AIC; Akaike 1973), Schwarz's Bayesian criterion (BIC; Schwarz 1978), or, especially for larger models, extended BIC (EBIC; Chen and Chen 2008). Here, we use the smallest EBIC value as a selection criterion for λ. EBIC adds an extra component to BIC, namely, the logarithm of the number of models of the given size. Such criterion-based methods are suitable for the selection in the first-order model, but more problematic for the second-order model, or for any large dataset in general. The selection of λ for the second-order model requires multiple identification efforts, but time cost (a major concern for big data analytics) does not allow an exhaustive search. For the second-order model, we suggest to search for the tuning parameter among the predetermined increasing sequence of candidates starting from the value selected for the first-order model.

Big Matrix Processing
Computations discussed in Sections 4.3 and 4.4 rely on just three types of matrix-vector operations, specifically X T Wy, XWβ, and X T WXβ. Here, X is the N × p design matrix, W is the diagonal N × N matrix of weights, y is the N × 1 vector, and β is the p × 1 vector. In our settings, N = n − m, and both N and p are large.
We perform all matrix-vector operations in the map-reduce framework; see Dean and Ghemawat (2008) for details. Mapreduce is the algorithmic concept based on the division of tasks (between mappers) and assembly of results (by reducers). Any matrix-vector operation can be divided into K smaller tasks executed by mappers, with the results of these tasks collected by reducers. For instance, denote by X [k] the kth part of matrix X, consisting of n k consecutive rows. Also, denote by W [k] the diagonal matrix of size n k × n k , with weights corresponding to rows of X [k] , and by y [k] a vector of size n k × 1 with elements of y corresponding to rows of X [k] . One can easily verify the following: These tasks can be divided between C servers, or C cores, where C N. The distinct processes or threads will execute all K tasks, with at most C executed in parallel. The design matrix X described in Section 4.2 is too large to fit in the memory of a single computer. An established approach is to distribute the matrix row-wise between servers in the Hadoop cluster or a Spark cluster. Hadoop and Spark are Apache Software Foundation implementations of map-reduce in the distributed environment. Both impose a steep penalty in terms of computing time, primarily due to the distributed communications. Spark is a better choice for statistical computations, which often require hundreds of operations on the same matrix. The performance results, presented in Table 1, are not reassuring, especially if the cluster is small.
Here, we propose a much more efficient computational strategy than the established approach based on distributed computations. Our method supports map-reduce computations on a single, multicore server. As shown in Section 4.2, X consists of two column-wise parts, specifically X 1 , which corresponds to first-order kernels, and X 2 , which corresponds to second-order kernels. In addition, as demonstrated in (12), each row of X 2 is a modified Kronecker product of a part of a row of X 1 (excluding duplicate entries and columns corresponding to ϕ (1) d+1 ). When a matrix-vector operation is executed on a multicore, we set K = N, so that X [k] = x T k , and each simple task involves a single row of matrix X. It is easy to notice, that, for instance, where x (1) k is a part of x k corresponding to first-order components, β (1) and β (2) are parts of β corresponding to first-and second-order components, respectively. Also, ⊗ is a modified Kronecker product, which contains only unique components of the product, and does not contain elements of x [k] corresponding to ϕ (1) d+1 . Each of the K tasks can be executed without having x (2) k available. The entire operation can be executed without X 2 ever present in memory. In our application, X 1 , with over 5 million rows and 120 columns, requires around 9 gigabytes of storage. The matrix X 1 fits in the memory of a single computer, but some degree of parallelization of tasks is needed to offset the additional complexity of (18). If X 1 is too large for a single computer, it can be placed in a distributed filesystem. The singlemachine strategy will not apply, but the methods described here will still be applicable. Performance results for the second-order model with map-reduce for different hardware configurations: a multicore server with , , , or  cores, a Hadoop cluster with four servers, and a Spark cluster running on top of the Hadoop distributed file system (HDFS). The tests are performed for the first-order matrix with dimensions N = 10 6 , p 1 = 120. The dimensions of the second-order matrix are N = 10 6 , p 2 = 5580. The operations listed in the first column are the only big data operations used in our proposed method. The next three columns show the approximate number of big data operations per method used in this article: Conjugate gradient (CG), iteratively reweighted least squares (IRLS), and coordinate descent (CD).

Computing time per operation [sec]
Operations per method Multicore

Analysis of Neural Spikes
In this section, we analyze the spike train data described in Section 2. When discussing the model in Section 3, we introduced several system parameters, for example, the system memory. Here, the system memory length is chosen to be M = 400 msec. Based on earlier studies (e.g., Song et al. 2013), the memory length is generally less than 1 sec. In our data analysis, results from M = 400 and M = 1000 are very similar. For illustration purposes, we present only the results with M = 400 msec.
Another system parameter is J, the number of B-spline basis functions in the approximations (5) and (6). We choose cubic splines with 10 equally spaced knots on the interval (0, M), with two additional knots near zero where, according to neuroscientists, the signal is strong. These knot choices result in J = 15. The number of equally spaced knots and the number of additional near-zero knots are selected from a small grid using EBIC for the first-order model. All observations are used to fit the first-and second-order models. The first-order model contains only the intercept and first-order Volterra kernels, no interaction components are included. The second-order model contains all components as listed in (2). Both models estimate firstorder Volterra kernels, which can be compared in Figures 3-6.
Only the second-order model estimates second-order Volterra kernels, which are presented in Figure A of the supplementary materials.

First-Order Model
In this section, we model the probability of neural spikes as a GLM defined in (10) with a matrix X = X 1 and its elements defined in (11). Even though the number of observations n is over five million, the matrix X 1 can be computed and stored in memory for all calculations. Here we assume that ϕ(mδ) = 0. We also consider a constraint ϕ(0) = 0 because the impact of the input spike is expected to have a positive delay. However, the delay might not be detectable in the data if the sampling rate is not high enough, in which case the value of the ϕ(0) may correspond to the maximum value of the functional. In fact, this turns out to be the case for all seven input neurons. Figures 3 and 4 show the results of the first-order model fit, with the unpenalized estimate depicted with a solid line, and three different penalized fits including group bridge (dashed), group SCAD (dotted), and LASSO (dash-dotted). The zoom-in view in each image shows the differences in sparsity estimation for all four models. The group bridge fit exhibits local sparsity in four out of seven neurons. The group SCAD and LASSO fits do not achieve the local sparsity in the examined range. The group bridge estimate was calculated with λ n = 40 selected using the EBIC criterion. The group bridge fit has a lower EBIC value than those of the group SCAD fit and LASSO fit, so we will employ it as a regularization method in the second-order model. Figure 5 depicts the unpenalized estimation and a group bridge estimation of the kernel corresponding to the feedback kernel. The negative value at 0 confirms the expected refractory period following a recent spike. The positive value close to 0 could be an indication of a clustering effect in neural spikes, which can also be observed in Figure 1. Local sparsity is not detected by any estimator.

Second-Order Model
In this section, we model the probability of neural spikes as a GLM defined in (10) with a matrix X = [X 1 , X 2 ] and its elements defined in (11) and (12). Only the matrix X 1 is computed, and the matrix X 2 is created dynamically to perform necessary matrix operations, as explained in Section 4.7. Here, we compute two estimates, the maximum likelihood estimate and the penalized maximum likelihood estimate using the group bridge penalty. Recall that this penalty function outperforms others in the first-order analysis based on the EBIC criterion.
In Figure 6, the functional estimates of the first-order kernels ϕ (1) i are very close to the first-order counterparts as depicted in Figures 3-5. The local sparsity in the second-order model is more pronounced. The estimated interaction components, that is, the second-order kernels, are presented in Figure A of the supplementary materials. Only four second-order kernels are shown, ϕ 11 , ϕ 37 , ϕ 57 , and ϕ 67 . The remaining kernels exhibit global sparsity, which suggests no interaction exists between corresponding input neurons or within the same input neurons at different time lags.
To examine the computational performance in big data context, we implement our proposed method in different hardware environments. Computing times for second-order model fit are presented in Table 1. Columns 2-4 show the approximate number of matrix-vector operations required to achieve the acceptable numerical approximation, for each method used in this article. These numbers are data dependent but are reported to show the overall computational complexity. Columns 5-10 provide   exact timings for the matrix-vector operations in different hardware configurations. The results are rather unfavorable for the Hadoop cluster. This may partially be because the performance test was conducted in R, which is not a very efficient computing platform. The Spark test uses Python, which is known to be more efficient at matrix operations, but even those results cannot match the in-memory multicore operations, which are executed using R and C. As the results make clear, the penalty for distributing data across multiple servers is very severe. The computational cost of recreating the data (e.g., recreating the matrix columns that constitute a Kronecker product of the existing columns) is much smaller than the cost of distributed communication. However, the clusters we used are very small. Larger clusters will drive the performance numbers down. On the other hand, multicore servers with 250 and more cores are becoming available, and that should make the in-memory identification of second-and higher-order models faster and feasible.

Model Validation
Model validation for neuron spike train data in functional dynamical settings is a challenging task and is still an area of Figure . Estimated first-order kernels in the second-order model. Unpenalized estimation (solid) and a group bridge estimation (dashed) of ϕ functionals in the secondorder model corresponding to the input neurons x 1 , x 3 , x 4 , x 5 , x 6 , x 7 . Local sparsity visible for the penalized version in all neurons, the start of sparsity is marked with an arrow. Neuron x 2 exhibits global sparsity and is not shown. active research. Brown et al. (2002) proposed to use a timerescaling theorem to evaluate goodness of fit of a neural spike model. The method relies on the estimate of the spike rate (a.k.a intensity) function for continuous time. The intensity function, integrated between two adjacent observed output spikes, is assumed to follow an exponential distribution with a unit rate. Thus, after an appropriate transformation to a uniform distribution, the Kolmogorov-Smirnov (KS) test can be used to compare against a reference uniform distribution and thus test the goodness of fit.
It is more practical to consider time as a discrete rather than a continuous variable. Haslinger, Pipa, and Brown (2010) focused on a discretized model for neuron spikes, in which time is divided into short intervals or bins, and a spike train encoded as a binary sequence of 0's and 1's, corresponding to whether a spike was observed in a given bin. Those authors pointed out that the KS test in Brown et al. (2002) may exhibit biases for a discretized approach. The biases are caused by physical and numerical constraints. For instance, neural spikes are not instantaneous events, and in fact are believed to last about 1 msec. In addition, the bin length creates a lower bound on the interspike interval. For the discretization approach, the distribution of interspike intervals is geometric rather than exponential, and the exponential approximation may not be valid for large firing probabilities. Haslinger, Pipa, and Brown (2010) further suggested an improved rescaling of estimated firing probabilities for bias correction.
In our analysis, we adopt the discrete time rescaling theorem (Haslinger, Pipa, and Brown 2010). We transform the estimated probabilities θ (t ) to the uniform distribution as suggested by these authors. The resulting KS plot is shown as Figure B in the supplementary materials. It can be seen that the overall model fit is satisfactory. The transformed data, depicted as a dashed black line, fall within the 95% confidence bounds of a standard uniform distribution in all but a short interval near zero. Small bias is visible for low probabilities, which is expected due to the discretization of the time interval and the dependence on the spike history. This phenomenon was previously reported in Haslinger, Pipa, and Brown (2010). In fact, for the KS plot generated from the true model, such biases are also visible.

Supplementary Materials
Document: This document includes pseudo-code for computations and additional images for the second-order model and model validation.