Explainable Multimodal Deep Dictionary Learning to Capture Developmental Differences From Three fMRI Paradigms

Objective: Multimodal-based methods show great potential for neuroscience studies by integrating complementary information. There has been less multimodal work focussed on brain developmental changes. Methods: We propose an explainable multimodal deep dictionary learning method to uncover both the commonality and specificity of different modalities, which learns the shared dictionary and the modality-specific sparse representations based on the multimodal data and their encodings of a sparse deep autoencoder. Results: By regarding three fMRI paradigms collected during two tasks and resting state as modalities, we apply the proposed method on multimodal data to identify the brain developmental differences. The results show that the proposed model can not only achieve better performance in reconstruction, but also yield age-related differences in reoccurring patterns. Specifically, both children and young adults prefer to switch among states during two tasks while staying within a particular state during rest, but the difference is that children possess more diffuse functional connectivity patterns while young adults have more focused functional connectivity patterns. Conclusion and Significance: To uncover the commonality and specificity of three fMRI paradigms to developmental differences, multimodal data and their encodings are used to train the shared dictionary and the modality-specific sparse representations. Identifying brain network differences helps to understand how the neural circuits and brain networks form and develop with age.

Abstract-Objective: Multimodal-based methods show great potential for neuroscience studies by integrating complementary information. There has been less multimodal work focussed on brain developmental changes. Methods: We propose an explainable multimodal deep dictionary learning method to uncover both the commonality and specificity of different modalities, which learns the shared dictionary and the modality-specific sparse representations based on the multimodal data and their encodings of a sparse deep autoencoder. Results: By regarding three fMRI paradigms collected during two tasks and resting state as modalities, we apply the proposed method on multimodal data to identify the brain developmental differences. The results show that the proposed model can not only achieve better performance in reconstruction, but also yield age-related differences in reoccurring patterns. Specifically, both children and young adults prefer to switch among states during two tasks while staying within a particular state during rest, but the difference is that children possess more diffuse functional connectivity patterns while young adults have more focused functional connectivity patterns. Conclusion and Significance: To uncover N ORMAL brain development is a complex process, from the establishment of basic cognitive functions in childhood to the gradual maturity of more complex self-regulatory functions throughout adolescence [1], [2], [3]. Functional magnetic resonance imaging (fMRI) can capture hemodynamic responses to neuronal activities by measuring the blood oxygenation leveldependent (BOLD) signal, based on which the changes in neural interaction and integration between functionally interconnected regions with development can be revealed [4]. Compared with BOLD signals, dynamic functional connectivity (dFC) measured by a sliding window approach can reflect time-varying dependencies between spatially separated brain regions. It helps to quantify the changes of correlation strength between functional activities of paired brain regions over time. Thus, there has been growing interest in identification of the recurring whole-brain functional connectivity patterns (i.e., states) based on dFC recently. These studies aim to divide the whole-brain dFC profiles into distinct states observed reliably across subjects throughout the fMRI scans [5], [6], [7], [8]. It enables us to investigate the differences of states related to brain development, capture the transition mechanism among these states, and provide insights into neural brain dynamics from the perspective of functional connectivity [4], [5].
Compared with single modality methods for fMRI analysis, multimodal-based methods can take advantage of complementary information provided by different modalities. Studies have shown that integrating the multimodal prior or combining the complementary information from diverse modalities can promote model enhancement and diagnosis [9], [10], [11]. Many methods have been extended for multimodal data integration including multitask learning, linear regression, neural network, support vector machine, and dictionary learning [9], [10], [11], [12], [13], [14]. Due to the ability to reduce dimensionality and identify the reoccurring patterns of dFC [4], multimodal dictionary learning methods have attracted considerable attention. For example, Li et al. [11] proposed a multimodal discriminative dictionary learning (mSCDDL) method based on a weighted combination strategy, and further applied it to fuse information from structural magnetic resonance imaging and fluorodeoxyglucose positron emission tomography for Alzheimer's disease classification. In [15], a 1 -norm regularized dictionary learning approach was proposed to identify the epilepsy-related dFC states, where the time courses representative of epileptic activity extracted by electroencephalogram are incorporated into the fMRI for dFC state analysis. In [16], multimodal dictionary learning was applied to the diagnosis of schizophrenia, which embeds the correlation information of multimodal data into the learning model. Additionally, to achieve the nonlinearity or higher-level features of data, Li et al. [10] improved the mSCDDL with the multi-feature kernel trick to obtain the nonlinear representations of data. D'Souza et al. [17] proposed a framework for Autism Spectrum Disorder's diagnosis, which couples a structurally-regularized dynamic dictionary learning model (sr-DDL) with a deep network to predict behavioral scores, where the dFCs of fMRI were decomposed by sr-DDL while constraining the decomposition by the FCs of diffusion tensor imaging.
Of particular note, the aforementioned methods either fail to uncover both commonality and specificity of different modalities, or overlook the nonlinear higher-level features of data, or have difficulty in explainability (i.e., it fails to identify the reoccurring patterns of dFC, or brain regions and FCs related to development or disease). To address these issues, we propose an explainable multimodal deep dictionary learning (EMDDL) method, which connects the multimodal dictionary learning in the original space and the encoding space through a sparse deep autoencoder (sDAE). Within this framework, all modalities share the same dictionary to reveal the inherent commonality. To achieve the specificity of each modality, Fisher cost is used to constrain the sparse representations due to its ability to learn the modality-specific features by avoiding the overlap of neighboring pairs between different modalities. Moreover, the shared dictionary and the modality-specific sparse representations are learned based on the multimodal data and their encodings of the sDAE. In this way, multimodal dictionary learning can attain the nonlinear higher-level features while reconstructing the original data for identifying the reoccurring patterns or functional connectivity related to development. To maintain the complex relationships among subjects, a hypergraph Laplacian regularization is used, which helps to enhance the learning ability through prior knowledge.
We apply EMDDL to the multimodal data from Philadelphia Neurodevelopmental Cohort (PNC) to recognize the developmental differences between children and young adults, where the three fMRI paradigms collected during two tasks and resting state are regarded as modalities. We found that both children and young adults tend to switch frequently among states during two tasks and stay within a particular state during rest. The main difference is that children have more diffuse functional connectivity patterns while young adults possess more focused functional connectivity patterns under three fMRI paradigms. Besides, the differences in functional connectivity between children and young adults are mainly related to information processing, cognition, emotion, and working memory under three fMRI paradigms.

II. PRELIMINARY WORK
In this section, some preliminary work is presented including hypergraph learning to preserve the higher-order relationships among subjects and Fisher cost to extract modality-specific features.

A. Hypergraph Learning
Given that the traditional graph learning loses information inevitably by squeezing the complex relationships into pairwise ones, hypergraph has been widely applied to identify the high-order relationships among subjects [18], [19]. Generally, a hypergraph G = (V, E, W) consists of three parts, namely, the vertex set V = {V i |i = 1, 2, . . . , N v }, the hyperedge set E = {E i |i = 1, 2, . . . , N e } and the hyperedge weight W = {W i |i = 1, 2, . . . , N e }. To represent the relationships between hyperedges and vertices, the incidence matrix H ∈ R N v ×N e of hypergraph G is defined as where the (i, j)-th entry of H denotes whether the i-th vertex belong to the j-th hyperedge. Based on the incidence matrix H, the degree of the i-th vertex To construct a hypergraph, the k nearest neighbor strategy is usually applied because the geometric structure relationship among data can be approximately represented by the nearest neighbor graph [18], [20]. Specifically, for a chosen vertex, the distances between the chosen vertex and other vertices are calculated, and then the k nearest vertices are connected by a hyperedge. The weight of the i-th hy- . To obtain the diagonal matrix W h ∈ R N e ×N e , the hyperedge weight W i is arrayed as the i-th diagonal element of W h . By analogizing the definition of a simple graph Laplacian matrix [21], hypergraph Laplacian matrix is defined as where S = HW h D −1 e H T is the similarity matrix to define the similarity between each pair of vertices.
Compared with the traditional graph Laplacian regularization, hypergraph Laplacian regularization has the characteristics of preserving complex local geometric structure and incorporating the higher-order relationships among subjects, which are conducive to classification or clustering tasks in FC or dFC analysis [19].

B. Fisher Cost
The Fisher discrimination criterion is to cluster the samples in the same modality and keep the samples in different modalities as far away from each other as possible, which helps to extract features corresponding to the specific modality [22], [23], [24]. Assume that the multimodal data X = (x 1 , x 2 , . . . , x N ) ∈ R p×N contains M modalities with N m samples belonging to the m-th modality N m and M m=1 N m = N , where p-dimensional vector x n is the n-th sample of X. The within-modality scatter matrix S w and the between-modality scatter matrix S b of samples are defined as x n are the modality mean and the overall mean respectively. Then, the Fisher cost is as follows in which the Frobenius norm || · || F is to ensure the convexity of the cost function [24].
To get a more concise expression and facilitate calculation [22], the Fisher cost F(X) can be rewritten as where F = 2I − 2F 1 + F 2 ∈ R N ×N with I ∈ R N ×N being the identity matrix, F 1 ∈ R N ×N being defined as

III. METHODOLOGY
The details of EMDDL and the corresponding optimization algorithm are presented in this section, which can learn the shared dictionary and modality-specific sparse representations in both the original space and the encoding space.

A. Explainable Multimodal Deep Dictionary Learning
Multimodal dictionary learning methods can not only embed the high-dimensional features into low-dimensional space, but also boost learning performance with the combination of multiple modalities [12]. However, most of the existing methods either cannot simultaneously reveal the inherent commonality and specificity of different modalities, or overlook the nonlinear higher-level features of data, or have difficulty in explainability. To address these problems, we propose EMDDL which couples multimodal dictionary learning with sDAE. Specifically, by sharing the same dictionary through all modalities to capture the inherent commonality and constraining sparse representations with Fisher cost to obtain the specificity of each modality, the inherent commonality and the specificity of different modalities can be concurrently achieved in multimodal dictionary learning. Moreover, to achieve the nonlinear higher-level features of data and reconstruct the original data to identify the developmental differences in reoccurring patterns or FCs, both the shared dictionary and the modality-specific sparse representations are learned not only in the original space, but also in the encoding space of the sDAE at the same time. By alternating minimization algorithms, the sDAE, dictionary, and sparse representations can be sequentially obtained. The flowchart of EMDDL is shown in Fig. 1.
Suppose that there are M modalities with N m samples belonging to the m-th modality N m and the training data Besides, the sDAE contains 2L + 1 layers with r (l) neurons in the l-th layer and r (2L−l) = r (l) holds for l = 0, 1, . . . , 2 L.
EMDDL contains two parts including J sDAE and J MDL , where J sDAE is to efficiently learn the nonlinear higher-level features of data and J MDL is to train multimodal dictionary learning in both the original space and the encoding space. The objective function of EMDDL is defined as Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where J sDAE and J MDL are defined as being the sparse representation corresponding to the m-th modality in both the original space and the encoding space.
are the connection weight matrix and bias of sDAE between l-th layer and (l − 1)-th layer respectively. As defined in Appendix I-A, Kullback-Leibler divergence KL(ρ||ρ (l) j ) measures the difference between two Bernoulli distributions with mean ρ and ρ (l) j , where ρ is a sparsity hyperparameter and ρ (l) j is the average activation of neuron j in the l-th layer of sDAE. Similar to the definition of F in (2), H ∈ R N ×N is given by consists of hypergraph Laplacian matrix of all modalities, which is defined as where L h (m) ∈ R N m ×N m , the hypergraph Laplacian matrix of the m-th modality, is defined by (1).
In (4), J recon is to train the sDAE by minimizing the error between original data and its reconstruction. J KL is to prevent overfitting of the sDAE by controlling the activation of neurons. Compared with L 1 -norm and L 2 -norm, Kullback-Leibler divergence has better sparsity ability, which helps to improve model performance, and the details can be seen in Appendix I-A. J W F is to prevent overfitting of the sDAE by controlling the weights. In (5), J MDL O is to learn the shared dictionary of all modalities and the modality-specific sparse representations based on the original data. Meanwhile, by encoding the original data and the shared dictionary through the sDAE, J MDL E is to achieve the multimodal dictionary learning in the encoding space for capturing the nonlinear higher-level features of data. Inspired by [25], we use the same sparse representations to synchronously characterize the local geometric relationships between data and dictionary in the original space as to characterize those between encoded data and encoded dictionary in the encoding space. In other words, our objective is to use the sparse representations to capture the intrinsic local geometric relationships between data and dictionary. It helps to learn the locality-sensitive dictionary, resulting in improved generalization ability in reconstrcution or classification. By clustering the samples within modalities and separating the samples between the modalities, J F isher helps to learn the modality-specific representations. J hyperL is designed to retain the complex neighborhood relationships of samples hidden in each modality. J V F guarantees the convexity of Fisher cost and J V 1 is to ensure the sparsity. The constraint on dictionary atoms is to prevent sparse representation from being too small due to the large dictionary. In addition, the positive parameters λ 1 , λ 2 , λ 3 , λ 4 , λ 5 and λ 6 are used to balance the network fitting, dictionary learning and the complexity of model.

B. Optimization
The alternating minimization algorithm is applied to solve the problem (3) to optimize the parameters {W (l) } 2L l=1 , D and V , which contains three parts, i.e., the training of sDAE, the learning of the dictionary, and sparse representations learning. Denote where ϕ is a differentiable activation function which is the sigmoid function in this paper; h (0) n = X n and g where X n is the n-th column of the multimodal data X and d k is the k-th atom of the dictionary D. X (l) = (h To update the parameters of sDAE {W (l) } 2L l=1 , the backpropagation algorithm with gradient descent method is applied. Then, the gradient ofW (l) is given by n is defined as where the operation denotes the element-wise multiplication.

I(·) is an indicator function defined by
where η 1 is the learning rate.
2) The Learning of Dictionary: To update dictionary D with fixed {W (l) } 2L l=1 and V , problem (3) can be rewritten as The gradient descent method is used to optimize the above problem and the gradient of D is given by in which the k-th column of ΔR is computed by The update formula for D is where η 2 is the learning rate. Considering the constraint on the dictionary, each column of the updated dictionary D is normalized to unit length by

3) Sparse Representations Learning:
With the fixed {W (l) } 2L l=1 and D, the sparse representations can be obtained by solving the following optimization problem where f (V ) and g(V ) are To ensure the convexity of f (V ), λ 5 > λ 3 ≥ 0 holds and the details can be seen in Appendix I-B. In problem (11), f (V ) is convex and differentiable, while g(V ) is convex but nondifferentiable. Thus, the Fast Iterative Shrinkage Thresholding Algorithm (FISTA) [26] is adopted to optimize V . The gradient in which S = λ 3 H + λ 4 L + λ 5 I. The Lipschitz constant of the gradient ∇V is given by (13) in Appendix I-C. Besides, the soft thresholding function in FISTA is defined as representing absolute value function. The total optimization process is described in Algorithm 1.

IV. RESULTS AND ANALYSIS
In this section, EMDDL is utilized to explore the dynamic functional connectivity changes of brain during two tasks and resting state.

A. Data Acquisition and Preprocessing
PNC is a large scale collaborative project between the Brain Behaviour Laboratory at the University of Pennsylvania and the Children's Hospital of Philadelphia, which contains data collected using three fMRI paradigms from nearly 900 youth aged from 8 to 22, i.e., two tasks including emotion identification (Emoid fMRI) and working memory (Nback fMRI), and resting-state (Rest fMRI) [27]. All fMRI scans were collected on a single 3 T Siemens TIM Trio whole-body scanner using a single-shot, interleaved multi-slice, gradient-echo, echo planar imaging sequence. The Emoid fMRI, Nback fMRI and Rest fMRI scan durations were 10.5 minutes (210 TR), 11.6 minutes (231 TR) and 6.2 minutes (124 TR) respectively. During Emoid task, subjects were asked to identify 60 faces with neutral, happy, sad, angry, or fearful expressions. During Nback task to probe working memory, subjects were required to respond only when a presented fractal was the same as the one presented in the previous trial. During the resting-state scan, subjects were instructed to stay awake, keep eyes open, fixate on the displayed crosshair, and remain still. Of these, 123 children and 146 young adults completed all three paradigms. By using Statistical Parametric Mapping 12, motion correction, co-registration, spatial normalization to standard Montreal Neurological Institute space (spatial resolution of 3 × 3 × 3 mm), and spatial smoothing with a 3 mm full width half maximum Gaussian kernel were implemented. Then, a regression procedure was used to remove the influence of motion and the functional time series were band-pass filtered using a 0.01 Hz to 0.1 Hz frequency range. According to the Power coordinates with a sphere radius parameter of 5 mm [28], 264 regions of interest (ROIs) containing 21384 voxels were extracted. The details of the 264 ROIs are shown in Table 2 of Supplementary material. Every subject file can be reduced to a 264 × T matrix by averaging the time series of all voxels in the same brain region, where the time point T is 210, 231, and 124 for Emoid, Nback, and Rest fMRIs respectively.
We divided 264 ROIs into 13 functional networks to facilitate the understanding of functional connectivity relationships between the ROIs [28]. Among them, 12 functional networks including sensory/somatomotor network (SSN), cinguloopercular task control network (COTCN), auditory network (AN), default mode network (DMN), memory retrieval network (MRN), visual network (VN), frontoparietal task control network (FPTCN), salience network (SN), subcortical network (SCN), ventral attention network (VAN), dorsal attention network (DAN), and cerebellar network (CN), are mainly associated with the perception of movement, memory, language, vision, cognition and other functions of the brain, while there are 28 ROIs unrelated to any of the above functional networks which belong to the uncertain network (UN).
To capture the dynamic characteristics of the brain, dFC is obtained by calculating the Pearson correlation between the time-courses of the BOLD signals of pair regions within a window [29], [30], [31]. The details of obtaining the multimodal data can be seen in the Supplementary material. By grid search, we choose window length w l being 14, 17, and 33 for Emoid, Nback, and Rest fMRIs respectively, and scan length s l is 1 for all three modalities. Thus, each subject provides a dFC matrix M dF C ∈ R C 2 264 ×S l corresponding to Emoid, Nback, and Rest fMRIs, where C 2 264 = 34716. To reduce the complexity of computation, systematic sampling is used to select 20 sub-sequences from the dFC matrix corresponding to each modality of each subject [4]. Training data contains 80% of the subjects and the remaining subjects are test data.

B. Experimental Results
To evaluate the performance of the algorithm, the signal-tonoise ratio (SNR) [32] is used as evaluation index which is defined as Given that the grid search method can simply make a complete search over a given hyperparameters space, easily be parallelized to find more stable optimal hyperparameter [33], [34], it is used to select appropriate hyperparameters. Specifically, one of the hyperparameters is selected by the grid search method when other hyperparameters are fixed. By repeating the above process, all hyperparameters are optimized, and the results are shown in Fig. 1 of the Supplementary material. There are 7 layers of sDAE with 34716, 10000, 6000, 1000, 6000, 10000, 34716 units respectively. The number of atoms K is 300, the sparsity parameter ρ is 0.1, the regularization coefficients λ 1 , λ 2 , λ 3 , λ 4 , λ 5 , and λ 6 are 0.0001, 0.0005, 0.0003, 0.0001, 0.0005, and 0.001 respectively, the k nearest neighbor of hypergraph is 9. Because problem (6) and (8) are nonconvex, RMSProp algorithm is used to update {W (l) } 2L l=1 and D due to the better generalization ability and it is less prone to overfitting [35]. For RMSProp algorithm, the learning rates η 1 and η 2 are 0.00005 and 0.00008 respectively, and the square gradient decay rates ξ 1 and ξ 2 are both 0.9.
Based on the optimal hyperparameters, we apply EMDDL to the training data to obtain the dictionary and sparse representations. The learning curves of loss functions and the SNR evaluation on both training data and testing data are shown in Fig. 2. The results testify that the sparse representations can characterize the local geometric relationships between data and dictionary in the two spaces. The SNR of EMDDL, multimodal dictionary learning (MDL) [36] and sparse deep dictionary learning (SDDL) [4] on testing data are shown in Table I. It shows that the multimodal-based methods have better reconstruction ability compared with the single modality methods, and the generalization ability in reconstruction of EMDDL are better than the other two methods. It testifies that integrating the multimodal prior or combining the complementary information from diverse modalities can promote model enhancement.

C. States Analysis of Multimodal Data
To find the differences in reoccurring patterns of dFC (i.e., states) between children and young adults, k-means clustering method with the cityblock distance metric is used to obtain the reoccurring patterns of each group in each modality [37]. Specifically, sparse representations of each group in each modality are clustered, and then states can be obtained by multiplying the dictionary and the cluster centroid. We use the elbow criterion defined as within-cluster sums of distances to estimate the optimal number of dFC states, and the optimal number of dFC states for Emoid, Nback, and Rest fMRIs are 5, 5, and 4 respectively. To test whether the clustering results are consistent in multiple subgroups, we use the kappa coefficient as the indicator [38], and the details can be seen in Supplementary material. The results indicate that the clustering results obtained from two different subgroups are substantial agreement or perfect agreement in a large probability.
To further investigate the time occupied divergence of each state, dwell time (DT) and fraction of time (FT) are estimated from the state transition vector [7]. DT represents how long an individual spends in a given state on average, and FT is to describe the total time spent in a given state. For a subject i, DT and FT of k-th state are defined by Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
Total number of window where T R start and T R end are computed by in which "1" and "−1" mean that the specific window of i-th subject belongs to a certain state k or not; state_vector  To study the changes in reoccurring patterns over time under two tasks and resting state, we define the transition probabilities P ij from time t to time t + 1 as follows where S n = (S n 1 , S n 2 , . . . , S n T ) ∈ R 1×T is the state vector for n-th subject, and S n t = i for i = 1, 2, . . . , s (s is 5, 5, and 4 for Emoid, Nback, and Rest respectively) represents that the n-th subject is in state i at time t. I (·) is an indicative function, which is 1 when the condition is true, otherwise it is 0. The probability of each state at the initial time is defined as Specifically, for a given state i t at time t, we can calculate the transition probabilities P i t j for j = 1, 2, . . . , s from time t to time t + 1. Then we record the maximum transition probability and the corresponding state at time t + 1, and denote them as P i t i t+1 and i t+1 respectively. By repeating the above steps, we can obtain the state transition curve with maximum state transition probability, which is shown in Figs It indicates that the reoccurring patterns of three paradigms are similar for a subject. The same conclusion also can be found in previous research [39], which reveals that no matter in resting state or task, the basic structure of the brain functional network remains relatively consistent. The finding testifies that the brain has a shared functional architecture during resting and many directed tasks, and the shared functional architecture of the brain can only modulate the connectivity pattern in response to task demands. In other words, the overlapping functional connectivity patterns between Rest fMRI and two task fMRIs suggest a shared functional architecture underlying and even shaping brain function, and a potential explanation of overlap is that the functional connectivity during resting constrains the activation of brain regions in response to task demands [40].
Although the brain shares the basic functional architecture during task and resting state, the basic functional organization between children and young adults are different. The number of within or between functional networks that children exist high-strength functional connections is 43 in state 1 and 2 in state 3 under Emoid task, 55 in state 1 and 10 in state 2 under Nback task, and 13 in state 2 under resting state. The number of within or between functional networks that young adults exist high-strength functional connections is 9 in state 1 under Emoid task, 20 in state 1 under Nback task, and 2 in state 2 under resting state. For all three fMRI paradigms, we found that children have many high-strength functional connections distributed widely among 13 functional networks, young adults have high-strength functional connections only within and between some functional networks. It is consistent with the previous studies that children show more diffuse functional connectivity patterns while young adults show more focused functional connectivity patterns, and the changes in functional connectivity patterns between children and young adults explain how brain function changes from an undifferentiated system to a specialized system as one grows up [3], [4]. The brain organization of distinct and stronger within-network communication can promote precise modulation efficiently because it can transfer more information in a short time [41]. Thus, compared with children with more diffuse functional connectivity patterns, the brain organization of young adults with more focused functional connectivity patterns can transmit information more efficiently and facilitate precise modulation during resting and two tasks.
Additionally, the functional connectivity among DMN, SCN, MRN, CN, AN, FPTCN, and SN is decreased in most reoccurring patterns for Emoid, Nback, and Rest fMRIs during development. DMN, so-called task-negative network, is broadly inactivated across tasks, which are closely related to numerous key brain functions such as integration of autobiographic information, self-monitoring, and social cognition [28]. It is reported that the functional activity in DMN never stops but regulates during the resting state [42]. SCN participates in memory, attention, perception, and consciousness, and dominates the motivation and emotion state independent control of cortical functions [43]. MRN is reported to be engaged during autobiographical memory retrieval that involves strategic search processes guided by self-knowledge and current goals, memory recovery associated with a rich sense of re-experience, monitoring, and other control processes [44]. CN is not just considered as the domain of motor control that receives information from widespread regions to affect the generation and control of movement, but also is thought to be involved in cognition and visuospatial reasoning [45]. AN innervated by autonomic nerves, involves activities related to sound information including collection, conduction, and processing [46], [47]. FPTCN involving working memory maintenance, predictive perceptual coding, and cognitive task, is thought to play an important part in mediating the allocation of attentional resources to compete for auditory information under varying degrees of perceptual demand [48]. SN is thought to regulate attention and behavior adaptively through the physical characteristics and the relevant information of the task, and also is considered to be a key interface for cognitive, homeostatic, motivational, and affective systems [49]. Both resting and task fMRIs suggest that the functions of the brain in processing information, working memory, and cognition are not mature in children compared with young adults [3].
The functional connectivity between SSN, COTCN, DAN, and some other functional networks is increased in some reoccurring patterns for resting and task fMRIs during development. SSN participates in the process of emotional feeling and cognitive activities [50]. COTCN is the key to coordinate information transmission and involves many complex cognitive tasks [51]. DAN controls external and attention-demanding cognitive functions [52]. Three fMRI paradigms indicate that brain functions related to emotional feelings, cognition, and information transmission are still growing with age.

2) The Developmental Differences of Each fMRI
Paradigm: Figs. 3-5 B show the time occupied divergence of children and young adults during task and resting state. Both children and young adults have lower DT and FT in each state for two tasks while having higher DT and FT in each state during rest. It indicates that subjects including children and young adults tend to switch frequently among states in tasks and prefer to stay in a particular state while resting. It reveals that the spontaneous functional activity is stable during resting state, and then the functional activity corresponding to task demands changes quickly when the participant is required to perform a task [53].
For Emoid fMRI, both children and young adults stay in states 2, 3, and 4 for about the same time, but children stay longer in state 1 while young adults stay longer in state 5. Under the Emoid task, whether the initial state is 2, 3, 4, or 5, the children group will eventually switch to state 4 at time 9, and then they will switch back and forth between state 2 and state 4. When the initial state is 1, children group will stay in state 1 for the most time and then switch to state 3. No matter which the initial state is, the young adult group will eventually switch to state 5 at time 5 and stay at state 5 for a long while, and then they will switch to state 4 at time 18. The result of the Emoid task indicates that children have more frequent state transitions between state 2 and state 4, and the strength of functional connections within or between functional networks changes over time. Compared with children, the strength of functional connectivity within or between functional networks decreases at the early stage for young adults, and then they prefer to stay in state 5.
For Nback fMRI, both children and young adults stay in states 2, 3, and 4 for about the same time, but children stay longer in state 1 while young adults stay longer in state 5. Under the Nback task, no matter which the initial state is, the children group will eventually switch to state 4 at time 9, and then they will stay at state 4 until they switch to state 3 at time 18. The young adult group switch between state 4 and state 5 after time 7 in any initial state. The result of the Nback task indicates that the strength of functional connectivity for children changes over time during the frequent state transition at an early time, and then they will stay at state 4 for a while and finally switch to state 3. Unlike children, young adults prefer to stay for a while after switching to state 4 or state 5, and the strength of functional connectivity within or between functional networks decreases first, then increases, and then decreases during state transition between state 4 and state 5.
For Rest fMRI, both children and young adults stay in state 3 for about the same time. Children stay longer in state 1 and state 2, whereas young adults prefer to stay in state 4. Under the resting state, both children and young adults prefer to stay in a specified state with no change in the strength of functional connectivity within or between functional networks. We found that children prefer to switch among states with diffuse functional connectivity patterns during the two tasks and stay in states with diffuse functional connectivity patterns during rest. On the other hand, young adults switch among states with focused functional connectivity patterns in two task fMRIs and stay in states with focused functional connectivity patterns during rest.
For Emoid fMRI, along with the enhanced functional connectivity among SSN, COTCN, and DAN with age in states 4 and 5, the functional connectivity in the rest states declines to various degrees. For Nback fMRI, there exists enhanced functional connections within and between 13 functional networks in state 3 during development. Also, the functional connectivity decreases in the rest states with age. For Rest fMRI, in states 1, 2, and 4, there are not only lessened functional connections which mainly exist among SCN, MRN, CN, DMN, AN, FPTCN, and SN, but also exist strengthen functional connections which are mainly among SSN, COTCN, and DAN. In state 3 of Rest fMRI, the functional connections within and between 13 functional networks enhance during development. We found that compared with children, the functional connectivity of young adults increases or reduces with time for resting fMRI while generally decreasing for the two tasks. It indicates that the changes of functional connectivity with age are more complex in resting, and the brain functions related to emotion and working memory are more mature and efficient during development [4], [41].

B. Future Work
In this paper, the functional connections of three fMRI paradigms were used for learning the modality-shared dictionary and modality-specific sparse representations. An interesting note is that the dynamic functional connections of multiple modalities from multiple subjects can be treated as a higher-order tensor by considering time dimension, subject dimension, or modality dimension. Higher-order tensor can maintain the structure relationship of different dimensions of data, which may be lost in a low dimensional form. Besides, by considering the brain regions as nodes of a graph, the functional connections can be viewed as the weight matrix of edges for a graph. Compared with regarding the functional connections as a feature vector, considering the functional connections as a graph feature can help to discover the structural relationships among brain regions. Hence, a meaningful future work is to incorporate the structural information of different dimensions of the data or the graph structural information of the brain into EMDDL, which may help to improve the model performance and capture the intrinsic topological structure of the brain.

VI. CONCLUSION
In this paper, we present an explainable multimodal deep dictionary learning method to capture the developmental differences between children and young adults from three fMRI paradigms. Specifically, the shared dictionary and the modalityspecific sparse representations are learned based on the multimodal data and their encodings of the sDAE to simultaneously reveal the commonality and specificity of different paradigms. By applying the proposed method to the three fMRI paradigms from PNC, we found that children share a diffuse functional connectivity pattern while young adults share a focused functional connectivity pattern during both resting and two tasks. Three fMRI paradigms reveal that compared with children, young adults possess more mature and efficient functional networks for processing information. Children and young adults rarely transit from one state to other states during resting and prefer to switch among states over time during a task.

A. The Comparison of Sparsity of Activations Among
The penalty functions based on L 1 -norm and L 2 -norm are defined as The sparsity and SNR with L 1 -norm, L 2 -norm and Kullback-Leibler divergence have been compared and the results are shown in Fig. 9 of the Supplementary material. The results show that the sparsity of Kullback-Leibler divergence is better than that of L 1 -norm in most hidden layers, and the sparsity of L 2 -norm is the worst among the above three penalty functions.
The SNR evaluation of EMDDL on the multimodal data in both the original space and the encoding space show that, EMDDL based on Kullback-Leibler divergence has better reconstruction ability compared with L 1 -norm and L 2 -norm.

B. The Proof of the Convexity of f (V )
The convexity of f (V ) depends on whether its Hessian matrix where λ min (·) denotes the smallest eigenvalue of a matrix. To ensure the positive definite of the Hessian matrix ∇ 2 f (V ), λ min (∇ 2 f (V )) should be greater than 0. Thus, f (V ) is convex when λ 3 < λ 5 holds.

C. The Lipschitz Constant of the Gradient ∇V
For every V 1 , V 2 ∈ R K×N , we have Thus, the Lipschitz constant of the gradient ∇V is where λ max (·) denotes the largest eigenvalue of a matrix.