Latent Similarity Identifies Important Functional Connections for Phenotype Prediction

Objective: Endophenotypes such as brain age and fluid intelligence are important biomarkers of disease status. However, brain imaging studies to identify these biomarkers often encounter limited numbers of subjects and high dimensional imaging features, hindering reproducibility. Therefore, we develop an interpretable, multivariate classification/regression algorithm, called Latent Similarity (LatSim), suitable for small sample size, high feature dimension datasets. Methods: LatSim combines metric learning with a kernel similarity function and softmax aggregation to identify task-related similarities between subjects. Inter-subject similarity is utilized to improve performance on three prediction tasks using multi-paradigm fMRI data. A greedy selection algorithm, made possible by LatSim's computational efficiency, is developed as an interpretability method. Results: LatSim achieved significantly higher predictive accuracy at small sample sizes on the Philadelphia Neurodevelopmental Cohort (PNC) dataset. Connections identified by LatSim gave superior discriminative power compared to those identified by other methods. We identified 4 functional brain networks enriched in connections for predicting brain age, sex, and intelligence. Conclusion: We find that most information for a predictive task comes from only a few (1-5) connections. Additionally, we find that the default mode network is over-represented in the top connections of all predictive tasks. Significance: We propose a novel algorithm for small sample, high feature dimension datasets and use it to identify connections in task fMRI data. Our work should lead to new insights in both algorithm design and neuroscience research. Code and demo are available at https://github.com/aorliche/LatentSimilarity/.


I. INTRODUCTION
F UNCTIONAL magnetic resonance imaging (fMRI) pro- vides a non-invasive estimate of brain activity by exploiting the blood oxygen level-dependent (BOLD) signal [1].This high-acuity imaging data can be used to predict variables like age, sex, intelligence, and disease status [2] [3] [4] [5].Interestingly, the gap between fMRI-predicted brain age and biological age can identify Alzheimer's disease patients prior to the onset of symptoms [6].Prediction is hindered, however, by the combination of small sample size and very high feature number.This results in models that have poor reproducibility and generalizeability [7].
Studies with small sample size only have the power to detect very large effects.Many effects that are found in small studies may be due to noise.When identifying regions that are associated with in-scanner tasks, it was found that the average minimum cohort size needed to reproducibly identify the same region 50% of the time in independent samples was N=36 [8].In contrast, models deployed clinically use thousands of subjects for training and validation [9].In 2017 and 2018, the median cohort sizes for published experimental and clinical MRI studies were 23 and 24 subjects, respectively, and less than 1% of the 272 papers surveyed reported cohort sizes greater than 100 [10].This may be attributed to both cost, at $500-$1000 per subject, and the difficulty of collecting the data, stemming from long scan times, subject discomfort in the scanner, and experimental design [10].
Additionally, for fMRI-based predictions to be useful clinically, they must be interpretable.There is a large literature on the interpretability of machine learning in medical imaging [11] [12]; however, there is often a tradeoff between model accuracy and interpretability.This raises questions about robustness in the clinical setting [13].For example, Zhang et al. show that different processing methods can yield similar accuracy in a sex prediction task, but with different discriminative features identified by each method [14].Identifying a minimal set of valid functional connections may increase model robustness, and make inroads into causal analysis of brain networks [15].
Finally, many recent studies in the deep learning field shift their focus to integrate data from multiple omics [16], or multiple omics and imaging [17].This is done for two purposes: to improve prediction accuracy and to learn novel interactions between different modalities.CCA-based models have been proposed that use response variable-guided feature An overview of the Latent Similarity model.In traditional ML, estimation of response variables is decoupled from inter-subject similarity, whereas GNN models require additional degrees of freedom to estimate edges between subjects.Our model calculates similarity between subjects based on a set of response variables and incorporates multi-modal feature alignment (in addition to ensembling) as well as sparsity and feature disentanglement.
alignment [18] [19].However, these models do not consider inter-subject relationships and cannot control disentanglement between different predictive tasks.
In this paper, we introduce LatSim (Figure 1), a model in the spirit of metric learning [20], that is both robust and interpretable.Traditional machine learning (ML) models in fMRI, which work directly on functional connectivity (FC) [21], are vulnerable to noise or random confounders like scanner drift or head motion [22].Graph neural networks (GNN) use intersubject information as an adjunct to calculations performed directly on FC [23].However, graph edges may be ambiguous or non-binary, requiring additional degrees of freedom for their estimation [24] [25].In contrast, LatSim learns an intersubject similarity metric, d(x a , x b ), and uses the inter-subject similarity, without a self-loop, to make predictions.
The contribution of our work is three-fold.First, we propose a novel metric learning-based model, LatSim, which is robust, interpretable, computationally efficient, multi-view, and multi-task.Second, we use LatSim and a greedy selection algorithm to identify the most discriminitive connections for age, sex, and intelligence prediction among adolescents in the Philadelphia Neurodevelopmental Cohort (PNC) dataset [26].We show that the such connections are superior to those identified by saliency maps.Third, we give a justification why LatSim performs better than traditional ML models with low sample sizes and high feature dimensionality.
The rest of this paper is organized as follows.Section II gives the mathematical foundations of LatSim and its relationship to other models.Section III provides simulation and experimental results.Section IV discusses significant brain networks and reasons why LatSim performs better in the low sample-size, high-dimensionality regime.Section V concludes with a recapitulation of the work.We make the code publicly available at the link in the footnote.

Notation Description
The (i,j)-th entry of matrix X X i,: The i-th row of matrix X X i , X (i)  The i-th matrix in a set of matrices X T The transpose of matrix X A, B Random variables The i-th element of a set y i The i-th entry of vector y The Hadamard product 1 A matrix of ones diag(a) A square matrix with the elements of a on the main diagonal, 0s elsewhere The l 2 norm

A. Kernel CCA
To compute similarity between subjects, we utilize ideas from canonical correlation analysis (CCA) [27] [28].Conventional CCA seeks to find relationships between the features of two different views of a dataset.It aligns the two views, X 1 ∈ R N ×d1 and X 2 ∈ R N ×d2 , by finding canonical variables w 1 and w 2 that maximize the correlation between X 1 w 1 and X 2 w 2 : maximize where N is the number of subjects and d 1 = d 2 = d is the feature dimension.Kernel CCA (kCCA) [29] [30] transforms features into a reproducing kernel Hilbert space (RKHS), and finds the alignment between the transformed features K 1 and K 2 .The similarity in the RKHS is k(X i,: , X j,: ) = φ(X i,: ) T φ(X j,: ), where φ : R d → R d is the feature transformation.LatSim learns a linear kernel A ∈ R d×d ; however, this still allows detection of nonlinear relationships.
The main idea behind CCA and kCCA is to maximize the similarity between two or more signals after some constrained transformation.This constrained transformation moves the data to a latent space, which may be of lower dimension.The limitation of CCA and kCCA is that they are unsupervised learning techniques that must account for every similarity between the signals, not just those relevant for a particular application, although recent work is tackling this problem [19].

B. Latent similarity
In contrast to unsupervised learning, LatSim maximizes similarity of subjects relative to a response variable of interest, such as age, sex or intelligence.First, similarities are computed as the inner product of the low-dimensional projections of subject features, based on a learned kernel: where A ∈ R d×d is the kernel matrix and x a , x b ∈ R d are feature vectors for subjects a and b, respectively.These similarities are then adjusted by passing them through a softmax activation function while masking each subject's selfsimilarity.The entire model for a single predictive task and a single fMRI paradigm is as follows: where E ∈ R N ×N is the final similarity matrix, M ∈ R N ×N is a mask to remove self-loops in predictions, ∞ ∈ R N is a vector of infinite-valued elements, 1 ∈ R N ×N is a matrix of ones, X ∈ R N ×d is the feature matrix, A ∈ R d×d is the kernel taking connectivity features to a lower latent dimension, N is the number of subjects, d is the number of features (FCs), S(z) i is the softmax function with temperature τ , and S Row (Z) is a function applying softmax to each row of the input matrix.High or low temperature τ determines whether the subject-subject similarity matrix E is more dense or sparse, respectively.The final similarity matrix of training and test set subjects is multiplied by the training set response variable to yield the prediction: The model is trained, using gradient descent, by minimizing the following objective function.Here we assume for brevity the existence of two fMRI feature matrices X a and X b , and two predictive tasks, one regression (1) and one classification (2), for which we identify four kernel matrices A 1a , A 1b , A 2a and A 2b : minimize where E (1a) ∈ R N ×N , for example, is the similarity matrix for task 1 and fMRI paradigm a, y (1) ∈ R N (numeric) and Set the residual r = y train , feature set F = {} Calculate the distance between every pair of elements in r: D Find FC feature that is most highly correlated with D across all pairs of subjects Add feature to F Run predictive model: y' train =LatSim(X F ,y train ) on current features Calculate new residual r=y train -y' train Go to step 2 Y (2) ∈ R N ×C (one-hot categorical) are the stacked response variables for tasks 1 and 2, respectively, N is the number of subjects, C is the number of classes in task 2, γ i is a task importance weight, λ i is a sparsity-inducing hyperparameter, α i is a hyperparameter promoting feature disentanglement, and β i is a hyperparameter promoting alignment between fMRI paradigms.Note that our experiments on the PNC dataset in Section III-B.1 used precomputed vectorized functional connectivity matrices as the input, e.g., X is a matrix where each row is the vectorized FC of one subject.
In the conventional image domain, Zheng et al. have proposed a similar metric learning approach using softmax aggregation for image classification [31].However, their work makes use of a pre-trained backbone, is semi-supervised, and does not provide all of the possibilities for feature selection, disentanglement, and alignment as does LatSim (see Equation 5).

C. Greedy selection algorithm and model interpretability
A greedy selection algorithm was developed to compare with other interpretability methods [32].The algorithm selects connections one at a time by ranking their ability to separate dissimilar subjects, i.e., their ability to minimize similarity between subjects that are "far apart" with regards to the current residual: where LatSim : a is the residual at iteration i for subject a, D ∈ R N ×N is a centered matrix of differences between residuals, F i = {0, . . ., i} is the set of selected connections at iteration i, X ∈ R N ×d is the vectorized FC matrix for all subjects, and y ∈ R N is the response variable.A summary of the algorithm is presented in Figure 2. We describe feature selection results in Section III-B.3.
The greedy algorithm can select the several dozen most relevant features given a single predictive task.To select discriminative features using the fully trained model, we find the correlation between subject similarities and residual distances, as in Equation 6above, except the FCs are multiplied by the learned model weights: where the residual is set to the response variable, D is calculated as before, A ∈ R d×d is the set of model weights, and F is the resulting set of ranked features.Except for greedy feature selection, we optimized prediction of all three response variables (age, sex, and intelligence) at the same time in the same LatSim model.Greedy selection required optimizing a single task at once, as the best feature for age prediction may not be the best feature for sex or intelligence prediction.LatSim was trained using PyTorch on an NVIDIA Titan Xp with CUDA support.

D. Spurious correlation
We hypothesize that overfitting occurs due to feature noise or confounds, such as scanner motion, whose effects are more severe for smaller size cohorts.These confounds may create spurious correlations in a subset of the cohort.
We define a spuriously correlated feature X to be one that appears to be highly correlated with response variable Y for only a subset of subjects: where ρ S is the value of the spurious correlation, C is the study cohort, and S ⊆ C is a subset of the cohort such that C \ S is maximized.Note that spurious correlation may actually be true correlation identifying subgroups, but we hypothesize that a spurious correlation is more likely to be false as |S| decreases.We conduct simulation experiments in Section III-A that suggest LatSim is more robust against spurious correlation than traditional feature-based models.When |S| is close to |C|, and the effect is systematic, we cannot tell whether the correlation is true or false.

III. RESULTS
We first demonstrate the superior performance of LatSim in a simulation study, then apply it to brain development fMRI data consisting of children and adolescents.We use both fullmodel and greedy feature selection to identify important functional connections for age, sex, and intelligence prediction.

A. Simulation experiment
We performed a simulation experiment to test LatSim in the presence of a ground truth dataset.A set of N train = 40, N val = 120, and N test = 120 subjects with 10,000 normallydistributed features x ni was generated, where n and i refer to subject and feature, respectively.Each subject was also associated with a response variable y n .The data generation process for each subject was as follows: where r and r S are correlation-generating parameters for non-spurious and spurious correlations, respectively.In other words, the first 1,000 features were correlated with the response variable at level ρ, the next 1,000 features of half of the subjects were correlated at level ρ S (and had 0 correlation for the other half of subjects), and the remaining 8,000 features were left uncorrelated.We varied r from 0.2 to 1 while keeping r S = 1.It can be seen that final feature to response variable correlation is ρ = r/ √ r 2 + 4 for correlated features for all subjects, and ρ S ≈ r S / r 2 S + 4 for spuriously correlated features for half of subjects.
The simulation showed that LatSim performs better than both a GCN [33] and Ridge Regression model in the presence of the spurious correlation ρ S (see Figure 3).Additionally, LatSim was the only model identifying the three types of features: correlated, spuriously correlated, and uncorrelated.All results are on the test split.We believe insensitivity to spurious correlation is one of the reasons that LatSim performs well in the low-sample, high-dimensionality regime (see Section IV-C).A multi-layer perceptron (MLP) with L1regularization performed as well as Ridge Regression (not shown).The GCN model was not interpretable via either weight magnitude or gradient-based saliency.The MLP model identified only sparse features and selected features in the noninformative range.In contrast, LatSim was able to consistently identify the full range of informative features.
Notably, the weights are smaller for correlated features than for non-correlated features.This is an artifact of taking the absolute value of weights in order to average them across latent dimensions.Conversely, the spuriously-correlated weights are, on average, smaller than the constantly-correlated weights.To explain, suppose there are 2 sets of features, A and B, which are correlated and non-correlated, respectively.The similarity between two subjects will be: hence it doesn't matter what magnitude the weights B have, because the expectation of the non-A 1 A 2 terms is zero due to independence and the standard normal distribution of features.Conversely, if there is a subset of features A that are spuriously-correlated, it is beneficial to reduce the spurious weights compared to the non-spurious ones.
B. Brain development study  1) Dataset: We trained and validated our model on the publicly available PNC dataset.The PNC dataset contains multi-paradigm fMRI data, neurocognitive assessments, and psychiatric evaluations for 1,445 healthy adolescents ages 8-23.We chose 620 subjects from the cohort who had both working memory paradigm (nback) and emotion identification paradigm (emoid) fMRI scans, along with results from the 1hour Wide Range Achievement Test (WRAT) [34] to measure general intelligence.fMRI was performed using a 3T Siemens TIM Trio whole-body scanner with a single-shot, interleaved multi-slice, gradient-echo, echo-planar imaging sequence.The resolution was set to be 3x3x3 mm with 46 slices.The imaging parameters were TR = 3000 ms, TE = 32 ms, and flip angle = 90 degrees.Gradient magnitude was 45 mT/m, having a maximum slew rate of 200 T/m/s.The duration of the nback scan was 11.6 minutes (231 TR), during which time subjects were asked to conduct the n-back memory task, which is related to working memory and lexical processing [35].The duration of the emoid scan was 10.5 minutes (210 TR), during which time subjects viewed faces displaying different emotions and gave an indication of what emotion was displayed.The demographics of our study cohort are given in Table II and the distribution is visualized in Figure 4.
Data was pre-processed with SPM12 2 .This included using multiple regression for motion correction, as well as spatial normalization and smoothing by a 3mm Gaussian kernel [36].Pre-processing was similar to [37].The Power template [38] was used to parcellate BOLD signal among 264 regions of interest, from which a 264×264 symmetric connectivity matrix was constructed using Pearson correlation.The unique d = 34, 716 entries in the upper right triangle, excluding the main diagonal, were vectorized and taken as the FC features for each subject.
The goal of the experiment was to predict subject age, sex, and intelligence as measured by WRAT score.Prediction performance was measured by root mean squared error (RMSE) for age and intelligence prediction, and accuracy for sex prediction, respectively.LatSim was compared against simple linear models (Least Squares and Logistic Regression), a Graph Convolutional Network (GCN), a Multi-Layer Perceptron (MLP), and a Multimodal Graph Convolutional Network (M-GCN).M-GCN is a recent deep learning model for functional connectome analysis [39] based on the CNN [40] architecture.
The inputs to all models were nback, emoid, and the arithmetic sum of nback and emoid task based vectorized FC matrices, from which separate predictions were made and averaged as part of an ensemble.The sum of nback and emoid FC was used to increase ensemble size.Standardization (Zscore normalization) was performed on the vectorized FC matrices using statistics from the training dataset applied to training, validation, and test datasets.Z-score normalization was only for the LatSim model, since the other models sometimes did not converge for Z-score normalized data.All predictive and feature selection experiments were carried out using 10-fold cross validation (CV), with an 80% training, 10% validation, and 10% test split.Hyperparameters were selected using random grid search (see Table III for LatSim hyperparameters and Table IV for those of other models).The search grid was initialized to be a 5-decade window around prior assumptions of optimal hyperparameters, with search points occurring at decade intervals for all models.A total of 100 grid points were evaluated with three repetitions.The only exceptions were dropout, which was sampled at 0.1 intervals, latent/hidden dimension, which was set heuristically, and number of training epochs, which was set to just past the maximum best validation epoch for each model individually.Hyperparameters were estimated for the largest training set size (N = 496) and subsequently used for all training set sizes, with the belief that over-optimization would give a distorted view of models and reduce reproducibility.
2) Prediction: LatSim achieved superior predictive performance on the PNC dataset in all three predictive tasks, especially at low sample sizes.The result of the entire experiment is given in Figure 5, and the low and high sample size results are given in Table V.
At N=30, close to the previously reported threshold of  N=36 for modestly reproducible fMRI results, we see that LatSim is the only model not to overfit.It surpassed the other models by a significant margin in two of three predictive tasks.Interestingly, LatSim performed much better at small sample sizes than the simple linear models, which we attribute to use of O(n 2 ) inter-subject connections rather than the n subjects themselves.LatSim remains the best performing model until about N=100, at which point it is only slightly better than the other best predictive model, GCN.We note that the GCN model performs almost as well as LatSim, except at low sample sizes.We also note that with a categorical response variable such as sex, the performance of both LatSim and GCN is reduced.We believe the advantage of both LatSim and the GCN model lies in utilizing inter-subject similarities and differences.This is hindered by a lack of granularity in the response variable.
Based on the prediction results, LatSim can fit a dataset   VI).This makes it possible to perform large-scale bootstrapping, mixture of experts, and ensembling that is not possible with traditional ML models.It also allows for the use of greedy selection.
3) Significant FCs in prediction: The most important FCs for all prediction tasks are given in Table VII.All connections are given with Automated Anatomical Labeling (AAL) region names [41] and with Montreal Neurological Institute (MNI) region coordinates.For age prediction, the most important connections were Insula R to Putamen R and Temporal Inf R to Frontal Med Orb R, being present in the top 10 connections for both the nback and emoid paradigms.For sex prediction, the Precentral L to Temporal Pole Mid R FC was found in the top 10 connections for the emoid paradigm.For intelligence prediction, the Postcentral L to Postcentral R FC was found   in the top 10 connections for both the nback and emoid paradigms.In addition, for sex prediction, we identified the Left Inferior Frontal Gyrus (Precentral L) as a region making multiple top 10 connections, as shown in Figure 6.
Using only the first few connections gives half of the predictive power of using the full set of d = 34, 716 connections.In particular, Figure 7 shows that the first 3 connections, if properly chosen, can contain more information than the next 50 connections, chosen in the same manner.Specifically, 10 FCs can explain 21% of variance for age, 50 FCs can explain 27%, whereas with the full set of FCs the GCN model can explain 35% and LatSim can explain 38%.The selected connections were chosen using the greedy feature selection algorithm.Figure 7 shows that the FCs chosen by greedy selection are superior to those chosen by gradientbased saliency, as well as to random FCs.Additionally, we compared the FCs chosen by greedy selection to the nextbest FCs that would be chosen by it.We believe this helps validate the significance of our identified connections, since, for small numbers of connections, we could not find a minimal combination of FCs that performed as well as that found by greedy selection.
Selecting connections with the fully trained LatSim model corroborated the trend found by greedy selection.As seen in Figure 8, we identified a very few "core" connections that were disproportionately important to the prediction task.The rest of the connections were interchangeable in terms of discriminative ability.Note, for instance, the rapid increase in accuracy for the 3 best FCs and the subsequent plateau in Figure 7.Likewise, almost all of the connections found in the top 50 connections by greedy selection were also found in the top 50 connections of the full model.

A. Significant functional networks
The top connections identified by this study contain regions that fall into the default mode (DMN), subcortical (SUB), fronto-parietal task control (FRNT), and sensory/somatomotor (SMT) brain functional networks (FNs).Abbreviations are given as a footnote to Table VII.Regions that belong to the same FN (within-module) tend to be more synchronized than regions from different FNs (between-module) [42].In Figure 8C, blocks on the main diagonal of the FC matrices represent connections within-module, while blocks off the main diagonal represent connections between-module.Recently, Jiang et al. found that, in an older population, connections between the DMN, SMT, and SUB networks were highly predictive for age [43].They also found that a DMN-SUB connection was correlated with high cognitive performance.
The DMN was overrepresented in the top 10 connections for all predictive tasks; 36% of regions identified were part of the DMN, whereas DMN regions constitute 22% of the Power atlas.Robust developmental changes have been identified in the DMN, and DMN connectivity has been positively correlated with high cognitive performance [44].Fan et al. found that DMN connectivity increases from childhood until young adulthood [45].Pan et al. identified FCs which included DMN regions to be more important in predicting intelligence than FCs which didn't [46].
The SMT network was overrepresented in top 10 connection regions for intelligence prediction.In that task, 43% of top 10 connection regions belonged to the SMT network, whereas SMT regions constitute 13% of the Power atlas.It is known that dysfunction in the SMT network is correlated with depression [47].However, FC represents synchronization between brain regions, and the cause of altered FC may not lie in the region itself.Table VII shows that the top SMT connections involve the CB network, leading to the idea that complex motor control is related to intelligence.
Many of the most important connections we identified for each predictive task are not recognized as part of an FN, and are classified as unknown-network (UNK).24% of regions identified in top 10 connections are labeled UNK, whereas UNK regions constitute 10% of ROIs in the Power atlas.These connections include cerebellar regions; some cerebellar regions are not included in the CB network because they contribute to functions other than motor function [48], including social thinking and emotion [48].Zhang  As with greedy selection, we show that the first several connections are far more important than the remaining ones (A).Notably, the DMN is highly represented in the top 10 connections for all predictive tasks and modalities (B).The DMN as a whole seems to be important for intelligence prediction (C).Importance was averaged over 50 repetitions of an 80-10-10 train/validation/test split.Discriminative power was calculated as in Equation 7. Correlation was greater than zero for all connections.See Table VII for definitions of abbreviations.

B. Significant FCs
Greedy selection identified 4 FCs present in more than 8 out of 10 CV splits for one of the predictive tasks: • Insula R to Putamen R (Age).The Insula R has many functions in humans dealing with low-level sensation, emotion, and high-level cognition [50].Mazzola et al. hypothesized that the Insula R participates in the social brain and found increased activation when participants watched scenes of joyful or angry actors [51].Increased Putamen R volume has been linked to autism spectrum disorder [52], and reduced amygdala-Putamen R FC has been linked to ADHD [53].• Temporal Inf R to Frontal Med Orb R (Age).The Temporal Inf R region is associated with language processing [54].Temporal Inf R FC was found to be decreased in adolescent schizophrenia patients [55].The Frontal Med Orb R region is part of the prefrontal cortex and is associated with dysfunctional connectivity in major depressive disorder [56].Notably, AAL regions extend over a large area, and Power atlas ROIs do not correspond exactly to AAL regions.

C. Robustness to spurious correlation
In this section, we argue that LatSim is robust to spurious correlation because it identifies features based on O(n 2 ) intersubject connections, rather than the number of subjects in the cohort.
Assume feature X is spuriously correlated with response variable Y on a subset of the cohort S ⊆ C, s = |S|, n = |C|, and X, Y ∼ N (0, 1).That is, for each subject u: LatSim uses weighted inner product similarity wX 1 wX 2 between the features of two subjects as input, where w is a learned weight.The correlation between wX 1 wX 2 and D = (Y 1 − Y 2 ) 2 determines how well this feature pair predicts the response variable: Var[wX 1, 2 ∈ S 0 otherwise (12) Since expectation is a linear operator, we can find the average value over the entire cohort: ρ XX,D = − s(s − 1) n(n − 1) Conversely, in a traditional model, feature X is correlated with response variable Y as the size of the subset S: A plot of the functions in Equations 13 and 14 is given in Figure 9.The maximum reduction in spurious correlation occurs at s/n = 0.5 and is about 1/4 of the value of the spurious correlation.The relative reduction is linear and maximal when s = 0, i.e., there are no subjects with spurious correlation (not shown).As s/n increases, the reduction in spurious correlation is diminished.This suggests that large model capacity is not the only reason complicated models falter at low sample sizes.We see in our experiments, e.g., in Table V, that the linear models perform worse than both LatSim and some other deep learning models.
Like LatSim, a k-layer GNN model also works on interactions between subjects, but as an adjunct to the prediction from the node self-loop.It also requires either additional degrees of freedom to estimate edge weights, or an arbitrary choice of a distance function and/or threshold.We believe the reason that a GCN model did so well in our experiments is that we made it incredibly simple: only 2 layers were used, and edge weights were uniform and equal in sum to the self-weights.It was found that expanding the GCN to 3 or 4 layers hurt performance.We believe the performance benefit comes from having a good prior and feature selection, not additional model capacity.Due to the very weak relationships between features and response variables in our data, we believe the advantage of the GCN was in averaging.This strategy breaks down at low sample sizes, where spurious feature correlation still causes large errors to be present at the node self-loop.

V. CONCLUSION
This paper proposes a novel model, LatSim, in the vein of metric learning, that is robust against overfitting at small sample sizes.It is interpretable, computationally efficient, multi-task and multi-view capable, and able to enforce feature disentanglement.First, we showed that LatSim is superior in the small sample size, high dimensionality regime, through both simulation and experiments on real datasets.Second, we identified specific connections within and between the sensory/somatomotor, default mode, fronto-parietal task control, and subcortical networks that are highly discriminative for age, sex, and intelligence in healthy adolescents.Third, we quantified the number of features required to attain a given prediction accuracy.Fourth, we showed that there are several core connections that are more discriminative for each predictive task than other connections.Finally, we found that connections identified by greedy selection were superior compared to those found by saliency methods.Our model may spur new research into algorithm development and, in turn, lead to new insights into the mechanisms underlying human cognition.
Fig. 1.An overview of the Latent Similarity model.In traditional ML, estimation of response variables is decoupled from inter-subject similarity, whereas GNN models require additional degrees of freedom to estimate edges between subjects.Our model calculates similarity between subjects based on a set of response variables and incorporates multi-modal feature alignment (in addition to ensembling) as well as sparsity and feature disentanglement.

Fig. 2 .
Fig. 2. The greedy feature selection algorithm.A. A summary of the algorithm.B. A flowchart representation.C. Visualization of the residual distance matrices used to choose an FC feature at each iteration, at iterations 1 and 10.D. Histogram of absolute residual distance matrix values at iterations 1 and 10.Since LatSim works on subject pairs, our objective is to fit distances between residuals.

Fig. 3 .
Fig. 3. Results of simulations on synthetic data with spurious correlation.A. Data generated with non-spurious ρ = 1/(2 √ 4.25) ≈ 0.25, present in all subjects, and spurious ρ S ≈ 1/ √ 5 ≈ 0.5, present in half of subjects.Correlation of response variable with feature for the training set is shown.Only the first two thousand features have any information relevant for prediction.B. Absolute value of learned model weights for the GCN and LatSim models, averaged over the first hidden layer (GCN) or latent dimension (LatSim).Weights are smoothed by a convolution kernel of size 20 to aid visualization.C. Average predictive performance (RMSE between ground truth y i and predicted ŷi ) over 6 independent train/validation/test splits, evaluated on the test split.

Fig. 4 .
Fig. 4. Demographics of the 620-subject subset of the PNC study used in our experiments.WRAT score has been adjusted from its raw value by regressing out the effects of age.

Fig. 5 .
Fig. 5. Results of age (A), sex (B), and intelligence (C) prediction experiments on our subset of the PNC dataset.Dashed black lines represent the null model.All models except LatSim performed worse than chance at the N=30 training set size for all tasks.

Fig. 7 .
Fig. 7. Comparison of four connection selection strategies.Dashed black lines represent the null model.Selection up to 10 connections (A, C, E) was done without dropout, whereas selection up to 50 connections (B, D, F) was done with 0.5 dropout.

Fig. 8 .
Fig.8.Important connections identified by running the full model with the entire d = 34, 716 set of connections as inputs.As with greedy selection, we show that the first several connections are far more important than the remaining ones (A).Notably, the DMN is highly represented in the top 10 connections for all predictive tasks and modalities (B).The DMN as a whole seems to be important for intelligence prediction (C).Importance was averaged over 50 repetitions of an 80-10-10 train/validation/test split.Discriminative power was calculated as in Equation7.Correlation was greater than zero for all connections.See Table VII for definitions of abbreviations.

Fig. 9 .
Fig. 9. A. Spurious correlation per sample in a traditional ML model (dashed lines) versus LatSim (solid lines).B. The absolute reduction in spurious correlation as a function of frequency in the sample.

TABLE I COMMONLY
USED NOTATION.

TABLE II DEMOGRAPHIC
INFORMATION FOR THE SUBSET OF THE PNC DATASET USED IN OUR EXPERIMENTS.WRAT SCORE HAS BEEN ADJUSTED FROM ITS RAW VALUE BY REGRESSING OUT THE EFFECTS OF AGE.

TABLE IV HYPERPARAMETERS
FOR PNC EXPERIMENTS (COMPARISON MODELS).

TABLE V RESULTS
OF PNC EXPERIMENTS.

TABLE VI TRAINING
TIME FOR ALL 10 FOLDS OF 10-FOLD CROSS VALIDATION.Identification of an interesting "hub" region found by emoid paradigm sex prediction that was included in 5 separate connections from among the top 10 connections across all CV splits.B. Visualization of regions found in the top 10 connections of more than 8 CV splits using the greedy selection algorithm.

TABLE VII MOST
IMPORTANT CONNECTIONS FOR DISCRIMINATING AGE, SEX, AND INTELLIGENCE AMONG HEALTHY ADOLESCENTS.THE # CV SPLITS COLUMN SHOWS THE NUMBER OF CV SPLITS FOR WHICH THE CONNECTION APPEARED IN THE TOP 10 CONNECTIONS OF THE GREEDY SELECTION ALGORITHM.