Prediction of Children’s Reading Skills Using Behavioral, Functional, and Structural Neuroimaging Measures

The ability to decode letters into language sounds is essential for reading success, and accurate identification of children at high risk for decoding impairment is critical for reducing the frequency and severity of reading impairment. We examined the utility of behavioral (standardized tests), and functional and structural neuroimaging measures taken with children at the beginning of a school year for predicting their decoding ability at the end of that school year. Specific patterns of brain activation during phonological processing and morphology, as revealed by voxel-based morphometry (VBM) of gray and white matter densities, predicted later decoding ability. Further, a model combining behavioral and neuroimaging measures predicted decoding outcome significantly better than either behavioral or neuroimaging models alone. Results were validated using cross-validation methods. These findings suggest that neuroimaging methods may be useful in enhancing the early identification of children at risk for poor decoding and reading skills.

The relation between education and cognitive neuroscience is exciting but controversial. It is exciting because noninvasive brain imaging methods are providing unprecedented views of the structural and functional development of the child's brain, and such new views of the maturing brain may provide novel information relevant for enhancing educational practices (Goswami, 2006). The relation between education and neuroscience is controversial because many links represent speculative and potentially flawed interpretations associating animal experimentation with human education (Bruer, 2002). For this reason, the relation between education and neuroscience has been called "a bridge too far" (Bruer, 1997, p. 4).
A critical issue in relating education and cognitive neuroscience is that education involves behavioral goals that are most directly evaluated by behavioral measures. For example, the most direct measures of the effectiveness of reading instruction are behavioral tests of reading comprehension or fluency or of the subskills of reading such as single-word decoding. Cognitive neuroscience studies of children aim to delineate the neural substrates of behaviors, such as reading. For example, a number of studies have found neural correlates of reading in typical-reading or dyslexic children (for reviews, see Dehaene, Cohen, Sigman, & Vinckier, 2005;Eden & Zeffiro, 1998;McCandliss, Cohen, & Dehaene, 2003;Price & Mechelli, 2005; S. E. Shaywitz & Shaywitz, 2005). These studies illuminate the neurobiological substrates of reading, but it is unknown whether such studies provide information that goes beyond behavioral measures with regard to reading itself. Perhaps brain imaging studies, at the optimal limit, can only be as informative about behavior as behavioral measures themselves. By this view, the imaging measures are redundant with behavioral measures, providing neurobiological correlates of behavior (i.e., providing a different level of description of the same phenomenon). Further, some behavioral measures, such as age-standardized language and reading tests, have been optimized for measurement reliability and validity, and measurement reliability and validity are seldom studied in brain imaging research. Finally, measures of a particular kind typically correlate best with measures of the same kind, so that behavioral measures of reading would be expected to be most closely associated with the most important outcome of reading education, that is, the behavior of reading. The preceding points suggest that current brain imaging measures are unlikely to provide insights into reading performance that go beyond behavioral measures. Alternatively, it may be the case that even now, neuroscience measures of brain structure and function contribute novel, nonredundant information about reading ability.
The goal of this study was to examine directly whether current brain imaging measures can provide novel information for predicting future reading skills in healthy children. We considered prediction to be an important goal because improved prediction of reading skill can facilitate identification of children who may benefit most from intensified or alternative reading instruction so that reading failure is minimized. We focused on one reading skill thought to be essential for effective reading, namely, word decoding skill. Decoding refers to the ability to determine the sound of a word from letters and syllables. Decoding ability is fundamental to reading because learning to read involves learning to relate the sounds of known auditory language (phonology) to letters (orthography). Early and systematic emphasis on decoding leads to superior achievement of reading skills (Adams, 1990;Snow, Burns, & Griffin, 1998). Therefore, improved methods for early identification of young children at risk for impaired decoding abilities hold promise for improving the specificity and effectiveness of early intervention and later achievement of reading skills.
A relatively pure test of word decoding involves reading aloud pronounceable nonsense words, because their proper pronunciation can only be derived from decoding skills (as opposed to words memorized by sight). Such a test also measures phonemic awareness, that is, awareness that words are composed of separable sounds (i.e., phonemes) that are blended to produce words. Phonemic awareness is one of the best predictors of reading success (e.g., Juel, 1988). We therefore used decoding skill as an outcome measure by measuring performance on a widely used test of decoding: the Woodcock Reading Mastery Tests (WRMT) Word Attack subtest. In this test, children attempt to read aloud pronounceable nonwords of successive difficulty.
With advances in neuroimaging, it is possible to examine brain activation and morphometric patterns that are associated with later reading achievement and decoding skills. To date, however, there are only minimal data pertaining to the use of brain measures to predict later reading achievement. All studies in this area have utilized event-related potentials (ERPs) to examine the development of language and reading skills (Espy, Molfese, Molfese, & Modglin, 2004;Molfese, Molfese, & Modgline, 2001). We used data from functional magnetic resonance imaging (fMRI) and structural imaging (VBM) and examined the relations of those measures to future reading skills. One imaging study has suggested that variation in brain morphology, as elucidated by VBM, can be linked to phonetic learning of novel speech sounds in normalreading adults (Golestani, Paus, & Zatorre, 2002). Although not focused on reading per se, these results are of interest as they suggest that tissue-specific features of particular brain regions (parietal gray and white matter) can, in part, predict the speed or facility of normal, healthy adults in learning novel speech sounds.
In the present study, we investigated whether data obtained from fMRI and VBM can predict later decoding skills and whether fMRI and VBM data can be combined with behavioral data to produce a successful multimodal predictor of future reading skills. We studied 64 healthy children, identified by teachers as at risk for reading difficulty, who were between 8 and 12 years of age and varied in reading ability (see Method section for details on how the children were recruited and characterized). These children were identified as struggling readers by their teachers, but scores on standardized tests ranged widely from poor to average to above average.
We performed an fMRI study using a real-word rhyme judgment task interrogating phoneme awareness at the beginning of the school year (Time 1). At Time 1, optimized VBM analysis (C. D. Good et al., 2001) was also performed with high-resolution anatomical images. Further, a full battery of behavioral measures before (Time 1) and after one school year (Time 2) were obtained. We examined (a) how well decoding skills after one school year were predicted by initial fMRI and VBM results; (b) whether the combination of behavioral and neuroimaging results were more predictive than behavioral or neuroimaging results alone using multiple regression, and (c) the validity of the regression models. The model validity check is critical because the residual (or prediction) error of a multiple regression analysis may underestimate the errors found in practice when there are outliers in the data or an excessive number of regressors in the model. We used leave-one-out validation analysis to demonstrate prediction ability and split-half reliability to demonstrate the stability of model estimation (see Supplemental Data online).

Recruitment
All participants were children attending public schools surrounding Pittsburgh in Allegheny County, Pennsylvania and were recruited from a larger behavioral study of children in the Pittsburgh area. Although most children were within the normal range of reading, all students were initially identified as struggling readers by their teachers. These children were participants in the Power4Kids Reading Initiative, a randomized trial, field study of remedial instruction for children with a wide range of reading difficulties. 1 Parents received explanatory materials about the Power4Kids reading project in the mail, including the fMRI study, and those expressing interest in the fMRI study were recruited. The children gave verbal informed assent in the presence of a parent or guardian, who gave signed informed consent. The children were paid for their participation. A parent questionnaire was used to verify that all participants met inclusion criteria (e.g. right-handed, native English speakers, normal vision and hearing, no brain injury, sensory disorders, psychiatric disorders, attention deficit disorder, medication, claustrophobia, or metal in their bodies). Following recruitment and screening, the children were scanned and baseline measures were administered. All protocols were approved by the University of Pittsburgh and Carnegie Mellon University Institutional Review Boards, and informed assent and consent was obtained for participation from each child and guardian, respectively.

Participants
Children were healthy, right-handed, native English speakers between the ages of 8.2 and 12.4 years old. Out of 95 children tested at Time 1, 73 children returned for Time 2 behavioral assessment and 64 (37 females, 27 males) had complete and usable behavioral and neuroimaging measures. Many of the children underwent one of four different types of reading intervention, but there was no significant effect of intervention on their Time 1 or Time 2 standard scores of decoding (Time 1: p ϭ .92, Time 2: p ϭ .44).

Behavioral Evaluation
Reading ability was assessed witha standard battery of behavioral measures. Behavioral evaluations of reading and readingrelated skills were obtained by Mathematica Policy Research (Princeton, NJ).

fMRI Task Design
A real-word rhyme judgment task was used in the scanner with two conditions: rhyme and rest. During the rhyme condition, participants judged whether two visually presented words rhymed (e.g., bait/gate, price/miss) and indicated each response with a button press using their right hand for 'rhyme' and their left hand for 'non-rhyme'. Word pairs were selected so that the visual appearance of the last letters of the two words could not be regularly used to determine whether they rhymed. Stimuli were balanced for frequency of occurrence, number of letters, and syllables between the rhyme and nonrhyme trials and across blocks (Zeno, Ivens, Millard, & Duvvuri, 1995;see Hoeft et al., 2006, for the list of stimuli). Each trial lasted a total of 6 s, consisting of a 4-s period where the two words were presented simultaneously, followed by a 2-s fixation cross. Each task block consisted of a 2-s cue period followed by five trials (32 s total). During the rest block, participants saw a 15-s fixation cross on the screen. The entire scan was 234 s long, including two practice trials at the beginning, and consisted of four rhyme blocks and five rest blocks.

fMRI Data Analysis
Statistical analysis was performed with statistical parametric mapping software (SPM99; Wellcome Department of Cognitive Neurology, London, United Kingdom). After image reconstruction, each participant's data was slice-time corrected (ascending, reference slice 8) and realigned to the first functional volume. Sessions were then normalized with the mean functional volume resampled to 2 ϫ 2 ϫ 2 mm voxels in Montreal Neurological Institute (MNI) stereotaxic space (12 nonlinear iterations, 7 ϫ 8 ϫ 7 nonlinear basis functions, medium regularization, sinc interpolation). Spatial smoothing was done with a Gaussian filter (8-mm full-width half maximum). Each participant's data, which was high-pass filtered at 96 s and globally scaled, was analyzed with a fixed effects model incorporating their 6 motion parameters (x, y, z, pitch, roll, yaw) as regressors. Motion was minimal in these children (Table 1). There were no significant differences between younger and older children: Grades 3 and 5, t(62) ϭ .05, p ϭ .95; for Grade 3, n ϭ 26, M ϭ 0.23; for Grade 5, n ϭ 38, M ϭ 0.24). Further, there were no significant correlation with age (r ϭ .14, p ϭ .28).
Group analysis was performed with a random effects model with the rhyme versus rest contrast images (one per participant, per contrast identified by fixed effect analysis). One-sample t tests were conducted to identify regions involved in phonological processing ( p ϭ .01, false-discovery rate corrected; extent threshold (et) ϭ 10 voxels).
Further, we performed simple regression analysis using Time 2 WRMT Word Attack standard scores for age as a covariate of interest. We identified regions that showed significant positive or negative correlation with contrast values and Time 2 Word Attack standard scores (which we defined as regions of interest, or ROI fMRI ; p ϭ .001, et ϭ 10) and extracted contrast estimates for each participant for further analyses.

Voxel-Based Morphometry Data Analysis
Statistical analysis was performed with SPM2 (Wellcome Department of Cognitive Neurology, London, United Kingdom). After image reconstruction and coregistration with functional images, we used an optimized voxel-based statistical analysis (C. D. Good et al., 2001) with tools modified by Christian Gaser (http:// dbm.neuro.uni-jena.de/vbm.html). Images were segmented into gray matter, white matter, and cerebrospinal fluid and normalized to a segmented template using the following parameters for nonlinear normalization: 25-mm cutoff, medium regularization, 16 iterations. Normalization parameters were applied to the initial anatomic volume, and the normalized anatomic images were partitioned into gray matter and white matter. Spatial smoothing was performed at full-width half maximum 12 mm. We performed analyses using both the standard adult template as well as the customized template including all participants and found similar results in terms of location and statistical significance. We also found similar results for modulated and nonmodulated VBM results. Here, we report the results using a customized template without modulation.
We performed multiple regression analysis of gray matter and white matter densities using Time 2 WRMT Word Attack -standard scores as a covariate of interest and total gray matter or white matter volume as a nuisance variable. We identified regions that showed significant positive or negative correlation with gray matter or white matter density and Time 2 WRMT Word Attack standard scores (ROI GM , ROI WM , respectively, where GM ϭ gray matter and WM ϭ white matter; p ϭ .000001, family-wise error corrected, et ϭ 0) and extracted the average density values for each ROI and for each participant for further analyses.
For both fMRI and VBM, statistical images were overlaid onto the SPM or medical image viewing software MRIcro (http:// www.sph.sc.edu/comd/rorden/mricro.html) template image for three-dimensional viewing. Peak coordinates of brain regions with significant effects were converted from MNI to Talairach space with the mni2tal function (http://www.mrc-cbu.cam.ac.uk/ Imaging/Common/mnispace.shtml). Brain regions were identified from these x, y, and z coordinates with Talairach Daemon (Research Imaging Center, University of Texas Health Science Center, San Antonio, TX) and confirmed with the Talairach atlas (Talairach & Tournoux, 1988).

Definition of Prediction Models
We performed prediction analyses using a method similar to other studies predicting outcome where there were a number of predicting variables (Poulakis et al., 2004;Woodhouse et al., 2003) (Figure 1a). All analyses were performed with Matlab (MathWorks, Natick, MA). First, simple regression analyses were performed between Time 2 WRMT Word Attack standard scores It is possible that we missed some important predictors (suppressor variables) that on their own do not correlate with outcome but may reduce error variance by explaining additional variance. However, we first performed simple regression analyses to select variables that were entered into multiple regression analyses, in order to match the methods used to derive neuroimaging predictors and also to reduce the number of variables objectively.
In multiple regression analyses, yЈ i are the Time 2 WRMT Word Attack standard scores (i ϭ 1, . . ., N, where N ϭ total number of participants), and x ki (k ϭ 1, . . ., K, where K ϭ total number of behavioral variables) are the behavioral scores. The predicted Time2 WA-ss YЈ i s were then denoted with weights b k 's for each participant i. Using the least square method (Minotani, 2004), we determined b 1 , . . . b k , and constant term b 0 , to minimize the sum of squared deviations and provide the best fit of the multiple regression model, the best correlation coefficient r 2 of the model, and the best contribution R 2 for each variable. The regression residual is represented by First, we performed multiple regression with all behavioral variables defined in the simple regression analyses using the enter procedure. Next, using the stepwise procedure (criteria: probability-of-F-to-enter Յ .05, probability-of-F-to-remove Ն .1, which are the default settings in Matlab), we obtained the behavioral model. We also performed forward and backward procedures and obtained similar results not only for the behavioral model but also for the neuroimaging and combined models. In the multiple regression model obtained from the stepwise procedure, behavioral measures that contributed significantly according to the above criteria were defined as behavioral predictors. YЈ i was defined as the prediction index PI i for a given participant i.
Prediction indices were plotted against Time2 WRMT Word Attack standard scores, and r 2 and p values were computed. Linear regression lines and two types of 95% prediction intervals (Minotani, 2004) were drawn. Prediction intervals were calculated as follows. The predicted value from the regression line Ŷ i can be defined as: where b 0 , b 1 , b 2 , and so forth are the results of the regression model fit. The residue ε i is therefore: Assuming Gaussian data, the prediction interval with 95% confidence of the mean Time 2 WRMT Word Attack standard scores was calculated as: for the ranges of the prediction indices where there are 95% probabilities that the next experimental group line regression will occur (95% prediction interval, group), and for the ranges of the prediction indices where there are 95% probabilities that the next experiment's individual's Time 2 WRMT Word Attack standard score value will occur (95% prediction interval, individual; Minotani, 2004), where PI ϭ prediction index, Ŷ i ϭ ␣ 0 ϩ ␣ 1 PI, N ϭ number of participants, PI 0 ϭ a new given PI with which we predicted the confidence interval, PIЈ i s ϭ the original PIs (data itself), and PI ϭ mean value of PIЈ i s. We defined the neuroimaging and combined (behavioral and neuroimaging) models similarly and subsequently compared different models.
For neuroimaging data, extracted contrast values from the fMRI analysis (ROI fmri ), and gray matter and white matter density values from VBM analyses (ROI GM , ROI WM , respectively) were submitted to similar multiple regression analyses used to identify behavioral predictors and were defined as neuroimaging predictors. Prediction indices were calculated similarly and correlation with prediction indices and Time 2 WRMT Word Attack standard scores were examined (Figure 1a). This was defined as the neuroimaging model.
Additionally, the 12 behavioral variables and 10 neuroimaging predictors that showed correlation with Time 2 WRMT Word Attack standard scores were combined and submitted to similar analyses to identify combined predictors, which yielded eight predictors. Prediction indices were calculated similarly, and correlation with prediction indices and Time 2 Word Attack standard scores were examined (Figure 1a). This was defined as the eightvariable combined model There was some variation between the number of days between Time 1 and Time 2 testing sessions, but this interval did not correlate with Time 2 WRMT Word Attack standard scores (r ϭ Ϫ.06, p ϭ .66). Therefore, time between testing sessions was not considered in further analyses.

Prediction Analyses Controlling for Initial Decoding Skills, Age, or PPVT Standard Scores
We performed separate partial correlation analyses for each model using Time1 WRMT Word Attack -standard scores as covariates of no interest to examine whether results reflected only strong associations between Time 1 Word Attack -standard scores and the behavioral or neuroimaging measures, rather than a unique contribution of Time 2 Word Attack standard scores. We also partialed out Time 1 age and PPVT standard scores to avoid bias from these variables. We chose PPVT in place of an IQ measure because IQ was not obtained in this study; PPVT highly correlates with full-scale IQ (.90) in children on the Wechsler Intelligence Scale for Children, Third Edition (WISC-III; Dunn & Dunn, 1997).

Leave-One-Out Cross-Validation Method
In the leave-one-out validation analysis, we tested whether single participant Time 2 WRMT Word Attack standard scores were predicted from the remaining 63 participants in the behavioral, neuroimaging, or eight-variable combined models (Figure 1b). We first performed a multiple regression analysis with 63 participants, leaving out the single participant to be tested. The 63 participants were resampled 64 times, giving the best fitted b i 's for each sample. The b i 's were then applied to the omitted participant, yielding a prediction index. The 64 predicted values were plotted against Time 2 WRMT Word Attack -standard scores. Mean prediction indices of the 64 trained sets (i.e.,predicted value) were correlated with WRMT Word Attack Time 2 standard scores, and a linear regression line and 95% prediction intervals of individual expected Time 2 Word Attack standard scores (see above Definition of Prediction Models for definition) were drawn for each model. Absolute differences between the predicted values and the actual Time 2 Word Attack standard score values of the omitted participant were calculated. One-way repeated measures analysis of variance (ANOVA) and post hoc comparisons were performed between models.

Demographic and Behavioral Measures
Demographic information is presented in Table 1. WRMT Word Attack standard scores were the critical outcome measure, and scores ranged from well above (138) to well below (66) the expected mean of 100. Time 1 and Time 2 Word Attack standard scores correlated highly (r ϭ .68, p Ͻ .001).
Among these 12 Time 1 behavioral measures that correlated significantly with Time 2 WRMT Word Attack standard scores, 3 variables remained as significant predictors when multiple regression analysis was performed, multiple r 2 ϭ .65, F (3, 60) ϭ 36.59, p Ͻ .001. These 3 remaining variables were defined as behavioral predictors, WRMT Word Attack, t ϭ 5.70, p Ͻ .001; Woodcock Johnson Spelling, t ϭ 3.05, p ϭ .003; Woodcock Johnson Calculation, t ϭ 2.53, p ϭ .014, and combined into prediction indices calculated by summing the constant and multiplying the 3 variables with their respective coefficients (see Supplemental Figure  A1 online). Thus, this behavioral model was predictive of later decoding skills.

Neuroimaging Model: Predicting Later Decoding Skills Using Functional Magnetic Resonance Imaging and Voxel-Based Morphometry Measures
We compared fMRI activation for real-word rhyme judgments versus rest state (Figure 2, Table 2). We then performed wholebrain regression analyses correlating Time 1 fMRI activation and gray matter or white matter VBM densities with Time 2 WRMT Word Attack standard scores and found 10 brain regions that showed significant positive or negative correlations (ROI fMRI s, ROI GM s, and ROI WM s, respectively; Figure 3 and Table 3). Mean contrasts (effect size calculated as the linear combination of beta parameters) or density information were extracted from these ROIs for each participant. Consistency maps from permutation analyses of fMRI and VBM multiple regression analyses showed consistent activation and morphometric patterns for these ROIs (see Supplemental Figure B and text online). Using multiple regression, four ROIs were found to contribute significantly, which were defined as neuroimaging predictors: multiple r 2 ϭ .57, F(4, 59) ϭ 19.41, p Ͻ .001; ROI fMRI right fusiform ϳ middle occipital gyri (RFG/ MOG), t ϭ 4.30, p Ͻ .001; ROI fMRI left middle temporal gyrus (LMTG): t ϭ 3.21, p ϭ .002, ROI fMRI right middle frontal gyrus (RMFG): t ϭ Ϫ 2.36, p ϭ .021; ROI GM right posterior fusiform gyrus (RFGp): t ϭ 3.90, p Ͻ .001; Figure 3 and Table 3). Prediction indices were calculated from these four predictors as described above (Supplemental Figure A2 online). Thus, the neuroimaging model was also predictive of later decoding skills.

Combined Model: Predicting Later Decoding Skills Combining Behavioral and Neuroimaging Measures
We repeated multiple regression and prediction analyses using the 12 behavioral and 10 neuroimaging variables that showed significant correlation with Time2 WRMT Word Attack standard scores (Figure 1a). The goal was to achieve an index that would best predict Time 2 Word Attack standard scores while minimizing the total number of predictors. Using multiple regression, we found that 8 variables contributed significantly, which were then defined as combined predictors: multiple r 2 ϭ .81, F(8, 55) ϭ 30.18, p Ͻ .001; standard scores on Word Attack, t ϭ 4.16, p Ͻ .001; Woodcock Johnson Calculation, t ϭ 2.94, p ϭ .005; Wood-cock Johnson Spelling, t ϭ 2.79, p ϭ .007; ROI fMRI RFG/MOG, t ϭ 3.95, p Ͻ .001; ROI GM RFGp, t ϭ 2.74, p ϭ .008; ROI GM right anterior frontal gyrus (RFGa), t ϭ 2.94, p ϭ .005; ROI WM left inferior parietal lobule (LIPL), t ϭ 2.26, p ϭ .028; ROI WM left superior temporal lobe (LSTL), t ϭ 2.14, p ϭ .037; Figure 3 and Table 3) Prediction indices were calculated as described above (see Supplemental Figure A3 online). Thus, the combined model was also predictive of Time 2 Word Attack standard scores.
When behavioral predictors were entered first, neuroimaging predictors explained 23% more variance in addition to the variance explained by behavioral predictors, F(4, 56) ϭ 6.42, p Ͻ .001. When neuroimaging predictors were entered first, behavioral predictors explained 15% of the variance in addition to the variance explained by neuroimaging predictors, F(3, 56) ϭ 14.78, p Ͻ .001. Thus, the combined model was significantly better than the behavioral model or the neuroimaging model (see Supplemental Figure A3 and text online).

Prediction Analyses Controlling for Initial Decoding Skills, Age, or PPVT Scores
There is a possibility that the results reported thus far may merely be a result of strong associations between Time 1 WRMT Word Attack standard scores and the behavioral scores or prediction indices, rather than a unique contribution of Time 2 Word Attack standard scores. Therefore, we performed partial correlation analyses for the three models, partialing out Time 1 Word Attack standard scores, and found that the results remained significant (behavioral r 2 ϭ .21, neuroimaging r 2 ϭ .40, combined r 2 ϭ .59; all ps Ͻ .001; details are provided in the Supplemental text online).
We also partialed out Time 1 age and PPVT standard scores. The results remained significant: for age, behavioral r 2 ϭ .63, neuroimaging r 2 ϭ .56, and combined r 2 ϭ .81; for PPVT, behavioral r 2 ϭ .63, neuroimaging r 2 ϭ .57, and combined r 2 ϭ .81; allp's Ͻ .001). Hierarchical regression analyses of the three models entering Time 1 WRMT Word Attack standard scores, age, or PPVT standard scores first and examining the remaining variance showed similarly significant results.

Validation of Predictability of Models
The following tests use permutation approaches by effectively reformulating the question on the differences between the predictability of the models as measured by classifier performance in the traditionally used framework of hypothesis testing (P. Good, 1994). The model validity check is critical because the residual (or prediction) error of a multiple regression analysis may underestimate the errors found in practice when there are outliers in the data or an excessive numbers of regressors in the model. We used leave-one-out cross-validation analysis to suppress possible effects of outliers. We further performed bootstrap and split-half reliability to demonstrate the stability and predictability of model estimation (Supplemental Figure C and text online).
Using leave-one-out cross-validation analyses, we tested whether single participant Time 2 WRMT Word Attack standard scores (validation participant) can be predicted from the remaining 63 participants (training set) in the behavioral, neuroimaging, or combined models, which allows for testing of generalization error  Figure D4 online). There was a significant effect of models, F(1, 63) ϭ 5.33, p ϭ .024, which was driven by the significantly greater accuracy of predicting the validation participant's Time 2 Word Attack standard scores (i.e., less deviation) of the combined model compared with the behavioral, t(63) ϭ 2.31, p ϭ .024, and the neuroimaging models, t(63) ϭ 3.02, p ϭ .004. There was no significant difference between the neuroimaging and behavioral models, t(63) ϭ 0.78, p ϭ .44.
One might expect that the combined model performs better simply as a result of the increased number of predictors in the model. Hence, we repeated the analyses including the same number of predictors (eight or three) for each model (see Supplemental text online). Eight variables per model were chosen as the number of variables to match with the combined model, which had the most number of variables. To avoid bias toward the combined model, we also tested with three variables per model, which were chosen as the number of variables to match with the behavioral model,which contains the least number of variables and which biases toward the behavioral model. With the number of predictors held constant, the combined model performed better than the behavioral or the neuroimaging model with either eight or three variables (see Supplemental Figure E and text online).

Discussion
We examined how well behavioral and brain measures taken at the beginning of the school year predicted a critical ability for reading, that is decoding skills, at the end of the school year for children 8 -12 years of age. Standardized behavioral measures of reading and language yielded a behavioral model that accounted for 65% of the variance in end-of-the-year performance on the WRMT Word Attack subtest, a standardized test of decoding. Brain imaging measures, comprised of both functional (fMRI) and gray and white matter morphological (VBM) scores, yielded a neuroimaging model that accounted for 57% of later variance in Regions showing a relation between white matter (WM) density and Time 2 Word Attack standard scores. The numeral 1 (in green) indicates predictors included in the neuroimaging model, and the numeral 2 (in red) indicates predictors included in the combined model (note: these models were derived independently; hence, the slight differences in predictors that were included in the models). Scaling bars indicate Tesla values. VBM ϭ voxel-based morphometry. decoding ability (which was nonsignificantly less than the behavioral model). Most importantly, the combined model of behavioral and neuroimaging measures was most predictive of later decoding skills, and explained 81% of the variance. The combined model was significantly better than either the behavioral or the neuroimaging model, as indicated by direct comparisons with multivariate analyses and validation tests. Thus, neuroimaging provided a unique kind of predictive information that was not merely redundant with behavioral measures. The combination of behavioral and neuroimaging most accurately predicted how much a year of education would influence a fundamental reading skill. Subsidiary analyses supported the reliability and validity of all three models. The leave-one-out cross-validation, bootstrap, and split-half reliability analyses indicated that the findings were not due to either outlier values or too many regressors in the models, and that the models were stable. The findings held when the number of regressors was equated across the models or when initial WRMT Word Attack standard scores were used as a covariate (although this reduced the explained variance of all models). Thus, the behavioral and neuroimaging measures were similar in their validity and robustness as predictors.
In agreement with previous studies that have repeatedly shown phonological awareness to be one of the best predictors of reading success (e.g., Juel, 1988), many behavioral tests that showed significant correlation with Time 2 WRMT Word Attack standard scores were related to decoding and phoneme awareness. Time 1 Word Attack scores, which should be most predictive of Time 2 Word Attack scores, that is, the outcome measure in our study, accounted for 49% of the variance on their own. Further, grapheme-phoneme knowledge (spelling, Time 1 Woodcock Johnson Spelling standard scores) was another strong predictor. One rather surprising predictor, however, was the children's ability to calculate (Time 1 Woodcock Johnson Calculation standard scores). This is, however, in agreement with what has been found previously (Nairoo, 1972). The prediction of later decoding skills from the ability to calculate may also be related to a higher prevalence of dyscalculia (a condition with a specific disturbance of arithmetic ability) in dyslexia (a developmental disorder characterized by difficulties with accurate and/or fluent word recognition and by poor spelling and decoding abilities that is often unexpected in relation to other cognitive abilities and the provision of effective classroom instruction (Lyon, Shaywitz, & Shaywitz, 2003) ranging from 17 to 64% (e.g., Badian, 1999;Gross-Tsur, Manor, & Shalev, 1996). These studies suggest that there may be a relationship between calculation and reading skills. In addition, neuroimaging studies have shown relationships between language or phonological processing and calculation in language-related brain regions, including the left parietal region (Dehaene, Spelke, Pinel, Stanescu, & Tsivkin, 1999;Simon, Mangin, Cohen, Le Bihan, & Dehaene, 2002). However, children with reading difficulties and children with comorbid reading and mathematics difficulties progress at about the same rate in reading achievement (Jordan, Kaplan, & Hanich, 2002). The exact role of calculation abilities in predicting later decoding skills needs further investigation. Other reading measures tested in these children did not remain as predictors in the behavioral model, which may partly be due to collinearity effects of Time 1 Word Attack standard scores and other reading tests.
Some have questioned the utility of behavioral tests for accurately predicting risk for poor decoding skills (Hammill, Mather, Allen, & Roberts, 2002). In a behavioral study examining 200 children between 1st and 6th grades with multiple regression analyses, phonology composites accounted for 40% of the variance in younger children and 42% of the variance when all children were combined in predicting word identification skills. The authors of that study concluded, however, that none of the composites studied met criteria that are considered to be practically useful. More recent multivariate studies, however, show better predictive values (e.g., Bowey, 2005), in which the results are more consistent with our results. Neuroimaging predictors of later superior decoding skills included greater brain activation in the right fusiform ϳ middle occipital and left middle temporal gyri, lesser activation in the right middle frontal gyrus, greater gray matter density in right fusiform gyrus, and greater white matter density in the left superior temporal and inferior parietal regions. Findings that greater left temporal-lobe activation and white matter are associated with superior decoding skill are consistent with prior studies in normal and dyslexic readers (e.g., structural studies: Deutsch et al., 2005;Klingberg et al., 2000;Silani et al., 2005;functional studies: Turkeltaub, Gareau, Flowers, Zeffiro, & Eden, 2003). The relation between lesser right frontal activation and later superior decoding skill may be related to findings indicating that the development of reading ability involves a reduction of right-hemisphere activation and a growth of left-hemisphere activation. Less expected were the relations of greater right fusiform activation and gray matter density with later superior decoding. Some studies have reported reduced VBM gray matter (W. E. Brown et al., 2001;Eckert et al., 2005) and activation (Aylward et al., 2003) in the right occipitotemporal region in dyslexia, which may imply that greater activation leads to better reading outcome. Other studies, however, have reported a developmental decrease in right fusiform activation associated with increasing age or gains in reading and language tasks in a cross-sectional study of healthy children and adults (T. T. Brown et al., 2005;Turkeltaub et al., 2003); these studies may indicate that less activation (more like that of adults) leads to better outcome and, hence, may be inconsistent with our results.
It is difficult to directly relate our findings, which are derived from a sample of children with a broad range of reading ability, to prior imaging studies mentioned above that examined severely dyslexic, highly skilled typically reading participants, or both. Speculatively, it may also be that the development of reading ability in the age range of our study depends transiently on mechanisms supported by the right fusiform gyrus before becoming dependent on left-hemisphere mechanisms in the adult reader. This possibility is supported not only by findings of right fusiform activation in children performing reading tasks (Aylward et al., 2003;T. T. Brown et al., 2005;Turkeltaub et al., 2003) but also by evidence that effective remediation for dyslexia involves not only increased activation in left-hemisphere language areas but also increased activation in many right-hemisphere areas (Aylward et al., 2003;B. A. Shaywitz et al., 2004;Temple et al., 2003).
The combined model predicted later decoding skills significantly better than either the behavioral or neuroimaging models, but our study has several limitations. First, the sample is not epidemiologically representative and, therefore, generalizability of the findings is unknown. Second, participants were followed for only for one school year, and evaluation of longer term predictive models will be important. Third, we focused on one measure of word decoding, a pseudoword reading task that measures phonological processing as the outcome measure. We chose this measure because decoding accounts for most of the variance in reading comprehension, the development of language specific phonology is essential for reading success, and early and systematic emphasis on decoding leads to better achievement of reading skills (Adams, 1990;Hulme & Snowling, 2005; E. Richardson, DiBenedetto, & Adler, 1982;Shankweiler et al., 1999;Snow et al., 1998;Snowl-ing, 1987). There may, however, be better measures or combinations of measures that will be more suitable as an outcome measure (Leonard, Eckert, Given, Virginia, & Eden, 2006), because reading comprehensions and reading fluency involve many processes beyond single-word decoding. In addition, one might argue that gains in reading ability may be a more appropriate outcome measure than Time 2 scores. In our preliminary analyses (results not shown), activation of brain regions that predicted gains in WRMT Word Attack scores were, however, similar to those of this study.
Fourth, behavioral and neuroimaging predictors used here were selected from univariate and multivariate analyses with the total sample and were applied to the validation tests, that is, for each permutation, ROIs were not re-identified, and the same contrast estimates and gray and white matter volume were used throughout. In addition, there may be important predictors that may not show significant correlation in a univariate analysis but may contribute significantly when included in multivariate analyses. Fifth, we deliberately chose a purely empirical approach so that we could optimize both behavioral and brain predictions of Time 2 scores. Such a purely empirical approach may highlight brain regions that are not yet well understood in reading, and that may merit a hypothesis-based approach in the future, but reduces our ability to interpret why certain brain measures predicted future decoding skill. Future studies with a predefined set of brain regions (a more theoretical approach) or utilizing multi-voxel pattern analysis (MVPA) will be of interest (Norman, Polyn, Detre, & Haxby, 2006). Sixth, the models created examined linear relationships, and an increasing number of studies show nonlinear effects of development (e.g., Shaw et al., 2006). Whether this approach has true clinical utility is unknown. Although the leave-one-out crossvalidation analyses showed significantly greater prediction for the combined as compared with the behavioral model, the gain was only 1.17 Word Attack standard score points on average (4.16 vs. 5.33). It is thought that the sensitivity index, specificity index, and positive predictive value should all reach at least 75% in order for a measure to be considered acceptable for practical use and suitable for screening purposes (Gredler, 2000). In our sample, and using the behavioral, neuroimaging, and the combined models, we found that high sensitivity, specificity, and positive predictive value greater than 75% were achieved in classifying children with reading disability at Time 2 (data not shown); reading disability was defined by Time 2 Word Attack standard scores of 85 or below, which is a common threshold. Ultimately, identification of a predefined set of predictors independent of the sample in a prospective study followed up for a longer period of time that passes the above threshold (a minimum of 75%) will be necessary.
More generally, our findings relate to one potential practical use of neuroimaging, namely, the prediction of future health or behavior. Neuroimaging has been used to predict the outcome of treatment for depression (e.g., Canli et al., 2005;Siegle, Carter, & Thase, 2006) and the conversion from healthy aging to Alzheimer's disease (e.g., Apostolova et al., 2006;Bookheimer et al., 2000;de Leon et al., 2001). These studies have often used smaller samples and a single imaging modality, and only one study has examined the validity of a model with a permutation test (Apostolova et al., 2006). Another study, however, examined preoperative behavior, brain volume, and fMRI in 10 temporal-lobe epilepsy patients to predict post-operative memory and compared sensitivity, specificity, and positive predictive value in a small sample of 10 participants (M. P. Richardson et al., 2004). They found that left-right hippocampal encoding activity difference showed reasonable sensitivity, specificity, and positive predictive value (20 -100 %) for predicting the amount of pre-to postoperative memory decline. Over time, perhaps in combination with an individual's genetic information, neuroimaging may contribute to increasingly accurate predictions of future behaviors. In the case of reading difficulties, identification of children at risk as early as possible seems desirable so that interventions may be implemented prior to reading failure and perhaps prior to the development of disadvantageous reading habits that may slow the effectiveness of interventions. Judicious use of predictive measures would require consideration of predictive accuracy at the individual level, such as sensitivity, specificity, and cost-benefit balances. Ethical considerations will also be important to avoid abuses of neuropredictive measures (although these ethical considerations are not fundamentally different from those of other kinds of predictive measures; Illes & Raffin, 2005). In this vein, it will also be important to recognize that brain dysfunction in dyslexia can be altered by remediation (e.g., Temple et al., 2003), indicating that effective education can guide beneficial plasticity.
Taken together, these findings indicate that neuroimaging measures predict decoding skill after a year of school almost as well as do current standardized tests and that behavioral tests and neuroimaging measures in combination predict decoding skill significantly better than either kind of measure alone. The significantly greater predictive accuracy of the combined behavioralneuroimaging model than either model alone shows that neuroimaging is measuring brain functions and structures relevant to reading that are not fully measured by their behavioral correlates in standardized testing. There are still many steps to be taken to show that neuroimaging measures have sufficient value before such measures ought to be considered for practical prediction of the need for educational intervention. Combined behavioral and neuroimaging measures, however, hold promise for improving the specificity and effectiveness of early intervention and later achievement of reading skills. The present findings, therefore, suggest a point where a useful bridge can be built between cognitive neuroscience and education.