Measurement Educational and Psychological Educational and Psychological Measurement Irt Modeling of Tutor Performance to Predict End-of-year Exam Scores

Interest in end-of-year accountability exams has increased dramatically since the passing of the No Child Left Behind Act in 2001. With this increased interest comes a desire to use student data collected throughout the year to estimate student proficiency and predict how well they will perform on end-of-year exams. This article uses student performance on the Assistment System, an online mathematics tutor, to show that replacing percentage correct with an Item Response Theory estimate of student proficiency leads to better fitting prediction models. In addition, it uses other tutor performance metrics to further increase prediction accuracy. Prediction error bounds are also calculated to attain an absolute measure to which the models can be compared.

W ith the recent push in standardized testing in the United States, there has been an increased interest in predicting student performance on end-of-year exams from work done throughout the year (Olson, 2005). This has led to an increase in formative assessment and a growth of companies that provide assessment and prediction services (i.e., Pearson, http://www.pearson.com, and 4Sight, www.cddre.org/ Services/4Sight.cfm). When predicting end-of-year exam performance, one of the most commonly used sources of student work is benchmark exams. Benchmark exams are typically paper-and-pencil exams given periodically throughout the year so teachers can get a snapshot of student knowledge at that time. A popular measure of student understanding for many researchers is percentage or number correct (e.g., Maccini & Hughes, 2000;Nuthall & Alton-Lee, 1995). Many popular prediction methods use a simple percentage correct or number of correct problems on the exams as a factor in prediction models (Bishop, 1998;Haist, Witzke, Quinlivan, Murphy-Spencer, & Wilson, 2003). This leads to linear prediction models of the form where Z i is student i's score on the end-of-year exam, X i is the percentage (or fraction) correct on the benchmark exam, and Y im are other variables used in the regression such as subject or school-level background variables and other measures of performance. However, one drawback of this method is that it does not take into account the difficulty of the problems. For example, if two students see 10 different problems and both correctly answer 7, we should be cautious about using percentage (or number) correct to compare the students. If one set of problems is much harder than the other, then there is an obvious difference of abilities.
As a solution to this problem, one can use Item Response Theory (IRT; e.g., van der Linden & Hambleton, 1997), which relates student and problem characteristics to item responses. By separating the problem difficulty from student ability, we can estimate the student's true underlying ability no matter what set of problems are given. One of the simplest IRT models is the Rasch (Fischer & Molenaar, 1995), which models student i's dichotomous response (0 = wrong; 1 = correct) to problem j, X ij , in terms of student proficiency (y i ) and problem difficulty (b j ) as where Z i and Y im are the same as in equation 1 and y i is student i's estimated IRT proficiency. This approach is similar to the IRT-based errors-in-variables regression model used by Schofield (2007) in public policy.
In this article, we illustrate the steps described above using data from an online mathematics tutor known as the Assistment System (Heffernan, Koedinger, & Junker, 2001;Junker, in press). During the 2004-2005 school year, more than 900 eighth-grade students in Massachusetts used the tutor to prepare for the Massachusetts Comprehensive Assessment System (MCAS) exam. The MCAS exam is part of the accountability system that Massachusetts uses to evaluate schools and satisfy the requirements of the 2001 No Child Left Behind Act (see more at http:// www.doe.mass.edu/mcas). In this analysis, the benchmark exams are the unique set of tutor problems that each student received and the other variables, Y im , in equation 3 are other manifest measures of student performance such as number of hints asked for and time spent answering problems.
We also compare the prediction models we construct with one another. To compare models, we computed the 10-fold cross-validation mean absolute prediction error, or the mean absolute deviation (MAD), shown in equation 4. MAD is used because it is considered to be more interpretable by the Assistment developers. We also report the cross-validation mean square error (MSE).
In equation 3, there are many different variables Y im that can be used and many different choices of IRT models to estimate student proficiency. By comparing the prediction error of these models, we can tell when one model is doing better than another, but we cannot tell whether any one model is doing well or poorly in an absolute sense. We will use classical test theory (Lord & Novick, 1968) to obtain approximate best-case bounds on the prediction error in terms of the reliabilities of the individual benchmark tests taken by the students. This gives us an absolute criterion against which to compare the prediction error of various models. If the prediction error of a model is larger than the upper bound, we know to throw out the model and search for a better one.
Each year, the 200 to 280 reporting scale used to communicate MCAS results to the public is recalculated by first using a standard-setting procedure to set the achievement levels in terms of the raw number correct and then using piecewise linear transformations to turn the number-correct scores into values within the 200to-280 range. This second step is done so that the reporting scale achievement level cut points remain the same from year to year (Rothman, 2001). We predict the raw number-correct score, 0 to 54, to avoid the artificial year-to-year variation introduced by this standard-setting and transformation process.
The study and data that this article uses are described in the following section. We then describe the statistical methods used to model student proficiency and summarize the results. Classical test theory is then used to discuss how well we expect to do with predictions. Next, we look at several MCAS exam score prediction models and compare results. Finally, we offer some overall conclusions.

The Assistment Project Design
During the 2004-2005 school year, more than 900 eighth-grade students in Massachusetts used the Assistment System. Eight teachers from two middle schools participated, with students using the system for 20 to 40 minutes every 2 weeks. Almost 400 main questions were randomly given to students in the Assistment System. The pool of main questions was restricted in various ways, for example, by the rate at which questions in different topic areas were developed for the tutor by the Assistment Project team and by teachers' needs to restrict the pool to topics aligned with current instruction. Thus, coverage of topics was not uniform, and students might see the same Assistment tasks more than once.

Data
Students using the Assistment System are presented with problems that are either previously released MCAS exam questions or that are prima facie equivalent ''morphs'' of released MCAS exam questions; these are called ''main questions.'' In other contexts (e.g., Embretson, 1999), item morphs are called ''item clones.'' If students correctly answer a main question, they move on to another main question.
If students incorrectly answer the main question, they are required to complete scaffolding questions that break the problem down into simpler steps. Students may make only one attempt on the main question each time it is presented but may take as many attempts as needed for each of the scaffolds. Students may also ask for hints if they get stuck answering a question.
The analysis in this article includes only those students who have MCAS exam scores recorded in the database. This narrows the sample size to 683 students. There are 354 different main questions seen by these students. Individual students saw between 1 and 252 problems; however, the distribution is right skewed, with a median of 71 problems and first and third quartiles of 39 and 107 problems, respectively. Previously, Farooque and Junker (2005) found evidence that skills behave differently in Assistment main questions and scaffolds. Because we want to make comparisons to the MCAS exam, the only Assistment data used in the IRT models are of performance (correct or incorrect) on Assistment main questions.

IRT Model Estimation
Because performance on any particular problem depends on both student proficiency and problem difficulty, we use IRT models to factor out student proficiency and directly model problem difficulty. MCAS multiple-choice questions are scaled (Massachusetts Department of Education, 2004) for operational use with the 3-Parameter Logistic (3PL) model, and short-answer questions are scaled using the 2-Parameter Logistic (2PL) model from IRT (van der Linden & Hambleton, 1997). We know that Assistment main questions are built to parallel MCAS exam questions, so it might be reasonable to model Assistment main questions using the same IRT models. However, for simplicity the Rasch model, equation 2, was used. There is evidence that student proficiencies and problem difficulties have similar estimates under the 3PL and the Rasch model (Wright, 1995), so we are not losing much information by starting with the Rasch model. Note that in the Rasch model, the problem difficulty parameters b j are not constrained in any way.
In our analysis, we consider N = 683 students' dichotomous answers to up to J = 354 Assistment main questions. There are many missing values because no student saw all of the problems. We treat these missing values as missing completely at random (MCAR), because problems were assigned to students randomly by the Assistment software from a ''curriculum'' of possible questions designed for all students by their teachers in collaboration with project investigators.
The dichotomous responses X ij are modeled as Bernoulli trials: where i = 1, . . . , N; j = 1, . . . , J; and P j (y i ) is given as above by equation 2. Under the usual IRT assumption of independence between students and between responses, given the model parameters, the complete data likelihood can be written as We estimated the student proficiency (y i ) and problem difficulty (b j ) parameters using Markov Chain Monte Carlo methods with the program WinBUGS (Bayesian inference Using Gibbs Sampling; Spiegelhalter, Thomas, & Best, 2003; WinBUGS and R code available from the authors on request). The Rasch model, equations 2 and 5, was estimated using the data with the priors y i ∼ N(m y , s 2 y ) and b j ∼ N(m b , s 2 b ). We placed a weak normal hyperprior on m b and a weak inverse-Gamma hyperprior on s 2 b . In item response models, the location and scale of the latent variable, and hence of problem difficulty parameters, are not fully identified, which can undermine comparisons between fits on different data sets. We decided to fix the (prior) mean and variance of the student proficiency (y) to be 0.69 and 0.758. These values were found by preliminary analysis using weak hyperpriors on these parameters. All estimates mentioned herein refer to the posterior means of the parameters.
Before moving on, it is worth mentioning three reasons why we chose to implement MCMC estimation over maximum likelihood methods. First, in a research setting where we are combining IRT and regression methods, as in equation 3, we are willing to trade speed of joint maximum likelihood (JML) or marginal maximum likelihood (MML) estimation for ease of implementation of MCMC. To set up the MCMC estimation, we only need to specify the likelihood and prior distributions for the parameters, versus calculating the first and second derivations of the likelihood. Second, we can make more complete uncertainty calculations within the Bayesian framework. Finally, depending on how the MCMC output is utilized, the asymptotic properties of the MCMC estimates will behave like either JML or MML estimates (Patz & Junker, 1999). As a comparison, we calculated the MML estimates using ConQuest (Wu, Adams, & Wilson, 1999). For the majority of the problems, the 95% confidence intervals for the MCMC and MML estimates overlapped quite well. However, for 4.5% of the problems the ConQuest estimates were unreasonably extreme (indicating lack of MML convergence) compared with the MCMC estimates.
To explore the fit of the Rasch model, we looked at the per-problem standardized residuals: where n j = P i:i saw j X ij is the number of correct answers to problem j, E(n j ) is its expected value estimated from fitting the model in equation 2, and Var(n j ) is its variance estimated from the same model. Because these residuals are standardized, we expect the majority to fall between −3 and 3. The plot on the left in Figure 1 shows these standardized residuals. One can note that they fall between −0.6 and 1.4, indicating a good fit.
We also calculated the per-problem outfit statistics (van der Linden & Hambleton, 1997, p. 113), where N j is the number of students that saw problem j, x ij is student i's response on problem j, E ij is the expected value of X ij conditional on the parameter vector f, and W ij is the variance of X ij also conditional on f. To check the per-problem fit of each model, the posterior predictive p (PPP) value (Gelman, Carlin, Stern, & Rubin, 2004)-the expected value of the classical p value over the posterior distribution of the parameter vector given the model and the observed data-was estimated using  (Gelman, Meng, & Stern, 1996, p. 790). However, we can still expect the PPP values to aggregate around zero if there is serious misfit for some of the problems. We can see that the histogram of PPP values on the right in Figure 1 is roughly uniform, which we would expect if the model fit is acceptable. We also considered the Linear Logistic Test Model (LLTM; Fischer, 1974) and random-effects LLTM (Janssen & De Boeck, 2006), but the fit of both models was inadequate in comparison with the Rasch model. More information can be found in previous work (Ayers & Junker, 2006).

Reliability and Predictive Accuracy
Before exploring the predictive accuracy of our models using the MAD measure defined in equation 4, it is important to ask how well Assistment scores could predict MCAS scores under ideal circumstances. Let us begin by assuming the MCAS

Figure 1 Standardized Residuals and Posterior Predictive p Values
exam and the Assistment System to be two parallel tests of the same underlying construct. Following classical test theory (Lord & Novick, 1968), we have where the true score of student i is T i , X it is student i's observed score on test t, and e it is the error on test t.
We have followed the usual assumptions that the expected value of the error terms are zero, the error terms are uncorrelated, and that the error terms and the true score are uncorrelated. The expected mean square error (MSE) between the tests is then Because the reliability of test t (t = 1 or 2) is defined as some algebra then shows that the root mean square error (RMSE) is This can be converted into lower and upper bounds on the MAD score as follows. Using the Cauchy-Schwarz inequality for Euclidean spaces (Protter & Morrey, 1991, p. 130) with x i = |MCAS i − predicted MCAS i | and Y i = 1, We can then scale both sides by 1/N to achieve MAD ≤ RMSE: We can also bound the MAD from below. First, let x i =MCAS i −predicted MCAS i and |x max | denote the absolute maximum deviation between the true and predicted MCAS scores. Then, where r 2 is the reliability of the Assistment score. However, because each student completes a unique set of Assistment main questions, we could not calculate r 2 directly. Instead, we calculated reliability separately for each student. For this purpose we considered a reduced data set of 616 students who had 10 or more problems completed for which all pairs of correlations were available. To estimate the per-student reliability, we used Cronbach's alpha coefficient (Cronbach, 1951), In equation 10, n i is the number of problems that student i saw and r i is the average interitem correlation for problems seen by student i. Once the per-student reliabilities were calculated, the per-student estimated RMSE values were computed using equation 9. Figure 2 shows the estimated reliabilities for the students who met the criteria explained above. It is interesting to note that the estimated RMSE is never lower than 4.44.
To have a single set of approximate bounds for the MAD score in equation 8, we found the median Assistment reliability, 0.8080, and the corresponding RMSE of 6.529 from equation 9. The largest deviation, |x max |, between the true and predicted MCAS scores among the models in Table 1 was 40.5. Substituting these values for RMSE and |x max | in equation 8, we find the approximate bounds, 1:053 ≤ MAD ≤ 6:529:

MCAS Exam Score Prediction
We now combine student proficiencies estimated from a successful IRT model with other Assistment performance metrics to produce an effective prediction function, following the work of Anozie and Junker (2006), using an errors-in-variables regression approach similar to that of Schofield (2007). The linear model is where y i is the proficiency of student i as estimated by the IRT model and Y im is performance of student i on manifest measure m. WinBUGS was again used to find Bayesian estimates of the linear regression coefficients. In practice, Assistment items will be calibrated and only y will need to be estimated. Thus, when estimating each of the following models, the IRT item parameters were fixed at their estimates from before, but student proficiency was reestimated. Because the measurement error for each person is small (ranging from 0.18 to 0.86 for the 683 students), it is tempting to plug in theŷs from before as well. However, a simulation study by Zwinderman (1991) showed an increased bias when y is replaced byŷ as a dependent variable, and we expect the same results whenever y is used as an independent variable.
To compare the prediction models, we calculated the 10-fold cross-validation (CV) MAD score (equation 4). In K-fold CV, the data set is randomly divided into K subsets of approximately equal size. One subset is omitted (referred to as the testing set) and the remaining K-1 subsets (referred to as the training set) are used to fit the model. The fitted model is then used to predict the MCAS exam scores for the testing set. The desired statistic, in this case the MAD score, is then calculated for the testing set. The process is repeated K times (the folds), with each of the K subsets being used exactly once as the testing set. The K results from the folds are then averaged to produce a single estimation of the MAD score. By using crossvalidation, we avoid using the data to both fit the model and give an estimate of the fit. In addition, we also report the 10-fold cross-validation root sample mean square error of the models. Table 1 shows results from several prediction models. Column 2 lists which variables are in the model. (For a full list and description of the variables used in each model, see Table 2.) Column 3 simply states the number of variables in the model. Columns 4 and 5 give the CV MAD score and the CV RMSE respectively. Column 6 offers some important notes about the model. Historically, and in particular within the Assistment Project, percentage correct on main questions has been used as a proxy for student ability. To see if any information was gained by simply using the Rasch estimate of student proficiency, we compared the two models with only these variables. Model 1 is the simple linear regression using only percentage correct on main questions and has a MAD score of 7.18. Model 2 uses only the Rasch student proficiency and gives a MAD score of 5.90. By simply using IRT to account for problem difficulty in estimating student proficiency, we can drop the MAD score a full point. Accounting for problem difficulty gives a more efficient estimate of how well a student is doing and leads to better predictions. Model 3, from Anozie and Junker (2006), uses as predictors monthly summaries from October to April for percentage correct on main questions and four other manifest measures of student performance. Model 4 uses the year-end aggregates of the same variables and substitutes Rasch student proficiency for percentage correct on main questions. We see that Model 4 gives a slightly lower MAD score. Thus, by using Rasch student proficiency (in place of percentage correct on main questions) we can use fewer measures of student performance on Assistment problems. Model 5 was optimized (for MAD score) for Rasch student proficiency and year-end aggregates of student performance measures using backwards variable selection implemented in WinBUGS and R (R Development Core Team, 2004; WinBUGS and R code available from the authors on request). To start, we used the same 12 variables as Anozie and Junker (2006), excluding percentage correct on From this, we see that there is a baseline MCAS exam score prediction of 18 points and for each additional unit of estimated Rasch student proficiency, we add 10.425 to the exam score prediction. As a student's proficiency increases, so does his or her exam score prediction. In equation 12, the increase in MCAS score for each unit of increase in Rasch proficiency is about the same as in equation 11. However, the baseline of 18.289 has been decomposed into a new baseline of about 8.5 points, incremented or decremented according to various measurements of response efficiency. The largest increment, 8.928, comes from the rate at which scaffolding questions are completed and the largest decrement, 2.696, comes from time spent on answering main questions incorrectly. Now that we have compared models to one another, we need to compare the models to the bounds calculated above. Recall from the previous section that we have a bound of 1:053 ≤ MAD ≤ 6:529: From Table 1, one can see that Model 5 has a MAD score of 5.24, which is well below the upper bound.
Moreover, the RMSE reported for Model 5, 6.46, is similar to our estimated optimal RMSE of 6.53. It should also be noted that with a perfect Assistment reliability in equation 9, the estimated RMSE would be 5.576 and the bound would be 0:768 ≤ MAD ≤ 5:576: Again, the MAD score of Model 5 is below this upper bound. Using a split-half reliability calculation on the MCAS exam itself, Feng, Heffernan, and Koedinger (2006) found an average MAD score of 5.94. Because we are achieving MAD scores less than this and the two previously mentioned upper bounds, we do not expect to do much better without an increase in the reliability of the MCAS exam.

Discussion
In this article, we have developed a framework to create prediction functions for end-of-year exam scores using an IRT estimate of student ability based on work done throughout the school year. Although this framework was illustrated using data from an online mathematics tutor, other benchmark work, such as homework or paperand-pencil exams, could be used to predict end-of-year exam scores as well.
In addition to developing this general framework, our research generated an additional finding. Prediction using IRT scores is more effective than prediction using number-correct scores. For example, the predictions based on our Rasch model always produced lower MAD and RMSE prediction errors than the corresponding predictions based on number-correct scores. Moreover, the IRT-based predictions were essentially as good as one could do with parallel tests, even though our Assistment System was not constructed to be parallel (in the classical test theory sense) to the MCAS exam.