Disentangling Different Aspects of Change in Tests with the D-Diffusion Model

Abstract Diffusion-based item response theory models are measurement models that link parameters of the diffusion model (drift rate, boundary separation) to latent traits of test takers. Similar to standard latent trait models, they assume the invariance of the test takers’ latent traits during a test. Previous research, however, suggests that traits change as test takers learn or decrease their effort. In this paper, we combine the diffusion-based item response theory model with a latent growth curve model. In the model, the latent traits of each test taker are allowed to change during the test until a stable level is reached. As different change processes are assumed for different traits, different aspects of change can be separated. We discuss different versions of the model that make different assumptions about the form (linear versus quadratic) and rate (fixed versus individual-specific) of change. In order to fit the model to data, we propose a Bayes estimator. Parameter recovery is investigated in a simulation study. The study suggests that parameter recovery is good under certain conditions. We illustrate the application of the model to data measuring visuo-spatial perspective-taking.

Psychological testing consists in assigning trait levels to test takers on basis of their responses to a series of standardized test stimuli (test items).The assignment is normally based on a measurement model that represents the relation between the possible responses to the test items and the target traits, the items were designed to measure.As measurement models establish a connection between the observable responses and the traits, they provide a basis for inferring the trait levels of the test takers from their responses to the items (Birnbaum, 1968).When tests are computer assisted, response times are often collected in addition to the responses.In this case, the mental speeds of the test takers may be estimated in addition to their levels on the target traits.This either requires a second measurement model for the response times (e.g., van der Linden, 2006) or a combined measurement model for the responses and the response times (e.g., van der Linden, 2007).
In standard measurement models (van der Linden, 2016), it is assumed that the test outcomes (responses, response times) depend on traits (target traits, mental speed) of the test takers and on characteristics of the item, but not on the position of the item in the test.This implies a strong form of invariance.It requires that the trait levels of the test takers, the characteristics of the items, and the functional relation between the traits and the test outcomes are fixed throughout the test.In practice, the assumption of invariance may be violated.It is a common finding that a test taker's probability to solve an item and his/her response speed depend on the position, an item is presented in the test (e.g., Hartig & Buchholz, 2012;Schweizer, 2012;Trendtel & Robitzsch, 2018).Possible causes are warm-up-effects, learning, changes in the response strategy, fatigue, or declining test motivation (e.g., Ulitzsch et al., 2022;Schweizer et al., 2020;Weirich et al., 2017;Wu et al., 2019).This has several consequences for psychological assessment.First, any unmodeled heterogeneity causes a violation of the conditional independence assumption.This may threat the validity of inferences about the latent traits.
Second, effects of the item position are problematic when items are presented in different order.This is, for example, the case in computer adaptive testing.Third, as the target trait of a test is typically assumed to be stable (i.e., does not change across the test), any change in test performance reflects the influence of interfering factors.Trait recovery requires the elimination of these interfering factors.Fourth, when trait levels change during the test and test takers differ in change, a single latent trait score mingles rate of change and level.Both quantities might have different predictive power and it is still unclear which quantity should be used for assessment.For all these reasons, changes in test performance should not be ignored.
In order to account for changes in the solution probability or the response speed during the test, the standard measurement models have been extended into dynamic latent trait models.A thorough overview of the proposed extensions is given in Appendix A.
Here, we focus on the extension that is most relevant for the present paper.In the extension, a standard latent trait model is combined with a latent growth model.The item parameters of the latent trait model are considered as fixed.The levels of the latent traits of the test takers, however, are modeled by the latent growth model.In the simplest case, the latent traits are modeled by a linear regression model with random effects (e.g., Cao & Stokes, 2008;Fox & Marianti, 2016).Note that alternatively, one could assume the latent traits as fixed and model the item parameters as a function of the item position (Kubinger, 2008;Li et al., 2012).This approach is not pursued here.
Dynamic latent trait models are capable to account for changes in the latent traits throughout the test.They, however, have two limitations.First, when a standard latent growth model is used, the change is modeled by a polynomial (e.g., linear) function (Fox & Marianti, 2016).Polynomial functions are not bounded and do not limit the amount of change.This contradicts common sense and is in conflict to findings on practice effects which are bounded and nonlinear (Heathcote et al., 2000).Second, the dynamic latent trait models are descriptive in the sense that they model the course of the trait levels a test taker operates on during the test.The operative level of the ability to solve an item and the operative level of the response speed, however, are the results of different change processes.Response speed, for example, may decrease during the test as a consequence of learning, but also as a consequence of declining motivation (Dutilh et al., 2009).A simple description of the overall effect of all change processes mingles different aspects of change and confounds the underlying causes.Instead of modeling the effects of change, one should model the different processes underlying change directly.
Recently, new measurement models have become popular that are more closely related to the response process (Bunji & Okada, 2019;Ranger et al., 2015;Tuerlinckx & De Boeck, 2005;van der Maas et al., 2011).These measurement models are adaptations of process models from cognitive psychology to the field of psychological assessment.One popular representative is the diffusion-based item response theory (DiffIRT) model (Molenaar et al., 2015;Tuerlinckx & De Boeck, 2005;van der Maas et al., 2011).The DiffIRT model distinguishes cognitive and motivational components of the solution process (De Boeck & Jeon, 2019).As a consequence, the test takers are not characterized in terms of operative ability and speed, but in terms of proficiency and carefulness which determine the operative levels of ability and speed.This is different to standard latent trait models that do not go beyond a description of a test taker's performance (operative ability or speed).As the DiffIRT model compartmentalizes different causes of performance, it provides an ideal basis for a more detailed description of how test takers change during a test.
In this paper, we propose a dynamic version of the DiffIRT model.For this purpose, we combine the DiffIRT model with a latent growth model.The latent growth model describes how the latent traits change as a function of the item position.The change in the two latent traits, proficiency and carefulness, is modeled separately.This allows to separate change in capability from change in motivation.We consider different forms of change, namely linear or quadratic change with or without an upper boundary.With this conception of change, we intend to account for systematic forms of change that occur when test takers learn or gradually decline their effort during the test (e.g., Schweizer et al., 2020;Verguts & De Boeck, 2000;Wu et al., 2019).The model allows a separation of the rate of change from the final level of a trait.Each test taker is described in terms of his/her initial level, his/her rate of change and his/her final level of proficiency and carefulness.This detailed description of a test taker might improve psychological assessment as all aspects might have different predictive value.
The outline of the paper is as follows.First, we introduce the model considered in the paper, the dynamic D-diffusion model.Then, we investigate the performance of a Bayes estimator in model fitting.Finally, we employ the model to empirical data and illustrate the insights that can be gained from the model.

The dynamic D-diffusion model
The dynamic D-diffusion model combines the D-diffusion model and a growth curve model.We first review the basic diffusion model (without latent traits), then its extension into the D-diffusion model (with latent traits) and finally present the dynamic Ddiffusion model.

The basic diffusion model
Due to space limitations, we give only a condensed review of the simplest variant of the basic diffusion model without many mathematical details.A more technical description is given in Appendix B. For further details concerning the general diffusion model with parameter variability and decision bias we refer to Ratcliff and McKoon (2008), Wagenmakers (2009), or Alexandrowicz (2020).
The diffusion model is a model for responses and response times in tasks with two response options.In this paper, we denote the two response options as the correct and the incorrect response.According to the diffusion model, the response and response time of a test taker in the task are the result of a process of preference formation.It is assumed that test takers develop a preference for one response option over the other when working on the task.The diffusion model describes how the process of preference formation evolves over time after the test taker has started working.The underlying process of preference formation is latent and cannot be observed directly.It, however, determines the observable response and response time as follows.A test taker stops working and responds by choosing the preferred response option as soon as the momentary preference exceeds a critical limit.Only the responses and response times are observed; see Appendix A for more details.
In the diffusion model, the process of preference formation is governed by two model parameters, the drift rate d (d 2 R) and the boundary separation a (a 2 R þ ).The absolute value of the drift rate d determines how fast a test taker develops a preference on average.The sign of the drift rate d determines the direction of the preference, that is, whether the correct response is preferred over the incorrect response on average or vice versa.The boundary separation a determines the critical limit, that is, the level of preference that is necessary to respond.
The drift rate and the boundary separation affect the response process differently.The drift rate is related to a test taker's cognitive capacity (or cognitive efficiency) in a task.Increasing the drift rate increases the probability of a correct response and decreases the expected response time.The drift rate is interpreted as a manifestation of a test takers capability in a task as it constitutes both, correctness and speed of information processing.The boundary separation determines the amount of information a test taker accumulates before responding and sets the speed-accuracy tradeoff a test taker is working on.Increasing the boundary separation increases the probability of a correct response (provided that the drift rate is positive), but also the expected response time.The boundary separation is interpreted as a manifestation of the effort, a test taker is willing to spend on the item and represents motivational aspects of the response process.In summary, in the diffusion model, the probability of a correct response and the expected response time, that is, the actual performance of a test taker in a test depends on both, the drift rate and the boundary separation.The drift rate reflects the capability of a test taker that is separated from his/her willingness to invest effort on the task.
In practice, the observed response time is longer than the time that is needed for reaching the decision.Some extra time is required for reading the item or giving the response.This extra time is denominated as the non-decision time nd (nd 2 R þ ).The non-decision time is the third model parameter of the diffusion model.The non-decision time and the duration of the preference formation both contribute to the observed response time in a task.It is assumed that the observed response time is the sum of the non-decision time and time needed for preference formation.

The D-diffusion model for responses and response times on tests
Tests consist of several items (tasks) that differ in difficulty and time demand.Test takers also differ in their capability to solve the items and to respond fast.This requires an extension of the diffusion model in order to take account of individual differences between the test takers and characteristics of the items.Such an extension can be achieved by embedding the basic diffusion model in a higher-order latent trait model (Vandekerckhove et al., 2011).These models are denominated as diffusion-based item response theory (DiffIRT) models.DiffIRT models decompose the parameters of the diffusion model into item effects and latent traits.One popular representative of DiffIRT models is the D-diffusion model (van der Maas et al., 2011;Molenaar et al., 2015) that is used in the paper.
In the following, we assume that a sample consists of the responses x ij and the response times t ij of i ¼ 1, :::, N test takers on j ¼ 1, :::, G items of a test.The responses x ij are scored as either correct (x ij ¼ 1) or incorrect (x ij ¼ 0).Responses and response times were generated according to a diffusion model with drift rates v ij and boundary separations a ij .In the Ddiffusion model, the two parameters are related to latent traits of the test takers.According to the model, each test taker can be characterized by two latent traits, the proficiency h i (h i 2 R) and the carefulness x i (x i 2 R).The proficiency h i of test taker i determines the drift rate in item j via a linear model: Parameter b 0j (b 0j 2 R) is an intercept parameter that determines the drift rate of a baseline test taker with h i ¼ 0: It takes account of all characteristics of the item (e.g., item difficulty) that have an effect on information processing.As to its role, the parameter is similar to the intercept in item response models.Parameter b 1j (b 1j 2 R þ ) is a discrimination parameter that determines the influence of h i on the drift rate.Parameter b 1j is similar to the loading in factor analysis; note that in contrast to the original version of the D-diffusion model, we assume that the discrimination parameter is item-specific.As h is the characteristic of the test taker that determines the drift rate, the trait can be interpreted in terms of the capability to solve the items of the test.Higher levels of h imply higher solution probabilities and faster response times in all items.
The carefulness x i of test taker i determines the boundary separation in item j via a log-linear model: The log-link guarantees that a ij ðx i Þ 2 R þ as it is required for the boundary separation.The interpretation of the item parameters a 0j and a 1j is totally in parallel to the interpretation of b 0j and b 1j : As x is the characteristic of the test taker that determines the boundary separation, it is related to the effort a test taker is willing to spend on the item.The trait can thus be interpreted as the carefulness of a test taker.When the drift rate is positive, higher levels of x imply higher solution probabilities and slower response times in all items.
The non-decision time is not related to the solution process.For this reason, it is not of interest in psychological assessment.It is also usually small in comparison to the time required for responding.For this reason, we follow the suggestion of Molenaar et al. (2015) and consider the non-decision time as an item-specific quantity that does not differ over test takers.This is an assumption that simplifies the model.We denote the item-specific non-decision time as nd j .
In the D-diffusion model, test takers are characterized by two different traits, the latent proficiency h and the latent carefulness x.Both traits determine the responses and the response times in the items.This is different to standard models for responses and response times in tests.In the hierarchical model of van der Linden (2007), for example, the latent traits represent ability and mental speed.Ability and speed, however, are descriptions of the actual performance of a test taker.The traits of the diffusion model are determinants of the performance that are related to different constituents of the response process; see the description of the basic diffusion model above.
The levels of the traits are considered to be realizations of the random variables H and X that result when test takers are sampled from the population of potential test takers.Each response and response time in the data is the realization of a random variate whose distribution can be characterized by the diffusion process with the parameters given in Eqs. ( 1) and (2).We refer to the density of this distribution as For ease of notation, the dependency on the item parameters is not indicated explicitly.In the DiffIRT model, it is assumed that conditionally on the latent traits, the responses and response times on different items are independent.This implies that H and X are the only systematic influences on the response process.Assuming conditional independence, the joint distribution of the responses X i ¼ ðX i1 , :::, X iG Þ and the response times T i ¼ ðT i1 , :::, T iG Þ of test taker i on the G items of the test can be factored as (3)

The dynamic D-diffusion model
In the D-diffusion model, the test takers' levels of the latent traits (proficiency, carefulness) are assumed to be fixed throughout the test.This assumption rules out the case that test takers reduce their effort (lower carefulness) or improve their capability (higher proficiency) during the test.In the following, we propose a dynamic D-diffusion model that is capable to account for changes in the latent traits during the test.We integrate the D-diffusion model into a latent growth curve model.Contrary to standard latent growth models, we use the item position as the time unit, not the testing time.We assume that the operative trait levels of a test taker change from one item to the next during the test, but are constant within an item.The changes are represented by change functions that determine the relation between the item position and the operative levels of the traits (proficiency and carefulness) a test taker is working on.Denote the item position by s (s 2 f1, :::, Gg) and the operative levels of proficiency and carefulness of test taker i at item position s by h i ðsÞ and x i ðsÞ: The change function maps s to h i ðsÞ and x i ðsÞ: In contrast to the latent traits, the item parameters of the D-diffusion model (Eqs.( 1) and ( 2)) are assumed to be fixed during the test.Throughout the paper, we assume that each test taker worked on the items in a different sequence.This is necessary in order to separate the average amount of change in the traits from effects of the items.
The change functions of each test taker are determined by six latent traits.A test taker operates on the first item with proficiency h Fi (h Fi 2 R) and carefulness x Fi (x Fi 2 R).During the test, the operative trait levels of a test taker increase or decrease until reaching the final levels h Li (h Li 2 R) and x Li (x Li 2 R).The positions in the test, where the final levels are reached may depend on the test taker.We denote these positions by saturation point s hi when we refer to proficiency and by saturation point s xi when we refer to carefulness.Before the saturation point s hi , a test taker's operative level of proficiency is between h Fi and h Li .After the saturation point s hi , the test taker is working on the fixed operative level h Li .For carefulness, we make the same assumptions.In the following, we assume that the saturation points are within the test, that is, s hi 2 ½1, G and s xi 2 ½1, G although this assumption can be relaxed.Note that the saturation points can take any value between 1 and G and do not necessarily have to coincide with the actual position s 2 f1, :::, Gg of an item.We consider two different forms of change, namely linear and quadratic change.

Linear change
In a first version of the model, the operative levels of the latent traits change linearly from h Fi to h Li and from x Fi to x Li , respectively.The operative trait levels h i ðsÞ and x i ðsÞ of test taker i at position s are determined by the following change functions:

Quadratic change
In a second version of the model, the operative levels of the latent traits change quadratically during the test from h Fi to h Li and from x Fi to x Li , respectively.We make the additional assumption that the saturation points are extremal points of the change function such that their first derivatives at s hi and s xi are zero.This guarantees that the change functions are monotone, continuous and have continuous first derivatives.These assumptions uniquely determine the change functions: Exemplary progressions of the operative levels h i ðsÞ and x i ðsÞ of the traits over a test of 24 items are illustrated for three exemplary test takers in Figure 1 for the two forms of change.The saturation points s hi and s xi of the test takers are indicated via dotted lines.In Figure 1, the operative level of the latent proficiency increases and the operative level of the latent carefulness decreases throughout the test.With other final levels, the progression could be reversed.
The initial level h Fi and the final level h Li of proficiency are regarded as realizations of the random variables H F and H L when sampling test takers from the population of the potential test takers.The two random variables are assumed to have a bivariate normal distribution with expectations of 0 and l H L , variances of 1 and r 2 H L , and a correlation of q 1 .The initial level x Fi and the final level x Li of carefulness are assumed to be realizations of the random variables X F and X L .These two random variables are likewise assumed to have a bivariate normal distribution with expectations of 0 and l X L , variances of 1 and r 2 X L a correlation of q 2 .In correspondence to the standard D-diffusion model (Molenaar et al., 2015), the traits related to the drift rate (H F ,H L ) are assumed to be independent from the traits related to the boundary separation (X F ,X L ).The standardization of H F and X F is required in order to identify the model by removing scale indeterminacies.
In the dynamic D-diffusion model, each test taker is characterized by two saturation points s hi and s xi that determine how fast a test taker reaches his/her final levels h Li and x Li ; see Eqs. ( 4) and ( 5) or Figure 1.These saturation points are latent traits like the initial and final levels of proficiency and carefulness.The saturation points are on a different scale than proficiency and carefulness as they denote an item position in the test.To have all traits on the same scale, we suggest a reparameterization of the saturation points in terms of two alternative latent traits h Si (h Si 2 R) and x Si (x Si 2 R) in the form of: In Eq. ( 6), the parameters c 0 and c 1 as well as d 0 and d 1 are test-specific parameters.Equation ( 6) is a simple reparameterization that transforms the traits h Si and x Si with range ðÀ1, 1Þ into the saturation points s hi and s xi with range ð1, GÞ: We assume that the latent traits h Si and x Si of a test taker are realizations of standard normally distributed random variables H S and X S .The parameters c 0 and d 0 thus determine the saturation points of an average test taker and the parameters c 1 and d 1 the variability of the saturation points over the test takers.Values of c 1 and d 1 near zero imply that all test takers have roughly the same saturation point.Large values of c 1 and d 1 imply that a proportion of the test takers has an early saturation point and the remaining test takers a late saturation point.This mimics a latent class model where some test takers operate on trait levels h Fi and x Fi while others operate on trait levels h Li and x Li in most of the items.
Equation ( 6) implies individual-specific saturation points and provides the most general version of the model.The model can be simplified by the assumption that the saturation points are the same for all test takers.This can be implemented by the restriction that c 1 ¼ 0 and d 1 ¼ 0: In this case, the saturation points are uniquely determined by c 0 and d 0 and all test takers reach their final levels at exactly the same item positions (s hi ¼ s h , s xi ¼ s x , i ¼ 1, :::, I).An even stronger restriction is the assumption that saturation occurs at a specific item.In this case, the saturation points are set to a specific value and not considered as a free parameter.A typical value would be, for example, the last item (s hi ¼ G, s xi ¼ G).Restricting the saturation points simplifies the model.This may be necessary when the model is fit to small data sets where issues of non-convergence and large standard errors of estimation may occur; see the next section for results on parameter recovery.
Identification of the model parameters requires that the items are presented in different order.This is necessary as otherwise effects of the item position would be confounded with item characteristics.This signifies that items are either presented in random order or grouped in test forms with different ordering.Randomizing item order is common in many examinations for the purpose of security and the reduction of position effects on parameter estimation (Li et al., 2012;Wu, 2010).It is also possible to use the model for a test with a fixed form.In this case, however, one has to make rather strong assumptions, such as that the average change is zero.Apart from this, there are no requirements for model identification beyond the restrictions that are made in order to fix the scales of the latent traits.As different values of the item parameters imply different moments (Ranger et al., 2020), the item parameters could in theory be estimated with the first two moments of the responses and response times in the different items and item positions.
The dynamic D-diffusion model has similarities, but also differences to dynamic extensions of standard latent trait models for responses and response times in tests (e.g., Cao & Stokes, 2008;Fox & Marianti, 2016).In both approaches, one considers the item parameters as fixed and models the progression of the operative levels of the latent traits over the test.The model complexity is similar when complexity is defined as the number of item parameters.In the dynamic D-diffusion model, there are intercepts and loadings for the drift rate and boundary separation and the parameters of the growth model.In the extended standard latent trait models, there are intercepts and loadings in the response and response time model and the parameters of the growth model.The main difference is that we model the progression of proficiency and carefulness, not the progression of ability and speed.Our approach might allow for more specific conclusions about the test takers.In case test takers reduce the effort spent on the solution process, they reduce the boundary separation und will respond faster and more inaccurately.An extreme form is rapid guessing, where no information is accumulated at all before responding.On the other hand, the test takers might actually get better.In this case, they are capable to solve items faster and more accurate.As both change processes might work at the same time, they are confounded in the ability to solve an item and the mental speed.Our model allows to disentangle them.In line with previous publications on the item position effect, we use the item position as the time unit, not the total testing time.We parameterized the growth model in terms of the initial levels (h Fi ,x Fi ), the final levels (h Li ,x Li ) and the saturation points (s hi , s xi ).Standard growth curve models are parameterized in terms of an intercept and a slope parameter, but do not assume a saturation point.We chose our parameterization as we consider its interpretation more natural in the context of psychological assessment, specifically as we get the minimum and maximum level of the trait as well as the position in the test where the maximal level is reached.

Simulation study on parameter recovery
We conducted a simulation study to investigate whether the parameters of the dynamic D-diffusion model can be recovered well.We simulated data for the two forms of change (linear, quadratic) and three versions of the model that differed in the way the saturation point was specified; see the description below.We considered two amounts of change (moderate, large) and two sample sizes (N ¼ 250, N ¼ 1000).Fully crossing all factors defined 2 Â 3 Â 2 Â 2 simulation conditions for which 10 replication samples were generated each.A higher number of replications was not possible due to the time demand of fitting the model.
Data were generated for a test of G ¼ 24 items.The dynamic D-diffusion model has two sets of parameters, the parameters b 0j , b 1j , a 0j , a 1j and nd j of the Ddiffusion model (Eqs.( 1) and ( 2)) and the parameters of the added growth model (Eqs.( 4)-( 6)).For the parameters of the D-diffusion model, we used the same values in all simulation conditions.We did not vary the item parameters systematically as we were mainly interested in the recovery of the growth parameters under the different conditions of change.The values of b 0j were taken from the set fÀ1:0, À 0:5, 0:5, 1:0g, the values of a 0j from the set f0:37, 0:47, 0:61g, the values of b 1j from the set f0:8, 1:2g, the values of a 1j from the set f0:2, 0:4g and the values of nd j from the set f2:0, 3:0g: Due to the large number of item parameters, we did not fully cross them, but used a fractional factorial design instead.A list containing the values of the item parameters in all items is added to this paper as an electronic supplement.Note that these values have already been used in the simulation study of Ranger et al. (2020).They idealize the parameter values typically found in real data.For the standard D-diffusion model, the parameter values imply solution probabilities between 0.2 and 0.8 and expected response times between 2.4 and 3.8.As the parameter values have been used before in order to compare different estimators, they provide a standard scenario for evaluating parameter recovery of our extended model.
Change was simulated as follows.In the condition with a moderate amount of change, we assumed an increase in the expected level of operative proficiency from l H F ¼ 0:0 to l H L ¼ 0:5 and a decrease in the expected level of operative carefulness from l X F ¼ 0:0 to l X L ¼ À0:5: These effects resemble the item position effects that were previously found with dynamic item response models (Bulut et al., 2017;Hartig & Buchholz, 2012;Trendtel & Robitzsch, 2018).We assumed an increase in proficiency (drift rate) and a decrease in carefulness (boundary separation) as this pattern might reflect the effect of practice and demotivation on response accuracy and response speed (Dutilh et al., 2009).Note that for a standardized latent trait, an increase of 0.5 is a moderate effect in terms of Cohen's d (Cohen, 1988).In the condition with a large amount of change, we assumed an increase in the expected level of proficiency from l H F ¼ 0:0 to l H L ¼ 1:5 and a decrease in the expected level of carefulness from l X F ¼ 0:0 to l X L ¼ À1:5: For a standardized latent trait, this is a large effect size in terms of Cohen's d (Cohen, 1988).The variances of the operative latent traits after the saturation points were set to r 2 H L ¼ 1:25 and r 2 X L ¼ 1:25: The correlation between H F and H L was set to q 1 ¼ 0:9 and the correlation between X F and X L was set to q 2 ¼ 0:9: We simulated data for the dynamic D-diffusion model and two restricted versions that were nested in the dynamic D-diffusion model.In the dynamic Ddiffusion model (denoted as Version C in the following), the saturation points are specific for each test taker.We generated the saturation points according to Eq. ( 6) with c 0 and d 0 set to 0.09 and c 1 and d 1 set to 0.50.With these values, 95% of the test takes had a saturation point between Item 8 and Item 18.We then restricted the dynamic D-diffusion model (Version B).We set the saturation points to the same values for all test takers (s hi ¼ s h , s xi ¼ s x ).This can be achieved by setting the parameters c 1 and d 1 of Eq. ( 6) to zero.The saturation points are then fully determined by the parameters c 0 and d 0 and all test takers reach their final trait values at the same item positions s h and s x : In contrast to Version C, in Version B there are not individual differences with respect to how fast the trait levels of a test taker change.When generating the data, we set c 0 and d 0 to 0.09.These values correspond to saturation points at item position s ¼ 13.We further restricted the D-diffusion model (Version A) by setting the saturation points to the last item of the test (s hi ¼ s xi ¼ G).This signifies that the operative levels of the traits change during the whole test.In Version A, the saturation points were considered as known and not estimated.All three versions of the model were implemented with a linear change function (Eq.( 4)) and a quadratic change function (Eq.( 5)).
Responses and response times were generated as follows.For each test taker, we sampled values for h Fi and h Li from a first bivariate normal distribution.Values for x Fi and x Li were drawn from a second bivariate normal distribution; expectations and covariance matrices were set as described above.In Version C, we drew the latent traits h Si and x Si underlying the saturation points (Eq.( 6)) independently from a standard normal distribution.We then simulated the responses and the response times.For each test taker, we randomly assigned each item to an item position s (s 2 f1, :::, 24g).We then determined the operative trait levels of a test taker at the item position either via Eq.( 4) or via Eq.( 5) depending on the simulation condition.The saturation points s hi and s xi were either set to 24 in Version A, to 13 in Version B or to the value implied by Eq. ( 6) in Version C. We then determined the values of the drift rate and the boundary separation according to Eqs.
(1) and ( 2).We finally generated responses and response times for these parameters according to the diffusion model.We used the package RWiener (Wabersich, 2014) from the software environment R (R Core Team, 2021).Depending on the simulation condition, responses and response times were generated either for 250 or for 1000 test takers.
We estimated the parameters of the model on basis of Bayes' theorem (Sorensen & Gianola, 2002).When fitting the diffusion model, it is common to use partly informative priors as the sampler has difficulties to explore regions of low density.Previous Bayesian estimators of the diffusion model employed relatively narrow priors (van der Maas et al., 2011) that had a sufficiently wide range to cover typical ranges of plausible parameter values according to past findings (Kang et al., 2022;Wiecki et al., 2013).Following this practice, we specified the priors as follows.In a first step, we fit a standard Ddiffusion model to the data with the minimum distance estimator proposed by Ranger et al. (2020).In that way, we determined preliminary, although possibly biased estimates of b 0j , b 1j , a 0j , a 1j and nd j .We then defined a range of plausible values.For a parameter, we considered all values as plausible within the range defined by its preliminary estimate plus/minus three times its standard error of estimation plus/minus a term accounting for the possible bias.The term accounting for the possible bias was set to 2.0 in b 0j and a 0j and to 1.0 in b 1j and a 1j : The range of plausible values of b 1j and a 1j was truncated at zero in order to prevent scale reflections.The range of plausible values of nd j was truncated at zero and the minimal response time in an item.The prior of the D-diffusion model parameters was the uniform distribution over the range of the plausible values.The priors of the parameters from the growth model were uniform distributions over a fixed range.For the expected amount of change (l H L , l X L ), we used uniform distributions with support ½À2:0, 2:0: This corresponds to a change of maximal two standard deviations.For the standard deviations (r H L , r X L ), we used uniform distributions with support ½0:1, 3:0: For the correlations (q 1 ,q 2 ), we used uniform distributions with support ½À0:99, 0:99: In the version of the model with fixed saturation points (Version A), the saturation points s hi and s xi were set to 24 and not estimated.In the version of the model with test-specific saturation points (Version B), we used uniform distributions with support ½À4, 4 for c 0 and d 0 .The parameters c 1 and d 1 were set to zero and no values were estimated for h Si and x Si .In the version of the model with individual-specific saturation points (Version C), we used uniform distributions with support ½À4, 4 for c 0 and d 0 and uniform distributions with support ½0, 4 for c 1 and d 1 .For all latent traits, we used normal priors.Our prior specification was motivated by the desire to narrow the difference between Bayes estimation and maximum likelihood estimation.We used uniform priors for the model parameters, as in this case the expectation of the posterior distribution coincides with the maximum likelihood estimate in the limit (Mislevy, 1986); note that the marginal posterior distribution of the model parameters is proportional to the marginal likelihood function when the traits are integrated out.With uniform priors, however, one restricts the parameter space.This can hide multimodality of the posterior distribution and may cause problems with the assessment of posterior variability as a reviewer pointed out.Such problems can be detected by a close inspection of the posterior distribution.
The posterior distribution was approximated by a Monte Carlo Markov Chain.We generated the chain with the NUTS sampler that is available in the probabilistic modeling language stan.We accessed stan via the package RStan (Stan Development Team, 2020).For each data set, we generated one Monte Carlo Markov Chain with 25,000 steps.We did not generate multiple chains as running multiple chains requires either much memory and CPU capacity when the chains are run in parallel or running time when the chains are run serially.The first 12,500 steps were considered as warm-up and discarded.This is probably somewhat conservative.Having generated the Monte Carlo Markov Chain, we estimated the values of the parameters by its conditional expectation (EAP estimator).Results of the simulation study are summarized in Figure 2. There, we report the average EAP estimates for the simulation conditions (dot) as well as confidence intervals for the expected EAP estimates for a ¼ 0:05 (bar).The true values of the item parameters are denoted as grey lines.More details on parameter recovery (e.g., standard errors, coverage frequencies of credibility intervals) are given in Table S2 and Table S3 of the electronic supplement to this paper.Due to space restriction, only the results for the parameters related to change are reported.The results for the parameters of the D-diffusion model (see Eqs. ( 1) and ( 2)) can be obtained from the authors on request.
In the condition with fixed saturation points (Version A), the values of the parameters related to change (l H L , r H L , l X L , r X L ) could be recovered well.
The EAP estimator had a slight bias in the conditions with 250 test takers, but was virtually unbiased in the conditions with 1000 test takers.The standard errors of estimation did not seem to depend much on the amount or form of change.
In the condition with test-specific saturation points (Version B), the standard errors of estimation were higher than the ones in the condition with fixed saturation point (Version A).The values of the parameters related to change (l H L , r H L , l X L , r X L ) were estimated well in samples of 1000 test takers.In samples of 250 test takers, they tended to be too small.The parameters determining the saturation points (c 0 ,d 0 ) tended to have higher standard errors of estimation.The saturation points were estimated most precisely when the sample size was 1000, the amount of change was large and the form of change was linear; in this case, the average estimate of c 0 was 0.07 and the standard error of estimation was 0.03.This corresponds to a credibility interval of ½12:56, 13:23 for the saturation point.The saturation points were estimated least precisely when the sample size was N ¼ 250, the amount of change was small and the form of change was quadratic.Here, the average estimate of c 0 was À0.24 (instead of the true value c 0 ¼ 0:09) and the standard error of estimation was 0.71.This corresponds to a credibility interval of ½4:76, 18:47 for the saturation point.
In the condition with individual-specific saturation points (Version C), the standard errors of estimation were largest.The values of the parameters related to change (l H L , r H L , l X L , r X L ) were estimated still well, especially with N ¼ 1000.The parameters related to the saturation points (c 0 ,c 1 ,d 0 ,d 1 ), however, had large standard errors of estimation.In samples with 1000 test takers, parameter recovery was still acceptable when the amount of change was large and the form of change was linear.In samples of 250 test takers, the recovery was poor, especially for parameters c 1 and d 1 .

Empirical example
We illustrate the use of the model for analyzing change with an analysis of real data.The data consisted of the responses and response times on a test on visuo-spatial perspective-taking with 16 items (Erle & Topolinski, 2015).In each item, the test takers are shown a picture that displays a person sitting at a round table from a bird's-eye perspective.A book and a banana are on the table next to the person's left and right arm or vice versa, respectively.The person's position at the table is rotated either through 120 , 160 , 200 , or 240 from a test taker's point of view.Test takers have to indicated whether the book was lying closer to the target person's left or right arm.Items were scored as either right or wrong.Data were collected during a computer assisted assessment that was conducted within a study on cheating on tests (Wolgast et al., 2020).Test takers were recruited from the cohort of education undergraduate students who were enrolled in an introductory lecture on psychology at a German university.The test items were presented on laptops in a large assessment center.The positions of the items were randomized.In addition to the test, the students had to complete further tests and questionnaires.As these are not relevant for the present purpose, we refer to Wolgast et al. (2020) for further details.Data were collected for 460 students.The data of 31 students had to be discarded as the responses or response times were not recorded properly or the students did not finish the test.We further excluded the data from students who responded by rapid guessing or did not follow the instructions.We defined test takers as rapid guessers when their solution probability was below chance level and their response times below a certain threshold.The threshold was set as the maximum of the minimal time demand required for a solution probability above 0.5 and the non-decision time determined by a preliminary estimator of the diffusion model.This reduced the number of valid cases to N ¼ 409.The original data set is available on https://osf.io/nmxq7/.The reduced data set for the current analyses can be downloaded from the homepage of the corresponding author.
The relative frequency of a correct response and the median response time in each item is depicted in Figure 3 as a function of the position the item was presented.Figure 3 illustrates that the response accuracies are high, irrespective of the position the items were presented.In contrast, the median response times decrease with increasing item position during the first half of the test.The decrease is nonlinear and consists in a reduction of the median response times from about 4s to about 1s.After Item 8, the median response times appear constant.This suggests that there might be a saturation point at Item 8 after that the test takers operate on constant trait levels.The combination of decreasing response time and high response accuracy suggests that test takers increase their proficiency during the test.They, however, might also reduce their carefulness to some extent.
We analyzed the responses and response times with the dynamic D-diffusion model described above.We fit the three versions of the model that were considered in the simulation study.In the most restricted version (Version A), the saturation point was fixed to the last item, in a less restricted version (Version B), the saturation point was assumed to be invariant over test takers and in the least restricted version (Version C), the saturation point was individual-specific.All versions were implemented for linear and quadratic change.Parameters were estimated as in the simulation study, namely with a Bayes estimator and uniform priors.Starting values of the basic diffusion model parameters (see Eqs. ( 1) and ( 2)) were determined as described before, by fitting a standard D-diffusion model to the data with a minimum distance estimator.In contrast to the simulation study, we used five different sets of starting values for the parameters of the growth model (see Eqs. ( 4)-( 6)).The starting values implied a small or large amount of change effect and an early or late saturation point.With the different starting values, we ran five chains with 10,000 iterations.We carefully checked the Monte Carlo Markov Chain for convergence.Trace plots suggested that the chains converged well.Divergence of transitions was no issue.Having ensured convergence, we run an additional chain with 25,000 iterations from the preliminary estimates and discarded the first half of iterations as burn-in in order to improve precision.For model selection, we determined the AIC and BIC index (Gelman et al., 2014).When determining AIC and BIC, we proceeded as Bolsinova et al. (2017) or Kang et al. (2022).We used the EAP estimates of the parameters and the latent traits when evaluating the likelihood of the data.The penalty term was based on the number of estimated item parameters and trait values.We chose the AIC index for model selection as this index performed well in previous studies (Kang et al., 2022).Furthermore, in psychological assessment one typically uses just the point estimates of the parameters, not the whole posterior distribution.This practice is in line with AIC that estimates the performance of the plug-in predictive density (Gelman et al., 2014). 1 We determined the conditional version of AIC where we considered the latent traits as model parameters (Merkle et al., 2019).The decision to consider latent traits as model parameters, not nuisance parameters was mainly due to computational reasons.Elimination of nuisance parameters by integration would have required the approximation of a six-fold integral.This would have not been feasible in reasonable time.All scripts used for the analysis can be downloaded from the homepage of the corresponding author.
The values of the model selection indices as well as the values of the model parameters related to change are given in Table 1 for the 3 Â 2 versions of the model.
For the visuo-spatial perspective-taking data, the best model with respect to the information criteria was the dynamic D-diffusion model with linear change and a saturation point that was identical for all test takers (Linear -Version B).This version of the model also provided a good representation of the data.Posterior predictive checks indicated that the empirical distribution of the response times at different item positions were similar to the predicted distributions.More precisely, the empirical quantiles of the response times in the data were similar to the average quantiles of data generated by the dynamic D-diffusion model with parameters that were randomly drawn from the posterior distribution.More details, e.g. the plots of the empirical quantiles against the Note when considering the full posterior distribution in an analysis, WAIC may be used instead of AIC (Vehtari et al., 2017).
predicted quantiles at different item positions can be found in an electronic supplement to this paper.According to the model, the expected operative level of proficiency increased from l H F ¼ 0:00 to l H L ¼ 0:93 during the test.The standard deviation of proficiency increased from r H F ¼ 1:00 to r H L ¼ 1:43: The saturation point was c 0 ¼ 0:81 or s h ¼ 11:41, respectively.The expected operative level of carefulness, in contrast, decreased from l X F ¼ 0:00 to l X L ¼ À1:44: The standard deviation of carefulness remained roughly the same.The saturation point was d 0 ¼ À2:22 or s x ¼ 2:46, respectively.At the suggestion of a reviewer, we repeated the analysis with normal and inverse gamma priors.The results, however, were virtually the same.
The test of Erle and Topolinski (2015) is a test of visuo-spatial perspective-taking.As solution probabilities are high, differences in performance are reflected predominantly in the response times of the test takers.The time to respond, however, declines during the test.This poses the question whether it is justified to characterize a test taker by a single latent trait reflecting his/her capability in perspective-taking.And if so, it remains unclear which aspect of the time profile should be used for trait assignment, the initial level, the rate of change or the final level.The way to proceed is even less clear given that drops in response time do not necessarily have to be related to a test taker's capability, but could have many causes, such as decreasing carefulness or changes in the speed-accuracy tradeoff.The fitted dynamic DiffIRT models gives some insights into the possible causes of the changes.First, findings suggest that the change in response time is due to two different processes.The first process is a fast reduction of carefulness by which test takers adjust the effort spent on the items to the difficulty of the task.As the accuracy is high from the start, the effort can be reduced without impairing the performance much.This is in accordance to the interpretation of carefulness as a quantity that is used to control the efficiency of the response process by making an adequate speed-accuracy tradeoff.The second process is a slow increase in proficiency.This change process might reflect a more time demanding learning process.It could reflect the gradual switch from controlled to automatic processing.In summary, the findings suggest that the change in response time cannot unequivocally be related to the capability of a test taker.Response times should therefore not be used directly for psychological assessment.Second, in the proposed dynamic D-diffusion model, each test taker is characterized by six different traits, the initial level, the saturation point and the final level of capability and carefulness.In the restricted version used here, there are no individual differences with respect to the saturation point.The initial levels of the test takers were highly correlated with the final levels.The correlation was q 1 ¼ 0:79 for H F and H L and q 2 ¼ 0:82 for X F and X L .This signifies that the ranking of the test takers with respect to proficiency and carefulness rarely changes during the test.Test takers can therefore uniquely be ordered with respect to their proficiency in perspective-taking.In the more general case, when individuals differ in the saturation points and traits are not highly correlated, one has to decide which aspect is most closely related to the construct one intends to measure.In the present case, the most relevant trait would probably be H L as this is the level of perspective-taking a test taker is able to reach.Individual differences in the learning-rate are probably construct irrelevant.This might be different in tests of general intelligence, where adaptation to new situations is an integral part of the construct.

Discussion
In psychological assessment, it is common practice to employ latent trait models.These models conceive the outcomes (e.g., responses, response times) of test taking as the product of invariant characteristics of a test taker (traits) and random noise (error).Test taking, however, is not as invariant as these models imply.When test takers become familiar with the test material, they approach the problems less tentatively, are able to reuse problem-solving strategies, or learn to separate relevant from irrelevant information.Test takers also might become bored, tired, or distracted and reduce their persistence and problem-solving capacity (e.g., Molenaar et al., 2016;Schweizer et al., 2020;Zeller et al., 2017).The standard latent trait models that assume fixed latent traits are not capable to fully take account of these changes.Latent trait models are also just descriptive models.They allow for a parsimonious representation of the data in form of person parameters (traits) and item parameters (item effects).They, however, do not allow the separation of different psychological processes that generate the test outcomes.This is problematic when different elements of the response processes have different effects on the test performance.In this paper, we have introduced a model that aims at disentangling different processes of change during the test.The model combines the D-diffusion model and a latent growth curve model.In this way, it becomes possible to model change in a test taker's levels of proficiency and carefulness separately.As both traits have a different interpretation-proficiency is generally viewed as representing the actual construct being measured while carefulness is usually supposed to reflect motivational and strategic aspects (Molenaar et al., 2015)the model allows to separate construct irrelevant from construct relevant aspects of change.
In the paper, we investigated parameter recovery with a simulation study.In the simulation study, we demonstrated that the parameters of the model can be recovered well under certain conditions.Parameter recovery is good when the sample size is at least 1000, the change is linear and the amount of change is large.Under other conditions, the saturation point cannot be estimated well.In this case, it might be better to fit a more restricted version of the model.We employed the model to an empirical data set on visuo-spatial perspective-taking.In the data set, we found evidence for change during the test.This is in line with previous findings on item position effects in tests consisting of items with homogeneous content (e.g., Nagy et al., 2018).The change was best described by a linear change function and a test-specific saturation point.The change consisted in a fast reduction of carefulness and a slower increase of proficiency.The fast reduction of carefulness is clearly a warm-up effect that consists in an adjustment of the response process to the current task.The slower increase in proficiency might reflect learning as it is more gradual and requires more time.In both processes of change, the initial and the final levels of the latent traits were considerably correlated.This facilitates psychological assessment as it probably does not matter whether the initial or the final levels of the traits are used for psychological assessment.More generally, the study can be seen as a contribution to the literature on item position effects.Our findings suggest that the item position effect may be a product of different change processes during test taking: A slow learning process that improves the capability to process information and a fast adaptation that adjusts the effort to the demand of the task.Whether these findings can be extended to other test material, however, requires further research.
Even though the model was capable to represent the data well, there are some limitations that have to be mentioned.Core of the model are change functions that specify how proficiency and carefulness change during the test.The change functions imply that the traits change from the first item on until some final level is reached.The final level may be higher, but also be lower than the initial level in the first item.This form of change is probably most adequate for the effects of warm-up and learning that manifests itself in an increase of proficiency and a decrease of carefulness during the first part of a test.Effects of time pressure or demotivation, however, require a different form of change.As time pressure and demotivation may affect later items more strongly than earlier items, the change function should be constant in the first items and decrease or increase in the last items (Boughton et al., 2007).Such forms of change can be implemented with a slight modification of Eqs. ( 4) and ( 5).One simply has to reflect the scale of the item position by replacing s with G À s þ 1: A second limitation is the focus on linear and quadratic forms of change.One alternative would be to use an exponential function or a power function (Heathcote et al., 2000).Another alternative would be to use an Sshaped growth function that is constant in the first third of the test, changes in the second third of the test, and reaches a limit in the last third of the test.More general versions with two switching points, where trait levels first increase due to practice but then decrease due to demotivation, could also be considered.Although such alternative forms of change could easily be implemented, it would be very difficult to estimate them precisely.
Our model considers change as a continuous function of the item position, not of the total testing time.This is, of course, a simplification.The item position has different implications with respect to the total testing time for a fast test taker than for a slow test taker.The item position is probably important for learning effects as we do not learn by spending time, but from experience, that is, the number of similar problems a test taker has processed before.For the effects of rapid guessing when time runs out, the total testing time might be more relevant.In the case of fatigue or demotivation, it might be the number of problems or the total testing time, or both that is relevant.We also did not consider reinforcement learning where change depends on the responses in previous items (Fontanesi et al., 2019).This can be justified by the fact that in psychological testing, the test takers usually do not receive feedback during the test.We also did not consider qualitative differences in change, for example by assuming that trait levels change only in a subset of the test takers (e.g.Bolt et al., 2002;Meyer, 2010).An alternative to modeling change with a growth curve model would be to use a latent-trait latent-state model (e.g., Fox, 2014;Kelava & Brandt, 2019;Molenaar et al., 2016).Such models might be attractive when traits are neither stable, nor increase or decrease systematically.
Several aspects of the model may be criticized for being too restrictive.We considered the non-decision time, for example, as fixed and hence as not affected by the item position.This excludes the possibility that learning also occurs for aspects of test taking behavior that are unrelated to problem solving.This limitation is probably not too serious.As non-decision times are usually relatively small, the effect of the item position on them cannot be large.In the implementation of the dynamic D-diffusion model, the latent traits determining the saturation points were not correlated with the initial and final levels of the traits.This was done in order to simplify the model.Parameter recovery was already difficult for the parameters determining the saturation point in the case of uncorrelated traits.The decision to assume uncorrelated traits can thus be considered as a tradeoff between bias and variance.More problematic is the aspect that we focused on the drift rate and the boundary separation exclusively.In addition to the drift rate and the boundary separation, the diffusion model has an additional parameter, the variance of the Brownian motion (Resnick, 2002).The variance is usually set to a fixed value in order to identify the model and thus not considered explicitly.The invariance of the variance over the test, however, might not hold true when the solution process becomes more standardized and stable during the test.When the variance is fixed to a constant in the model, but changes during the test, the change manifests itself in changes in the drift rate and the boundary separation.This is problematic, as it implies that the diffusion model does not disentangle different processes entirely.
Finally, a last issue is the question whether the fine-grained decomposition of a test taker's performance into initial level, saturation point and final level of proficiency and carefulness isolates more relevant components of the construct the test is intended to measure than a single score from a standard analysis.As a reviewer remarked, it might not always be the case that interindividual differences and intraindividual changes in carefulness are construct-irrelevant and need to be accounted for.There is no general response to this question.It depends on how the construct is defined.If one intends to assess performance, there is probably no need for the proposed decomposition.Often, however, one is interested in the capability of a test taker and seeks for a characterization of a test taker that goes beyond the test context.This is especially important, for example, when individuals differ in intrinsic motivation as in low-stakes tests (e.g., PISA).A test score that mingles different characteristics of a test taker might not be adequate for statements about competence levels that can be generalized beyond the general motivational level of the test takers.More research, however, is necessary in order to evaluate the significance of the different latent traits distinguished by the dynamic D-diffusion model.

Figure 1 .
Figure 1.Illustration of the progression of the operative trait levels over a test with 24 items.The black lines trace the operative trait levels of proficiency h i ðsÞ and carefulness x i ðsÞ of three test takers at different item positions s in a test with G ¼ 24 items.The dotted straight lines indicate the saturation points s hi and s xi :

Figure 2 .
Figure 2. Average EAP estimates (dot) and Wald based confidence intervals (bar) with a ¼ 0:05 for the expected EAP estimates in the simulation study for different forms of change, different versions of the model, different amounts of change and different sample size.The form of change is denoted by L (Linear) versus Q (Quadratic), the version of the model by A, B and C and the amount of change by ME (moderate) versus LE (large).Sample size was either N ¼ 250 or N ¼ 1000.

Figure 3 .
Figure 3. Accuracy of the response and median response time on the different items of the test on visuo-spatial perspective-taking as a function of the position an item was presented.The test had 16 items.Each line represents one item.

Table 1 .
Expectation (standard deviation) of the posterior distribution for the parameters related to change and values of information criteria for six versions of the dynamic d-diffusion model in the data on visuo-spatial perspective-taking.Version A: Fixed saturation points; Version B: Test-specific saturation points; Version C: Individual-specific saturation points.