Tests for goodness of fit in ordinal logistic regression models

ABSTRACT We examine three approaches for testing goodness of fit in ordinal logistic regression models: an ordinal version of the Hosmer–Lemeshow test (), the Lipsitz test, and the Pulkstenis–Robinson (PR) tests. The properties of these tests have previously been investigated for the proportional odds model. Here, we extend the tests to two other commonly used models: the adjacent-category and the constrained continuation-ratio models. We use a simulation study to assess null distributions and power. All three tests work well and can detect several types of lack of fit under both the adjacent-category and constrained continuation-ratio models. The and Lipsitz tests are best suited to detect lack of fit associated with continuous covariates, whereas the PR tests excel at detecting lack of fit associated with categorical covariates. We illustrate the use of the tests with data from a study of aftercare placement of psychiatrically hospitalized adolescents. Based on the results here and previous research, we can make a joint recommendation for testing goodness of fit in proportional odds, adjacent-category, and constrained continuation-ratio logistic regression models: because the tests may detect different types of lack of fit, a thorough assessment of goodness of fit requires use of all three approaches.


Introduction
Several regression models can be used to describe the relationship between an ordinal response variable and one or more explanatory variables. [1,2] The models differ in which and how the response levels are compared. The proportional odds logistic regression model -perhaps the most frequently used ordinal model -compares the probability of an equal or smaller response with the probability of a larger response. The model is constrained by the requirement that the log odds of each explanatory variable does not depend on the response category. Two other constrained models are the adjacent-category and the constrained continuation-ratio models. The adjacent-category model compares the probability of each response with the probability of the next lower response. The constrained continuation-ratio model compares the probability of each response with the probability of all higher responses. We may also use an unconstrained continuation-ratio model, which provides greater flexibility by allowing the effect of each explanatory variable to depend on the response level.
The choice between these models should be guided by which comparisons between response levels are most informative for the problem at hand as well as an assessment of model adequacy.
One important step in the assessment of model adequacy is an evaluation of overall goodness of fit. Fagerland and Hosmer [3] considered the proportional odds model and investigated the performances of three methods to test for goodness of fit. One of these methods was based on a generalization of the Hosmer-Lemeshow goodness-of-fit test [4] to multinomial logistic regression. [5,6] The other methods were a likelihood ratio test by Lipsitz et al., [7] partly based on the Hosmer-Lemeshow approach for grouping observations, and an extension of the Pearson χ 2 and deviance statistics to allow for continuous covariates. [8] The recommendation in [3] was to use all three methods to assess goodness of fit.
All three methods are based on a strategy of grouping the observations according to estimated probabilities following model fit and are not based on a particular ordinal model. The derivations of the test statistics are only dependent on an ordinal response variable and a model that can estimate probabilities for each response level. Because different ordinal models impose different structures on the data, it may be that the distributions of the test statistics, both under the null and alternative hypotheses, also differ between the models. So far, goodness-of-fit tests based on the three methods above have only been investigated under the proportional odds model. The purpose of this article is to examine whether these tests can also be used following the fit of adjacent-category and continuationratio models. We aim to establish sufficient empirical evidence so that solid recommendations for practical use of goodness-of-fit tests in these models are possible. Section 2 sets the notation and defines the ordinal models. We derive the test statistics in Section 3. The null distribution (Section 4) and power (Section 5) of the tests are investigated using stochastic simulations. We set the tests to work on data from a study of aftercare placement for psychiatrically hospitalized adolescents in Section 6. We discuss the findings and provide recommendations in Section 7.

Four ordinal regression models
Let Y denote an ordinal response variable with c levels (1, . . . , c), and let x = (x 1 , x 2 , . . . , x p ) denote a vector of p explanatory variables. The relationship between the response variable and the explanatory variables is described through c−1 logits, which relate a set of intercepts (αs) and regression coefficients (βs) to the probability of the response levels. We may interpret exp(β) as the odds ratio for a one unit increase in a given explanatory variable, comparing two response levels or two sets of response levels, depending on the model-specific logit equations. We use the notation OR(2, 1) to denote the odds ratio comparing response level 2 with response level 1. In like manner, OR(4-5, 1-3) denotes the odds ratio comparing response levels 4 and 5 with response levels 1-3.
Assume that we have a sample of n independent observations, denoted by (x i , y i ), i = 1, . . . , n. Let π ij = Pr(Y = j | x i ) denote the conditional probability of a response equal to j for observation i given x i . After the fit of the model, we denote the estimated probabilities byπ ij .
In the following, we consider three constrained models and one unconstrained model that differ in which and how the response levels are compared. The constrained models impose more structure on the data than the unconstrained model. They are more parsimonious and simpler to interpret, having a single regression coefficient -when compared with c−1 regression coefficients for the unconstrained model -for each explanatory variable.

The proportional odds model
The proportional odds model, also called the constrained cumulative logit model, compares the probability of an equal or smaller response with the probability of a larger response: where β = (β 1 , β 2 , . . . , β p ) is a vector of p regression coefficients. The log odds does not depend on the response level, and the regression coefficients β 1 , . . . , β p are constant across the logits. The intercepts satisfy the relationship α 1 < α 2 < · · · < α c−1 . The negative sign of β x is included to allow for the usual interpretation that a positive value of β k means that as x k increases, the probability of higher values of Y also increases. This is consistent with the way most statistical software packages implement the proportional odds model. For a four-level response variable, the following interpretations apply: where β k is the coefficient of the kth explanatory variable. If we want to estimate the odds ratio of a smaller response versus a larger response, we can change the sign of the coefficient: The conditional probability of each response level given x is derived from Equation (1) by using the equality Pr It follows from Equation (2) that and

The adjacent-category model
The (constrained) adjacent-category model compares the probability of each response with the probability of the next lower response: The regression coefficients β 1 , . . . , β p are constant across the logits, whereas the intercepts are not. As was the case with the proportional odds model, the effect of a particular explanatory variable on the response can be described by a single coefficient or odds ratio. With a four-level response variable, we can make the following interpretations for the kth explanatory variable: The conditional probability of each response level given x is derived and shown in Appendix 1.
An adjacent-category model may be fitted via a constrained multinomial model. A rationale is given in [1, p.290-291]. The necessary constraints are β jk = (j − 1) · β 2k , for j = 3, . . . , c and k = 1, . . . , p, where β jk is the coefficient of the multinomial logit comparing Y = j with Y = 1 (the reference category) for the kth explanatory variable. The estimated probabilities,π ij , then derive from the multinomial model. A command for fitting adjacent-category models in Stata has recently been made available. [9]

Continuation-ratio models
The constrained continuation-ratio model compares the probability of each response with the probability of a higher response: Again, the effect of a particular explanatory variable on the response can be described by a single coefficient or odds ratio. The interpretations for a four-level response variable are e β k = OR(2-4, 1) = OR(3-4, 2) = OR(4, 3).
The conditional probability of each response level given x is derived and shown in Appendix 2. The continuation-ratio model is most useful in situations where Y measures the number of attempts to attain a binary outcome.[1, p.296-297] Also called a sequential process,[2, p.96-97] each subject has to pass through each response level in turn, starting with the lowest one, to reach a higher response level. Constrained continuation-ratio models can be fitted with a generalized linear model formulation. Two commands for fitting constrained continuation-ratio models are available in Stata. [9,10] We may modify the model in Equation (4) such that the regression coefficients depend on the logits: where β j = (β j1 , β j2 , . . . , β jp ) for j = 1, . . . , c − 1. The model in Equation (5) is called the unconstrained continuation-ratio model. The effect of a particular explanatory variable on the response depends on the response level and is described by c−1 regression coefficients or odds ratios. The unconstrained continuation-ratio model may be fitted using c−1 binary logistic regression models, where the binary response variables are defined as for j = 1, . . . , c − 1. A command for fitting unconstrained continuation-ratio models in Stata is available. [9] The unconstrained continuation-ratio model has the same number of parameters as the multinomial logistic regression model; however, assessing the fit of unconstrained continuation-ratio models is a more complicated task due to the complexities introduced by the separate binary fits. Here, we do not consider the unconstrained continuation-ratio model further and leave that topic for a future article.

Derivation of test statistics
In this section, we present three approaches for testing goodness of fit in ordinal logistic regression models. The test statistics resulting from these approaches are not new. Fagerland and Hosmer [3] showed that these test statistics have satisfactory properties for testing goodness of fit in proportional odds logistic regression models; however, their null distribution and power properties have not yet been investigated for the adjacent-category and constrained continuation-ratio models.

An ordinal version of the Hosmer-Lemeshow test
We consider a test based on the Hosmer-Lemeshow test [4] for binary logistic regression. A generalization of the Hosmer-Lemeshow test to multinomial logistic regression was suggested by Fagerland et al. [5] and recently adapted to the proportional odds model. [3] Here, we make use of the same test statistic as in [3] to derive tests for use with the adjacent-category and the constrained continuation-ratio models.
Following the fit of the model, calculate the estimated probabilitiesπ ij . Assign an ordinal score (s i ) to each observation: Sort the observations according to the ordinal score and arrange them into g groups, such that group 1 contains the n/g observations with the lowest ordinal scores and group g contains the observations with the highest ordinal scores. In theory, any number of groups can be used, although g = 10 is commonly used for the binary Hosmer-Lemeshow test (the deciles of risk). Denote the sums of the observed and estimated frequencies in each group for each of the c response levels by O kj and E kj , respectively. The values of O kj and E kj can be summarized in a g × c contingency table, as shown in Table 1. The ordinal version of the Hosmer-Lemeshow test statistic is the Pearson χ 2 statistic given by As shown in [3], the distribution of C g is well approximated by the χ 2 distribution with (g − 2) (c − 1) + (c − 2) degrees of freedom under a correctly fitted proportional odds model. C g was originally derived for use after fitting a multinomial logistic regression model. [5] The degrees of freedom for the multinomial test is (g − 2)(c − 1). This difference in degrees of freedom reflects the reduced number of regression coefficients of the proportional odds model when compared with the multinomial model. The adjacent-category and the constrained continuation-ratio models have the same Table 1. Observed (O kj ) and estimated (E kj ) frequencies sorted and summed into g groups. number of regression coefficients as the proportional odds model. We thereby posit that the degrees of freedom when we use the adjacent-category and the constrained continuation-ratio models are also equal to

The Lipsitz test
The test by Lipsitz et al. [7] is partly based on the Hosmer-Lemeshow approach and is suitable for a general ordinal regression model. As in Section 3.1, we sort the observations according to the ordinal score (Equation (7)) and partition them into g groups. Define g−1 binary indicator variables I k such that Fit a new ordinal regression model that includes the indicator variables: When the fitted model is the correct model, γ 1 , . . . , γ g−1 = 0. Denote the log likelihoods of the fitted models with and without the indicator variables by L 0 and L 1 , respectively. The Lipsitz test statistic is the likelihood ratio statistic −2(L 1 − L 0 ). A P-value is obtained by comparing the observed value of the test statistic with the χ 2 distribution with g−1 degrees of freedom. To make sure all expected counts are greater than 1.0, and at least 80% greater than 5.0, Lipsitz et al. [7] propose that g is chosen such that 6 ≤ g < n/5c.

The Pulkstenis and Robinson tests
The Pearson χ 2 and deviance statistics can be used to assess goodness of fit in models where the number of possible covariate patterns is limited. We can then construct a contingency table of observed and estimated frequencies, where the covariate patterns form the rows and the response levels form the columns. When the number of covariate patterns is large, such as when continuous covariates are present, the expected cell counts become too small for the χ 2 asymptotics to hold. Pulkstenis and Robinson [8] suggest a grouping of covariate patterns that can be used when continuous covariates are present. Start by determining the covariate patterns using only the categorical covariates (ignore any unobserved patterns). Calculate the ordinal scores (Equation (7)). Split each covariate pattern in two based on the median ordinal score within each pattern. We can now construct a contingency table of observed and estimated frequencies based on the cross-classification of covariate patterns with response levels. The two Pulkstenis and Robinson (PR) test statistics are the Pearson χ 2 and deviance test statistics on that table: (10) and where l indexes the two subgroups based on the ordinal scores, K is the number of observed covariate patterns due to the categorical covariates, and c is the number of response levels. Let p cat denote the number of dichotomous variables needed to model all the categorical covariates (i.e. after dummy variables have been substituted for categorical covariates with more than two categories). The reference distribution for both PR(χ 2 ) and PR(D 2 ) is the

Assessing the null distribution of the test statistics
We set up a simulation study to investigate the null distribution of the test statistics. For each ordinal model, we considered scenarios with three, four, and five response levels. Each scenario is characterized by particular values of the intercepts (αs) and regression coefficients (βs), the distribution of the covariates, and the sample size (n). Let x denote a continuous covariate, distributed as U(0, 10) (uniform distribution on the interval [0, 10]) or N(5, 3) (normal distribution with μ = 5 and σ = 3). Let d denote a dichotomous covariate distributed as Bernoulli(0.5). We used four sample sizes: n = 100, 200, 400, and 800. The intercepts and regression coefficients were selected to represent scenarios that one might encounter in practice. We assessed this by plotting the probability distributions of the response levels as functions of the continuous covariate, stratified by the dichotomous covariate. The plots for the final models that were used in the simulation study are shown as Web Figures 1-6 in the supplemental online material (available for downloading at the journal website). The number of simulated replications was 10,000 for all scenarios. We used Stata 12.1 (StatCorp LP, College Station, TX) to fit adjacent-category models with constrained multinomial logistic regression. Constrained continuation-ratio models were fitted with the Stata package sg86 by Wolfe. [10] We used Matlab R2013a (The Mathworks, Inc., Natick, MA) to perform the simulations.

Adjacent-category model
We defined three adjacent-category models, one for three response levels (Equation (12)), one for four response levels (Equation (13)), and one for five response levels (Equation (14)): Web The results for three, four, and five response levels are practically equal, as are the results for the uniform and normal distributions of the continuous covariate. We give a summary of the results in Table 2. Simulated rejection rates (%) averaged over two continuous distributions (uniform and normal) and three adjacentcategory models (Equations (12)-(14)).   Table 2, which is an average of the results in Web Tables 1-3, stratified by sample size. The rejection rates of the C g tests are quite close to the nominal level, whereas the rejection rates of the Lipsitz and PR tests are generally above the nominal level, particularly for small sample sizes. Figure 1 shows an example of the distribution of test statistics and their adherence to the reference distributions. Details of the tail sections of the histograms are given in Figure 2. Figures 1 and 2 indicate excellent fit for the C g tests and slightly heavy tails for the Lipsitz and PR tests.
An additional simulation was carried out to assess the performance of the tests on an adjacentcategory model with three continuous covariates and no categorical covariates. The number of response levels was four: where x 1 , x 2 , and x 3 are continuous covariates, distributed as U(0, 10). We give the full results in Web Table 4 and discuss the results here only briefly. Compared with the results for the models in Equations (12)-(14), the 1% and 5% rejection rates of the C g tests for the model in Equation (15) are slightly higher, especially for n = 100 and n = 200. For n = 400 and n = 800, the 5% and 10% rejection rates are quite close to the nominal level. The rejection rates of the Lipsitz test are generally too high, also for large sample sizes, with 5% rejection rates of 5.98 (n = 400) and 5.73 (n = 800). No simulations were done for the two PR tests, because the PR tests can only be used when both continuous and categorical covariates are present.

Constrained continuation-ratio model
We defined three constrained continuation-ratio models, one for three response levels (Equation (16)), one for four response levels (Equation (17)), and one for five response levels (Equation (18)): Web Figures 4-6 show the probability distributions for each of the three models as functions of the covariates. We present the full results of the simulations in Web Tables 5-7.
The results of the null distribution simulations for the constrained continuation-ratio model were similar to the ones for the adjacent-category model (Section 4.1). A summary is given in Table 3, which is an average of the results in Web Tables 5-7. The rejection rates of the C g tests were slightly lower than the nominal level, but not by much. As in Section 4.1, the rejection rates of the Lipsitz and the PR tests were above the nominal level. Figure 3 shows histograms of the six test statistics for one of the scenarios with four response levels (Equation (17)), with details of the tail sections shown in Figure 4. Figures 3 and 4 give the same impression as Figures 1 and 2: that of a very good fit for the C g tests and somewhat heavy tails for the Lipsitz and PR tests.
An additional simulation was carried out to assess the performance of the tests on a constrained continuation-ratio model with three continuous covariates and no categorical covariates. The number  of response levels was four: where x 1 , x 2 , and x 3 are continuous covariates, distributed as U(0, 10). We give the full results in Web Table 8 and discuss the results here only briefly. The rejection rates of the C g and Lipsitz tests are quite similar to those for the models in Equation (16)-(18), with slightly lower rejection rates for the C g tests and higher rejection rates for the Lipsitz test, compared with the nominal level. No simulations were done for the two PR tests, because the PR tests can only be used when both continuous and categorical covariates are present.

Assessing the power of the tests
The previous section showed that while the rejection rates of the C g tests were close to the nominal level, the rejection rates of the Lipsitz and PR tests were, in general, above the nominal level. We thereby expect the power of the Lipsitz and PR tests to be greater than that of the C g test. One might argue that it is unreasonable to compare the power of tests that do not have similar null distribution approximations; however, in the case of goodness-of-fit tests, we are not too concerned about small inflations of the significance level. A small increase in the probability of false positive results from goodness-of-fit tests merely means that the occasional model is investigated for possible lack of fit in more detail than necessary. Goodness-of-fit tests are tools to detect lack of fit, and as such, their power properties are more important than strict control of type I error rates. Sections 5.1-5.8 describe and report simulations used to estimate the power of the tests to detect several different types of lack of fit. All scenarios used a four-level response variable. Let x be a continuous covariate, distributed as U(0, 10), and let d be a dichotomous covariate, distributed as Bernoulli(0.5). As in Section 4, we used four sample sizes: n = 100,200,400, and 800. The number of replications was 5000.

Omission of a quadratic term in a continuous covariate
For the adjacent-category model, the correct model was defined as and for the constrained continuation-ratio model, the correct model was In both cases, the fitted model excluded the quadratic term (β 3 = 0). The coefficient of the quadratic term was β 3 = 0.02, 0.04, and 0.06, corresponding to three scenarios with increasing difference between the correct and fitted models. Web Figures 7 and 8 show the probability distributions of the responses as functions of the covariates using β 3 = 0.04 and the models in Equations (20) and (21), respectively. We give the full results of the simulations in Web Tables 9 and 10. An excerpt is shown in Table 4. The C g tests had low to moderate power to detect a missing quadratic term. The power was greater for the adjacent-category model than the constrained continuation-ratio model; however, that could be attributed to differences in the definitions of the actual models (i.e. the values of the parameters and their relation to the response) used in the simulations (Equations (20) and (21)). The power of the Lipsitz test was noticeably greater than that of the C g tests, whereas the PR tests had poor power.

Omission of an interaction term between a continuous and a dichotomous covariate
For the adjacent-category model, the correct model was defined as and for the constrained continuation-ratio model, the correct model was In both cases, the fitted model excluded the interaction term (β 3 = 0). The coefficient of the interaction term was β 3 = 0.2, 0.4, and 0.6, corresponding to three scenarios with increasing difference between the correct and fitted models. Web Figures 9 and 10 show the probability distributions of the responses as functions of the covariates using β 3 = 0.4 and the models in Equations (22) and (23), respectively. We give the full results of the simulations in Web Tables 11 and 12. An excerpt is shown in Table 4. All tests had excellent power to detect a missing interaction term between a continuous and a dichotomous covariate under both the adjacent-category and the constrained continuation-ratio models. For small sample sizes or small values of β 3 , the Lipsitz test had greater power than the PR tests, which, in turn, had greater power than the C g tests.

Omission of an interaction term between two dichotomous covariates
Let d 1 and d 2 denote two dichotomous covariates, distributed as Bernoulli(0.5). For the adjacentcategory model, the correct model was defined as In both cases, the fitted model excluded the interaction term (β 4 = 0). The coefficient of the interaction term was β 4 = 1.0, 2.0, and 3.0, corresponding to three scenarios with increasing difference between the correct and fitted models. Web Figures 11 and 12 show the probability distributions of the responses as functions of the covariates using β 4 = 2.0 and the models in Equations (24) and (25), respectively. We give the full results of the simulations in Web Tables 13 and 14. An excerpt is shown in Table 4. Under the adjacent-category model, all tests had good -but not great -power to detect a missing interaction term between two dichotomous covariates. The PR(χ 2 ) test had highest power followed by the Lipsitz test. The PR(D 2 ) test was sometimes better and sometimes worse than the C g tests. Under the constrained continuation-ratio model, we observed excellent power with both PR tests, whereas some unexpected results were observed with the C g and Lipsitz tests: the power for β 4 = 2.0 was smaller than the power for β 4 = 1.0 (Web Table 14). The greatest power was observed for β 4 = 3.0 (as expected), although neither the C g tests nor the Lipsitz test had power nearly at the level of the PR tests.

Wrong functional form of a continuous covariate
For the adjacent-category model, the correct model was defined as and for the constrained continuation-ratio model, the correct model was In both cases, the fitted model included the term x instead of log(x). Web Figures 13 and 14 show the probability distributions of the responses as functions of the covariates using the models in Equations (26) and (27), respectively. We give the full results of the simulations in Web Tables 15 and 16. An excerpt is shown in Table 4. Under the adjacent-category model, the C g tests had low to moderate power, the Lipsitz test had moderate to good power, and the PR tests had poor power to detect the wrongly fitted functional form of x. For the constrained continuation-ratio model, the C g tests had low power, the Lipsitz test had low to moderate power, whereas the PR tests had power slightly above the nominal significance level.

Detection of a nominal (unordered) response variable
We generated a four-level response variable based on the following multinomial logits: We then fitted the two ordinal models with x and d as covariates. Web Figure 15 shows the probability distributions of the model in Equations (28)-(30) as functions of the covariates. We give the full results of the simulations in Web Tables 17 and 18. An excerpt is shown in Table 5. The PR tests had excellent power to detect an unordered response variable under both the adjacentcategory and the constrained continuation-ratio models. The C g tests had good power for both models, whereas the power of the Lipsitz test was poor.

Omission of an independent covariate
In this scenario, we generated data from models with two continuous covariates (x 1 and x 2 ) and one dichotomous covariate (d), then fitted models without one of the continuous covariates (x 2 ). The

adjacent-category model was
and the constrained continuation-ratio model was We give the full results of the simulations in Web Tables 19 and 20. An excerpt is shown in Table 5. The power of all tests was low. The highest power was observed for the two PR tests under the constrained continuation-ratio model when n = 800; however, even with such a large sample size, the 5% rejection rates were below 20%.

Detection of constrained continuation-ratio data by the adjacent-category model
In this section, we generated data from the constrained continuation-ratio model and fit them with the adjacent-category model. The models consisted of one continuous covariate (x) and one dichotomous covariate (d). We considered two scenarios, one in which the continuous covariate was the most influential, and one in which the dichotomous covariate was the most influential. The two constrained continuation-ratio models used to generate the data that was fitted by the adjacent-category model were g j (x) = α j + 0.25x + 0.5d, α j = [−1.5, −1.0, 0], j = 1, 2, 3, and g j (x) = α j + 0.05x + 1.0d, α j = [−1.5, −1.0, 0], j = 1, 2, 3.
Web Figure 16 shows the probability distributions of the models in Equations (33) and (34) as functions of the covariates. We give the full results of the simulations in Web Tables 21. An excerpt is shown in Table 5. When the continuous covariate was more influential than the dichotomous covariate, the C g tests had good power and the PR tests had very good power to detect that the wrong model had been fit. The power of the PR tests was quite good also for the scenario in which the dichotomous covariate was the most influential, whereas the power of the C g tests was low in this scenario. The Lipsitz test had poor power in both scenarios.

Detection of adjacent-category data by the constrained continuation-ratio model
This section contains a similar scenario as the previous section; however, we now generated data by the adjacent-category model and fit them with the constrained continuation-ratio model. Again, two scenarios were considered: one in which the continuous covariate was the most influential, and one in which the dichotomous covariate was the most influential. The two adjacent-category models used to generate the data that was fitted by the constrained continuation-ratio model were and g j (x) = α j + 0.075x + 1.75d, α j = [−0.5, 0.5, 1.0], j = 1, 2, 3.
Web Figure 17 shows the probability distributions of the models in Equations (35) and (36) as functions of the covariates. We give the full results of the simulations in Web Tables 22. An excerpt is shown in Table 5. We observe a similar pattern of results here as in the previous section: the PR tests have good power in both scenarios; the C g tests have better power when the continuous covariate is the most influential than when the dichotomous covariate is the most influential, although this difference is only noticeably for small sample sizes; and the Lipsitz test has poor power in both scenarios. The simulated power values are higher in Web Table 21 (detection of constrained continuation-ratio data by the adjacent-category model) than in Web Table 22 (detection of adjacent-category data by the constrained continuation-ratio model), which can be seen also in Table 5; however, this may be due to the choices of model parameters. It seems that the effect of the continuous covariate is stronger in Web Figure 16 (detection of constrained continuation-ratio data by the adjacent-category model) than in Web Figure 17 (detection of adjacent-category data by the constrained continuation-ratio model), which may affect the power values.

Example
In [3], the proportional odds model was fitted to data from a study of determinants of aftercare services for psychiatrically hospitalized adolescents, [11] and goodness of fit was assessed with the C g , Lipsitz, and PR tests. The data set for that study, called the Adolescent Placement Study, is available from the website of the University of Massachusetts (http://www.umass.edu/statdata/statdata). In short, the data set consists of the medical records of 508 adolescents admitted to three psychiatric hospitals and includes sociodemographics, clinical and family characteristics, service history, and treatment characteristics. A description of the variables in the data set and details on fitting, interpretation of model parameters, and model-building strategies can be found in [1]. Here, we expand on the analyses in [1,3] to illustrate testing goodness of fit for the adjacent-category and constrained continuation-ratio models. We first consider the response variable Neuro (neuropsychiatric disturbance: 1 = none, 2 = mild, 3 = moderate, 4 = severe). As described in [3], there was no evidence of lack of fit for a proportional odds model with Age (in years, centred about the sample mean age of 14.3 years), Age 2 , Gender (0 = female, 1 = male), Race (0 = white, 1 = non − white), Emot (emotional disturbance: 0 = mild, 1 = severe), and Custd (state custody: 0 = no, 1 = yes) as covariates (p ≥ .39 for all tests). We now fit adjacent-category and constrained continuation-ratio models of Neuro on the six covariates and calculate the C 8 , C 10 , C 12 , Lipsitz, PR(χ 2 ), and PR(D 2 ) tests. Web Tables 23 and 24 show the results of the model fits, which are similar to the fit of the proportional odds model in [3]. We observe no evidence of lack of fit for any of the models with any of the tests: p-values from .39 to .98 for the adjacent-category model and p-values from .18 to .92 for the constrained continuation-ratio model (Web Table 25).
Next, we consider the response variable Danger (danger to others: 1 = unlikely, 2 = possibly, 3 = probably, and 4 = likely). In [1], the purposeful selection procedure was used to obtain a proportional  odds model for Danger, which resulted in four significant covariates (at the 5% level): Weeks (length of stay in hospital, measured in weeks), Behave (behavioural symptom score 0-9), Gender, and Custd. The final model was evaluated for fit, using Brant's test [12] (p < .001 ), an approximate likelihood ratio test (p < .001 ), and the C 10 test (p = .001 ), which indicated that the proportional odds assumption was violated. Using the same covariates, we fit adjacent-category (Table 6) and constrained continuation-ratio (Table 7) models to Danger and calculate the goodness-of-fit tests ( Table 8). All three C g tests and both PR tests show clear signs of lack of fit for both the adjacent-category and constrained continuation-ratio models. The Lipsitz test, on the other hand, does not indicate lack of fit with any of the two models.
A secondary tool to examine lack of fit is to tabulate the observed and estimated frequencies, as in Table 1. Using g = 10 groups, corresponding to the C 10 test, the contingency table for the adjacentcategory model of Danger on Weeks, Behav, Gender, and Custd is shown in Table 9. The agreement between the observed and estimated frequencies is mostly fair, although there are some large discrepancies, particularly in the row for group 1 and in the columns corresponding to response levels Danger = 3 and Danger = 4 that suggest lack of fit. Because we are not able to relate these discrepancies to specific values of the covariates, and because case-wise diagnostic tools have not been developed for ordinal regression models, this is as far as we can go with the assessment of fit for this particular model. So far, we have considered proportional odds, adjacent-category, and constrained continuationratio models for the relationship between Danger and Weeks, Behav, Gender, and Custd. These three models all constrain the regression coefficients to be constant across the logits, and they all exhibit lack of fit. The best way forward may be to use a model that allows the regression coefficients to vary with the logits. The unconstrained continuation-ratio model does this without ignoring the ordinal nature of the response. Unfortunately, there are as yet no methods available to assess the goodness of fit of such models. An alternative solution is to use a multinomial logistic regression model, for which a multinomial C g test and a normalized version of the ungrouped Pearson χ 2 test are available. [5] The disadvantage with this approach is that the response is treated as unordered, thus ignoring the inherent information in the ordinal response.

Discussion
This article has investigated the null distribution and power properties of three approaches for testing goodness of fit under two ordinal logistic regression models: the adjacent-category and the constrained continuation-ratio models. An ordinal version of the Hosmer-Lemeshow test (C g ) had excellent null distribution results and good, but not great, power. The Lipsitz test [7] had null rejection rates slightly above the nominal level and higher power than the C g test for four out of eight of  1 Lipsitz Note: AC, adjacent-category; CCR, constrained continuation-ratio. a Poor power. b Power barely above the nominal significance level.
the considered scenarios (Table 10). Two tests by Pulkstenis and Robinson [8] had null rejection rates somewhat above the nominal level under both ordinal models. The power of the PR tests was better than the C g test for five of the eight considered scenarios and better than the Lipsitz test for four of the eight scenarios (Table 10). The C g and Lipsitz tests work best when lack of fit is associated with continuous covariates, whereas the PR tests excel if categorical covariates drive the lack of fit, such as when an interaction term that includes a categorical covariate is missing. Another useful feature of the PR tests is the ability to assess which covariate patterns contribute to the lack of fit by way of a contingency table of observed and estimated frequencies. An example where scrutiny of the contingency table suggests inclusion of an interaction term between two dichotomous covariates can be seen in the Example section of [8].
The simulation results for the adjacent-category and the constrained continuation-ratio models in this article resemble those in Fagerland and Hosmer, [3] who evaluated the C g , Lipsitz, and PR approaches for testing goodness of fit under the proportional odds regression model. The only noticeable difference was that the power to detect that a continuous covariate was modelled as x instead of log(x) was higher under the adjacent-category model than under the proportional odds model. The consistency of results across this article and Fagerland and Hosmer [3] allows us to make a joint recommendation for testing goodness of fit in these three ordinal logistic regression models. Because the different tests may detect different types of lack of fit, we recommend use of all three approaches.
Three versions of the C g test were assessed: C 8 , C 10 , and C 12 , which correspond to grouping observations into 8, 10, and 12 groups. The three tests had similar null distribution properties; however, the power of C 8 was usually higher than that of C 10 , which in turn was higher than that of C 12 . This result is consistent with the results for the multinomial C g test in [5] and the C g test for the proportional odds model in [3]. Even though the power of the C 8 test seems to be higher than the C 10 test, we recommend the C 10 test because using 10 groups is an established practice for the binary Hosmer-Lemeshow test.
In some cases, regardless of the type of ordinal model used, one or more of the tests may not be computable. The Lipsitz test is not recommended if the number of groups (g) cannot satisfy 6 ≤ g < n/5c, where n is the sample size and c is the number of response levels. The PR tests can only be calculated if both continuous and categorical covariates are present. If the number of covariate patterns is small, which happens when there are no continuous covariates and only a few categorical covariates present, neither the C g nor the PR tests can be calculated. In this case, the standard Pearson χ 2 statistic can be used to assess goodness of fit.
The most important limitations in this article are associated with the limited range of the scenarios and parameters of the simulation study: (i) the distribution and number of covariates have been kept few and small; (ii) the relationship between the covariates and the response has been described using a limited set of regression coefficients; and (iii) power has been assessed using eight specific scenarios for lack of fit. These restrictions have been imposed to keep the simulation study manageable to perform, interpret, and report. We believe the results show a consistency that allow for generalbut cautious -recommendations.
In summary, we recommend that all three approaches (C g , Lipsitz, and PR) are used whenever possible to assess goodness of fit in ordinal logistic regression models. Together, these tests have reasonable power with moderate to large sample sizes to detect several types of lack of fit: omission of a quadratic term in a continuous covariate, omission of an interaction term between a continuous and a dichotomous covariate, omission of an interaction term between two dichotomous covariates, wrong functional form of a continuous covariate, detection of an unordered response variable, detection of constrained continuation-ratio data by the adjacent-category model, and detection of adjacentcategory data by the constrained continuation-ratio model. Because power is generally low for small sample sizes, we recommend using a nominal significance level of 10% whenever the sample size is smaller than, say, 400.

Disclosure statement
No potential conflict of interest was reported by the authors.