The Log Multinomial Regression Model for Nominal Outcomes with More than Two Attributes

An estimate of the risk or prevalence ratio, adjusted for confounders, can be obtained from a log binomial model (binomial errors, log link) fitted to binary outcome data. We propose a modification of the log binomial model to obtain relative risk estimates for nominal outcomes with more than two attributes (the “log multinomial model”). Extensive data simulations were undertaken to compare the performance of the log multinomial model with that of an expanded data multinomial logistic regression method based on the approach proposed by Schouten et al. (1993) for binary data, and with that of separate fits of a Poisson regression model based on the approach proposed by Zou (2004) and Carter, Lipsitz and Tilley (2005) for binary data. Log multinomial regression resulted in “inadmissable” solutions (out‐of‐bounds probabilities) exceeding 50% in some data settings. Coefficient estimates by the alternative methods produced out‐of‐bounds probabilities for the log multinomial model in up to 27% of samples to which a log multinomial model had been successfully fitted. The log multinomial coefficient estimates generally had lesser relative bias and mean squared error than the alternative methods. The practical utility of the log multinomial regression model was demonstrated with a real data example. The log multinomial model offers a practical solution to the problem of obtaining adjusted estimates of the risk ratio in the multinomial setting, but must be used with some care and attention to detail. (© 2007 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)


Introduction
Fitting a generalized linear model (McCullough and Nelder, 1989) with a non-canonical logarithmic link to binomial data makes it possible to directly estimate risk ratios in prospective studies and prevalence ratios in cross-sectional studies. This is the log binomial model (Wacholder, 1986;Blizzard and Hosmer, 2006). Its use is advocated (Zochetti, Consinni and Bertazzi, 1995;Skov et al., 1998;McNutt et al., 2003;Deddens, Petersen and Lei, 2003;Spiegelman and Hertzmark, 2005; Barros and Hirakata, 2006;Lumley, Kronmal and Ma, 2006) particularly when the binary outcome is not rare. The odds ratio produced by logistic regression approximates the risk or prevalence ratio, but the approximation is adequate only when the outcome is rare in all exposure and confounder categories (Greenland, 1987) and is liable to misinterpretation and misreporting when it is not (Katz, 2006).
For nominal outcome data with more than two levels, the choice of modelling methods is limited presently to a modification of the binary logistic regression model. An early development in the business and econometric literature in this respect was the discrete choice model proposed by McFadden (1974). It is termed the multinomial, polychotomous or polytomous logistic regression model in the health and life sciences. For example, consider a follow-up study of children into adulthood with daily tobacco smoking as the outcome. The outcome categories are never-smoker, former smoker and current smoker. Possible covariates include sex, childhood attitudes to smoking, and continuous factors such as age. Multinomial logistic regression makes it possible to approximate the relative risk of being a former smoker and the relative risk of being a current smoker at follow-up with adjustment for continuous covariates.
In principle, it should be possible to modify the log binomial model to obtain relative risk estimates for data of this type. We are not aware of this having been attempted previously. In this paper, we examine some of the technical issues that arise when one attempts to fit a model with a log link to multinomial data. We refer to it as the log multinomial model. The paper is organised as follows. In Section 2, the log multinomial regression model is introduced and likelihood-based estimation methods are reviewed. Two alternative methods, one based on the approach of Schouten et al. (1993) for binary data, and another involving separate fits of a Poisson regression model, are proposed in Section 3. In Section 4, we present the results of a simulation study of the performance of the estimation methods. An application to real data is provided in Section 5. Finally, the implications are summarised in Section 6.

The Log Multinomial Regression Model
Suppose that individuals in an infinitely large population possess only one of the ðJ þ 1Þ attributes A 0 , A 1 , A 2 , . . . A J . Let random variable Y denote the outcome of sampling from this population, where Y ¼ j if attribute A j is observed, and let Y j , j ¼ 0; 1; 2; . . . J denote binary (0/1) random variables indicating which attribute is observed, with P J j¼0 Y j ¼ 1. Assume the mean p j of each Y j depends upon a linear combination of the observed values x 1 ; x 2 ; . . . x K of K non-constant covariates X 1 ; X 2 ; . . . X K . Denote the linear predictor as . . . J are parameters to be estimated. Assume further that the p j ¼ p j ðxÞ follow the simple exponential form p j ðxÞ ¼ exp ðx 0 b j Þ. The requirement P J j¼0 p j ¼ 1 renders as redundant the modelling of one outcome category (say the first, corresponding to j ¼ 0). Then the log multinomial regression model is: with Y 0 ¼ 1 À P J j¼1 Y j and Pr ðY 0 ¼ 1 j xÞ ¼ 1 À P J j¼1 Pr ðY j ¼ 1 j xÞ, and with the random and systematic components linked by the logarithmic transform ln ½p j ðxÞ ¼ x 0 b j . Finally we assume there are n independent observations of the multinomial outcome variable Y and the covariates, and denote the observed data as ðy i ; x i Þ; i ¼ 1; 2; . . . n where y 0 i ¼ ðy i0 ; y i1 ; y i2 ; . . . ; y iJ Þ. The conditional log likelihood function is: The likelihood equations for the parameters are found by taking the first partial derivatives of LðbÞ with respect to the JðK þ 1Þ parameters b jk ; j ¼ 1; 2; . . . J, k ¼ 0; 1; 2; . . . K. These JðK þ 1Þ Â 1 score equations may be written as the sum of n gradient vectors g i ðbÞ of dimension JðK þ 1Þ Â 1 containing the scores for i-th subject: Under standard maximum likelihood theory, the observed information matrix Iðb bÞ is the negative of the JðK þ 1Þ Â JðK þ 1Þ matrix of second partial derivatives evaluated at the likelihood-maximising solutionb b, and an estimator of the covariance matrix of the maximum likelihood estimator is its inverse: The elements of b V Vðb bÞ depend on the observed value of the outcome variable, its fitted values and the covariate values. In contrast, the elements of the observed information matrix for the multinomial logistic model are independent of the observed values of the outcome variable. This difference is a consequence of using the log link in place of the logit link. An "information sandwich" estimator is used to provide robustness to model misspecification. The "meat" of the sandwich is the JðK þ 1Þ Â JðK þ 1Þ matrixM M that is based on the outer product of the gradient vector evaluated atb b: The robust version using the observed information matrix is: An issue in fitting the log multinomial regression model is that the permissible parameter space for a maximum likelihood estimator of b is constrained to the subspace of R JðKþ1Þ that generates fitted probabilities that lie between 0 and 1. Two of the adverse outcomes in attempting to fit the model in practice are that the iterative methods used by the software package may fail to converge, and that a converged solution may not be admissible. Here we define an admissible solution as a value ofb b such that x 0 b iĵ j < 0; i ¼ 1; 2; . . . ; n; j ¼ 1; 2; . . . ; J and hence 0 <p p j ðx i Þ < 1, i ¼ 1; 2; . . . n, j ¼ 0; 1; 2; . . . J. This requires that P J j¼1p p j ðx i Þ < 1, i ¼ 1; 2; . . . n. We discuss this more extensively in later sections.
3 Two Alternative Models for Modelling Multinomial Outcomes 3.1 The expanded data logistic regression approach of Schouten et al.
To estimate relative risk for binomial data using logistic regression, Schouten et al. (1993) proposed expanding the data so that comparison group includes all subjects in the sample rather than the "noncases" only. To do this for binary outcomes Y ¼ 0 or 1, a new binary indicator variable D is defined with value d ¼ y, and the data for subjects with y ¼ 1 are replicated but with the additional observations in the augmented data assigned the value d ¼ 0. This logistic regression model applied to the expanded data yields consistent estimators of the log binomial slope coefficients. They also derive a consistent "information sandwich" type of estimator of the covariance matrix of the estimated coefficients. Modifying this approach for multinomial data is equivalent to maximising over the original sample the quasi-likelihood: . þ a jK x iK , x 0 i a 0 ¼ 0 and exp ðx 0 i a 0 Þ ¼ 1, and where the q j ðxÞ follow the multinomial logistic regression model: . . . ; n; j ¼ 1; 2; . . . J : An estimator of the covariance matrix of the estimated coefficientsâ a j can be obtained using an "information sandwich" analogous to that described by Schouten et al. (1993) for the binary case. The estimator of the variance-covariance matrix of the expanded data multinomial logistic model is: whereĤ H denotes the outer product of the gradient vectors. In this case,Ĥ H has the form: where r i ðâ aÞ is a gradient vector of dimension JðK þ 1Þ containing the estimated scores for i-th subject computed over the original sample with typical element: . . . n; j ¼ 1; 2; . . . J; k ¼ 0; 1; 2; . . . K : Estimatesq q j ðx i Þ of the parameters of the log multinomial model, and of their estimated standard errors, can be obtained in this way from nearly any set of data. The solution is not guaranteed to be admissible for the log multinomial model, however. This is an issue we investigate in Section 4.

Separate Poisson regression models
To estimate relative risk for binomial data, McNutt et al. (2003) investigated the use of the Poisson regression model with log link and Poisson errors that has the same form as the log binomial model: The coefficient estimatesĝ g consistently estimate the coefficients from the log binomial model (Carter et al., 2005). This approach requires no data modification, and can be easily performed using widely available software such as SAS and Stata. It produces confidence intervals that are too wide because the Poisson errors are over-estimates of the binomial errors when the outcome is not rare. Zou (2004) proposed correcting for this by using the information sandwich estimator to obtain variance estimates that are robust to the error misspecification. Carter et al. (2005) showed that by using the information sandwich estimator of the covariance matrix of the estimated coefficients following the Poisson regression fit, one obtains a consistent estimator of the covariance matrix of estimated coefficients from a log binomial fit.
Whilst there is no Poisson model for multinomial data, it is possible to fit separate binary Poisson regression models to the data for J of the J þ 1 outcomes. If outcome j ¼ 0 is chosen as the redundant category, this involves fitting the separate Poisson regression models: . . . n; j ¼ 1; 2; . . . J : Whilst cumbersome, this procedure is straightforward and we used it to provide a further benchmark against which to compare the estimates from the log multinomial model. Because the mean of the Poisson regression model is not constrained to be less than one, however, the resulting estimated coefficients may yield inadmissible fitted values for the log binomial model probabilities. This is investigated in the simulations of Section 4.

Simulation Study
In this section, we describe simulation studies conducted to assess practical aspects of fitting the log multinomial model and to evaluate its performance. Results are presented using three summary statistics. The first two are average percent relative bias, ð100=LÞ . . . K, and average percent mean squared error, ð100=LÞ jk is the coefficient estimated from the l-th of L replications of the data and b jk is its design ("true") value. The third is the observed coverage of confidence intervals calculated asb b . . . L where z 1Àa=2 is the 100ð1 À a=2Þ percentile of the standard normal distribution, and where a ¼ 0:05 for 95% confidence intervals. The estimated standard errors c SE SE ðb b ðlÞ jk Þ were produced by the robust version (4) of the observed information matrix for the log multinomial model, the covariance matrix (5) robust to duplication of observations for the expanded data multinomial logistic model, and the information sandwich estimator for the separate Poisson regression fits.
All calculations were performed using Stata (Stata Corp., 2003) with modified Newton-Raphson optimisation performed using the ml command (Gould, Pitbaldo and Sribney, 2003). In standard implementation of its estimation algorithm, evaluation of the terms y ij Â ln ½p j ðx i Þ in the log likelihood: To avoid this, the algorithm was written in a manner that ensured each y ij Â ln ½p j ðx i Þ was evaluated only when y ij ¼ 1. This reduced non-convergence of the estimation algorithm to a very rare occurrence, and replaced non-convergent solutions with convergent but inadmissible solutions withp p 0 ðx i Þ ¼ 1 À P J j¼1p p j ðx i Þ 0. This simple alteration enabled us to study and report results for inadmissible solutions.

Simulations I
The basic design was a univariable model for a three-category (J ¼ 3) multinomial outcome variable Y and a single (K ¼ 1) covariate X distributed uniformly on the interval ½À6; a. The settings are summarised in Table 1.
Binary (0/1) indicator variables Y j ; j ¼ 0; 1; 2 were used to denote the outcomes with Y 0 the redundant category. The intercept (b 20 ) and slope (b 21 ) coefficients in the model for Y 2 were chosen such that Pr ðY 2 ¼ 1 j x ¼ À6Þ ¼ 0:01 and Pr ðY 2 ¼ 1 j The upper bound a of the uniform distribution was then chosen so that the marginal probability of response: Biometrical Journal 49 (2007) 6  893 Table 1 Design of the first set of simulations.

Setting
Outcome Finally, outcomes were assigned by drawing a uniform random deviate (u) from a Uð0; 1Þ distribution, and setting We drew samples of size 500 because datasets of size n ¼ 500 are commonly encountered in practice, with an adequate number of replications to produce at least 1000 inadmissable solutions in settings 2, 4 and 6.
The results of fitting the three models are shown in Table 2. We expected convergence and/or admissibility problems in settings 2, 4 and 6 where the greater of the maximum design probabilities Pr ðY 2 ¼ 1 j x ¼ aÞ was 0.93 or greater, and the sum of the maximum design probabilities was at least 0.95, but not in the remaining settings where the sum of the maximum design probabilities was lower (around 0.75). As expected, fitting problems occurred in settings 2, 4 and 6 and their frequency increased with the size and sum of the maximum design probabilities. The samples that produced fitting problems for each model did not correspond exactly, with the expanded data multinomial logistic regression model and the separate Poisson regression models producing inadmissable probabilities for the log multinomial model in 19-28 percent of samples successfully fitted by a log multinomial regression model.
The results of estimating the coefficient of the uniformly distributed, continuous covariate are shown in Table 3. Bias and mean squared error were small in the admissible samples and, whilst the differences between models were minor, the log multinomial estimator generally had least bias and mean squared error. The expanded data multinomial logistic estimator had the poorest performance in these respects. The confidence interval coverage for each model straddled 95 percent, with those produced by the log multinomial estimator most often closest to the target. Given the design settings, the inadmissible samples for settings 2, 4 and 6 were produced by inflated estimates of the slope coefficient of the continuous covariate in the second outcome category (b 21 ). In consequence, the bias and mean squared error of the estimates of that parameter were substantially higher in the inadmissi- § Percentage of the samples producing admissable fits of a LnMM for which the estimated coefficients from the EMLM or the separate Poisson regression models produced at least one inadmissable log multinomial model probability.
ble samples than in the admissible samples, and bias reached nearly 10 percent for the log multinomial estimator in setting 2. With only modest differences in the standard error estimates, the increased bias was the main contributor to poorer confidence interval coverage. There was little to choose between the three models. Results are not provided for settings 1, 3, 5 and 7 because inadmissable samples were rare in those settings.

Simulations II
A second set of simulations was performed with models for a three-category (J ¼ 3) multinomial outcome Y with two covariates: a dichotomous covariate D $ BðpÞ having Bernoulli probability p ¼ 0:20; 0:35 and 0:50 in settings 8-10 respectively, and a uniformly distributed continuous covariate U associated with D via density function f ðU j D ¼ dÞ ¼ UðÀ6 þ 2 Â d; 2 þ 2 Â dÞ. In the model for Y 2 , we set the intercept coefficient as b 20 ¼ ln ð0:3Þ and the slope coefficient for the dichotomous  was as large as possible whilst the maximum log binomial probability: was at least 90 percent. In the model for Y 1 , we set Pr ðY 1 ¼ 1 j d ¼ 0; u ¼ À6Þ ¼ 0:5, 0:4 and 0:3 and Pr ðY 1 ¼ 1 j d ¼ 0; u ¼ 2Þ ¼ 0:12, 0:15 and 0:18 in settings 8-10 respectively, constrained Pr ðY 1 ¼ 1 j d ¼ 1; u ¼ 4Þ ¼ 0:5 Â ½1 À Pr ðY 2 ¼ 1 j d ¼ 1; u ¼ 4Þ, and solved for values of the parameters (b 10 ; b 11 ; b 12 ) that satisfied these conditions. These designs are summarised in Table 4. The results for the dichotomous and continuous covariates are shown in Table 5. In the admissible samples, the highest values of average percent relative bias occurred for the coefficient of the dichotomous covariate in the first outcome category (b 11 ) of setting 8. The high values of bias occurred in occasional data replications with a low proportion of d ¼ 1 values among observations with y 1 ¼ 1. This produced a right-skewed distribution of bias in the estimates of b 11 , with some very large values particularly for the log multinomial model. The median bias for this model was only 1.33, however. Otherwise bias for the log multinomial estimates did not exceed 5 percent and was generally less than that of the corresponding estimates from the alternative estimators. The log multinomial model estimator also had an advantage in mean squared error. Coverage straddled the 95 percent target and did not favour any model.
The inadmissible samples were produced by inflated estimates of the slope coefficient of the continuous covariate in the second outcome category (b 22 ). The bias and mean squared error of the estimates of that parameter were very high in consequence, and confidence interval coverage was substantially less than the target. This was particularly the case for the log multinomial estimator. For the other slope coefficients, the log multinomial model had slightly superior performance than the alternative models.

Example of Fitting a Log Multinomial Model to Real Data
As an example, we use data from a follow-up of participants in the 1985 Australian Schools Health and Fitness Survey. In that survey, an extensive battery of measurements of cardiovascular health, physical activity and fitness was conducted on a nationally representative sample of 7-15 year old Australian schoolchildren. Those aged 9-15 years (n ¼ 6414) completed a questionnaire that included questions on attitudes to tobacco smoking. One of those questions required respondents to indicate how important they regarded being a non-smoker. Twenty years later in 2004-05, those remaining in follow-up (n ¼ 3662) reported whether they had taken up daily smoking. To illustrate the methodology with a sample of around n ¼ 500, we restricted the analyses to male participants at follow-up who were aged 10 or 11 years in 1985. This produced the summary data in Table 6: The purpose of the analysis was to obtain estimates of the relative risk of each category of adult smoking status in 2004-05 for not regarding non-smoking as being very important in 1985, when subjects were 10 or 11 years of age. Further, these estimates were to be adjusted for exact age in 2004-05 (range 27.3 years to 32.2 years). The outcome variable for the analysis of these data was daily_smoking (Y) with its three categories coded as Y ¼ 0 (never smoked), Y ¼ 1 (former smoker) and Y ¼ 2 (current smoker). The covariates were a binary (0/1) indicator for not regarding being a non-smoker as very important (ambivalent), and a continuous predictor for age at follow-up (age). The values of age were centred by subtracting from them the mean age for the sub-sample (28.8 years).
The results of using Stata's ml command to fit a log multinomial regression model to these data are shown in Table 7. The intercept (_cons) and slope (ambivalent) in the model without adjustment for age correctly estimate the log baseline risk and log relative risk respectively, and yield values of risk and relative risk that match those able to be calculated from the summary data in Table 6. The standard errors in Table 7 are robust estimates calculated from the observed information using the information sandwich estimator (4). When estimated from the Hessian (3), the estimates are c SE SEðb b 10 Þ ¼ 0:114370 and c SE SEðb b 11 Þ ¼ 0:298436 for former smoking, and c SE SEðb b 20 Þ ¼ 0:081391 and c SE SEðb b 21 Þ ¼ 0:151489 for current smoking. These match exactly those able to be calculated using standard formula for the standard error of the logarithm of risk and of relative risk (Katz et al., 1978). For a model with a single dichotomised covariate model, the standard errors are unchanged if the expected information matrix is used in place of the observed information, and the coefficients and standard errors are identical to those produced by fitting log binomial models to the outcome categories separately (data not shown).
To adjust for age at follow-up, the continuous covariate age was added to the log multinomial model. The results are shown in the lower panel of Table 7. Doing so slightly reduced the estimated log relative risk of being a former smoker for having an ambivalent attitude to smoking in childhood, and slightly increased the estimated log relative risk of being a current smoker. This was as expected because the mean age at follow-up of subjects with an ambivalent attitude (28.9 years) was higher than the mean age of those who regarded non-smoking as very important (28.8 years), but only slightly.

Discussion
In this paper, we proposed an extension of the log binomial model to the multinomial outcome setting and investigated the feasibility of fitting the model. The performance of the model was assessed by a simulation-based comparison of the estimators of its coefficients and standard errors with those of two alternative approaches. The first of these was an extension to multinomial outcomes of the expanded   Schouten et al. (1993) for binary data. The second involved separate fits of a Poisson regression model to all but one of the outcome categories. Finally, we investigated a real data example to verify the veracity of the model estimates.
While currently there is no standard software to fit the model, specifying the multinomial likelihood and fitting the model by modified Newton-Raphson optimisation of the log likelihood was straightforward with the ml command in the Stata software package (Gould, Pitbaldo and Sribney, 2003). The real data example demonstrated that the log multinomial model produced coefficient and standard error estimates that matched exactly those calculated from the summary data using standard formulae. Furthermore, as should be the case in the case of a single dichotomous covariate, they matched exactly those from fitting binary log binomial models to the relevant outcome categories. The results of adjusting for a continuous covariate were as expected.
The simulations of log multinomial data demonstrated the occurrence of inadmissible solutions with this model, but the two alternative methods produced out-of-bounds solutions with similar frequency. Even when a log multinomial model was successfully fitted, the alternative models fitted to the same data produced inadmissible values for the log multinomial model on up to 28 percent of fits by the expanded data multinomial logistic regression model and up to 24 percent of fits by the separate Poisson regression models.
Of the three estimators, the log multinomial estimator provided slope estimates with the smallest average percent relative bias and smallest average percent mean squared error in the majority of admissible samples from the data settings we investigated. The expanded data multinomial logistic estimator had the poorest performance in these respects, but the differences were really quite small. Thus whilst the log multinomial regression model had slightly superior performance on these criteria, the expanded data multinomial logistic estimator and -in particular -the separate Poisson regression estimator each provides a close approximation to it. Only in one setting did average bias of a coefficient estimated by a log multinomial regression model exceed 5 percent, but the distribution of bias was right-skewed and median bias was just one percent. The large values in its right-skewed distribution arose from appropriate fits of the log multinomial regression model to datasets that were extreme but reflected the inherent variability in the distribution of log multinomial data for that setting.
In the inadmissible samples, the bias and mean squared error of one or more of the slope estimates was unacceptably high in three of the settings we investigated. These were settings with models having two covariates, which were the most realistic we investigated. This cautions against drawing inferences from models with inadmissible solutions.
For each estimator, the confidence interval coverage of the slope parameters straddled 95 percent in the admissible samples. The variance estimator was an information sandwich estimator based on the observed information matrix. We could have used instead the non-robust version of it, or substituted the expected value of the observed information matrix when calculating either. In comparisons of results for these four variance estimators (data not reported), coverage was more often closer to 95 percent when calculated with the robust estimates than when calculated with the non-robust estimates, and when based on the observed information matrix than when based on the expected information matrix. Whilst these differences were slight, they suggest that -at least in samples of this size and for simple models of the types we considered -the robust estimates based on the observed information matrix have superior sampling properties.
There are several limitations of our research that are worth keeping in mind. Firstly, we focused on outcomes with three categories, but the model extends readily to nominal outcome data of higher dimensions. Secondly, our results are based primarily on data simulations, and we did not simulate multivariable models that are as complicated as those typically encountered in practice. Thirdly, we used results based on maximum likelihood theory to compute Wald-type confidence intervals. The log multinomial model places a theoretical constraint on estimators that the linear predictor must be negative for all subjects in the sample, and hence assumptions about the parameter space required for distributional theory do not hold. Finally, we have not addressed the issue of data having a solution that lies on the boundary of the permissible parameter space. In the binary context, Deddens and colleagues (Deddens, Petersen and Lei, 2003;Deddens and Petersen, 2004) suggest estimating the model on an expanded dataset formed by combining (c À 1) copies of the original data with one version of the data that has the values reversed for the binary outcome indicator. They refer to this as the COPY algorithm. For their data example, our method produces a convergent, admissible solution that is a close approximation to the maximum likelihood solution (Blizzard and Hosmer, 2007). Lumley, Kronmal and Ma (2006) reviewed this and other approaches to the problem, and recommended an alternative solution. We verified that the COPY method can be extended with the desired result to simple examples of trinomial outcome data but did not investigate further.
In summary, we feel that the log multinomial model does offer a practical solution to the problem of obtaining adjusted estimates of the risk ratio in the multinomial setting. Particularly when the sum of the fitted probabilities is high, however, the model must be used with some care and attention to detail.