Psychometric validation of the Danish version of the Major Depression Inventory using data from the Lolland-Falster health study (LOFUS)

Abstract Purpose The Major Depression Inventory (MDI) is a widely used self-rating depression scale commonly in primary care in Denmark. It has not been subject to robust psychometric validation in a general population setting. The aim of this study was to evaluate the psychometric measurement properties of the MDI when applied in the general population. Methods We evaluated statistical psychometric validity using modern test theory (confirmatory factor analysis, item response theory models and Rasch measurement theory) testing local independence and differential item function across groups defined by gender, age, education, and chronic disease status. Separate analyses across different strata and across different statistical models were employed. Results Regarding structural validity we consistently identified local dependence for the item two pairs (MDI2,MDI3) and (MDI4,MDI5) across strata. This result was confirmed by bifactor CFA models and item screening. We further identified substantial differential item functioning with respect to age group and with respect to chronic disease. We identified quantified the magnitude of this lack of measurement invariance. Conclusion The MDI is psychometrically valid in homogenous sub populations, but the disclosed evidence of local dependence means that published estimates of its reliability cannot be trusted. The lack of measurement invariance means that the instrument cannot be used to compare individuals or groups unless they are similar in terms of age group and chronic disease status.


Introduction
During the past two to three decades, Patient Reported Outcome Measures (PROMs) have been widely accepted as important patient-relevant outcomes in the field of health research.In some medical specialties, PROMs are even essential outcomes, e.g.orthopaedics, psychiatry, general practice and public health [1][2][3].In these specialties death of specific causes or death of all causes are seldom sufficient to cover all relevant outcomes in intervention studies, prognostic studies or cross sectional studies.This is because outcomes like patients' somatic and mental health are maybe a more relevant target for an intervention or just as important as death.Patients' somatic and mental health are often measured as e.g. level of daily activity, functionality, impairment, illness, mental status and quality of life.However, such constructs are not always easily measured compared to for example numbers of diagnoses and mortality.This means that the question about the validity of PROMs must be addressed: does the PROM actually measure what it claims to measure?
To answer questions about a PROM's measurement adequacy several measurement properties can be explored with validity as the first and most important area of exploration.Validity is the extent to which the MDI measures depression.Our focus is on statistical validity, operationalized as evaluation of fit of psychometric models, but theoretically several types of validity that address different aspects exist.These include: (i) content validity: the extent to which the content of the MDI represents the entire domain it purports to measure, (ii) criterion validity: the extent to which the MDI is associated with or predicts external criterion variable, (iii) construct validity: the degree to which the MDI measures an abstract trait or construct (e.g.intelligence, motivation) that cannot be directly observed.For a more thorough discussion see [4].Content validity is the most important measurement property because items need to be relevant, and comprehensible with respect to the construct of interest in the target population.The statistical validity tested here, does not test if the items, but rather study if the scoring algorithm is reasonable.If the MDI is construct valid, but the score does not adequately reflect the state of respondents [5] this will be manifested as misfit or the disclosure of statistical anomalies in the validation.Most PROM's encompass one or more scales or domains that are operationalized using scalar-valued scores corresponding to implicit assumptions of unidimensionality.These should be tested using modern test theory (MTT) models like confirmatory factor analysis (CFA; [6]), item response theory (IRT; [7]), or Rasch measurement theory (RMT; [8]).A related concept is the absence of local dependence (LD; [9]).This is the assumption that observed items are conditionally independent given the value of the underlying latent variable, and is an underlying assumption in all MTT models.It means that the latent variable explains why the observed items are related to one another.However, this assumption is not always met in PROM data and for this reason it must be tested as part of a psychometric validation.Finally, items in a scale might possess differential item function (DIF; [10]) if they do not function equally in different groups (e.g.across gender, age group, disease status etc.).It is a necessary part of psychometric validation to test measurement invariance across, e.g.gender groups or age groups.
The Major Depression Inventory (MDI) is a self-rating depression scale commonly used in primary care in Denmark.It was developed in the late 1990s to detect major depression [11].It has been used in a clinical setting as a diagnostic tool and as an outcomes measure in research projects [12] and has been validated against other instruments [13][14][15][16] and in different patient groups [11,17,18] by item response theory analyses on a variety of patients [16,[19][20][21].Studies of general population data using the MDI exist [22][23][24][25][26][27], but only the two most recent of these ( [26,27]) address psychometric validation using state-of-the-art methodology.
For the item pair 8a (Have you felt very restless?)and 8b (Have you felt subdued or slowed down?)only the highest response is used.The same is true for the item pair 10a and 10b where items address reduced and increased appetite, respectively.The items are based on the ICD-10 diagnostic criteria for depressive disorder [28].This guarantees content validity.The scoring is a straightforward procedure where, after exclusion of the lowest score on the item pair addressing increased/decreased restlessness and on the item pair addressing increased/decreased appetite, the total sum score ranges from zero to 50.Values below 20 is interpreted as though depression does not exist or that its existence is doubtful, values from 21-25 indicate mild depression symptoms, 26-30 moderate symptoms of depression, and 31-50 indicates severe symptoms of depression [14].The instrument is designed in such a way that it can also be applied diagnostically by assessing the number of core and accompanying symptoms present.This is according to ICD-10 and DSM-IV.Diagnosis of depression cannot be made from a sum score of MDI (or any other PROM).
Two language versions of the instrument have been psychometrically validated in small patient samples [29,30], and one in a population with participants aged 13-24 years in rural Kenya [25].An analysis using RMT in 263 females diagnosed with chronic widespread pain and referred for rehabilitation revealed problems with the rating scale properties of the MDI [20].In a general practice setting the MDI was appropriate, but problems with misfit (items 9 (sleep) and 10 (appetite)) were disclosed [16,21], and a need for changing the item scoring identified [21,27].Beyond validation of psychometric validity this may also be important in capturing atypical depression [12].Authors did not recommend using the MDI scores for screenings purposes, but recommended a diagnostic approach counting core and accompanying symptoms.An earlier study in population data used Mokken analysis to evaluate the MDI [24].The reported Mokken analysis did not evaluate DIF and local response dependence.
The Danish version of the MDI has not been subject to robust psychometric validation in a general population setting.Therefore, the aim of this study was to evaluate the psychometric measurement properties of the MDI when applied in the general population.

Methods
The Lolland-Falster Health Study (LOFuS) is a population survey conducted in a socioeconomically deprived area of Denmark, 1 1 2 -2 h drive south of the capital Copenhagen, in the municipalities of Lolland and Guldborgsund [31].In the national ranking of all 98 municipalities these two were ranked the most deprived and the 6th most deprived municipalities in 2020 [32].In the LOFuS questionnaire all participants aged 18 or older were asked to complete the MDI.
Educational attainment was measured and classified as the following: no post-secondary education if the respondent did not complete any post-secondary education; 1-3 years post-secondary education for vocational or academy/professional graduates of 1-3 years; 3+ post-secondary education for baccalaureate matriculants who completed 3-4 years; and academic for those who completed graduate study of ≥5 years.Regarding self-reported chronic disease status we classified respondents based their response to the item 'Do you have any prolonged illness, prolonged aftereffect after injury, handicap or other long-term health-related problem?Long-term means at least six months.' We stratified the respondents according to the four variables gender, age group (below 60; 60 or over), education group (short education; long education), and chronic disease (yes; no) yielding a total of 2 16 4 = strata.From each of these strata we sampled 200 respondents using simple random sampling.
We evaluated structural validity in each of these 16 homogeneous sub samples we performed a psychometric evaluation using MTT (CFA, IRT and RMT) following the recommendations in [33][34][35][36].Model fit was evaluated for CFA models without LD and for bifactor CFA model incorporating LD.Bifactor models were based on modification indices (MI).The derived models were confirmed using IRT (using bifactor graded response models) and RMT (applying graphical Rasch models [37,38]).The latter of these models incorporate local dependence.For this item screening [39] was used.
In the instances where a suitable measurement model could be identified measurement invariance was assessed using multiple groups CFA.We used the approach described by Svetina et al. [40] and evaluated model fit for bifactor CFA models with configural, metric, and scalar invariance.Invariance across gender groups was assessed in the eight strata defined by age group, education group and chronic disease status.Invariance across age groups was assessed in the eight strata defined by gender, education group and chronic disease status.Invariance across education groups was assessed in the eight strata defined by gender, age group and chronic disease status.Invariance across chronic disease status was assessed in the eight strata defined by gender, age group and education.
In a final step graphical Rasch models [37,38] incorporating LD were used to test for DIF and to derive translation tables where evidence of DIF was found.The test for DIF was performed by combining evidence from all the strata.This resulted in a total of 40 statistical tests (10 items and four DIF variables) and P-values were adjusted using the Benjamini-Hochberg [41] correction to keep the false discovery rate at 5%.Furthermore, in interpreting the evidence of DIF we go beyond evaluation of statistical difference and interpret the magnitude of change on the total score resulting from the DIF.This quantifies the total impact on DIF on MDI scores.For reporting we focus on values of the latent variable corresponding to scores 20, 25, and 30, respectively in a reference group.

Results
A total of 44,209 adults (18+) were invited to LOFuS and 16084 of these participated (response rate 36.4%).Additionally, 53 persons invited before they were 18 are included in the final adult sample.Thus 16,137 respondents were eligible and we included 12701 respondents with no missing values in the data relevant to this study.Among men 90.3% had no missing data, while for women 91.6% had no missing data.This constituted a statistically significant difference of 1.4 (95% CI: 0.5 to 2.2) percentage points.The median age for those with no missing data was 59 (IQR: 48 to 68) years, while the median for those with missing data was 58 (IQR: 36 to 72) years (Kruskal-Wallis P-value < 0.0001).The distribution of the MDI score in the 16 strata is illustrated in Figure 2.

(i) structural validity
We evaluated fit of a CFA model (Figure 3, panel (a)) and found that model fit was rejected in 15 of the 16 strata (Supplementary Table 1).In nearly all strata modification indices were high and they consistently identified the item pairs (MDI 2 , MDI 3 ) (highest MI value in six strata) and (MDI 4 , MDI 5 ) (highest MI value in three strata).Adding these yielded bifactor CFA models (Figure 3, panel (b)) with better fit (Supplementary Table 2).
This result was confirmed by item screening where evidence of LD for the item pair (MDI 2 , MDI 3 ) was found in five strata and evidence of LD for the item pair (MDI 4 , MDI 5 ) was found in four strata.

Measurement invariance across gender groups
was evaluated using multiple groups bifactor CFA models in the eight strata defined by age group, education and chronic disease status.Substantial evidence of lack of measurement invariance was seen (Supplementary Table 3).There was significant evidence of DIF with respect to gender for the items MDI 5 , MDI 8 , and MDI 9 (Table 1).The impact of this DIF was unsubstantial (results not shown).

Measurement invariance across age groups
in the eight strata defined by gender, education and chronic disease is reported in Supplementary Table 4. Again evidence of lack of measurement invariance was seen.There was significant evidence of DIF with respect to age group for four items (MDI 2 , MDI 5 , MDI 6 , and MDI 9 ; Table 1).The impact of this DIF was substantial.Figure 4 illustrates that respondents with the same value of depression are assigned quite different MDI scores.For those in the reference group (males with short education, and without chronic disease) the three vertical dashed lines in each of the eight panels indicate latent trait values corresponding to scores of 20, 25, and 30, respectively.In those with values corresponding to a score of  The impact of DIF was larger for values of the latent trait corresponding to a score of 25 in the reference group where those over 60 on average have scores that are 1.7 points lower, while for values of the latent trait corresponding to a score of 30 in the reference group the average difference was 2.1 points.

Measurement invariance across education groups
used strata defined by gender, age group, and chronic disease status.Again evaluation of fit of different multiple groups CFA models indicated lack of measurement invariance (Supplementary Table 5).A single item, MDI 9 , had DIF with respect to the education group (Table 1).The impact of this DIF was unsubstantial (results not shown).

Measurement invariance across chronic disease groups
evaluated the fit different multiple groups bifactor CFA models in the eight strata defined by gender, age group, and education group.Substantial evidence of lack of measurement invariance was seen (Supplementary Table 6).There was significant evidence of DIF with respect to chronic disease group for the items MDI 3 , MDI 5 , and MDI 9 (Table 1).The impact of this DIF was also noticeable. Figure 5 illustrates that respondents with the same value of depression are assigned different MDI scores.
The impact is smaller than for the age group DIF and manifests itself at lower levels of the depression continuum.Here the reference group consists of males with short education under 60) and for values of the latent trait corresponding to scores of 20, 25, and 30, respectively those with chronic disease score 0.6, 0.5 and 0.4 points lower.

Discussion
We evaluated the measurement properties of the MDI using state-of-the-art psychometric validation.Three distinct, but related statistical models (CFA models, IRT models, Rasch models) were applied and yielded very similar results.The statistical psychometric validation showed that the structural validity is sound and this in combination with the fact that the MDI is based on diagnostic criteria for depressive disorder shows that it is a valid measurement instrument.In all 16 strata bifactor CFA models and graphical Rasch models showed good fit after LD was taken into account.The LD disclosed was also found in a recent Danish study using data from general practice [27] and matches items that DSM-IV combine in their criteria and an item that was declassified to accompanying item in ICD-11.Thus, our findings do not necessarily identify problematic items, rather they illustrate difficulties capturing the latent phenomena.However, a consequence of LD is that estimates of reliability cannot be trusted, and thus published estimates of reliability in population studies using the MDI should be interpreted with caution.
As pointed out by Fried and Nesse [5][p.5] the assumption in psychometric models that there is no LD, i.e. that underlying latent variables fully explain all correlation between manifest indicators is rarely met and ignoring this can substantially bias inferences.The evidence of LD disclosed here confirms this, and the CFA and IRT models that take LD into account illustrate that a (statistically) valid MDI score can still be derived due to the content validity provided that the scoring makes sense.However, the evidence can also be taken to indicate that network analysis, that go beyond the assumption that symptoms are manifestations of a common underlying factor, is more reasonable [42].
An earlier study in population data used Mokken analysis and recommended the use of the MDI in population studies [24].The reported Mokken analysis completely disregarded the concepts of DIF and local response dependence.Here we disclosed strong evidence of DIF and must conclude that MDI score comparison across age groups should be interpreted with caution.This does not invalidate the use of the MDI per se.MDI scores still yield a valid ranking of people with similar age, and change scores still have an interpretation as change within a person.Furthermore MDI scores for people without chronic disease should not be uncritically be compared to MDI scores for people with one or more chronic diseases.It makes sense that the latter group responds differently to some items, but this does not invalidate the use of MDI scores for within-group ranking or evaluation using change.The impact was noticeable for both variables, but was strongest regarding age group towards the higher end of the depression continuum.
Invariance may occur for a number of reasons.Some items may yield different scores in some groups, e.g.those with chronic disease, for reasons other than depression.Similarly items may be interpreted differently in atypical depression [12].Thus, when the item content is fixed due to the requirement of content validity some DIF will occur.Here we identified lack of invariance and quantified the magnitude of the total impact on scale scores.

Implication for research
Our results show that MDI is a unidimensional measure in the present LOFuS cohort, where data were collected in a population study.Some will argue that the MDI scoring that generates a single number is inherently invalid, and that single symptoms should be used [5].Two studies in a Danish working population indicated that single symptoms predict risk of long-term sickness absence among employees who are free of clinical depression, but also that for the MDI score clear doseresponse relationship exists with adverse effect of non-clinical depressive symptoms manifesting itself at relatively low scores [43,44].However, several of the MDI items possess DIF making comparison problematic across different covariates.These non-invariant measurement properties of the MDI must be explored in future studies because DIF can bias comparison between sub-groups especially in non-randomised trials.In future studies using LOFuS MDI data the identified DIF must be taking into account.This can be done by generating translation tables yielding translated scores similar to those reported in Figures 4 and 5. Researchers using the MDI with other data sources may also want to adjust for this source of bias.It should be noted that, due to the nonlinear nature of the DIF effect, simply adjusting for the effect of age or chronic disease in regression models will not removed the bias.

Implication for practice
Our results reveal that MDI is not invariant across sub-groups and that its reliability might be lower than previous results indicate.Therefore, we encourage to use MDI scores in practice with caution taking into account the measurement problems we have identified.

Methodological limitations
It is a limitation that we did not pursue evaluation of a network method.If single symptoms should be used in place of a scale score [5] cannot be tested, but we have identified strengths and weaknesses associated with using a scale score.

Figure 1 .
Figure 1. the items in the Mdi.

Figure 2 .
Figure 2. the distribution of the Mdi score in the 16 strata.
20 those over 60 on average have scores that are 1.1 points lower.

Figure 3 .
Figure 3. the factor structure of the Mdi: (a) cfa model, (b) cfa bifactor model with two bifactors B1 and B2.

Figure 4 .
Figure 4. the impact of the differential item functioning with respect to age groups.dashed horizontal lines indicate expected scores for respondents with the same value of depression.

Figure 5 .
Figure 5. the impact of the differential item functioning with respect to chronic disease groups.dashed horizontal lines indicate expected scores for respondents with the same value of depression.

Table 1 .
Evidence of differential item functioning.Boldface indicate p-values that are significant after Benjamini-hochberg correction.