Item response theory analysis of the Amyotrophic Lateral Sclerosis Functional Rating Scale-Revised in the Pooled Resource Open-Access ALS Clinical Trials Database.

Abstract Our objective was to examine dimensionality and item-level performance of the Amyotrophic Lateral Sclerosis Functional Rating Scale-Revised (ALSFRS-R) across time using classical and modern test theory approaches. Confirmatory factor analysis (CFA) and Item Response Theory (IRT) analyses were conducted using data from patients with amyotrophic lateral sclerosis (ALS) Pooled Resources Open-Access ALS Clinical Trials (PRO-ACT) database with complete ALSFRS-R data (n = 888) at three time-points (Time 0, Time 1 (6-months), Time 2 (1-year)). Results demonstrated that in this population of 888 patients, mean age was 54.6 years, 64.4% were male, and 93.7% were Caucasian. The CFA supported a 4* individual-domain structure (bulbar, gross motor, fine motor, and respiratory domains). IRT analysis within each domain revealed misfitting items and overlapping item response category thresholds at all time-points, particularly in the gross motor and respiratory domain items. Results indicate that many of the items of the ALSFRS-R may sub-optimally distinguish among varying levels of disability assessed by each domain, particularly in patients with less severe disability. Measure performance improved across time as patient disability severity increased. In conclusion, modifications to select ALSFRS-R items may improve the instrument's specificity to disability level and sensitivity to treatment effects.


Introduction
In clinical trials of amyotrophic lateral sclerosis (ALS), commonly used endpoints include assessments of muscle strength (1), respiratory function (2) and mortality (3). However, these assessments can be time-consuming, require patient cooperation and specialized equipment to administer. The Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS) (4) and its revised form (ALSFRS-R) (5) were developed as functional outcome assessments to evaluate clinical aspects of ALS, including bulbar, fine motor, and gross motor function, and respiratory disability. The ALSFRS-R has been and is currently used as a primary or secondary outcome measure in multiple ALS clinical trials (6)(7)(8)(9)(10)(11). Psychometric assessments of the ALSFRS-R using classical test theory methods have provided support for the measure's acceptable internal consistency, reliability, construct validity, and responsiveness (5,(12)(13)(14). Change in the ALSFRS-R is longitudinally associated with change in other functional measures such as assessments of muscle strength and Schwab and England scores (4,12).
Modern psychometric methods such as Rasch (15) and Graded Response Model (GRM) (16), both forms of Item Response Theory (IRT) (17), can provide valuable information beyond what can be learned using classical test theory methods. For example, these techniques provide information on the scaling metrics and item-level performance of an instrument, as well as specific characteristics of the scale (e.g. response options) that could potentially be modified to provide better measurement.
Using Rasch analysis in cross-sectional data, Franchignoni et al. (18) found substantial misfit in many ALSFRS-R items, including overlapping response options and low person-to-item fit. A subsequent analysis by Franchignoni et al. similarly demonstrated the need for refinement of the ALSFRS-R (19). However, GRM analysis has not been applied to the ALSFRS-R. The development of the longitudinal Pooled Resources Open-Access ALS Clinical Trials (PRO-ACT) Database (20) provides access to large quantities of ALS patient data from clinical trials, many of which included the ALSFRS-R as a study endpoint. The objective of this study was to conduct a secondary analysis of PRO-ACT Database to investigate the psychometric properties of the ALSFRS-R over time using both Rasch and GRM, including assessments of instrument dimensionality, item fit, item response ordering, and differential item functioning (DIF).

Data source
Data used in the preparation of this article were obtained from the PRO-ACT Database (20). In 2011, Prize4Life, in collaboration with the Northeast ALS Consortium, and with funding from the ALS Therapy Alliance, formed the PRO-ACT Consortium. The data available in the PRO-ACT Database has been volunteered by PRO-ACT Consortium members, and contains more than 8500 fully de-identified records from ALS patients participating in 17 publicly and privately-conducted international ALS clinical trials. Both placebo and treatment arm patients are included in the database.
The ALSFRS-R properties were evaluated using data at three time-points. Time 0 included patients at their first assessment in the trial. Time 1 included all patients who had a follow-up assessment and were administered the ALSFRS-R approximately six months (AE45 days) from Time 0. Time 2 included all patients who had a follow-up assessment and were administered the ALSFRS-R approximately six months (AE45 days) after Time 1. To capture variability in trial follow-up assessment schedules, a 90-day (AE45 days) window was applied at each six-month follow-up. The primary analytic population included all patients in the PRO-ACT database who completed an assessment using the ALSFRS-R at all three time-points, designated as 'completers'; patients without data at all three time-points were designated as 'non-completers'.

ALSFRS-R
The ALSFRS-R is composed of 12 items used to assess daily functioning in four domains (with 3 items per domain) -bulbar, gross motor, fine motor, and respiratory function. Each item has five question-specific response options ranging from 0 (complete dependence) to 4 (normal function), resulting in a total sum score ranging from 0 to 48. Although designed as a clinician rating scale, the ALSFRS-R instructions have been adapted to allow patients to respond and caregivers to serve as proxy-respondents (21). All three types of data were included in the database.

Statistical analysis
Descriptive statistics. Descriptive statistics were used to describe the clinical and sociodemographic characteristics of all patients at Time 0, by completer and non-completer status. Independent samples t-tests and 2 tests were performed to determine if completers and non-completers differed systematically on any baseline characteristics. Descriptive statistics were calculated for all items of the ALSFRS-R, including mean (SD), median, range, and percentage of items with responses at the floor and ceiling (defined as 25% or greater responses of 0 or 4, respectively).
Confirmatory factor analysis. Confirmatory factor analysis (CFA) was used to assess fit of the ALSFRS-R as a unidimensional single scale, as well as the fit of the four ALSFRS-R sub-domains individually. All CFAs were conducted using the weighted least squares mean and variance adjusted (WLSMV) estimator in Mplus version 7.11 (22). Model fit was assessed with the comparative fit index (CFI), root mean square error of approximation (RMSEA), and weighted root mean square residual (WRMR). A CFI40.90, RMSEA50.07, and WRMR close to 1 were all used to indicate acceptable model fit (23,24).

Rasch analysis and Graded Response Model
The psychometric properties of all ALSFRS-R domains were assessed separately using RUM M2030 (25) for the Rasch (15) modeling and IRTPRO version 2.1 (26) for the GRM (16) at each time-point. Differences between the Rasch model and GRM have been described in detail elsewhere (27) and are briefly described here.
The GRM estimates the probability of correct response (i.e. ability level of the patient accurately corresponding to ability as assessed by the ALSFRS-R) on an item in terms of item difficulty and discrimination. Item difficulty describes the trait level required for a patient to have a 0.50 probability of endorsing a particular response category and higher response categories. Item discrimination indicates the strength of an item score in relation to the latent construct measured by the whole instrument. Thus, items with high discrimination are better able to distinguish between individuals at the response difficulty level.
Through the GRM analysis, item slope values were examined to assess item discrimination. Higher values are associated with items that are better able to discriminate between adjacent trait levels, and provide greater information about a patient than less discriminating items. However, slope parameters 44.0 generally indicate an item is too discriminating, pointing to a high level of dependency between the item score and latent construct or between two item scores (28). In the Rasch model, the discrimination parameter is fixed (does not vary between items) as it is assumed the relationship to the construct is constant for all items. Both approaches are appropriate for ordered categorical item responses as in the ALSFRS-R, but may yield different results in terms of fit and ordering latent scores.
For the Rasch model, the ordering of item thresholds is used to determine if patients with higher levels of the measured attribute (e.g. disability due to ALS) consistently endorse lower scored response options and vice versa. When this does not happen, disordered thresholds occur for a variety of reasons, such as: the item or response phrasing is confusing; the instructions provided to the patient are unclear; or the item responses as stated are indistinguishable from one another in terms of severity. Threshold ordering was assessed by examining category threshold estimates and order, and item characteristic curves (ICCs). In particular, the range of disability for which a particular score (response option) is more probable than the other response options was examined; if that range was narrow, or non-existent, the validity of the response category was questioned.
The fit of each item to model requirements was also assessed using both Rasch and GRM. Item fit using Rasch was assessed using a 2 and fit residual (misfit indicated when 2 p-value 50.001). The fit residual considers the fit of the data in the population (observed data) to the Rasch model. A large negative fit residual value (5-3.0) demonstrates the information provided by this item does not add additional value to the measurement. A high positive residual value (43.0) demonstrates that the item is under-fitting, indicating that the item is not detecting differences in severity. For the GRM analysis, item fit was similarly assessed using Pearson's S-2 fit statistics (misfit indicated when p50.001) (29), which is used to assess the difference between observed and model-based predicted values.
Finally, DIF of each of the ALSFRS-R items, separately by domain, was assessed by sub-groups created using site of symptom onset (bulbar versus limb) using a total 2 test. An assessment of DIF is needed to assess the stability of item calibration across different patient populations (28). A significant DIF 2 test indicates that the response probabilities differ by type of symptom onset.

Sample characteristics
A total of 1736 patients had ALSFRS-R data. Of these, 1730 (99.6%) patients had an ALSFRS-R assessment at Time 0, and 888 (51.1%) had an ALSFRS-R assessment at all three time-points. Baseline demographic and clinical characteristics were calculated for all patients with complete ALSFRS-R data at Time 0, and stratified by completer status (Table I). The majority of patients were male (63.4%) and Caucasian (93.2%), with a mean age of 55.2 years. Baseline characteristics were similar between groups. All further analyses report the results of the completer group at all three timepoints to enable a comparison of instrument characteristics across time. Parallel analyses were conducted with the non-completer group, with no substantive differences in findings (results not reported).

ALSFRS-R descriptive analyses
Descriptive statistics for the ALSFRS-R were calculated at all time-points, with Time 0 and Time 2 scores presented in Tables IIa and IIb. At all timepoints, scale scores spanned the entire range. At Time 0, mean scores ranged from 2.7 (moderate difficulty) to 4.0 (performs without difficulty). No items displayed floor effects; however, 10 of 12 items displayed ceiling effects (indicating high percentage of patients with normal function). At Time 2, mean scores on every item were lower, indicating increased disability in patients at approximately 12 months following Time 1. At Time 2, floor effects were observed for one item. Only six items displayed ceiling effects, demonstrating that patients were increasing in disability.

ALSFRS-R dimensi1onality
Confirmatory factor analysis of the total scale score of the ALSFRS-R as one dimension demonstrated poor fit to the data at all time-points (CFI range ¼ 0.75-0.86; RMSEA range ¼ 0.28-0.33; WRMR ¼ 0.29-0.31). However, CFA analyses of each of the four domains individually demonstrated acceptable fit, with little to no variance in each model at all time-points (CFI range ¼ 1.00; RMSEA ¼ 0.00; WRMR ¼ 0.00, for all domains). Standardized coefficients for item loadings within each factor ranged from 0.68 to 0.99.

Rasch and GRM analysis
The Rasch model fit within each domain was poor. Items within each of the four domains demonstrated disordered thresholds at all time-points and fit indices indicated item misfit. It was clear that the data did not fit the modeling assumptions of Rasch measurement, thus the results of this analysis are not presented (available upon request). Items modeled with the GRM performed better in terms of fit than when they were modeled with the Rasch. However, select items within each domain still demonstrated poor fit (Tables IIIa, b) and DIF (Table IV). The results presented below are of the GRM analyses only.
Bulbar domain. Figure 1 shows the ICCs for the bulbar domain. An examination of the ICCs for Item 2 ('salivation') indicated that response category 1 ('marked excess of saliva with some drooling') did not distinguish well between patients in this and either of the adjacent score categories ('marked drooling' and 'moderately excessive saliva') at Time    Figure 2 shows the ICCs for the fine motor domain. At all time-points the slope value for Item 5 ('cutting food and handling utensils') was large ($10), far exceeding the threshold for acceptable item discrimination. This finding indicates that Item 5 is potentially redundant with the latent construct of disability severity being measured by the ALSFRS-R and may be over-discriminating between individuals with different levels of severity as assessed by this item. All items displayed acceptable fit, with the exception of Item 6 ('dressing and hygiene') at Time 0. Item 5 also displayed DIF at all timepoints, while Items 4 and 6 only displayed DIF at Time 0 and Time 1, respectively. Specifically, Item 7 ('turning in bed and adjusting bed clothes') response category 0 ('helpless') at Time 0 was not distinct from response category 1 ('can initiate but cannot adjust sheets alone'). Item 9 ('climbing stairs') was the most problematic, as response category 2 ('mild unsteadiness or fatigue') was not distinct from response categories 1 and 3. While all S-2 values were acceptable for all items at all time-points, the slope value for items 8 and 9 exceeded the recommended discrimination threshold. Items 7 and 8 showed DIF at Time 0, but no items displayed DIF at Time 2. Figure 4 shows the ICCs for the respiratory domain. The fit was very poor for Items R2 and R3 at Time 0, and the lowest threshold could not be estimated because of the lack of responses in that score category (Tables IIIa, b). An examination of the ICCs supported this finding at Time 0, further

Discussion
Using the large, publically accessible PRO-ACT Database, the current study builds upon previous work (5,(12)(13)(14)18,19) to provide a more complete understanding of the scaling metrics of the ALSFRS-R. The IRT analyses conducted contribute valuable information to indicate which items and response categories of the ALSFRS-R are performing well, as defined by an IRT model, and which could be revised, in an effort to improve assessment of disability in ALS patients.
Dimensionality assessments in the current study further supported the finding that the ALSFRS-R is better treated as an instrument assessing 4 separate constructs (4,18,30). GRM analyses demonstrating multiple item response categories within the bulbar, gross motor, and respiratory domains did not adequately distinguish between varying levels of disability represented by each category. Items within the respiratory domain were particularly problematic,  The current analysis is unique as ALSFRS-R performance was assessed at multiple time-points. Item fit and parameter performance was better at Time 2 than Time 0, perhaps suggesting the instrument performs better over the course of disease progression. In the respiratory domain, for example, changes were evident in the ICCs for the R1 and R2 respiratory items, indicating response options for these items are more appropriate for assessing respiratory function in a more severe patient population. With the exception of Item 9 ('climbing stairs'), threshold spacing and individual item fit for all items showed varying levels of improvement over time.
Findings from this analysis can be used to consider revisions to the ALSFSR-R. In its current form, the ALSFSR-R may not be optimized to support sensitive assessment of treatment effects across time in a clinical trial population, as change in some response levels may not accurately correspond to changing severity. Some items (i.e. gross motor domain) could be modified by clarifying/rewording existing response options. Other items (i.e. respiratory domain) might be improved by presenting additional response options to capture more subtle changes in less disabled patients, allowing for more robust assessment of change given that considerable ceiling effects were present for all but one item at Time 0. These recommendations are supported by previous conclusions that revising the ALSFRS-R may improve its performance both within clinical trial (18) and non-clinical trial populations (19). However, if items of the ALSFRS-R are modified in any way, a full validation study to assess the reliability and validity of the modified measure should be conducted with the modified measure administered following current guidelines.
Potential study limitations should be noted. First, the ALSFRS-R was administered by clinicians, caregivers, or self-administered; however, the PRO-ACT Database did not contain a variable to indicate the reporter (e.g. patient, caregiver, investigator). Instructions used for administering the ALSFRS-R were also not recorded in the database (31). Thus, the impact of mode of administration and instructions for administration could not be analytically examined and/or controlled. Inconsistent administration and rater variability could result in reduced measurement reliability, independent of patient disability severity. Consistent administration could be employed by development and implementation of standard examples to guide individuals administering or responding to the scale. Other limitations of the PRO-ACT database include the omission of a variable to assess disease duration or clinical site. In addition, analyses were restricted to patients who had ALSFRS-R data at all three time-points. Thus, the analysis is subject to attrition bias and exclusion of the most severe patients, as data would be missing for patients who died during the time period analyzed. Thus, these findings may not be representative of instrument performance in the most severe patients, nor of patients in a community based population, as all patients within this sample were from a clinical trial sample. In addition, the windows chosen to represent Time 1 and Time 2 were broad (AE45 days) -an interval over which change may occur, even within a single patient. The broad visit windows may have accounted for some of the overlapping scaling observed within items. However, a descriptive assessment using a AE15-day window versus the AE45-day window demonstrated that very few patients fell outside the 15-day window, and the mean span of time in which the ALSFRS-R assessment was conducted was almost identical when using either the AE15-day or AE45-day windows. Finally, it should be noted that different analytical techniques (IRT versus CFA, for example) may lead to different conclusions regarding item performance and scaling metrics in the same population (32).
Findings from the current analysis indicate the ALSFSR-R may benefit from a revision for use in a clinical trial population; however, the performance of the measure did improve over time as the study population progressed, suggesting a need to focus on improving response options for patients with milder disability. Modification of response options may increase sensitivity of the instrument across the spectrum of ALS progression, potentially improving the likelihood of detecting true treatment effects. In addition, implementation of standard administration practices should decrease measurement variability arising from poor inter-rater reliability. Such revisions and establishment of standards for training evaluators on administration of the instrument may lead to an increased ability of the measure to accurately detect change at all disability levels. Any revision to the ALSFRS-R or establishment of administration standards must be accomplished by adherence to current guidelines and best practices (33).

Acknowledgement
Funding for this work was provided by Biogen.
Data used in the preparation of this article were obtained from the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) Database. As such, the following organizations and individuals within the PRO-ACT Consortium contributed to the design and implementation of the PRO-ACT Database and/or provided data, but did not participate in the analysis of the data or the writing of this report: Neurological Clinical Research Institute, MGH; Northeast ALS Consortium; Novartis; Prize4Life; Regeneron Pharmaceuticals, Inc., Sanofi; Teva Pharmaceuticals Industries, Ltd.

Declaration of interest
KSC, EDB, DS are employed by Evidera, which provides consulting and other research services to pharmaceutical, device, government and nongovernment organizations. As Evidera employees, they work with a variety of companies and organizations and are expressly prohibited from receiving any payment or honoraria directly from these organizations for services rendered. JC, SB and LAW are employed by Biogen. KSC, EDB, DS were paid consultants to Biogen in connection with the development of this manuscript.