Psychometric properties of the Action Research Arm Test using decision rules for skipping items in hemiparetic patients after stroke: a retrospective study

Abstract Purpose Important properties have been studied using the Action Research Arm Test (ARAT) in patients with stroke. However, whether the ARAT subtests constitute a Guttman scale (i.e., items hierarchically ordered according to difficulty) remains unclear. Guttman scales can define decision rules for skipping items in patients with low endurance. This study investigated the psychometric properties of the ARAT when applying decision rules for post-stroke hemiparetic patients. Methods A retrospective, single-institution study was conducted between 2020 and 2021. Datasets of 30 patients with stroke-induced hemiparesis were collected from a previous study which employed the ARAT without decision rules, Fugl-Meyer assessment (FMA), Box and Block Test (BBT), and Motor Activity Log (MAL). The ARAT was rescored with decision rules for this study, and inter-rater reliability/agreement, parallel forms reliability, and construct validity were assessed. Results Parallel forms reliability (Spearman’s rho) was 0.99 (95% CI, 0.99–0.99) for both raters. The lower 95% CI limits of the sum and individual item scores in the reliability analysis exceeded the planned value (0.8). Construct validity values exceeded the planned value (0.8) for FMA, BBT, and MAL. Conclusion Decision rules can be used to skip ARAT items when assessing upper extremity motor function in stroke patients. IMPLICATIONS FOR REHABILITATION The Action Research Arm Test with decision rules for skipping items was valid and reliable for measuring upper extremity motor function in hemiparetic patients after stroke. The decision rules may reduce the burden of both patients and evaluators by decreasing the number of Action Research Arm Test items to be administrated.


Introduction
Upper extremity motor function has received considerable recent attention in the field of stroke rehabilitation.Clinical trials frequently use the Fugl-Meyer assessment (FMA) and the Action Research Arm Test (ARAT) to assess arm motor function and activity capacity in hemiparetic patients after stroke [1].Recent largescale, rigorous clinical trials of post-stroke motor rehabilitation have used the ARAT to determine their primary endpoints [2] because sufficient studies have been conducted on the test's psychometric properties, confirming its suitability for clinical use due to its reliability, validity, responsiveness, and interpretability [3].Among such studies of high clinical value, the ARAT is used more frequently than the FMA as a primary endpoint tool, which implies that the ARAT could be more appealing outcome measure in stroke upper limb rehabilitation intervention studies.The ARAT is also preferred because of its scoring system and content related to upper extremity capacity for everyday life tasks.If all 19 items of the ARAT are performed, it takes at least 20 min to prepare and administer.Therefore, the ARAT was organized into four subtests, each of which uses a Guttman scale (i.e., the items are ordered hierarchically according to the difficulty of the task) [4].These subsets are important for routine clinical assessment as well as for patients with low endurance because a patient's ability to perform a particular item predicts their ability to perform all easier or harder items.Therefore, a patient's score can sometimes be assessed by administering only a few items of the ARAT.This reduction of the ARAT subtests is based on the premise that they constitute a Guttman scale; however, insufficient research has been performed examining whether this premise is true.To our knowledge, one study has questioned this assumption.van der Lee et al. showed that compared to scoring without decision rules, the sum score when using the decision rules proposed by Lyle led to lower scores in 16 patients (25%) and to higher scores in 12 patients (19%) [5].However, no reports have since addressed this issue.Furthermore, the reduction rate in the number of items performed associated with employing the Guttman scale (which provides decision rules for skipping items) during ARAT administration is unclear.Attempts to reduce the burden required for ARAT assessment have also received considerable attention in recent years, and several shortened versions have been reported, with a focus on making it easier to use in clinical practice [6-8].However, the former studies have relied on an extreme reduction in the number of items, but since each item of the ARAT may have different weight of difficulties thorough and appropriate assessment need to be conducted.
The present study aimed to determine the reduction rate resulting from the use of decision rules for skipping items during the ARAT and investigate the test's psychometric properties (parallel forms reliability, inter-rater reliability, and construct validity) by scoring hemiparetic patients in acute-to-chronic stages of stroke with and without the decision rules.

Materials and methods
A retrospective, single-institution study was conducted between October 2020 and June 2021 in Japan.The data were collected in a cross-sectional study conducted at a college hospital between June 2016 and March 2017 [9].The current study's protocol conformed to the principles of the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines (the completed checklists are provided in Supplemental online material.)and was approved by the institutional human research review boards of the Niigata University of Health and Welfare (provided number, 18503-200904) and Hyogo College of Medicine (provided number, 3631).The study adhered to the approved protocol and was registered in the University Hospital Medical Information Network Clinical Trial Registry in November 2020 as a preinitiation condition (registered number, UMIN000042514).The study received ethical approval from the ethics committee of the Hyogo College of Medicine for the use of an opt-out methodology.

Participants
The 30 recruited participants were adults aged 20 years or older with a physician-confirmed history of ischemic or hemorrhagic stroke and stroke-induced hemiparesis.Key inclusion criterion was the appearance of a symptom of incomplete upper extremity paresis resulting from stroke.The following exclusion criteria were applied: (1) clear sings of dementia; (2) aphasia or mental disorders that may lead to failures in the measurements; (3) excessive pain during medical treatment; and (4) serious uncontrolled medical conditions.Regarding participants characteristics, the following information was collected: age, sex, side of hemiparesis, dominant hand, presence of sensory impairment and unilateral spatial neglect (Table 1).

Measurement instruments
The Japanese ARAT, developed in accordance with the recommended principles of translation, including double translation (i.e., forward translation, back translation) [10], had been used for the datasets.The previous study trained the raters on standardized administration and scoring procedures.The rater training included confirmation work of test administration and scoring procedures with 30 stroke patients.Regarding the Japanese version of the ARAT score was highly reliable: the intraclass correlation coefficient (ICC) ranged from 0.974 to 0.990 for non-simultaneous evaluation and from 0.994 to 0.998 for simultaneous evaluation [10].The results for construct validity of the ARAT were also shown to be valid: the motor-related measures such as the FMA, Spearman's rho values were higher than 0.904.On the other hand, regarding the sensation and joint motion/pain domain of the FMA, Spearman's rho values ranged from 0.268 to 0.470.
The ARAT [4,9,11,12] employs an ordinal 4-point scale (0, 1, 2, 3) for each 19 items.The score ranges 0 to 57 points.Items are ordered in four subtests which are thought to represent the main aspects of upper extremity capacity: grasp, grip, pinch, and gross movement.Each subtest (except the Gross subtest) was arranged such that the most difficult item was administered first.Scoring procedures with decision rules as follows: If a patient received the highest score (3 points) for the first item, no additional items were tested, and the remaining items in the subtest were scored as 3 points.The second item in each subtest was the easiest item.If a patient received the lowest score (0 points) for both the first and second items in a subtest, no additional items were tested, and the remaining items in the subtest were scored as 0 points.If the decision rules were not applied, all items in the subtest would be administered.Because it has not been assumed that there is a specific pattern of difficult/easy tasks after the first two.In the current study, the ARAT was scored using two methods, with and without decision rules for skipping items in the subtests.
The two scoring methods were used to investigate the reduction rate resulting from using the decision rules as well as parallel forms reliability and inter-rater reliability.Construct validity was investigated using the upper extremity section of the FMA [4, [10][11][12], the Box and Block Test (BBT) [11,12], and the Motor Activity Log (MAL) [13].

FMA
The FMA assesses the movement, sensation, and balance functions of the extremities and trunk [11].In this study, we used the upper extremity component of the FMA (score range, 0-126 points), which has the motor domain (three-level ordinal scale, 33 items, score range, 0-66 points) and the sensation domain (threelevel ordinal scale, 6 items, score range, 0-12 points), and the passive joint motion/pain domain (three-level ordinal scale, 24 items, score range, 0-48 points).The FMA motor domain has good psychometric properties in patients with stroke [10,12], therefore it has also been used as the "gold standard" for comparing other upper extremity measures on the assessment of ICF body function and structure.

Box-and-Block Test
The score is the number of wooden blocks (2.5 � 2.5 � 2.5 cm) transported from one compartment of a box to another in one minute [11,12].The Box-and-Block Test is very easy to obtain, to  learn and to apply, therefore frequently used in stroke rehabilitation of adults [1].

Motor Activity Log-14
The upper-extremity MAL-14 is a structured interview that elicits information about 14 activities of daily living (ADLs) using a 11point Likert scale ranging from 0 to 5. Patients are asked to rate how well (quality of movement scale) and how much (amount of use scale) they use their affected upper extremity to accomplish each ADL in their real-world environment [13].

Evaluation process
The original data of the ARAT without decision rules, FMA, BBT, and MAL were collected in a two-day interval.The ARAT datasets used were video data with a prospectively standardized recording procedure.Two occupational therapists (raters A and B) scored the patients' performance in the ARAT videos independently, and each rater was blinded to the scores of the other rater during the evaluation process.They had completed a rater training program for the ARAT administration and scoring procedures in our previous study [9].In the current study, the ARAT scores of both raters obtained without using the decision rules were applied to obtain ARAT scores with the application of the decision rules.Intragroup (parallel forms reliability) and intergroup comparisons (inter-rater reliability) were performed, and the reduction rate was determined (Figure 1).The raters' scores were averaged to examine the construct validity of the FMA, BBT, and MAL (Figure 1).

Statistical analysis
The inter-rater reliability of the raters' sum scores was assessed using ICCs with a two-way random approach to achieve absolute agreement.Weighted kappa coefficients were used to assess the inter-rater reliability of the individual scores.A Bland-Altman plot (including mean difference and limits of agreement) was used to determine the inter-rater agreement of the sum scores.The parallel forms reliability and construct validity were assessed using Spearman's rank correlation coefficient.Scatterplots and spline curves were used to assess the relationships between the variables.All statistical analyses were performed using the programming language R (version 3.5.1;available at: http://www.r-project.org) [14].The packages used in R were "irr" for ICC analysis, "BlandAltmanLeh" for Bland-Altman plot, "vcd" for kappa coefficients analysis, and "DescTools" for correlation analysis.

Sample size calculation
The sample size was calculated using Bonett's method to compare the planning value with the retrospective power analysis (reliability results) [15].With an ICC planning value of 0.8 and a desired precision width of 0.15, a sample size of 28 participants was determined.

Results
Table 1 presents the participants' characteristics.Using the decision rules reduced the total number of items (570 items) by approximately 58% for rater A (240 items) and 61% for rater B (222 items) in the entire group of participants.The decision rules were actually used for 23 patients (�77%) by rater A and for 22 patients (�73%) by rater B. The median (first quartile-third quartile) numbers of skipped items for each participant were 10.0 (2.5-14.3)and 9.0 (0.5-12.0) for raters A and B, respectively.Using the decision rules for skipping items resulted in different scores for 2 of the 30 patients scored by rater A (i.e., 54 to 55 and 52 to 54) and 1 of the 30 patients scored by rater B (i.e., 46 to 47).
The parallel forms reliability assessed by Spearman's rho was 0.99 (p value < 0.01; 95% confidence interval [CI]: 0.99-0.99)for both raters.The rho coefficients showed the degree of the conformity of the ARAT with decision rules.The higher values meant that the scores of ARAT with decision rules appropriately represented the scores without decision rules.The scatterplots and spline curves for parallel forms reliability are presented in Figure 2. The scatterplots with spline curves are important because they can visually show the extent of conformity between the scores of ARAT with and without decision rules.The inter-rater reliability of the sum scores assessed by the ICCs was 0.99 (p value < 0.001; 95% CI: 0.99-0.99).The scatterplot and spline curve for inter-rater reliability are presented in Figure 3.The Bland-Altman plot illustrating the inter-rater agreement of the sum scores is shown in Figure 4.The plot can highlight outliers.If one rater always gave too low scores, then many of the points were below or above the zero line.The mean difference between the raters was 0.80, and the limits of agreement were À 2.22 and 3.82.The limits of agreement were calculated as the mean of differences ±1.96 standard deviations of the differences.They represented the range of which it would be expected that 95% of the differences between the two raters would fall.The inter-rater reliability of the individual item scores of the ARAT with decision rules is summarized in Table 2.The construct validity of the ARAT with decision rules is shown in Table 3.The scatterplots and spline curves for construct validity are presented in Figure 5.The validity coefficients and plots showed the degree to which two clinical tools to measure the same or distinct concepts.
The results of the above reliability coefficients exceeded the planned value of 0.8, indicating that the sample size of 30 was sufficient for this study.

Discussion
This study examined the psychometric properties of the Japanese ARAT by comparing its scoring with and without decision rules for skipping items.The results show that the ARAT administered with decision rules is valid and reliable for measuring upper extremity motor function in post-stroke hemiparetic patients.

Contribution of the decision rules for the ARAT
The use of the decision rules resulted in a reduction rate of more than 50% for both raters.This result agrees with that reported by Lyle, who developed the ARAT [4].Considering the benefit for patients with low endurance and routine assessments, reducing evaluation time by more than half is meaningful for clinical use.This reduction will further reduce the mental burden on patients with severe arm paralysis by avoiding unnecessary evaluation of tasks they cannot perform.

Conformity of the ARAT with decision rules
The Spearman's rho values and 95% CIs for the parallel forms reliability were very close to 1.00 for both raters, indicating strong reliability.The relationship between the ARAT scores with and without decision rules is shown in the spline curves (straight lines) and linear scatterplots of Figure 2. The scores of the ARAT with decision rules are therefore comparable to the ARAT scores without decision rules.

Psychometric assessment of the ARAT with decision rules
Terwee et al. proposed the following criteria to identify acceptable measurements [16]: (1) ICC or weighted kappa �0.70 for inter-rater reliability; (2) minimal important change outside the limits of agreement for inter-rater agreement; and (3) correlation with gold standard �0.70 for construct validity.All results of this study met these criteria.
In the reliability analysis, even the lower limit of the 95% CI exceeded the planned value of 0.8 for the sum score and the individual item scores.Furthermore, a linear relationship between the raters is shown by the spline curve and scatterplot in Figure 3. Therefore, the ARAT with decision rules reliably assessed the affected upper extremities in post-stroke hemiparesis patients.This study produced very high reliability coefficients, which may be due to the standardized manual and careful rater training.
Based on clinical experience and estimates reported for similar outcome measures of upper extremity function [17,18], the minimally important change was set as 10% of the full range of the scale.The limits of agreement were found to be between À 2.22 and 3.82 points, which shows that a score difference of À 3.02 to þ3.02 ([absolute (À 2.22) þ 3.82)]/2) could occur by chance.In this study, the limits of agreement showed the degree of the agreement of between the ARAT scores of two raters using decision rules.Therefore, the results indicate that a score difference of 3.02-point probably could occur by chance.This 3.02-point deviation is approximately 5% of the highest score, which is less than the 10% criterion (5.70 points).Therefore, the ARAT with decision rules may be able to distinguish clinically important changes from measurement errors.
The construct validity exceeded the value of 0.8 (even the lower limit of 95% CI) for the FMA upper extremity motor scale (gold standard), BBT, and MAL.By contrast, values below 0.5 were confirmed for the FMA scales of sensation, passive range of motion, and pain.Moreover, the spline curves for these scales were non-linear.These results suggest that these scales express different functions than that of the ARAT.

Limitations
This study had several limitations.First, the results of the ARAT scoring in this study are generalizable only among the range of ARAT scores included in the analysis; they may not apply to those who are out of this range.Second, this study's good reliability results cannot be expected without careful rater training on the ARAT administration and scoring procedures.Third, the ICC (based on the principles of analysis of variance [ANOVA]) was used for the reliability analysis, which is less suitable for ordinal data.We employed the ICC because it is the most commonly used reliability parameter and ANOVAs are generally robust against distribution violations.However, a statistical method particularly designed for ordinal data, such as the Svensson method [19], must be used for future studies.Fourth, we assumed that each item contributes equally to the sum score.Item response theory analysis with a larger sample size would be useful to further investigate whether this assumption is correct.Furthermore, we did not control for the severity of motor paresis.If more patients with moderate motor paralysis were included, the reliability coefficients and reduction rate might have decreased due to the difficulty of differentiating between 2-and 3-point scores for each ARAT item.The difference between 1-and 2-point scores depends largely on the patient's ability to lift the item from the platform, but the difference between 2-and 3-point scores lies in the evaluator's subjective assessment of "great difficulty" in the target performance.Therefore, including more patients with severe or mild arm paralysis may improve the reliability of the ARAT assessment.Little difference was observed among the patients with moderate arm paralysis in our results (Figure 2), but this should be confirmed in future studies with larger samples of such patients.

Conclusion
The present study supports the use of decision rules for skipping ARAT items when testing upper extremity motor function in stroke patients as they may reduce the burden on patients and evaluators by decreasing the number of ARAT items to be administrated and the examined properties are appropriate for clinical use.

Figure 2 .
Figure 2. Scatterplots and spline curves of the ARAT scores with and without decision rules for parallel forms reliability.

Figure 3 .
Figure 3. Scatterplot and spline curve of the ARAT scores with decision rules for inter-rater reliability.

Figure 4 .
Figure 4. Bland-Altman plot of the ARAT scores with decision rules.

Figure 5 .
Figure 5. Scatterplots and spline curves of the ARAT scores with decision rules for construct validity.

Table 1 .
Characteristics of the participants.

Table 2 .
Inter-rater reliability of the ARAT individual item scores with decision rules.

Table 3 .
Construct validity of the ARAT score with decision rules.