Measuring mealtime performance in older adults with suspected oropharyngeal dysphagia: an updated systematic review of psychometric properties

Abstract Purpose To update a previous review of psychometric properties of performance-based outcome measurement instruments (PerFOMs) for task performance in the context of meal activity of older adults (≥65 years) with suspected oropharyngeal dysphagia (OD). Materials and methods Systematic searches were conducted in PubMed, CINAHL, EMBASE, SCOPUS, and Web of Science. Studies on PerFOMs that covers items reflecting skills in the pre-oral, oral, and pharyngeal stages of ingestion during meals were included. Two review authors independently screened, extracted, and evaluated the methodological rigour and quality of the reported psychometric properties in the included studies using the guidelines of the COnsensus-based Standards for the Selection of health Measurement INstruments (COSMIN). Results Twenty-three articles featuring nine original PerFOMs and five translated versions were included. PerFOM development and content validity were rated with inadequate or doubtful methodological quality across all studies. The quality of the evidence across the additional psychometric properties of the PerFOMs was very low for two, ranged from very low to moderate for six, and from very low to high for five. Conclusions There is limited evidence of the psychometric properties of available PerFOMs for measuring task performance during meals in older adults with OD, and further validation is warranted. Implication for rehabilitation Assessing the mealtime performance of older adults with oropharyngeal dysphagia (OD) provides important information. Performance-based outcome measurement instruments (PerFOMs) need to be valid and reliable. Clinicians need to be careful when choosing PerFOMs to assess the mealtime performance of older adults with OD as there is insufficient evidence on the quality of available instruments. Established guidelines and standards should be used when developing and investigating psychometric properties of PerFOMs assessing mealtime performance of older adults with OD.


Introduction
Oropharyngeal dysphagia (OD) impairs the efficiency and safety of ingestion functions and is highly frequent in older people aged 65 years and above [1,2].OD is caused by acute and chronic diseases common to advanced age (e.g., stroke, Parkinson's disease, dementia, COPD) [1].In recent years, OD in older adults has also been related to a decline in the strength, mass, and function of the masticatory and swallowing muscles (i.e., sarcopenia), which might affect ingestion efficiency and safety [2].OD increases the risk of malnutrition, dehydration, aspiration pneumonia [1,2], depression, and anxiety [3] as well as decreased quality of life [4], and is a serious, costly, current, and future healthcare issue [5].A recent meta-analysis reports that OD affects about 30% of community-dwelling older adults, almost 50% of geriatric patients, and above 50% of nursing home residents [6].The high frequency and serious consequences of OD in old age necessitate systematic screening, which typically involves simple pass-fail procedures to identify those at risk of OD [2,7].If individuals are screened positive for OD, further assessment is required to identify possible causes of the underlying problem, the risk of aspiration, and recommendations for appropriate consistencies for oral intake, and to establish baseline data for evaluation of the effectiveness of interventions [7].
The assessment process for suspected OD frequently involves an instrumental assessment and/or a clinical swallow assessment.Instrumental assessment includes fibre-optic endoscopic evaluation of swallowing (FEES) or videofluoroscopic evaluation of reproducible measurements [16] when administered by clinicians involved in OD management (e.g., nurses, speech-language pathologists, caregivers, occupational therapists).A review by Hansen et al. [11] concludes that of eight identified PerFOMs for evaluating mealtime task performance in older adults with suspected OD, only two (Minimal-Eating Observation Form -version II and McGill Ingestive Skills Assessment-version I) displayed the adequate quality of the psychometric properties.Development and validation of PerFOMs are, however, ongoing processes [16], and it is expected that the evidence base has increased since the publication by Hansen et al. [11].Recently, two reviews addressing PerFOMs for evaluating mealtime task performance in older adults have been published [17,18].Spencer et al. [17] identified three PerFOMs, of which one is new (Feeding Difficulty Index) and two (Feeding Behaviour Inventory and Edinburgh Feeding Evaluation in Dementia) were excluded by Hansen et al. as they did not cover all stages of ingestion [11].Jung et al. [18] identified two PerFOMs, of which one (Eating Behaviour Scale) was included by Hansen et al. [11] and one (Edinburgh Feeding Evaluation in Dementia) was excluded as it did not cover all stages of ingestion [11].Unfortunately, these two reviews exclusively include PerFOMs targeting older adults with dementia and do not evaluate the quality of the available evidence.
Note that Hansen et al. [11] evaluated the quality of the psychometric properties of the identified PerFOMs before the publication of the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) initiative [19], which provides a guideline for systematic reviews and quality appraisal of outcome measurement instruments.Standardised criteria for evaluating the psychometric quality of PerFOMs of mealtime task performance are needed in systematic reviews to provide evidence-based recommendations for purposeful selection by clinicians and researchers [20].Therefore, the objective of this study was to update the review by Hansen et al. [11] using the COSMIN methodology [19,20] to critically appraise, compare, and summarise the psychometric properties of previous and newly identified PerFOMs that include items reflecting skills in the pre-oral, oral, and pharyngeal stages of ingestion for evaluating mealtime task performance in older adults with suspected OD.

Materials and methods
This updated systematic review is guided by the PRISMA statement [21] and the Guideline for Systematic Reviews of Outcome Measurement Instruments by COSMIN [9,14,20,[22][23][24].The methodology consists of three consecutive phases: (1) conducting a systematic literature search; (2) evaluating the methodological quality of the included studies using the COSMIN Risk of Bias checklists [22][23][24]; and (3) rating the quality of each psychometric property using pre-defined criteria [20].Though the COSMIN approach was originally intended for evaluating patient-reported outcome measures (PROMs), the checklists have been elaborated to fit clinician-reported and performance-based outcomes and measures [20].
The study was pre-registered with the International Prospective Register of Systematic Reviews (PROSPERO) in February 2022 (CRD42022309662).There is a minor change from the pre-registration.Information on interpretability is not presented since most of the included studies did not report on the distribution of scores in the study populations, the percentage of missing items and the percentage of missing total scores, or floor and ceiling effects.

Literature search
The literature search aimed to identify studies published after the literature search conducted by Hansen et al. [11], who included studies no later than January 2010.The online databases PubMed, CINAHL, and EMBASE were searched.The main search terms were: (("oropharyngeal dysphagia" OR "dysphagia" OR "deglutition disorders" OR "swallowing difficulties" OR "aspiration" OR "feeding difficulties" OR "eating difficulties" OR "drinking difficulties") AND ("observation" OR "test meal" OR "scale") AND ("eating" OR "functional oral intake" OR "mealtime performance" OR "ingestive skills" OR "task performance" OR "feeding" OR "drinking" OR "occupational performance" OR "performance" OR "swallowing competency" OR "meal")).For finding studies on measurement properties, filters developed by Terwee et al. [25] were applied.Each database-specific filter was piloted and adjusted before the final searches, which were restricted to age �65 years and the English language.The search was performed on 25 February 2022 and included publications since 1 January 2010.After completion of the entire review process, an updated search of the original search was performed from 25 February 2022 to 30 May 2022.The inclusion criteria for study selection were: a.The instrument aims to measure the quality of mealtime task performance by items covering skills in the pre-oral, oral, and pharyngeal stages of ingestion (the construct).b.More than 50% of the study population is older adults (�65 years of age) with suspected OD (the target population).c.The PerFOM is administered as observation during a meal activity, allowing numeric ratings of the quality of mealtime task performance (the type of measurement instrument).d.The main aim of the study is the development of a PerFOM or an explicit evaluation of one or more of its psychometric properties (the outcome).e. Original research published in peer-reviewed journals.
Studies were excluded if the PerFOM was used to measure an outcome (e.g., in randomised controlled trials) or served as a gold standard for validation of another instrument; was used for trial swallows during videofluoroscopy or fibre-endoscopic evaluation of swallowing; was used for investigating oesophageal dysphagia or psychogenic dysphagia; or was integrated as part of the wider measurement of general ADL performance.We excluded reviews, guidelines, letters, editorials, commentaries, conferences abstracts, unpublished manuscripts, and grey literature.All eligibility criteria, apart from criterion (d) align with Hansen et al. [11], who included all studies that used the PerFOM as an outcome measurement.
An additional search was conducted, which involved checking reference lists of included studies to find additional articles, as well as citation searches in PubMed, CINAHL, EMBASE, SCOPUS, and Web of Science using the name OR acronym of the identified PerFOMs.All searches were performed by one author (SAFR) and transferred to the Covidence software tool [26].After the removal of duplicates, potentially eligible studies were identified from screening titles and abstracts and then assessed through full text [21].Study selection was undertaken independently by two authors (TH and SAFR).Disagreements were resolved through discussion; in the event of uncertainties, this was resolved by discussion with two of the authors (JF, IS).

Data extraction
Characteristics of the PerFOMs were extracted and included information on the target population; the purpose and context of use (i.e., discriminative, evaluative, and/or predictive); the number of (sub)scales and items, which were linked to the pre-oral, oral and pharyngeal stages of ingestion (if items related to influencing factors or consequences, this was also noted); tasks; response options; the range of scores; and target user group, user instructions, and training courses.Characteristics of the included study involved information on the addressed psychometric properties, country, and language version of the PerFOM, sample size, age, diagnosis, inclusive information on OD, setting, and assessors.Characteristics on the psychometric properties were extracted and arranged according to the order proposed by COSMIN, which is content validity, internal structure, and remaining psychometric properties [23].For content validity of the PerFOMs, data extraction included the development or translation process, the conceptual model used, the definition and operationalization of the construct, the item generation, the type of data analyzed, how relevance, comprehensiveness, and comprehensibility of items were addressed, and how the PerFOM was pilot-tested.Information on internal structure included structural validity/unidimensionality, internal consistency, and cross-cultural validity/measurement invariance.Information on the remaining psychometric properties included reliability (inter-and intra-rater reliability, and measurement error), criterion validity (concurrent and predictive validity), construct validity (hypothesis testing for convergent and known-groups validity), and responsiveness.If data were not reported in the articles, thus resulting in incomplete data extraction, this was recorded, but no further actions were taken.The data extraction procedure was piloted by TH and implemented by SAFR.The completeness and correctness of data extraction were confirmed by TH.Discrepancies were resolved by consensus.

Risk of bias (quality) evaluation
The risk of bias and quality evaluation of each psychometric property involved three steps and was performed independently by two of the five authors, who worked in random pairs.Disagreements were discussed, and in the event of no consensus, a third author was consulted.First, assessment of the risk of bias according to psychometric property per study was undertaken using the COSMIN Risk of Bias checklists [22][23][24], which contain a series of standards referring to study design requirements and preferred statistical methods.Each standard is rated as very good, adequate, doubtful, or inadequate quality with a "worst score count"-approach, meaning that each psychometric property category gets the lowest rating achieved for any of the standards within that category.In the event of doubtful or inadequate ratings, the main reason(s) was/were reported.
Second, the studied psychometric properties for each PerFOM were rated as sufficient (þ), insufficient (�), or indeterminate (?) according to the COSMIN updated criteria for good psychometric properties [20,23].In the event of multiple studies on a psychometric property for a PerFOM, an overall rating was given by summarizing the ratings of each study into sufficient (þ) or insufficient (�) if �75% of the results displayed the same rating.If ratings across multiple studies were inconsistent (i.e., <75% of studies did not display the same results), this was rated as (±).If the results per study were all indeterminate (?), the overall ratings were also indeterminate [23].If a study presented more than one result for a psychometric property (e.g., assessed both inter-and intra-rater reliability or addressed structural validity using both classical test theory (CTT) and item response theory (IRT)), the results were handled as sub-studies and an overall rating was determined with the same procedures as for multiple studies [23].
In assessing the quality of hypothesis testing for construct validity, the review team acknowledged that there is no gold standard per se for measuring the construct in question.Therefore, a set of hypotheses were formulated [23].It was expected to find at least a fair correlation (i.e., �0.30) [27] between the PerFOMs under review and comparators covering related constructs such as mental function, physical function, oral motor skills, cranial nerve integrity, and/or swallowing function (i.e., convergent validity) [7].The comparison between subgroups (i.e., discriminative or knowngroups validity) was related to the ability of the PerFOM to distinguish between individuals with/without dysphagia or dysphagia-related conditions [1-4,6].It was expected to find at least medium differences according to Cohen's d � 0.5 [27], which was either reported in the studies or calculated by the review team.
The third step of the evaluation involves grading the quality of evidence for each psychometric property of the PerFOMs.For this, the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach adopted by COSMIN [23] was used.The quality of evidence is graded either high, moderate, low, or very low according to a set of criteria for risk of bias, inconsistency, impression, and indirectness [23], listed below Table 4.
For the risk of bias and quality evaluation, an adapted version of a standard COSMIN Excel spreadsheet for PROMs was used.The adaptations involved modifications of wordings of some of the standard criteria to be related to PerFOM development and content validity.Translations of existing PerFOMs were considered part of the development phase [22], and the translation standard criteria from the COSMIN study design checklist for patientreported outcome measurement instruments were added to the standard for the developmental phase [28].Standards addressing the involvement of representatives from the target population in the instrument development were extended to include professionals, and content validity was extended by adding judgement of relevance, comprehensiveness, and comprehensibility by professionals instead of patients.In addition, the evaluation of the content validity was exclusively based on the quality and results of the available studies according to the COSMIN criteria for good content validity.Due to limited information on and restricted access to some of the included PerFOMs (scoring sheets and instruction manuals), it was not possible to supply the evaluation with reviewer ratings of the content as recommended by COSMIN [22].Where some information was available on the development, but not on content validity per se, the methodological quality was rated as inadequate, and the overall quality of content validity was rated as indeterminate (?).Finally, if reliability or internal consistency was applied in the development phase of a PerFOM for item reduction or refinement, the methodology and results were not considered for evaluating the final existing PerFOM.

Results
The study selection process is presented in the PRISMA flow diagram in Figure 1.The literature search identified 931 records after the removal of duplicates.Following the screening of titles and abstracts, 31 full-text articles were retrieved.With the 17 studies included in the previous version of the review [11], 48 full-text articles were assessed for eligibility, of which 16 met the selection criteria.Ten studies included in the previous review were excluded because of the wrong outcome.Seven full-text articles were identified during the additional search and were added, resulting in the inclusion of 23 studies.The updated search did not provide additional studies on PerFOMs.Studies excluded after full-text reading are listed in the supplemental online material (Table S2), along with reasons for exclusion.

Characteristics of included studies and addressed psychometric properties
Table 2 presents the characteristics of the included studies.Sixteen studies relate to original versions of the PerFOMs, and seven studies relate to translated versions.All studies included participants with OD-related conditions in individuals 65 years and older.The presence of OD was reported in six studies, of which two used instrumental assessments and five used non-instrumental assessments.Information on the development was reported to some degree for all nine original PerFOMs, and content validity was addressed for three.For the four translated versions, the translation process was described for three and content validity was addressed for two (Table 2 and supplemental online material, Table S3).For all translations, except KT-Index, psychometric properties were investigated in the target populations.Across all studies, internal consistency and hypothesis testing for construct validity were most frequently investigated.Responsiveness and measurement invariance were least addressed (Table 2 and supplemental online material, Tables S4-S8).Cross-cultural validity in terms of differential item function (DIF) by language versions was not reported for any of the translated versions.

Methodological quality of included studies
In general, the applied methods for investigating the psychometric properties of the PerFOMs were predominantly within CTT, and only a few studies used methods within IRT.The methodological quality of the included studies and the ratings of the addressed psychometric properties are summarized in Table 3. Detailed information is provided as supplemental online material (Tables S3-S8).

Development and content validity
Of the nine original PerFOMs, the methodological quality for development and content validity was rated inadequate for six [31][32][33]35,38,39] and doubtful for two [37,41].One PerFOM received an inadequate rating for development and a doubtful rating for content validity [29].These ratings were primarily driven by the fact that professionals, including the clinicians who are expected to administer the PerFOM, were not asked about relevance, comprehensiveness, or comprehensibility, or too few were asked, during the development and/or content validation.With regard to the translated versions, three received inadequate ratings on both aspects [33,46,51], one received doubtful ratings on both aspects [45], and one received an inadequate rating for (continued) development and a doubtful rating for content validity [47] (Table 3 and supplemental online material, Table S3).

Internal structure (structural validity, internal consistency, and measurement invariance)
For four original PerFOMs and four translated versions, structural validity was addressed and obtained adequate to very good methodological quality ratings.Exploratory Factor Analysis (EFA) was most frequently applied.According to the COSMIN criteria, this resulted in an indeterminate rating for structural validity of four original PerFOMs (Ch-FDI [29], MAT [37], MEOF-I [38], MEOF-II [39,40]) and one translated version (MEOF-II-Ch [45]).For the Danish translations of MEOF-II [46], MISA1 [48,49], and MISA2 [49,51], unidimensionality was also addressed using the Rasch model within IRT.For MEOF-II-DK [46] and MISA2-DK [49,51], model fit was realized using a Testlet approach, which resulted in sufficient ratings.For MISA1-DK, an inconsistent rating was given, since most items in two of six subscales misfitted the Rasch model (Table 3 and supplemental online material, Table S4).
Internal consistency was addressed for seven of nine original PerFOMs and four translated versions.According to COSMIN, the methodological quality for calculating internal consistency is dependent on the quality of evidence for structural validity and unidimensionality [23].Across all studies, the methodological quality for addressing internal consistency was rated doubtful for nine studies, relating to seven original PerFOMs and one translated version.Of these, there was no available information on structural validity or unidimensionality for three original PerFOMs (EDAS [32], KT-Index [33,34], and MAS [36]) resulting in indeterminate ratings for internal consistency.The studies related to Ch-FDI [29], MAT [37], MEOF-I [38], MEOF-II [39,40], and one translated version of MEOF-II [45] calculated Cronbach's alpha or Omega for sub-scales determined by EFA, which however do not provide evidence of unidimensionality.Accordingly, it is not possible to obtain a clear interpretation of the internal consistency parameters [23], and all were rated of doubtful methodological quality.For the three Danish translations of the PerFOMs (MEOF-II-DK [46], MISA1-DK [48,49], and MISA2-DK [49,51]) for which the Rasch model was used to evaluate unidimensionality, the methodological quality for addressing internal consistency was rated very good, and the estimates were rated sufficient (Table 3 and supplemental online material, Table S4).As part of the analysis using the Rasch model, measurement invariance by gender, age, or setting was also investigated for the three translated PerFOMs [46,48,49,51], with sufficient ratings for the property but inadequate or doubtful methodological quality due to small sample sizes in the different groups (Table 3 and supplemental online material, Table S5).

Remaining psychometric properties
Reliability was addressed for seven original PerFOMs and two translated versions in terms of inter-rater reliability (N ¼ 7), intrarater reliability (N ¼ 5), and test-retest reliability (N ¼ 3).The methodological quality was rated doubtful in the studies related to EDAS [32], KT-Index [33], and MAT [37], and inadequate in the studies related to Ch-FDI [29] and MEOF-II-Ch [45], mainly due to unclear description of the administration or use of improper statistics.The studies evaluating MEOF-I [39], MEOF-II [39], MISA1 [43], and MISA1-DK [50] showed adequate and very good methodological quality, and sufficient reliability, although inconsistent results across the items in MEOF-I and II were found [39].Measurement error was addressed for five original PerFOMs but stated as reliability in the studies [29,32,36,39].One study on MISA1-DK [50] addressed this property explicitly.For all the PerFOMs, the measurement error estimates were rated indeterminate because minimal important change (MIC) was not defined, which is a requirement according to the COSMIN criteria [23,24] (Table 3 and supplemental online material, Table S6).Criterion validity in terms of concurrent and predictive validity was addressed for five original PerFOMs.One study related to MISA1 [42] obtained very good methodological quality, but with insufficient parameter estimates for predictive validity.The studies related to EDAS [32] and MAT [37] obtained doubtful methodological quality ratings, and the studies related to Ch-FDI [30] and MEOF-II [39] achieved inadequate methodological quality because of inappropriate gold standards, no available evidence on psychometric properties, or improper statistics (Table 3 and supplemental online material, Table S7).
Hypothesis testing for construct validity containing a mixture of convergent and known-groups validity was addressed for seven original PerFOMs and two translated versions, and most obtained sufficient ratings for this property.The methodological quality was rated doubtful in the studies related to EDAS [32], KT-Index [33,34], MEOF-II-Ch [45], MISA1 [43,44], and MISA1-DK [48], mainly due to unclear description of procedures in terms of time intervals and measurement administration as well as characteristics of subgroups.For MISA1 [43] and MISA1-DK [48], both convergent and known-groups validity were addressed in one study.Since these properties obtained different methodology quality, they were handled as two sub-studies for each PerFOM.For MAS [36], the quality ratings were adequate for both aspects, and they were therefore combined (Table 3 and supplemental online material, Table S7).Ch-FDI [29] Content validity (N ¼
Long-term care facilities (Research assistants, nurses, and nursing assistant).

25).
Hypotheses testing for construct validity, reliability Responsiveness was only addressed for MAS in one study [36], which obtained an indeterminate rating and an inadequate methodological rating due to inappropriate statistical methods (Table 3 and supplemental online material, Table S8)

Quality of evidence according to the COSMIN modified GRADE approach
The overall quality of evidence for the psychometric properties of each PerFOM is presented in Table 4.Most of the psychometric properties were predominantly addressed in one study for each PerFOM.Across all PerFOMs, indirectness was not downgraded, whereas it was necessary to downgrade for risk of bias due to some degree of methodological quality flaws for almost all studied psychometric properties.For some of the PerFOMs, it was also necessary to downgrade for imprecision due to small sample sizes and inconsistency due to conflicting results across (sub)studies (Table 4).
None of the included PerFOMs obtained sufficient psychometric properties with high-quality evidence for all possible psychometric properties.The overall content validity of one original PerFOM (Ch-FDI) and one translation (MISA1-DK) was rated sufficient, but with low-quality evidence.For the remaining original PerFOMs and translated versions, overall content validity was rated indeterminate with low-or very low-quality evidence (Table 4).The methodological quality of the studies on structural validity revealed indeterminate ratings and moderate-quality evidence for five PerFOMs.The two translated PerFOMs (MEOF-II-DK, MISA2-DK) evaluated by the Rasch model obtained sufficient ratings and high-quality evidence.Across all included PerFOMs, appropriate interpretation of internal consistency parameters was only possible for these translations, which obtained sufficient ratings with moderate-to high-quality evidence for this property (Table 4).For the MISA1-DK, the risk of bias was downgraded by À 1 for internal consistency, resulting in moderate-quality evidence.The quality of evidence for internal consistency cannot be higher than the quality of evidence for structural validity [23].Although structural validity was rated adequate for the two studies on MISA1-DK [48,49], they are in fact one study based on the same sample.Therefore, it is not possible to use the criterion "multiple studies of at least adequate quality" for rating no risk of bias for structural validity.Though measurement invariance was sufficient for the three translations, the quality of evidence was low or very low (Table 4).For the remaining psychometric properties, reliability parameters were rated sufficient with high-quality evidence for MISA1 and MISA1-DK, and hypothesis testing for construct validity was rated sufficient with moderate-quality evidence for Ch-FDI, KT-Index, MAS, MAT, and MISA1-DK, whereas parameters for the criterion validity of Ch-FDI, EDAS, MAT, MEOF-II, and MISA1 and for the responsiveness of the MAS obtained low or very low-quality evidence.
OD: Difficulty with self-feeding, managing different food textures, or swallowing.Prevalence or severity not reported.

85).
Canada (English) MISA1 [44] Hypotheses testing for construct validity (N ¼ excluded, of which four studies relate to two of the included PerFOMs (EDAS and MEOF-II) and six studies relate to four PerFOMs not included in this updated review.The reason for exclusion was the wrong outcome; the main aim of the study was not the development of a PerFOM nor an explicit evaluation of one or more of the psychometric properties of the PerFOM.This eligibility criterion was not part of the study selection in Hansen et al. [11].
All the identified PerFOMs have been assessed for some psychometric properties, except for the translated version of KT-Index [33].However, evidence of a rigorous construct theory as well as the content validity of the PerFOMs was generally absent, and when provided, it was rated with doubtful or inadequate methodological quality.In psychometrics, content validity is regarded as a fundamental property for ensuring that the items and their scoring truly reflect the construct to be measured [16,52].The general definition of validity refers to the extent to which the instrument measures what it intends to measure, and content validity is an important step for supporting the validity of an instrument [16].The consequences of insufficient content validity of a PerFOM intended to measure mealtime task performance in older adults with OD are the production of invalid and inaccurate item scores that might distort the PerFOM's purpose, for example, identification of impaired ingestion functions for rehabilitation planning.A fundamental aspect of validity in broad terms is also the requirement of specific objectivity (i.e., comparisons between individuals become independent of which particular items have been used, and vice versa), which requires unidimensionality [52].To provide measures on a continuum of mealtime task performance, the PerFOM must be unidimensional in the sense that each item measures some aspect of the same construct [52].This was only confirmed for the Danish translations of MEOF-II [46], MISA1 [48,49], and the further developed MISA2-DK [49,51] by applying the Rasch model.For the remaining five PerFOMs addressing structural validity, EFA was applied, which however is inadequate for determining unidimensionality [53].The consequences of using a summated total score or subscale scores of PerFOMs that are not unidimensional is that wrong conclusions might be drawn about the measured construct [53].
Internal consistency was the most frequently addressed psychometric property.However, appropriate interpretation of this psychometric property was only possible in relation to the Danish translations of MEOF-II [46], MISA1 [48,49], and MISA2 [49,51].Measurement error and responsiveness were investigated in a limited number of studies, and with insufficient methodology.It is, however, important to have some indication of the MIC over time as well as the minimal important difference (MID) in scores of a measure.Without such information, it is not possible to understand whether changes in the levels of mealtime task performance are meaningful for older adults with OD.This is important when evaluating rehabilitation efforts in clinical practice and in research [54]. A/?
a Item response theory/the Rasch model applied for structural validity and internal consistency in addition to Exploratory factor analysis and then treated as u sub-study.b Hypothesis testing for construct validity used convergent validity.c Hypothesis testing for construct validity used known-groups validity.d The Rasch model was applied to the same sample from the same overall study.[39,40] NA ?0 0 0 0 High Internal consistency (n ¼ 2) [39,40] 8556 [39] 50 ?À 1 0 À 1 0 Low Criterion validity (n ¼ 1) [39] 2600 ?À 3 0 0 0 Very low MEOF-II-Ch Content validity (n ¼ 1) [45] NA ?À 2 0 0 0 Low Structural validity (n ¼ 1) [45] NA ?À 1 0 0 0 Moderate Internal consistency (n ¼ 1) [45] 125 þ À 2 0 0 0 Low Reliability (n ¼ 1) [45] 20 ?À 3 0 À 2 0 Very low Hypotheses testing for construct validity (n ¼ 1) [45] 125 a Risk of bias: 0 ¼ None (multiple studies of at least adequate quality, or one study of very good quality); À 1 ¼ serious (multiple studies of doubtful quality available, or there is only one study of adequate quality); À 2 ¼ very serious (multiple studies of inadequate quality, or there is only one study of doubtful quality available); À 3 ¼ extremely serious (only one study of inadequate quality available).b Inconsistency: is rated as 0 ¼ none for all PerFOMs because inconsistency have been taken into account during the QP ratings across multiple studies or sub-studies within a main study.c Impression: 0 ¼ none; À 1 ¼ serious (n ¼ 50-100, not for content-, structural, and cross-cultural validity, since it is included in COSMIN Risk of Bias evaluation); À 2 ¼ very serious (n < 50 (not for content-, structural, and cross-cultural validity since it is included in COSMIN Risk of Bias evaluation)).d Indirectness relates to evidence from different populations than the population of interest in the review.No studies were downgraded (i.e., 0 ¼ none) since all selected studies included study populations of which >50% were older adults (>65 years of age) in risk of OD. e COSMIN grades of evidence: High certainty: We are very confident that the estimate of the psychometric property lies close to that of the true psychometric property.Moderate certainty: We are moderately confident in the psychometric property estimate: the true psychometric property is likely to be close to the estimate of the psychometric property, but there is a possibility that it is substantially different.Low certainty: Our confidence in the psychometric property estimate is limited: the true psychometric property may be substantially different from the estimate of the psychometric property.Very low certainty: We have very little confidence in the psychometric property estimate: the true psychometric property is likely to be substantially different from the estimate of the psychometric property [23].
f The quality of evidence for internal consistency cannot be higher that the quality of evidence for structural validity [23], which was rated adequate for a sample size of 100-199 using the Rasch model in references [48,49] which are based on the same sample and study.Therefore, the risk of bias is downgraded with À 1 for internal consistency.
Information on training in administering the assessment was only reported for Ch-FDI [29], EDAS [32] and MISA2-DK [49].Although easy access to PerFOMs might contribute to their clinical utility, it is recommended that clinicians involved in the assessment of individuals with OD should be provided with sufficient training to ensure standardization and consistency in ratings (7).
Compared to Hansen et al. [11], who recommended MEOF-II [39] and MISA1 [41][42][43] for use in clinical practice, the findings in this updated review make such specific recommendations difficult.This conflicting result might be due to the fact that Hansen et al. [11] did not evaluate the methodological quality of the included studies, but only considered the number of psychometric properties addressed for the PerFOMs, the magnitude of the parameter estimates, and the sample size.Nevertheless, positive evidence for the internal structure using methods within IRT was found for the translated versions MEOF-II-DK [46], MISA1-DK [48,49], and MISA2-DK [49,51].It could therefore be recommended to investigate the cross-cultural validity of the original PerFOMs and their translations by using the Rasch model to address unidimensionality, item fit, monotonicity of items and scale scores, and DIF by language versions [52].In general, the identified gaps across all the included PerFOMs suggest that further, more rigorous research is needed to establish high-quality PerFOMs for measuring mealtime task performance in older adults in the field of OD.In addition, the reporting of psychometric studies should provide more details on the applied methodology.For reliability testing, detailed information on methods for the independent administering of the instrument and scores of repeated measurements in the same patients is required.Regarding hypothesis testing for construct validity and criterion validity, clear descriptions of comparator measurements, how they are performed, and psychometrics are essential.

Methodological considerations
Although this updated systematic review included a comprehensive literature search in five electronic databases, we acknowledge that we might have excluded some published PerFOMs due to the restriction of studies to peer-reviewed journal articles published in English, and the exclusion of grey literature.
In contrast to Hansen et al. [11], this updated review applied the COSMIN standard criteria for evaluating the quality of methodology and psychometric properties of PerFOMs measuring mealtime task performance.However, many of the included studies were not designed according to the COSMIN criteria [28], which made the risk of bias and quality evaluation very complex.In addition, the COSMIN methodology was originally developed for systematic reviews of PROMs, and it was necessary to adapt some of the standard criteria for risk of bias and quality evaluation.It might be argued that the content validity of such adaptations should have been established before our critical appraisal [53].
Despite the fact that the review team covers a broad range of research expertise within measurement theory and psychometrics as well as the GRADE approach, this update has been challenging.For rating the methodological quality, the COSMIN methodology uses the "worst score counts" approach to acknowledge that insufficient methodological aspects of a study cannot be compensated by good aspects [23].However, how do we distinguish between deficient quality and deficient reporting?In addition, the indeterminate ratings of the criteria for good psychometric properties appear unclear in the sense that on some occasions it is related to insufficient information on a specific psychometric property and on other occasions it is related to the behaviour by the review team, e.g., "No hypothesis defined (by review team) for Responsiveness".Consequently, some COSMIN standard criteria appear vague and unclear, leaving reviewers to make subjective decisions [53].To be as transparent as possible, we have provided supplemental online material (Tables S3-S8) with information on our ratings, especially for the doubtful and inadequate ratings.

Conclusions
This updated review reflects an increased emphasis on the necessity of measuring ingestion functions in the context of a meal activity during the comprehensive assessment of OD in older adults.Thus, several new PerFOMs have been developed and existing PerFOMs have been further validated in the last decade.However, this area of research is in its infancy and there is therefore still limited evidence of the psychometric properties of available performance-based instruments.Further research and validation are needed.It is especially relevant to establish the unidimensionality of PerFOMs where item scores are summated into composite measures.Given the results of this review, the application of methods within IRT could make valuable contributions to achieving this goal.

Figure 1 .
Figure 1.PRISMA flow diagram of study selection (PRISMA refers to the preferred reporting items for systematic reviews and meta-analysis).
Assessment of Swallowing Ability; OD: Oropharyngeal dysphagia; OTs: Occupational therapists; SALTs: speech-language pathologists; VE: video-endoscopy; WST: Water swallowing test.a The analyses of unidimensionality and internal consistency by the Rasch model in references 48 and 49 were applied to the same population.Reference 49 elaborated on the analysis performed in reference 48.

Table 1 .
Characteristics of the included PerFOMs.
items: 16 items related to the preoral stage (e.g., self-feeding skills, alertness, restlessness, behaviour).2 items related to the oral stage (food dribbles out from the mouth, does not initiate swallowing).1 item related to the pharyngeal stage (Chokes or gags on food).
Ch-FDI: Chinese Feeding Difficulty Index; EBS: Eating Behaviour Scale; EDAS: Eating Disabilities Assessment Scale; KT-Index: Kuchi-Kara Taberu Index; MAS: Mealtime Assessment Scale; MAT: Mealtime Assessment Tool; MEOF-I: Minimal-Eating Observation Form -version I; MEOF-II: further developed version II; MISA1: McGill Ingestive Skills Assessment-version I; MISA2-DK: further developed Danish translation of the MISA1-DK; OTs: Occupational therapists.Notes: Assessment tools with a limited number of items are listed by item name; assessment tools with multiple items are only listed by subscales.Italic: Review authors comments.

Table 2 .
Characteristics of included studies.

Table 3 .
Methodological quality of the included studies and ratings of psychometric properties.

Table 4 .
Overall results for the quality of the evidence on psychometric properties.