Moving beyond Likert and Traditional Forced-Choice Scales: A Comprehensive Investigation of the Graded Forced-Choice Format

Abstract The graded forced-choice (FC) format has recently emerged as an alternative that may preserve the advantages and overcome the issues of the dichotomous FC measures. The current study presented the first large-scale evaluation of the performance of three types of FC measures (FC2, FC4 and FC5 with 2, 4 and 5 response options, respectively) and compared their performance to their Likert (LK) counterparts (LK2, LK4, and LK5) on (1) psychometric properties, (2) respondent reactions, and (3) susceptibility to response styles. Results showed that, compared to LK measures with the same number of response options, the three FC scales provided better support for the hypothesized factor structure, were perceived as more faking-resistant and cognitive demanding, and were less susceptible to response styles. FC4/5 and LK4/5 demonstrated similarly good reliability, while LK2 provided more reliable scores than FC2. When compared across the three FC measures, FC4 and FC5 displayed comparable psychometric performance and respondent reactions. FC4 exhibited a moderate presence of extreme response style, while FC5 had a weak presence of both extreme and middle response styles. Based on these findings, the study recommends the use of graded FC over dichotomous FC and LK, particularly FC5 when extreme response style is a concern.

Existing FC measurement requires respondents to choose among or rank two or more statements (see Figure 1b); either way the responses must be recoded into dichotomous paired comparisons for scoring.Despite all the promising features, compared to their LK counterparts, the dichotomous FC formats suffer from two major issues: lower reliability and less favorable respondent reactions, both of which are likely to bias estimates of key parameters, lower statistical power, and hamper the accuracy of selection/diagnostic decisions.Therefore, it is imperative for researchers to come up with ways that can improve reliability and respondent reactions for FC measures without substantially sacrificing existing advantages.
The graded FC format (see Figure 1c; Brown & Maydeu-Olivares, 2018) is a promising candidate for improving reliability and respondent reactions, because compared to the traditional dichotomous FC format, it allows respondents to express finer differentiations regarding their preference for each statement.However, only two studies provided a preliminary investigation of this new format (Brown & Maydeu-Olivares, 2018;Dalal et al., 2021) initial findings, many other critical aspects of graded FC measures, such as psychometric performance, respondent reactions, the degree to which response styles are reintroduced, and the impact of the inclusion/exclusion of a middle response option, remain largely unknown.Thus, a systematic investigation is needed before we can confidently embrace this promising FC format.
Using data from two samples of over 4,000 respondents, the present study contributes to the literature by providing the first systematic and comprehensive investigation of the performance and the potential pros and cons of the graded FC format.Specifically, we compared two versions of a graded FC measure with 4 and 5 response options to the traditional dichotomous FC measure and their LK counterparts with 2, 4, and 5 response options on the following dimensions: (1) seven aspects of psychometric performance, (2) seven aspects of respondents' reactions, and (3) the extent to which the two graded FC scales reintroduce extreme and/or middle response styles.The Five-Factor Model of Personality was used as the theoretical measurement framework.

Dichotomous forced-choice measurement
The scientific study of many noncognitive constructs, such as personality, heavily relies on self-reported measures.The LK format has no doubt been the most widely used.In a LK measure, respondents are presented with a series of single statements describing typical behaviors, feelings, or thoughts and asked to indicate their degree of agreement with each statement on a graded scale (e.g., 1 ¼ "Strongly Disagree", … , 5 ¼ "Strongly Agree").Despite the ease of development and scoring, LK measures have been known to suffer from various rating biases that may render the validity of scores derived from them questionable (Podsakoff et al., 2012).LK measures are also prone to faking when used in high-stakes situations such as personnel selection (Hu & Connelly, 2021), which also undermines the validity of scores.To overcome these issues, researchers came up with an alternative: the forcedchoice format (Sisson, 1948).
In traditional FC measures, each time, respondents are presented with a block of statements.Block size, or the number of statements per block, often ranges from two to five.When the block size is two (see Figure 1a), respondents are required to choose the statement that is "more like me".When the block size is greater than two, respondents are either asked to choose one statement that is "most like me" and another statement that is "least like me" or rank the statements from the most descriptive to the least descriptive of them.Responses to a block with a size of n (e.g., Jane ranks four statements as A > B > D > C) need to be recoded into n(n-1)/2 pseudo binary items representing all unique pairwise comparisons (e.g., for Jane's responses, the corresponding six pseudo items are AB ¼ 1, AC ¼1, AD ¼1, BC ¼ 1, BD ¼ 1, CD ¼ 0; "1" indicates that the first statement in the pair is preferred over the second one, and "0" indicates the opposite).All pseudo items are then subjected to a second-order ordinal factor analysis model (Brown & Maydeu-Olivares, 2011).In the model, each pseudo item loads on two first-order factors with factor loadings fixed to 1 and À1 and uniqueness fixed to 0; the two factors represent the latent utilities of the two statements that are compared with each other.Each first-order factor then loads on its designated second-order factor (e.g., the Big Five factors) in same way as how items from a LK scale load on latent factors.Normative factor scores for each second-order factor can be obtained from this model.Readers are encouraged to refer to Brown and Maydeu-Olivares (2011) for more details.
Unlike responding to statements in a LK measure where respondents make an absolute judgment about their degree of agreement, FC measure requires respondents to make a relative judgment regarding the degree of match between statements and themselves.For example, even if a respondent (dis)agrees with both statements when presented individually, one statement may still describe the respondent relatively better than the other.Therefore, the respondent is asked to make a comparison and choose the one they agree/disagree more, relatively.Due to such a forced nature, rating biases such as leniency or acquiescence can be avoided.If statements within the same block are matched on social desirability, FC measures are also faking-resistant (Cao & Drasgow, 2019).Given these unique advantages of FC measurement and methodological advances (Brown & Maydeu-Olivares, 2011;Bunji & Okada, 2022;Frick et al., 2023;Joo et al., 2020Joo et al., , 2023;;Morillo et al., 2016;Stark et al., 2005), FC measurement is now playing important roles in many contexts.For example, the Tailored Adaptive Personality Assessment System has helped the US military to select soldiers from millions of applicants (TAPAS; Drasgow et al., 2012).Similarly, the Occupation Personality Inventory (OPQ32i; Bartram, 2013) and the Adaptive Employee Personality Test (ADEPT-15; Boyce et al., 2015) have also been widely used for talent management globally.

Issues of dichotomous FC measurement
Despite all the advantages discussed above, dichotomous FC measurement is limited by at least two important issues: (1) relative low reliability in comparison to its LK counterpart with identical statements and equal length (e.g., 60 pairs in a FC measure1 and 60 single statements in a LK measure), and (2) suboptimal respondent reactions.Unlike most LK measures where responses to each statement are on a graded scale (1 ¼ "Strongly disagree", … , 5 ¼ "Strongly agree"), responses to the traditional FC measures are dichotomous in nature.Compared to graded responses in LK measures, the dichotomous response format is psychometrically suboptimal because the latter contains less information than the former.Additionally, because in some cases, dichotomous FC measurement forces respondents to make a choice of one statement over the other even if the two statements are similarly descriptive of them (Sass et al., 2020), it is likely that people will just randomly respond (McCloy et al., 2005), bringing in another potential source of measurement error.When scale length is held constant, the combined effects of less information and more random error jointly contribute to the relatively lower reliability of dichotomous FC measures relative to their LK counterparts (e.g., Brown & Maydeu-Olivares, 2011;Lee, Lee et al., 2018;Lee et al., 2019;Zhang et al., 2020).When used for research purposes, lower reliability will lead to more biased estimation of key parameters and lower statistical power; when used for clinical diagnosis, less reliable measures may increase the risk of inaccurate diagnosis; when used in selection contexts, less reliable measures may lead to problematic decisions such that truly qualified respondents are screened out while those not qualified are selected.Therefore, users always seek good reliability (at least as good as LK scales).As reliability is positively related to scale length, longer FC scales are needed to achieve reliability comparable to their LK counterparts.However, longer FC scales are more time-consuming and cognitively demanding, which are undesirable in many contexts (as detailed in the next paragraph).Given that FC measurement is becoming increasingly popular across different contexts, it is imperative to find ways to improve the reliability of FC measures without increasing scale length.
Regarding respondent reactions, we located five studies that examined various aspects of respondent reactions to dichotomous FC measures (Bowen et al., 2002;Converse et al., 2008;Dalal et al., 2021;Sass et al., 2020;Zhang et al., 2020).Responding to FC scales requires respondents to retrieve multiple pieces of information from long-term memory and hold them in working memory.Moreover, compared to LK scales, responding to FC scales involves an additional cognitive process of weighing statements within a block against each other (Sass et al., 2020).Due to these additional cognitive demands, across the studies, respondents consistently found FC measures to be more difficult than LK measures (Bowen et al., 2002;Converse et al., 2008;Sass et al., 2020;Zhang et al., 2020).Also, though there were some inconsistencies (Sass et al., 2020;Zhang et al., 2020), evidence was reported by previous work that respondents found FC measures to be less interesting and more confusing than LK measures (Bowen et al., 2002).Additionally, respondents also experienced lower levels of positive affect during the assessment when FC measures were used (Converse et al., 2008).In sum, there is some evidence suggesting that when dichotomous FC measures are used, less favorable respondent reactions than those to their LK counterparts may result.The issue of negative respondent reactions can be problematic across settings.For example, in low-stakes contexts, if respondents dislike a particular scale or find it useless, they may just respond inattentively, jeopardizing the validity of scores derived from the scale.In highstakes contexts such as personnel selection, it has been meta-analytically established that respondents' unfavorable reactions to selection procedures can be harmful for organizations because such negative reactions may contribute to a lower probability of accepting an offer and a lower likelihood of recommending others to apply, as well as a higher probability of filing legal complaints (Hausknecht et al., 2004).Therefore, improving respondent reactions to FC measurement is also worth more research attention.
Taken together, dichotomous FC measurement has unique advantages that make it a promising alternative to LK measurement.However, compared to its LK counterparts of the same length, dichotomous FC measurement is also prone to the issues of lower reliability and suboptimal respondents' reactions, potentially jeopardizing its application.Therefore, it would be ideal if there is an alternative format that can maintain the advantages and overcome the drawbacks of the dichotomous FC format.Graded FC measurement appears to be such a candidate.

Graded forced-choice measurement
In graded FC measurement, respondents are presented with two statements and asked to indicate their degree of preference (Brown & Maydeu-Olivares, 2018; See Figure 1b and c).Specifically, instead of just making a dichotomous choice to indicate whether statement A or statement B is more like them as in a dichotomous FC measure, graded FC measures allow respondents to indicate the degree to which statement A is more like them (e.g., "much more like me" vs. "slightly more like me") when compared to statement B. This way, respondents can provide more refined information regarding the relative degree of match between statements and themselves, which is likely to contribute to improved reliability (B€ urkner, 2022).Moreover, allowing respondents to provide more refined information may also increase their sense of autonomy and make them feel less "forced" (Dalal et al., 2021), leading to improved respondent' reactions.While this graded comparison paradigm has been adopted in marketing research to study customer preferences under the names of ordinal paired comparisons (B€ ockenholt & Dillon, 1997), graded paired comparison (De Beuckelaer et al., 2015), or constant sum paired comparisons (Skedgel et al., 2015), its application to the assessment of noncognitive individual differences, such as personality, is still at its inception.To date, only two published studies provided some preliminary evaluations regarding the psychometric performance of the graded FC measures of personality (Brown & Maydeu-Olivares, 2018) and respondent reactions (Dalal et al., 2021).Many critical properties of the graded FC format remain to be evaluated.The next section expands on four critical aspects of graded FC measurement that were investigated in the current study.

Four critical aspects of graded FC measurement
The broad questions that we are interested in include (1) the psychometric performance of graded FC measures compared to that of dichotomous FC measures and their LK counterparts, (2) respondents' general reactions to graded FC measures compared to the reactions to dichotomous FC measures and their LK counterparts, (3) the degree to which extreme and/or middle response styles are reintroduced into graded FC measures compared to their LK counterparts, and relatedly, (4) whether a middle response option should be included in graded FC measures.Answers to these questions are critical for the application and future development of graded FC measurement.

Psychometric performance
It is important to evaluate the following aspects of the psychometric performance of graded FC measures as these are properties fundamental to the accuracy and effectiveness of measurement: (1) fit of the hypothesized factor structure, (2) structural validity (factor loading patterns and sizes), (3) reliability, (4) convergent validity, (5) discriminant validity (correlations with theoretically distinct factors), ( 6) criterion-related validity (correlations with theoretically relevant external variables), and (7) agreement between self-and other-ratings.Meanwhile, in addition to evaluating according to an absolute standard, it is also critical to assess the performance of the graded FC measures relative to the dichotomous FC measure and their LK counterparts.
This far, no study has systematically compared graded vs. dichotomous FC scales on psychometric performance.As graded FC measures allow respondents to make finer differentiation regarding their preference than dichotomous FC measures, we expect scores from graded FC measures to be more reliable than scores from traditional dichotomous FC measures when statements and scale length are held constant (Brown & Maydeu-Olivares, 2018;Maydeu-Olivares et al., 2009).Higher reliability is likely to lead to higher convergent and criterion-validity.Previous studies suggest that due to increased power to detect model misfit, more response options may lead to worse observed model fit (Maydeu-Olivares et al., 2017).However, on the other hand, more response options may also result in better respondent reactions and thus higher data quality, which may result in improved model fit.Therefore, we do not have directional hypotheses regarding the performance of graded vs. dichotomous FC scales in terms of model fit.Similarly, we also do not have a directional hypothesis on the impact of the number of response options on structural validity and discriminant validity.
Although FC and LK measures have been compared to each other in some studies (Brown & Maydeu-Olivares, 2011;Guenole et al., 2018;Wetzel & Frick, 2020;Zhang et al., 2020), most of them only compared dichotomous FC measures with graded LK measures and they did not keep the number of response options constant across formats.Such a study design makes it hard to conclude whether the observed differences are attributable to response format (FC vs. LK) or to different numbers of response options, especially given that the number of response options has been shown to substantially impact model fit (Xia & Yang, 2019) and the accuracy of item parameter estimates (DiStefano & Morgan, 2014).Brown and Maydeu-Olivares (2018) is the only study that compared a graded FC measure to a graded LK measure with 5 response options included in both.Despite of the mostly favorable evidence for graded FC measures on model fit, reliability, convergent and discriminant validity, the study design in which all respondents completed the LK measure after the graded FC measure could be a potential factor that negatively impacted the accuracy of responses to the LK measure.As it has been consistently reported that respondents find FC measures cognitively demanding (Zhang et al., 2020), in the study reported by Brown and Maydeu-Olivares (2018), the resulting fatigue from FC measures could partially contribute to the worse fit and lower discriminant validity of the LK measure in comparison to the graded FC measure.Also, when comparing criterion-related validity between dichotomous FC measures and graded LK measures, most studies only used a limited number of subjective criterion variables measured in a graded LK format, thus giving LK measures an unfair advantage due to the potential impact of common method variance shared between the LK focal measures and the LK criterion measures.A more robust comparison requires the use of more objective criterion variables measured in non-LK formats (Wetzel et al., 2020).All told, although previous studies have compared traditional dichotomous FC measures with graded LK measures, due to the abovementioned limitations, we do not have directional hypotheses regarding the comparisons between FC and LK measures with the same number of response options.
Aside from the above-mentioned six aspects of psychometric performance that are commonly examined, when evaluating noncognitive measurement tools, especially personality measures, another important but less studied aspect of psychometric performance should also be considered: the agreement between self-and other-ratings.Multiple-rater design is very common in both low-and high-stakes contexts.In low-stakes situations, researchers often collect self-and other-reported data for personality traits of the focal person to obtain a deeper understanding of the phenomenon being studied (Kim et al., 2019).In high-stakes situations, 360-degree ratings are routinely employed to enhance the accuracy of hiring/promotion decisions (Oh & Berry, 2009).In both cases, we would like to achieve at least a moderate degree of consistency across raters to ensure that different raters tap into the same psychological construct/process (e.g., the same personality trait).However, meta-analytic findings based on LK measures depict a slightly gloomy picture such that the self-other rating correlations range from 0.32 to 0.43 for the Big Five personality factors (Connelly & Ones, 2010).The relatively low interrater correlations are at least partially due to various rater-specific rating biases.According to Brown et al. (2017) and Wetzel and Frick (2020), dichotomous FC measures, on the other hand, seemed to have the potential to lead to higher crossrater consistency than graded LK measures, suggesting a higher level of immunity from various rating biases of dichotomous FC measures when compared to graded LK measures.Given that scores from graded FC measures can potentially be more reliable than those from dichotomous FC measures and the resulted smaller measurement error can lead to less attenuated estimation of bivariate correlations (e.g., correlation between raters), we expect graded FC measures to lead to even higher cross-rater consistency than dichotomous FC measures.If this hypothesis holds, we may improve the effectiveness of multiple-rater design by using graded FC measures.However, no studies have explored this possibility.
In the present study, to systematically evaluate the seven aspects of psychometric performance of graded FC measures, we first compared two graded FC measures to a dichotomous FC measure, and then compared graded FC measures with their LK counterparts with the same number of response options.

Respondent reactions
In addition to psychometric performance, respondent reactions to measures are also crucial to assessment effectiveness.As of now, there is only one study that examined respondents' reactions to graded FC measures of personality.Specifically, Dalal et al. (2021) found that, compared to the dichotomous FC scale, MTurk respondents considered the graded FC measure of the Big Five personality factors with 4 response options as allowing them more opportunities to perform (d ¼ 0.11) and more appropriate for making hiring decisions (d ¼ 0.10).Respondents were also more attracted to (d ¼ 0.14) and more likely to recommend (d ¼ 0.13) the organizations that used graded rather than dichotomous FC measures.However, some design features of this study may render the conclusion not as informative as we would hope.First, a computerized adaptive version of the FC measures was used in the study, in which each respondent was exposed to different statement pairs (there were 350,000 unique statement pairs in the pool).The design makes it hard to distinguish the effects due to response format (dichotomous vs. graded FC) from the effects of statement content.Second, according to the description of the study procedure, most dimensions of respondent reactions assessed were about job application experiences (e.g., withdrawal intention, recommendation intentions, organizational attraction) despite all test-takers answering the FC scales in a research-only context.The mismatch between the measured variables and the data collection context may limit the external validity of the findings.Even though more general reactions (such as motivation and enjoyment) were measured aside from recruitment-specific reactions, different dimensions of the general reactions were combined into a composite score rather than being analyzed separately.Finally, given that the LK is still the mainstream format, it would be desirable to also include respondent reactions to LK measures for comparisons.
Therefore, in the current study, for respondent reactions, we not only compared graded FC with dichotomous FC measures but also examined their LK counterparts with the same number of response options.To rule out the confounding effect of item content, we ensured the same item content across formats.Moreover, we focused on seven specific aspects of respondents' general reactions that were well-aligned with the data collection context to ensure the external validity of the findings.Given the multidimensional nature of respondent reactions, we analyzed each aspect of respondent reactions separately.

Response styles
When evaluating graded FC measures, response styles are also a potential concern as they may undermine validity.Response styles refer to individuals' idiosyncratic ways of using response options regardless of item content (Rorer, 1965).For example, some people may have a general tendency to use extreme options (extreme response style [ERS]: "strongly agree" or "strongly disagree") while others may have a general tendency to use the middle option (middle response style [MRS]: "neither agree nor disagree") regardless of the item content.Response styles are often considered as noise for the measurement of focal traits (Plieninger, 2017).Specifically, previous studies have shown that the confounded effects of response styles can (1) deteriorate model fit, (2) undermine structural validity by biasing item parameter estimates, (3) artificially inflate reliability estimates, (4) compromise discriminant validity by inflating correlations among different traits, (5) reduce agreement between self-and other-ratings, and (6) bias latent trait estimates, especially when response styles are related to the focal traits (e.g., those who are higher on extraversion may also be more likely to display ERS; Baumgartner & Steenkamp, 2001;Johnson & Bolt, 2010;Plieninger, 2017).Moreover, unmodelled effects of response styles may lead to violations of measurement invariance (Bolt & Johnson, 2009;Morren et al., 2012).Thus, the confounded effects of response styles pose a substantial threat to measurement validity (Cronbach, 1950).
In the dichotomous FC format, given that only two response options are available, the issue of response style is not a concern as there is no room for respondents to display their general response tendencies.However, in the graded FC format, due to the availability of multiple response options, Lee et al. (2022) raised the concern that ERS or MRS might be reintroduced.We note that this is in fact not a yes-or-no question but a question of degree: to what degree are response styles present?The degree of the presence of response styles can be quantified by the average correlation among response style indicators and the strength of the first eigenvalue (as detailed in the Data Analysis section).In the present study, we examined the absolute strength of ERS and/or MRS, as well as their strength relative to those found in the LK scales with the same number of response options.If the magnitude is small, response styles may not have a large impact and their effects can then be safely ignored; if the magnitude is moderate to large, researchers should be cautious and reconsider the use of graded FC measures.In the present study, we provided the first piece of empirical evidence to help researchers gauge the magnitude of response styles in graded FC measures in comparison to LK measures.

Middle response option or not?
A particularly important question about graded FC measurement is whether a middle response option (e.g., "both statements are equally like me") should be included.This issue is closely intertwined with the three aspects discussed above.Specifically, if a middle response option is provided, respondents may not feel "forced" to make a choice when they evaluate the two statements as describing them with equal accuracy.Such a more user-friendly and less demanding response process may lead to more positive respondent reactions (Jenadeleh et al., 2023).Meanwhile, as a byproduct, the amount of random errors may be reduced to some degree because respondents do not have to make a choice when they truly have no preference, leading to increased reliability.On the other hand, it is also possible that respondents may use the middle response option as a safety net when they feel indifferent or undecided (Nadler et al., 2015), particularly in the case of a cognitively demanding paired comparison where choosing the middle option may be an effective way to avoid exerting effort (Johns, 2005;Sturgis et al., 2014).Such unintended uses of the middle option can lead to worse model fit, biased parameter estimates, and compromised reliability and validity (Jin & Wang, 2018;Liu et al., 2017;Soland & Kuhfeld, 2021).Furthermore, the use of the middle option may give a chance to the return of MRS, which will not be an issue if there is no middle option at all.In sum, there may be a tradeoff between respondent reactions and psychometric performance for the use of a middle option in graded FC measures.We need empirical evidence to examine the degree of this tradeoff.Therefore, in this study, we compared graded FC scales with 5 vs. 4 response options to investigate the impact of including/excluding the middle option.

The present study
The present study aims to fill the gaps in the literature and address limitations in previous studies to gain a comprehensive understanding of the performance and pros and cons of graded FC measures.Data from two samples with over 4,000 participants were used.In Sample 1, we focused on psychometric performance, respondent reactions, and response styles.Data from Sample 2 were specifically collected to examine the agreement between self-and other-ratings.The performance of graded FC measures was compared to dichotomous measures, and the FC measures (both dichotomous and graded measures) were also compared to their LK counterparts with the same number of response options.Across the two samples, a fully crossed 2 (response format: FC, LK) Â 3 (the number of response options: 2, 4, 5) mixed design was adopted.Specifically, the response format was a within-participants factor such that all respondents completed FC and LK measures of the Big Five personality traits with the same number of response options.The number of response options was a between-participants factor such that each respondent was only exposed to one of the three levels of the number of response options.In total, there were six versions of personality scales (FC2, FC4, FC5, LK2, LK4, and LK5) that shared identical statements.To control for the confounding effect of scale length, we ensured that all focal scales were of a length of 60.The presentation order of the FC and the LK measures was randomized within each condition to counterbalance potential order effects.For criterion measures, both subjective and objective criterion variables measured in different ways (e.g., LK format, checklist, and resource allocation paradigm) were included to better compare the criterion-related validity of graded FC measures to that of the dichotomous FC measures and their LK counterparts.This work is exploratory in nature as we are stepping into a promising but largely unchartered territory.
Although faking-resistance is an appealing feature of applying FC measurement in the personnel selection context, FC measures have other advantages that make useful in both low-and high-stakes contexts.As reviewed previously, FC measurement is largely immune to various response biases/styles (Kreitchmann et al., 2019), which are features equally attractive to research and practical use in general (e.g., construct structure, predictive validity of constructs, moderating effects of constructs, cross-cultural comparisons, development trajectory).We believe that it is critical to take the first step to establish the foundational knowledge about the overall pros and cons of graded FC measure in general contexts before examining its application in high-stakes contexts specifically.Therefore, to maintain a reasonable scope, the current study only focused on general research contexts and did not dig into the application of graded FC measures in high-stakes contexts (e.g., the issue of fakability).

Sample 1
Students from 3 sessions of the Psychometrics course taught by the third author in the Spring and Fall semesters of 2021 formed groups of 3-5 people and were required to collect empirical data from at least 30 respondents per group for their final projects (exceptions could be made upon reasonable justifications).Each group was encouraged to reach out to their friends and colleagues in various geolocations and majors for participant diversity.When their friends or colleagues agreed to participate, students sent them the link containing all survey items.If participants wanted to receive personalized feedback (learning how to interpret results was one of the course objectives), they were asked to provide their email addresses at the end of the survey.No monetary incentive was provided.Three advantages of recruiting participants in this way were that (1) participants were diverse in terms of geolocations and areas of specialty, (2) participants were more likely to provide authentic responses because most respondents were not professional "survey-takers" like those from MTurk or Prolific, and (3) participants were more likely to be the target audience of personality assessment because most of them would be on the job market in a year or two.Data from all groups within the same section were compiled into a single file as the final dataset for the final project.Email addresses were stripped before the single dataset was distributed to students for the final project.
There were three conditions featured by different numbers of response options for the focal scales.Specifically, in Condition 1, respondents were presented with the traditional dichotomous FC scale and the LK counterpart with 2 response options; in Condition 2, respondents were presented with a graded FC scale and the LK counterpart with 4 response options; in Condition 3, respondents were presented with a graded FC scale and the LK counterpart with 5 response options.As we had to make sure that each group of students within a session had the same task for the sake of consistency and fairness, we could not randomly assign respondents to the three conditions.Instead, students from the first, second, and third sessions worked with Conditions 1, 2 and 3, respectively.Within each condition, we counterbalanced the administration order effect by randomly having about 50% participants finish the FC measure first and the LK measure second, and the other 50% participants complete the surveys in the reverse order.After finishing each format, respondents were immediately presented with the respondent reactions measures with instructions tailored for that format.All respondents across the three conditions also provided answers to identical demographic questions at the beginning and completed the same criterion measures in the ending section.
Given that no monetary incentive was provided, the presence of some inattentive responses that may negatively impact our findings was expected (Huang et al., 2015).Therefore, we inserted 8 quality control items across the survey to screen out potentially inattentive respondents.There were 5 directed responses (e.g., "please choose response Agree") and 3 impossible questions ("I eat cements sometimes", Huang et al. [2015]).Respondents who passed at least 6 quality control items were retained in the final samples.In total, we received 1,162, 2,260, and 868 responses for Conditions 1-3, respectively.After excluding inattentive respondents, we had 1,059 (91.13%), 1,869 (82.37%), and 757 (87.21%) valid responses with the average ages being 19.42 ± 2.36, 26.77 ± 9.08, and 19.77 ± 0.96 in the three conditions.Among the retained participants, there were 61%, 62%, and 59% females in Conditions 1-3.Participants in each condition were from diverse geographic locations (over 70 cities) and areas of specialty (over 30 areas of specialty).

Sample 2
Data from Sample 2 were collected to examine the agreement between self-and other-ratings for the graded FC personality measures.The same three conditions were set up in Sample 2. Specifically, respondents in Conditions 1 and 2 were undergraduate and graduate students recruited through on-campus advertisements during the Fall semester of 2021.It was stated in the advertisement that we were looking for pairs of paid participants who knew each other well to participate in a study examining how similar their own personality profiles were to their partners' personality profiles and how well they knew each other's personality.Interested participants contacted our research assistants to sign up for the study.Data collection from each pair of participants was conducted in two sessions.Specifically, in Session 1, each participant finished the self-report versions using the same scales as used in Sample 1. Participants also answered the same demographic questions and criterion measures as those used for Sample 1.About one week later, in Session 2, each pair of participants was invited back to finish the other-report versions of the personality scales to assess their partners' personality.Respondent reactions were not assessed in Sample 2. Participants were randomly assigned to Conditions 1 and 2. For Condition 3, participants were students in a campus-wide General Psychology course taught by the third author in the Spring semester of 2021.Interested students were also required to find someone who knew them well to participate in the survey.Participants first completed the self-report personality surveys and then completed the other-report surveys to rate their partners' personality about one week later.Across the three conditions, all questionnaires were presented using the online platform used in Sample 1.We implemented 6 and 4 quality control questions for the self-rating and other-rating surveys, respectively.Respondents were allowed to miss one quality control item on each survey.As with Sample 1, we counterbalanced the administration order of the FC and LK measures within each condition in each session.Each participant in Conditions 1 and 2 was compensated $5.00 for participation.Participants in Condition 3 received course credit for participation.
All data collection procedures were approved by the Institute Review Board at Beijing Normal University under the title "Approaches to Reducing Measurement Error in Self-Reported Personality Assessment (202104260023)".

Focal personality measures
Forced-choice measure of personality (Samples 1 & 2) We adopted the Forced-Choice Five Factor Markers (Brown & Maydeu-Olivares, 2011) as the focal measure of the Big Five personality factors.There were 12 statements per factor (8 positively worded items and 4 negatively worded statements).The original scale was made up of 20 blocks of triplet (statements A, B, and C within each block) where respondents were asked to choose the one that was most like them and the one that was least like them.Each block would then be recoded into three paired comparisons (AB, AC, and BC) for modeling and scoring.As the graded FC format requires paired comparison, we therefore presented the 60 paired comparisons (20 ABs, 20 ACs, and 20 BCs) separately2 , in which each statement was shown twice in two different pairs.To minimize the potential interference from the same item, we first presented the 20 AB pairs, then the 20 AC pairs, and then the 20 BC pairs (see Figure 1 for response options in different conditions).When used for the other-ratings in the second sample, we rephrased the items and response options to refer to a third person.We chose this scale because it was designed to minimize the degree of ipsativity in person scores by including 30 pairs of statements keyed in the same direction (both with positive or negative factor loadings) and 30 pairs keyed in the opposite direction (one with a positive loading and the other with a negative loading).Several studies have shown that ipsativity is not an issue for this scale and it works well in Spain, America, and China (Brown & Maydeu-Olivares, 2011, 2018;Lee et al., 2021;Zhang et al., 2022).

Criterion measures
Among the criteria variables, self-rated health, subjective well-being, and depressive symptoms were considered more subjective because they were measured by LK scales.The other variables were considered more objective because they were measured by formats other than LK.These objective criteria variables could serve to ensure fairer comparisons between the criterion-related validity of personality scores derived from FC and LK measures.We chose these criteria because they have been consistently shown to be substantially related to at least one of the Big Five factors.
Responses were reverse coded so that a higher score indicated better health.Self-rated health has been shown to be positively related to conscientiousness and extraversion, and negatively related to neuroticism (Luo et al., 2022).

Social value orientation (Samples 1 & 2)
We used the 6 core scenarios of the resource allocation task developed by Murphy et al. (2011) to measure social value orientation, which was operationalized as the magnitude of concerns an individual has for others.Given the way in which the scores were calculated, we could not compute Cronbach's alpha for this scale.Social value orientation has been found to be moderately related to agreeableness (Hilbig et al., 2014).

Leadership experience (Samples 1 & 2)
We assessed college leadership experience by asking respondents whether they had been engaged in the following activities (1 ¼ Yes; 0 ¼ No): (1) taking on a leadership role in your class; (2) taking on a leadership role in a department-level student organization; (3) taking on a leadership role in an university-level student organization; (4) taking on a leadership role in any other student organizations; (5) taking a group leader role in any course projects; (6) organizing dormitory activities.Responses across the 6 items were averaged as a proxy for leadership experience.As we consider leadership experience as a formative construct (Diamantopoulos et al., 2008), Cronbach's alpha is not appropriate.Leadership experience has been found to be positively related to extraversion, agreeableness, and conscientiousness, and negatively correlated with neuroticism (Judge et al., 2002;Wilmot et al., 2019).

Charity behaviors (Samples 1 & 2)
Charity behaviors were measured by asking respondents whether they had done the following activities (1 ¼ Yes; 0 ¼ No) in the past year: (1) donating money; (2) volunteering; (3) donating blood; (4) donating personal protective equipment to frontline workers fighting against the COVID-19; (5) donating clothes; (6) buying homeless food or giving them money.Responses across the 6 items were averaged as a proxy for charity behaviors.As we consider charity behaviors as a formative construct, Cronbach's alpha is not appropriate.Charity behaviors have been shown to be positively related to agreeableness (Habashi et al., 2016).

GPA (Samples 1 & 2)
We asked students to report their GPA, which has been shown to be highly correlated with their actual GPA obtained from the Office of Registrar (Kuncel et al., 2005).As different universities may have different GPA systems, respondents also reported the theoretical maximum of their GPA system.Then, we divided their GPA by theoretical maximum to transform all GPA to the same scale (0-1).McAbee and Oswald (2009) meta-analytically showed that GPA was positively related to conscientiousness and openness, and negatively related to neuroticism.

Number of friends on social media (Samples 1 & 2)
Respondents reported the number of friends they had on their social media.Extraversion has been shown to be positively related to the number of friends on social media (Wetzel et al., 2020).

Data analysis
As most analyses were conducted in R (version 4.1.0;R Core Team, 2021), for the three FC measures, the TIRT model was first fitted using the R-package lavaan (version 0.6-9; Rosseel, 2012) because it provides unbiased estimates of Standardized Root Mean Squared Residual (SRMR) and the confidence interval of SRMR (Shi et al., 2020), which is not yet available in Mplus.However, personality score estimates provided by the "predict" function in lavaan do not come with conditional standard errors, which are necessary for estimating empirical reliability.Therefore, we switched to Mplus 8.5 (Muth en & Muth en, 1998-2017) to obtain maximum a posteriori (MAP) personality score estimates and the associated conditional standard errors 3 .As recommended by Brown and Maydeu-Olivares (2018), we used the unweighted least square means and variances adjusted (ULSMV) estimator with theta parameterization.As for the three LK scales, we fitted a five-factor confirmatory factor analysis model by treating items as ordinal using the same ULSMV estimator and theta parameterization.To make personality scores more comparable across FC and LK scales, we also estimated MAP scores and the associated conditional standard errors instead of using the average scores.For the sake of simplicity,

3
We fitted the same model using Mplus 8.5 and estimates from Mplus were the same as estimates from lavan.
respondent reactions and criterion measures were scored by taking the average across items after proper reverse coding.Data cleaning and simple analyses, such as the calculation of descriptive statistics, were conducted using the R package psych (version 2.1.9;Revelle, 2021).Note that all item parameters were estimated in Sample 1.We did not re-estimate the models in Sample 2 due to its small sample size.Instead, we used item parameters obtained from Sample 1 to derive personality scores for Sample 2.
To examine whether the two graded FC formats reintroduced response styles, we first recoded each original response to each paired comparison into two types of response style indicators which represent ERS and MRS.Each MRS indicator (only applying to FC5 and LK5) was scored as "1" if respondents chose the middle option ("A and B are equally like me") and "0" otherwise.Each ERS indicator was scored as "1" if respondents chose the two extreme options ("A is much more like me" and "B is much more like me"), and "0" if they chose the two nonextreme options ("A is slightly more like me" and "B is slightly more like me").Responses to LK4 and LK5 were recoded into different types of indicators following the same recoding scheme.To examine the strength of each response style within each scale, we first estimated the tetrachoric correlations among the 60 indicators using the lavCor function in lavaan.The averages of the correlations were computed for each type of response styles within each scale, which served as a crude measure of the strength of a particular response style.We also calculated the first eigenvalue of each tetrachoric correlation matrix to obtain a more refined measure of the strength of a particular response style.While average correlations provide little information regarding dimensionality of the data, sizes of the first eigenvalues can better inform readers about whether there is a general response style effect and its strength.Then we estimated the correlations between the sum of each response indicator across the LK and FC measures to test whether the same response style effect was present in the two formats.
As for model fit evaluation, we reported robust chisquare, degrees of freedom, robust Comparative Fit Index (CFI), robust Tucker-Lewis Index (TLI), and robust Root Mean Square Error of Approximation (RMSEA).However, the chi-square test of exact fit is very sensitive to minor misfit that is almost unavoidable for personality scale (Hopwood & Donnellan, 2010) when the sample size is large (Barrett, 2007).In addition, CFI, TLI, and RMSEA have also been consistently shown to perform badly for ordinal models (Nye & Drasgow, 2011;Xia & Yang, 2019).
The most promising measure is the unbiased SRMR (Shi et al., 2020) that can be obtained from lavaan using the "lavResiduals" function.Therefore, while we reported all indices mentioned above, we mainly relied on the unbiased SRMR for the interpretation of model fit.While we were primarily interested in comparing the relative fit of the same Big Five model across formats, here we include a guideline on how to interpret the unbiased SRMR so that readers can gauge the absolute degree of fit.Previous studies showed that the size of SRMR was related to average item commonality (R 2 ) when holding model misspecification constant (Shi et al., 2020;Xim enez et al., 2022).Therefore, the substantive interpretation of SRMR should be contingent on the average item commonality.Specifically, SRMR/R 2 <0.05 and 0.10 were suggested as cutoffs for "close fitting" and "adequate fitting" (Xim enez et al., 2022).As shown later, in the current study, the average standardized factor loadings were similar across the 6 versions (M ¼ 0.665), which correspond to an R 2 of .442.Thus, in the current study, SRMR cutoffs for close and adequate fit would be 0.022 and 0.044, respectively.
Given the large sample sizes and the practical nature of the study, we relied on effect sizes instead of p-values to evaluate potential differences in respondent reactions.Specifically, we consider Cohen's d smaller than 0.20 as practically ignorable.

Descriptive statistics
Means and standard deviations for the LK measured personality factors and all the criterion variables in the three conditions in the two samples can be found in Table 1.As can be seen from Table 1, in Sample 1, respondents in the three conditions were pretty similar to one another on most variables except that respondents in Condition 2 had more social media friends (d CON1-2 ¼ À0.65; d CON3-2 ¼ À0.59), their parents on average had lower education levels (mother: 25), and they had relatively higher self-reported academic ranking (d CON1-2 ¼ À0.27, d CON3-2 ¼ À0.30).
In Sample 2, respondents in Conditions 1 and 2 were also very similar to each other on most aspects except that respondents in Condition 1 had relatively higher levels of self-reported GPA (d CON12 ¼ 0.20) and their fathers had relatively lower levels of education (d CON12 ¼ À0.20).Respondents in Condition 3 had fewer social media friends (d CON13 ¼ 0.52; d CON23 ¼ 0.51), lower self-reported levels of GPA (d CON13 ¼ 0.32; d CON23 ¼ 0.21), and their parents had relatively lower levels of education (father: d CON13 ¼ 0.20; d CON23 ¼ 0.38; mother: d CON13 ¼ 0.22; d CON23 ¼ 0.39) than respondents in Conditions 1 and 2. Except for the difference in the number of Social Media friends, all other differences were relatively small.

Psychometric performance
In this section, we focus on the comparisons between the graded FC and the dichotomous FC formats, as well as the comparisons between the FC formats and their LK counterparts on seven aspects of psychometric performance: (1) fit of the Five-Factor Model, (2) structural validity, (3) reliability, (4) convergent validity, (5) discriminant validity, (6) criterion-related validity, and (7) the agreement between self-and other-ratings.Sample 1 was used to examine psychometric properties 1-6 and Sample 2 was used to examine psychometric property 7.

Fit of the Five-Factor model
The same Five-Factor model was fitted to the three FC scales and the three LK scales in Sample 1. Model fit information is shown in Table 2. Two patterns are clear.First, among the three FC formats, the two graded FC formats outperformed the dichotomous FC format as indicated by the non-overlapping confidence intervals of unbiased SRMR.The two graded FC formats showed similarly acceptable fit with overlapping confidence intervals (SRMR FC4 ¼ 0.061, 90% CI [0.059,0.063];SRMR FC5 ¼ 0.064, 90% CI [0.061,0.067]).Second, all fit indices unanimously suggested that the Five-Factor model fitted better to responses to the FC scales than responses to their LK counterparts with the same number of response options.
Thus, regarding the fit of the Five-Factor model, overall, the two graded FC formats were found to be superior to the traditional dichotomous FC format.Across formats, FC measures outperformed LK measures.

Structural validity
Standardized factor loadings from the three FC scales and the three LK scales are shown in Table 3.The most important pattern is that factor loadings were very similar to each other across the six versions (M FC2 ¼ 0.64 ± 0.14; M FC4 ¼ 0.69 ± 0.18; M FC5 ¼ 0.63 ± 0.13; M LK2 ¼ 0.69 ± 0.16; M LK4 ¼ 0.64 ± 0.12; M LK5 ¼ 0.70 ± 0.16).We further calculated Tucker's Congruence Coefficient (TCC; Tucker, 1951) across the six versions for each of the five factors.According to Lorenzo-Seva and Ten Berge (2006), TCC greater than 0.95 can be interpreted as evidence for identical factor solutions.In our sample, the lowest TCC among the 5Â (6Â5/2) ¼ 75 possible TCCs was 0.95 for the extraversion factor measured by FC5 and LK5.
In sum, these findings provide strong evidence that the graded FC formats can almost perfectly retain the factor solutions originally identified in LK measures.Users of the graded FC formats apparently do not have to worry about sacrificing structural validity.

Reliability
As the standard errors of person score estimates are conditional on the latent trait levels under the IRT framework, the most appropriate way to evaluate reliability is to plot person score estimates against their standard errors.We present such plots in Figure 2. Across the 10 plots, the X-axis stands for the person score estimates and the Y-axis stands for standard error.Blue, red, and gray dots correspond to scales with 2, 4, and 5 response options, respectively.Four patterns are clear.First, across both the FC and the LK formats, the more response options there were, the lower the standard errors.Second, the two graded FC formats clearly outperformed the dichotomous FC format in measuring individuals at the two ends of the latent trait continuum, who are often the focus of personnel selection and clinical diagnosis.Third, LK measures displayed slightly smaller standard errors than their FC counterparts with the same number of response options.Fourth, the distribution of standard errors was more uniform across the trait continuum for the two graded FC formats than their LK counterparts, perhaps because the Five-Factor model fitted responses to the two graded FC formats better.Empirical reliability was also computed using the formula presented in Brown and Maydeu-Olivares (2018) to evaluate the overall reliability of scores derived from each format (see Table 4).The overall pattern is consistent with what we observe in Figure 2: scales with 2 response options had substantially lower reliability compared to scales with 4 or 5 response options and this discrepancy was particularly salient for the FC scales.For example, the empirical reliability of agreeableness scores and openness scores obtained from the dichotomous FC scale were 0.59 and 0.61, which leaped to 0.76 and 0.81 when they were assessed using the graded FC scale with 4 response options.Again, LK scales had somewhat higher reliability than their FC counterparts with the same number of response options.
In sum, the two graded FC formats provided substantially more reliable assessments of individuals than the dichotomous FC format.Personality scores estimated from the LK scales were somewhat more reliable than those from the FC formats with the same number of response options; however, the differences were small when there were 4 or 5 (instead of 2) response options.

Convergent and discriminant validity
As shown in Table 4, convergent validity of personality scores estimated from both the FC and the LK scales was generally high and did not differ substantially across the numbers of response options (M FC2- LK2 ¼ 0.78; M FC4-LK4 ¼ 0.77; M FC5-LK5 ¼ 0.81).Moreover, across the three conditions, extraversion and agreeableness consistently showed the highest and lowest levels of convergent validity (M EXT ¼ 0.85; M AGR ¼ 0.73), respectively.Results for discriminant validity can be found in Table 5.It is clear that the Big Five factors were substantially more distinct from each other when measured with the FC formats than their LK counterparts, but the differences observed between the dichotomous and the graded FC formats were relatively small (M FC2 ¼ 0.18 vs. M LK2 ¼ 0.29; M FC4 ¼ 0.29 vs. M LK4 ¼ 0.40; M FC5 ¼ 0.19 vs. M LK5 ¼ 0.26).Across the six versions, the highest correlations were observed in the associations of extraversion with agreeableness (M ¼ 0.48) and openness (M ¼ 0.41).
In sum, the two graded FC formats displayed convergent validity on a par with the dichotomous FC format.The three FC scales also displayed slightly better discriminant validity than their LK counterparts.

Criterion-related validity
Bivariate correlations between personality scores estimated from the six versions of measures and each of the criterion variables are shown in Table 6.As we tested 13 criterion variables in total and we were equally interested in all of them, we calculated the double-entry intraclass correlations to quantify the similarity among criterion-related validity profiles (see Table 7).The double-entry intraclass correlation is appropriate because it takes the shape and elevation (mean) of profiles into consideration simultaneously (Furr, 2010).In Table 7, we also present the overall predictive validity of the five personality factors assessed by each of the six versions in the form of coefficient of multiple correlation (the square root of R 2 ; Kutner et al., 2004).It is clear that the criterion-related validity profiles were very similar across the six versions for all the five factors (M ICC-E ¼ 0.93; M ICC-A ¼ 0.91; M ICC-C ¼ 0.92; M ICC-N ¼ 0.98; M ICC-O ¼ 0.90).Regarding the coefficient of multiple correlation, the five personality factors also displayed very similar levels of overall validity across the six versions (M FC2 ¼ 0.29; M FC4 ¼ 0.25; M FC5 ¼ 0.27; M LK2 ¼ 0.31; M FC4 ¼ 0.26; M FC5 ¼ 0.28).
Among the criterion variables, leadership experience, charity, academic ranking, the number of friends on social media, and social value orientation were relatively more objectively measured by either checklists or a resource allocation paradigm, whereas selfrated health, subjective well-being, and depression were more subjectively measured by LK scales.First, across the three versions of the FC measures, the two graded FC measures showed criterion-related validity   comparable to the dichotomous FC measure.Second, the three FC scales did not consistently show superiority over the three LK scales in predicting objectively measured outcomes.Third, the three LK scales also did not show any systematic advantage in predicting the three subjectively measured outcomes over the three FC scales.Overall, FC4 and FC5 showed similar levels of criterion-related validity to FC2, and the three FC scales displayed similar levels of criterionrelated validity to the three LK measures.Thus, according to the results, response format (FC vs. LK) and the number of response options had little impact on the criterion-related validity of personality scores.This is very reassuring because it indicates that findings based on the traditional LK or the dichotomous FC formats can be generalized to the graded FC formats.

Self-other rating agreement
Using data from Sample 2, we examined the convergence between self-and other-ratings of the five personality traits on three aspects: bivariate correlation that captures rank-order information; Intraclass correlation (ICC1) that captures the absolute agreement between self-and other-ratings; and Cohen's d that captures average mean difference.Results are shown in Table 8.Several patterns worth noting.First, although the average cross-rater bivariate correlations were very similar across the six versions (M FC2 ¼ 0.37; M FC4 ¼ 0.36; M FC5 ¼ 0.36; M LK2 ¼ 0.32; M LK4 ¼ 0.36; M LK5 ¼ 0.36), they differed substantially across the five factors.For example, across the six versions, the highest level of the self-other correlation was observed in extraversion (r ¼ 0.49 À .64.), while agreeableness showed the lowest level of correlation in four of the six versions (r ¼ 0.17 À 0.34).Second, relatively speaking, the magnitude of the selfother correlations varied less across the factors in FC5 compared to the other five versions.These results were largely consistent with previous meta-analytic findings (Connelly & Ones, 2010).Third, similar patterns were also observed in ICC1 except that the absolute agreement between self-and other-ratings was relatively weaker for LK2 compared to the other five versions (M FC2 ¼ 0.31; M FC4 ¼ 0.31; M FC5 ¼ 0.32; M LK2 ¼ 0.25; M LK4 ¼ 0.31; M LK5 ¼ 0.30).
The mean differences between self-and other-ratings were consistent across the six versions in that others perceived the focal persons to be more extraverted (M Cohen's d ¼ À0.55) and more open to new experiences (M Cohen's d ¼ À0.48) when compared to the focal persons' self-ratings.Others also perceived the focal persons to be less neurotic across all versions (d FC2 ¼ 0.21; d FC4 ¼ 0.25; d LK2 ¼ 0.31; d LK4 ¼ 0.32; d LK5 ¼ 0.25) but FC5 (d FC5 ¼ 0.02).As for agreeableness, others perceived the focal persons to be more agreeable only when FC4 and FC5 were used (d FC4 ¼ À0.34; d FC5 ¼ À0.23).Others only perceived the focal persons to be more conscientious when FC2 and LK2 were used (d FC2 ¼ À0.30; d LK2 ¼ À0.44).Averaging the absolute values of the effect sizes across the five factors, FC5 showed a slightly smaller difference between self-and other-ratings (M FC5 ¼ 0.28) than the other five versions (M FC2 ¼ 0.34; M FC4 ¼ 0.35; M LK2 ¼ 0.37; M LK4 ¼ 0.33; M LK5 ¼ 0.33).The direction of these findings is consistent with meta-analytic findings (Kim et al., 2019) except openness.
In sum, compared to the other five versions, FC5 led to more consistent bivariate correlations and ICCs across the five factors and displayed relatively smaller mean differences across raters.FC4 performed similarly to FC2 and the three LK measures on all three aspects of convergence.

Respondent reactions
Descriptive statistics and paired-comparison results for respondent reactions are presented in Table 9.As our sample sizes were large and there were many (60 pairs) paired comparisons, we focused on effect sizes instead of p-values.Cohen's d with magnitude smaller than 0.20 was considered not practically meaningful to further simplify interpretation.
Comparing across the three FC versions (betweenperson comparisons), no meaningful differences on six out of seven respondent reactions were found (jCohen's dj < 0.20), suggesting that different numbers of response options did not have a substantial impact on respondent reactions to FC measures.The only exception was perceived difficulty such that respondents considered FC4 easier than FC2 and FC5 (d FC2-4 ¼ 0.39; d FC5-4 ¼ 0.37).No meaningful difference was found for perceived difficulty between FC2 and FC5 (d FC2-5 ¼ 0.01).
When comparing between the FC and LK measures with the same number of response options (within-person comparisons), it was almost uniform that participants found the FC formats equally accurate Respondents reported better concentration when completing FC measures compared to LK measures in the conditions with 2 and 5 response options (d FC2- LK2 ¼ 0.32; d FC5-LK5 ¼ 0.44).In addition, respondents reported slightly less positive affect toward FC measures when compared to LK measures in the conditions with 2 and 4 response options (d FC2-LK2 ¼ À0.22; d FC4-LK4 ¼ À0.25).
In sum, we found consistent evidence that for FC measures, the number of response options had very limited impact on most aspects of respondent reactions except that FC4 was perceived to be slightly easier than FC2 and FC5.Comparing FC scales to their LK counterparts with the same number of response options, we found strong evidence that respondents generally perceived FC measures to be more fakingresistant, more difficult, and more cognitively demanding.They were also more concentrated when completing FC scales.These findings were consistent across different numbers of response options.

Response styles
Results regarding extreme response style (ERS) and middle response style (MRS) are presented in Table 10.As for MRS, the average tetrachoric correlations among the 60 MRS indicators were 0.1260.09and 0.15 60.15 for FC5 and LK5, indicating no more than a weak MRS effect for both response formats.The first eigenvalues were 8.52 and 9.52 for FC5 and LK5, corresponding to 14.2% and 15.87% explained variance respectively, which argued against the presence of a general MRS effect.Moreover, the correlation between the numbers of endorsed middle options in FC5 and LK5 was only 0.19, providing little evidence for a general MRS effect.
Regarding ERS for the FC and LK scales with 4 response options, the average correlations were 0.3060:12 and 0.356:12: The first eigenvalues were 19.18 and 21.92 for FC4 and LK4, suggesting that the first factors explained 31.97% and 36.53%variance, respectively.These findings indicated the existence of a moderately strong ERS effect in both FC4 and LK4.The correlation between the numbers of endorsed extreme options for FC4 and LK4 was 0.52, further suggesting that the ERS effect identified in FC4 and LK4 overlap substantially.
When there were five response options, ERS had a weak presence in FC5 but was moderately strong in LK5.Specifically, the average correlation was only 0.2160:12 for FC5 but was 0.3260:13 for LK5.Similarly, the first eigenvalue was 13.63 for FC5 but 20.25 for LK5, indicating that the first factors explained 22.7% and 34.12% variance, respectively.The correlation between ERS scores for FC5 and LK5 was .54.These findings indicated that the ERS effect identified in FC5 and LK5 overlapped substantially, despite a weaker ERS effect in FC5.
Taken together, these findings suggest that MRS is not a substantial concern for both FC and LK measures as its magnitude was weak across both FC5 and LK5.However, ERS had a relatively stronger presence in both FC4 and LK4 and the ERS effects across the two response formats shared 27.04% variance.While the strength of the ERS in LK5 was almost as strong as that in FC4 and LK4, FC5 was less susceptible to ERS.Overall, FC5 seemed to be the least contaminated by ERS and MRS among the four graded versions (FC4, FC5, LK4 and LK5).

Middle response or not
Although we speculated that there might be a tradeoff between psychometric performance and respondents' reactions regarding the inclusion/exclusion of a middle response option, the empirical evidence presented above did not support this speculation.Including a middle response option did not substantially improve respondents' reactions, nor did it impair psychometric performance.It seems that people use the middle response option as intended.Surprisingly, FC4 was in fact more susceptible to ERS than FC5.

Summary
Given the large volume of results presented above, we summarize the main findings in Table 11.Specifically, we compared FC4 and FC5 to FC2 as well as FC4 vs. LK4, and FC5 vs. LK5.As shown in Table 11, generally, FC5 outperformed FC4.The main issue with FC4 was that it was substantially impacted by ERS.When compared to the dichotomous FC measure, FC5 demonstrated better model fit, higher reliability, and slightly higher agreement between self-and other-ratings while performed similarly in other aspects.
Overall, FC5 performed the best among the three FC formats.When compared to LK measures with the same numbers of response options, the two graded FC scales displayed better model fit, slightly higher agreement between self-and other-ratings, higher perceived faking-resistance, higher difficulty levels, and higher perceived cognitive burden.FC5 was also less susceptible to ERS than LK5.It should be noted that all conclusions regarding the comparisons between FC and LK scales were conditional on the same scale length and statement content.They should not be generalized unconditionally (e.g., conclusions like "dichotomous FC scales are less reliable than LK scales" should be avoided).

Discussion
The current study presented the first large-scale systematic evaluation of the performance of two graded FC measures (FC4 and FC5), as well as their relative performances in comparison to the dichotomous FC measure (FC2) and their LK counterparts (LK2, LK4, and LK5).Our findings show that, compared to FC2, the two graded FC measures had better psychometric performance, particularly in terms of model fit and reliability.As for respondent reactions, we did not find substantial differences among the three FC scales except that respondents found FC4 easier than FC2 and FC5.When it comes to response styles, while we found that MRS and ERS were not major concerns for FC5, ERS had a moderately strong presence in FC4.Compared to LK4 and LK5, FC4 and FC5 provided better support for the hypothesized five-factor model, and FC4 and FC5 were very similar to the LK measures regarding other aspects of psychometric performance.Respondents perceived FC4 and FC5 to be more faking-resistant, more difficult, and more cognitively demanding than LK4 and LK5.FC4 and FC5 were also less susceptible to response styles than their LK counterparts.These findings provide a balanced view of the pros and cons of graded FC measurement, thus laying a foundation for future research.

Graded FC measures possess satisfactory psychometric properties
In the current study, one of the most important findings regarding the psychometric performance of the graded FC measures was that person scores estimated from the two graded FC scales were substantially more reliable than scores from the dichotomous FC scale.This is consistent with our theoretical prediction: more response options provide more information and may potentially reduce random responding, both of which jointly translate into more accurate measurement.It is also interesting to note that the conditional standard errors of measurement curves of the graded FC measures were flatter than those of their LK counterparts.This is very encouraging because users of personality tests are often interested in measuring people at the two ends (e.g., select in top candidates or screen out bottom candidates).However, it is commonly recognized that people at the two ends are much harder to be measured accurately using the LK format.Based on our findings, the graded FC scales have the potential to improve measurement accuracy at the two ends.
Another important pattern emerging in our results was that the hypothesized Five-Factor Model fitted better when tested with the FC than the LK measures.The pattern was consistently observed across different numbers of response options.Although the same pattern was observed in previous studies (Brown & Maydeu-Olivares, 2011, 2018), our findings were more robust because we held the number of response options constant across the FC and LK measures.This is important in study design because the number of response options has been shown to impact the observed values of model fit indices given the same level of model misspecification (Xia & Yang, 2019).Two factors may jointly account for the improved Note.If the graded FC is advantageous, a "þ" sign is placed; if the graded FC performed equally as others, an "¼" sign is placed; if the graded FC performed worse than the others, a "À"sign is placed.
model fit in FC measures when compared to their LK counterparts.First, the impact of omitted cross-loadings (the Thurstonian IRT model assumed cross-loadings to be zero), which commonly occurs in many scales (Zhang et al., 2023), may be ameliorated with paired comparison data.For example, if items 1 and 2 have the same-sized cross-loadings on the third factor (g 3 ) aside from their designated focal factors (g 1 and g 2 ), the factor model can be expressed as follows: where t, k, and e refer to intercept, factor loading, and statement residual, respectively.When paired together in models for FC measures, the effect of omitted cross-loadings will cancel each other out according to the following formula (see Brown & Maydeu-Olivares, [2011] for details of the TIRT model): Therefore, even if these cross-loadings are unmodeled in the Thurstonian IRT model, their impacts can cancel each other out at least to some degree.However, when left unmodelled in models for LK measures, their impact will be reflected in reduced model fit.Second, according to our results, respondents reported that they were more concentrated when working on FC measures compared to when working on LK measures.Higher levels of concentration could lead to less noise in their responses, which resulted in better fit of the hypothesized model (Kam & Meyer, 2015).Among the three FC scales, FC4 and FC5 also outperformed FC2 in supporting the Five-Factor model.It is likely that respondents had the freedom to show the degree of preference when responding to the two graded FC versions but were "forced" to make a binary choice on FC2 even if they did not have a strong preference.The latter is likely to result in more random errors in the responses than the former, thus causing relatively worse fit.
Aside from the above-mentioned psychometric advantages of the graded FC measures over the traditional dichotomous FC and LK measures, we also found the first piece of evidence showing that the graded FC measures also had as desirable structural validity, convergent validity, and criterion-related validity as their LK counterparts.These findings are very reassuring because it means that users of the graded FC measures can enjoy their advantages without substantial sacrifice.It is also worthwhile to note that in the current study, structural validity was found to be almost identical across the FC and LK measures, whereas previous studies often reported divergent structural validity across the two measurement formats (Ackerman et al., 2016;Dueber et al., 2019;Guenole et al., 2018).This discrepancy is likely due to the differences in how the FC scales were designed.In those previous studies, most blocks involved equally keyed statements while the FC scales used in our study included several mixed-keyed blocks.The proportion of mixed-keyed blocks is key to the statistical performance of the TIRT model (Lee et al., 2022).Together with the evidence for high convergent validity and similar criterion-related validity as their LK counterparts, we extended previous findings on the equivalence of construct validity between dichotomous FC and graded LK measures (Zhang et al., 2020) to graded FC and graded LK measures.

Graded FC measure does not substantially improve respondent reactions
Contrary to our expectations, graded FC measures did not lead to substantially improved respondent reactions compared to the dichotomous FC measure.One possibility is that less favorable respondent reactions do not primarily originate from the feeling of "being forced" or decreased autonomy.Instead, the amount of cognitive processing involved may be more relevant.In the current study, the FC scales were consistently perceived as more cognitively demanding than their LK counterparts across different numbers of response options (d ¼ 0.55-0.69).When responding to a LK measure, respondents need to keep only one statement in their working memory and evaluate how well the content of that item matches their typical feelings, thoughts, or behaviors.However, when responding to an FC block, respondents need to keep two statements in their working memory simultaneously, evaluate the degree of match, and then compare and decide which statement is a better match in a relative sense.Clearly, more cognitive processing is involved in responding to an FC block, which may lead to less enjoyment, higher levels of perceived cognitive burden, and higher levels of perceived difficulty compared to responding to their LK counterparts.As responding to dichotomous FC and graded FC measures involves similar levels of complex cognitive processing, it is not surprising that the graded FC measures did not substantially improve respondents' reactions compared to the dichotomous FC measure.
As such paired comparison is the essence of all FC formats, we suspect that it will be hard to improve upon these aspects of respondents' reactions, and it may just be a cost researchers have to accept.
Although the graded FC measures did not demonstrate substantial advantages over the dichotomous FC measure in terms of respondents' reactions, they did not show any significant disadvantages either.For example, similar to the dichotomous FC measure, respondents also found the graded FC measures to be more faking-resistant and to evoke higher levels of concentration compared to their LK counterparts.In addition, respondents also perceived the graded FC measures to be as accurate and useful as their LK counterparts.Overall, respondents' reactions to the graded FC measures were very similar to those to the dichotomous FC measure.
FC measure with 5, but not with 4, response options is less susceptible to response styles One of the advantages of using a dichotomous FC measure is that it is inherently resistant to various response styles, as the dichotomous response format does not allow response styles to manifest.In contrast, graded response scales run the risk of reintroducing the issues of response styles, which can render the estimated person scores less valid.However, our findings showed that FC5 was largely free from such negative impacts.On the other hand, FC4, which was assumed to be impacted by ERS, was indeed moderately impacted.Specifically, we did not find evidence for a general MRS effect in FC5, as the correlation between the MRS sum scores of FC5 and LK5 was low (r ¼ 0.19).It is possible that the middle response option of the FC5 ("Both statements A and B are equally like me") has a clearer and different meaning than that of the LK5 ("Neither agree nor disagree").Meanwhile, we did find evidence for the existence of a general ERS effect in FC4 and FC5 as there were moderate positive correlations (r ¼ 0.52 and 0.54) between the ERS sum scores of the two graded FC and those of their LK counterparts.However, the existence of an ERS effect does NOT naturally indicate that the two scales were impacted by ERS to the same degree.According to the average correlations and the size of the first eigenvalues, FC5 was less impacted by ERS than FC4.Conceptually, we can think of the strength of impact in a way similar to the size of factor loadings.The ERS effect indicators showed moderate factor loadings in FC4 but small loadings in FC5.Therefore, while the ERS effect was present in both FC4 and FC5, its impact on FC5 was substantially smaller than that on FC4.Another consistent finding is that the graded FC measures were consistently less impacted by response styles compared to their LK counterparts.This is encouraging because the finding suggests that scores estimated from graded FC measures are generally less contaminated by response styles and more valid than scores derived from their LK counterparts.However, we do not have a strong theory to explain why FC5 was less impacted by ERS than FC4 and why graded FC measures in general were less susceptible to response styles than their LK counterparts.We speculate that factors related to the way in which respondents interpreted the response options might contribute to the observed pattern.A think-aloud study with an interview component is needed to further explore these patterns.

Yes, keep the middle option
One of the most important take-home messages from the current study is that the graded FC format is a viable alternative to the dichotomous FC measure because it can improve the reliability of person score estimates substantially (the mean empirical reliability scores for FC2, FC4, and FC5 were 0.66, 0.82, and 0.82) without sacrificing any other desirable properties.In terms of choosing from different forms of the graded FC measures (e.g., FC4 vs. FC5), we recommend the use of FC5 with a middle response option for the following reasons.First, across the six versions of the personality scales, both the smallest average absolute mean difference between self-and other-ratings and the highest consistency in self-other ratings were found in FC5, suggesting that graded FC measures with 5 response options may be particularly useful in scenarios where higher interrater agreement is required (e.g., 360-degree assessment).Second, compared to FC4 where the ERS had a moderately strong presence (just like in LK4 and LK5), FC5 was substantially less impacted by ERS (though not completely unaffected), indicating that person score estimates from FC5 are more accurate than those from FC4.Third, FC5 had substantially better discriminant validity than FC4 (M FC4 ¼ 0.29; M FC5 ¼ 0.19).Fourth, respondents displayed similar levels of positive affect toward FC5 and LK5, while they found FC4 to be less enjoyable than LK4.Aside from these advantages, we also showed that MRS was not a major issue for FC5, which is reassuring because one of the major concerns of having odd-numbered response options is the introduction of MRS.Even though we found similar performance of FC4 and FC5 in terms of respondent reactions, we speculate that the availability of a middle response option still has the potential to improve respondent reactions in scales where statements within more blocks are matched on social desirability.Taken together, based on our findings, we recommend the use of the graded FC measure with 5 response options, at least in low-stakes research settings.

Limitations and future directions
Despite the many strengths (e.g., comprehensiveness, large samples, multi-source ratings), the current study is still limited in the following aspects.First, we collected data from relatively well-educated samples to gather the first piece of evidence to evaluate the performance of this new format.However, as respondents found the FC measures to be more cognitively demanding than the LK measures, we do not know whether the FC format (including both dichotomous and graded FC measures) can show equally satisfactory performance when applied to less-educated respondents.Future research is strongly encouraged to examine the applicability of the graded FC measures in such populations.Second, due to practical constraints, we did not randomly assign respondents to different conditions.Although participants did not differ substantially on most aspects (e.g., leadership experience, depression, life satisfaction) across conditions, future efforts with random assignment are encouraged to replicate our findings.Third, a few standardized factor loadings across the three FC measures were abnormally large (>0.95) and likely reflected estimation difficulties.Although we believe this issue would not distort our key findings, we still encourage future studies to develop a Bayesian estimator that can estimate the TIRT model parameterized as a second-order factor model.This way, such estimation issues can be eliminated by incorporating reasonable priors.Fourth, we only examined the graded FC format in a low-stakes situation, which is critical for establishing baseline knowledge, but may not provide sufficient evidence for the application of this new format in high-stakes situations.Future studies are encouraged to examine whether our findings can be replicated in high-stakes situations.Last, despite the many advantages, the FC format is still underutilized.We suspect that one of the main reasons for such underutilization is the high complexity of constructing good FC scales and the lack of empirically supported guidelines on optimal practices.For example, how to perform social desirability matching so that we could develop psychometrically sound and faking-resistant FC scales for high-stakes situations?What is the appropriate block size that can strike a good balance among psychometric properties, testing time, and cognitive burden?Future studies are encouraged to investigate these empirical questions using experimental designs.Together with accessible scoring and automatic test assembly software programs (B€ urkner et al., 2019;Li et al., 2022), we believe such efforts could promote a wider adoption of the FC format.

Conclusions
The present study comprehensively evaluated the pros and cons of the graded FC format in comparison to the dichotomous FC format and their LK counterparts.It was found that while the graded FC format did not improve respondent reactions to a meaningful extent, FC5 substantially improved the reliability of person score estimates, remained largely immune to response styles, and maintained other desirable psychometric properties of the dichotomous FC format and their LK counterparts.In sum, researchers and practitioners can switch from the dichotomous FC to the FC5 format with gains in reliability and no substantial loss.

Figure 1 .
Figure 1.Examples of Likert measure and different types of forced-choice measures.

Figure 2 .
Figure 2. Conditional standard errors for person score estimates obtained from the three forced-choice formats (the first row) and the three Likert formats (the second row).FC ¼ Forced-Choice; LK ¼ Likert; E ¼ Extraversion; A ¼ Agreeableness; C ¼ Conscientiousness; N ¼ Neuroticism; O ¼ Openness; CAT ¼ the no. of response options.Blue dots represent conditional SEs from scales with 2 response options; Red dots represent conditional SEs from scales with 3 response options; Gray dots represent conditional SEs from scales with 5 response options.

Table 1 .
Descriptive statistics.The Big Five factor scores were computed as mean of responses.They are NOT on the same metric across conditions due to the use of different number of response options.Therefore, they are not comparable.Depression was not measured in sample 2.
Note.FC ¼ Forced-choice; LK ¼ Likert; The numbers appending "FC" or "LK" refers to the no. of response options.

Table 4 .
Reliability and convergent validity.
Note.FC ¼ Forced-choice scale; LK ¼ Likert scale; E ¼ Extraversion; A ¼ Agreeableness; C ¼ Conscientiousness; N ¼ Neuroticism; O ¼ Openness.The numbers appending "FC" or "LK" refers to the no. of response options.Cat ¼ the no. of response options.

Table 5 .
Discriminant validity.Openness.The numbers appending "FC" or "LK" refers to the no. of response options.

Table 8 .
Convergence between self-and other-ratings.

Table 10 .
Results for Extreme and Middle Response Styles.