On diagnostic accuracy measure with cut-points criterion for ordinal disease classification based on concordance and discordance

The accuracy of a diagnostic test has always been essential in detecting disease staging. Many diagnostic tests of accuracy measures are used in binary diagnosis tests. Some measures apply to multi-stage diagnosis. Yet, there are limitations to the implementation, and the performance highly depends on the distribution of diagnostic outcomes. Another essential aspect of medical diagnostic testing using biomarkers is to find an optimal cut-point that categorizes a patient as diseased or healthy. This aspect was extended to the diseases with more than two stages. We propose a diagnostic accuracy measure and optimal cut-points selection (CD), using concordance and discordance for k-stages diseases. The CD measure uses the classification agreement and disagreement between tests outcomes and diseases stages. Simulations for power studies suggest that CD can detect the differences between the null and alternative hypotheses that other methods cannot for some scenarios. Simulation results indicate that using CD measures to select optimal cut-points can provide relatively high correct classification rates than the existing measures and more balanced accurate classification rates than the generalized Youden Index (GYI). An illustration is provided using the ANDI data to choose biomarkers for diagnosing Alzheimer's Disease (AD) and select optimal cut-points for the chosen biomarkers.


Introduction
Along with assessing the patient's signs and symptoms, diagnostic tests are used to identify if a patient has a disease or not [1]. Diagnostic accuracy is the diagnostic test's ability to discriminate between non-diseased and diseased or non-diseased and different stages of a particular disease state. Diagnostic tests play an essential role in health care, especially in providing information to discriminate between diseased and non-diseased subjects. Tests with high accuracy provide a decent understanding of patients' health conditions for decision-making in treatment plan designs.
A perfect diagnostic test, which discriminates between the diseased and non-diseased subjects completely, does rarely exists. A diagnostic test can only partially distinguish between subjects with or without the disease. Furthermore, cut-point(s) are required for a continuous biomarker to make a diagnostic decision in the case of binary or ordinal disease stages. For instance, in binary two-stages disease (non-diseased and diseased), a test value above the cut-point tested positive does not always indicate a disease's existence. Those values are false positives (FPs). The test values below the cut point, tested negative, do not guarantee free of disease. Those false test values are false negatives (FNs). So, the cut-point of a diagnostic test categorizes the examined subjects into four subgroups based on the test results and the actual condition status (by gold standard or reference method): True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
A diagnostic test result does not accurately represent the patient's condition because diagnostic tests rarely have perfect accuracy. Here accuracy refers to the probability of a correct test result. It is essential to develop quantitative methods to measure diagnostic accuracy. Some of the well-established before test measures (which relates to the inherent discriminatory accuracy of a test given the true disease conditions) are sensitivity (Se or TPR) and specificity (Sp or TNR), Youden index, the area under the ROC curve (AUC), diagnostic odds ratio (DOR). Additionally, after test measures tell us a person's chance of having the disease given the test results and examples are likelihood ratios (LR) and positive and negative predictive values (PPV and NPV). These measurements can be used to decide whether to accept a diagnosis of disease, rule one out or order more testing [1].
Depending on the purpose of diagnostic procedures, different measures of diagnostic accuracy have different uses. Some diagnostic accuracy measures evaluate the ability to discriminate between the healthy and the diseased, and others measure its predictive ability. Diagnostic accuracy measures are also susceptible to the spectrum of the diseases and the tested population. It is important to know which measure to use under what conditions and interpret those measures carefully.
In practice, many diagnostic tests deal with non-binary cases, such as discriminating among more than two conditions or measuring the magnitude of a substance [1]. For example, some diseases such as liver cancer (LC) and Alzheimer's disease (AD) have an intermediate or transitional stage between non-diseased and diseased states [2]. So those diseases often have a three-class classification: non-diseased, early diseased, and fully diseased. Another example is the classification of cancers stages, which determines appropriate treatment and prognosis. It is necessary to detect the disease early for timely medical interventions to reduce cost and improve patients' quality of life. Therefore, diagnostic tests that can identify multiple stages are highly valuable and desirable. Hence, methodologies that measure diagnostic accuracy for such tests are indeed in need.
A natural practice to measure the diagnostic accuracy of non-binary scale tests (i.e. nominal-, ordinal-and continuous-scale tests) is to dichotomize the results of a gold standard and then use the traditional ROC curve. However, this method sometimes ignores crucial clinical information (the stages of the disease) and results in biased estimates of the probabilities of the corrected classification rates of the disease stages, which affect the diagnostic accuracy [3].
Binary ROC analysis has been extended to three-class ROC analysis in various ways. One approach is analyzing binary ROC for all the alternative classes' pairs [4]. However, this method does not assess the overall accuracy of the diagnostic test. Another approach is to define a three-way ROC surface on three-dimensional coordinates and use the volume under the ROC (VUS) as a summary index to measure the diagnostic accuracy [5]. Multiple studies discussed those diagnostic test methods with three ordinal outcomes [6,7]. VUS extended to k-stage classification problems, including, Hypervolume (HUM), Multiple-class ROC analysis, and optimal ROC hypersurfaces [8,9,10,11,12].
As one of the most popular diagnostic accuracy measures, the Youden index has been intensively studied and extended to three-class classification problems. There are a few different versions of the extended Youden index, and the most commonly used one is defined as the maximum of the sum of correct classification rates minus 1( [13]). The generalized Youden index, defined as the maximum of the total correct classification rates minus one, is directly extended from the three-class setting and becomes the most used method to measure diagnostic accuracy for the k-stage disease.
Despite their popularity in applications, the Youden index and the generalized Youden index only use partial classification information, the correct classification rates. They may lose some vital information of misclassification, hence may produce unbalanced classification rates. Most recently, the maximum absolute determinant (MADET) method that maximizes the absolute determinant of the classification probability matrix was proposed to measure the overall diagnostic accuracy for k-stage classification problems [14]. MADET utilized all the classification probabilities and performed better in discriminating among classes and capturing the differences in classification probabilities than GYI and VUS in the power study for some scenarios.
In diagnostic studies, determining the optimal cut-point(s) for making diagnoses is another crucial task besides measuring the continuous biomarkers' diagnostic accuracy. Typically, a criterion for selecting cut-point(s) is developed based on the measure of diagnostic accuracies such as the Youden index and ROC curve. However, VUS and HUM are defined over all the possible cut-points, and they cannot be used as criteria for selecting cutpoints. Moreover, some criteria for selecting optimal cut-point(s), such as the closest-to-(0, 1) criterion, are not used for measuring diagnostic accuracy due to the lack of probabilistic interpretation [15]. The selected optimal cut-point(s) are expected to maximize the correct classification rates and minimize misclassification rates.
For multi-stage diseases measures such as the generalized Youden index (GYI), the closest to perfection method or minimum distance (MD) [2], the maximum volume method (MV) [2], and Dong's maximum absolute determinant (MADET) are also the criterion for selecting optimal cut-point(s).
It is worth mentioning that the optimal cut-point (s) or decision threshold is sometimes subjective, depending on the diagnostic test's focus. For example, some tests such as mammograms for breast cancer must be interpreted subjectively by a human reader. The reader follows his/her established decision threshold to identify cases as positive or negative. Many factors can affect how the observer adopts the decision threshold, including assessing the likelihood of the health condition in the patient before testing, such as family history, the observer's estimate of the consequences of misdiagnoses, and the observer's style' [16].
Motivated by Dong's MADET method that utilizes all the classifications information and provides more balanced correct classification rates, this paper proposes a new diagnostic accuracy measure based on the difference of concordance and discordance (CD) for any general k-stage diseases. This new measure, namely CD, uses all the classifications information and the classification agreements and disagreement between tests outcomes and patients' stages status to achieve higher correct classification rates. Moreover, maximizing CD measure is also used for a criterion of selecting optimal cut-points. Section 2 introduces the new diagnostic accuracy measure (CD) and as a criterion for selecting optimal cut-points for the general k-stage setting. Simulation studies, including the power study, are conducted to compare the performance in measuring diagnostic accuracy and selecting cut-points based on CD in section 3. Illustration using real data presented in section 4. Final remarks and discussions are in section 5.

Proposed New Measure of Diagnostic Accuracy (CD)
Many methods of measuring diagnostic accuracy and selecting cut-point(s) have been developed over the years. Consider a continuous biomarker used for the diagnostic test of a disease with k different ordinal groups. T denotes the diagnostic test result for a subject from the biomarker, and S represents the true disease status for the same subject from a gold standard test. A subject with true disease status i(i = 1, 2, . . . , k) can be classified into a disease state j(j = 1, 2, . . . , k) based on the biomarker reading. Classification rate, defined as the conditional probability of being classified into a class given the true disease status and can be written as.
When i = j, p ij is the probability that a patient with true disease status i is correctly diagnosed by the test. When i = j p ij (i = j) denotes the probability that a patient with true disease status i is incorrectly diagnosed into class j.
Define a k × k classification probability matrix as.
The P-matrix includes all the k correct classification rates and all the k(k − 1) incorrect classification rates for all k classes. A subject who has a true disease status i is possible to be classified into k classes, hence, Therefore, the P-matrix is a stochastic matrix that each row sums to 1. In k-stage disease, k-1 cut-points (c 1 , c 2 , . . . , c k−1 ) is required to discriminate between c 0 < c 1 < c 2 < . . . < c k−1 < c k disease stages based on a continuous biomarker. Here c 0 , c 1 are defined as −∞ and ∞, respectively. For the measurement of the diagnostic biomarker, denoted as an X. A subject is classified into stage j (i.e. T = j) if c j−1 < X < c j , for j = 1, . . . , k. Let X i denotes the biomarker value for the i th disease stage with PDF f i (x) and CDF F i (x), for i = 1, 2, . . . , k. We can write the corresponding classification probabilityp ij in terms of the CDFs as.
For example, when k = 3, we can write the probability matrix P with all the possible pairs of thresholds (c 1 , c 2 ) as Note that p i1 + p i2 + p i3 = 1, for i = 1, 2, 3. This paper proposes a new diagnostic accuracy measure that uses concordance and discordance (CD) for general k-stage diseases of the classification probability matrix in (1). They are calculated for ordinal (ordered) variables and tell you if there is an agreement (or disagreement) between scores. Since in k-stages disease classifications we have ordinal scale inherent ordering, the pair (i,j) is concordant if the subject ranking higher on variable (disease stages) S also ranks higher in (test results) T. The pair (i, j) is discordant if the subject ranking higher on S ranks lower on T. Therefore, the CD measure uses all the classification information (both the correct classification and misclassification rates) and aims to achieve higher correct classification rates. Moreover, maximizing CD measure is also used for a criterion of selecting optimal cut-points.
For the k-stage classification probability matrix P, let C and D denote the concordance and discordance of matrix P ( [17], PP57), then, CD measure is defined as the maximum of the difference between concordance and discordance.
The cut-points (c 1 , c 2 , . . . , c k−1 ) are obtained by maximizing the absolute difference between concordance and discordance: For binary diseases, i.e. k = 2, where c is the cut-point; Sp and Se are the specificity and sensitivity of the diagnostic test, respectively.
The CD measure is the same as the Youden index and the MADET measure for the binary case. Hence the optimal cut-point from each measure is the same as well.
For three-class diseases, i.e. k = 3, After rearranging the terms in Equation (10), For the three-class case, C-D is the sum of all the nine 2 * 2 minors of the probability matrix P. This result holds for the general k class.
Rearrange the terms in (6) we have.
For the k-class case (k > 2), C-D is the sum of all the minor determinants of order 2. It provides a possible general computational method to calculate CD when k > 2.

Power study
The proposed diagnostic measure, CD measure, utilizes all the classification information, and the previous numerical example shows that CD measure provides relatively balanced classification rates across the three classes. Simulation of power study in detecting the differences of the biomarker distributions across disease stages is conducted to evaluate the CD measure's performance. This power study compares the newly proposed diagnostic measure with three other measures-MADET, GYI, and VUS. In the power study,H 0 (the null hypothesis) is rejected if the estimated statistics under H a (the alternative hypothesis) are greater than their 95% percentiles obtained under H 0 . The power for each measure is calculated as the proportion that H 0 is rejected out of the 2000 replications.
Consider diseases with three stages (i.e. k = 3): non-diseased, early diseased, and fully diseased. Assume that the three stages are under certain underlying distributions. Table 1 presents the distributional scenarios selected for the power study. Six different scenarios are tested for each of the Normal and Gamma distributions, representing symmetric distributions and skewed distributions, respectively. For each distributional setting (Normal and Gamma), scenarios 1 and 2 have a null hypothesis for which the three distributions for the three stages are identical. Such a setting is to test if the biomarker has any diagnostic accuracy. Scenario 3 with the same underlying distributions for H 0 and H a is used to check type I error probability. For example, the power study result should be around 0.05 under scenario 3 if the probability of type I error is predetermined as 0.05.   Scenarios 4, 5, and 6 refer to changes in location, scale, and combination of location and scale for the alternative hypotheses. While scenario 4 shifts stage 1 and stage 2 closer, scenario 5 shifts stage 2 and stage 3 closer, and scenario 6 moves stage 2 and stage 3 away from stage1.
In each scenario, the power study is conducted under three sample sizes (20, 40, and 80), and the three classes are assumed to have equal sample sizes. That is, n 1 = n 2 = n 3 = 20,n 1 = n 2 = n 3 = 40, and n 1 = n 2 = n 3 = 80. Such settings can help us to explore the changes in power as the sample size increases. Tables 2 and 3 present the simulated power for CD, MASET, GYI, and VUS for underlying normal and Gamma distributions. In general, when the sample size increases from 20 to 80, the power for all the scenarios. Under scenario 1 and scenario 2, where the null hypothesis for which the three distributions for the three stages are identical, the power for every setting is greater than 0.05, indicating that the biomarker has some diagnostic accuracy. Under scenario 3, where the underlying distributions for H 0 and H a are the same, the power for all the measures is around 0.05, which verifies the probability of type I error is controlled since we choose 95% quantile as the critical value. There is no dominant winner among CD, MADET, GYI, and VUS in power under Normal distributions scenarios. The tests have similar powers for all the methods. For Normal 2 and Normal 5, the CD has slightly higher power than others. However, under the scenarios of Gamma distributions, the CD method outperforms all other methods (For Gamma 5 and Gamma 6, the CD is as powerful as VUS). Particularly, for Gamma 4 and Gamma 5, the CD shows much superior power than both MADET and GYI. While CD performs comparably as MADET and VUS in Gamma 6, GYI has very small power. The results show that the CD measure performs better than or at least as well as other measures, implying that CD can detect the differences between the null and alternative hypotheses that other methods cannot. It is particularly true when comparing CD and GYI. Similar results can be seen in Table S7 for the power analysis for (stage 1 and 2 has the same distribution) and (stage 2 and 3 has the same distribution) in the supplementary data file. We highlighted the cases when our proposed measure is more powerful than other measures (red fonts).
Moreover, in the Supplementary data (section 2) we compared the CD measure with several existing methods regarding cut-points selection. An intensive simulation with multiple scenarios is carried out to evaluate their relative performance in selecting the optimal cut-points for a three-stage setting (i.e. k = 3). For the general case with k (k ≥ 2) classes, MADET, the generalized Youden index, the minimum distance (MD), the maximum volume (MV), and the proposed CD measure can be used to select optimal cut-points.
We used several judging rules or indexes to evaluate the performance of the selection criteria [14]. Relative bias (RBias) and root mean square error (RMSE) are two performance indexes for the estimated cut-points compared to the true cut-points. Total correct classification rate (TCCR), the loss of TCCR, and the balance of CCRs among classes are used to evaluate the selection criteria' overall performance at their selected optimal cut-points (See the Supplementary data in section 2).
In scenario 1, when the three stages are evenly spaced in the mean, all the methods perform consistently with different distributional settings. CD has a smaller total CCRs, and GYI gives the highest p 11 and p 33 . Moreover, MADET has the smallest MMDIF, indicating the best balance of correct classification rates.
When stage 1 and stage 2 are heavily overlapped, as shown in scenario 2 (Normal 2, Gamma 2, and Exp 2), results from different methods are similar for each distributional setting. CD has the highest p 11 , suggesting a good ability to rule out. MD has the best balance.
All methods perform similarly in scenario 3 (Normal 3, Gamma 3, and Exp 3) when stage 2 and stage 3 are heavily overlapped. GYI has the highest p 11 , and MADET has the best balance.
The CD measure generally has a smaller or comparable loss of total CCRs than MADET, MD or MV does. For example, in the setting of n 1 = n 2 = n 3 = 100, the CD has the smallest loss of total CCRs than MADET, MD, and MV in the scenarios Normal 2, Gamma 1&3, Exp 1&2. Its loss of total CCRs in other scenarios (Normal 1, Normal 3, Gamma 2, and Exp 3) is very close to the smallest one. For instance, in Normal 1, the loss of total CCRs for CD is 2.11%, while the smallest loss of total CCRs is 1.8096% from MD.
Moreover, MADET, MD, and MV have smaller MMDIFs generally, but CD has smaller MMDIFs than GYI for all the scenarios except Exp 2 and Exp 3. This result indicates that CD achieves a better balance in CCRs than GYI, but not as good as MADET, MD, and MV. However, even though CD has larger MMDIFs than MADET, MD, or MV, the difference is small in some cases. For example, consider scenario 1 with n 1 = 100, n 2 = 50, n 3 = 20. CD has an MMDIF of 0.1559 close to MMDIFs from MD (0.1885) and MV (0.1605), while GYI has a much larger MMDIF of 1.1891.
In conclusion, not only can CD measure provide relatively high total correct classification rates with less loss than MADET, MD, and MV, but also it provides more balanced correct classification rates than GYI. It is a measure that keeps the advantages of other measures.

Application to ADNI data
This section will apply the CD measure to the ANDI data to choose biomarkers for diagnosing Alzheimer's Disease (AD) and select optimal cut-points for the chosen biomarkers. The MADET, GYI, VUS, MD, and MV are also used for comparison.

Introduction
Alzheimer's Disease (AD) is a progressive disease that damages important mental functions such as memory due to brain tissue deterioration. Based on Alzheimer's Association [18] information, AD is one of the most common forms of dementia among seniors. Approximately 5.7 million Americans are affected by dementia due to AD, and 5.5 million are over the age of 65. AD is the sixth leading cause of death in the United States. The ADrelated deaths increased by 123% from 2000 to 2015. In the United States, the number of people dealing with Alzheimer's disease is expected to increase in the coming years, so are the costs of caring for those patients.
Not all people with Mild Cognitive Impairment (MCI) develop Alzheimer's disease. However, people with MCI tend to have a higher risk of developing Alzheimer's or other dementia [19]. The progress of Alzheimer's disease can last more than a decade and through a few stages. Alzheimer's disease typically develops slowly and gradually in four general stages -preclinical stage, mild (early stage), moderate (middle-stage), and severe (late-stage). Some institutions, such as MAYO Clinic (2020), consider one more transitional stage-preclinical Alzheimer's disease, mild cognitive impairment (MCI), mild dementia, moderate dementia, and severe dementia. In this application, we consider three clinical disease stages for Alzheimer's disease, as suggested by ADNI [20]: Cognitively Normal (non-diseased), MCI (early diseased), and dementia (fully diseased). The gold standard to determine the AD stages is based on the global clinical dementia rating (CDGLOBAL) that is obtained from individual ratings by experienced clinicians in multiple domains. CDGLOBAL 0, 0.5, and 1 indicate cognitively normal (non-diseased), MCI (early diseased), and Dementia (fully diseased), respectively. Here, patients with CDGLOBAL values equal to and greater than 1 (2 or 3) are combined as a fully diseased group.

Data Analysis
The data used in this paper's preparation are obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). ADNI is a global longitudinal multicenter study that aims to track the progression of AD, detect AD at the earliest possible stage (pre-dementia), and improve clinical trials to prevent and treat the disease. ADNI researchers collect, validate, and utilize data, including cerebrospinal fluid (CSF), MRI and PET images, blood biomarkers, cognitive tests, and genetics as predictors of the disease.  Figure 4 presents five core biomarkers as indicators of AD over the clinical disease stages. The curves depict changes from normal to abnormal in the following five biomarkers [20] over AD's progression. Changes in biomarkers 1-3 are indicators that can be observed before dementia diagnosis, while changes in biomarkers 4-5 are the classic indicators of dementia diagnosis.
Current treatments for Alzheimer's disease treat only the symptoms and are indicated for patients with Alzheimer's dementia. Should disease-modifying treatments become available in the future, early and accurate diagnosis of Alzheimer's disease will be necessary to initiate treatment [21]. Biomarker assessment will likely be crucial for identifying early markers of brain changes before cognitive impairment onset [22]. For the biochemical diagnosis of AD, the medical society developed biomarkers, namely, Cerebrospinal fluid (CSF), include standardization and harmonization of (pre-) analytical procedures [23]. Currently, three core CSF biomarkers for AD diagnostics are estimated for the confirmation of AD. They are the 42 amino acid long amyloid-beta peptide (Aβ142), tau phosphorylated at threonine 181 (P-tau181), and total tau protein (T-tau). However, CSF is obtained by invasive lumbar punctures, which has limitations in practice. It is desirable for diagnostic biomarkers that we can obtain with less invasive methods to detect AD early. An appropriate source of biomarkers should promote frequent testing, cause minimal discomfort to the patients, easily follow-up, and provide better consent to clinical trials. Blood serum and plasma can serve as such a source. ADNI study has collected data from various sources, including blood biomarkers, cerebrospinal fluid (CSF), cognitive tests, MRI and PET images, and genetics. It is essential to determine which biomarker measures and their associated cut-points best predict the disease's presence.
Based on the ADNI biomarker core report [24], the biomarkers we are interested in are mainly CSF variables: Aβ1-42, p-tau 181, t-tau, the ratio of t-tau and Aβ1-42 (ttau/Aβ1-42), and the ratio of p-tau181 and Aβ1-42 (p-tau181/Aβ1-42). We also include hippocampus volume and brain volume in the analysis since those two biomarkers are related to the severity of cognitive impairment [25].

Analysis of ADNI data
We apply four diagnostic accuracy measures-CD, MADET, GYI, and VUS discussed in previous sections to ADNI-1 data collected between September 2005 to August 2007. We also apply five criteria for cut-points selection-CD, MADET, GYI, MD, and MV-to the dataset to find the corresponding optimal cut-points. Our primary goal is to evaluate the seven biomarkers of interest: hippocampus volume (Hippocampus), brain volume (Whole-Brain), Total tau (TAU), Aβ1-42 (ABETA142), p-tau181 (PTAU181P), the ratio of t-tau and Aβ1-42 (Ratio.TAU), and the ratio of p-tau181 and Aβ1-42 (Ratio.PTAU). Another goal is to compare the performance of the different diagnostic accuracy measures and the cut-points selection criteria. We use the correct classification rate (CCR) for each stage, total correct classification rates (Total CCRs) over three stages, the loss of total CCRs, and the maximum and minimum difference of CCRs (MMDIF) to evaluate the performances of the above methods.
The ADNI-1 data collected between September 2005 to August 2007 includes 114, 256, and 45 subjects for the non-diseased (Cognitively Normal, or CN), the early diseased (Mild Cognitive Impairment, or MCI), and the fully diseased (Dementia) groups, respectively. However, the biomarkers' actual sample sizes may be different and are smaller than the group sizes because of missing values. Table 4 presents the summary descriptive statistics of the seven interested biomarkers in the ADNI data.
Density plots are provided in Figure 1 to visualize the distributions of those biomarkers, as well as the gold standard (CDGLOBAL). Smaller values of the biomarkers Hippocampus, WholeBrain, and ABETA142, indicate a more severe disease stage. In contrast, values of the rest biomarkers increase when the disease progresses to later stages. Both Hippocampus and WholeBrain are approximately normally distributed for the non-diseased, early, and fully diseased. The density plots of PTAU181P, TAU, Ratio.TAU, and Ratio.PTAU are skewed, approximately following Gamma distributions, for all three classes. While the three classes overlap each other for all the biomarkers, Hippocampus and WholeBrain seem to separate the early diseased and the fully diseased more than the rest biomarkers. The density plots of ABETA142 for the early diseased population (stage 2) and the fully diseased population (stage 3) are alike, and the plots for all three classes are almost totally overlapped. Those are warning signs that ABETA142 might not be a good biomarker for discriminating among the three AD diagnosis stages. CD, MADET, GYI, VUS, MD, and MV are applied to the ANDI-1 dataset. Optimal statistics of accuracy measures estimated optimal cut-points (c 1 and c 2 ), correct classification rates (p 11 , p 22 , and p 33 ), and the total correct classification rates (Total CCRs) of the seven interested biomarkers are calculated and presented in Table 5. VUS cannot be used for selecting optimal cut-points among the six methods, and therefore only its optimal statistic is given. MD and MV are cut-points selection criteria and only be used to select optimal cut-points, so we should not use their optimal statistics to measure diagnostic accuracy even though they are reported.

Analysis results
Based on the estimated statistics, the seven biomarkers are ranked in Table 6. For example, with CD measure, the best biomarker is Hippocampus, followed by ratio.PTAU, Ratio.Tau, and ABETA142. Among the seven biomarkers, Hippocampus has the highest estimated statistics for all four diagnostic accuracy measures. Thus, Hippocampus is the best biomarker to discriminate subjects among the three stages of Alzheimer's disease. All four methods agree that WholeBrain is the least favorable biomarker. While ABETA142, PTAU181P, and TAU are ranked very differently with the four methods, ranking of Hippocampus, Ratio.PTAU and Ratio. Tau is consistent. Some literature has shown that the ratio of total tau to Aβ142 and p-tau181 to Aβ142 outperformed Aβ142 alone with INNOTEST and INNO-BIA CSF platforms, with ROC analyses and disease-independent mixture modeling [24].
In this sense, the CD method performs better than other methods at selecting biomarkers for measuring overall diagnostic accuracy. For all the seven biomarkers, the cut-point c 1 estimated by GYI is larger than those by other methods (CD, MADET, MD, and MV), implying that GYI tends to diagnose more subjects into the non-diseased group. However, the performance of all the five criteria for selecting cut-points is similar in terms of p 11 , the correct classification rate for the first class (non-diseased group). CD method has slightly higher p 11 for biomarkers Hippocampus, WholeBrain, and TAU (except for GYI). The intervals between c 1 and c 2 with GYI are shorter than those obtained with other methods, which indicates that fewer subjects are diagnosed into the early diseased group with the GYI criterion than with other criteria. For ABETA142 and PTAU181P, the estimated cut-points c 1 and c 2 using GYI are identical. The correspondingp 22 s, correct classification rates for the second class (the early For most of the biomarkers, the cut-point c 2 distinguishing the early diseased and the full diseased, estimated by CD, is generally larger than by GYI but smaller than by other methods, which results in that the correct classification rates, p 33 s, are smaller than those with GYI, but larger than those with other methods. It shows that the CD method has the potential of ruling-in patients. Although GYI produces the largest p 33 s for most biomarkers, the CD method can be a good implementation when the GYI criterion may not be suitable, such as for ABETA142. CD method also gives relatively larger or comparable total correct classification rates than other methods except for GYI for the interested biomarkers.

Discussion
In diagnostic studies, two essential topics are measuring the diagnostic accuracy and selecting the optimal cut-point(s). Most diagnostic accuracy measures and optimal cut-point selection criteria only apply to diseases with binary or three stages. In practice, it is vital to detect the early stage of disease for timely medical interventions to reduce cost and improve patients' quality of life. Diagnostic tests that can identify multiple stages are highly valuable and desirable and indeed in need. Recently years, some measures for diseases with general k-stage have been developed, such as the Hypervolume under the manifold (HUM), the generalized Youden index (GYI), and the maximum absolute determinant (MADET).
Motivated by Dong's MADET method that utilizes all the classification information and provides more balanced correct classification rates, this dissertation proposes a new diagnostic accuracy measure using the difference of concordance and discordance (CD) for any general k-stage diseases. This new measure, namely CD measure, uses all the classification information and aims to achieve purposive higher correct classification rates. Meanwhile, the CD measure selects optimal cut-points that maximize CD for the general k-stage setting. A power study and simulations are conducted to compare the performance in measuring diagnostic accuracy and selecting cut-points of CD, VUS, and GYI under the three-class setting. Power studies show that CD can detect the differences between the null and alternative hypotheses that other methods cannot for some scenarios. Simulation results indicate that using CD measures to select optimal cut-points can provide relatively high correct classification rates with less loss than MADET, MD, and MV and more balanced correct classification rates than GYI.
CD measure and cut-points selection criterion are then applied to ADNI data as an example to compare performances with other methods. The CD method performs the best among all selected biomarkers for measuring overall diagnostic accuracy for this dataset. Its results are consistent with the literature. MADET, MD, and MV generally have a better balance in correctly diagnosing relatively equal numbers of people to the three stages. CD and GYI, on the other hand, tend to correctly diagnose more people to non-diseased (stage 1) or fully diseased (stage 3) groups.
There is no certain winner among all the methods. Different methods have different advantages for different diagnostic tests with different biomarkers, depending on the diagnostic test's focus and purpose. For example, the Youden index has a much higher probability of correctly identifying the ADNI application's non-diseased population with biomarker ABETA142. It suggests that the Youden index is a better choice for ruling out the non-diseased people using biomarker ABETA142, even though it fails to identify any patient for the second stage.
The CD measure is defined as the absolute value of the difference between concordance and discordance probabilities. However, larger C − D means better agreement, but larger |C − D| does not necessarily mean better agreement. Observed C − D could be negative by chance, but it is unlikely if the diagnostic test is informative for the true ordinal class. The use of absolute value is for the convenience to overcome the directionality of the measure.
In this study, the focus is on diseases with ordinal stages. However, many cases deal with multiple nominal classes, such as genomic studies. CD measure will not be applicable in those cases. Dong et al. (2017) claimed his MADET measure could handle those situations since the absolute value of the determinant of the classification probability matrix (P-matrix) does not change when switching columns (or rows) of the P-matrix.
We intend to choose the distributions that overlap heavily for the simulations and data application stages because the previous diagnostic accuracy measures do not distinguish different classes well. However, it works better than most measures when stage 2 and stage 3 overlap more than stage 1 and stage2, the CD measure has similar difficulty distinguishing different classes when two or more of the ordinal stages heavily overlap. It may be necessary to merge classes before further calculation.