Classification and severity progression measure of COVID-19 patients using pairs of multi-omic factors

Early detection and effective treatment of severe COVID-19 patients remain two major challenges during the current pandemic. Analysis of molecular changes in blood samples of severe patients is one of the promising approaches to this problem. From thousands of proteomic, metabolomic, lipidomic, and transcriptomic biomarkers selected in other research, we identify several pairs of biomarkers that after additional nonlinear spline transformation are highly effective in classifying and predicting severe COVID-19 cases. The performance of these pairs is evaluated in-sample, in a cross-validation exercise, and in an out-of-sample analysis on two independent datasets. We further improve our classifier by identifying complementary pairs using hierarchical clustering. In a result, we achieve 96–98% AUC on the validation data. Our findings can help medical experts to identify small groups of biomarkers that after nonlinear transformation can be used to construct a cost-effective test for patient screening and prediction of severity progression.


Introduction
The ongoing global 1 COVID-19 pandemic is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). As of 21 March 2022, more than 471 million cases of COVID-19 have been reported globally, resulting in more than 6.09 million deaths. 2 Prior to the rollout of the vaccination programs, about 80% of patients infected with SARS-CoV-2 were classified as mild or moderate COVID-19 because they display mild symptoms with a good prognosis, and they usually recover with, or even without, conventional medical treatment (see [66]). The remaining 20% of patients suffered from respiratory distress and required immediate oxygen therapy or other inpatient interventions, including mechanical ventilation (see [47,72]). Unfortunately, many people worldwide remain unvaccinated due to the lack of access to the vaccine (especially in underdeveloped countries), the preexisting medical conditions which do not allow them to get vaccinated, the age restrictions for the vaccination, or their own personal choice. Hence, the prediction of severe cases remains one of the major challenges of this pandemic.
Results in [49,61] suggest that information encoded in the biomarkers in patients blood can lead to a medical test to predict who is likely to progress to a severe case versus who will recover from COVID-19 without progression. The datasets in these two papers have been derived from serum and plasma, and serum, respectively.
McElvaney et al. [44] study the relationship between patients severity and the ratio of two small proteins (cytokines) interleukin(IL)-6 divided by interleukin(IL)-10. They create a simple 5-point linear score predictor of clinical outcome, the so-called Dublin-Boston score. It is the first and only COVID-19-specific severity prognostic score to guide clinical decision-making. The Dublin-Boston score is easy to calculate and can be applied to a spectrum of hospitalized COVID-19 patients.
Overmyer et al. [49] analyzed more than seventeen thousand multi-omic factors that differ in people with COVID-19 compared to other non-COVID patients. Of those, the final fourteen thousand variables were selected and used in an extra trees classification model to predict the progression of COVID-19 patients. Importantly, for most of the patients, the blood samples for the analysis were collected before the symptoms of these patients developed to a severe level.
Finally, Shen et al. [61] analyzed more than 1600 proteins and metabolites of COVID-19 patients. Of those, levels of about 75 were associated with disease severity, and the final 29 variables were selected via a random forest model to significantly predict severe cases among the patients in their sample.
Tree-based classification methods employed in these two studies are robust and nonlinear classifiers (see [28]). As with many other machine learning methods, when trained on a large enough dataset, they have small generalization errors and perform well in out-of-sample prediction. However, in the final models in [49,61] there are 100 and 31 observations, and 1113 and 29 decision variables, respectively. The proportion of decision variables to the number of observations in these two papers carries an extreme risk of overfitting during the training of the model. In fact, the out-of-sample classification error in [49] is 20%. Similarly, Shen et al. [61] report that their model wrongly classifies 3 patients out of 10 in an out-of-sample analysis. This corresponds to a classification error rate of 30%.
Moreover, the classifiers developed in both papers employ a large number of biomarkers from patients blood. The complexity of measuring many quantities and an elevated risk of a measurement error prevent these tests from being used in massive medical testing. Finally, the random forest is not a probabilistic classifier. It does not provide the probability of a given patient to progress to a severe state.
The approach in the aforementioned Dublin-Boston score from [44] is different. Instead of data-intensive and prone to overfit statistical procedure, the authors select only two factors using their medical expertise and form the score. The use of two factors allows them to provide a practical measure of severity that is recommended by the medical doctors to quickly assess patients severity progression and risk. Despite its benefits, the Dublin-Boston score lacks statistical formality and validation.
In this paper, we propose to use a classification method that, similarly to the Dublin-Boston score, utilizes only a few factors and significantly outperforms the two tree-based classifiers from Refs [49,61]. 3 Our method consists of four components (i) preselection of important factors using log 2 -fold-change and Benjamini-Hochberg-adjusted p-values (a standard approach in bio-statistical research); (ii) bivariate logistic regression models; (iii) nonlinear spline transformations; and (iv) clustering analysis.
Given a small sample size and the empirical characteristics of the data, in the (ii) step, we use only bivariate logistic regression models. A combination of a parametric probabilistic classifier, and an exhaustive search through all possible pairs of biomarkers, makes our model applicable even in small samples and allows us to identify pairs of significant factors. Already many of the bivariate models have very high in-sample likelihood values. In order to further reduce the risk of overfitting, we include the Ridge penalty in our logistic regression. Even among these simple bivariate models, we cannot determine a unique bestperforming one. Recommending one model or one pair of biomarkers would be wrong, misleading, and possibly dangerous. Besides good statistical performance, there are many other objectives that need to be taken into account when selecting the best factors for a COVID-19 severity test. The ability, necessary time, and cost of collecting the data from patients blood can disqualify some of the models; and the difficulty of measuring a given multi-omic factor accurately needs to be considered when selecting the model for real testing, e.g. data derived from different depots (serum, plasma, and white blood cells) might pose a challenge in developing a clinical useful biomarker. These characteristics can only be measured by medical experts, lab engineers, and technicians. Therefore, instead of a unique best-performing model, we report a set of bivariate models with the largest empirical likelihood values (comparing their information criteria is equivalent because the models have the same number of parameters). Finally, we provide a list of references (see Table 5) that select our biomarkers in their studies as severity progression predictors. This is independent evidence that the selected models identify useful biomarkers, and it demonstrates that, most likely, there is no unique combination of biomarkers that can accurately predict patient severity.
In the next step, encouraged by the results from linear classifiers, we investigate a bivariate logistic regression model with spline-transformed features. The motivation for the nonlinear transformation of the features in the bivariate models is that we want to allow nonlinearity in the decision boundary. Models with linear factors extrapolate the decision boundary linearly. Spline transformation allows for more flexibility in the decision boundary. In particular, when one factor has slightly abnormal levels and the other factor is in its normal range, one needs to consider more carefully such a case. We confirm that using a linear or quadratic spline transformation improves the performance of many of the models in terms of better in-sample fit and lower out-of-sample classification error in the cross-validation.
In the final step, we use combinations of the best pairs of factors to form a new logistic regression. Given the number of factors, an exhaustive search of triples and quadruples of (nonlinearly transformed) factors would be computationally too intensive and prone to overfit. Combining only the factors from the top-performing pairs would be ineffective because the top-performing pairs tend to capture the same characteristics of the patients, and, as we show empirically, merging them into one model does not increase the performance. Instead, we apply a hierarchical clustering algorithm to the classification results of individual patients from the set of best-performing bivariate models to find complementary pairs of factors that can correctly classify different subsets of patients. As a result, on the datasets from [49,61], our classification algorithm achieves very good in-sample and out-of-sample performance using only triples or quadruples of biomarkers.
The remainder of this article is organized as follows. Section 2 introduces our class of model and the selection procedure. Section 3 discusses the two datasets used for the empirical analysis. Section 4 presents the empirical results in terms of in-sample fit, crossvalidation results, and out-of-sample evaluation. Section 5 summarizes the main findings and provides ideas for further research. Finally, the Online Appendix 4 provides detailed results for one of the datasets and the robustness results against the control group of non-COVID patients for both datasets.

Problem statement
Let ξ 1 , . . . , ξ n ∈ be the n labeled patients, where R k the vector of blood biomarkers for a patient i; and y i ∈ {0, 1} the binary label of the patient severity, where y i equals 1 for ICU (severe) case and 0 for non-ICU (nonsevere). We assume that {ξ i } n i=1 are independent realizations from some distribution such that y i | x i ∼ F θ ; and that the severity of COVID-19 patient can be modeled by a binary classifier function θ : R k → {0, 1}. The θ function is the unknown target feature of the distribution F θ , and it belongs to some set of possible functions . Using the existing labeled patient data, we can estimate θ using an approximation ∈ . The quality of is measured by its loss L( ) that is given by where γ : × → [0, ∞) is a contrast function. The loss L( ) is assumed to be minimal for = θ. L( ) measures the average discrepancy between and a new observation ξ with distribution F θ . The maximum likelihood estimator θ for a given model m maximizes the empirical log-likelihood function where the contrast function is the log of the conditional density of y | x described by the model m ⊂ . The empirical log-likelihood function {−L F m ( )} has an expectation {−L F θ ( )} which is maximal for = θ. If θ ∈ m, then the parameter estimates characterizing θ converge to the true parameters describing θ. If θ / ∈ m, then, under some regularity conditions, 5 parameter estimates characterizing θ converge to the parameters characterizing function θ * KL ∈ m such that θ * KL = arg min ∈m KL(F θ ; F ), for KL() denoting the Kullback-Leibler (KL) divergence. In other words, the maximum likelihood estimator is estimating the distribution in our model that is closest, with respect to KL-divergence, to the true distribution F θ even when θ / ∈ m. However, even after the preselection of explanatory variables via log 2 -fold-change and p-values from the univariate models, the dimension k of the multi-omic factors x that can potentially explain the severity of the patients in our model is very large. Selecting a unique best-performing model using poorly statistical techniques would lead to multiple comparison problems. In order to avoid this, we search for sets of best-performing models with a limited number of covariates and leave the final selection of the best model to the medical experts.
More formally, denote by {x ·,j } k j=1 the set of all biomarkers, and by P({x ·,j } k j=1 ) the set of all subsets of them. Let denote all subsets of the biomarkers with at most elements, and M(M ) the subset of all models from a given class of models with the covariates from the set M . Define the set of models from a given class M with at most factors, and such that the optimization criterion at the optimal solution is less than a threshold L * . This subset of models consists of best-performing models (and the corresponding factors) that provide best fit to our classification problem. Our problem is to find the empirical version of the set of models defined in (1). Namely, given n observations, and a threshold L * , we are looking for In the empirical analysis in Section 4, we set = 2 and report best models sorted according to their log-likelihood values (−L F m ( θ; ξ)).

Class of logistic regression models
The logistic regression allows us to model θ using the following contrast function where (x) = 1 1+exp(−S(x)) , and S is a function on R , called a score function. A simple example of S is a linear function, S(x) = w x + b, where w ∈ R and b ∈ R; for more general S see Section 2.2 below.
The triple (S, w, b) characterizes all the models that we consider M ⊆ . For given S, the parameters (w, b) ∈ R +1 can be estimated by the maximum likelihood method (see, for instance, Hosmer and Lemeshow [29]).

Generalized additive model with spline-transformed biomarkers
In addition to a simple logistic regression with a linear function S, we use a spline transformation to capture the nonmonotonic relation between every biomarker x j and the response y. This is a special case of Generalized Additive Models introduced by Hastie and Tibshirani [25]. In particular, for every feature x j , we define a partition where the end points [a, b] are the minimal and the maximal observed values of the biomarker. A spline s on [a, b] is a piecewise nonlinear function where P i is a nonlinear function, and 1 A (x) is an indicator function equal to 1 if x ∈ A and 0 everywhere else. Points t i , i = 0, 1, . . . , n, are called knots and are specified by splitting interval [a, b] into n subintervals containing (approximately) an equal number of observations (see [2] for a typical definition of a spline).
In the analysis below, we consider three cases with (i) linear with one and two knots and (ii) quadratic spline interpolation with one knot; with P i linear and quadratic polynomials, respectively. These are special cases of spline interpolation that, given our sample size, offer the best trade-off between the flexibility of the model and the estimation error. They are also very often used in practice.
The corresponding fitting procedure can be reduced to convex programming and easily implemented. For spline fitting and logistic regression, we have used the Portfolio Safeguard (PSG) package, which is free for academic purposes (download from www.aorda.com). PSG is a general-purpose nonlinear optimization package for solving optimization and statistical problems. Nonlinear transformations, such as log, exp, or polynomial, are commonly used for transforming features. However, with the splines implemented in PSG, it is possible to perform an optimal transformation rather than using a trial-and-error approach. The mathematical description of the one-dimensional spline as an argument for the logistic regression likelihood function is described here. 6 The mathematical programming problem for finding optimal parameters of the spline is a convex nonlinear programming problem that can be solved very efficiently with standard nonlinear optimization algorithms.
Each biomarker vector x ∈ R k is transformed via k-separate logistic regression problems, where the j-th logistic regression problem only utilizes the j-th biomarker x j to perform a univariate prediction. After estimating splines, say s j , for every x j , j = 1, . . . , k, we employ the transformed features s j (x j ) as the new features of the score function, S(x) = j=1 w j s j (x j ) + b. This transformation does not affect the complexity of the following optimization problem, such as the maximization of the log-likelihood function of the logistic regression.

Clustering and selection of complementary pairs
Based on the empirical results from the logistic regression in Section 4, we have identified that for both datasets considered, there are clusters of pairs of factors such that within the cluster, all the models classify individual patients similarly, but between the clusters they are complementary, i.e. models in one of the clusters correctly classify patients that models from the other cluster made a classification error on (see clustergram plots in Section 4). Therefore, in the final step of the analysis, we employ a hierarchical clustering algorithm to identify such clusters of models. Given the clusters, we select complementary pairs of factors by selecting the best performing (in terms of in-sample log-likelihood values) bivariate models from each cluster. These pairs-of-pairs of factors consist of three (one factor can occur in both models) or four different factors. In the final stage, we fit logistic regression to these factors jointly. Importantly, the use of complementary factors provides much better out-of-sample results than combinations of only top-performing factors.

Model selection algorithm
The following algorithm summarizes the steps of our model selection and estimation.

Datasets
In our empirical analysis, we consider the datasets from two papers [49,61]. The dataset from the first paper (DS-1) was collected using RNA-seq and high-resolution mass spectrometry. It consists of biomarkers from the blood of 50 non-ICU and 50 ICU COVID-19 positive patients; and 10 non-ICU and 15 ICU non-COVID-19 patients (control group). There are 14531 (including 517 proteomic, 105 metabolomic, 646 lipidomic, and 13263 trascriptomic) biomarkers from which we select, as in [49], 231 factors using log 2 -foldchange and the t-test value (| log 2 -fold-change| > 1.5 and Benjamini-Hochberg-adjusted p-values< 0.05). The left panel in Figure 1 shows statistical significance (p-value) versus magnitude of change (fold change) for all the factors in the data. The outer data points correspond to the selected factors.
In order to fit the models and evaluate their out-of-sample performance on the DS-1 data, we use a 10-fold cross-validation. We train each of the models on the 90% of the observations, evaluate its out-of-sample performance on the remaining 10% of the data,

Algorithm 1 Model Selection Algorithms
, and a given threshold L * for the likelihood value (one can pick L * based on the results in the Tables). Output: Collection of bivariate models from M L * (M 2 ); and the models with the complementary pairs of factors.

Feature Preselection
Step: Out of a large number of multi-omic factors, select those factors that have | log 2 -fold-change| > 1.5, and Benjamini-Hochberg-adjusted p-values < 0.05. In result, there are k factors.
Estimate the univariate spline transformation of each k factors. 4: end if 5: for j = 1, 2, . . . , k 2 (all bivariate models) do 6: Estimate is one of the bivariate logistic regression models with spline S transformed factors, and γ is defined in (3). Select the top models with the highest likelihood values. 10: Perform hierarchical clustering on selected models and their fit to individual patients. 11: Determine complementary pairs of models by selecting the best-performing model in each of the top two (in the dendrogram) clusters. 12: Estimate the logistic regression model with the complementary pairs of factors as explanatory variables; and evaluate its out-of-sample performance in crossvalidation. 13: end for 14: Report results for all S, for the top-performing bivariate models; and for the complementary pairs. So the medical doctors and other experts can select the preferred bio-markers and model. 15: end and repeat these steps 10 times for different partitions of the data. The log-likelihood values are from the whole sample, and the out-of-sample classification errors are the average values from the 10 folds of testing sets, respectively.
The second dataset (DS-2), from [61], consists of four cohorts of patients that are described below. The biomarkers data for cohorts C1, C2, and for the control group was obtained using the mass spectrometry method that allows for an untargeted metabolomics approach to analyze the sera samples. There are 41 proteomic and 34 metabolomic biomarkers from the patients in these cohorts. The data for the C3 cohort is generated by targeted proteomics and metabolomics method. Therefore, it consists of only 29 biomarkers selected in [61]. The four cohorts have the following details: (C1) The first cohort consists of 31 patients with COVID-19. Among them, 13 patients were diagnosed with a severe stage of the disease (see the criteria listed in the Introduction above); and 18 cases were not severe. It is important to note that among the severe patients, 6 were diagnosed as severe after the blood samples were obtained. From the remaining severe cases, 4 patients had the blood test done on the same day or one day after when they were classified as severe. The remaining 3 cases had the blood test done a few days after being classified as severe. (C2) The second cohort consists of 10 patients. It is important to note that among the 10 patients in cohort C2, there are 6 nonsevere and 4 severe. Among the severe, 3 patients had a blood test done before being classified as severe. (C3) In [61] there is a third cohort (C3) that consists of additional 19 patients. The data for this cohort is generated by targeted proteomics and metabolomics method. Therefore it consists of only 29 biomarkers selected in [61]. That does not allow us to search through all the 75 biomarkers from C1 and C2. Moreover, in order to use C3, we need to apply normalization before training the models because the data has been collected using a different method and has a different range than in the other cohorts. patients. For these subjects, we have the data for all the biomarkers, and we use it to evaluate the out-of-sample false-positive rate of the proposed classifiers.
Our empirical analysis of the DS-2 data replicates the approach in [61]. The DS-2 dataset consists of over 1600 proteins and metabolites of sera from COVID-19 patients and the control group. We select 75 factors via log 2 -fold-change (see the right panel in Figure 1). Using the data in C1 and C2, we identify the best pairs of biomarkers that can classify the patients for severe and nonsevere cases and evaluate their performance on C3 and the two control groups. In particular, our empirical analysis of DS-2 data consists of four steps: (DS − 2 : C1, C2) In Section 4.1, we perform linear classification using logistic regression estimated only on the C1 cohort. We select the best pairs of biomarkers using in-sample (on C1) log-likelihood values. Further, for the best pairs of biomarkers, we evaluate their performance in a cross-validation exercise (on C1) and in terms of out-of-sample performance on an independent cohort C2.

Linear classification
In the first step, we estimate logistic regression models from Section 2.1 on all pairs of biomarkers for both datasets. Table 1 summarizes the results for the best performing (in terms of in-sample likelihood values) 20 pairs of biomarkers from the dataset DS-1. All the pairs have very high likelihood values and significant coefficients for the factors. The last column in Table 1 summarizes the in-sample and cross-validation error. The two errors are very close, suggesting a proper fit of the models. Moreover, all of the pairs have a lower cross-validation error than the out-of-sample error from the extra trees model used in the original study in [49]. Figure 2 shows the contour plots of the logistic regression for the top pairs from Table 1. The linear classification using logistic regression captures the categories of the majority of the patients correctly. However, the nonlinear transformation improves many of the results, as also indicated by the out-of-sample error reported in the title of each plot. Table 1A 7 summarizes the results for the DS-2 dataset, for the best (in terms of the log-likelihood values) 72 pairs of biomarkers. We can see that the top 8 pairs in Table 1A with the highest log-likelihood values above −10 5 , have p-values equal to 1 for all the coefficients. This indicates that the standard errors of these estimates are extremely large. Therefore, the C1 cohort is not sufficient to properly fit these pairs of biomarkers even with a simple linear classifier such as the logistic regression.
The remaining pairs in Table 1A also have very high log-likelihood values, but the corresponding p-values are less extreme. Some of these pairs perform well in terms of classification error in leave-one-out cross-validation (CV-C1 columns in Table 1A. Many of them also perform well in terms of classification error in an out-of-sample analysis using the C2 cohort (C2 columns in Table 1A). Figure 3 show contour plots for the fit of various bivariate logistic models to the DS-2 dataset. The first column of plots in Figure 3 shows the contour plots of the logistic regression function estimated on C1 for a sample of pairs of biomarkers from Table 1A.
The remaining columns of plots in Figure 3 are linear and nonlinear classifiers estimated on a joint dataset C1+C2. Similar to the results for the DS-1 dataset, already the standard logistic regression models fitted to C1 or to C1+C2 data perform very well in terms of classification. The results for the nonlinear classifiers are discussed in Section 4.2 below.

Linear vs. Nonlinear classification
Encouraged by the results from the linear classifier and motivated by the fact that models with linear factors extrapolate the decision boundary linearly. We introduce splinetransformed factors that allow for more flexibility in the decision boundary-in particular, when one factor has slightly abnormal levels, and the other factor is in its normal range, they extrapolate the shape of the decision boundary.
Plots in Figure 4 show the nonlinear transformations of the biomarkers from the bestperforming models in Table 1 from DS-1 dataset. Some of these transformations are even nonmonotonic. Hence, not only abnormal but also typical values of factors can be classified differently conditionally on the levels of the second factor in the pair.
The effect of these nonlinear transformations in a bivariate model is well summarized by the two right columns in Figures 2 and 3. The nonlinear transformation generates nonlinear decision boundaries, and, for most of the pairs of biomarkers, the nonlinear models improve the out-of-sample performance.
For the DS-1 dataset, we have data for 100 patients; it is enough to fit linear and nonlinear models. The results for the nonlinear classifier with linear one-knot spline-transformed features, linear two-knots spline-transformed features, and quadratic one-knot splinetransformed features are summarized in Tables 2, 3, and 4, respectively. These more flexible models lead to higher likelihood values than the simple logistic regression models. More importantly, the cross-validation classification error is lower for many models than it was for the linear classifier reported in Table 1.
For the DS-2 dataset, it is difficult to improve the linear logistic regression using a nonlinear classifier without increasing the sample size. Therefore, we join the two cohorts (C1+C2) and estimate models on the whole dataset. The consequence is that we cannot  Table 1. The dots represent individual patients. Column-wise for different models: (i) linear classification using logistic regression; (ii) nonlinear classification using linear one-knot splinetransformed biomkarkers and logistic regression (iii) nonlinear classification using two-knots linear spline-transformed biomarkers and logistic regression; (iv) nonlinear classification using quadratic oneknot spline-transformed biomarkers and logistic regression. . Bivariate contour plots Row-wise: for best-performing pairs of biomarkers. Column-wise for different models: (i) linear classification using logistic regression fitted on C1 cohort; (ii) linear classification using logistic regression fitted on C1+C2 cohorts (iii) nonlinear classification using linear splinetransformed biomarkers and logistic regression fitted on C1+C2 cohorts; (iv) nonlinear classification using quadratic spline-transformed biomarkers and logistic regression fitted on C1+C2 cohorts. verify our results on an independent dataset. Instead, for the best 72 pairs listed in Table 1A, we report only the results from leave-one-out cross-validation to detect possible overfit of the nonlinear models. We also report the out-of-sample classification errors of the selected pairs on the data from the C3 cohort. The results are incomplete because of the aforementioned lack of data (see Section 3). Moreover, the classification errors are higher than in the cross-validation because of the different methods used to collect the data and the necessary rescaling of the data. Table 2A summarizes the performance of the pairs of biomarkers in a logistic regression fitted on a joint dataset (C1+C2). The results are sorted according to the in-sample loglikelihood value. The first pair of biomarkers has a very high log-likelihood value, but all the p-values of the coefficients are equal to 1. This suggests that even with this larger dataset, we do not have enough data to fit this pair of biomarkers. In order to reduce the overfit and provide more realistic model parameter estimates, we estimate the models for DS-2 by logistic regression with Ridge penalty (see Tables 5A-7A). All the other pairs also have very good performance in terms of in-sample log-likelihood value and can be compared in terms of in-sample classification error (IS−C1+C2 CE) and leave-one-out cross-validation  Table 1 from DS-1 dataset. Top two rows -linear splines; the bottom two rows -quadratic splines. All the spline transformations are with 1/2/3 knots. Different colors of the parts of the curves correspond to different subsets separated by the knot, and a change of the color indicates the location of the knot.  Table 2A. Interestingly many of the pairs of biomarkers perform very well. Table 3A summarizes the performance of the pairs of biomarkers in logistic regression for linear-spline with one-knot transformed features fitted on a joint dataset (C1+C2). The results are sorted according to the in-sample log-likelihood value. In terms of the overlap of factors from our best-performing models and the original research of [61], the following proteins have a very high frequency of occurrence in our top 20 models and in [61]: SAA1, SAA2, ALB, SERPING1, F5, HABP2, CPN1, and CLEC3B.
The results in Tables 2A and 3A can serve medical experts to further identify the best pairs of biomarkers for COVID-19 severity prediction. We provide statistical results that Table 3. Summary of the performance of the 20 bivariate logistic regression models with the highest in-sample log-likelihood values. The biomarkers are transformed using linear-spline with two-knots. The results are sorted according to the in-sample log-likelihood value obtained after fitting the model on the DS-1 dataset. The columns are the same as in Table 1  A medical expert can screen through these two tables and potentially identify the best combination of biomarkers based on her expertise. For the results of a quadratic spline with one knot, even the joint sample of C1+C2 is too small to fit such a flexible model correctly. The right columns in Figure 3 show poor fit of the quadratic spline model. But for completeness, we report the estimation result for the quadratic spline with one knot for the top 72 pairs in Table 4A. Support vector machine (SVM) is another (non)-linear classifier that is considered more robust than logistic regression against the outliers in the data. We have applied SVM to DS-1 and DS-2 with Gaussian, linear, and polynomial of order two kernels. The corresponding Table 4. Summary of the performance of the 20 bivariate logistic regression models with the highest in-sample log-likelihood values. The biomarkers are transformed using quadratic-spline with one-knot. The results are sorted according to the in-sample log-likelihood value obtained after fitting the model on the DS-1 dataset. The columns are the same as in Table 1 Table 19A in the Online Appendix. Importantly, for the majority of the pairs of biomarkers, the logistic regression model with features transformed by linear splines with 2 knots has a lower cross-validation classification error and higher AUC values than any of the SVM models.

Robustness analysis
Many of the biomarkers that we analyze are also elevated in other diseases. The analysis in this section helps to find COVID-19 specific pairs of biomarkers. We evaluate the performance of our models out-of-sample on the control groups of healthy subjects and non-COVID-19 patients with similar symptoms to COVID-19 cohorts.
For the DS-1 dataset, all the models correctly classify most of the non-COVID non-ICU patients as non-ICU. The corresponding results are reported in the Online Appendix in Tables 11A, 12A, 13A, and 14A. However, the non-COVID ICU patients from DS-1 are often misclassified as COVID ICU patients, i.e. our bivariate models (trained only on the COVID patients) cannot separate COVID ICU patients from non-COVID ICU patients. One can argue that this is less of a problem because these patients require medical attention independently of having COVID or not.
For the DS-2 dataset, the result is reported in the Online Appendix in Tables 8A, 9A, and 10A. The last two columns are the false-positive classification errors for the two control groups (healthy and non-COVID patients). Most of the pairs correctly classify healthy people as not being severe COVID-19 patients. However, many pairs misclassify non-COVID patients as severe COVID-19 patients. The false-positive error in the last column of Table 8A is always above 20.8%. Suggesting that the linear classifier cannot identify COVID-19 specific biomarkers.
Similarly to the above, Table 9A and 10A report the results for the models with linear spline-transformed features and quadratic spline-transformed features, respectively. The same conclusions can be drawn for these nonlinear models. Except that the last column in Table 9A and in Table 10A shows that some pairs have very low false-positive error for non-COVID-19 patients. This suggests that some of the pairs after the nonlinear transformation can be robust against patients that have other diseases.

Complementary Pairs of biomarkers
Encouraged by the results obtained with the pairs of biomarkers, we extend the analysis to models with more explanatory variables. Many models with pairs of biomarkers fail to predict the same individual patients. Hence, if combined in a larger model, they will not improve the out-of-sample performance. An exhaustive search for all possible triples and quadruples of the (nonlinear) factors would be very computationally intensive. In order to reduce the computations and also reduce the risk of overfitting, we look only for complementary pairs of biomarkers.
We are looking for pairs of biomarkers that, using our bivariate models, are characterized by a high level of in-sample fit. So they have to classify the majority of the patients correctly. At the same time, they need to be complementary, i.e. they have disjoint subsets of patients on which they misclassify. The patient population is never homogeneous. Gender, age, race, additional (patient-specific) risk factors are all important variables that can increase or decrease the severity risk. One factor or even pairs of factors from patients blood will not accurately predict the progression for all the patients.
In order to account for the heterogeneity of the population of patients, we use a clustering algorithm to identify groups of bivariate models that work best for different clusters of patients. This approach has a higher probability of performing well in-sample and also out-of-sample than including triples and quadruples of factors from the top-performing pairs, and it does not require defining the subgroups of patients prior to the analysis. Figure 5 is a heat map with the probabilities of a given patient predicted by each of the top 100 linear classifiers using a bivariate logistic regression model on dataset DS-1. These top 100 models are selected via in-sample log-likelihood values from the results in Table 1. The x-axis dendrogram shows the clustering between ICU and non-ICU patients based on the predictions from all the models. The majority of the patients are classified correctly by the majority of the models. However, there are individual patients that are wrongly classified by the majority of the models. Using the dendrogram for the models on the y-axis in Figure 5, we select two top clusters, and from each cluster, we select the best performing (in terms of in-sample likelihood value) model. The corresponding pairs of biomarkers are complementary because they are in two separate clusters, and they classify the majority of the patients correctly because they are the best-performing models in each cluster. We combine these biomarkers to form one larger model.
Using the complementary pairs instead of an exhaustive search over all possible triples and quadruples of biomarkers allows us to reduce the risk of overfitting. From the heat map in Figure 5 one can see that selected models serve different subgroups of patients. Nevertheless, we cannot completely rule out the possibility of overfitting due to the fact that the clusters are constructed using all of the data. Therefore, prior to selecting the model, we recommend a consultation with medical experts.
We repeat this analysis for all the bivariate models used on the DS-1 dataset: the linear classifier (simple logistic regression) and three nonlinear classifiers (linear spline with one and two knots, and quadratic spline with one knot). The results are summarized in Table 6. For comparison, we also include models that use pairs of best-performing models without the separation into the clusters. It is interesting to observe that the out-of-sample classification error for the pairs of biomarkers for the best bivariate (linear or nonlinearlytransformed) logistic regression models reported in Tables 1, 2, 3, and 4 are all between 10% and 15%. While the complementary pairs of pairs reduce it to 6% − 10%, depending on the model. Moreover, the same out-of-sample (from 10-fold cross-validation) classification error from the complementary pairs of pairs is lower than the error from any pairs of top pairs in Table 6.  Also, the AUC achieved by the models with complementary pairs is higher than for the models with pairs of top pairs. The corresponding ROC curves are in Figure 6. The complementary pairs perform better than all the other models. In particular, we reproduced the results for the extra trees model from the original study in [49], and our proposed models outperform it.
The same analysis for the DS-2 dataset produces qualitatively very similar results-confir ming the value-added of the complementary pairs. The complementary pairs selected via clustering analysis significantly reduce the classification error. However, the sample size in DS-2 is much smaller than in DS-1. Any model with triples or quadruples of biomarkers is too flexible, and it overfits substantially. Therefore, the quantitative results for DS-2 are available upon request.

Conclusions
We have analyzed a large set of biomarkers from patients blood sera from [49], and [61]. We have identified multiple pairs of factors that correctly predict the severity of the COVID-19 patients. There is no a unique set of factors that can predict severity. We have found evidence in the literature that most of the selected biomarkers have also been confirmed by Figure 6. ROC curves for four different classification models. Each plot includes the results for random forest model from [49], as benchmark; a triple or quadruple of complementary biomarkers selected via our model clustering method; and three triples or quadruples of best-performing models selected without the clustering step. Top left: linear classifier using logistic regression. Top right: nonlinear classifier using linear with one-knot spline-transformed biomarkers and logistic regression. Bottom left: nonlinear classifier using linear with two-knots spline-transformed biomarkers and logistic regression. Bottom right: nonlinear classifier using quadratic with one-knot spline-transformed biomarkers and logistic regression.
other research. The advantages of our methods are small dimensionality, flexibility, interpretability, and ease of estimation. Moreover, our classification methods outperform the tree-based methods from the original studies. Based on clustering analysis, we have identified complementary pairs of biomarkers that further improve the performance and classify correctly even a heterogeneous sample of patients. Based on our robustness analysis, some of these pairs (after nonlinear transformation) are even COVID-19 specific and do not misclassify patients with other diseases.
However, the presented results have to be interpreted with caution. Only medical experts can decide which pairs of biomarkers are the best predictors of the severity of patients.
The results in this paper cannot provide a long-term prediction of the severity of a given patient. In order to provide more conclusive results regarding which pair should be used and what is the best (perhaps nonlinear) decision boundary for the classification, one would need to involve a medical expert and ideally a larger sample of data.
Finally, the logistic regression is a probabilistic classifier that allows us not only to classify patients but also to estimate the probability that a given patient will progress to a severe state. This probability can be used to construct a severity measure for the patients. It can help to monitor patients progression and assist in providing the best treatment.