Prediction models for in-hospital deaths of patients with COVID-19 using electronic healthcare data

Abstract Objective Many models for predicting various disease prognoses have achieved high performance without laboratory test results. However, whether laboratory test results can improve performance remains unclear. This study aimed to investigate whether laboratory test results improve the model performance for coronavirus disease 2019 (COVID-19). Methods Prediction models were developed using data from the electronic healthcare record database in Japan. Patients aged ≥18 years hospitalized for COVID-19 after February 11, 2020, were included. Their age, sex, comorbidities, laboratory test results, and number of days from February 11, 2020, were collected. We developed a logistic regression, XGBOOST, random forest, and neural network analysis and compared the performance with and without laboratory test results. The performance of predicting in-hospital death was evaluated using the area under the curve (AUC). Results Data from 8,288 hospitalized patients (females, 46.5%) were analyzed. The median patient age was 71 years. A total of 6,630 patients were included in the training dataset, and 312 (4.7%) died. In the logistic regression model, the area under the curve was 0.88 (95% confidence interval [CI] = 0.83–0.93) and 0.75 (95% CI = 0.68–0.81) with and without laboratory test results, respectively. The performance was not fundamentally different between the model types, and the laboratory test results improved the performance in all cases. The variables useful for prediction were blood urea nitrogen, albumin, and lactate dehydrogenase. Conclusions Laboratory test results, such as blood urea nitrogen, albumin, and lactate dehydrogenase levels, along with background information, helped estimate the prognosis of patients hospitalized for COVID-19.


Introduction
Studies have attempted to develop models to predict the future onset of diseases or the prognosis of patients.Indeed, scoring systems based on studies developing prediction models, such as the Framingham coronary heart disease prediction scores and quick sequential organ failure assessment (qSOFA) scores for sepsis, are used in clinical practice 1,2 .
Recently, studies on the development of prediction models have been increasing, with models using a database of real-world data, such as electronic healthcare records (EHR) or claims data [3][4][5][6][7][8] .In general, EHR and claims databases include large amounts of data on patients with various backgrounds.EHR databases usually include information on outcomes such as laboratory test results.However, when conducting studies using a database of real-world data, it is difficult for researchers to use data not included in the selected database.Therefore, it is important to select a database that contains sufficient data for each study.
Although laboratory test results are used to consider the diagnosis or prognosis of diseases in clinical practice, various prediction models predicting disease prognoses achieve high performance without laboratory test results, such as in cases of atrial fibrillation and chronic obstructive pulmonary disease 3,4 .However, this does not mean that laboratory test results are not required for high performance in all diseases.It is unclear whether laboratory test results can improve the performance of prediction models.
We hypothesized that the use of laboratory test results can improve the prediction model performance by using a database of real-world data and COVID-19.Accordingly, we aimed to quantitatively measure the usability of the information gained from laboratory test results in improving COVID-19 prediction model performance.

Study design and setting
The RWD database was used in this study.The RWD database is owned by the Health, Clinic, and Education Information Evaluation Institute (HCEI, Kyoto, Japan) and operated by Real World Data Co., Ltd.(Kyoto, Japan).The RWD database includes the demographic data, diagnoses, laboratory test results, medication prescriptions, and medical procedures of approximately 20 million patients (both inpatients and outpatients) from approximately 190 medical institutions across Japan as of July 2021 19,20 .The data collection began in 2015.The data in this database are continuously updated from the electronic medical records of each medical institution.The RWD database does not include the identification of individual information.This study was conducted in accordance with the principles of the Declaration of Helsinki.The Investigation and Ethics Committee of Kyoto University approved this study (Approval No. R2895-1; a study on prediction models for in-hospital deaths of patients with COVID-19).This study was initially approved on May 6, 2021, with a modified application approved on May 22, 2023.

Study population
Patients hospitalized for COVID-19 and aged �18 years were included.Patients with COVID-19 were defined as those assigned confirmed disease names corresponding to the International Classification of Diseases 10th Revision (ICD10) codes U071 or U072 on or after February 11, 2020 and before June 7, 2021, the administrative end of the database.Hospitalization due to COVID-19 was defined as hospitalization within 7 days before or after the ICD10 codes were assigned.Patients who met the inclusion criteria more than once were included in the study only once.Patients without any prescriptions, laboratory test results, or medical procedure data during hospitalization were excluded.Patients without a documented discharge date and those who died on the day of admission were excluded.

Variables and outcomes
We used baseline background features, including age, sex, smoking, body mass index, comorbidities, laboratory test results at the time of admission, and the number of days elapsed since the naming date of COVID-19 by the World Health Organization as exploratory variables.The detailed codes used to define comorbidities are summarized in Supplementary Table S1.We selected laboratory test items that are often measured during off-duty hours.To make the results available at the time of admission (or very early in the admission period), the explanatory variables that could be identified at that time were selected.Laboratory test results at the time of admission were defined as the results of the test performed closest to the date of admission among the tests performed from 7 days prior to admission to 3 days after admission for each item.Body mass index data were identified using the Diagnosis Procedure Combination (DPC) data.DPC is a payment approach used only in Japan and is based on case-mix classification 21 .Only acute care hospitals can choose DPC, and most acute care hospitals in Japan have adopted this payment scheme.The outcome was defined as in-hospital death in the primary analysis.Inhospital death was identified using a combination of the date of death, date of admission/discharge, and DPC data.
In the secondary analysis, outcomes were defined as admission to the intensive care unit (ICU), extracorporeal membrane oxygenation (ECMO), and invasive mechanical ventilation.

Model development
Descriptive statistics were calculated to summarize the baseline background features, comorbidities, laboratory test results, and features of the hospital in which each patient was hospitalized.The number of missing values for each variable was calculated.
We developed models to predict the outcomes of the primary and secondary analyses.We used all of the variables described above, except for the features of the hospital, as explanatory variables.Patients without values on smoking were regarded as non-smokers, and missing values in the continuous variables were complemented by a value calculated using a chained equation.In addition, to address abnormal outliers, values below the 0.05 quantile and values greater than the 0.95 quantile were rounded to the 0.05 quantile and the 0.95 quantile, respectively, for each continuous variable.We randomly split the data into training and test data at a ratio of 4:1.We developed logistic regression models, XGBOOST, random forest, and neural network models to predict the outcomes of the primary and secondary analyses with and without laboratory test values.The logistic regression model has been frequently used in studies that have developed prediction models with binary outcomes, and we assume that this model could serve as a benchmark.
In the logistic regression model, the boundary between one class and another was assumed to be linear.However, this boundary is often non-linear in the real-world setting.Machine learning methods such as XGBOOST, random forest, and neural networks can solve the classification problem with a non-linear boundary.Given the possibility that outcomes may rarely occur, down-sampling for the training data combined with random forest, synthetic minority oversampling technique (SMOTE) combined with XGBOOST, and SMOTE combined with random forest were used in the analysis.The scale of the variables was standardized to develop the logistic regression and neural network models.In addition, in the case of XGBOOST and the neural network, we split all the data except the test data using a ratio of 4:1 into training data and data to evaluate early stopping to prevent overfitting.We trained the models with training data, and the hyperparameters were tuned using grid search and random search using k-fold cross-validation (k ¼ 10).K-fold cross-validation is a method in which training data are divided into k parts, one of which is used as validation data and the remaining k-1 pieces of data are used as training data, and the model performance for each set of hyperparameters is evaluated by iterating this process k times and integrating the results.The grid search searches for all combinations of enumerated parameters, whereas the random search selects combinations of parameters to search.After tuning the hyperparameters, except for the cases of models using down-sampling or SMOTE, the models were trained again using the entirety of the training data with hyperparameters corresponding to the best performance in the process of hyperparameter tuning.
We evaluated the performance of the models using the area under the receiver operating characteristic (ROC) curve (AUC) for the test data and its 95% confidence interval (CI).Moreover, the calibration curves of all models developed in the primary analysis are shown, and the Shapley additive explanation (SHAP) values are shown to visualize the contribution of each variable to the results.The SHAP is a popular framework for interpreting predictions and evaluating the predictive importance of each variable 22 .One of the main criticisms of machine learning is that it is unclear which variables contributed to the prediction.SHAP is a recently proposed method to overcome this problem, analogous to the odds ratios of logistic regression.In general, the higher the absolute value of the SHAP value, the more important the variable 22 .
Finally, subgroup analyses were performed according to age and sex as females aged <50 years, males aged <50 years, females aged �50 years, and males aged �50 years, and we tried to define cutoff values of laboratory test items for each group.Laboratory test items corresponding to the top five absolute values of SHAP values in the case of at least one type of model were selected.The cutoff value was defined as the value corresponding to a point on the ROC curve nearest to the coordinate (0, 1) (upper left corner) for each laboratory test item.In addition, for each part of the split data, the sensitivity and specificity corresponding to the number of laboratory test items exceeding the cutoff values were calculated.
Python 3.7.6 was used for all data analyses.Additional information regarding the Python code used is provided in Supplementary Appendix 1.

Characteristics of the study population
Patients with a diagnosis of COVID-19 between March 7, 2020 and May 27, 2021 were included.In total, 9,065 patients fulfilled the inclusion criteria and 777 patients were excluded.Finally, 8,288 patients were included in the analysis, of whom 3,854 (46.5%) were female; the median age was 71 years (interquartile range ¼ 51-82 years).The number of patients for whom data were available after 2021 was 5,338 (64.4%).A summary of the explanatory variables and hospital characteristics is presented in Table 1.

Results of the primary analysis
In the case of XGBOOST and the neural network, data from 5,304 patients were assigned to the training data (1,326 to validation data and 1,658 to test data), and 250 (4.7%) patients in the training dataset died.For the other models, data from 6,630 patients were assigned to the training data (1,658 for test data), and 312 (4.7%) patients in the training dataset died.
First, we observed the results of the models that included laboratory test results.The AUC was 0.88 (95% CI ¼ 0.83-0.93) in the case of logistic regression.The AUC for each model is shown in Table 2 and the ROC curve for each model is shown in Figure 1.The calibration curves for each model are shown in Figure 2. The SHAP values of each variable in the case of logistic regression, XGBOOST, random forest, and neural networks are shown in Figure 3.
Second, we presented the results of the models without laboratory test results as exploratory variables.The AUC was 0.75 (95% CI ¼ 0.68-0.81) in the case of logistic regression, and the value of the AUC for each model is shown in Table 3.

Results of the secondary analysis
Finally, the results of the secondary analysis are presented.The number of patients assigned to the training data was 6,630, and the number of patients admitted to the ICU and those who received invasive mechanical ventilation were 5 (0.07%) and 25 (0.37%), respectively.None of the patients had received ECMO.In the cases of prediction of admission or transfer to the ICU and the induction of invasive mechanical ventilation, the values of the AUC were 0.92 (95% CI ¼ 0.54-1.00)and 0.81 (95% CI ¼ 0.61-1.00)with logistic regression, respectively.

Cut-off values of laboratory test results
Laboratory test items with the top five absolute values of SHAP values in at least one type of model were blood urea nitrogen (BUN), albumin, lactate dehydrogenase (LDH), blood glucose, aspartate aminotransferase (AST), D-dimer, prothrombin time, and C-reactive protein (CRP).The calculated cutoff values for these items in each age and sex subgroup are shown in Table 4.In addition, the ROC curve for each laboratory test item is shown in Supplementary Figure S1 in Supplementary Appendix 1.The combinations of sensitivity and specificity when three items among BUN, albumin, LDH, blood glucose, AST, D-dimer, prothrombin time, and CRP exceed the cutoff were calculated.For females aged <50 years, males aged <50 years, females aged �50 years, and males aged �50 years, the sensitivity and specificity of the combinations were 1.00 and 0.70; 1.00 and 0.69; 0.89 and 0.51; and 0.93 and 0.49, respectively.When six items exceeded the cutoff, the sensitivity and specificity of the   S2.

Discussion
This study aimed to compare the performance of various prediction models (logistic regression, XGBOOST, random forest, and neural network) with and without laboratory test values, such as the mortality prediction of COVID-19.The performance of the models with laboratory test results was better than that of the models without laboratory test results, regardless of the model type.In addition, some laboratory test results reported in previous studies as risk factors for mortality in patients with COVID-19, including blood urea nitrogen (BUN), albumin, and lactate dehydrogenase (LDH), showed high absolute values of SHAP values in this study.
The AUCs of the models with laboratory test results as explanatory variables were higher than those of models without laboratory test results as explanatory variables in all cases: XGBOOST, random forest, neural network, a combination of down-sampling and random forest, a combination of SMOTE and random forest, and a combination of SMOTE and XGBOOST.Furthermore, although an excessive number of variables can cause overfitting to the training data, hyperparameters were tuned using the cross-validation method to prevent overfitting, and models with laboratory test data achieved higher performance than models without laboratory test results, even in the test data.Obtaining information regarding laboratory test results can change the impression about the diagnosis or prognosis of patients in clinical practice.The results of this study supported this intuition.Unlike other background information, such as age, sex, smoking, body mass index, and comorbidities, laboratory test results can markedly change within a short period of time.Thus, it is possible that laboratory test results can more sensitively capture changes in the patients' medical conditions, and this sensitivity leads to a high prediction performance.
Albumin, age, LDH, and BUN levels, in addition to cardiovascular disease, were included among the variables with the top five absolute SHAP values in the logistic regression analysis.Although in different orders, BUN, albumin, and LDH were among the variables with the top five absolute SHAP values in all cases of logistic regression, XGBOOST, random forest, and neural network.In XGBOOST with grid search, XGBOOST with random search, and random forest, age was among the top five variables in all cases, and the remaining

CURRENT MEDICAL RESEARCH AND OPINION
ones were blood glucose level, aspartate aminotransferase (AST), and D-dimer.The remaining two of the top five variables in the neural network were prothrombin time and Creactive protein (CRP).High BUN, low albumin, high LDH, high blood glucose, high AST, high D-dimer, and high CRP levels have been reported as risk factors for mortality in patients with COVID-19, and the directions of change in values (higher or lower than the normal range) were consistent with those previously reported 17,18 .The median patient age was 71 years, and the mortality rate in the training data was 4.7%.This mortality rate was higher than that of all patients in Japan reported in a survey by the Japanese Ministry of Health, Labor, and Welfare (approximately 0.2% as of September 10, 2022) 23 .Restriction to inpatients only and the high proportion of older patients may be responsible for the high mortality rate observed in this study.
Because death was a rare outcome in the training data (4.7%, 312/6,630), we combined down-sampling and random forest, SMOTE and random forest, and SMOTE and XGBOOST, which are generally suitable for rare outcomes 24,25 .However, the performances of these methods were not materially different from those of XGBOOST alone or random forest alone, and the calibration curves of these combined strategies deviated from a line representing accurate probability predictions.This feature of the calibration curve may be because under-sampling can change the prior distribution of the outcome and distort the probability estimates 26,27 .In general, bagging algorithms, such as random forest, have been reported to work better than boosting when combined with re-samplings, such as down-sampling or SMOTE 25 ; however, no substantial difference was found in this study.
Some international collaboration studies have been conducted in settings similar to this study [28][29][30][31] .These studies reported that the patients' clinical features differed between the first and the second wave of the pandemic.This difference can cause the change in the important factors for predicting the prognosis of COVID-19 patients.However, a study for developing prediction models for death of COVID-19 patients using only data in 2020 showed that age, albumin level, AST level, creatine level, CRP level, and white blood cell count were important predictors of mortality.These factors overlapped with those in our study although we included later data.This similarity can imply that factors important for predicting COVID-19 mortality have not changed over time.Some study focused on the trend of laboratory test values and the symptoms.Laboratory test results only at the admission are used in this study, and the RWD database does not have information about symptoms.However, whether these factors improve the performance of models is one of the topics to be examined next.
This study has several limitations.First, a selection bias may exist because not all medical institutions in Japan participate in the RWD database.Thus, external validity should be tested using data collected in different settings.Second, it was difficult to collect additional items that were not included in the database, such as vital signs, radiological imaging results, and patients' social backgrounds, because this study used existing data.However, high performance can be achieved using only the available data.

Conclusions
The prediction models of COVID-19 mortality show better performance when laboratory test results are included in the model than when they are not.Moreover, high BUN, low albumin, high LDH, high blood glucose, high AST, high D-dimer, high prothrombin time, and high CRP levels were identified as risk factors that can significantly contribute to mortality in patients with COVID-19.It might be possible to estimate the prognosis of patients hospitalized for COVID-19 with higher performance if laboratory test results, such as

Declaration of funding
This research did not receive any specific grants from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of financial/other relationship
ET AL. combinations were 0.75 and 0.98; 0.75 and 0.97; 0.47 and 0.93; and 0.41 and 0.91, respectively.The remaining results are shown in Supplementary Table

Figure 1 .
Figure 1.ROC curves of the mortality prediction models with laboratory test data in the case of (a) logistic regression, (b) XGBOOST alone with grid search, (c) XGBOOST alone with random search, (d) random forest alone, (e) down-sampling and random forest, (f) SMOTE and XGBOOST, (g) SMOTE and random forest, and (h) neural network.Abbreviations.SMOTE, synthetic minority oversampling technique; ROC, receiver operating characteristic curve.

Figure 2 .
Figure 2. Calibration curves of the mortality prediction models with laboratory test data in the case of (a) logistic regression, (b) XGBOOST alone with grid search, (c) XGBOOST alone with random search, (d) random forest alone, (e) down-sampling and random forest, (f) SMOTE and XGBOOST, (g) SMOTE and random forest, and (h) neural network.

Figure 3 .
Figure 3. SHAP values of the mortality prediction models with laboratory test data in the case of (a) logistic regression, (b) XGBOOST with grid search, (c) XGBOOST with random search, (d) random forest, and (e) neural networks.

Table 2 .
AUC values of models with laboratory test data: estimated value (95% confidence interval).
Abbreviations: AUC, area under the curve; SMOTE, synthetic minority over-sampling technique.

Table 4 .
Kenichi Hiraga is a paid consultant of Real World Data Co., Ltd.Masato Takeuchi declares no conflicts of interest.Takeshi Kimura is an employee of Real World Data Co., Ltd.Satomi Yoshida was employed by the Department of Digital Health and Epidemiology, an Industry-Academia Collaboration Course supported by Eisai Co., Ltd., Kyowa Kirin Co., Ltd., Real World Data Co., Ltd., and Mitsubishi Corporation.Satomi Yoshida has also received consulting fees from Real World Data Co., Ltd.Koji Kawakami has received research funds from Eisai Co., Ltd., Kyowa Kirin Co., Ltd., Mitsubishi Corporation, OMRON Corporation, Real World Data Co., Ltd., Sumitomo Pharma Co., Ltd., and Toppan Inc.; consulting fees from Advanced Medical Care Inc., JMDC Inc., LEBER Inc., and Shin Nippon Biomedical Laboratories Ltd.; executive compensation from Cancer Intelligence Care Systems, Inc.; honoraria from Chugai Pharmaceutical Co., Ltd., Kaken Pharmaceutical Co., Ltd., Mitsubishi Chemical Holdings Corporation, Mitsubishi Corporation, Pharma Business Academy, and Toppan Inc.; and held stock in Real World Data Co., Ltd.The data used in this study were provided free of charge by the Real World Data Co., Ltd.Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.Cut-off values of laboratory test values.For albumin, the value in the cell is the maximum value in the abnormal range, and for others, it is the minimum value in the abnormal range.For example, for females aged <50 years, BUN �17.2 mg/dL and albumin �3.4 g/dL are abnormal ranges.