Machine Learning Analysis to Identify Data Entry Errors in Prehospital Patient Care Reports: A Case Study of a National Out-of-Hospital Cardiac Arrest Registry

Abstract Background: The objective of this study was to develop and validate machine learning models for data entry error detection in a national out-of-hospital cardiac arrest (OHCA) prehospital patient care report database. Methods: Adult OHCAs of presumed cardiac etiology were included. Data entry errors were defined as discrepancies between the coded data and the free-text note documenting the intervention or event; for example, information that was recorded as “absent” in the coded data but “present” in the free-text note. Machine learning models using the extreme gradient boosting, logistic regression, extreme gradient boosting outlier detection, and K-nearest neighbor outlier detection algorithms for error detection within nine core variables were developed and then validated for each variable. Results: Among 12,100 OHCAs, the proportion of cases with at least one error type was 16.2%. The area under the receiver operating characteristic curve (AUC) of the best-performing model (model with the highest AUC for each outcome variable) was 0.71–0.95. Machine learning models detected errors most efficiently for outcome place and initial rhythm errors; 82.6% of place errors and 93.8% of initial rhythm errors could be detected while checking 11 and 35% of data, respectively, compared to the strategy of checking all data. Conclusion: Machine learning models can detect data entry errors in care reports of emergency medical services (EMS) clinicians with acceptable performance and likely can improve the efficiency of the process of data quality control. EMS organizations that provide more prehospital interventions for OHCA patients could have higher error rates and may benefit from the adoption of error-detection models.


Introduction
The increasing use of electronic health records and the advances in information technology have led to a rapid growth in registry-based medical research (1,2).One study identified over 150 national clinical registries in the United States and another found over 100 in Sweden (3,4).Although randomized controlled trials remain the gold standard of hypothesis testing in evidence-based medicine, clinical registries play an essential role in generating research hypotheses, monitoring diseases, and assessing the effects of health care interventions (2,5).
Data quality is one of the most important aspects of a clinical registry and is commonly measured by accuracy and completeness (6,7).However, the quality of data in clinical registries is not always satisfactory.A review article reported that the accuracy and completeness of diagnostic data varied significantly, ranging from 67 to 100% and from 30.7 to 100%, respectively (8).Low-quality data may generate biased results, leading to potentially harmful clinical decisions.To improve the quality of the data in registries, continuous efforts including the adequate training of the involved personnel, monitoring the data for errors, and providing feedback are recommended (7,8).While it is relatively easier to monitor the completeness of the data, it is much more challenging to detect data entry errors and thus monitor the accuracy of the data.
Recently, attempts to use machine learning models for the detection of medical errors have shown promising results (9,10).While traditional data quality control processes can be time-consuming and exhausting, efficiency can be increased with the assistance of machine learning models.These models are better than traditional methods at handling complex relationships between data, and thus have the potential to detect errors that cannot be easily found by humans (9).For example, Valko et al. showed that machine learning models were able to detect medical decision errors after being trained to identify anomalous patterns in the data (11).However, whether machine learning tools can accurately detect errors in real-world clinical registries has not yet been thoroughly validated.
Out-of-hospital cardiac arrest (OHCA) is a global health concern, and many countries have developed national registries for research on this issue (12,13).Most OHCA registries collect data using the Utstein style, a set of guidelines for the uniform reporting of data in OHCA cases (14).A substantial portion of data in OHCA registries is recorded by emergency medical services (EMS) personnel, and the overall error rate in EMS patient care reports was found to be over 25% (15).
In this study, we aimed to develop and validate machine learning models for the detection of data entry errors in EMS-recorded data of core Utstein variables using a national OHCA registry.We hypothesized that machine learning models would be able to accurately detect data entry errors that are not easily detected by conventional data quality control methods and would improve the efficiency of these processes.Another aim of this study was to compare the characteristics of OHCA cases that were treated by EMS clinicians based in fire stations with high error rates with those treated by EMS clinicians based in fire stations with low error rates.

Study Design
This cross-sectional study used a random sample of data from the Korean OHCA Registry (KOHCAR).This study was approved by the institutional review board of Seoul National University Hospital.We followed the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement in reporting the results of this study.

Data Source
The KOHCAR is a nationwide prospective registry that includes all EMS-treated OHCA cases in Korea.The EMS patient care reports, EMS OHCA registry, dispatcher OHCA registry, and hospital medical record reviews were merged into a single registry.Information in EMS patient care reports and EMS OHCA registries was recorded through an electronic health record system by EMS personnel immediately after transporting each patient to the emergency department (ED).An EMS patient care report contains information about patient characteristics, place and time of OHCA, and the management by EMS.The EMS OHCA registry contains in-depth data based on the Utstein style, including witness state, bystander cardiopulmonary resuscitation (CPR), initial rhythm, prehospital shocks delivered, airway interventions, and drug delivery.Data on hospital outcomes were retrospectively collected by trained reviewers of medical records.
Quality control for the KOHCAR data was performed by the Korea Centers for Disease Control and Prevention data quality control team.Education programs were held to train EMS personnel to record EMS data.Medical oversight for OHCA cases was performed by the EMS medical directors.A monthly audit of the entered data was performed, and feedback was provided.Detailed information on the quality control methods for the KOHCAR data has been previously published (16,17).

Study Setting
Korea has a fire-station-based, single-tiered EMS system for the entire country.An ambulance crew usually comprises two to three emergency medical technicians (EMTs).Korean EMTs are classified as level 1 and 2, similar to the United States' Advanced EMT (AEMT) and EMT, respectively (18).All EMTs can provide basic life support, including the use of automated external defibrillators (AED).Level 1 EMTs can perform advanced cardiac life support (ACLS) interventions, including advanced airway management, intravenous (IV) access, and epinephrine injections under direct medical oversight.Level 2 EMTs are not licensed to perform ACLS interventions.As an ambulance crew cannot declare death in the field unless there are definite signs of irreversible death, all patients with OHCA are transported to the nearest ED while undergoing CPR.

Study Population
Adult (18 years or older) EMS-treated OHCA patients with presumed cardiac etiologies who presented between January 2017 and December 2019 were screened from the KOHCAR registry.A random sample of 12,100 patients was used in the study.OHCA patients with noncardiac etiologies, including trauma, asphyxia, drowning, poisoning, or burns, were excluded.Data from 2017 to 2018 were used as the training set to train the machine learning models, and data from 2019 were used as the test set to evaluate the performance of the models.

Outcomes
In addition to entering structured coded data in KOHCAR, EMS personnel document each event in a free-text note.To search for errors in the KOHCAR, three medical record reviewers with experience with KOHCAR data reviewed the free-text notes of the cases in the study.They were instructed to extract information from the free-text notes in the same format as the coded data.To avoid confirmation bias, reviewers were blinded to the coded data.When there was a discrepancy between the coded data and the free-text note in which a certain intervention or event was recorded as "absent" in coded data but was recorded as "present" in the free-text note, it was defined and recorded as a data entry error.
In this study, we aimed to detect errors in documenting the place of the event, witnessed status, bystander CPR, initial rhythm, EMS defibrillation, advanced airway, IV access, epinephrine use, and any prehospital return of spontaneous circulation (ROSC).Place error was defined as a case where the place of arrest was coded as "nonpublic" or "unknown or missing" in KOHCAR but was found to be a "public" place in the free-text note.Witnessed status error was defined as a case where the witnessed status was coded as "unwitnessed" or "unknown or missing" in KOHCAR but was found to be a "witnessed" OHCA case in the free-text note.Initial rhythm error was defined as a case where the initial rhythm was recorded as "nonshockable" or "unknown or missing" in KOHCAR, but was documented as "shockable" in the free-text note.Any prehospital ROSC error was defined as a case where "achieved prehospital ROSC at least once" was coded as "no" or "unknown or missing" in KOHCAR, but was documented as "yes" in the free-text note.For outcome variables other than place, witnessed status, and initial rhythm error, an error was defined as a case where the record in KOHCAR was "no" or "unknown or missing," but the free-text note was documented as "yes."We also attempted to detect cases with at least one of the abovementioned errors, defining them as cases with "any error."

Variables and Preprocessing
Nineteen variables were used for the error detection models: age, sex, day of the week, time of day, urban or rural area, place of arrest, witnessed status, type of witness, bystander CPR, bystander AED use, initial rhythm, EMS defibrillation, advanced airway, IV access, epinephrine use, response interval, scene interval, transport interval, and prehospital ROSC.Data with unknown or missing values were not imputed but were retained as a separate group because being missing may be a significant predictor for detecting errors.Variables relating to EMS interventions (EMS defibrillation, advanced airway, IV access, and epinephrine use) were mandatory fields and had no unknown or missing values in the KOHCAR.
Negative values or values over 60 min in response interval, scene interval, and transport interval were considered implausible.Continuous variables were categorized into groups for input into the model (age: 18-64, 65-79, 80-130 years; response interval, scene interval, and transport interval: 0-4, 5-9, 10-60 minutes, missing values).Therefore, all the variables used in the model were categorical.Subsequently, one-hot encoding was applied to all the variables.One-hot encoding, otherwise known as dummy variables, is a data preprocessing technique for converting a categorical variable into several binary variables (19).

Model Development
Machine learning models for the prediction of different outcome variables were developed using four different algorithms: extreme gradient boosting (XGB), logistic regression (LR), extreme gradient boosting outlier detection (XGBOD), and K-nearest neighbor outlier detection (KNN).XGB, a gradient-boosted tree ensemble algorithm, and LR are supervised learning algorithms that require labeled data for training (20).KNN is an unsupervised learning algorithm that detects outliers based on the relationship between neighboring data points.XGBOD uses an ensemble of supervised (XGB) and unsupervised (KNN) learning algorithms to detect the outliers.The unsupervised learning algorithm extracts useful representations of the data that are fed into the supervised classifier to enhance the performance of the model (21).We utilized the Python module xgboost version 1.6.0,scikit-learn version 0.24.1, and Pyod version 0.9.3, to develop the XGB, LR, XGBOD, and KNN models, respectively (22).The optimal hyperparameters of each model were tuned by five-fold cross-validation and grid searches to maximize the area under the receiver operating characteristic curve (AUC) within the training set.The details of the hyperparameter tuning results for each model are presented in Online Supplementary Table 1.

Statistical Analysis
Sample size was estimated based on previous literature (23).Assuming a sensitivity of 0.90, disease prevalence (proportion of each type of error) of 1%, and a 95% confidence interval (CI) width of 10%, the required sample size in the test set was 3,458.Since we had planned to use approximately one-third of the data as the test set, we determined that a total sample size of 12,100 would be sufficient for this study.
All categorical variables were represented by numbers and percentages, and inter-group comparisons were performed using the chi-square test.Continuous variables were represented by medians and interquartile ranges, and intergroup comparisons were performed using the Wilcoxon rank-sum test.Statistical significance was set at p < 0.05.
Inter-rater agreement between the three medical record reviewers was examined using Cohen's kappa to assess the reliability of the information extracted from the free-text notes.Among the 12,100 cases, 400 cases were randomly assigned to the three reviewers and used to evaluate interrater agreement.The remaining 11,700 cases were equally divided among the 3 reviewers.
The performance of each model was assessed by calculating the AUC, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and 95% CIs.The cutoff thresholds for each model were determined by fixing the sensitivity of the models to 0.9 in the training set.The calibration of each model was assessed using the Brier score.A Brier score ranges from 0 to 1, 0 indicates perfect agreement, and 1 indicates perfect disagreement between the predicted probabilities and the actual outcomes.Although there is no definite criterion of a good Brier score, a score lower than 0.25 (which can be achieved by assigning a probability of 0.5 for all predictions) is considered acceptable (24).The variable importance of the best-performing model (the model with the highest AUC) for each outcome variable was determined by the permutation importance score, which is defined as the mean decrease in the accuracy of the model when each variable is permuted (25,26).The top 10 variables with the highest importance scores for each model were presented using box plots.
The utility of the developed machine learning models was assessed by assuming a quality control method in which only the data predicted to contain errors by the best-performing machine learning models were screened for errors (quality control with machine learning).For example, an audit of witnessed status data would be performed only for OHCA cases that were predicted to have witnessed status errors.This was compared with a baseline method that screens all the data for the possibility of error (quality control without machine learning).Because of our study's definition of data entry errors, some data have no possibility of being an error.For example, if an OHCA case was coded to be "witnessed" in the KOHCAR, it would be impossible for it to be a witnessed status error.Therefore, to identify witnessed status errors, all cases that were coded as "unwitnessed" or "unknown or missing" in the KOHCAR would be screened in quality control without machine learning.
We attempted to observe the characteristics and clinical outcomes (ROSC at ED arrival and survival to discharge) of OHCA patients with respect to the error rate of the KOHCAR data for each fire station."ROSC at ED arrival" is defined as the ROSC state (ROSC vs. no ROSC) at the time of a patient's arrival to the hospital, and this data point is acquired from hospital medical record reviews.For this analysis, OHCA cases with missing clinical outcomes and OHCA cases treated by fire stations that encountered fewer than 10 OHCA cases during the entire study period were excluded.The fire stations were grouped into quartiles based on the error rate for each station.The characteristics of the patients were compared between the groups using the chisquare test or Kruskal-Wallis test, as appropriate.Multivariable LR analysis, with adjustment for potential confounders (age, sex, urban/rural status, place of arrest, witnessed status, bystander CPR, initial rhythm, EMS defibrillation, advanced airway, epinephrine use, response interval, scene interval, and transport interval), was performed to observe the association between the error rates of the groups and clinical outcomes.
All statistical analyses were performed using Python version 3.9.7 and SAS software.

Results
Among the 62,990 OHCA cases with presumed cardiac etiologies, 12,100 were randomly sampled and split into the training (N ¼ 7,828) and test sets (N ¼ 4,272) (Figure 1).The median ages of the OHCA cases in the training and test sets were 74 years.The proportions of males in the training and test sets were 63.1 and 63.7%, respectively.The proportions of cases with at least one error type (any error) were 16.2% (16.2% in the training set and 16.1% in the test set).Among the different types of errors, witness status error was the most prevalent (7.8% in the training set and 8.1% in the test set), whereas EMS defibrillation error was the least prevalent (0.4% in the training set and 0.6% in the test set) (Table 1).The characteristics of the OHCA cases in the randomly sampled cohort were similar to the characteristics of the total cohort before sampling (Online Supplementary Table 2).
The inter-rater agreement of the extracted information from the free-text notes between the reviewers is presented in Online Supplementary Table 3.The Cohen's kappa ranged from 0.67 to 0.97, indicating substantial to almost perfect agreement for all the investigated variables (27).The strength of agreement for the place of arrest and witnessed status (Cohen's kappa 0.67-0.76)was relatively poor compared to other variables (Cohen's kappa 0.75-0.97).
The error detection performance of the machine learning models for different outcomes in the training and test sets is shown in Online Supplementary Table 4 and Online Supplementary Table 5, respectively.The performance of the best-performing machine learning models in the test set is shown in Table 2.Although the sensitivity of all models was fixed to 0.90 in the training set, the sensitivity of the models ranged from 0.60 to 1.00 in the test set.The Brier scores for all the XGB, LR, and XGBOD models were below 0.13, and the Brier scores for all the KNN models were below 0.22.The receiver operating characteristic curves of the machine learning models for the prediction of different outcomes in the training and test sets are presented in Online Supplementary Figure 1 and Figure 2, respectively.XGB was the best-performing model in the test set for the outcomes of the bystander CPR error, advanced airway error, and prehospital ROSC error.LR was the best-performing model for outcomes of place error, witnessed status error, initial rhythm error, and epinephrine use error.XGBOD was the best-performing model for the outcomes of EMS defibrillation and IV access errors.The AUCs of the best-performing model for different outcomes ranged from 0.71 to 0.95 in the test set.The top 10 features with the highest importance scores of the best-performing models are shown in Online Supplementary Figure 2. The most important features for each outcome were those that are known to be closely associated with the outcome variable.For example, the initial rhythm of an OHCA patient was strongly associated with whether the patient received defibrillation.In addition, whether the patient received bystander CPR was closely related to the witnessed status.
Compared to quality control without machine learning, quality control with machine learning detected different types of errors more efficiently in KOHCAR.The efficiency of detecting the outcomes was highest for the outcome place error and initial rhythm error.Using machine learning, 82.6% of place errors and 93.8% of initial rhythm errors were detected while screening only 11% and 35% of the data, respectively, compared to quality control without machine learning (Table 3).
With the exclusion of the fire stations that treated fewer than 10 OHCA cases during the study period, 201 fire stations treated 10,944 OHCA cases.The proportions of OHCA cases that occurred in urban areas (77.2% vs. 57.1%),received advanced airway placement (83.4% vs. 78.9%),and received epinephrine treatment (20.1% vs. 13.4%) were greater in the quartile with the highest error rate than in the quartile with the lowest error rate.In terms of clinical outcomes, the proportions of cases that achieved ROSC at ED arrival (10.4% vs. 7.6%) and survived to discharge (10.6% vs. 9.3%) were greater in the quartile with the highest error rate than in the quartile with the lowest error rate (Table 4).After adjusting for potential confounders, the quartile with the highest error rate had higher odds of ROSC at ED arrival (adjusted odds ratio (95%CI):1.38(1.05-1.80))than the quartile with the lowest error rate; however, the results did not translate into higher odds of survival to discharge.

Discussion
In this study, machine learning models were developed and validated to detect data entry errors in care reports for prehospital patients recorded in the national OHCA registry.We found that machine learning models detected data entry errors in the patient care reports with acceptable performance.Overall, supervised machine learning models outperformed unsupervised machine learning models in detecting errors, with AUCs of the best-performing model for different types of errors ranging from 0.71 to 0.95 in the test set.We demonstrated that a data audit with the aid of the machine-learning models is more efficient than an audit without the models.Given that auditing entire datasets of large-scale clinical registries is time-consuming and almost impossible, machine learning models have the potential to increase the efficiency of data quality control processes.By reviewing and comparing the free-text notes with the coded registry data for the events, we found that nearly 16% of OHCA cases had at least one data entry error in the registry.Conversely, the proportion of cases with accurate data was 84%, which is comparable to previous reports on the accuracy of computerized medical records (8).Errors of witnessed status constituted nearly half of all the errors in our study.This is not surprising, given that witnessed status was found to be one of the Utstein variables with the lowest inter-rater agreement (28).A "witnessed" OHCA is defined as an arrest that was monitored, seen, or heard by another person (29).However, quite a few OHCA cases in KOHCAR were found to be erroneously coded as "unwitnessed" when the arrest was heard from nearby but not seen.
Labeled data were acquired to train the supervised machine learning models, and the performance of different types of models was evaluated.However, acquiring clinical registry data labeled for errors may be difficult in some situations.Considering this, the performance of an  unsupervised machine learning algorithm was also evaluated.Unsupervised algorithms are frequently used for anomaly detection problems such as the detection of prescription errors (30,31).Although the KNN, conveniently, did not require labeled data, its performance was unsatisfactory in our study.The reason for this may be that using unsupervised anomaly detection algorithms in high-dimensional categorical data is challenging (32).There are several potential effects beyond the improved accuracy of data in using quality control audits coupled with machine learning models.Efficiency may improve by reducing the number of man hours needed to conduct the review, which can also decrease the associated personnel cost.A previous study calculated the hidden cost of a clinical audit by analyzing the time the staff spent on the audit and found that the cost was substantial (33).The monthly audit of KOHCAR is also performed at high costs, requiring at least a few hours of time from multiple emergency medicine staff members.Even then, it is impossible to examine individual data because of the large number of cases.Therefore, a data audit of KOHCAR was previously performed by comparing the data of fire stations and medical record reviewers.Our study presents an efficient solution for identifying errors at an individual case level.
Collecting data in challenging prehospital environments may lead to data inaccuracies (34).We found that the proportion of OHCA cases that received advanced airway and epinephrine treatment was greater in fire stations with the highest error rates than in those with the lowest error rates.Fire stations that perform more interventions may be busier and more occupied, resulting in greater inaccuracies when recording data.The developed error detection models may be of great benefit to busy fire stations.
The adjusted odds of ROSC at ED arrival were significantly higher in the highest error rate group compared with the lowest error rate group.In our study, fire stations with the lowest error rate showed the lowest frequency of prehospital interventions (Table 4).If prehospital interventions are not performed frequently, the possibility of errors may be lowered because EMS clinicians rarely record that some interventions were performed, and there would be a minimal amount of information in the free text record.And those nonfrequent prehospital interventions may reflect the situation in which EMS clinicians are less trained or not active to patients.While we have adjusted for EMS interventions including defibrillation, advanced airway, and epinephrine use in our multivariable LR analysis, there remain significant differences in outcome between the highest error rate group and the lowest error rate group.There may have been some unobserved factors related to the quality and activeness of the EMS clinicians.Further research is required to verify this hypothesis.

Limitations
There are several limitations in our study that need to be addressed.First, this study is limited by its retrospective design; thus, it needs to be validated through prospective application in the real world.When error detection models are prospectively applied in the real world and feedback is obtained for data recorders, the frequency and pattern of data errors may change.In this case, model updates may be required to prevent a decline in model performance.Second, whether the results of this study can be generalized to registries in countries other than Korea needs to be assessed.Third, we were not able to capture all data errors in this study because the method of finding errors was through the review of free-text notes.For example, if the variable "epinephrine use" is erroneously coded as "yes" in KOHCAR when epinephrine was not used, it was not included as an error in this study.These errors were not included due to a lack of documentation for negative findings in the free text notes.While most of the positive findings are written in the free-text notes, negative findings such as "epinephrine was not used" are often not.

Conclusion
In conclusion, we developed machine learning models to detect data entry errors in the core variables of a national OHCA registry, with acceptable performance.In addition, the characteristics of fire stations with the highest error rates were identified, and those may highly benefit from using the developed models.The adoption of machine learning models will likely improve the efficiency of the registry quality control process.Further prospective studies are required to examine the effectiveness of machine learning tools for data quality control in the real world.

Figure 2 .
Figure 2. Receiver operating characteristic curve of machine learning models for the prediction of different types of errors in the test set.Abbreviations: XGB, extreme gradient boosting; LR, logistic regression; XGBOD, extreme gradient boosting outlier detection; KNN, K-nearest neighbor outlier detection; CPR, cardiopulmonary resuscitation; EMS, emergency medical services; IV, intravenous; ROSC, return of spontaneous circulation; AUC, area under the receiver operating characteristic curve.

Table 1 .
Comparison of baseline characteristics and outcomes between the training set and test set.

Table 2 .
Error detection performance of the best-performing machine learning models for different types of errors in the test set.

Table 3 .
Comparison of quality control with and without machine learning in the test set.

Table 4 .
Characteristics and clinical outcomes of different error rate groups.The adjusted odds ratios (95% confidence intervals) of survival to discharge compared to that of the lowest error rate group (reference) were 1.09 (0.84-1.41) for the low error rate group, 1.02 (0.79-1.32) for the high error rate group, and 1.02 (0.79-1.33) for the highest error rate group, respectively.The adjusted odds ratios were calculated with a multivariable logistic regression model after adjusting for age, sex, urban, place of arrest, witnessed status, bystander CPR, initial rhythm, EMS defibrillation, advanced airway, epinephrine use, response time, scene time, and transport time. b