Cross-camera Performance of Deep Learning Algorithms to Diagnose Common Ophthalmic Diseases: A Comparative Study Highlighting Feasibility to Portable Fundus Camera Use

Abstract Purpose To compare the inter-camera performance and consistency of various deep learning (DL) diagnostic algorithms applied to fundus images taken from desktop Topcon and portable Optain cameras. Methods Participants over 18 years of age were enrolled between November 2021 and April 2022. Pair-wise fundus photographs from each patient were collected in a single visit; once by Topcon (used as the reference camera) and once by a portable Optain camera (the new target camera). These were analyzed by three previously validated DL models for the detection of diabetic retinopathy (DR), age-related macular degeneration (AMD), and glaucomatous optic neuropathy (GON). Ophthalmologists manually analyzed all fundus photos for the presence of DR and these were referred to as the ground truth. Sensitivity, specificity, the area under the curve (AUC) and agreement between cameras (estimated by Cohen’s weighted kappa, K) were the primary outcomes of this study. Results A total of 504 patients were recruited. After excluding 12 photographs with matching errors and 59 photographs with low quality, 906 pairs of Topcon-Optain fundus photos were available for algorithm assessment. Topcon and Optain cameras had excellent consistency (Κ=0.80) when applied to the referable DR algorithm, while AMD had moderate consistency (Κ=0.41) and GON had poor consistency (Κ=0.32). For the DR model, Topcon and Optain achieved a sensitivity of 97.70% and 97.67% and a specificity of 97.92% and 97.93%, respectively. There was no significant difference between the two camera models (McNemar’s test: x2=0.08, p = .78). Conclusion Topcon and Optain cameras had excellent consistency for detecting referable DR, albeit performances for detecting AMD and GON models were unsatisfactory. This study highlights the methods of using pair-wise images to evaluate DL models between reference and new fundus cameras.


Introduction
The use of fundus photography has become a popular noninvasive method for observing the structures of the retina and diagnosing various ocular diseases. This boom is largely owed to deep learning (DL) algorithms which now hold the capability to automatically segment and interpret fundus photographs with diagnostic specificity comparable to that of trained ophthalmologists. [1][2][3][4][5] Numerous DL models have been developed for the detection of eye diseases including age-related macular degeneration (AMD) and glaucomatous optic neuropathy (GON) 1,3,6-11 and diabetic retinopathy (DR). For DR in particular, a DL model can achieve an area under the curve (AUC) of 0.98-0.99, which is greater than most ophthalmologists. 2,[12][13][14][15] For the development of diagnostic DL algorithms, manual segmentation of fundus photographic data must be performed for algorithm training and validation. To avoid confounding from microscopic features of images which may be distorted by minute differences in pixel size and resolution, ideally, the algorithm should also be tested for performance on images taken from different fundus cameras. 16 If confounding exists, this can knock on to have adverse effects on diagnostic accuracy and cause over-or under-diagnosis of vision-threatening disease and could cause initiation of unnecessary treatments. Despite this risk, most existing DL models were trained on photographs taken by a single camera without external validation from different camera models.
Recently, Tsai et al. conducted the first external crosscamera validation for the screening of diabetic retinopathy (DR) and revealed high sensitivity and specificity across three different cameras although the field of view in these cameras was alike and only one disease model was tested. 17 To date, few studies have examined cross-camera validation, however, these were done using desktop fundus cameras, leaving the agreement between portable and desktop cameras unknown. Therefore, this study sought to verify the consistency of DL model performance on fundus photographs taken by Topcon and Optain cameras for DR, AMD, and GON using pair-wise images taken on the same individuals.

Image data
Patients who attended outpatient ophthalmic clinics at Guangdong Provincial People's Hospital from November 2021 to April 2022 were invited to participate in this study. Patients with DR, GON, and AMD were eligible for inclusion, while individuals under 18 years of age, adults under guardianship or unable to provide consent, and patients who had undergone posterior segment surgery were excluded. Fundus photographs of each patient were taken twice for each eye without pupil dilation in one sitting by professionally trained staff. The first photo was captured using a Topcon TRC-NW8 camera with a field of view of 45 , while the second photo was captured using an Optain OPTFC01 camera with a field of view of 50 . A single camera was used to ensure internal consistency and calibrated each morning at the clinic. Optain OPTFC01 captures binocular fundus images fully automatically and outputs images in JPEG, DICOM format with 15 Mega Pixels. Data were excluded if photos were deemed of poor quality by manual grading, or if a matching Optain/Topcon pair were missing. Detailed specifications of the Topcon and Optain cameras are presented in Table 1.
This study was approved by the Guangdong Provincial People's Hospital Institutional Review Board (KY-Q-2021-032-01) and adhered to the tenets of the Declaration of Helsinki. Informed consent was obtained from all participants before entering the study.

Deep learning model
Three diagnostic models were included in this study, including algorithms that screened for referable DR, AMD, and GON. The development and validation of each algorithm are described in detail in previous studies. 6,7,13,18,19 In brief, each algorithm was trained using more than 200 000 fundus photographs acquired across different ophthalmic clinics and institutions in China using various fundus camera models (Topcon, Canon, Heidelberg, and Digital Retinography System). Data were stored on a web-based cloud resource platform (www.labelme.org). DL models for each disease were developed using the Inception-v3 architecture, including disease classification, image quality assessment, and macular region detection. All images that went through referable DR models were classified as "positive," "negative," or "ungradable." Grading of AMD for DL algorithms had outcomes that consisted of "absent," "early or intermediate," "late-dry," or "late-wet." Grading of GON by DL algorithm had outcomes consisting of "low risk," "medium risk," and "high risk." In the previous study, all three models achieved AUC greater than 0.98 in the internal validation dataset.

Manual grading for diabetic retinopathy
All 906 pairs of Topcon-Optain fundus photos underwent manual annotation for DR presence. Images were randomly assigned to one of the four trained retina specialists. The degree of diabetic retinopathy was graded according to the National Health Service (NHS) Diabetic Eye Screening guidelines. According to the guide, the image results are classified as R0 (no DR), R1 (background DR), R2 (pre-proliferative DR), R3 (proliferative DR), or U (unclassifiable). Referable DR was defined as either pre-proliferative or proliferative DR (R2, R3). If the grading results were inconsistent among the graders, the images were reviewed by a designated senior ophthalmologist, and the final results were checked by this ophthalmologist as the ground truth of this study.

Statistical analysis
Statistical analyses were conducted using Stata V.15.0 software (StataCorp, College Station, Texas, USA) and Python 3.6. We compared the performance of three deep learning models on images captured by Topcon and Optain cameras and calculated the P-values using chi-square or Fisher's exact test. The sensitivity, specificity, and area under the curve (AUC) were the primary outcomes of this study and were derived by comparing the performance of the DR model with the ground truth (outcomes evaluated by ophthalmologists). Pair-wise analyses by McNemar's test compared the sensitivity between the two cameras for each primary outcome. Cohen's linearly weighted kappa (jw) rated the performance of DL algorithms for inter-rater and intra-rater reliabilities. 20 Cohen's kappa ranges from-1 to 1, and the result can often be interpreted as follows: 0.40-0.60 as moderate, 0.60-0.80 as substantial, and 0.80-1.00 as almost in perfect agreement. Cohen's kappa takes an imbalance in the class distribution into account for a realistic view of model performance when using imbalanced data. A p value less than .05 is considered statistically significant. Each pair of fundus photographs included in this study contains images of the same patient's same eye taken with Topcon and Optain cameras. We calculated the agreement between the performance of each DL model on photographs taken using different camera styles.

Results
After enrollment, a total of 504 patients were recruited. After excluding 12 photographs with matching errors (those that did not come from the same eye) and 59 photographs with low quality (those with ungradable manual grading results), 906 pairs of Topcon-Optain fundus photos were available for each algorithm to assess. A flow diagram for participant inclusion is illustrated in Figure 1. Figure 2 shows examples of the fundus photographs taken using the Topcon and Optain cameras. Table 2 lists the DL model results when using the Topcon and Optain cameras, respectively. All enrolled images (100%) were judged as "gradable" by the DL models for all conditions. For the referrable DR model, the number of referable DR images was higher (n ¼ 102) for the Topcon camera compared with Optain (n ¼ 101). AMD was the most prevalent disease within the dataset (16.8% for Topcon, 14.5% for Optain), followed by GON (2.54% for Topcon, 8.06% for Optain)(all p < .001). Table 3 compares the sensitivity, specificity, and AUC of the referable DR model performance for Topcon and Optain cameras, with manually graded DR images as ground truth. The sensitivity of Topcon and Optain cameras was 97.70% and 97.67%, specificity was 97.92% and 97.93%, and AUC was 0.978 and 0.977 for the DR algorithm, respectively. There was no significant difference between the two camera models (McNemar's test: The confusion matrix for each eye disease model is shown in Supplementary Figures 1 and 2. Table 4

Discussion
This study investigated inter-camera differences between Topcon and Optain cameras using pair-wise images for the diagnostic evaluation of DR, AMD, and GON DL algorithms. These results identified that inter-camera agreement was poor to moderate for GON and AMD models, although consistency for identifying referrable DR was near-perfect. Considering the accuracy of DL algorithms across camera models is rarely investigated in studies assessing their accuracy, this study calls for greater surveillance of inter-camera differences to ensure the application of DL algorithms does not have negative consequences in clinical practice.
The inter-camera differences found in this analysis are problematic considering camera models are rarely considered by publications claiming their diagnostic accuracy. The original articles publishing these algorithms deemed AUCs of 0.989, 0.995, and 0.986 for DR, AMD, and GON, respectively 6,7,13,18,19 and subsequent publications have consolidated these findings thereafter in other models based on clinically common cameras, leaving the performance of the emerging cameras inquiry. 18,[21][22][23][24] Although this study reported the AUCs of each model were over 0.95, and sensitivity/specificity were adequate, Cohen's kappa coefficients for consistency showed cameras had poor consistency for the diagnosis of GON and AMD. These inter-camera differences could be extremely concerning in real-world settings, where the use of portable cameras as-is could result in unreliable diagnoses and under or over-reported cases of GON and AMD. Some features between the photos taken by Topcon and Optain may explain the inter-camera disagreement observed herein. For AMD diagnosis to be accurate, lesions are detected in the macular area, while the optic disk area is assessed in GON. Although both Optain and Topcon cameras expose the macula and optic disk area, the presence of overexposure, low light, halos, and shadows can easily distort these regions and affect the accuracy of segmentation and interpretation. 25,26 For example, images taken with the Optain camera were overexposed in the optic disk region compared to the Topcon, which may have caused variations in the performance of the GON model compared to Topcon  Figure 3). In the future, an autocorrection mechanism may need to be integrated to prevent overexposure, such as Generative Adversarial Network technologies, to assist the commercialization and generalization of these algorithms. [27][28][29][30] Despite this, our results show that the cameras had an almost perfect agreement for the referable DR Algorithm, and achieved AUC over 0.97 for both cameras, which is improved compared with the external verification results of the original model (AUC ¼ 0.955). 18 This may be because the representation of DR in the fundus is widely distributed, which makes it easier to be captured by different cameras. In addition, the referrable DR model had only two gradable classifications which may have allowed for easier decisionmaking by the algorithm and narrowed the potential range of error. Nonetheless, the strong performance of the referable DR model indicates this algorithm is adaptable to Optain camera models and endorses its suitability as a screening tool within clinical practice.

(Supplementary
Investigating cross-camera consistency is of utmost importance should low-cost portable cameras like Optain become common in general medical practice. Despite this, the methodologies of most published studies claiming the accuracy of DL models were trained or validated using Topcon, Canon, or Zeiss camera models with no validation performed on portable cameras so far. Being low cost, oneninth the weight of the Topcon camera, and small enough to carry in a suitcase, Optain cameras are likely to be a future alternative to bulky fundoscopic cameras, enabling medical practice in rural areas ( Figure 3). Therefore, this study verified the performance and cross-camera consistency of DL models for photos taken by portable Optain cameras to bolster the perceived reliability of portable low-cost cameras and ensure they are safe for use in clinical practice, should they become a main place.
The findings from this study condone that warnings should be issued for users of AI algorithms where their generalizability and adaption to new models, as assessed by including inter-camera reliability, are not proven by peerreviewed research. 31 Although our model achieved good performance upon external validation this was tested by a dataset using the same camera for image capture 6,7,18 and until now these algorithms were only assumed to generalize well across camera models. Because of this fallacy, is likely this error exists in many other published algorithms. The poor agreement found in the current results suggests the performance of all DL models requires re-evaluation from a variety of camera models, and domain adaptation and active learning should be considered to ensure the evolution of reliable and adaptable DL algorithms. 21,32 For now, warnings should be considered by regulatory bodies to educate users about possible incompatibility with other camera models to prevent malpractice. In addition, expanding the data diversity of model training sets which include retinal images from different camera models should be pursued by future DL algorithms, and considered as important as validating for other factors like ethnicity, sample size, and geographic diversity.
To the best of our knowledge, this is the first study to evaluate the inter-camera differences between desktop and portable cameras and their interference with diagnostic DL algorithm performance. This study applied three previously validated algorithms using pairwise images taken on the same individual to test the comparative accuracy for DR, GON, and AMD. These results elucidated novel findings for ophthalmic and artificial intelligence scientific communities which will require a response about the suitability of warnings by fundus camera companies and algorithm software holders to ensure responsibility and integrity in clinical environments. Despite these implications, this study has several limitations that require addressing. First, the agreement was assessed between two camera models, thus inter-camera consistency across more camera modes is required. Second, the data volume of this study was limited, and the application of this methodology to a larger sample size would be ideal to confirm the accuracy of these findings. Third, the data samples used were collected from a single Chinese ophthalmology clinic, and considering the prevalence of AMD, DR, and GON differ geographically and between general practice clinics, a real-world assessment in other countries and different healthcare settings is ideal.   Note: Inter-rater reliability as measured by exact agreement (%) and Cohen's Kappa coefficient (and its associated p value) for each model.

Conclusion
Topcon and Optain had a near-perfect agreement for diagnostic accuracy using the referrable DR model, however, the AMD model showed moderate consistency and GON had poor consistency. This highlights the importance of evaluating cross-camera performances of DL models before their real-life application and calls for regulatory warnings on algorithms that have not undergone testing with other cameras.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The present work was supported by Fundamental Research Funds of the State Key Laboratory of Ophthalmology, National Natural Science Foundation of China (82101173, 81870663, 82171075).