Unsupervised clustering to differentiate rheumatoid arthritis patients based on proteomic signatures

Objective Patients with rheumatoid arthritis (RA) have different presentations and prognoses. Cluster analysis based on proteomic signatures creates independent phenogroups of patients with different pathophysiological backgrounds. We aimed to identify distinct pathophysiological clusters of RA patients based on circulating proteomic biomarkers. Method This was a cohort study including 399 RA patients. Clustering was performed on 94 circulating proteins (92 CVDII Olink®, high-sensitivity troponin T, and C-reactive protein). Unsupervised clustering was performed using a partitioning cluster algorithm. Results The clustering algorithm identified two distinct clusters: cluster 1 (n = 223) and cluster 2 (n = 176). Compared with cluster 1, cluster 2 included older patients with a higher burden of comorbidities (cardiovascular and RA related), more erosive and longer RA duration, more dyspnoea and fatigue, walking a shorter distance in the Six-Minute Walk Test, with more severe diastolic dysfunction, and a 4.5-fold higher risk of death or hospitalization for cardiovascular reasons. Tumour necrosis factor (TNF) receptor superfamily-related pathways were mainly responsible for the model’s discriminative ability. Conclusion Using unsupervised cluster analysis based on proteomic phenotypes, we identified two clusters of RA patients with distinct biomarkers profiles, clinical characteristics, and different outcomes that could reflect different pathophysiological backgrounds. TNF receptor superfamily-related proteins may be used to distinguish subgroups.

Rheumatoid arthritis (RA) affects approximately 1% of the worldwide population (1)(2)(3).Systemic inflammation plays a pivotal role in the underlying mechanism of RA, which may also add to the high cardiovascular risk of some patients with RA.Therefore, identifying such patients with high risk is important to better understand the pathophysiology of the disease, ascertain prognosis, and potentially develop targeted therapies.
Unsupervised machine-learning algorithms may help in identifying independent phenogroups of patients with different disease pathophysiology and prognoses (4,5).For example, in patients with heart failure with preserved ejection fraction, cluster analysis based on proteomic patterns enabled the identification of distinct disease phenotypes, such as those at a high risk of cardiovascular events or those who benefited from specific disease treatments (6)(7)(8).
In the present study we aimed to identify mutually exclusive phenogroups of RA patients based on a wide range of inflammatory and cardiovascular biomarkers.These biomarker-based phenogroups were then compared with regard to their clinical characteristics, biomarker profiles, and outcomes.

Study population
This prospective single-centre study includes consecutive patients with RA aged 18 years or older followed in the Autoimmune Disease Outpatient Unit of Centro Hospitalar Universitário do Porto, Portugal, from June 2016 to June 2018 (ClinicalTrials.govidentifier: NCT03960515) (9).RA was diagnosed based on the 2010 American College of Rheumatology (ACR)/European League Against Rheumatism (EULAR) classification criteria (10).Patients who had an active neoplasm or a short life expectancy (< 6 months), severe dementia, or severe frailty (e.g.inability to walk or totally dependent on a third person) were excluded.
Cardiovascular hospitalizations and deaths were prospectively recorded by regularly calling patients and their families, and then cross-checking the given information with medical records and hospital registries.Events were adjudicated by two authors (MBF and TF) using a standardized event definition form.
This study was conducted following the principles of the Declaration of Helsinki and approved by the hospital ethics committee under number 2016-023 (020DEFI/ 020CES).All patients signed written informed consent prior to entry into the study.
We systematically filled the clinical data of all patients under a prespecified case report form explicitly built for this project.Independent external data cleaning, consolidation, and verification was performed to ensure data accuracy.

Patient evaluation, echocardiogram, and routine laboratory tests
At baseline/first study visit, we collected information regarding medical history, physical examination, treatments, and an RA-specific questionnaire.The Six-Minute Walk Test was performed according to the American Thoracic Society guidelines (11).The echocardiogram was performed by an experienced echocardiographer, blinded to clinical data, following international recommendations (12,13).
Routine blood laboratory tests [including complete blood count, C-reactive protein (CRP), glucose, blood electrolytes, lipid profile, creatinine, high-sensitivity troponin T (hsTnT), and N-terminal pro-brain natriuretic peptide (NT-proBNP)], were collected at baseline/ first study visit and analysed at the central laboratory of the hospital.CRP was measured by the enzymelinked immunosorbent assay (Olympus CRP Latex Calibrator Normal Set®), hsTnT by the Elecsys (Roche Diagnostics), and NT-proBNP by the Gen 5 STAT test (Roche Diagnostics).NT-proBNP, hsTnT, and CRP showed intra-assay and interassay coefficients of variation of 10% at a concentration lower than the 99th percentile cut-off.At baseline/first study visit, plasma was frozen at −80°C in the central laboratory of the hospital and then sent to Uppsala, Sweden, for Olink® analysis.

Plasma Olink® biomarkers
We measured a large protein biomarker panel (Olink CVDII panel) that comprised 92 biomarkers from a wide range of pathophysiological domains.The Olink CVDII panel was purposefully selected because it contains several known human circulating proteins associated with cardiovascular and inflammatory diseases.An overview of the 92 circulating proteins is shown in Online Supplementary Table S1, including their full names and Uniprot ID.In Online Supplementary Table S2, the proteins are described according to their main biological functions.The full description and technical details of the analytical assay are described in the Online Supplementary Addenda.The CVDII Olink panel displayed mean intra-assay and interassay coefficients of variation between 9.1% and 11.7%.

Statistical analysis
Cluster analysis was performed based on 94 biomarkers (92 Olink plus hsTnT and CRP), which were centred and unit scaled to allow each biomarker to contribute equally to cluster discovery.Moreover, k-means partitioning was used to unravel the biomarker clustering latent structure (Online Supplementary Figure S1).This procedure minimized the within-cluster variation.The average silhouette statistics were used to find the optimal number of clusters.As shown in Figure 1, two clusters were identified, almost perfectly separated.The starting centroid was set at random (with 50 randomization starts to improve cluster stability) and k-means was run with 300 bootstrapped samples.
After identifying the patient clusters, logistic regression was used to assess the influence of each biomarker on the separation of the clusters.The contribution of each protein biomarker towards patient clusters was evaluated by the impact on the biomarker's added value to the classification using the Akaike information criterion (AIC) (Online Supplementary Figure S2).
The primary outcome was the composite of cardiovascular mortality or hospitalization for cardiovascular reasons (i.e.heart failure, acute myocardial infarction, stroke, or sudden death).Survival probabilities were estimated using a Cox proportional hazards regression model and the Kaplan-Meier method stratified according to the patient clusters and adjusting for age, sex, diabetes, RA duration, and estimated glomerular filtration rate (eGFR) (owing to their clinical and prognostic value and for consistency across reports from this cohort) (9).No data imputation was performed.
All statistical analyses were performed using R version 4.0.3(R Development Core Team, Vienna, Austria).A p-value < 0.05 was considered statistically significant.

Results
In total, 399 patients with RA were included in the present analysis; the median age was 61 years, 77% were female, 46% had hypertension, and the median left ventricular ejection fraction) was 61%.The median duration of RA in these patients was 9 years; 39% had articular erosion and 45% took corticosteroids.
The unsupervised cluster analysis identified two groups with distinct proteomic profiles, labelled as cluster 1 (n = 223) and cluster 2 (n = 176).The comparison of the baseline characteristics between the two clusters is presented in Table 1.There were significant between-cluster differences in age, several cardiovascular risk factors and comorbidities (i.e.diabetes, hypertension, obesity, and heart failure history), RA duration and erosive disease, dyspnoea, fatigue, walked distance, systolic and diastolic dysfunction, and renal function, with all of the worse clinical conditions observed in cluster 2.

Proteomic biomarkers independently expressed between clusters
Of the 94 measured proteins, 86 (91%) were differentially expressed between clusters, with higher expression levels in cluster 2 (Online Supplementary Table S3).The adjusted logistic regression model showed that TRAILR2 (TNF-receptor superfamily member 10b) and TNFRSF11A (TNF-receptor superfamily member 11a) were the two biomarkers with the highest impact on the  model's predictive ability (Online Supplementary Fig ure S3a and S3b).Compared with all biomarkers, these two proteins showed a good level of accuracy for separating our clusters (accuracy = 0.83) (Figure 2 and Table 2).This simpler model with similar accuracy is preferred over a more complex model, because it reduces the risk of overfitting and makes the predictions easier to implement in a clinical setting.

Outcome
During a median follow-up period of 1.5 (25th-75th percentile 0.7-2.3)years, 41 patients had the primary outcome (combined primary cardiovascular endpoint).After adjustment for age, sex, a history of diabetes, RA duration, and eGFR, cluster 2 was independently and significantly associated with a higher risk of the primary outcome compared with cluster 1 [adjusted hazard ratio 4.5 (95% confidence interval 2.2-9.1),p < 0.0001] (Fig-Figure 3).The cluster risk separation clearly identified two independent phenotypes, with almost all events occurring in cluster 2 patients, while in cluster 1 almost no events were recorded within 2 years of follow-up.

Discussion
An unsupervised machine-learning approach based on 94 proteomic biomarkers, applied to 399 patients with RA, identified two distinct clusters: cluster 2 (the high-risk cluster) and cluster 1 (the low-risk cluster).Cluster separation was unsupervised and agnostic to the cardiovascular status of the patients, i.e. it was based on circulating proteomic markers only.These two mutually exclusive clusters were organized mainly by a heightened expression of TNF-receptor superfamily-related proteins (TRAILR2 and TNFRSF11A), reinforcing the relevance of the TNF superfamily-related pathways in identifying subgroups of patients with RA with different characteristics, symptom severity, and prognoses.Supporting these findings, long-term anti-TNF-α therapy for the treatment of RA has been shown to reduce the risk of RA-related  complications, and improve cardiovascular outcomes and quality of life (14)(15)(16).
To the best of our knowledge, the present study is the first to use multiple circulating proteins to identify phenogroups of RA patients with distinct characteristics and prognoses.Other studies in RA patients performed cluster analysis based on clinical characteristics but not on biomarker profiles (17,18).For example, one study used 14 clinical features to identify four subgroups with different risk profiles and potential for the initiation of biological disease-modifying anti-rheumatic drugs (18).Another study tried to distinguish patients with 'true refractory RA' from 'non-adherent dissatisfied patients' based on clinical data (17).
Proteomic analysis leveraging proteins involved in diverse disease pathways allows us to identify subgroups of patients with the same disease but with a different molecular and pathophysiological fingerprint, and different prognoses.Whether the proteomic footprints identified herein may help in identifying treatment responders requires adequate testing in prospective trials.
The existence of two circulating proteins members of the TNF receptor superfamily (i.e.TNFRSF11A and TRAILR2) identified in our study could facilitate the clinical applicability of our results.These two proteins may help in patient screening and prognostic stratification, and potentially identify patients' responses to therapeutic interventions.Thus, the TNF superfamily pathway was central to our cluster identification.The TNF superfamily system leads to a vast communication network among various cell types and tissues, which mediates signalling that controls the survival, proliferation, and differentiation of cells (19).
Regarding the two 'top' proteins identified in our study, their biological activity may explain their centrality in cluster separation.Specifically, TRAILR2 exacerbated autoimmune arthritis by enhancing both cellular and humoral immune responses, leading to the amplification of inflammation, hyperproliferation of synovial cells and arthritogenic lymphocytes, and increased production of cytokines (interleukin-2 and interferon-gamma).Furthermore, increased levels of TRAILR2 were associated with a poor cardiovascular prognosis (20)(21)(22)(23)(24). TNFRSF11A has an important role in bone metabolism regulation and appears to be involved in the immune response (25).Like TRAILR2, TNFRSF11A is also associated with poor cardiovascular outcomes (9,26).
This study is innovative, and these biomarkers (particularly TRAILR2 and TNFRSF11A) could be used to differentiate RA patients according to pathophysiological types and have potential for tailoring treatment if adequately tested in prospective randomized trials.
The limitations of this study include that we lacked external validation of the presented proteomic phenotypes in other RA cohorts.This analysis was exploratory, and no power calculation was performed.Given the relatively small sample size and short follow-up time with few events, this analysis should be regarded as exploratory and further replication in larger cohorts with longer follow-up is required.In addition, the observational nature of this study precludes us from inferring any causality.Our RA population had a long history of disease (median of 9 years) and many were still being treated with corticosteroids and methotrexate; therefore, these data may not necessarily be applicable to more contemporary cohorts.

Conclusion
Using unsupervised cluster analysis based on proteomic phenotypes, we identified two clusters of RA patients with distinct biomarker profiles, clinical characteristics, and outcomes, which could reflect different pathophysiological backgrounds.TNF-receptor superfamilyrelated proteins may be used to distinguish subgroups.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Factorial distribution of patients according to two-cluster k-means partitioning.Separation of patients into two independent and mutually exclusive groups was based on k-means clustering, an unsupervised machine-learning technique.Dim, dimension.

Figure 2 .
Figure 2. Factorial distribution of patients according to two-cluster k-means partitioning using TRAILR2 and TNFRSF11A protein biomarkers.Separation of patients into two independent and mutually exclusive groups was based on k-means clustering using only two biomarkers: TRAILR2 and TNFRSF11A.The results are superimposable on those shown in Figure 1 using 94 biomarkers.

Table 1 .
Characteristics of the study population according to the identified clusters.