Ultra-High Dimensional Quantile Regression for Longitudinal Data: An Application to Blood Pressure Analysis

Abstract Despite major advances in research and treatment, identifying important genotype risk factors for high blood pressure remains challenging. Traditional genome-wide association studies (GWAS) focus on one single nucleotide polymorphism (SNP) at a time. We aim to select among over half a million SNPs along with time-varying phenotype variables via simultaneous modeling and variable selection, focusing on the most dangerous blood pressure levels at high quantiles. Taking advantage of rich data from a large-scale public health study, we develop and apply a novel quantile penalized generalized estimating equations (GEE) approach, incorporating several key aspects including ultra-high dimensional genetic SNPs, the longitudinal nature of blood pressure measurements, time-varying covariates, and conditional high quantiles of blood pressure. Importantly, we identify interesting new SNPs for high blood pressure. Besides, we find blood pressure levels are likely heterogeneous, where the important risk factors identified differ among quantiles. This comprehensive picture of conditional quantiles of blood pressure can allow more insights and targeted treatments. We provide an efficient computational algorithm and prove consistency, asymptotic normality, and the oracle property for the quantile penalized GEE estimators with ultra-high dimensional predictors. Moreover, we establish model-selection consistency for high-dimensional BIC. Simulation studies show the promise of the proposed approach. Supplementary materials for this article are available online.


Introduction
High blood pressure continues to be a major risk factor for morbidity and mortality worldwide. High levels of blood pressure are linked to diagnoses such as stroke and cardiovascular disease (e.g., Ettehad et al. 2016). Forouzanfar et al. (2017) found millions of deaths were linked to high levels of systolic blood pressure in 2015: 7.8 million deaths were related to systolic blood pressure above 140 mmHg. Lately, high blood pressure has even been associated with COVID-19 complications (Caillon et al. 2021).
This continuing burden of high blood pressure is partially because there are still unknown mechanisms and risk factors due to the complexities of high blood pressure. Despite major research progress and improvement in treatment, it remains a challenging problem to uncover blood pressure risk factors, since blood pressure is known to be a complex disorder "involving multiple interlinked physiological systems and organs, numerous cell types, myriad subcellular processes, and innumerable signaling pathways and networks" (Touyz and Schiffrin 2021). Due to the increased damaging consequences of high blood pressure, there is an urgent and continuing need to investigate the association of risk factors and high blood pressure levels through novel approaches.
With recent advances, genotyping technology can quickly and accurately collect genotype data consisting of hundreds of thousands of single nucleotide polymorphisms (SNPs) for participants in the study. These genotype data can greatly enhance studies of complex conditions such as blood pressure, since genotype data can be analyzed to uncover molecular genetic pathways that regulate factors of interest to help inform personalized treatment and develop medicines to lessen the burden of disease (Padmanabhan and Dominiczak 2021). Traditionally, genome-wide association studies (GWAS) have detected SNPs associated with a wide variety of complex conditions, including in blood pressure research (e.g., Evangelou et al. 2018). This widely used method implements a simple univariate linear regression approach which adjusts for multiple testing of the hundreds of thousands of SNPs. Although great strides have been made using the univariate GWAS approach, two aspects constrain its performance: it can only capture marginal associations, and it leads to an extremely high threshold of genome-wide significance levels (usually 10 −8 ) after Bonferroni correction. Such a high bar in testing still does not guarantee a low false positive rate for SNPs as many common variants have weak effects, especially in blood pressure (Levy et al. 2009).
Figure 1. This figure shows the longitudinal systolic blood pressure (SBP) measurements for five randomly selected participants with high blood pressure over waves 3 through 6 of the Framingham dataset. For each participant, the systolic blood pressure averaged over these waves is shown by a diamond of the same color located above the "averaged" marker on the horizontal axis.
In this article, we make the first attempt to identify important SNPs from hundreds of thousands of SNPs via simultaneous variable selection and modeling, focusing on high quantiles of blood pressure measurements that are of primary interest. We also use a rich dataset, the ongoing Framingham Heart Study (Dawber, Meadors, and Moore Jr 1951), consisting of over 500,000 SNPs from about 1500 participants, time-varying covariates such as age, body mass index (BMI) etc., and longitudinal blood pressure measurements over four waves spanning 17 years (see Figure 1). Altogether, we tackle this important problem with the following complex characteristics: (a) ultrahigh dimensional genotype variables where the dimension of SNPs p is much larger than the number of participants n; (b) time-varying phenotype covariates that may be important; (c) longitudinal measurements of blood pressure; and (d) conditional high quantiles of blood pressure are most of interest. For such a challenging problem, we develop and apply a novel quantile generalized estimating equations (GEE) approach with a penalty term for ultra-high dimensional predictor variables, aiming to identify important SNPs along with time-varying phenotype covariates.
Our novel approach to blood pressure analysis identifies new SNPs for high blood pressure and confirms SNPs previously identified in literature. Specifically, some newly selected SNPs warrant further investigation. Several of these newly identified SNPs may have a plausible association with high blood pressure. As an example, selected at the 90th conditional quantile for systolic blood pressure, rs10853618, rs10853619, and rs17747515 are located in chromosome 18 colorectal carcinoma (DCC) genes that have been found to be significantly associated with hypertension via the netrin signaling pathway in a hydrochlorothiazide blood pressure study (Shahin et al. 2016). As DCC is one receptor of netrin-1, it is possibly related to the function of netrin-1 on innervation of peripheral resistance arteries and thus regulates the blood supply to organs (Lasram et al. 2015). Moreover, many SNPs selected at the 90th conditional quantile are verified by the literature. For example, studies confirm the correlation between systolic blood pressure and rs1275988, located near the KCNK3 gene, a potassium channel, which is known to regulate vascular tone and found to be associated with pulmonary hypertension when mutated (Ganesh et al. 2014;Kato et al. 2015;Manichaikul et al. 2016). Rs2384550 is located in the TBX3/TBX5 gene, which is associated with diastolic blood pressure regulation (Levy et al. 2009).
A fundamental characteristic of our data is that the SNPs are ultra-high dimensional with hundreds of thousands of SNPs but with only thousands of participants. "Ultra-high dimensionality" (Fan and Lv 2008) refers to the situation when p can grow exponentially with the sample size (i.e., log p = O(n c ) for some 0 < c < 1). Most popular GWAS focus only on genetic variants one at a time, using a simple univariate linear regression. However, univariate regression can only capture simple marginal associations. In addition, mean regression may not be appropriate for blood pressure analysis. Instead of average blood pressure, high or abnormal blood pressure measurements are a major concern as these levels tend to cause the most negative consequences.
We advocate a penalized quantile regression approach to simultaneously modeling and selecting important SNPs and phenotype variables as a potentially useful alternative tool to GWAS. Even though there is little work beyond GWAS for blood pressure research, simultaneous variable selection and modeling in mean regression outside of blood pressure research have successfully identified important genetic variants for crosssectional data (e.g., He and Lin 2011;Li et al. 2011;Kohannim et al. 2012;Jiang, He, and Zhang 2016). Furthermore, ultrahigh dimensional datasets commonly exhibit extra heterogeneity, which is often scientifically important (Wang, Wu, and Li 2012). Since the seminal work Koenker and Bassett Jr (1978), quantile regression has shown great promise in various fields. Our quantile regression approach to blood pressure analysis also allows uniquely identified sets of risk factors for different blood pressure quantiles, which is ideal since treatment plans may require separate or calibrated methods for different response quantiles to have the greatest impact on disease.
In summary, in our challenging setting of ultra-high dimensional SNPs for high quantiles of blood pressure, we use variable selection and simultaneous modeling through a penalized model to identify important genotype and phenotype risk factors. Moreover, through a simulation study mimicking the real data, we show that the traditional univariate GWAS approach tends to incur a high false positive rate especially with a very large set of SNPs, as observed in Table 4 in Section 3. And, we find superior variable selection performance of our penalized quantile modeling approach overall.
Finally, the rich data we have consist of longitudinal continuous blood pressure measurements over four waves spanning 17 years. To see this, Figure 1 presents repeated measurements of blood pressure values and the cross-sectional average for five randomly selected participants with high blood pressure in our data. Simply taking the average of longitudinal blood pressure measurements, using only one wave, or using crosssectional analysis will likely miss valuable information and the nonignorable correlation within participants. As demonstrated in the large literature using GEE for longitudinal data analysis (e.g., Liang and Zeger 1986;Zeger and Liang 1992;Diggle et al. 2002;Wang, Zhou, and Qu 2012;Green et al. 2021), GEE can lead to increased estimation efficiency to uncover associations compared to approaches that assume independent observations. To further incorporate the longitudinal nature of our data, we apply a novel quantile generalized estimating equations (GEE) approach with a penalty term for ultra-high dimensional predictor variables that we develop for the blood pressure application. Besides identifying important SNPs and phenotype covariates, we can provide a comprehensive picture of conditional quantiles of repeated blood pressure measurements, which can potentially allow more insights and targeted treatments.
Our proposed approach is the first to select important genotype and phenotype variables from hundreds of thousands of SNPs and time-varying phenotype covariates simultaneously, incorporate the longitudinal nature of the data, and importantly, focus on high quantiles that are of most interest. Even for crosssectional data, this is the first work in blood pressure research to identify important SNPs and phenotype covariates through simultaneous variable selection and modeling for high quantiles of blood pressure, to the best of our knowledge.
In addition to our main contributions to the application of blood pressure analysis, we also provide methodological, computational, and theoretical contributions. Methodologically, this is the first paper to develop quantile penalized GEE in ultrahigh dimensions for simultaneous variable selection and modeling for longitudinal data. Computationally, we develop efficient algorithms along with detailed implementations. Theoretically, we establish challenging asymptotic theory including consistency, asymptotic normality, and the oracle property in ultrahigh dimensions for the quantile penalized GEE estimators.
Moreover, we establish a model-selection consistency theory for high-dimensional BIC (HBIC) in our challenging setting with a non-differentiable quantile objective function and penalized GEE.
The rest of this article is organized as follows. In Section 2, we detail the longitudinal data we use and provide our penalized generalized estimating equations model for conditional quantiles of blood pressure. We establish challenging theoretical properties for our approach in ultra-high dimensions. In addition, we propose HBIC for penalty parameter selection and prove the model-selection consistency of HBIC. We also provide computational implementation details. In Section 3, we present our application to systolic and diastolic blood pressure. We also describe newly identified SNP associations to blood pressure at the 90th conditional quantile and confirm previous associations. In Section 4, we present a simulation study to further investigate our approach's performance. We conclude and discuss in Section 5.

Model
To explore the relationship between quantiles of blood pressure measurements and phenotype and ultra-high dimensional genotype risk factors, we employ data from the continuing Framingham Heart Study (FHS) (Dawber, Meadors, and Moore Jr 1951). The FHS is a large-scale ongoing study that measures blood pressure, cardiovascular disease, and other vital health measurements of participants over multiple examinations and has been used to investigate various diseases (e.g., Haider et al. 2003). The FHS provides an opportunity to study associations between blood pressure and risk factors of blood pressure in a longitudinal setting. From this data, we include both systolic and diastolic blood pressure types, since high levels of both types of blood pressure are linked to diseases (Law, Morris, and Wald 2009). For our analysis, we use a subset of the data with n = 1577 participants over four waves of the study.
Recently, this study also collects ultra-high dimensional genetics risk factors, SNPs, from participants. One of the main challenges stems from having 1577 participants, but the number of SNPs is around 550,000. For more information on this data, see the website https: //www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000007.v32.p13. Specifically, in our blood pressure analysis, the response is longitudinal systolic or diastolic blood pressure measured over multiple waves. That is, for each participant i = 1, . . . , n, let the blood pressure measurements be Y i = (Y i1 , Y i2 , . . . , Y im i ) T measured over m i time points. The covariates X i = (X i1 , X i2 , . . . , X im i ) T include both genotype and phenotype risk factors, where X i1 = (X i11 , X i12 , . . . , X i1p n ) T and so on. Here p n is the dimension of the covariates, which can diverge and even be ultra-high dimensional. In particular, the seven time-varying phenotype covariates consist of age (AGE), total cholesterol (TC), smoking status (SMK), body mass index (BMI), ventricular rate (VR), triglyceride (TRIG), and high density lipoprotein cholesterol (HDL). The non-time-varying genetic covariates consist of 550,568 SNPs. The longitudinal observations within each participant are dependent, and observations between different participants are assumed independent.
We propose a novel quantile estimation and variable selection approach specifically for our blood pressure application with ultra-high dimensional covariates. Given τ ∈ (0, 1), we consider estimation and variable selection for longitudinal data in ultra-high dimensions for the τ th conditional quantile, ( 1 ) Here we allow different α τ at different quantile levels τ . For notational simplicity, we omit the subscript τ throughout. We denote the coefficient vector as α = (α 1 , α 2 , . . . , α p n ) T .
As the data is longitudinal in nature, ignoring the correlated measurements of participants likely leads to less efficient estimation. To capture this correlation, generalized estimating equations are a well-proven approach that can capture dependence without explicitly estimating the covariance matrix (e.g., Liang and Zeger 1986;Wang, Zhou, and Qu 2012).
We adopt the smoothly clipped absolute deviation (SCAD) penalty, which can be defined by its first derivative as q λ (ζ ) = 1 In our application, we use the identity matrix for i for simplicity of computation, and it works well. If the errors are highly heteroscedastic, it can be estimated empirically as discussed in Wang, Zhu, and Zhou (2009). We also implemented the empirical estimates using their Equation (3.1), and the results are similar in our application. We keep the term i here for generality. 2 The working correlation matrix R i (ρ) describes the correlation within participant i. We adapt and extend the most common working correlation structures in mean regression to our quantile penalized GEE modeling, following the classical literature of GEE. In particular, we implement independent, exchangeable, and autoregressive AR-1 working correlation structures as examples listed in Section 4 of Liang and Zeger (1986). Specifically, the independent working correlation structure simply takes the identity matrix. The exchangeable working correlation structure assumes a common correlation ρ between any pair of measurements. The AR-1 working correlation structure allows correlation to diminish exponentially over time. Again, as in Liang and Zeger (1986) and classical GEE literature, to estimate ρ, we borrow strength over all n subjects. We refer readers to detailed discussions in Liang and Zeger (1986).
Here λ is a regularization penalty parameter, q λ (0) = 0, and a = 3.7 as recommended by Fan and Li (2001). The SCAD penalty, originally proposed in the seminal work of Fan and Li (2001) for fixed-dimensional and cross-sectional data, has been a popular penalty for its nice oracle property for the corresponding penalized estimators. In particular, Fan and Li (2001) show that the SCAD penalty can achieve nearly unbiasedness as it is bounded by a constant so that large coefficients are not excessively penalized. It shrinks small estimated coefficients to zero automatically due to singularity at the origin. Meanwhile, the SCAD penalty is continuous in estimation, which can improve the stability of model selection. We refer readers to Fan and Li (2001) for more details.
To tackle the challenge due to the nondifferentiable quantile check function, we employ a version of induced smoothing (Brown and Wang 2005) and define smoothed estimating equations Here h is set as n − 1 2 , which achieves the same order as in induced smoothing. For simplicity, we do not treat h as a separate tuning parameter. And is the cumulative distribution function (CDF) of the standard normal distribution. This works well in our application.

Theoretical Properties
Given more than half a million SNPs in our blood pressure analysis, we need an effective approach that can cope with ultrahigh-dimensional data. We also strive to have good theoretical guarantees for both estimation and variable selection. However, as demonstrated in Fan and Peng (2004), approaches with fixeddimensional theoretical properties do not address the complications brought by diverging high-dimensional data. While providing theoretical guarantees is important, it is challenging to establish theoretical properties for our ultra-high dimensional method due to the nonsmooth quantile function in combination with the nonconvex penalty function with longitudinal data. Previous studies provided theory for quantile regression with longitudinal data only in a fixed-dimensional setting (e.g., Zhao, Lian, and Liang 2017), which cannot easily extend to ultra-high dimension.
We provide theoretical properties for the quantile penalized generalized estimating equation estimator of S P (α) = S(α) − nq λ (|α|)sgn(α) of Equation (2) in Theorem 1 and that of S P (α) = S (α) − nq λ (|α|)sgn(α) of Equation (3) in Theorem 2, when p n is ultra-high dimensional and the true p 1n , the number of important variables, can diverge. More specifically, we prove the convergence rate, oracle property, and asymptotic normality for the penalized estimators from our approach.
We use S P (1) and S P (2) to denote the first p 1n and last p n − p 1n equations, respectively. More generally, subscripts (1) and (2) are used to denote the first p 1n and the last p n − p 1n components of a p n -dimensional vector. For simplicity, with some abuse of notation, all estimators computed from S and S are denoted by α, while α 0 is the true parameter with its last p n − p 1n components being zero. Let R 0 and V 0 be the true correlation matrix and the true covariance matrix of I{Y i ≤ X i α}, respectively. For theoretical convenience, we assume R i and i are both known, although in practice they often need to be estimated. C is a constant that is positive which may take different values at various places. ∨ is the supremum of two or more elements. For ease of notation, we omit the subscript n in p n and p 1n . We need the following conditions: are bounded away from zero and infinity.
, together with its first and second derivatives, is bounded. f ij (0) = f (0|X ij ) is bounded away from zero for any X ij in the support of its distribution. (C3) The eigenvalues of R i and R 0 are bounded away from zero and infinity.
We say α is an n -approximate minimizer if S( α) ≤ inf α∈R p 1 S(α) +O p ( n ). If the infimum is taken in a neighborhood of α, it is called an n -approximate local minimizer. α is an We establish the following theorems.
Remark 1. Our assumption that log(p ∨ n)/n << min j≤p 1 |α 0j | implicitly imposed a constraint on the size of p. In particular, if we assume the signal strength min j≤p 1 |α 0j | >> n −a for some a > 0, then we require log p = o(n 1−2a ). In other words, we allow p to grow exponentially with the sample size, which is consistent with the results in Kim, Choi, and Oh (2008) and Fan and Lv (2011). Furthermore, the assumption np 2 1 r n log n + p , constraining the diverging number of important variables.
Theorem 2. Under the same assumptions of Theorem 1 and that (1) has the asymptotic normality property as in Theorem 1. (iii) α (2) has the zero crossing property: The sketch of the main propositions and the detailed proofs are provided in the supplementary materials.
To the best of our knowledge, this work is the first to establish theoretical properties for the longitudinal penalized quantile regression model in ultra-high dimension. Given the increased disease burden from high blood pressure, theoretical guarantees for our longitudinal quantile approach may help provide a solid foundation to identify different sets of important variables for different high quantiles of blood pressure.

High-Dimensional BIC
The penalty parameter λ is an important tuning parameter to select for our simultaneous variable selection and estimation using quantile penalized GEE for longitudinal data. We propose high-dimensional BIC for selecting λ. We further establish a model-selection consistency property of HBIC in Theorem 3 for quantile regressions for longitudinal data, advancing the literature from the important works of Leng (2009), Wang, Kim, and, and Chen and Chen (2008) in mean regressions for cross-sectional data.
Specifically, the best penalty parameter λ is the minimizer of the following high-dimensional BIC (HBIC): where p 1,λ is the number of nonzero parameters. C n is set as log log(p) as suggested by the literature. Furthermore, we establish the model-selection consistency of HBIC as follows.
Theorem 3. In addition to the assumptions used in Theorems 1 and 2, assume that we search over λ with the resulting support S λ ⊆ {1, . . . , p} having size bounded by a constant p max which is a known constant upper bound of p 1 . Furthermore, we assume p = O(n κ ) for some constant κ > 0 and min j≤p 1 |α 0j | >> C n log n/n. Then for λ that minimizes HBIC, we have P(S λ = S 0 ) → 1.
Remark 2. Establishing the theoretical justification of highdimensional BIC is challenging even for mean regression for cross-sectional data (Chen and Chen 2008;Wang, Li, and Leng 2009;Wang, Kim, and Li 2013). A main reason is due to the extremely large number of model combinations that must be considered as the dimensionality of covariates grows. Our quantile penalized GEE model for longitudinal data is more complicated due to the nondifferentiable quantile check function and the quasi-likelihood GEE setting. The current approaches for mean regression with squared loss can only handle p at the polynomial order of n, which is the same order we can achieve here.

Implementation
Estimating and selecting the important variables using our quantile penalized generalized estimating equation approach in (2) are challenging, because the quantile check function is nondifferentiable; the SCAD penalty function is nonconvex; and the data are longitudinal and more importantly ultra-high dimensional. To tackle these challenges, we propose to combine smoothing for the nondifferentiable check function and the minorization-maximization algorithm (Hunter and Li 2005) with a local linear approximation for the discontinuity of the first derivative of the nonconvex SCAD penalty function. The quantile penalized estimating equations (2) can be approximated as where is a small positive constant. We use = 10 −6 in practice.
To estimate α using this combination of smoothing and minorization-maximization, we employ an iterative algorithm that uses the idea in the Newton-Raphson algorithm. The updating step of the algorithm is defined as where φ(·) is the density function of the standard normal distribution.

High Blood Pressure Analysis
Our main goal is to identify important genotype and phenotype risk factors linked to high blood pressure while in a longitudinal setting with ultra-high dimensional SNPs. For this problem, we develop and apply a quantile approach for correlated data in ultra-high dimensions from the Framingham Heart Study. Our investigation consists of 1577 participants over four waves of the study spanning 17 years. Table 1 provides summary statistics for both systolic blood pressure (SBP) and diastolic blood pressure (DBP) at various quantiles along with means for other phenotype variables. Obviously, we are more interested in subjects at high quantiles who develop hypertension. At the 90th percentile of SBP, we observe a clear increase from 139 mmHg at exam 3 to 150 mmHg at exam 6. Using averages alone is not able to reflect the alarming hypertension levels for those of most interest. By examining the marginal correlation of blood pressure measures across each exam wave, we find that they are correlated with ρ ranging from 0.57 to 0.71, suggesting nonignorable correlation. We also extend and implement several recommended methods in the literature and find that an exchangeable working correlation is preferred slightly over AR-1, while the independence working correlation structure is clearly ruled out. 3 We conduct two separate analyses: systolic blood pressure as the response and diastolic blood pressure as the response, respectively. For each type, we separately perform preprocessing for the SNPs as discussed in supplementary materials. Since we are most interested in identifying risk factors for high risk participants, as an example we focus on the 90th conditional quantile, an abnormally high blood pressure level, as the conditional response. If even higher blood pressure groups are of interest, our approach is certainly applicable for various quantiles. To benchmark, we also apply penalized GEE for mean regression (Wang, Zhou, and Qu 2012) to our longitudinal blood pressure data. To provide the standard error of coefficient estimates, we use 500 bootstrap samples on the subject level.
3 There is very limited research for working correlation selection even in mean regression. We extend and apply two recommended methods in the literature, that is, the quasi-likelihood information criterion (QIC) (Pan 2001) and correlation information criterion (CIC) (Hin and Wang 2009). We also extend a modified version of AIC or BIC on quantile GEE (Fu, Wang, and Zhu 2015) to our model. All methods select the exchangeable working correlation structure over the AR-1 working correlation structure with a minor margin. Developing a rigorous method to choose the best working correlation structure in our complex model setting could be an interesting future study.
NOTE: The parameter estimates and standard errors (s.e.) are presented for the selected phenotypes from each approach. Selected genotypes are ordered in such a way that commonly selected SNPs are shown first for ease of comparison. At the 90th conditional quantile, the symbol * indicates newly detected SNPs and the symbol * * indicates newly detected SNPs with a plausible association.

Systolic Blood Pressure
Our main goal is to identify important SNPs and phenotype risk factors linked to longitudinal measurements of systolic blood pressure at abnormally high measurements. To this end, we use our proposed penalized generalized estimating equations quantile approach at the 90th conditional quantile of systolic blood pressure. The unconditional 90th percentile of systolic blood pressure corresponds to an abnormal reading of about 145 mmHg. Research shows this high level of systolic blood pressure to be very harmful (Forouzanfar et al. 2017). We present the 90th conditional quantile as an example to save space, and our approach can easily apply to various quantiles of interest. 4 Table 2 presents the results from our approach at the 90th conditional quantile, the median, as well as mean penalized 4 When data is ultra-high dimensional, the computational time may inhibit practical usage as detailed in Fan and Lv (2008). Ultra-high dimensional data, therefore, often requires screening in practice. In our case study, we follow the literature and preprocess SNPs with classic quality control for GWAS analysis and dimension reduction as discussed in the supplementary materials along with sure independence screening (SIS) (Fan and Lv 2008). One can also use a quantile screening approach such as QC-SIS (Ma and Zhang 2016). We find the results are similar in our case study. GEE (PGEE) for systolic blood pressure. This table shows that six out of seven phenotypes except smoking status are selected throughout, and literature also confirms the importance of these phenotypes (Dua et al. 2014;Kannel 1989). More interestingly, new SNPs are identified using our proposed approach at the 90th conditional quantile but are not detected in previous blood pressure literature. 5 We find that several of these may have a plausible association with high blood pressure and are marked with * * in Table 2. Specifically, rs4523751 is located in chromosome 12 LINC00615 gene, and LINC00615 has been found as one of top genes that are associated with blood pressure in Kim et al. (2017). The SNPs rs10853618, rs10853619, and rs17747515 are found in colorectal carcinoma (DCC) genes that have been associated with hypertension by possibly impacting hydrochlorothiazide blood pressure response, innervating peripheral resistance arteries, and regulating the supply of blood to organs (Storkebaum and Carmeliet 2011;Lasram et al. 2015;Shahin et al. 2016). It is worth noting that the new plausible SNP rs17747515 is only identified at the 90th quantile but not in the median nor in the mean model. These new findings may be potentially interesting for researchers to further understand the genetic risk factors of high blood pressure.
The rest of the SNPs selected at the 90th conditional quantile are verified by the scientific literature. In particular, rs13149993, located within gene FGF5, was shown to have a strong association with mean arterial pressure for trans-ethnic replication in Kelly et al. (2013). Rs1957563 near CLINT1 was shown to have a nominally significant association with mean arterial pressure as in Wain et al. (2011). Rs16893776, located within C8orf38, was also reported as an important SNP for blood pressure in Wain et al. (2011). The association of high systolic blood pressure and rs1275988, located near gene KCNK3, has been confirmed by multiple studies (Ganesh et al. 2014;Kato et al. 2015;Franceschini et al. 2016;Manichaikul et al. 2016). The gene KCNK3, a potassium channel, helps regulate vascular tone, and when mutated, it has been shown to be related to pulmonary hypertension (Ma et al. 2013). Rs7467853, located within gene LINC00474, was reported among the top 100 SNPs with associations to mean arterial pressure in de las Fuentes et al. (2012). Rs9874923, located within gene KAT2B, was identified as a novel blood pressure loci in Simino et al. (2013). The coded allele of rs9874923 was shown to have protective effects of blood pressure depending on alcohol consumption. Rs2384550, located in gene TBX3/TBX5, was found to be associated with the regulation of diastolic blood pressure in Levy et al. (2009) andHo et al. (2011). Rs956006 has been identified in the meta-analysis on Genetic Epidemiology Research on Adult Health and Aging (GERA) cohorts combined with the International Consortium for Blood Pressure (ICBP) dataset in Hoffmann et al. (2017). Rs16823124 and rs4987082 were reported in UK Biobank as important SNPs for blood pressure, which were cross validated by studies for the Uganda population (Lule et al. 2019) and Chinese populations (Guo et al. 2012).
Moreover, most of the SNPs selected only in the mean PGEE model in the last column of Table 2 are not commonly mentioned in the literature such as rs12739022, but most of the SNPs selected only in the 90th quantile are either verified by the scientific literature or found with a plausible association. This further indicates possible heterogeneity in the data and suggests a clear need to focus on high quantiles, especially when abnormally high blood pressure is of primary interest.

Diastolic Blood Pressure
High diastolic blood pressure has also been linked to serious diseases. For example, researchers show that reducing systolic blood pressure by 10 mmHg or diastolic blood pressure by 5 mmHg is linked to a decrease in coronary heart disease and stroke (Law, Morris, and Wald 2009). We perform a similar analysis as in Section 3.1, where we seek to identify risk factors linked to longitudinal measurements of diastolic blood pressure at the 90th percentile. The unconditional 90th percentile of diastolic blood pressure is about 89 mmHg, an irregularly high measurement. Table 3 shows the important risk factors selected for diastolic blood pressure using our approach at the 90th conditional quantile, the median, and mean PGEE regression. While most of the phenotype variables except smoking status are selected as important throughout, age is only selected in the 90th conditional quantile model. More specifically, age is positively associated with diastolic blood pressure at the 90th conditional quantile. This suggests that age is an important risk factor for high diastolic blood pressure levels as is commonly identified by previous literature such as Shi et al. (2009).
For genotype variable selection, Table 3 indicates some interesting new SNPs are identified by our proposed approach at the 90th conditional quantile. We find that rs1873691 may have a plausible association with high blood pressure. Specifically, rs1873691, located in chromosome 10, is associated with the pore-forming α subunit gene KCNMA1. KCNMA1 has been found to have an association with severe hypertension in Tomás et al. (2008). This is possibly due to the β 1 regulatory subunit of the Ca 2+ -dependent potassium channel that modulates the renin-angiotensin-aldosterone system. These findings are interesting to further explore in future research.
The rest of the SNPs selected at the 90th conditional quantile are verified by the scientific literature, including the verified SNPs that have already been discussed in the SBP analysis (rs13149993, rs1957563, rs1275988, rs7467853, rs16823124, rs9874923, rs956006, rs16893776, rs4987082). SNPs specifically selected by DBP but not SBP at the 90th quantile, for example, rs1036477, have been verified by multiple studies including Wain et al. (2017) and Hamrefors et al. (2012). Rs1327235 was identified in Ehret et al. (2011) to have an association to hypertension based on a combined discovery in individuals of European ancestry. 6 Overall, we identify interesting SNPs from over half a million SNPs along with time-varying phenotype covariates through simultaneous variable selection and modeling for longitudinal rs16948048 rs16948048 rs4987082 rs11222084 rs2714017 rs5905498 * NOTE: The parameter estimates and standard errors (s.e.) are presented for the selected phenotypes from each approach. Selected genotypes are ordered in such a way that commonly selected SNPs are shown first for ease of comparison. At the 90th conditional quantile, the symbol * indicates newly detected SNPs and the symbol * * indicates newly detected SNPs with a plausible association.
data focusing on the high quantiles of systolic and diastolic blood pressure, respectively, unlike traditional GWAS focusing on marginal associations only. The analysis also indicates that important SNPs and phenotype risk factors may vary across different conditional quantiles. Both blood pressure types are likely heterogeneous. Our quantile regression approach in ultrahigh dimension may provide a better view of the heterogeneity of blood pressure to help provide new insights that can inform more targeted treatments for high blood pressure.

Remarks on GWAS
To empirically compare the performance of our approach to standard univariate GWAS analysis, we conduct a simulation study that mimics the real-world genetic data. First, n = 400 subjects and p = 2000 or 100,000 SNPs along with a phenotype control variable, age, are randomly sampled from our Framingham study data to form predictors X i similar to Bao and Wang (2017). The response variable is then generated under model (1) with errors from a multivariate normal distribution that has an exchangeable correlation structure with ρ = 0.8. The coefficients of age and four important SNPs are set as 0.5 and 0 otherwise.
We summarize the variable selection results in Table 4. To evaluate variable selection performance, we report for 500 simulation runs the average number of true important variables that our approach selected, "TP"; the average number of nonimportant variables that our approach selected as important, "FP"; and F 1 = 2TP 2TP+FP+FN , "F 1 " to measure the overall accuracy of variable selections. While the true positive values are fairly comparable, our proposed method provides much lower average false positive values and substantially higher F1 measures, especially for the scenario with 100,000 SNPs. This clearly indicates a superior overall variable selection performance compared to the standard univariate GWAS. This limited simulation study also confirms the finding in blood pressure literature that there is a high false positive rate using GWAS (Levy et al. 2009).
We note that GWAS and the proposed method are not exclusive but complementary. In blood pressure research, we emphasize that our work is the first available approach to employ simultaneous variable selection and estimation through a penalized approach for ultra-high dimensional SNPs and time-varying phenotype variables to model conditional quantiles of blood pressure. Moreover, since participants' blood pressure is measured repeatedly over multiple waves and the correlation is non-ignorable, we incorporate the longitudinal nature of the data through GEE. We hope that our approach can add a new useful alternative to the challenging problem of SNP selection for blood pressure research.

Numerical Simulations
To further evaluate the variable selection and estimation performance of our quantile approach developed for longitudinal data in ultra-high dimensions for blood pressure analysis, we perform a simulation study. The simulated data has aspects including homogeneity, heterogeneity, and high dimensionality in a longitudinal setting.
We consider the following simulation model: where i = 1, . . . , n, and κ is either 0 or 1 corresponding to two settings: the homoscedastic setting and heteroscedastic setting.
The covariates X i for both settings are drawn from the standard normal distribution, and α 0 = (1, 1, 1, 1, 1, 0, . . . , 0) T . We draw the errors e i from a multivariate normal distribution that has an exchangeable correlation structure with ρ = 0.8. For both settings, we present cases with n = 200 observations and numbers of covariates p = 10, 200, 2000. Under both the homoscedastic setting and heteroscedastic setting, the number of repeated measures is m = 5. Further, to assess performance at various quantiles, we implement our approach using different τ levels including τ ∈ (0.1, 0.5, 0.9).
To evaluate variable selection performance, besides the number of true positives and false positives, we also report the percentage of simulation runs that correctly selected the true important variables only, "%". To further evaluate estimation performance, we report average mean squared error MSE = We use HBIC to choose the penalty parameter λ for the SCAD penalty as discussed in Section 2.3. Following common practice for ultra-high dimensional data, we reduce the dimension using SIS (Fan and Lv 2008). We evaluate various working correlation structures, including independence, AR-1, and exchangeable, to assess robustness when working correlation matrices are incorrectly specified. Table 5 shows the simulation results for the homoscedastic setting. This table indicates that for each quantile level our approach identifies the correct number of predictors a large percentage of the simulation runs. Even in a high dimensional setting when p = 2000, the amount of true positives on average achieves the correct amount, the amount of false positives on average is zero, and the correct percentage is high. This table also confirms the importance of incorporating correlation of the longitudinal data in analysis. The MSE and MAD are smaller under the true exchangeable working correlation structure than the independence structure that does not capture correlation. This holds true for various quantile levels.
For the heteroscedastic setting, Table 6 confirms the satisfactory variable selection performance of our approach in various quantiles. Again, Table 6 indicates that capturing the correlation among observations is beneficial for estimation performance. The MSE and MAD are lowest with the true exchangeable working correlation structure compared to the independence structure that does not incorporate correlation.
In the supplementary materials, we provide an additional simulation study for unbalanced data. We also conduct an additional simulation study to demonstrate our quantile penalized GEE approach incorporating correlation in the longitudinal data is advantageous over averaging the waves or only using one wave.

Discussion
High blood pressure is always a major health problem, and there are multiple recent calls to action to lessen the burden of this harmful condition including from the 2020 U.S. surgeon general NOTE: "p" is the total number of covariates in each setting where n = 200 and m = 5 in all settings. The true coefficient vector includes five coefficients with a value of 1 and the remaining p − 5 coefficients are zeros. We generate this data using the exchangeable correlation structure and set the correlation coefficient ρ = 0.8. "τ "is the quantile level used in each model. "%"is the fraction that the true model is identified. The mean number of true positives and false positives selected over the simulation runs are denoted by "TP" and "FP, " respectively. "MSE" is the average mean squared error || α − α 0 || 2 2 for all simulation runs, and "MAD" is the average mean absolute deviation || α − α 0 || 1 for all simulation runs. Standard errors are provided in parentheses. NOTE: "p" is the total number of covariates in each setting where n = 200 and m = 5 in all settings. The true coefficient vector includes five coefficients with a value of 1 and the remaining p − 5 coefficients are zeros. We generate this data using the Exchangeable correlation structure and set the correlation coefficient ρ = 0.8. "τ "is the quantile level used in each model. "%"is the fraction that the true model is identified. The mean number of true positives and false positives selected over the simulation runs are denoted by "TP" and "FP, " respectively. "MSE" is the average mean squared error || α − α 0 || 2 2 for all simulation runs, and "MAD" is the average mean absolute deviation || α − α 0 || 1 for all simulation runs. Standard errors are provided in parentheses.
as discussed in Touyz and Schiffrin (2021). In this application, we capitalize on the information-rich data of a large-scale longitudinal public health study consisting of over half a million genotype risk factors, time-varying phenotype risk factors, and longitudinal blood pressure measurements. Our approach is also motivated by the destructive effects of high blood pressure quantiles as opposed to average blood pressure levels.
We aim to identify important SNPs along with time-varying phenotype covariates through simultaneous variable selection and modeling for longitudinal data focusing on high quantiles of blood pressure. Our analysis uncovers new SNPs plausibly related to high blood pressure and identifies important SNPs confirmed in previous research. Some newly identified SNPs deserve future careful investigation. In a simulation study mimicking the real data, we confirm the superior performance of the proposed approach in comparison to traditional GWAS that analyzes one SNP at a time. Furthermore, the proposed approach can help grasp a comprehensive picture of blood pressure along with simultaneous modeling and variable selection from hundreds of thousands of SNPs. This is a challenging task because of the ultra-high dimensional SNPs, longitudinal nature of the data, and the non-smooth objective function of quantile regression. In addition to the blood pressure application, we further make methodological, computational, and theoretical contributions for quantile penalized GEE in ultra-high dimensions for simultaneous variable selection and modeling for longitudinal data.
Our analysis indicates both blood pressure types are likely heterogeneous, and our penalized quantile approach to ultrahigh dimensional longitudinal data is well suited to exploit the rich and diverse data in the ongoing large-scale Framingham Heart Study. This analysis may help provide new insights for more targeted treatments for high blood pressure. Especially since people who have the highest quantiles of blood pressure also tend to suffer from other comorbidities and health challenges, our findings of different risk factors at different quantiles of blood pressure have potentially important implications for subgroups and high-risk populations. This proposed approach can potentially provide another vehicle to study other conditions where abnormal levels are of most interest, especially when the rich longitudinal data have ultra-high dimensional risk factors.

Supplementary Materials
Supplementary materials include details on data preprocessing, a sketch of theoretical proofs, detailed technical proofs along with supporting lemmas and propositions, and additional simulation studies.

Disclosure Statement
No potential competing interest was reported by the authors.