Likelihood Inferences on Semiparametric Odds Ratio Model

A flexible semiparametric odds ratio model has been proposed to unify and to extend both the log-linear model and the joint normal model for data with a mix of discrete and continuous variables. The semiparametric odds ratio model is particularly useful for analyzing biased sampling designs. However, statistical inference of the model has not been systematically studied when more than one nonparametric component is involved in the model. In this article, we study the maximum semiparametric likelihood approach to estimation and inference of the semiparametric odds ratio model. We show that the maximum semiparametric likelihood estimator of the odds ratio parameter is consistent and asymptotically normally distributed. We also establish statistical inference under a misspecified semiparametric odds ratio model, which is important when handling weak identifiability in conditionally specified models under biased sampling designs. We use simulation studies to demonstrate that the proposed approaches have satisfactory finite sample performance. Finally, we illustrate the proposed approach by analyzing multiple traits in a genome-wide association study of high-density lipid protein. Supplementary materials for this article are available online.


INTRODUCTION
The log-linear model (Bishop, Fienberg, and Holland 1975) and the joint normal model are the two most frequently used modeling approaches to discrete and continuous data, respectively. For data with a mix of discrete and continuous variables, a conditional normal model combined with a log-linear model is often used for the data analysis. Such approaches, however, are not flexible in model specification, nor robust against model misspecification. For example, it cannot accommodate possible interaction between two continuous variables. In addition, the normal modeling approach cannot address the problem of asymmetry and/or the long tail of the distribution of continuous variables (Azzalini and Valle 1996). Two versions of semiparametric odds ratio model have been proposed, respectively, by Chen (2004Chen ( , 2007Chen ( , 2010 and Osius (2004Osius ( , 2009 to extend both the log-linear model and the joint normal model to address these issues in a unified way. Chen parameterized the model by odds ratio functions and conditional distributions at fixed conditions, while Osius parameterized the model by an odds ratio function and marginal distributions. The parameterization adopted by Chen (2004Chen ( , 2007Chen ( , 2010 has a closed-form expression for the joint density and is easier to study and generalize.
The semiparametric odds ratio model is particularly useful for handling biased sampling designs (Chen 2003(Chen , 2007(Chen , 2011Osius 2009), such as case-control design, matched case-control design, and extreme-value sampling design widely used in epidemiological studies. Traditionally, the Zhang 1995, 1996;Zhang and Risch 1996;Chen and Li 2011), it is unclear what supplemental information is needed to achieve strong identifiability and computation stability. Osius (2009) studied the asymptotic properties of the likelihood estimator of his version of the semiparametric odds ratio model for the case where outcome-dependent samples are taken on a fixed number of outcome values. A systematic theoretical study of the likelihood approach to semiparametric odds ratio model has not been done, especially when more than one nonparametric component are involved in the model. In this article, we develop a maximum semiparametric likelihood approach to estimation and inference of the semiparametric odds ratio model (Chen 2004(Chen , 2007(Chen , 2010(Chen , 2011 aiming at addressing the issue of weakly identifiability in applications. Our study reveals that when all the parameters are strongly identifiable, such as in the generalization of log-linear model for discrete data and joint normal model for continuous data, the maximum likelihood estimator of parameters in the semiparametric odds ratio model is consistent and asymptotically normally distributed. When weakly identifiable parameters are involved in the odds ratio function, which is often induced by nonmultiplicative model or conditionally specified models under biased sampling designs, directly maximizing the likelihood is computationally challenging and the theoretical results may not apply. An approximate semiparametric odds ratio model is proposed to circumvent the inherent computational problem associated with estimating the weakly identifiable parameters. In this case, statistical inference on the model parameters requires us to study the asymptotic properties of the maximum likelihood estimator under the misspecified semiparametric odds ratio model that has not been done previously.
The remainder of this article is organized as follows. In Section 2, the semiparametric odds ratio modeling framework is briefly reviewed. In Section 3, theories for the maximum likelihood estimation and inference on both correctly specified and misspecified semiparametric odds ratio model are, respectively, established. Section 4 formulates applications of the developed theory through approximate likelihood for problems with weakly identifiable parameters frequently encountered in genetic association studies involving conditionally specified models under biased sampling designs. Section 5 deals with a specific application of the proposed approach to the genome-wide association study of lipoproteins, which involves multiple traits under extreme-value sampling design. Extensive simulation studies are also conducted in this section to assess the performance of the proposed approach. Section 6 concludes the article with a discussion of the proposed approach and further research. Proofs of the theoretical results are provided in the online supplementary appendix.

SEMIPARAMETRIC ODDS RATIO MODEL
Let Y denote the data from a sampling unit. Suppose that variables in Y are divided into t groups as Y 1 , . . . , Y t , where Y j taking values in Y j is a vector of d j ≥ 1 dimension for j = 1, . . . , t. Let p(y 1 , . . . , y t ) be the density of (Y 1 , . . . , Y t ) under an appropriate product of count measures and Lebesgue measures. Assume the positivity condition (Besag 1974(Besag , 1996 holds, that is, where p j (y j ) denotes the marginal density of Y j . Chen (2007Chen ( , 2010 showed that, on one hand, any given joint density can be represented by odds ratio functions and conditional densities at a fixed condition. On the other hand, the semiparametric model that models the odds ratio functions parametrically and the conditional densities at a fixed condition nonparametrically is very flexible and unifies many existing parametric and semiparametric models. To keep this article self-contained, we give a brief review of these results here. Without loss of generality, we mainly focus on t = 3 in the remainder of this article. Let (y 10 , y 20 , y 30 ) be a fixed reference point in the sample space. The conditional odds ratio function of y 1 versus y 2 given y 3 with the reference point (y 10 , y 20 ) is defined as η(y 1 , y 10 ; y 2 , y 20 | y 3 ) = p(y 1 , y 2 | y 3 )p(y 10 , y 20 | y 3 ) p(y 10 , y 2 | y 3 )p(y 1 , y 20 | y 3 ) .
Chen (2011) showed that the maximum semiparametric likelihood estimator of Q 3 exists and takes values only on observed values of Y 3 . Moreover, the profile likelihood with Q 3 profiled out is exactly the same as the conditional likelihood based on p(y 1 , y 2 | y 3 , S = 1). In the following, we concentrate on this conditional likelihood in studying the asymptotic properties of the maximum likelihood estimator of γ = (γ 1 , γ 2 ).

The Kullback-Leibler information exists and is finite. That
is, The identifiability condition 1 is an important requirement, which may fail when weak identifiability occurs. Conditions 2 and 3 are technique regularity conditions for applying uniform law of large number and central limit theorem to obtain the asymptotic results. We did not explore the best possible conditions, and as a result, these conditions may be relaxed. Condition 4 is required for consistency and condition 5 for √ nconvergence rate of the parameter estimator.
Theorem 1. Under Conditions 1-5, the maximum likelihood estimator of γ from maximizing (8) exists and is almost surely consistent. Furthermore, the maximum likelihood estimator of γ , denoted byγ n , is asymptotically normally distributed. That is, with σ defined in the Appendix and h 0 being a vector having the same dimension as γ . The variance function σ can be consistently estimated by the sampling version of the inverse of the information operator or by the inverse of the information matrix of the profile likelihood for γ from (8).

Misspecified Semiparametric Odds Ratio Model
To account for possible misspecification involved in the odds ratio functions, we prove a generalized version of Theorem 1 under modified conditions. Let where p 0 (y 1 , y 2 , y 3 ) is the true density generating the observed data. The modification keeps Conditions 1-3 and replaces Conditions 4 and 5 by the following conditions: 4a. The Kullback-Leibler information exists and is finite, that is, where E 0 is the expectation computed under f 0 . 5a. The minimizer in (9) exists and is unique. Furthermore, let H 0 be a bounded set in R p . Let H k be the collection of functions of y k with bounded variations on the support of Q k0 Condition 4a is the same as Condition 4 except that the true density f 0 (y 1 , y 2 , y 3 ) generated the data is different from the best possible model f (y 1 , y 2 , y 3 | γ 0 , Q 10 , Q 20 ) in the semiparametric odds ratio family. The second part of Condition 5a is a strengthened version of an inequality that can be derived from the first part of Condition 5a. The modified conditions ensure that the density minimizing the Kullback-Leibler divergence within the semiparametric odds ratio model exists, the minimizer is asymptotically unique, and the matrix of the second derivatives is invertible. Theorem 2 establishes that the estimator from maximizing the likelihood under misspecified model converges and is asymptotically normally distributed. Proof of Theorem 2 is given in the online supplementary appendix.
Theorem 2. Under Conditions 1-3, 4a, and 5a, the maximum likelihood estimator of γ from maximizing (8) exists and is almost surely converges to γ 0 , which is the value of γ minimizing the Kullback-Leibler divergence. Furthermore, the maximum likelihood estimator of γ is asymptotically normally distributed. That is, Theorem 1 is a special case of Theorem 2. To see this, note that Conditions 4 and 5 imply Conditions 4a and 5a. It is easy to see that Condition 4 implies Condition 4a. Under Conditions 4 and 5, by Jensen's inequality. The equality holds only when f (y 1 , y 2 | y 3 , γ, Q 1 , Q 2 ) = f 0 (y 1 , y 2 | y 3 ), which implies the minimizer in (9) exists, unique, and is the same as the true parameter values by the identifiability Condition 1. Note that the identifiability here refers to the misspecified semiparametric odds ratio model, not the odds ratio model corresponding to the family of distributions containing the truth, which may be weakly identifiable only. See the next section for more details. Furthermore, f 0 (y 1 , y 2 , y 3 ) = f (y 1 , y 2 , y 3 | γ 0 , Q 10 , Q 20 ) implies that If, for some (h 0 , h 1 , h 2 ), Similarly, we can show that h 2 (Y 2 ) = 0. By plugging in the results back, we see that It follows from Condition 5 that h 0 = 0. Therefore, if (h 0 , h 1 , h 2 ) = 0, the strict inequality in Condition 5a holds. These results show that Condition 5a is true. Strictly speaking, Theorems 1 and 2 cannot be directly applied to data generated by a normal model because of the compactness requirement of Y j , j = 1, 2, 3. However, we did not run into any problem in such applications. If this becomes a concern, it can be easily circumvented by trimming, that is, discarding any observations with very large values in any of Y 1 , Y 2 , Y 3 . Trimming data in this way do not affect the odds ratio function within the trimming range because trimmed data are a biased subsample depending only on the marginal values of Y's.

Conditionally Specified Models
Joint model is often conveniently specified and easily interpreted through consecutive conditional models. With a biased sampling design, the odds ratio functions may contain weakly identifiable parameters (Chen 2011) in such models. As a result, Condition 1 required by Theorem 1 may not be satisfied. Consider the trivariate case where p(y 1 , y 2 , y 3 ) = p 1 (y 1 |y 2 , y 3 )p 2 (y 2 |y 3 )p 3 (y 3 ).
The density q 3 is usually unspecified. In this case, the conditional likelihood corresponding to (8) becomes The identifiability result in Chen (2011) guarantees that η 1 , η 2 , and q j (y j ), j = 1, 2, 3, are identifiable from the biased sampling design (5). However, some parameters in η t and η * 2 may be unidentifiable or only weakly identifiable from the biased sample. This can lead to poor performance of the estimator for the model parameters with practically attainable sample sizes when the parameters take values close to an unidentifiable point. We approach this problem using the following approximation.

Approximation to Odds Ratio Functions
For simplicity of discussion, we assume in the following that both q * 2 and q * 3 are unconstrained density functions so that both q 2 and q 3 are unconstrained. In this case, the primary problem lies in the estimation of parameters in η t . The parameters in η t is determined by η 1 that is identifiable from the biased sampling design, and by p 1 (y 1 |y 20 , y 30 ), which may or may not be identifiable from the biased sampling design. We propose an approximation to η t (y 2 , y 3 ) to circumvent the problem.

Models and Approximation for Multiple Traits Analysis
The odds ratio approximation framework can be applied to many problems with biased sampling designs using consecutively conditionally specified models, for example, in efficiency gain by exploiting the gene-environment independence assumption (Chatterjee and Carroll 2005;Lin and Zeng 2006;Chen, Chatterjee, andCarroll 2008, 2009;, and in secondary traits analysis (Monsees et al. 2009;Lin and Zeng 2009;Chen, Reilly, and Li 2013) of a case-control sample. Details of these applications are discussed elsewhere (Chen, Kittles, and Zhang 2013;Chen, Reilly, and Li 2013). Here, we focus on an application to identifying genes that are associated with multiple phenotypes with data from a biased sampling design. One biological mechanism that favors the joint analysis of multiple phenotypes is the pleiotropy, a single gene that affects multiple phenotypes.
To avoid lengthy discussion in a general treatment, we concentrate on the case where the primary outcome Y 1 is a quanti-tative trait and the extreme-value sampling design is employed. Let Y 2 denote other phenotypes of interest and Y 3 = G denote the genotype. The extreme-value sampling design has sampling probability P (S = 1 | Y 1 , Y 2 , G) = π (Y 1 ), where S is the sampling indicator and the exact functional form of π is assumed unknown. Under this sampling design, a simple retrospective analysis based on p(G | Y 1 , Y 2 ) yields estimates of the effects of genotype G on phenotypes Y 1 and Y 2 . However, since the sampling design depends on Y 1 only, the simple retrospective likelihood approach is inefficient. More efficient analysis uses the retrospective likelihood based on One parametric approach to this problem models where ∼ N (0, σ 2 ) and e ∼ N (0, ω 2 ). Note that we include the interaction term in the first model to make it more flexible. Under the normal model, it can be shown that both β 0 and σ 2 are identifiable if β 12 = 0 or β 2 = 0. However, when β 12 = β 2 = 0, which suggests that neither β 0 nor σ 2 is identifiable, where p * is an unrestricted distribution for G. This means that β 0 and σ 2 are weakly identifiable parameters. The foregoing discussion indicates that the more restrictive normal model assumption may not help to remove the weakly identifiable parameters in the extreme-value sampling design. For robustness consideration and ease of presentation, we switch to the semiparametric odds ratio models in the following. Suppose that the parametric odds ratio function (13) holds. Under the normal model for Y 1 , (θ 1 , θ 2 , θ 12 ) = (β 1 , β 2 , β 12 )/σ 2 . Suppose that we model Y 2 by a semiparametric odds ratio model with Under the semiparametric odds ratio models (10) and (11), the retrospective likelihood based on p(Y 2 , G | Y 1 ) is equivalent to the prospective likelihood based on p(Y 1 , G | Y 2 , S = 1), which has the form η(Y 1 ; G; Y 2 )q 1 (Y 1 )q 2 (G) G η(Y 1 ; G; Y 2 )q 1 (Y 1 )q 2 (G)dY 1 , with q 1 (Y 1 ) and q 2 (G) unspecified, and The odds ratio functions are approximation when the semiparametric odds ratio models for Y 1 and Y 2 are correctly specified. If the normal model for Y 1 is correctly specified, the odds ratio functions are exact.

Simulation Study
In the following, we conduct a simulation study to assess the performance of the proposed approach in an extremevalue sampling design. We simulated a minor allele with frequency 0.3 in the population. The simulated genotype G follows the Hardy-Weinberg equilibrium. That is, P (G = 2) = p 2 , P (G = 1) = 2p(1 − p), and P (G = 0) = (1 − p) 2 . A quantitative trait Y 2 is simulated to follow a normal distribution with mean α 0 + α 1 G and variance ω 2 = 1. Another quantitative trait Y 1 is simulated conditional on Y 2 and G to follow a normal model with mean β 0 + β 1 G + β 2 Y 2 + β 12 GY 2 and unit variance. The extreme-value sampling design is employed to select subjects whose quantitative trait Y 1 is outside mean plus-minus 2 standard deviations. A range of parameter values for α 1 and β = (β 1 , β 2 , β 12 ) are used in the simulation. We set α 0 and β 0 to zero in generating the data. In the analysis of the data, a linear odds ratio model is assumed for the G-Y 2 association. However, neither the sampling probability is assumed known, nor the Hardy-Weinberg equilibrium is assumed to hold in the analysis. Three methods of analysis are employed. The first fits a simple linear regression model with G-Y 2 interaction in the model.  Two estimators are obtained by this method: one is the regression parameter estimators (LMR) and the other is the estimator for the odds ratio parameter (LMO) defined as the regression coefficient over the residual variance. The second method fits a semiparametric odds ratio model (SOR) for P (Y 1 |G, Y 2 , S = 1) with the G-Y 2 interaction included. The third method fits a joint odds ratio model (JOR) for P (Y 1 , G|Y 2 , S = 1) with the odds ratio function specified in the previous subsection. When the interaction is strong, two additional estimators are obtained: one is the weighted least-square combination (WLS) of estimated odds ratio parameters, and the other is the unweighted least-square combination (LS).
The simulation results based on 1000 repetitions of a sample size 400 are given in Table 1 for the case with no G-Y 2 interaction and in Tables 2 and 3 for the case with strong G-Y 2 interaction. When there is no G-Y 2 interaction, the least-square combination is infeasible even if we did not assume we know there is no G-Y 2 interaction. When more extreme-valued samples are taken, for example, those beyond 3 standard deviations of the mean, our simulation results not shown here also indicate more efficiency is gained. It can be seen from Table 2 that both LMR and LMO may be subject to very large bias and their variance estimates may also be very different from the empirical variances. These observations suggest that inference based on the linear model under the extreme-value sampling design can be misleading. In comparison, the odds ratio parameter estimator based on SOR or JOR has small bias and the variance estimates are close to the empirical variances. The efficiency of the odds ratio parameter estimator based on JOR is higher than  that based on SOR. When the G-Y 2 interaction is very strong, the least-square combination approach can be applied. Even though theory predicts that the weighted least square is more efficient, our simulation results cannot confirm it. For both WLS and LS, it appears that the estimated variances are substantially different from the empirical variance. This is due to the intrinsic difficulty in estimating the weakly identifiable parameters. Table 3 shows the estimates of the weakly identifiable parameters when G-Y 2 interaction is strong. The estimates appear to be reasonably well. When the interaction effect is not strong, we found that the feasibility of the least-square combination becomes an issue in our simulation. This most likely ties to the weak identifiability Table 3. Estimation of the difficult-to-estimate parameters (α 1 − θ 12 μ, τ 2 ) by the least-square combination when there is gene-environment interaction (θ 12 = 0.5) of the parameters. Overall, the simulation results suggest that the JOR estimator without the least-square combination for the odds ratio parameters is favorable. It is more efficient and stable, and relatively easy to obtain. The least-square combination is of value only when the actual G-Y 2 interaction is very strong.

Analysis of the High-Density Lipoprotein Study
High-density lipoprotein (HDL) is one of the five groups of lipoproteins that enable cholesterol to be transported in the water-based blood stream. It is measured in blood test by the HDL-C level, which is the amount of cholesterol contained in the HDL particles. HDL particles can remove cholesterol from within the artery atheroma and transport it back to liver for excretion or reutilization. As a result, high HDL-C level is associated with lower risk of cardiovascular diseases. The Upenn HDL cholesterol study is a cross-sectional study of the genetic factors associated with elevated HDL-C levels. Subjects of European ancestry with HDL-C level less than 30 percentile or greater than 90 percentile in their age and sex group are sampled as cases and controls, respectively. The study genotyped 625 cases and 606 controls using the IBC 50 K SNP array. This genome-wide association study had been analyzed previously as a case-control study in Edmondson et al. (2011) and as an illustrative example in Chen and Li (2011) for extreme-value sampling design.
In addition to the HDL, other relevant phenotypes such as the low-density lipoprotein (LDL), the total cholesterol (TC), and the triglycerides (TR) levels are measured. The correlation between HDL and LDL, TR, and TC estimated from the biased sample are, respectively, 0.22, −0.46, and 0.61. In this analysis, we investigate the effects of single-nucleotide polymorphism (SNP) genotypes on these phenotypes. As suggested by the theoretical discussion and the simulation studies, the marginal effects of G on individually phenotype is unidentifiable or requires unknown weakly identifiable parameter for identification. We analyze the conditional associations here. Two sets of analysis are performed for comparison. The first fits a semiparametric odds ratio model using the inefficient retrospective likelihood approach based on p(G | HDL, LDL, TC, TR), which models the odds ratio function as log η{G; (HDL, LDL, TR, TC)} = θ 1 G * HDL + ψ 11 G * LDL + ψ 12 G * TR + ψ 13 G * TC.
The study genotyped more than 44, 000 SNPs. There are 28, 202 left after SNPs with minor allele frequency less than 5% are excluded. The two methods are applied to each SNP. The estimates are displayed in Table 4 if either test for nonzero association between a genotype and HDL adjusted for the measurements on LDL, TR, and TC is significant at the family error level 0.05 after Bonferroni adjustment for the multiple tests performed with either method. We choose this criterion because other traits are less likely to be detected as significant. All the detected SNPs are within CETP, a gene located on chromosome 16 that is well known to be associated with HDL. The result also shows that, overall, the complex OR model tends to be more efficient than the simple OR model. But it is not uniform across all SNP genotypes. For some genotypes, the simple OR model yields stronger or equal evidence against the hypothesis of no association. The evidence of genetic association for traits other than HDL are weak. However, the direction of the association is fairly consistent in that those identified SNPs have either positive associations with HDL, LDL and TC, and negative association with TR, or negative associations with HDL, LDL, and TC, and positive association with TR. The implication of the result needs to be examined further.

DISCUSSION
We have established the asymptotic properties for the likelihood inference on the odds ratio parameters in the semiparametric odds ratio model with possible misspecification of the odds ratio functions. The results can be applied to handling the weakly identifiable parameter in the analysis of biased sampling designs. We emphasize that the issue of weakly identifiability is common in nonmultiplicative models, such as probit model for case-control design or conditionally specified models under biased sampling designs, and is not a consequence of using semiparametric odds ratio model. The applications of the proposed approximation approach include among others, efficiency gain by exploiting gene-environment independence, secondary trait analysis, and the analysis of family-based designs, in addition to the extreme-value sampling designs discussed in this article.

SUPPLEMENTARY MATERIALS
A supplemental appendix for the proof of Theorem 2 is available online. [Received June 2013. Revised April 2014