A test for detecting etiologic heterogeneity in epidemiological studies

Current statistical methods for analyzing epidemiological data with disease subtype information allow us to acquire knowledge not only for risk factor-disease subtype association but also, on a more profound account, heterogeneity in these associations by multiple disease characteristics (so-called etiologic heterogeneity of the disease). Current interest, particularly in cancer epidemiology, lies in obtaining a valid p-value for testing the hypothesis whether a particular cancer is etiologically heterogeneous. We consider the two-stage logistic regression model along with pseudo-conditional likelihood estimation method and design a testing strategy based on Rao's score test. An extensive Monte Carlo simulation study is carried out, false discovery rate and statistical power of the suggested test are investigated. Simulation results indicate that applying the proposed testing strategy, even a small degree of true etiologic heterogeneity can be recovered with a large statistical power from the sampled data. The strategy is then applied on a breast cancer data set to illustrate its use in practice where there are multiple risk factors and multiple disease characteristics of simultaneous concern.


Introduction
One central goal of epidemiological studies of cancer is to study the etiologic heterogeneity in its subtypes. Etiologic heterogeneity in the subtypes is explained as follows. Consider breast cancer which is characterized by its histological type (ductal/lobular/tubular/mixed carcinoma), tumor size ( ≤ 2 cm, 2 cm), tumor grade (1/2/3), nodal status (+/−), estrogen receptor (ER) status (+/−), and progesterone receptor (PR) status (+/−). Subtypes are the breast cancer classifications according to these characteristics. The subtypes are etiologically heterogeneous, if the effect of exposures are different for different subtypes. Etiologic heterogeneity of breast cancer, in particular, has been under investigation, see [11][12][13]18,19,29,30] for recent findings. Such studies have also long been concerned with other types of cancer including ovarian cancer, colorectal cancer, and non-Hodgkins lymphoma as in [20,22,23,28]. In epidemiological case-control studies, statistical approaches to study etiologic heterogeneity of the disease, where reference is the disease-free situation represented by the study controls, commonly consist of (i) logistic regression analysis for each disease characteristic, (ii) polychotomous logistic regression analysis for a categorical variable whose levels are the disease subtypes. In the first approach, etiological heterogeneity in the disease subtypes is studied by means of heterogeneity in disease's defining characteristics in terms of their relation with the risk factors. This approach estimates the odds ratios for each disease characteristic separately ignoring the relation between them. This in fact hampers direct inference on etiological heterogeneity with respect to disease subtypes. On the other hand, the latter approach provides direct inference on etiological heterogeneity with respect to disease subtypes as desired. However, it suffers from high-dimension problem for diseases with large number of subtypes, for example, number of breast cancer subtypes constructed by grouping the disease according to the characteristics described in the first paragraph is 4 × 2 × 3 × 2 × 2 × 2 = 192 leading to a polychotomous logistic regression with 384 parameters (including the intercepts) for a single explanatory variable in the model. Obviously, the dimension of the parameter space will increase substantially with the inclusion of all risk factors of interest. In such situations, not only does the analysis suffer from computational burden, but it also results in decreased statistical power and inflated bias. The bias increases with increasing p/n ratio, where p is the total number of regression coefficients and n is the number of study patients [5]. Cordeiro and McCullagh [7] show for logistic models that the bias is equal to pβ/n, where β is the vector of unknown regression coefficients. Also, this approach appears to be inefficient when odds ratios by disease characteristics is desired in addition to the odds ratios by disease subtypes. Challenges encountered in the statistical analysis of epidemiological data concerned with cancer subtypes are reviewed by Troester and Swift-Scanlan [26]. Power analysis of a subtype based approach for etiologic heterogeneity is given in [1].
Chatterjee [6] developed a two-stage logistic regression (TS-LR) addressing the issues described above. This model tames the high-dimensionality issue of approach (ii) (e.g. for the example above, number of association coefficients is 192 in approach (ii) versus 10 if the twostage approach was used), readily estimates etiologic heterogeneity in each disease characteristic adjusted for the others, and provides inference on etiologic heterogeneity in terms of disease subtypes. The method has been widely used to understand etiologic heterogeneity of colorectal cancer [3,[15][16][17]21] and breast cancer [10,24,27]. As the emphasis on etiologic heterogeneity increases in cancer epidemiology studies, the need for formal statistical tests for it emerges. Testing etiological heterogeneity can be accomplished using the TS-LR model given its advantages summarized above and detailed in [6]. In the current article, a strategy based on score test (ST) statistic is developed to test etiological heterogeneity with respect to the disease defining characteristics. The main advantages of Rao's ST are that it is invariant to different formulations of the null hypothesis and requires maximum likelihood estimation on a lower dimension parameter space [2,4]. Statistically significant etiologic heterogeneity in terms of the disease defining characteristics indicates etiologic heterogeneity between the disease subtypes with respect to the corresponding characteristics. Detecting etiologic heterogeneity in disease subtypes ultimately leads to a better understanding of risk profiles by subtypes (1) and improves effective treatment options.
In this article, we investigate the statistical properties of the ST in TS-LR and illustrate its use. We hope that the results of our simulation study ultimately guide the researcher in determining the sufficient number of study cases from each subtype and use a powerful test for detecting etiological heterogeneity. Next section reviews the TS-LR model while articulating additional interpretations and lays out the testing strategy devised. Section 3 gives the results of the simulation experiment carried out to assess the performance of the testing strategy in terms of its type I error and power. Section 4 employs the devised strategy to test for etiologic heterogeneity in breast cancer for Turkish patients and also shows how it is used in the presence of multiple covariates. A brief discussion is given in the last section.

Review of TS-LR
The TS-LR model is a hierarchical model consisting of the following two stages: where i = 1, 2, . . . , N indexing the study subjects and m = 1, 2, . . . , M indexing the disease subtypes, N and M are the number of study subjects and disease subtypes, respectively, α m (m = 1, 2, . . . , M ) determine the baseline prevalence of the different disease subtypes, and β m denotes the P × 1 vector of regression coefficients associated with P covariates X i . Below, we expound on the model basics and advise the reader to refer to [6], for further elaborations. For the time being, assume a single covariate for ease in illustrating the idea. First stage is the response model in which multi-level categorical disease subtype variable, Y i , is modeled using a standard logistic regression given the covariate X i . In model (1), Y i = m means that disease subtype of subject i is the one coded by m, while m = 0 refers to the disease-free category. Second stage consists of models for the first-stage regression coefficients (i.e. for the β coefficients). In the current article, we concentrate on the diseases in which etiologic heterogeneity with respect to one characteristic does not depend on the others and thus β m (m = 1, . . . , M ) is modeled as in model (2). Otherwise, interaction terms are included in this model but this is a concern of another paper. Each β m can also be denoted by β i 1 i 2 ···i K , where K is the number of categorical disease defining characteristics and each of i k , k = 1, 2, . . . , K indexes the specific level of each characteristic. Model (2) is a design-like deterministic model consisting of a linear combination of contrast parameters denoted by θs. Each θ (1) . parameter in the model is a first-order contrast parameter. Here, θ (1) k 1 (i k 1 ) is the logOR associated with having i k 1 th level of the characteristic k 1 relative to the reference level of the same characteristic for a one unit change in the particular covariate. To illustrate simply, suppose a cancer described by the following two characteristics; tumor size (small(0)/medium(1)/large(2)) and nodal status (yes(1)/no(0)). Cross-classifying the levels of the two characteristics results in a six-level subtype with levels being (small,yes), (small,no), (medium,yes), (medium,no), (large,yes), and (large,no) of which the reference level is (small,no). Assuming a single risk factor, first-stage slope parameters are β 00 , β 01 , β 10 , β 11 , β 20 , and β 21 . Reparameterization of those in terms of θ s are given in Table 1. The contrast arrays ) T are related with tumor size and nodal status, respectively. The reference level contrasts are set at 0 for estimability (i.e. θ (1) 1(1) = θ (1) 2(1) = 0). Each nonreference contrast parameter represents the association between the covariate and the odds of a certain disease characteristic being at a certain level in relation to its reference level. For instance, θ (1) 1 (2) represents the association between the covariate and the odds of tumor being medium relative to tumor being small. Also, under the estimability condition, θ (0) is the coefficient specific to the reference disease subtype and represents the logOR associated with probability of the disease at the reference level relative to absence of the disease. In the case of multiple covariates, above ideas are replicated for each covariate with the proper indexing. Proper notations for multi covariate situations are given in [6]. The TS-LR model is especially beneficial when logOR estimates by cancer characteristics is desired. The θ parameters readily deliver logOR estimates by disease characteristics, whereas one would have to go through extra calculations to obtain them when a standard polychotomous logistic regression (approach ii) is used.

ST for etiologic heterogeneity in TS-LR
Source of etiologic heterogeneity in disease subtypes in terms of their associations with the risk factors is the etiologic heterogeneity in each disease characteristic. Etiologic heterogeneity in terms of the disease defining characteristics indicates etiologic heterogeneity between the disease subtypes with respect to the corresponding characteristics. Therefore, determining etiologic heterogeneity of the disease boils down to testing the difference between the related θ parameters for each disease characteristic. For instance, admitting to the illustrative example above, assuming a single covariate for the time being, the null hypothesis for testing etiological heterogeneity in tumor size is H 0 : . The generalization is that for the disease characteristic k with m k levels, testing its etiological heterogeneity is accomplished by setting We derive a ST to test this hypothesis under the models (1) and (2). For a recent informative account of Rao's ST refer to [2]. Result of the test sheds light on risk factors with significant differential effects with respect to disease characteristic levels.
For parameter estimation, pseudo-conditional likelihood (PCL) method, the basics of which are recapitulated here, is used. Please refer to [6] for further methodological and theoretical details. This methodology uses a likelihood function where the building blocks are conditional probabilities. PCL is free of the intercept parameters, namely α m 's and depends only on the regression coefficients β m 's. In the resulting PCL, β m 's are replaced by the forms in model (2) and score equations for θ parameters are obtained. Resulting maximum likelihood estimators of θ parameters were proved to be asymptotically normal.
ST statistics, denoted by T s , is given in Equation (3).
Following [2,25], one can easily confirm that asymptotic distribution of T s in the TS-LR model is χ 2 with degrees of freedom m k − 1. This implies that when subtype frequencies are sufficiently large, one can use suitable χ 2 quantiles for the test. Then two customary crucial questions follow: How large is sufficiently large? How should the test proceed if subtype frequencies are not large enough? Our Monte Carlo-based investigation sheds light on these questions. The results of the simulation study infer a condition for using χ 2 (the asymptotic distribution) for performing the test and the condition is based on minimum of the expected subtype frequencies, that is, minimum of the subtype frequencies that would have been expected if the disease characteristic of interest was non-heterogeneous with respect to its association with the particular risk factor of interest. For study designs in which the condition for asymptotic test is not satisfied, a permutation-based testing procedure is recommended. We derive the permutation algorithm and provide an outline for it here using the following example. For a 3 × 2 × 2 disease where there are 3 characteristics with number of levels being 3, 2, and 2, respectively, binary dummy variables for each characteristic are constructed. To test for etiological heterogeneity of the disease characteristic -1, for example, rows of its corresponding dummy variables are permuted (reshuffled) holding the rows of the other characteristics and the covariates untouched. A new column holding the disease subtype information is then constructed based on the resulting reshuffled levels. We used 10,000 such permuted data sets. The idea behind the permutation method is that the finite sample distribution of a test statistic under the null hypothesis can be realized by randomly shuffling the data several times obeying the condition presented in the null hypothesis and calculating the test statistic for each data set obtained as such (see [9,14]). The basis for our choice of resampling procedure over Bootstrap is that the rationale for permutation method appears to be clearer for hypothesis testing in linear models. For performing a level-α ST, critical points are set at (1 − α)th sample quantile of the resulting empirical distribution of the ST statistic. In case of multiple covariates, the method above is repeated in a similar manner for each covariate.

Simulation study
Our purpose is to get a rough idea about the minimum expected subtype frequencies necessary for asymptotic ST and power of permutation-based ST when expected subtype frequency does not meet the minimum condition. We try to address those via a simple Monte Carlo simulation study. To study type I error and power, two representative sample sizes are considered, N = 500 and N = 1000 with equal number of cases and controls. Two separate scenarios are considered for disease subtypes. First one concerns with a disease with 3 categorical characteristics each having 2 levels, resulting in 8 subtypes (2 × 2 × 2). The second scenario concerns with a disease with 3 categorical characteristics having 4, 2, and 2 levels respectively, resulting in 16 subtypes (4 × 2 × 2). For convenience, a single covariate is considered and covariate data are generated from the standard normal distribution. To generate the responses (i.e. disease subtypes), true values for θ parameters are set as explained in the following two paragraphs and then true β parameters are obtained using model (2). Then, given βs and X i s, Y i s (i.e. disease subtypes including disease-free case) are generated from a multinomial distribution with cell probabilities given in Equation (1). Hypothesis of interest is etiologic heterogeneity of a particular disease characteristic, for example, the first one, and denoted by H 0 : θ (1) 1(2) = 0 and H 0 : θ (1) 1(2) = θ (1) 1(3) = θ (1) 1(4) for the first and second scenarios, respectively. The aim of the simulation study is to investigate the statistical properties of the asymptotic-and permutation-based STs for testing this type of hypotheses. Nominal significance level is set at 0.05.
We designed the following two-part Monte Carlo experiment to profile empirical false positive rates (type I errors) of asymptotic-and permutation-based ST for different minimum expected subtype frequencies. The aim of the first part is to determine true values for θ parameters so that they lead to a certain minimum expected average subtype frequency under H 0 of interest. Then in the second part, data are generated as explained in the previous paragraph using the θ values predetermined by the first part so that we have a control over the minimum expected subtype frequencies in our experiment. In the first part, the sample size N and number of simulated data sets J are set at large numbers namely 10,000 and 20,000, respectively. Letp mi denote the probabilities for each disease subtype for subject i in the jth simulated data set which is estimated using models (1) and (2) under the null hypothesis. Then, for each simulated data set, the probabilities associated with each subtype are averaged over the subjects to obtain an average probability of having a certain disease subtype. For the jth simulated data set, the average probability of being in subtype-m isp m· . In order to obtain the average expected subtype frequencies, resulting values are multiplied by the number of the cases n case as n case ×p m· . Minimum of these frequencies are recorded. This work provided an idea about the θ values and the minimum subtype frequency they eventually lead to in the long run. These particular θ values are then used in the data generation process to study empirical type I error rates and their relationship with minimum subtype frequencies for the scenarios considered.
For the power study, true θ parameters are chosen to satisfy H 1 . The first column of Table 4 shows the alternative values for θ (1) 2(2) that correspond to varying degrees of etiologic heterogeneity most of which are local alternatives. For the case of 4 × 2 × 2, the first two columns of Table 5 imply a different alternative hypothesis each time. Here, 1 and 2 , respectively, are θ (1)  Results: The aim of the study in Section 3.1. is to investigate the finite sample properties of the tests when there are subtype categories with particularly low expected frequencies. Tables 2 and 3 reported the empirical type I error rates for the cases 2 × 2 × 2 and 4 × 2 × 2, respectively. In these tables, the proportion of the minimum expected subtype frequencies vary between about 1% and 5% of the cases. The results show that, when the minimum subtype frequency is as low as 1 − 5% of the cases, asymptotic ST is liberal in terms of empirical type I error rate and fails to maintain the nominal significance level. Practical implication of this finding is that asymptotic ST should be used when the proportion of study patients in each disease subtype expected under the null hypothesis is larger than the ones viewed here. Otherwise, the analyst is likely to result in false positive decision about etiologic heterogeneity. On the other hand, the same tables show that permutation-based ST maintains the nominal significance level better. The results for 4 × 2 × 2  case with n = 500 is disregarded as they did not add any further information. These results altogether imply that analyst should prefer the permutation-based test when minimum subtype frequency does not meet the guidelines produced here for the asymptotic χ 2 (1) to hold. Power comparisons are given in Tables 4 and 5. From Table 4, when there is a subtype category with expected frequency as low as 5% of the cases (that is the block with 25) in a moderate situation such as 2 × 2 × 2, asymptotic ST offers slightly larger power than the permutationbased ST, but at the cost of higher type I error rate. The second block of the table shows that when there is a subtype category with expected frequency as low as 2% of the cases, the preferred permutation-based ST achieves 92% power to detect even a small effect size such as 0.4 towards etiologic heterogeneity (i.e. towards 0). From Table 5, when there is a subtype category with expected frequency as low as 1% or 2% of the cases, the permutation-based ST achieves about at least 80% power to detect an effect of size at least (0.2,0.2) towards etiologic heterogeneity.

Illustrative example
The testing procedure is applied to understand the heterogeneity in risk factor-breast cancer subtype associations in the Turkish female breast cancer patients. The data set is obtained from a case-control study in Ankara Oncology Research and Education Hospital. There are 500 females in the study, of whom 249 have breast cancer. Reader should refer to [8] for further study details. Major tumor characteristics considered for our illustration are tumor size (extended to chest wall or skin, > 50, > 20 and ≤ 50, ≤ 20 mm), tumor type (invasive ductal, invasive lobular, and tubular), NA status (metastatis in axillary nodes or not), ER status (+, −), PR status (+, −), and Her2/neu receptor status ( + , − ) where last levels are the reference levels. Cross-classifying these levels correspond to 4 × 3 × 2 × 2 × 2 × 2 = 192 breast cancer subtypes. Covariates in this study consist of the risk and the adjusting factors. Risk factors, namely age at menopause, age at first menstruation, number of births, age at first birth, age at last birth, duration of breast feeding, and duration of smoking are ascertained and indexed by p = 1, . . . , 7. Adjusting factors typical to breast cancer studies, namely age and body mass index, are included in the model and indexed by p = 8, 9. A TS-LR model and maximum PCL estimation is used. Second stage main contrast parameters for each of these covariates are estimated and the ones of the main concern are given in Tables 6 and 7.
Developed testing strategy is employed to investigate the etiologic heterogeneity of the disease subtypes by means of heterogeneity in disease characteristic-risk factor association. Null Table 6. Estimates (standard errors) of the second stage parameters.
Covariate  Permutation-based ST is applied as it is proved to be more guarded against false discovery compared to its asymptotic counterpart. The reshuffling procedure is repeated 10,000 times and at each time, a ST statistic corresponding to each risk factor and each hypothesis is calculated and stored. These values are then used to obtain the required empirical quantile of distribution of the ST statistic under the null hypothesis for each risk factor. The observed test statistics and the permutation-based critical points are presented in Tables 8 and 9 herein and Tables 1-4  supplemental file, each corresponding to a particular disease characteristic as seen. The results reveal that at the 5% significance level, there is statistically significant evidence in the data set that PR status and Her2/neu receptor status are etiologically heterogeneous with respect to their relationship with smoking and breastfeeding durations, respectively (see supplemental material). That means, smoking has a different effect on PR positive and PR negative breast cancer. Similarly, duration of breastfeeding-breast cancer association is significantly different for Her2/neu receptor positive and negative cancers. Looking back at Table 7, estimate ( − 0.0488) given in (duration of smoking, θ (1) 5(2) ) suggests that the duration of smoking is more strongly associated with PR negative-type breast cancers. Also estimate (0.0138) given in (duration of breastfeeding, θ (1) 6(2) ) suggests that the duration of breastfeeding is more strongly associated with Her2/neu receptor positive-type breast cancers.

Discussion
The suggested testing strategy consists of two stages: Stage 1: Estimate subtype frequencies that would have been expected if the null hypothesis (etiologically non-heterogeneous) was true. The decision as to whether use asymptotic-or permutation-based test is then made given the minimum of the expected frequencies. The rule of thumb to which we are led following the results of our simulation study is that avoid using asymptotic test for etiologic heterogeneity and prefer permutation-based test instead if minimum expected subtype frequency is less than about 5% of the study cases.
Stage 2: Apply the test chosen at stage 1. Use χ 2 tables with appropriate degrees of freedom for asymptotic test. For permutation test, carry out the permutations as outlined to obtain exact distribution of the ST statistic under the null hypothesis; use the appropriate empirical quantile of this distribution to finalize the test.
We considered single covariate in the simulations for clear interpretation of the results. Inclusion of multiple covariates in the model surely would have changed parameter and information matrix estimates. However, there is no reason to believe that it would have altered the conclusion derived from the simulation results regarding its power properties should we had multiple covariates. Because each hypothesis is concerned with etiological heterogeneity related with each covariate separately. Nevertheless, applications with multiple covariate case is addressed in the application section.
The sample sizes and the number of subtypes utilized may be considered as a semblance of limitation. However, it does not curb the intuition obtained from this study as to whether to proceed with asymptotic-or permutation-based test, nor does it preclude generalizing the results and what this paper aims to contribute to the epidemiological data analysis.