Outlier Detection and False Discovery Rates for Whole-Genome DNA Matching

We define a statistic, called the matching statistic, for locating regions of the genome that exhibit excess similarity among cases when compared to controls. Such regions are reasonable candidates for harboring disease genes. We find the asymptotic distribution of the statistic while accounting for correlations among sampled individuals. We then use the Benjamini and Hochberg false discovery rate (FDR) method for multiple hypothesis testing to find regions of excess sharing. The p values for each region involve estimated nuisance parameters. Under appropriate conditions, we show that the FDR method based on p values and with estimated nuisance parameters asymptotically preserves the FDR property. Finally, we apply the method to a pilot study on schizophrenia.


INTRODUCTION
During the past decade, scientists have had phenomenal success in discovering the genes responsible for simple genetic disorders. The success is due in part to the fact that these disorders are generally caused by one or at most a small number of defective genes and, as in Mendel's peas, these genes act in a manner that is straightforward to model. In contrast, complex disorders are those for which there is clearly a genetic basis, but the inheritance pattern is not apparent. For complex disorders, certain alleles (particular versions of a gene) enhance the risk of contracting the disorder but are neither necessary nor suf cient to cause the disorder. An allele associated with increased risk of disease is called a liability allele. Furthermore, there may be many genes at various locations (loci) that have liability alleles.
To discover the liability loci that affect the risk of human diseases, scientists exploit the fact that the DNA in a region bracketing a liability allele will tend to be passed down along with the liability allele itself from generation to generation. Through the process of recombination, each pair of chromosomes usually breaks into a few pieces, and the genetic material is exchanged between the pair. This process causes the length of the chromosomal segment shared among affected individuals to diminish, helping localize the position of the liability locus within a particular chromosome. Genetic linkage analysis looks for an unusually large amount of sharing of a particular chromosomal segment among the affected members of a family.
Although linkage analysis has been a powerful tool for the discovery of simple genetic disorders, it has not experienced a similar level of success for complex disorders, presumably because the power is insuf cient (Risch and Merikangas 1996). To gain more power, one can exploit the fact that within a region bracketing a liability allele, the genetic material can be conserved for hundreds of generations. Analysis that looks for unusual sharing of chromosomal segments among affected members of a population, rather than among Jung-Ying Tzeng is a doctoral candidate, Kathryn Roeder is Professor, and Larry Wasserman is Professor, Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213 (E-mail: larry@stat.cmu.edu affected individuals within an extended family (pedigree), is a form of association analysis (e.g., McPeek and Strahs 1999).
Typically, geneticists do not initially sequence chromosomal segments. Rather, they measure alleles at particular loci known as genetic markers at regular intervals, either over the entire genome or at targeted regions of particular interest. For the sampled set of genetic markers all lying within a single chromosomal segment, the ordered string of alleles de nes a haplotype. An excess of a certain form of haplotype in affected individuals versus unaffected individuals is consistent with the presence of a liability allele in the region de ned by the haplotype. The correlation between certain alleles in a region bracketing a liability allele is called linkage disequilibrium. In general, the stronger the linkage disequilibrium, the easier it is to discover the approximate location of the liability allele.
Not all affected individuals share the same haplotype in a region bracketing a liability locus for various reasons, including (a) many individuals will have the disorder for other reasons, either genetic or environmental; (b) even among those individuals whose liability traces in part to a common locus, not all will have inherited this liability allele from a common ancestor; and (c) even for those who inherited the liability allele from a common ancestor, recombinations may have occurred within the haplotype since the introduction of the liability allele into the population. For these reasons, only modest differences in haplotype frequencies are expected between the affected individuals (cases) and the unaffected individuals (controls) even if there were a liability allele in the region under investigation.
Compounding the statistical challenge, it is often unreasonable to assume that the sampled haplotypes are independent. Individuals with common ancestry are more likely to share haplotypes throughout the genome than would be predicted due to chance. In genetic studies, it is expected that some individuals who share a genetic disorder also share a common ancestor, but this common ancestry is far enough in the past that it is often unknown. Furthermore, population substructure also induces correlation among individuals from the same ethnic group. Overall, the data can have a complex correlation structure that is dif cult to model directly.
In this article we propose a matching statistic to measure the difference between the haplotype distribution of cases and controls. In our derivation of the distribution of the matching statistic, we incorporate the correlation among haplotypes in a simple way. In particular, we show that the distribution of the matching statistic is well approximated by its distribution, assuming that the haplotypes are independent and identically distributed, multiplied by a constant that re ects the perturbation due to correlation. Because the correlation structure is not directly estimable based on a sample of haplotypes obtained from a single region of the genome, a key step in the development of the matching statistic involves demonstrating that the correlation induces an effect that is constant across the genome. Being constant, this factor is estimable, provided that multiple regions of the genome have been sampled.
In genetic studies, haplotypes are typically obtained for many regions, K, across the genome. Within each region, a test for association is performed. There are many methods for deciding which hypotheses to reject while maintaining control over the probability of false positives at level . For example, the well-known Bonferroni method rejects a hypothesis if the p value P < =K. This guarantees that the familywise error rate (FWE)-the probability of at least one false rejection-will be no larger than . When K is large relative to the sample size, the Bonferroni method (and other FWE controlling methods) have power tending toward 0. Benjamini and Hochberg (1995) argued that when testing many hypotheses, protecting against a single false rejection is too stringent. Instead, they suggest controlling the false discovery rate (FDR), which is the fraction of false positives. For wholegenome matching, in which the number of hypotheses can be potentially very large, we nd their argument compelling, and so this is the approach that we take here. This research was motivated by an ongoing study of schizophrenia on Palau, a remote island in Micronesia (Devlin, Roeder, Otto, Tiobech, and Byerley 2001a). Schizophrenia is a complex disease that appears to have a substantial genetic basis. A noteworthy feature of the Palauan population is that it exhibits an elevated rate of schizophrenia relative to the worldwide rate. Palau has a unique history that makes it potentially amenable to gene discovery via an association study. Linguistic analyses and ethnographic studies suggest that the Palauan population developed in relative isolation, even from other Micronesian populations; nonetheless, this population shows some evidence of immigration from surrounding populations (Devlin et al. 2001a and references therein). Being settled about 2,000 years ago, presumably by Asian islanders, the population is both young and small in number, currently numbering 21,000. Epidemics of contagious disease originating from American and European contact reduced the population to a low of 4,000 about 100 years ago. These reductions enhanced the linkage disequilibrium and presumably increased the population's suitability for an association study. In this article we illustrate the matching statistic for detecting association in a genome scan by analyzing a pilot sample of patients and controls obtained from Palau.
The article is organized as follows. Section 2 motivates the matching statistic and derives its distribution. Section 3 proves the validity of the FDR procedure for detecting outliers using the matching statistic. Section 4 describes a simulation study, and Section 5 gives the results of the Palauan data analysis. Finally, Section 6 presents discussion and conclusions.

THE MATCHING STATISTIC: QUANTIFYING HAPLOTYPE SHARING
In this section we develop a test statistic for association, initially by ignoring correlations due to relatedness among sampled individuals (Sec. 2.1) and then taking the correlation into account (Sec. 2.2). We assume that the test statistic will be computed at each of K regions of interest across the genome.

Independent Samples
Consider K regions of interest with n case and m control haplotypes sampled from each region. Each individual contributes two haplotypes to the sample. Let H i4k5 denote the ith sampled haplotype in region k. Assume that there are R k distinct haplotypes in segment k. Let al4k5 D Pr4H i4k5 D l5 for a haplotype sampled from an affected individual, i D 11 : : : 1 n and l D 11 0 0 0 1 R k , and ul4k5 D Pr4H i4k5 D l5 for a haplotype sampled from an unaffected individual, i D 11 : : : 1 m and l D 11 0 0 0 1 R k . The original data consist of two matrices, one matrix for the cases of dimension n K and one for the controls of dimension m K. The 4i1 k5 entry of the case matrix is the form of the ith haplotype at region k. The 4i1 k5 entry of the control matrix is arranged similarly. Within each column of each matrix, assume that the haplotypes are a sample from a multinomial distribution.
An omnibus chi-squared test with R k ƒ 1 degrees of freedom between column k of the cases and column k of the controls offers one possible test to determine whether the haplotype distribution differs across cases and controls at locus k. In this report we investigate a statistic that tests for association using only 1 degree of freedom. A 1-degree-of-freedom test is of interest for three reasons: (1) it has the potential to exhibit greater power, at least in some portions of the parameter space; (2) it is likely to achieve its asymptotic distribution with a smaller sample size; and (3) it permits a natural extension that incorporates correlation among haplotypes. We develop these points throughout the remainder of the article and summarize them in Section 6.
To measure the degree of matching in region k, suppose that we draw two case haplotypes at random. The chance likelihood that they will have the same version of the haplotype is P l 2 al4k5 . One minus this quantity is called the heterozygosity index. The heterozygosity is maximized when al4k5 D 1=R k , l D 11 : : : 1 R k and is minimized when al4k5 D 1 for some l. A mutation leading to increased risk of disease occurring in the population will most likely be embedded within a relatively common haplotype. If a cluster of the cases traces back to this common ancestor, then the cases will have diminished heterozygosity relative to the controls. The degree of matching at locus k in the cases versus the controls can be measured by the difference in the heterozygosity indexes, This measure tends to be large if substantial clusters of case haplotypes derive from one (or at most several) common ancestor(s), such as would be anticipated under the alternative hypothesis of association. In this article we develop a test statistic based on this measure of association, but note that this is just one of many possible measures of association. Methods similar to those presented here could be developed for other measures as well.
Let O al4k5 be the maximum likelihood estimator for al4k5 in the cases, and let O ul4k5 be the corresponding quantity for the controls. Thus O al4k5 is the observed proportion of haplotype l in the case sample at locus k, and O ul4k5 is the observed proportion of haplotype l at locus in control samples. Then T k is the maximum likelihood estimator of OE k , We call T k the unstandardized matching statistic for region k.
De ne ç a4k5 D 4 a14k5 1 : : : 1 aR k 4k5 5 and ç u4k5 D 4 u14k5 1 : : : 1 uR k 4k5 5, and let b ç a4k5 and b ç u4k5 denote the corresponding estimated quantities. The variance of T k can be computed directly by noting that The variance of T k is computed in Appendix A. We call this variance, ' 2 k , the multinomial variance, because it is computed assuming that the sample of haplotypes follows the multinomial distribution. Provided that ç a4k5 and ç u4k5 are not equal to 1 R k 411 : : : 1 15, T k is approximately distributed as N 4OE k 1' 2 k 5 and ' 2 k is estimable. Most regions will not harbor a liability allele. The level of matching is assumed to be a constant value OE across these "null" regions. Because the cases may differ somewhat in their ethnic origin from the controls, OE is not assumed to be 0. However, based on genetic theory, we anticipate ç a4k5 º ç u4k5 in null regions for any complex disease (Devlin, Roeder, and Wasserman 2001b) and hence OE is close to 0.
According to the association hypothesis, those regions harboring liability alleles are likely to exhibit in ated matching. The goal is to nd the regions where OE k > OE. We have now reduced the problem to the following. We have T k ¹ N 4OE k 1' 2 k 5 for k D 11 : : : 1 K. There is a real number OE and a subset S 811 : : : 1 K9 such that OE k D OE for k 2 S and OE k > OE for k y S. The goal is to identify S c .

Correlated Samples
In the previous section we formulated a matching statistic that measures the degree of sharing observed in each measured region of the genome. However, in so doing, we computed the multinomial variance ignoring the correlation between haplotypes due to relatedness among the individuals in our sample.
In reality, the variance of the matching statistic depends on this correlation. To address the correlation, we extend an approach pioneered by Devlin and Roeder (1999) to this setting. These authors demonstrated that when the data consist of 2 2 casecontrol tables computed for each of k D 11 : : : 1 K biallelic markers, the variance of O a14k5 ƒ O u14k5 equals the binomial variance times a constant multiplier ' 2 that accounts for the correlation among subjects in the study. This is a useful result because one can estimate ' 2 , provided that K is large.
Here we demonstrate that there exists a variance in ating/de ating factor, ', such that var4T k 5 is proportional to the multinomial variance, ' 2 k under certain assumptions, that is, var4T k 5 D ' 2 ' 2 k . Thus, following Devlin and Roeder, we also can obtain the true variance by estimating ' 2 and then scaling the multinomial variance ' 2 k by ' 2 . We rst provide motivation for our assumptions, then establish the result.
Recall that case (control) haplotypes are indexed as i D 11 : : : 1 n 4i D 11 : : : 1 m) for the sample of n=2 (m=2) individuals, ignoring the pairing of haplotypes within an individual. Haplotype i, obtained from an affected individual, can be encoded in a binary vector of length i4k5 5, consisting of R k ƒ 1 0s and a single 1; for example, type 2 is coded as 401 11 01 : : : 1 05, and type R k is coded as 401 01 : : : 1 01 15. Let X i4k5 denote the corresponding quantity for the ith control haplotype. To compute the variance under the null hypothesis of no extra matching at locus k, we assume that ç a4k5 0 i4k5 1 X 2 i4k5 1 : : : 1 X R k i4k5 5 identically, but not independently, follow a multinomial distribution with sample size 1 and probability vector ç 4k5 . For a xed observation i, let lh 4k5 denote the usual multinomial correlation between two forms of a haplotype (type l and type h), that is, Note that lh 4k5 is independent of i but depends on the region k, and is the same for cases and controls under the null hypothesis.
For haplotype form l, let f 4Y 5 ij denote the correlation between two case haplotypes, f 4X5 ij denote the correlation between two control haplotypes, and f 4XY 5 ij denote the correlation between a case and a control haplotype. We assume these do not depend on k and l, that is, Next, consider the pairwise correlation between different forms of haplotypes (l 6 D h) and different measured haplotypes (i 6 D j). Assume that the correlation coef cients can be expressed as Genetic theory supports these assumptions. The correlation f ij in (1) is assumed to be independent of k and l because it is determined by the ancestry of the chromosomal segments themselves and is not a function of the haplotype form l or the haplotype segment k. Indeed, based on genetic theory, the correlation between two individuals f ij is equal to the probability two chromosomal segments are inherited from a common ancestor, perhaps many generations in the past. Extant chromosomal segments deriving from a common ancestor are said to be identical by descent (ibd). The term "ibd" emphasizes that the two chromosomal segments match because they are from a common ancestor rather than matching due to chance. Based on this, we obtain From this, it follows that corr4Y l i4k5 1 Y l j4k5 5 D f ij . By the same manner, we obtain cov4Y l i4k5 1 Y h j4k5 5 D 4ƒ l4k5 h4k5 5f ij ; that is, and de ne ' 2 k to be the variance of T k obtained assuming that the n C m haplotypes are independent and identically distributed from a multinomial 4ç 4k5 5 distribution. If the correlation structure implied by (1) and (2) Proof. Because the analytical form of the exact variance of correlated T k is intractable, we study the relationship between the variance of the correlated sample and the variance of an independent and identically distributed multinomial sample via the delta method approximation. By the delta method, the multinomial variance is approximately equal to Similarly, the approximate var4T k 5, based on a correlated sample, is approximately equal to Plugging in the quantities given in (1) and (2), the result follows. So far, we have veri ed the variance factorization (4) and can approximate the distribution of the matching statistic Z k D 4T k ƒ OE5=4'' k 5 under the null hypothesis with a standard normal. With the analytic formula of ' 2 k , the only remaining requirements in standardization are estimates of OE and '. Note that ' 2 does not depend on k; that is, no matter which region we are considering, the correlation among sampled haplotypes affects the variance of T k multiplicatively in the same way. The most important consequence of this result is that ' can be estimated from the sample of T k 's.
A U statistic can be used to estimate '. For 1 µ k < g µ K, Because T k º N 4OE1 ' 2 ' 2 k 5, it follows that for pairs of null loci, U kg º N 401 ' 2 5. This suggests using the sample variance of the U kg as an estimate of ' 2 , which thus is a U statistic because the U kg depend on the pairs 4T k 1 T g 5. Alternatively, one can use a robust scale estimator applied to 8U kg 9, such as O ' D median4 -U kg -5=0 65. Similarly, OE can be estimated with either the mean or median of the T k 's. When estimating OE and ', it is preferable to use robust estimators, because OE and ' re ect quantities de ned for the null loci.

False Discovery Rate
We identify the outliers by testing H 0k 2 OE k D OE versus H 1k 2 OE k > OE for k D 11 : : : 1 K. To correct for multiple testing, we use the FDR method of Benjamini and Hochberg (1995). We begin this section by reviewing their method. We then show how the method can be used in our setting.
Consider testing a set of null hypotheses H 01 1 : : : 1 H 0K . Let P 1 1 : : : 1 P K be the p values associated with the K tests. Suppose that we reject some subset of these hypotheses. The realized FDR, Q, is de ned to be the number of false rejections divided by the total number of rejections. Q is taken to be 0 if no hypotheses are rejected. The Benjamini and Hochberg method for controlling E4Q5 is as follows. First, order the p values, P 415 µ ¢ ¢ ¢ µ P 4K5 , and then de ne d D max8j 2 P 4j5 < j =K9. Finally, reject all hypotheses whose p values are less than or equal to P 4d5 . Benjamini and Hochberg proved that this procedure ensures that E4Q5 µ , no matter how many nulls are false and no matter what the distribution of the p values when the null is false. The operating characteristics of the method have been discussed by Genovese and Wasserman (2001). In particular, the method has much higher power than Bonferroni when K is large.
This method can be adapted to our setting. Let T 1 1 : : : 1 T K be test statistics where T k º N 4OE k 1' 2 k ' 2 5. For the moment, suppose that OE, ', and ' 2 k are known and that T k has exactly a normal distribution. The Benjamini and Hochberg procedure can be applied to test H 0k 2 OE k D OE using the usual p values de ned by P k D 1 ƒ ê4Z k 5, where Z k D 4T k ƒ OE5=4' k '5 and ê is the standard normal cumulative distribution function.
There are several complications in our setting. First, T k is only asymptotically normal in the number of cases and controls. For large n, we have found (by simulation) that this is not a serious problem; see also the example in Section 5. Second, the T k may be correlated. Benjamini and Yekutieli (2001) showed that the inequality E4Q5 µ still holds for correlated This leads to a more conservative procedure. However, our experience is that the types of correlation in our problem are mild and do not in ate the FDR. Thus we have ignored the correction. Finally, OE, ', and ' 2 k are unknown and must be estimated. We now show that, at least asymptotically, the FDR property is preserved even when the p value is estimated by inserting consistent estimates of these parameters.

False Discovery Rate With Nuisance Parameters
For simplicity, we take ' k D 1 and known in what follows and concentrate on OE and '. The extension to unknown ' k is straightforward, albeit tedious. Let S4z5 D 1 ƒ ê4z5. The p value de ned earlier for testing H 0k 2 OE k D OE can be written as Remark. The estimators in Section 3 are asymptotically biased, as is the case in general for both robust and nonrobust estimators in the presence of outliers. However, if we let the fraction of outliers tend to 0 as K grows, then this bias disappears asymptotically. This assumption is realistic, because it re ects the fact that the fraction of liability genes is small relative to the size of the genome.
Proofs for these results can be found in Appendix B.

SIMULATION
We conducted a simulation study to investigate the FDR and the power of the procedure under various settings. In each experiment, we used the robust estimators of OE and '. We generated data from K D 100 regions with sample size n D m D 500 (or 250 individual cases and controls, each with a pair of haplotypes). The number of distinct haplotypes was set at R k D 32, and the nominal level of signi cance was set at .05. To obtain the average performance of the procedure, we generated 1,000 datasets for each con guration under investigation. In this case, power is de ned as the average fraction of alternative hypotheses rejected using the FDR method.
To validate the theorems concerning the detection of outliers, we generated data under the null hypothesis in three different ways. To investigate the procedure under the simplest setting, we set ç u D ç a D ç 0 , in which the xed haplotype distribution ç 0 was chosen to have four levels of haplotype probabilities 4001251 00251 003751 0055, with eight consecutive repetitions of each level to obtain a total of R k D 32 types. We set D 005 in the FDR procedure. For this setting, the mean FDR was .026. Next, we allowed ç u4k5 to vary randomly as a function of k by sampling ç u4k5 from the Dirichlet432 ç 0 5. By setting the haplotype frequencies of cases equal to the controls, ç a4k5 ² ç u4k5 , we xed OE D 0. For this setting, the mean FDR was .046. Finally, the model allows for general location shifts, but ç u is assumed to be equal to ç a when computing the variance. We investigated the robustness of the procedure to this approximation when OE 6 D 0. This time, we sampled both ç u4k5 and ç a4k5 from the Dirichlet432 ç 0 5. The choice of 32 corresponds to at least as much variation as is likely to be observed between case and control populations, under the null hypothesis (Devlin et al. 2001b). For this setting, the mean FDR was .010. We conclude that the procedure performs well under the null hypothesis. Indeed, it is somewhat conservative.
To investigate the power under various types of outlier models, we preset ç u4k5 at ç 0 for all of the null loci and then perturb this distribution for a set of alternative loci. We considered two different levels of contamination: H D 5 outliers and H D 20 outliers. For the remaining regions, k D H C 11 : : : 1 100, ç a4k5 ² ç 0 , as de ned earlier for the controls. We perturb ç 0 in four ways to obtain ç a4k5q , k D 11 : : : 1 H , q D 11 : : : 1 4: ç a4k5q D 41 ƒ a5ç 0 C a spike q . Let spike q q D 11 : : : 1 4 denote probability vectors of length R k . Let spike 1 be a point mass at l D 1, let spike 2 be a point mass at l D 25, let spike 3 be a mass of .5 at l D 1 and 9, and let spike 4 be a mass of .5 at l D 17 and 25 with a D 015 (Fig. 1). The rst and second conditions simulate the performance when a fraction a of the haplotypes trace back to a single ancestral haplotype, and the third and fourth conditions simulate the performance when a fraction a=2 of the haplotypes trace back to one of a pair of ancestral haplotypes.
The size of the deviation from the null, as measured by OE kq , k D 11 : : : 1 H , q D 11 : : : 1 4, depends greatly on the relative frequency of the haplotype associated with the disease. For q D 11 : : : 1 4, OE kq equals .015, .025, .006, and .012. Clearly, the biggest deviation occurs when the associated haplotype is also common in the controls, and the least detectable deviation occurs when two haplotypes are associated with the disease and these haplotypes are relatively rare in the controls. Not surprisingly, the latter condition (q D 3) exhibits considerably lower power than the other three conditions (Table 1). Overall, the relative power is correlated with the size of the true deviation, OE k , but the power also depends on ' k . For instance, OE 1 º OE 4 , but the power is considerably less for the latter conguration, because the variance is larger for this haplotype frequency distribution.
The number of regions deviating from the null, H , also affects the overall performance of the method (see Table 1). When 20% of the regions deviate from the null, both O OE and O ' are positively biased, and the bias in O ' is substantial. Although this bias de ates the power, the FDR rate is maintained at a conservative level (considerably less than .05, the nominal level) for all eight scenarios investigated. The procedure is more conservative than the nominal level, due to the bias in the estimates of OE and '. In most practical settings, H is a very small fraction of K, and, consequently, the bias will be considerably less than observed for this simulation.
To obtain a sense of how powerful the matching statistic is relative to competing methods, we compared it to the omnibus chi-squared test with R k ƒ 1 degrees of freedom (Table 2) using Holm's correction (Holm 1979) for both methods. The null distribution of the goodness-of-t statistic was obtained using a permutation test. This test is a valid competitor only when ' D 1. With the exception of condition q D 3, the matching statistic was either more powerful or roughly equivalent to the goodness-of-t statistic in performance.
In the four simulated conditions investigated thus far, none of the alleles dominates in frequency under the null hypothesis. To complete our investigation, we also considered a scenario with two common alleles ( ü 1 D ü 2 D 0155) and three types of rare alleles (00161 00231 003), each with 10 copies, so that R k D 32. Under the alternative hypothesis, ç a4k5 D 41 ƒ a5ç ü C a spike 1 . In this setting, the matching test clearly dominates the goodness-of-t statistic, with power equal to 67.3% versus 24.6%.
The general principle appears to be that the matching statistic is more powerful when the associated haplotype(s) is (are) relatively common in the control population. When the associated haplotype(s) is (are) relatively rare, then OE k tends to be near 0, and the goodness-of-t test is more powerful. In fact, under some conditions OE k º 0, and the matching statistic has power equal to the size of the test. The relative strength of the matching statistic to the goodness-of-t statistic is also greater when R k is large.
It is also worth noting that the size of the matching test in all comparisons was smaller than the nominal size of the test. Apparently OE and ' are estimated with some bias, which leads to a conservative test.

DATA ANALYSIS
To illustrate the proposed methods, we analyze a small sample of schizophrenia patients and controls sampled from the island nation of Palau. As part of an ongoing linkage study of schizophrenia on Palau, seven extended pedigrees have been ascertained and genotyped. We use a portion of these data for a pilot study of association. From the extended pedigrees, we selected 22 cases and 27 controls for further analysis. In a future study we anticipate supplementing our sample by obtaining a larger sample of both cases and controls.
Schizophrenia is a mental illness characterized by disordered thoughts, behaviors, and language. It is identi ed from a collection of "positive" symptoms together with "negative" symptoms. The positive symptoms include hallucinations (e.g., hearing voices or seeing things that do not exist), delusions (e.g., holding false beliefs, such as that one is being watched, spied on, or plotted against), and disorganization or incoherence of thought or speech; negative symptoms include lack of normal emotional response, withdrawal from others, neglect of grooming and hygiene, and poor work performance.
By the time the illness is diagnosed, brain structure and chemistry have been altered. Ultimate causes appear to involve both acquired and genetic factors. We concentrate on the genetic factors, attempting to identify variants in genes that generate higher risk for schizophrenia. To date the scienti c community has made only limited progress toward identifying the genetic basis for this challenging, complex disease.
A remote island nation in Micronesia, Palau covers an archipelago of more than 200 islands scattered over 125 miles of the South Paci c. The islands lie 600 miles north of New Guinea and 550 miles east of the Phillipines. Palau exhibits a slightly elevated rate of schizophrenia, 2.77% in males and 1.24% in females, compared with the sex-averaged estimate of .5%-1% worldwide.
Carbon dating (Takayama 1981) suggests that Palau was rst populated about 2,000 years ago and that the initial population size was small, probably less than 50. The population grew to 20,000 by 227 years ago but decreased to 4,000 about 100 years ago because of disease epidemics. The current population is around 21,000. The Palauan population is thought to have developed in isolation compared to, say, European populations, but it experienced a surprising level of immigration for such a remote region (Simmons, Graydon, Gajdusek, and Brown, 1965;Devlin et al. 2001a). Genetic analysis suggests that the original population appears to have migrated from an island in Southeast Asia, with some later migrations from Melanesia. Studies on Palau (Devlin et al. 2001a) indicate that substantial linkage disequilibrium (i.e., haplotype-sharing) exists in this population, making it ideal for an association study.
The population history of Palau facilitates the search for schizophrenia liability genes. First, the linkage disequilibrium for Palauan people is enlarged by the recent population bottleneck and extends potentially as far as 10 to 20 cM (Devlin et al. 2001a). Second, the isolation of Palau makes it easier to detect any schizophrenia genes introduced by recent migration. This is because these foreign chromosomes will be relatively prominent compared to the general Palauan chromosomes. Based on these facts, we believe that it is instructive to search for association even with the 10-cM marker grid available from the Palauan linkage study. Nevertheless, this 10-cM grid is much sparser than generally required for detecting association between markers and disease genes (Ott 2000). In follow-up studies, we hope to have a denser grid of markers.
We assume that most Palau natives are at least distantly related, due to their population history. Of the 22 cases in this study, some are close relatives, such as cousins, and other pairs are not obviously related. For controls, we chose 27 people who were either not known to be closely related to schizophrenics or were from a pedigree containing schizophrenics but were themselves less likely to carry a disease allele. As a group, the controls are not as closely related to one another as are the cases. Nevertheless, some of the controls are closely related to some of the cases. Overall, the genetic relationships among study subjects generates a complex correlation structure among the sampled haplotypes. This is clearly not a standard case-control study, and we anticipated that the correlation among haplotypes would have some effect on the variance of the test statistic.
To explore the relationship between familial relationship and matching of genetic material, we compared all cases with a known degree of familial relationship. Figure 2 shows the relationship between the degree of relatedness and the level of haplotype similarity for these case pairs. The y-axis indicates the degree of relatedness between two cases; the lower the value, the more closely they are related to each other. The x-axis is the average proportion of markers with matching allele types; the average is taken over four possible combinations because each person has two chromosomes. The plot show that as two people are more closely related to each other, they share more common genetic material. The correlation coef cient is extremely high (corr D ƒ0 85).
A critical selection criterion for the 22 cases and 27 controls was the ability to determine unambiguous haplotypes for these individuals, which were obtained using the linkage program Simwalk2 (Sobel and Lange 1996). Thus for each individual, we have haplotypes for 22 pairs of autosomal chromosomes, with 37 markers on the largest chromosome (chromosome 1),   7  2  6  11  3  1  1  3  0  6  3  5  10  3  6  2  2  0  5  4  3  10  3  5  2  2  0  4  3  0  10  4  3  ¢  ¢  ¢  22  3  8  3  3  4  14  1  3  22  3  8  4  3  3  descending to 8 markers on the smallest chromosome (chromosome 22). The genetic markers are STRs, and the allele type represents the number of repeats observed at a particular locus. Because chromosomes occur in pairs, each individual contributes two haplotypes to the dataset. Table 3 displays a portion of the haplotype data for chromosome 22. Each row records the alleles of markers on a chromosome. For example, from the last row of Table 3, the 27th control has the haplotype (5,8,7,2,3,11,3,5) on one of its 22nd chromosomes. We de ned regions using a moving bin encompassing adjacent pairs of markers. With this de nition, we obtained K D 453 regions. The haplotype frequency distribution varied by region, but R k º 32, and one or more forms were generally considerably more common than the others.

% of Matching Chromosomes
From these 453 observations, we obtain O OE D 0007 and O ' D 09046. Surprisingly, even though we had anticipated ' > 1 due to the strong positive correlation between cases, two factors offset this expectation; the cases are also correlated with the controls, which reduces the variance of T k , and for small samples, O ' k has a slight positive bias. To compensate, O ' has a slight negative bias. We observed the same phenomenon in simulated data for small samples (results not shown). This phenomenon did not in ate the FDR in our simulations.
To evaluate the assumption of normality, we examine the normal scores plot of the standardized matching statistics Z k (Fig. 3). The test statistics show a surprising degree of consistency with a normal distribution considering the size of the dataset. From this gure, it is also clear that none of the statistics appears to be unusually large relative to the remainder of the sample. In fact, none of the regions indicates a signi cant association with the disease using either the FDR or a Bonferroni procedure.
Plotting the test statistics as a function of the approximate relative location of the haplotypes on the 22 autosomal chromosomes indicates that several statistics tend to approach signi cance in regions that have shown promising signals in other studies of schizophrenia (Fig. 4). Considering the size of the sample and the coarseness of the marker grid, the test undoubtedly has low power. Follow-up studies will have better power.

DISCUSSION
In this article we have de ned a statistic, called the matching statistic, for locating regions of the genome that exhibit excess similarity of haplotypes within case haplotypes relative to the controls. This statistic is of interest because it identies regions that are reasonable candidates for locating disease genes. It is a practical alternative to a statistic developed by Devlin et al. (2000) that tested for excess ibd sharing assuming an extremely dense grid of genetic markers.
In many case-control association studies, the sampled haplotypes are correlated because the subjects are obviously related, cryptically related (related, but the relationship is unknown), or related due to common ethnic background. We nd the asymptotic distribution of the matching statistic while accounting for correlations among sampled individuals. The approach taken here could potentially be extended to many other statistics that measure haplotype sharing or, more generally, linkage disequilibrium. In fact, the performance of the matching statistic depends very strongly on the way in which the distribution of the case haplotypes deviates from the distribution of the control haplotypes. It would be interesting to investigate the performance of other statistics sensitive to linkage disequilibrium as well.
In motivating a 1-degree-of-freedom test, we noted that such a statistic was likely to have greater power for some alternatives than an omnibus test. In Section 4 we identi ed the type of alternative for which the statistic obtains a competitive advantage, and it appears that this type of alternative arises frequently in practice. Naturally, there are alternative haplotype distributions for which the omnibus test has greater power than the matching statistic. The advantage of the matching statistic is that it achieves its asymptotic distribution for a modest sample size, and hence is amenable to corrections for correlated samples, such as the one described in Section 2.2. Although in principle the goodness-of-t test could also be corrected in  a similar fashion, this test does not achieve its asymptotic distribution for moderate-sized samples if R k is large.
In our treatment of correlation among haplotypes, we ignore the known relationships among subjects, allowing ' to adjust for correlations due to known familial relationships as well as unknown relationships. It is likely that greater power could be obtained if the known relationships were modeled overtly in a manner such as that described by Slager and Schaid (2001).
Determining which regions in the genome exhibit signicant association involves a large number of hypothesis tests. In problems of this nature, cogent arguments can be made that it is more appropriate to control the FDR rather than the FWE, because the FDR method offers a more powerful option for discovering genomic regions that potentially have liability alleles. In its formulation, the Benjamini and Hochberg procedure controls the FDR for given a set of independent p values. In our application, the p values for each region involve esti-mated nuisance parameters. We have shown that under appropriate conditions, the FDR method based on p values with estimated nuisance parameters asymptotically preserves the FDR property. These results should be useful for other applications as well.

APPENDIX A: VARIANCE OF T k
We compute the variance of T k under the assumption of iid haplotypes, and hence n b ç a4k5 is distributed multinomial 4n3 ç a4k5 5 and m b ç u4k5 is distributed multinomial 4m3 ç u4k5 5. We suppress the subscript 4k5 to simplify the notation.
Here we express P R lD1 O 2 al in a quadratic form, b ç T a A b ç a , with A being the R-dim identity matrix. Without extra complexity, we obtain the general variance formula of the quadratic form b where diag4ç a 5 is the diagonal matrix of 4 a1 1 : : : 1 aR 5, diag4ç 425 a 5 is the diagonal matrix of 4 2 a1 1 : : : © 4 E T a Aç a C ç T a 4dA54dA5 T ç a C 2 ç T a A 425 ç a ª C 1 n 3 4dA 425 5 T ç a ƒ £ tr6Aè n 7 C ç T a Aç a ¤ 2 0 Using the same general form, var4 b ç T u A b ç u 5 is obtained.