What’s in a name? Y chromosomes, surnames and the genetic genealogy revolution

Heritable surnames are highly diverse cultural markers of coancestry in human populations. A patrilineal surname is inherited in the same way as the non-recombining region of the Y chromosome and there should, therefore, be a correlation between the two. Studies of Y haplotypes within surnames, mostly of the British Isles, reveal high levels of coancestry among surname cohorts and the influence of confounding factors, including multiple founders for names, non-paternities and genetic drift. Combining molecular genetics and surname analysis illuminates population structure and history, has potential applications in forensic studies and, in the form of ‘genetic genealogy’, is an area of rapidly growing interest for the public.

Heritable surnames are highly diverse cultural markers of coancestry in human populations. A patrilineal surname is inherited in the same way as the non-recombining region of the Y chromosome and there should, therefore, be a correlation between the two. Studies of Y haplotypes within surnames, mostly of the British Isles, reveal high levels of coancestry among surname cohorts and the influence of confounding factors, including multiple founders for names, non-paternities and genetic drift. Combining molecular genetics and surname analysis illuminates population structure and history, has potential applications in forensic studies and, in the form of 'genetic genealogy', is an area of rapidly growing interest for the public.
Cultural markers of ancestry Before Darwin, humans accorded themselves a special place in the kingdom of life. Now, 150 years after the publication of the Origin of Species, we can appreciate that we are part of the continuum of the evolution of all species, but our unique qualities remain undeniable. Homo sapiens literally means 'knowing man', but Linnaeus might equally have called us Homo nominans -'naming man' -because of our capacity for complex language and our innate need to apply names to things, and indeed to ourselves. Some of these names are heritable, and are recorded and persist through the generations. So, uniquely among organisms, many of us carry a cultural marker of coancestry, a surname, to go with the biological marker of coancestry common to all organisms, DNA.
In this review, we examine the relationship between these two kinds of information: surnames and DNA. Because most heritable surnames pass from father to son, we focus on the relationships between surnames and paternally inherited Y-chromosomal haplotypes. Together with the recent revolution in the power of DNA analysis, the internet has introduced a new dimension in the way that this power can be made easily available to the public and the way that surname information can be shared, exploited and understood. Most studies have focused on surnames in the developed world, and the British Isles in particular [1][2][3][4], and even reliable data on surname diversity are difficult to come by for many countries. Although this leads to an inevitable geographical and cultural bias, we hope that our description of principles and case studies will help to stimulate studies of a greater diversity of populations in the future.
History, inheritance and diversity of surnames In human societies, having a name, and thus being identifiable, is essential. The addition of a heritable element facilitates identification and also marks lineages, providing a label of regional and familial membership. Although some societies (such as that of Iceland) continue to eschew heritable surnames, governments like them and, in some countries, have imposed them quite recently. For example, in Turkey all citizens were obliged to adopt a heritable surname in 1934 and in Mongolia a compulsory surname law was introduced in 1997. The earliest heritable surnames are those of China, dating back $5000 years; timedepths for other nations vary ( Table 1).
The diversity of heritable surnames also varies considerably; in China it is inconveniently low, as anyone who has carried out a PubMed search for a particular Li (the world's commonest surname) can testify [5], but in most countries it is amazingly high, with the mean number of bearers of any one surname well below 100 (Table 1). Some populations have high surname diversity because of a long history of admixturethis is certainly true of the USA. The current population of Great Britain has $1.6 million surnames, but this value is much greater than that in the past, owing to recent immigrationthe number listed in the 1881 census of England and Wales was only some 420 000. Though the derivations of surnames are often debatable, many fall into a limited number of classes, including patronyms (son of. . .) and those related to occupation, status or place-names (Box 1).

Patrilineal surnames and the Y chromosome
Given that DNA passes down to us from our ancestors together with surnames, people sharing surnames should have a greater than average chance of sharing segments of DNA by descent than the general population. Although most DNA is inherited from both parents, there is one segment, the non-recombining region of the Y chromosome, which is only passed down from father to son [6]. We might therefore expect that a surname should correlate with a type of Y chromosome, inherited from a shared paternal ancestorperhaps the surname's original founder. A plethora of polymorphic DNA markers for distinguishing between Y chromosomes enables this idea to be tested; the types of marker and their properties are described in Box 2.
The simple expectation of a correlation between Y chromosome type and surname is complicated by several confounding factors. Some surnames are likely to have been founded independently more than once ( Figure 1); this will result in more than one Y type being associated with a given surname. Non-paternity events, the adoption of male children and deliberate surname change will have the same consequence ( Figure 1).
Mutation also acts to diversify the Y chromosome types associated with a particular surname, but, unlike the factors described earlier, its impact is quite predictable. The mutation rates of single nucleotide polymorphisms (SNPs) are low so, within the time-depths of surnames in most populations (typically $500-1000 years in Europe, for example), the widely typed SNPs are not expected to undergo mutations. By contrast, short tandem repeats (STRs) mutate rapidly, so mutations are relatively likely to be observedindeed, our knowledge of their rates comes from identifying mutations within pedigrees [7] and father-son pairs [8]. The probability of detecting mutations within lineages depends on the number of STRs analysed and also their individual properties.
Genetic driftthe random changes in haplotype frequencies over the generationsis the final factor that acts against the influences described earlier by reducing the diversity of haplotypes within surnames. For example, the stochastic variation in the number of sons fathered by different men can, over many generations, lead to the extinction of some Y chromosome lineages and the increase in the frequency of others within surname cohorts. Indeed, genetic drift (known in genealogical circles as 'daughtering out') is responsible for the complete extinction of some British surnames (such as Campinot) that had persisted for many generations [9].

Y chromosome diversity within surnames of the British Isles
Most detailed studies have focused on surnames of the British Isles. The pioneering and eponymous study of the surname Sykes [4] indicated low Y-haplotype diversity among unrelated carriers of the name, suggesting that this was compatible with a single founder. However, its haplotype resolution (four Y-STRs) was low.
The availability of more STRs and haplogroup-defining SNPs (Box 2) has enabled higher-resolution studies to be performed. A general link between surnames and Y haplotypes was revealed in a study of 150 pairs of randomly ascertained men, each sharing a different British surname [1]. Sixteen of the 150 pairs shared identical 17-STR haplotypes and 20 more pairs shared sufficiently similar haplotypes to suggest coancestry within the past 700 yearsthe average time since  Many surnames have one or more spelling variants; most of these were fixed recently, when spellings were formalized [2,54].
In Iceland, surnames are not heritable, but patronymic: the surnames of a son or daughter of the father Stefán, for example, will be Stefánsson and Stefánsdottir, and in the next generation the surnames will change again. Many heritable surnames in other countries have evolved from previously non-heritable patronymic systems.

Box 2. Markers for Y-chromosome diversity
Two types of polymorphic marker are commonly used to distinguish Y chromosomes from one another [6]. Binary markers such as SNPs have low mutation rates, typically $10 À8 per base per generation [67], and mostly represent unique events in human evolution. STRs are multiallelic markers, new alleles arising largely by single-step mutation at a typical rate of $10 À3 per STR per generation [8].
Binary markers are used in combination to define monophyletic haplotypes ('haplogroups'), which are arranged into a maximum parsimony tree [38,68] containing major clades labelled A-T ( Figure  Ia). Each clade is further subdivided into alphanumerically named subclades ( Figure Ib), the whole tree currently comprising 586 markers defining 311 haplogroups [38]. Application of new sequencing technologies (www.1000genomes.org) will yield thousands of new markers and serious nomenclature problems because the current system will become impossibly unwieldy. Some haplogroups are frequent in particular populations and, therefore, provide little discriminatory power.
The majority of widely used Y-STR markers are tri-and tetranucleotide repeats, of which there are >200 on the chromosome [69].
Combinations of Y-STRs (typed in PCR multiplexes) define more informative haplotypes within the haplogroups. Relationships among Y-STR haplotypes are often displayed in median-joining networks [70] ( Figure Ic), which can also incorporate haplogroup information. Closely related sets of haplotypes (typically found within surnames) define 'descent clusters' and, given an estimate of average STR mutation rates, TMRCA for a cluster can be estimated [71].
Typing an STR multiplex is a highly efficient way both to distinguish between Y chromosomes and to indicate haplotype relationships, and can even be used to predict a haplogroup [72]. Each new haplogroup-defining SNP arose on a single chromosome, carrying a single Y-STR haplotype. Over time, mutation led to a limited repertoire of variation among the Y-STR haplotypes within this haplogroup, deriving from the founding haplotype [73]. The power of haplogroup prediction depends on the number of STRs typed and, in some cases, specific diagnostic STR alleles. Distinguishing between closely related haplogroups is usually difficult and, indeed, they can share identical Y-STR haplotypes, even when many STRs are typed. In such cases, SNP typing is essential. British surnames were established. Overall, the link is stronger the rarer the surname, with all pairs that show a strong signal of coancestry being found among the less common surnames (<5 600 bearers); this suggests that the commoner surnames had large numbers of founders relative to rarer surnames.
Two studies, in Britain [2] and Ireland [3], have collected and analysed larger groups of men with fewer surnames, using the same set of 17 Y-STRs, plus several haplogroup-defining SNPs. Both studies used networks to display and analyse diversity, with different approaches to defining 'descent clusters' of related haplotypes (Box 2). Both also estimated the time to most recent common ancestor (TMRCA) for clusters, finding ages compatible with the known time-depths of surname establishment. British control males carrying different surnames show very little haplotype sharing (Figure 2a) and the same is true of men carrying the commonest surname, Smith (Figure 2b). However, less common names show decreasing haplogroup diversity and increasing degrees of STR haplotype sharing (Figure 2c,d): rare names (such as Attenborough) can be dominated by a single descent cluster (Figure 2d), which might indicate a single founder. How-ever, the shallow time depth of many clusters within names, the absence of an effect of surname type on diversity and computer simulations together suggest a strong influence of genetic drift, such that current diversity is a poor reflection of the initial founder number [2].
Irish Y chromosomes show much lower haplogroup diversity than those of Britain, $90% belonging to a single haplogroup, so most information is provided by Y-STRs [3]. Based on the same set of 17 STR markers [2], Irish controls carrying different surnames (like British ones) show very few shared haplotypes. However, within surname cohorts descent clusters are again evident, with an average of 61% of haplotypes within a surname lying in descent clustersa very similar value to the British proportion of 62% [2]. Most of the variation between names was attributed to differences in founder numbers.
Comparison of the two studies reveals a striking difference between these neighbouring islands: the surname frequency-dependence of coancestry proportions evident in British names is absent from Ireland [2]. Some common Irish names such as Ryan (Figure 2e), borne by as much as 1% of the population, are dominated by single descent Figure 1. Current Y diversity within a surname is influenced by founder numbers, non-paternity, genetic drift and mutation. In this hypothetical genealogy all males share a patrilineal surname, which originated 20 generations ago in two unrelated founding men carrying different Y haplogroups (hgs), T and R1a. These also share common ancestry $1600 generations [38] ago. Subsequently, further diversity was introduced by non-paternity events, adoptions or surname changes (shown by stars and the different haplogroup colours) or STR mutations (different shades of haplogroup colours). Diversity was reduced by genetic drift: all current hg T chromosomes within the surname descend from the original founder, whereas all current hg R1a chromosomes have a most recent common ancestor (MRCA) only nine generations ago. In each case, white dots and bold lines indicate genealogical connections between current chromosomes and their MRCA. Current haplogroup diversity within the surname is very different from that in the general population [2] (pie charts to right, with sectors proportional to haplogroup frequency); in particular, hg T is not found in the general population sample, but represents 35% of the chromosomes in the surname sample.

Review
Trends in Genetics Vol.xxx No.x TIGS-748; No of Pages 10 clusters and, unlike in Britain, there is no significant correlation between the rarity of a surname and the diversity of the Y chromosomes within it. The difference could be due to an amplification of genetic drift in Ireland, as a result of the prevalence of medieval patrilineal dynasties that linked male social and reproductive success in the past (discussed further later), but could also reflect other demographic historical differences, such as greater urbanization in Britain and different impacts of epidemic disease.
These studies also highlight several factors that should be considered when systematic surname studies are carried out in other populations: (i) sampling strategy needs to be planned carefully to avoid sampling related individuals; (ii) geographical structure could affect diversity within sampled surnames and its extent needs to be assessed [3]; (iii) use of a standard set of Y-STRs would facilitate comparisons between studies and, because of their convenience and high resolution, the commercially available profiling kits such as Y-filer (ABI) seem appropriate; and (iv) the criteria for membership of descent clusters need careful consideration because the boundary of a cluster is often not obvious. Our recommendation is to type binary markers and STRs, which will enable the definition of clusters within haplogroups that are rare in the population, and which therefore have relatively clear boundaries [2]. The observed pattern of STR divergence within such clusters can be used to define a set of rules for cluster definition that can be more widely applied to common haplogroups. In some populations (e.g. Ireland) haplogroup diversity is currently inconveniently low for this approach [3], but new marker discovery should soon alleviate this problem: (v) deduction of relevant generation times [10], perhaps from genealogical research in the populations under study, would aid in the accuracy of dating; and (vi) standardization of Y-STR mutation rates would help in the estimation of TMRCAs across studies. The mutation rate derived from direct observation in father-son pairs (the 'pedigree rate'; $2 Â 10 À3 per STR per generation [8]) is approximately threefold greater than that derived from consideration of accumulated diversity within populations (the 'evolutionary rate' [11]), and studies have differed in which of these they apply leading to challenges in comparing studies [2,3].

Applications of surname studies
The first application of surnames in genetics was in 'isonymy' studies, a field originated by Charles Darwin's son George, in which surnames were used to estimate the degree of inbreeding in populations, based on the frequency of same-surname marriages [12] or on surname frequencies alone [13]. The underlying assumption that a shared surname implies shared ancestry has not been tested in most of the surveyed populations and, as our previous discussion indicates, is often likely to be incorrect [14]. Despite such objections, the field of isonymy studies remains active; for a review, see Ref. [15].
Here, we focus on three areas in which surname information has been combined with molecular genetic analysis to yield new insights.
Past population structure and history Surnames tend to be specific to particular indigenous populations and to show geographical specificity within regions. This property means that they find wide application as convenient proxies for ethnic origin [16] in healthcare [17], epidemiological studies [18] and directed marketing [19]. However, combining surnames with Y-chromosome analysis has also enabled them to be used in genetic studies of historical migrations and admixture.
Much of this work has been carried out in the Irish population. For example, removal of individuals with non-Gaelic surnames in an analysis of Irish Y chromosomes leads to a significant change in haplogroup frequencies [20] and probably also to access to a more 'indigenous' sample and its population structure. A further link with the distant past is suggested by a common haplotype [21], interpreted to reflect the demographic impact of a medieval patrilineal dynasty, the Uí Néill. This 17-STR haplotype accounts for $17% of Y chromosomes in the northwest of Ireland and is proposed to be the Y lineage of a 5 th century warlord, Niall of the Nine Hostages. This interpretation is supported by the over-representation of a descent cluster centred on the haplotype in 25 Irish surnames thought to originate in the Uí Néill dynasty.
The high reproductive success of this lineage seems to provide support for the idea of an amplification of genetic drift through social selection in the history of Ireland, adduced earlier to explain differences in haplotype diversity between Irish and British surnames. However, studies of multiple surname groups thought to descend from two other patrilineal clans (Eóganacht and Dá l Cais) show much less evidence of coancestry within either clan [22] and this suggests either that not all clans were really established by eponymous founders or that, in these cases, the link between modern surnames and early origins has been severed. A broken link might also be suggested by an analysis of males with names of Norse Viking derivation (e.g. Thunder, Doyle and Hanrick), which reveals no difference from a general Irish sample [23], although this could also simply indicate that the Norse contribution in the Viking period (800-1200 CE) was very low.
The geographical differentiation of Y haplotypes is particularly marked in intercontinental comparisons. An association of a clearly African Y lineage with a rare English surname [24] provides evidence of a past African presence in Britain, and genealogical research connecting men carrying both the surname and the exotic chromosome enables a lower limit to be placed on its time-depth (the mid-eighteenth century). In a different geographical context, observation of the low diversity of Y haplotypes in surname groups in Colombia demonstrates the powerful male-specific founder effects caused by Spanish and Portuguese colonization [25].
Most population studies of Y-chromosome diversity categorize donors into local sub-populations on the basis of at least two generations of residence. However, this is compromised by migration in preceding generations. The geographical specificity of surnames suggests surnamebased sampling as a means to choose modern Y chromosomes in a way that reflects their past population distributions [26]. This was done in a study of the Viking contributions to the Wirral peninsula and West Lancashire, in northwest England [27]. Historical and other evidence suggests colonization by Norse Vikings, beginning in 902 CE. Independent samples were recruited for each place: the 'modern' sample, based simply on two-generations of residence; and the 'medieval' sample, based on a history of residence plus the possession of a surname known from documentary evidence to have been present in the region before 1572 CE. The distributions of Y haplotypes in the two sample types were significantly different, and this could be accounted for by a greater Norse contribution to the 'medieval' samples, as judged by admixture analysis. This supports the idea that surname-based ascertainment provides a sample that more closely reflects past populations, before immigration from elsewhere.
Several studies of surnames and Y haplotypes have used the diversity present within surnames to make inferences about the past rates of non-paternity [2-4,25]. The assumptions and methods vary, but there is agreement that rates are < 5% per generation and, in some cases, < 1% [25]. These rates are therefore consistent with modern estimates where there is no previous suspicion of non-paternity [28] and contradict the oft-quoted 'urban mythical' figure of 10% per generation.

Forensic application
The link between surname and Y-chromosomal haplotype suggests the idea of predicting a surname in forensic investigations [29]. In a case in which an autosomal DNA Review Trends in Genetics Vol.xxx No.x TIGS-748; No of Pages 10 profile yields no matches in a DNA database, a list of surnames with associated Y-STR haplotypes could enable a Y profile to be matched with one or more surnames. This would provide a means to prioritize a suspect list; the surname prediction would act only as an investigative tool because autosomal profiling could be used to exclude or match individuals once they were identified. The validity of this approach has been confirmed in principle [1], but has yet to be used in practice; it might be compromised in the mixed and urban populations commonly encountered in criminal investigations. The link between surname and Y haplotype is weak for common names (Figure 2), and including all rare ones is impractical, so the approach would be most useful for intermediate frequency surnames. In some cases, sharing of common haplotypes across surnames could result in many surnames being returned. In a sample of 1814 men carrying 164 names, the commonest 17-STR haplotype was shared across 16 different surnames (T.E. King, PhD thesis, University of Leicester, 2007).
Although surname prediction might have useful forensic applications, it also has the potential to infringe the privacy of those contributing DNA anonymously for medical research. For example, the surnames of the donors of the European members of the HapMap [30] DNA collection could be guessed at using published genotyping data and public databases of names and haplotypes [31]. In a highly publicized case, a 15-year old boy conceived by anonymous sperm donation traced his biological father by surname prediction through testing of his own Y chromosome and exploiting public databases, together with information on the father's date and place of birth [32].
Genetic genealogy and the rise of recreational genetics Without doubt, the most active area of exploitation of the link between surnames and Y haplotypes is in the area of genetic genealogy, driven by the massive popular interest in family history, the availability of commercial DNA testing and the ease of communication afforded by the internet. Many companies offer Y-chromosome analysis, which is done using DNA extracted from buccal samples received from customers by post. More broadly, genetic genealogy forms part of 'recreational genetics', which includes the use of genome-wide markers to assess personal ancestry, relatedness and disease susceptibility, and this growing activity is also providing useful information for surname studies.
Directed commercial Y testing is usually seen as an adjunct to the traditional methods of genealogical research [33] and can, for example, show that two men with the same surname share a haplotype and, therefore, a recent common ancestor [34,35]. Estimates of the time during which that ancestor lived [36] might also be offered, subject to considerable uncertainty. More broadly, a group of men sharing a surname can collaborate to have their Y chromosomes analysed, which can lead to the refinement of family trees, or the inclusion or rejection of branches for further genealogical investigation. Thousands of such 'surname projects' are currently in existence (Box 3). Datasets containing Y-haplotypes and associated surnames, often made freely available online by customers, are large (Table 2). Despite the possibly biased ascertainment of samples, these represent a very useful general resource and give opportunities for collaboration between the academic and amateur communities. One recent example is the characterization of a set of novel SNPs within the rare hg G, which was facilitated by the easy identification and recruitment of DNA donors carrying hg G chromosomes via public genetic genealogy databases [37].
The interpretation of the relationships among customers' Y haplotypes depends on the haplotype resolution. Although some companies offer Y-SNP analysis, most offer only Y-STR typing because this is highly discriminating and universally applicable; the number of STRs typed varies from 15 to 67. Notably, 67 STRs is far more than are analysed in most academic studies, which are commonly restricted by budgetary considerations to 20 or fewer. Generally speaking, the more markers typed the better (aside from the increasing probability of typing errors) because this reduces ambiguity in the interpretation of shared haplotypes. However, as the number of STRs increases, so too does the probability of detecting an STR mutation between close relatives [29], and this needs to be taken into account.
Companies offering broader recreational genetics services use microarray-based methods to type up to $1 million SNPs genome-wide and return information to customers. The relevance for surname studies is that a proportion of the SNPs typed in these analyses are annotated as Y-linked (for example, 858 SNPs on the commercially typed Illumina 1 M chip) and so they provide potential information about Y lineages. However, the SNP validation status and the correspondence with well-studied Y-SNPs [38] is in many cases unclear and this is being resolved through the sharing of genotypic data from SNP chips among genetic genealogists [39,40]. For example, haplogroup R1b1b2 is the commonest Y-lineage in western Europe, reaching >90% in Ireland, and it has been difficult to find SNPs to subdivide it for population studies. The SNP rs34276300, known as S116, has been identified through comparing SNP chip results as a useful marker to subdivide hg R1b1b2, and is now being incorporated into academic studies. This is an area in which closer collaborations between amateurs and academics could prove particularly useful. Genetic genealogy enthusiasts often display an impressive level of knowledge about aspects of molecular evolution, population genetics and statistics; some of this is evinced in the quarterly online Journal of Genetic Genealogy (www.jogg.info). Although it lacks the standard scientific peer-review system of traditional journals, it is nonetheless attracting academic geneticists among its authors and is an interesting model for public involvement in scientific publication. Other resources for genetic genealogy are listed in Box 3. Thanks to the advances in DNA technology and the power of the internet, genetics is now joining astronomy as a science in which amateurs can make useful discoveries.
Genetic genealogy is fun, fascinating and has much to contribute to academic science, but does it have any drawbacks? One obvious problem is the danger of detecting unexpected past non-paternities or of having cherished oral histories disproven, both of which happened in the case of a family who believed themselves to be descendants of President Thomas Jefferson [41]. Although the Y chromosome is notoriously lacking in robust disease associations [6], some interstitial Y-chromosomal deletions (with incidences up to $1 in 4000 males [42]) are certainly associated with male infertility [43] and can be signalled by the absence of specific Y-STRs and SNPs [44]. Beyond the genealogical aspects, the assignment of Y-lineages to particular geographical origins or ethnic groups can be misleading [45,46]. None of these potential pitfalls seem likely to put off the customers of DNA typing companies, however.

Future developments
Sampling of a wider variety of populations and their surnames will help to alleviate the current geographical bias and should lead to interesting new insight into social and demographic history. However, most new advances will arise from exploitation of recent technological developments. Improvements to the methods of analysis of ancient DNA should enable the testing of genealogical links between living individuals and putative patrilineal ancestors and also among archaeological human remains [47,48]. High-resolution Y typing and mitochondrial DNA sequencing, together with whole-genome SNP analyses, should enable reliable reconstructions of genealogies de novo, at least for the past few generations; this will include the establishment of links across the sexes, which cannot be achieved by the analysis of uniparentally inherited markers alone. In terms of relatedness, surname-ascertained cohorts of men who share Y-chromosomal coancestry lie between the traditional pedigree and the population, and application of whole-genome typing to such groups could be useful in understanding the history of recombination [49] and for genetic epidemiological purposes.
Recent application of conventional and 'next-generation' sequencing [50] technologies has revealed a large number of putative Y-SNPs in two named individuals, Craig Venter [51] and James Watson [52]. Such 'celebrity genomics' [53] projects will add further famous names to the webpages of genealogical geneticists, to join the motley crew of Genghis Khan, Thomas Jefferson, Marie Antoinette, Jesse James et al. (www.isogg.org/famousdna.htm). As the cost of sequencing continues to fall, private individuals will fund their own genome projects and it seems inevitable that SNPs that are specific to particular surnames or their branches will be identified, providing powerful resources for genealogical research.