Human genetic clustering

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

Human genetic clustering analysis uses mathematical cluster analysis of the degree of similarity of genetic data between individuals and groups in order to infer population structures and assign individuals to groups. These groupings in turn often, but not always, correspond with the individuals' self-identified geographical ancestry. A similar analysis can be done using principal components analysis, which in earlier research was a popular method.[1] Many studies in the past few years have continued using principal components analysis.

Studies

Clusters by Rosenberg et al. (2006)

<templatestyles src="Module:Hatnote/styles.css"></templatestyles>

In 2004, Lynn Jorde and Steven Wooding argued that "Analysis of many loci now yields reasonably accurate estimates of genetic similarity among individuals, rather than populations. Clustering of individuals is correlated with geographic origin or ancestry."[2]

Gene clusters from Rosenberg (2006) for K=7 clusters. (Cluster analysis divides a dataset into any prespecified number of clusters.) Individuals have genes from multiple clusters. The cluster prevalent only among the Kalash people (yellow) only splits off at K=7 and greater.
Human population structure can be inferred from multilocus DNA sequence data (Rosenberg et al. 2002, 2005). Individuals from 52 populations were examined at 993 DNA markers. This data was used to partition individuals into K = 2, 3, 4, 5, or 6 gene clusters. In this figure, the average fractional membership of individuals from each population is represented by horizontal bars partitioned into K colored segments.

Studies such as those by Risch and Rosenberg use a computer program called STRUCTURE to find human populations (gene clusters). It is a statistical program that works by placing individuals into one of an arbitrary number of clusters based on their overall genetic similarity, many possible pairs of clusters are tested per individual to generate multiple clusters.[3] These populations are based on multiple genetic markers that are often shared between different human populations even over large geographic ranges. The notion of a genetic cluster is that people within the cluster share on average similar allele frequencies to each other than to those in other clusters. (A. W. F. Edwards, 2003 but see also infobox "Multi Locus Allele Clusters") In a test of idealised populations, the computer programme STRUCTURE was found to consistently underestimate the numbers of populations in the data set when high migration rates between populations and slow mutation rates (such as single-nucleotide polymorphisms) were considered.[4]

Nevertheless, the Rosenberg et al. (2002) paper shows that individuals can be assigned to specific clusters to a high degree of accuracy. One of the underlying questions regarding the distribution of human genetic diversity is related to the degree to which genes are shared between the observed clusters. It has been observed repeatedly that the majority of variation observed in the global human population is found within populations. This variation is usually calculated using Sewall Wright's Fixation index (FST), which is an estimate of between to within group variation. The degree of human genetic variation is a little different depending upon the gene type studied, but in general it is common to claim that ~85% of genetic variation is found within groups, ~6–10% between groups within the same continent and ~6–10% is found between continental groups. For example, The Human Genome Project states "two random individuals from any one group are almost as different [genetically] as any two random individuals from the entire world."[5] Sarich and Miele, however, have argued that estimates of genetic difference between individuals of different populations fail to take into account human diploidity.

<templatestyles src="Template:Blockquote/styles.css" />

The point is that we are diploid organisms, getting one set of chromosomes from one parent and a second from the other. To the extent that your mother and father are not especially closely related, then, those two sets of chromosomes will come close to being a random sample of the chromosomes in your population. And the sets present in some randomly chosen member of yours will also be about as different from your two sets as they are from one another. So how much of the variability will be distributed where? First is the 15 percent that is interpopulational. The other 85 percent will then split half and half (42.5 percent) between the intra- and interindividual within-population comparisons. The increase in variability in between-population comparisons is thus 15 percent against the 42.5 percent that is between-individual within-population. Thus, 15/42.5 is 32.5 percent, a much more impressive and, more important, more legitimate value than 15 percent.[6]

Additionally, Edwards (2003) claims in his essay "Lewontin's Fallacy" that: "It is not true, as Nature claimed, that 'two random individuals from any one group are almost as different as any two random individuals from the entire world'" and Risch et al. (2002) state "Two Caucasians are more similar to each other genetically than a Caucasian and an Asian." It should be noted that these statements are not the same. Risch et al. simply state that two indigenous individuals from the same geographical region are more similar to each other than either is to an indigenous individual from a different geographical region, a claim few would argue with. Jorde et al. put it like this:

<templatestyles src="Template:Blockquote/styles.css" />

The picture that begins to emerge from this and other analyses of human genetic variation is that variation tends to be geographically structured, such that most individuals from the same geographic region will be more similar to one another than to individuals from a distant region.[2]

Whereas Edwards claims that it is not true that the differences between individuals from different geographical regions represent only a small proportion of the variation within the human population (he claims that within group differences between individuals are not almost as large as between group differences). Bamshad et al. (2004) used the data from Rosenberg et al. (2002) to investigate the extent of genetic differences between individuals within continental groups relative to genetic differences between individuals between continental groups. They found that though these individuals could be classified very accurately to continental clusters, there was a significant degree of genetic overlap on the individual level, to the extent that, using 377 loci, individual Europeans were about 38% of the time more genetically similar to East Asians than to other Europeans.

A study by the HUGO Pan-Asian SNP Consortium in 2009 using the similar principal components analysis found that East Asian and South-East Asian populations clustered together, and suggested a common origin for these populations. At the same time they observed a broad discontinuity between this cluster and South Asia, commenting "most of the Indian populations showed evidence of shared ancestry with European populations". It was noted that "genetic ancestry is strongly correlated with linguistic affiliations as well as geography".[7]

Criticism

The Rosenberg study has been criticised on several grounds.

The existence of allelic clines and the observation that the bulk of human variation is continuously distributed, has led some scientists to conclude that any categorization schema attempting to partition that variation meaningfully will necessarily create artificial truncations. (Kittles & Weiss 2003). It is for this reason, Reanne Frank argues, that attempts to allocate individuals into ancestry groupings based on genetic information have yielded varying results that are highly dependent on methodological design.[8] Serre and Pääbo (2004) make a similar claim:<templatestyles src="Template:Blockquote/styles.css" />

The absence of strong continental clustering in the human gene pool is of practical importance. It has recently been claimed that “the greatest genetic structure that exists in the human population occurs at the racial level” (Risch et al. 2002). Our results show that this is not the case, and we see no reason to assume that “races” represent any units of relevance for understanding human genetic history.

In a response to Serre and Pääbo (2004), Rosenberg et al. (2005) make three relevant observations. Firstly they maintain that their clustering analysis is robust. Secondly they agree with Serre and Pääbo that membership of multiple clusters can be interpreted as evidence for clinality (isolation by distance), though they also comment that this may also be due to admixture between neighbouring groups (small island model). Thirdly they comment that evidence of clusterdness is not evidence for any concepts of "biological race".[9]

Risch et al. (2002) state that "two Caucasians are more similar to each other genetically than a Caucasian and an Asian", but Bamshad et al. (2004)[10] used the same data set as Rosenberg et al. (2002) to show that Europeans are more similar to Asians 38% of the time than they are to other Europeans when only 377 microsatellite markers are analysed.

Percentage similarity between two individuals from different clusters when 377 microsatellite markers are considered.[11]
x Africans Europeans Asians
Europeans 36.5
Asians 35.5 38.3
Indigenous Americans 26.1 33.4 35

In agreement with the observation of Bamshad et al. (2004), Witherspoon et al. (2007) have shown that many more than 326 or 377 microsatellite loci are required in order to show that individuals are always more similar to individuals in their own population group than to individuals in different population groups, even for three distinct populations.[5]

Witherspoon et al. (2007) have argued that even when individuals can be reliably assigned to specific population groups, it may still be possible for two randomly chosen individuals from different populations/clusters to be more similar to each other than to a randomly chosen member of their own cluster. They found that many thousands of genetic markers had to be used in order for the answer to the question "How often is a pair of individuals from one population genetically more dissimilar than two individuals chosen from two different populations?" to be "never". This assumed three population groups separated by large geographic ranges (European, African and East Asian). The entire world population is much more complex and studying an increasing number of groups would require an increasing number of markers for the same answer. Witherspoon et al. conclude that "caution should be used when using geographic or genetic ancestry to make inferences about individual phenotypes."

Clustering does not particularly correspond to continental divisions. Depending on the parameters given to their analytical program, Rosenberg and Pritchard were able to construct between divisions of between 4 and 20 clusters of the genomes studied, although they excluded analysis with more than 6 clusters from their published article. Probability values for various cluster configurations varied widely, with the single most likely configuration coming with 16 clusters although other 16-cluster configurations had low probabilities. Overall, "there is no clear evidence that K=6 was the best estimate" according to geneticist Deborah Bolnick (2008:76-77).[12] The number of genetic clusters used in the study was arbitrarily chosen. Although the original research used different number of clusters, the published study emphasized six genetic clusters. The number of genetic clusters is determined by the user of the computer software conducting the study. Rosenberg later revealed that his team used pre-conceived numbers of genetic clusters from six to twenty “but did not publish those results because Structure [the computer program used] identified multiple ways to divide the sampled individuals”. Dorothy Roberts, a law professor, asserts that “there is nothing in the team's findings that suggests that six clusters represent human population structure better than ten, or fifteen, or twenty.”[13] When instructed to find two clusters, the program identified two populations anchored around by Africa and by the Americas. In the case of six clusters, the entirety of Kalesh people, an ethnic group living in Northern Pakistan, was added to the previous five.[14][15]

The law professor, Dorothy Roberts asserts that “the study actually showed that there are many ways to slice the expansive range of human genetic variation. In a 2005 paper, Rosenberg and his team acknowledged that findings of a study on human population structure are highly influenced by the way the study is designed.[15][16]

They reported that the number of loci, the sample size, the geographic dispersion of the samples and assumptions about allele-frequency correlation all have an effect on the outcome of the study. Rosenberg stated that their findings “should not be taken as evidence of our support of any particular concept of biological race (...). Genetic differences among human populations derive mainly from gradations in allele frequencies rather than from distinctive 'diagnostic' genotypes.”[17] The study's overall results confirmed that genetic difference within populations is between 93 and 95%. Only 5% of genetic variation is found between groups.[15]

Controversy of genetic clustering and associations with “race”

In the late 1990s Harvard evolutionary geneticist Richard Lewontin stated that “no justification can be offered for continuing the biological concept of race. (...) Genetic data shows that no matter how racial groups are defined, two people from the same racial group are about as different from each other as two people from any two different racial groups.[18]

Lewontin's statement came under attack when new genomic technologies permitted the analysis of gene clusters. In 2003, British statistician and evolutionary biologist A. W. F. Edwards faulted Lewontin's statement for basing his conclusions on simple comparison of genes and rather on a more complex structure of gene frequencies. Edwards charged Lewontin that he made an “unjustified assault on human classification, which he deplored for social reasons.”[19]

According to Roberts, “Edwards did not refute Lewontin's claim: that there is more genetic variation within populations than between them, especially when it comes to races. (...) Lewontin did not ignore biology to support his social ideology (...). To the contrary, he argued that there is no biological support for the ideological project of race.” “The genetic differences that exist among populations are characterized by gradual changes across geographic regions, not sharp, categorical distinctions. Groups of people across the globe have varying frequencies of polymorphic genes, which are genes with any of several differing nucleotide sequences. There is no such thing as a set of genes that belongs exclusively to one group and not to another. The clinal, gradually changing nature of geographic genetic difference is complicated further by the migration and mixing that human groups have engaged in since prehistoric times. Race [however defined] collapses infinite diversity into a few discrete categories that in reality cannot be demarcated genetically.”[15]

Genetic clustering was also criticized by Penn State anthropologists Kenneth Weiss and Brian Lambert. They asserted that understanding human population structure in terms of discrete genetic clusters misrepresents the path that produced diverse human populations that diverged from shared ancestors in Africa. Ironically, by ignoring the way population history actually works as one process from a common origin rather than as a string of creation events, structure analysis that seems to present variation in Darwinian evolutionary terms is fundamentally non-Darwinian.”[20]

In 2006, Lewontin wrote that any genetic study requires some priori concept of race or ethnicity in order to package human genetic diversity into defined, limited number of biological groupings. Informed by geneticist, zoologists have long discarded the concept of race for dividing up groups of non-human animal populations within a species. Defined on varying criteria, in the same species widely varying number of races could be distinguished. Lewontin notes that genetic testing revealed that “because so many of these races turned out to be based on only one or two genes, two animals born in the same litter could belong to different 'races'”.[21]

Studies that seek to find genetic clusters are only as informative as the populations they sample. For example, Risch and Burchard relied on two or three local populations from five continents, which together were supposed to represent the entire human race.[15] Another genetic clustering study used three sub-Saharan population groups to represent Africa; Chinese, Japanese, and Cambodian samples for East Asia; Northern European and Northern Italian samples to represent “Caucasians”. Entire regions, subcontinents, and landmasses are left out of many studies. Furthermore, social geographical categories such “East Asia” and “Caucasians” were not defined. “A handful of ethnic groups to symbolize an entire continent mimic a basic tenet of racial thinking: that because races are composed of uniform individuals, anyone can represent the whole group” notes Roberts.[15][22][23]

The model of Big Few fails when including overlooked geographical regions such as India. The 2003 study which examined fifty-eight genetic markers found that Indian populations had their ancestral lineages to Africa, Central Asia, Europe, and southern China.[24][25] Reardon, from Princeton University, asserts that flawed sampling methods are built into many genetic research projects. The Human Genome Diversity Project (HGDP) relied on samples which were assumed to be geographically separate and isolated.[26] The relatively small sample sizes of indigenous populations for the HGDP do not represent the human species' genetic diversity, nor do they portray migrations and mixing population groups which has been happening since prehistoric times. Geographic areas such as the Balkans, the Middle East, North and East Africa, and Spain are seldom included in genetic studies.[15][27] East and North African indigenous populations, for example, are never selected to represent Africa because they do not fit the profile of “black” Africa. The sampled indigenous populations of the HGDP are assumed to be “pure”; the law professor Roberts claims that “their unusual purity is all the more reason they cannot stand in for all the other populations of the world that marked by intermixture from migration, commerce, and conquest.”[15]

King and Motulsky, in a 2002 Science article, states that “While the computer-generated findings from all of these studies offer greater insight into the genetic unity and diversity of the human species, as well as its ancient migratory history, none support dividing the species into discrete, genetically determined racial categories”.[28] Cavalli-Sforza asserts that classifying clusters as races would be a “futile exercise” because “every level of clustering would determine a different population and there is no biological reason to prefer a particular one.” Bamshad, in 2004 paper published in Nature, asserts that a more accurate study of human genetic variation would use an objective sampling method. An objective sampling method would chose populations randomly and systematically across the world, including those populations which are characterized by historical intermingling, instead of cherry-picking population samples which fit a priori concept of racial classification. Roberts states that “if research collected DNA samples continuously from region to region throughout the world, they would find it impossible to infer neat boundaries between large geographical groups.”[10][15][29][30]

Anthropologists such as C. Loring Brace,[31] philosophers Jonathan Kaplan and Rasmus Winther,[32][32][33][34] and geneticist Joseph Graves,[35] have argued that while there it is certainly possible to find biological and genetic variation that corresponds roughly to the groupings normally defined as "continental races", this is true for almost all geographically distinct populations. The cluster structure of the genetic data is therefore dependent on the initial hypotheses of the researcher and the populations sampled. When one samples continental groups the clusters become continental, if one had chosen other sampling patterns the clustering would be different. Weiss and Fullerton have noted that if one sampled only Icelanders, Mayans and Maoris, three distinct clusters would form and all other populations could be described as being clinally composed of admixtures of Maori, Icelandic and Mayan genetic materials.[36] Kaplan and Winther therefore argue that seen in this way both Lewontin and Edwards are right in their arguments. They conclude that while racial groups are characterized by different allele frequencies, this does not mean that racial classification is a natural taxonomy of the human species, because multiple other genetic patterns can be found in human populations that crosscut racial distinctions. Moreover, the genomic data underdetermines whether one wishes to see subdivisions (i.e., splitters) or a continuum (i.e., lumpers). Under Kaplan and Winther's view, racial groupings are objective social constructions (see Mills 1998 [37]) that have conventional biological reality only insofar as the categories are chosen and constructed for pragmatic scientific reasons.

Commercial ancestry testing and individual ancestry

Commercial ancestry testing companies, who use genetic clustering data, have been also heavily criticized. Limitations of genetic clustering are intensified when inferred population structure is applied to individual ancestry. The type of statistical analysis conducted by scientists translates poorly into individual ancestry because they are looking at difference in frequencies, not absolute differences between groups. Commercial genetic genealogy companies are guilty of what Pillar Ossorio calls the “tendency to transform statistical claims into categorical ones”.[38] Not just individuals of the same local ethnic group, but two siblings may end up beings as members of different continental groups or “races” depending on the alleles they inherit.[15]

Many commercial companies use data from HapMap's initial phrase, where population samples were collected from four ethnic groups in the world: Han Chinese, Japanese, Yoruba Nigerian, and Utah residents of Northern European ancestry. If a person has ancestry from a region where the computer program does not have samples, it will compensate with the closest sample that may have nothing to do with the customer's actual ancestry: “Consider a genetic ancestry testing performed on an individual we will call Joe, whose eight great-grandparents were from southern Europe. The HapMap populations are used as references for testing Joe's genetic ancestry. The HapMap's European samples consist of “northern” Europeans. In regions of Joe's genome that vary between northern and southern Europe (such regions might include the lactase gene), the genetic ancestry test is using the HapMap reference population is likely to incorrectly assign the ancestry of that portion of the genome to a non-European population because that genomic region will appear to be more similar to the HapMap's Yoruba or Han Chinese samples than to Northern European samples.[39] Likewise, a person with East African ancestors may be classified as someone having part North European and part Western African ancestry.[40] “Telling customers that they are a composite of several anthropological groupings reinforces three central myths about race: that there are pure races, that each race contains people who are fundamentally the same and fundamentally different from people in other races, and that races can be biologically demarcated.” Many companies base their findings on inadequate and unscientific sampling methods. Researchers have never sampled the world's populations in a systematic and random fashion.[15]

Geographical and continental groupings

Roberts argues against the use of broad geographical or continental groupings: “molecular geneticists routinely refer to African ancestry as if everyone on the continent is more similar to each other than they are to people of other continents, who may be closer both geographically and genetically.[15] Ethiopians have closer genetic affinity with Armenians and Norwegians than with Bantu populations.[41] Similarly, Somalis are genetically more similar to Gulf Arab populations than to other populations in Africa.[42] Braun and Hammonds (2008) asserts that the misperception of continents as natural population groupings is rooted in the assumption that populations are natural, isolated, and static. Populations came to be seen as “bounded units amenable to scientific sampling, analysis, and classification”.[43] Human beings are not naturally organized into definable, genetically cohesive populations.

Usage in scientific journals

Some scientific journals have addressed previous methodological errors by requiring more rigorous scrutiny of population variables. Since 2000, Nature Genetics requires its authors to “explain why they make use of particular ethnic groups or populations, and how classification was achieved.” Editors of Nature Genetics say that “[they] hope that this will raise awareness and inspire more rigorous design of genetic and epidemiological studies.”[44]

See also

References

  1. Lua error in package.lua at line 80: module 'strict' not found.
  2. 2.0 2.1 Lynn B Jorde & Stephen P Wooding, 2004, "Genetic variation, classification and 'race'" in Nature Genetics 36, S28–S33 Genetic variation, classification and 'race'
  3. Lua error in package.lua at line 80: module 'strict' not found.
  4. Lua error in package.lua at line 80: module 'strict' not found.
  5. 5.0 5.1 Lua error in package.lua at line 80: module 'strict' not found.
  6. Sarich VM, Miele F. Race: The Reality of Human Differences. Westview Press (2004). ISBN 0-8133-4086-1
  7. Mapping Human Genetic Diversity in Asia, The HUGO Pan-Asian SNP Consortium, 2009
  8. Back with a Vengeance: the Reemergence of a Biological Conceptualization of Race in Research on Race/Ethnic Disparities in Health Reanne Frank
  9. Lua error in package.lua at line 80: module 'strict' not found.
  10. 10.0 10.1 Lua error in package.lua at line 80: module 'strict' not found.
  11. The table gives the percentage likelihood that two individuals from different clusters are genetically more similar to each other than to someone from their own population when 377 microsatellite markers are considered from Lua error in package.lua at line 80: module 'strict' not found., original data from Rosenberg (2002).
  12. Lua error in package.lua at line 80: module 'strict' not found.
  13. Lua error in package.lua at line 80: module 'strict' not found.
  14. Lua error in package.lua at line 80: module 'strict' not found.
  15. 15.00 15.01 15.02 15.03 15.04 15.05 15.06 15.07 15.08 15.09 15.10 15.11 Lua error in package.lua at line 80: module 'strict' not found.
  16. Lua error in package.lua at line 80: module 'strict' not found.
  17. Lua error in package.lua at line 80: module 'strict' not found.
  18. Lua error in package.lua at line 80: module 'strict' not found.
  19. Lua error in package.lua at line 80: module 'strict' not found.
  20. Lua error in package.lua at line 80: module 'strict' not found.
  21. Lua error in package.lua at line 80: module 'strict' not found.
  22. Lua error in package.lua at line 80: module 'strict' not found.
  23. Lua error in package.lua at line 80: module 'strict' not found.
  24. Lua error in package.lua at line 80: module 'strict' not found.
  25. Lua error in package.lua at line 80: module 'strict' not found.
  26. Lua error in package.lua at line 80: module 'strict' not found.
  27. Lua error in package.lua at line 80: module 'strict' not found.
  28. Lua error in package.lua at line 80: module 'strict' not found.
  29. Lua error in package.lua at line 80: module 'strict' not found.
  30. Lua error in package.lua at line 80: module 'strict' not found.
  31. Loring Brace, C. 2005. Race is a four letter word. Oxford University Press.
  32. 32.0 32.1 Kaplan, Jonathan Michael (January 2011) 'Race': What Biology Can Tell Us about a Social Construct. In: Encyclopedia of Life Sciences (ELS). John Wiley & Sons, Ltd: Chichester
  33. Winther, Rasmus Grønfeldt (2011) ¿La cosificación genética de la 'raza'? Un análisis crítico in C López-Beltrán (ed.) Genes (&) Mestizos. Genómica y raza en la biomedicina mexicana. Ficticia editorial http://philpapers.org/archive/WINLCG.1.pdf
  34. Kaplan, Jonathan Michael, Winther, Rasmus Grønfeldt (2012). Prisoners of Abstraction? The Theory and Measure of Genetic Variation, and the Very Concept of 'Race' Biological Theory 7 http://philpapers.org/archive/KAPPOA.14.pdf
  35. Graves, Joseph. 2001. The Emperor's New Clothes. Rutgers University Press
  36. Lua error in package.lua at line 80: module 'strict' not found.
  37. Mills CW (1988) "But What Are You Really? The Metaphysics of Race" in Blackness visible: essays on philosophy and race, pp. 41-66. Cornell University Press, Ithaca, NY
  38. Lua error in package.lua at line 80: module 'strict' not found.
  39. Lua error in package.lua at line 80: module 'strict' not found.
  40. Lua error in package.lua at line 80: module 'strict' not found.
  41. Lua error in package.lua at line 80: module 'strict' not found.
  42. Lua error in package.lua at line 80: module 'strict' not found.
  43. Lua error in package.lua at line 80: module 'strict' not found.
  44. Lua error in package.lua at line 80: module 'strict' not found.