We now have the tools to describe the pattern of genetic variation1,2 across the whole genome and its relationship to the history of human origins and the differential distribution of diseases across populations and geography1,2,3,4,5. We can begin to dissect common complex diseases1,2,3,4,5 and devise new therapeutic strategies to reduce adverse drug reactions, a key public health problem ranking between the fourth and sixth leading cause of death in the US6,7,8. At the social level, the new genomic tools can help us to better appreciate the fluidity of social identity, including 'race', 'ethnicity' and the more complex notion of ancestry9,10,11,12,13. Challenges surrounding the design of large-scale genotyping projects such as the international HapMap initiative and their future applications illustrate the complexities and ambiguities associated with the use of group labels in genomic research. Depending on how we use this information, the potential exists to describe simultaneously our similarities and differences without reaffirming old prejudices.

Genetic basis of common diseases and the HapMap project

Researchers have identified the genetic basis underlying several mendelian (single-gene) disorders using linkage studies in families with affected individuals. These success stories created an unrealistic expectation for the resolution power of linkage studies to unravel the genetic basis of common diseases in human populations3. In contrast to mendelian disorders, no single factor is either necessary or sufficient for describing the etiology of common diseases14. An individual's risk is the result of the complex interplay between an unknown constellation of genetic variants, environmental factors, lifestyle characteristics and some stochastic processes15. We are beginning to appreciate the complexity of common diseases, and our initial optimism, driven to some extent by candidate gene and genome-wide linkage approaches, has been tempered by the modest success rate. Notwithstanding this humbling reality, we continue to intensify our efforts, and as a result, we have achieved some success with asthma16,17,18, cancer19, diabetes20,21,22, Alzheimer disease23, deep vein thrombosis24, inflammatory bowel disease25,26, schizophrenia27 and stroke28.

The international HapMap project explores patterns of DNA sequence variation in the human genome4,5. Successful completion of the HapMap project will furnish scientists with powerful new tools for identifying genetic variants that contribute to common diseases and to differential drug response and for developing new diagnostic tools4,5,29,30. The HapMap project is predicated on the common disease–common variant (CDCV) hypothesis, which assumes that complex diseases are influenced by genetic variants (single-nucleotide polymorphisms, SNPs) that are relatively common in human populations30. If this hypothesis holds true, the HapMap project has the potential to advance the field of genetic epidemiology by facilitating association studies of candidate genes, chromosome regions or the whole genome without knowing the function of putative variants4,5.

Association studies offer greater statistical power than linkage studies in examining the genetic basis of complex diseases when the risk variant is common in the population29,31. The HapMap project will help develop an efficient and comprehensive catalog of common variants by contributing to our understanding of the patterns of linkage disequilibrium (LD) in multiple human populations4. LD is the tendency for alleles (in the case of the HapMap, SNPs) at separate sites in the genome to be found together more frequently than would be expected by chance32.

Some geneticists and statisticians have reservations regarding whether the HapMap will have sufficient resolution to be useful for understanding the genetics of common diseases across multiple populations14,15,33,34,35,36,37,38. Reasons for the skepticism include what may be called a 'rush to judgment'. Some believe that the initial findings that seem to support the CDCV hypothesis may be the exception and not the rule14,15,33,34,35,36,37,38. For example, some common diseases may result from the effects of multiple variants that are individually rare14,29,36,37,38. This scenario, if true, has multiple implications for study design and analytic strategies for understanding common diseases. Because rare, as opposed to common, variants are probably differentially distributed in populations, a project like the HapMap will have to sample many more populations and develop much more detailed genetic maps than currently recommended. And some have argued that this problem may be magnified several times for African populations with more divergent patterns of genetic variation35,37,38. Indeed, all bets will be off for the CDCV hypothesis and most other current analytic strategies if the manifestation of complex diseases is the result of the interacting effects of a set of rare variants (<1%) on other sets of rare (or even rarer) variants in the context of a changing environment33.

Ethical issues and nonmedical uses of the HapMap project

Although the HapMap project is designed to answer medical questions, it would be naive to think that results of future studies based on the HapMap data will not be applied to the emotional and volatile issues surrounding group identity and their subsequent correlation with health and social outcomes5,39,40. As discussed by Morris Foster (, “the proposed haplotype map project cannot be considered in isolation from the more general, ongoing discussion of the implications of using socially constructed identities in genetic research. Nor can it be considered apart from prior efforts to catalogue human genetic diversity and the controversies that surrounded them.” This is particularly relevant because some may attempt to use the HapMap data to validate old notions of 'race' and its correlation with multiple phenotypes (e.g., behavioral characteristics), making use of population identifiers recorded in the HapMap project.

The inclusion of population and ethnic labels in the HapMap project (Yoruba, Han Chinese, Japanese and Americans of northern and western European descent)4,5 is source of considerable debate5. Although the project used population samples rather than racial or ethnic groupings, by concentrating on common variants, the project ran the risk that this first approximation of human population structure might be subsequently used to reinforce existing racial or ethnic categories or even taken as evidence for a new categorization of human stereotypes. Why did the designers of the HapMap take this risk? Were guidelines put in place to reduce potential harm to individual participants and their communities? Could the HapMap project have been implemented without this population and ethnic information? How would this have affected the team's goal of including the broadest spectrum of common human genetic variations?

Several rationales supported the decision to include population and ethnic labels. First, available data show differences in haplotype structure and frequency across populations (e.g., African populations tend to have shorter haplotype lengths than non-African populations do)35,41. Second, having the population information will make it possible to choose the most efficient sets of SNPs for association studies. Third, removing population labels may create a false sense of protection from collective risks (e.g., stereotyping) because this information can easily be reconstructed given publicly available information including the names of the researchers and institutions involved in the project. Also, it would not be difficult to discern the identity of participating populations from previously collected data sets. Fourth, identifying the populations will allow HapMap researchers and ethicists to provide better context for interpreting the biological importance of genetic findings that are associated with particular population identities5.

I argue that the HapMap team of investigators selected the most appropriate design for the specific hypothesis of CDCV and have gone to great lengths to address most of these issues. For example, extensive strategies for community engagement were used to discuss potential harms and benefits. Scientists have been given descriptive guidelines for interpreting group findings and are advised to present their data in ways that avoid stigmatizing groups, conveying an impression of genetic determinism or attaching inappropriate levels of biological importance to largely social constructs such as race5. But the challenges associated with correct and ethical use of the HapMap results are ongoing and will probably manifest themselves in unanticipated ways.

Genetic variation and social identity

To reap the full benefits of the Human Genome Project and spin-offs like the HapMap project, we must be willing to move beyond old and simplistic interpretations of differential frequencies of disease variants by poorly defined social proxies of genetic relatedness like 'race'. We should allow the genome to teach us the extent of our evolutionary history without abbreviating it with preconceived notions of population boundaries and social identities. We must recognize that social identities are formed in various ways—ancestry, ethnic and tribal background, geopolitical boundaries, language, and other social and behavioral activities42,43. Identities change over time and from one context to another. Their use as markers of 'relatedness' in genetic research without appreciation for how they were formed is likely to produce misleading information concerning the distribution of genetic variation7.

We all have a common birthplace somewhere in Africa44,45, and this common origin is the reason why we share most of our genetic information46,47. Our common history also explains why contemporary African populations have more genetic variation than younger human populations that migrated out of Africa 100,000 years ago to populate other parts of the world, carrying with them a subset of the existing genetic information44,45.

Given this shared history, why do we interpret human genetic variation data as though our differences rise to the level of subspecies? Two facts are relevant: (i) as a result of different evolutionary forces, including natural selection, there are geographical patterns of genetic variations that correspond, for the most part, to continental origin46,47,48; and (ii) observed patterns of geographical differences in genetic information do not correspond to our notion of social identities, including 'race' and 'ethnicity'46,47,48. In this regard, no matter what categorical framework is applied, we cannot consistently use genetics to define racial groups without classifying some human populations as exceptions10. Our evolutionary history is a continuous process of combining the new with the old, and the end result is a mosaic that is modified with each birth and death. This is why the process of using genetics to define 'race' is like slicing soup: “You can cut wherever you want, but the soup stays mixed”49.

How can we grasp the population structure of our species? I believe this requires universal awareness that genomic information cannot be used either to confirm or to refine old social, political and economic classifications such as 'race'. In particular, we should understand the following points: (i) individuals in genetics studies may have membership in more than one biogeographical clusters; (ii) the borders of these clusters are not distinct; and (iii) population clusters are influenced by sampling strategies47,48. For example, the inference drawn from a study with one or two African populations will probably be very different from that drawn from a study with 100 African populations sampled from north, east, west, central and south Africa. As Steve Olson observed, “Not only do all people have the same set of genes, but all groups of people also share the major variants of those genes. Geneticists have never found a genetic marker that is of one type in all the members of one large group and of a different type in all the members of another large group”50. Furthermore, because most alleles are widespread, genetic differences among human populations are the result of gradations in allele frequencies rather than distinctive diagnostic genotypes46,48.

Differential distribution of disease genes across populations

Genetic variations in human populations are distributed in a nonrandom manner1. For example, a greater degree of genetic variation is seen in present-day Africa populations, resulting in more haplotypes, lower levels of LD, more divergent patterns of LD and more complex patterns of population substructure35. As observed by Reich and collaborators, LD in a sample of Yoruba individuals from Nigeria extends only to an average distance of 5 kb, compared with 60 kb in a Mormon population of European descent51. Similarly, Gabriel et al. report an average haplotype block size of 11 kb in their Yoruba and African American samples, compared with 22 kb in European and Asian samples52. The nonrandom pattern of genetic variation by populations has implications for mapping disease genes and for understanding how population and genomic diversity have influenced evolution, differentiation and adaptation of humans35,41.

The impact of the forces of evolution, including adaptation (natural selection), on the differential distribution of disease genes is currently better understood in the context of the worldwide distribution of monogenic traits41. Gene variants that cause monogenic diseases are more common in some populations than others53. But the worldwide distribution of these genetic variants does not always follow our usual definition of continental populations or social groups. For example, Tay-Sachs disease is more frequent among individuals of Ashkenazi Jewish ancestry53, but people without known Jewish ancestry also have mutations in the gene responsible for the disease53. Cystic fibrosis, though more common in people of European ancestry, is found in other groups, including those of African and Arab ancestry53. A notable example of the signature of selection on the genome is provided by the human need to survive the mosquito-borne disease malaria53. Because gene variants of hemoglobin (Hbs and HbE) and glucose-6-phosphate dehydrogenase provide survival advantages for populations who lived in the area where malaria was endemic, they have been maintained at high frequency, despite the fact that they cause multiple hemopathologies41,53,54,55.

Sickle cell anemia is a good example by which we can evaluate some consequences of ethnic labeling of genetic traits. Though more frequently observed in populations of African descent, it is found in a wide range of people including Hispanic people and inhabitants of northwestern India and areas around the Mediterranean56. The label 'black disease', however, rendered the distribution of sickle cell anemia invisible in other populations56, leading to erroneous understanding of the geographical distribution of the underlying genetic variants. This is one reason why many people, including physicians, are unaware that the town of Orchomenos in central Greece has a rate of sickle cell anemia that is twice that of African Americans and that black South Africans do not carry the sickle-cell trait56,57.

As the following case illustrates, labeling this disease on the basis of phenotype (skin color) resulted in serious health consequences to individuals who are not phenotypically 'black' but have the relevant genetic variants. An 8-year-old boy, phenotypically European, presented with acute abdominal pain and anemia (hematocrit 0.21). Although his body temperature was only 37.9 °C, surgery was considered. A technician found red corpuscles with hemolytic characteristics on a smear. Surgery was canceled after the results of a subsequent sickle preparation were found to be positive, and the child was treated for previously undiagnosed sickle cell anemia. His parents were from Grenada and were of Indian, northern European and Mediterranean ancestry58. This case highlights the idea that ancestry is a better indicator than 'race' or 'ethnicity' of whether one carries the markers for sickle cell anemia, Tay Sachs disease, cystic fibrosis or other genetic diseases.

The geographical distribution of genes associated with common diseases is less skewed by populations and by geographic origin than that of monogenic diseases59. For example, the ε4 allele of APOE is found in all populations, albeit at varying frequencies60,61. Those carrying a variant of the ε4 allele have a greater risk of developing Alzheimer disease. The frequency of the ε4 allele ranges from 9% in Japanese individuals to 14% in populations of European descent to 19% in African Americans60. There is no evidence supporting the view that common functional variants are organized in discrete racial or ethnic categories. In contrast, available data show that coding sequences are conserved across populations and common polymorphisms are usually old and are therefore shared62. The take-home message is that variation is continuous, it is discordant with race, and the future categorization of groups for drug development and treatment will probably not correspond to our current sociopolitical group definitions.

Genetic variation: implications for drug development

Studies of genetic variation among population groups have implications for development of drugs aimed to reduce adverse reactions. As high-throughput genotyping methodologies are applied to large populations, the opportunity exists to develop genetic tests that will allow scientists and physicians to tailor medicine to individuals and to groups defined by a collection of specific genetic variants63. Unfortunately, the new genomic information is being interpreted along old familiar social labels such as 'race' and 'ethnicity'. An important side effect of this phenomenon is the promotion of 'ethnicity'- or 'race'-based medicine. In recent years, there have been a flurry of newspaper articles with titles like “Shouldn't a Pill Be Color Blind?” and “Are 'racialized drugs' a marketing ploy or part of medical progress?” (ref. 64 and

This heated debate reached a boil with the announcement of the first and only trial to test the efficacy of a drug, BiDil, in treating heart failure in a sample consisting only of African Americans (ref. 64 and BiDil is a combination drug (two vasodilators, hydralazine and isosorbide dinitrate) designed to restore low or depleted nitric oxide levels in the blood to treat or prevent cases of congestive heart failure65. The trial, which was cosponsored by the Association of Black Cardiologists and supported by the National Medical Association and members of the Congressional Black Caucus, was recently halted because the drug was so effective66. The seeming success of an 'ethnic drug' and the support for this trial by leading African Americans highlights the confusion surrounding race and biology. It also illustrates the potential of genomic research to contribute inadvertently to the harmful effects of using 'race' as a variable in research. Similar thinking on the part of some physicians led to the uncritical acceptance of the decision not to treat African Americans suffering from chronic heart failure with inhibitors of angiotensin-converting enzyme67. Despite the favorable preliminary results of the BiDil trial in African Americans, earlier concerns about the development of 'ethnicity'- or 'race'-based drugs, including the potential to exacerbate health disparity, remain valid68,69,70. 'Race' or 'ethnicity' is an inadequate proxy for the subset of human populations that are likely to benefit from a certain drug.

We have come full circle in biomedical research. Due to social, political and economic forces, biomedical research was almost exclusively conducted in people, especially men, of European descent. The results of such studies were then extrapolated to other groups. In the end, the BiDil story will have similar outcome; if the drug continues to be effective in the treatment of heart failure, the subset of individuals with heart failure who will benefit from the drug will not be accurately described by the label 'African American'. It is reasonable to assume that the distribution of genetic variants underlying the effectiveness of this drug will not be limited to African Americans. Moreover, it is important to note that the label used to designate the African American population in studies like the clinical trial for BiDil is too imprecise to be relevant for individual therapy. Some members of this population 'supergroup' with heart failure will benefit from this drug, and others will not. More importantly, some members of other ethnic groups will probably benefit from this drug as well. In this regard, and because of ethnic binning, a drug like BiDil, which may help other ethnic groups, could never achieve its full potential.

'Race'-based hypotheses in biomedical research sell. Reporting the nuances underlying group differences does not and, more importantly, will probably not receive the same attention in the popular press. Unfortunately, instant notoriety can be attained by reporting genetic explanations for 'racial' differences in disease, at least in North America. For example, the College of Medicine at Howard University71 made the front page of the New York Times and received considerable amount of air time in other venues when it was erroneously reported that the university was developing a 'black' genetic biobank to understand the genetic basis of health disparity. Although it is the intention of Howard University to facilitate the development of a biobank to house demographic, epidemiologic, clinical and genetic materials in populations of the African diaspora to study the complex interplay between environmental and genetic factors in the etiology of diseases, only the genetic component of this huge infrastructural development was reported. Given the polarized atmosphere of race relations in the US, it is easy to see why the Howard University story gained that much currency. The story was too good to resist: “Howard University, Race and Genetics!”, a readymade controversy.


Future clinical trials may be driven by the delineation of subpopulations using DNA polymorphisms as opposed to current imprecise classifications such as 'race', 'ethnicity' or skin color63. Polymorphism-based stratification of populations is expected to reduce adverse reaction to drugs and facilitate the identification of genetic variants that confer resistance or predispositions to many diseases63. In this regard, and if successful, genomic data in the context of drug development may contribute to the deconstruction of 'race' and other imprecise group definitions as currently applied. Until we achieve the ultimate goal of genotyped-based medicine, however, drug developers and health-care providers will struggle with how to interpret differential drug response by groups when group definition is imprecise, fluid and time-dependent2,56,63. Similarly, they will struggle with whether an individual's response to a drug or other medical interventions can be inferred from group data.