Introduction

The completion of the human genome project has generated new enthusiasm and opportunities in life sciences. It has provided the necessary tools to understand the genetic basis of diversity among individuals, the most common familial traits, evolutionary processes, complex and common diseases such as diabetes, obesity, hypertension and psychiatric disorders, and to develop genome-based medicinal drugs (Emilien et al. 2000). Scientists generally think that the genomes between two randomly selected individuals contain approximately 0.1% differences or variations. This variation is called polymorphism, and it arises because of mutations. Several comparative studies on identical and fraternal twins (Martin et al. 1997) and siblings suggest that DNA polymorphism is one of the factors associated with susceptibility to many common diseases (Table 1), every human trait such as curly hair, individuality and inter-individual difference in drug response. DNA sequence variation is also considered to be responsible for genome evolution. Based on these observations, it has been proposed that by cataloging the DNA polymorphisms in different populations and in different species, it may be possible (a) to develop genome-based knowledge on the susceptibility of an individual to many common diseases, (b) to manufacture safer and more effective individualized diet and medications for patients, and (c) to understand evolutionary processes. However, many experts believe that this single nucleotide polymorphism (SNP) technology has to face several challenges before it makes its impact on medicine. What follows is a brief discussion of the above three aspects with emphasis placed on evolution.

Table 1 A partial list of diseases associated with single nucleotide polymorphisms

Detection and analysis of DNA polymorphism

The simplest form of DNA variation among individuals is the substitution of one single nucleotide for another. This type of change (Fig. 1A) is called SNP. It is estimated that SNPs occur at a frequency of 1 in 1,000 bp throughout the genome. These simple changes can be of transition or transversion type. According to one report (Halushka et al. 1999), approximately 50% of SNPs are in the noncoding regions, 25% lead to missense mutations (coding SNPs or cSNPs), and the remaining 25% are silent mutations (they do not change encoded amino acids). These silent SNPs are called synonymous SNPs, and it is most likely that they are not subject to natural selection (but see below). On the other hand, nonsynonymous SNPs (nSNPs, change-encoded amino acids) may produce pathology and may be subject to natural selection. SNPs (both synonymous and nonsynonymous) influence promoter activity and pre-mRNA conformation (or stability). They also alter the ability of a protein to bind its substrate or inhibitors (Kimchi-Sarfaty et al. 2007) and change the subcellular localization of proteins (nSNPs). Therefore, they may be responsible for disease susceptibility, medicinal drug deposition and genome evolution. Although several of them affect the functions of genes, many of them are not deleterious to organisms and must have escaped selection pressure. For the purpose of identifying SNPs, several private and public organizations have undertaken massive efforts to develop high-throughput SNP genotyping methods over the past 20 years (reviewed in Shastry 2002, 2005). As a result of these efforts, a large collection of SNPs is now available from the human genome project (http://www.ncbi.nlm.nih.gov/SNP/snp_summary.cgi).

Fig. 1A–C
figure 1

A schematic representation of single nucleotide polymorphism (SNP) (A), a haplotype (B), and the relationship between the genotype and variation in drug response among three individuals (C). In A, strings of nucleotides at which individuals A and B differ are shown. In B, a long stretch of DNA with distinctive patterns of SNPs at a given location of a chromosome is shown. Haplotype diversity may be generated by new SNP alleles. In C, two horizontal lines denote a pair of homologous genes and the symbol X indicates polymorphism in the gene

SNPs in gene discovery

Because some diseases are hereditary, one immediate goal of the human genome project is to find out which genes predispose people to various disorders and how the sequence variation in a gene affects the functions of its product. As mentioned previously, SNPs occur frequently throughout the genome. Therefore, they can be used as markers to identify disease-causing genes by an association study (Gray et al. 2000). In such studies, it is assumed that two closely located alleles (gene and marker) are inherited together. Therefore, a simple comparison of patterns of genetic variations between patients and normal individuals may provide a method of identifying the loci responsible for disease susceptibility (Hirschhorn and Daly 2005). One advantage of this method is that it does not need a large family. However, several limitations such as population structure, different levels of linkage disequilibrium (LD) (see below) in loci, and epistatic interaction of alleles may impose difficulties. Despite these limitations, there has been some success in identifying the association between polymorphisms and diseases (Table 1).

Unfortunately, however, this type of whole-genome approach to mapping requires the genotyping of thousands of samples. Although there are several high-throughput methods that are available for these studies, they are expensive, laborious and cannot be undertaken by many laboratories. Therefore, a different procedure called haplotype (collection of SNPs on a single chromosome at a locus that is inherited in blocks, Fig. 1B) analysis has been used to identify common disease genes (Hirschhorn and Daly 2005). Because the genome undergoes recombination involving large stretches of DNA, there may be several SNPs linked together in this large region of DNA. These closely linked SNPs may then be cotransmitted from generation to generation in these large blocks (Reich et al. 2001). This phenomenon is called LD. In this type of analysis, multiple genotypes are reduced to haplotypes and hence only a small number of SNPs are required to map the disease gene. Therefore, this method can be more effective in gene mapping and can also provide substantial statistical power in association studies (McVean et al. 2005). However, for conducting genetic association studies, putative polymorphism must be validated. The deleterious effect of SNPs should be evaluated in the context of a relevant haplotype because it is more accurate than a single SNP (Fig. 2). Additionally, previous studies have shown that in the human genome many variants may be common to all populations and others may have a very restricted distribution (Salisbury et al. 2003). Hence its use in disease gene mapping requires additional research. In this regard, the recently characterized second type of DNA variation, called copy number polymorphisms (CNVs), which show marked variations among populations and individuals, may be helpful (Sharp et al. 2005; Locke et al. 2006).

Fig. 2
figure 2

A hypothetical haplotype and drug response. Individual A shows no response, individuals B and D show an increased response, whereas individual C shows a decreased response to a drug. The horizontal arrows denote no change in gene activity, whereas upward and downward arrows represent increased and decreased gene activity, respectively. Haplotype analysis may give a more accurate prediction of drug response

Pharmacogenetics and pharmacogenomics

Single nucleotide polymorphism technologies are also applicable in the development of individualized medicine. Over the past 20 years it has become increasingly clear that genetic polymorphism in genes encoding drug-metabolizing enzymes, drug transporters and receptors contribute, at least in part, to the inter-individual variability in drug response (reviewed in Shastry 2003, 2004; Evans and Johnson 2001). These factors affect drug absorption, distribution, metabolism and excretion. As a result, some drugs work better in some patients than others, and some drugs may be highly toxic to certain patients (Ansari and Krajinovic 2007). This type of anti-drug reaction has been observed in several diseases, such as pulmonary hypertension, epilepsy, cardiac arrhythmia, renal cell carcinoma, leukemia and liver cancer (reviewed in Shastry 2006a, 2006b; Roses 2000). In order to understand the relationship between heritable changes in genes and inter-individual variations to drug response, two related fields namely pharmacogenetics and pharmacogenomics (Dervieux and Bala 2006) emerged and gained popularity in the late 1990s. They have undertaken massive studies on the genetic personalization of drug response (McLeod and Evans 2001). As a result, there are several high-density SNP maps of genes encoding proteins of medical importance (Iida et al. 2002, 2003), and now there is strong evidence (Table 2) that links SNPs to inter-individual differences in drug response (Nothen and Cichon 2002; Ansari and Krajinovic 2007). Patients with more active drug-metabolizing enzymes may require higher doses of the drug, and those who do not have an active enzyme may exhibit toxicity (Fig. 1C).

Table 2 A partial list of genetic polymorphisms associated with drug response

However, it should be noted that there are many negative results regarding the association of gene polymorphism and drug efficacy and toxicity. In addition to these negative results there are other problems. For example, drug metabolism or variation in drug response includes dozens of genes, and many of these often have multiple polymorphisms. Moreover, there are inducible genes, signaling molecules and environmental factors that may also contribute to variable drug response. The greatest challenge for the future (Roden et al. 2006) is to understand the genotypic–environmental factor interaction, ethnicity, inheritance patterns in drug response and how genetic variance responds to medicine. If the goals of pharmacogenetics and pharmacogenomics are fulfilled, it may allow clinicians to genetically subdivide and profile individual patients and treat each patient according to their genetic make-up. This type of medical practice may gradually replace the current trial-and-error-based selection of medicine in the future. However, at present, studies do not unambiguously prove the clinical value of pharmacogenetic testing.

Nutrigenetics and nutrigenomics

It is well known that certain monogenic disorders are associated with the interaction between variant genes and nutrients. The best example is phenylketonuria. Several recent population-based and intervention studies (Subbiah 2007; Ordovas and Mooser 2004; Ordovas et al. 2002a) support this gene–nutrient interaction. Polymorphism on its own may not have an effect, but nutrients may modulate the expression of genes. For instance, a significant variation in the low-density lipoprotein cholesterol level is shown to be associated with A-204C variant in CYP7 (Couture et al. 1999) gene, and high-density cholesterol concentration is determined by polymorphism C-514T in the hepatic lipase gene (Zhang et al. 2005; Couture et al. 2000). Similarly, the high-density lipoprotein cholesterol level is modulated by apolipoprotein A1 (APOA1) genetic polymorphism (-75G/A) in the promoter region (Ordovas et al. 2002b). Therefore, an understanding of the genetic make-up of an individual may lead to the development of an individualized diet. This may reduce diet-related disease risk more efficiently in some common multifactorial disorders (Ordovas and Mooser 2004). Because of this important relationship between gene–nutrient interactions and human health, two recently developed multidisciplinary fields, namely nutrigenetics and nutrigenomics, are exploring the possibility of developing personalized diets based on the genetic make-up of an individual.

SNPs in evolution

Genetic variants are not only considered to be responsible for disease risk and inter-individual differences, but also molecular evolution. Genetic evolution in part depends upon a balance between natural selection and environmentally driven mutation. The natural selection will maintain and retain the amino acid type and position among species because these amino acids are critical for the protein function. Therefore, in a given set of homologous genes, certain amino acids are highly conserved, even among distantly related species that diverged hundreds of million years ago (Fig. 3). These conserved residues are evolving under strong selective pressure. Deleterious mutations that affect the biological functions of proteins are effectively eliminated by natural selection from the gene pool. The selection pressure against deleterious SNPs depends upon the molecular functions of proteins and those genes that encode transcriptional regulatory proteins are generally found to be under the strongest selective pressure (Ramensky et al. 2002).

Fig. 3A–B
figure 3

Protein sequence alignment of the mutant part of the human frizzled-4 with that of other species. A conservative change in codon 256 causes pathology in the patient (A) whereas a radical change in a less conserved residue is nonpathogenic (B). h, human; m, mouse; r, rat; X, xenopus; Z, zebra fish; and g, chicken

Because SNPs are present at all levels of evolution, including the branch point of speciation, they can be used to study sequence variation among species. Additionally, the rate, type and site of substitution as well as the selection pressure on codons are not uniform throughout the given gene. Therefore, if genetic variants are fixed during evolution, then they may have either selection advantages for the organism, they may be neutral regarding the fitness, or they may be deleterious and thus cause pathology. Hence, a comparative genomic study of disease-associated SNPs can be used to understand the relationship between the pathology and evolution.

Evolutionarily conserved regions more frequently contain disease-associated SNPs

The retention of variants by natural selection is considered to be an important step in evolution. According to the neutral theory of evolution (Kimura 1983), those amino acids that vary among species or those SNPs that do not occur in protein coding regions are either not subjected to natural selection or are under less selective pressure. This is because such amino acid changes can be tolerated and they only minimally affect protein functions. This may imply that such amino acids are more stable and less mutable. However, nSNPs in the coding regions of human genes may have phenotypic effects (Bao and Cui 2005) and may undergo natural selection. Hence, by comparing the rate ratio (omega) of nonsynonymous to synonymous changes (which is considered to be a measure of selective pressure on amino acid replacement mutations) in several proteins from several different species, raw evolutionary data can be generated.

In protein coding genes, patterns of selection can be inferred from amino acid substitution patterns (Jiang and Zhao 2006). One interesting example used to illustrate this is the patterns of distribution of disease-associated and nonpathogenic mutations in human genes. For instance, by using disease-associated mutation data and multiple species of phylogenetic lineage, it has been shown that disease-associated substitution (DAS) occurs more frequently in evolutionarily conserved positions (nonrandom distribution) than in positions that are undergoing variation (Subramanian and Kumar 2006; Miller and Kumar 2001). On the other hand, the opposite trend has been observed for silent and polymorphic mutations, and these are randomly distributed. These patterns are reinforcing the logic that the conserved region of the protein is under evolutionary pressure because these amino acids are critical to the proper functioning of a given gene. On the other hand, silent mutations have minimal affects on the organism because of their random distribution and may not be subjected to natural selection. However, polymorphic mutations of variable amino acids (nonconserved) may have moderate deleterious effects on the organism, and it is likely that such affects are tolerated and hence evolution may be more relaxed in these nucleotides. Similarly, a comparative study between human and chimpanzee genome indicates that some of the human specific traits could be due to positive selection, whereas loci for complex disorders could involve negative selection (Kehrer-Sawatzki and Cooper 2007; Patterson et al. 2006).

Another simple example is the myostatin gene (a negative regulator of skeletal muscle growth). When two different human populations and other mammals are compared for the myostatin gene (Saunders et al. 2006), the number of highly conserved replacement mutations over the evolutionary time scale is greater (five) than the number of silent mutations (three). These data suggest a positive natural selection in the highly conserved region of the myostatin gene because, according to the neutral model of molecular evolution, the ratio of replacement to silent changes does not differ within and between species. However, at present it is not known what types of specific traits are associated with these five replacement changes and what kinds of selection advantages they may have for the species. Additional studies in the future may provide some answers to these questions.

Substitution patterns in the regulatory regions of DNA and noncoding RNA

Natural selection not only operates on protein coding genes but also at the RNA level and on the noncoding regions (regulatory regions) of the DNA. A similar distribution pattern to that discussed above has also been observed for DAS and SNPs in regulatory regions of genes and in RNAs that do not code for proteins (Keightley and Gaffney 2003). For example, it is estimated that at the genomic level, the deleterious point mutation rate is similar between noncoding and coding DNA. Moreover, deleterious mutations in noncoding DNA have quantitative effects, which means that these variations can produce complex genetic diseases (Keightley and Gaffney 2003). Similarly, using SNP genotypic data, it has been shown that negative selection in humans is stronger on conserved microRNA (miRNA binds to the target sites in the 3′-untranslated region of mRNA to repress the translation) binding sites than on other conserved sequence motifs in the 3′-untranslated regions (Saunders et al. 2007). This illustrates the importance of miRNAs to Darwinian fitness (Chen and Rajewsky 2006). Interestingly, a comparison between the miRNA and target sites shows a relatively low level of variation in the functional regions of miRNA and an appreciable level of variation in target sites. Some of these SNPs create novel target sites for miRNA and are found at relatively high frequencies in human populations. If some of these variants have functional effects, they may be involved in phenotypic differences and hence may undergo positive selection. Similarly, an evolutionary comparison using entire classes of mammalian sequences has provided other evidence for the relationship between the pathogenicity of RMRP (RNA component of the mitochondrial RNA processing ribonuclease) mutations and evolutionarily conserved sequences (Bonafe et al. 2005). Although this RNA does not code for a protein, some regions of the RNA are critical to protein binding. The encoding gene is remarkably conserved between species, but disease-causing mutations are once again found in highly conserved nucleotides whereas nonpathogenic variants are located in the nonconserved positions (evolution is more relaxed in these nucleotides). This is consistent with other examples discussed above for the protein coding genes.

Selection pressure is not uniform at amino acid sites

It should also be noted that there are differences in types of amino acid substitutions between species and diseases (Yang et al. 2000). For instance, among species glutamic acid is most commonly replaced by an aspartic acid (very similar) and phenylalanine is replaced by tyrosine. However, this trend has not been observed in disease. When a total of 4,236 mutations in 436 genes causing Mendelian disease (monogenic etiology) and 1,037 synonymous and nSNPs in 313 human genes are compared, a significantly larger contribution at arginine and glycine (also to some extent lysine) is observed in human genetic diseases (Vitkup et al. 2003). This is not the type of change accepted by natural selection. Additionally, a random mutation at tryptophan or cystein residues has the highest probability of causing a disease. This is in agreement with our understanding of their highest evolutionary contribution, which is nothing but their (trp and cys) involvement in determining the protein stability. Thus, selection pressure is not uniform among codons (Arbiza et al. 2006), and in many cases whenever a highly conserved codon is mutated it causes pathology (Fig. 3A).

Radical and less radical SNPs cause early- and late-onset diseases, respectively

Similar to the difference in types of amino acid substitutions between species and diseases, selection pressure also varies between species and disease depending on the properties of amino acids. Those amino acids that have larger chemical difference (radical) are more likely to produce disease phenotypes than those with smaller chemical properties (less radical). Amino acids that have smaller chemical properties are mostly observed among species. As mentioned above, it is the mutation with the larger chemical difference that is most likely to be removed from the population over a long period of time because they are likely to be deleterious. On the other hand, radical changes in variable positions (Fig. 3B) are more likely tolerated (they may not have large effects on protein functions) than in highly conserved positions. Hence these positions do not undergo strong selection. Interestingly, early-onset diseases (they are more damaging) are found to be associated with more radical amino acid mutations, and as a consequence these positions are expected to undergo strong selection. In the same way, late-onset diseases are associated with less radical amino acid mutations and they are not abundant in evolutionarily conserved positions. These less radical amino acid mutations are often associated with common diseases such as diabetes and hypertension. Because they are involved in late-onset diseases, they may have smaller effect on fertility and hence these positions may not undergo strong natural selection. In short, comparative genomic studies between homologous gene sequences from both closely and distantly related species predict that evolution and DAS (pathology) are interrelated. Those residues that evolve under strong selective pressures are likely to be significantly associated with human disease (Arbiza et al. 2006). These types of studies also give us some understanding of the types of variations that can be tolerated in a given gene over time.

Substitution patterns and rates at the chromosome level

Although a lengthy discussion on this subject is not intended in this article, it is relevant to add that the evolutionary rates across the human chromosome are also not constant (Prendergast et al. 2007). Previous studies have predicted a relatively constant mutation rate across mammalian genomes. However, a recent analysis of human–mouse alignment suggests an approximately threefold difference in substitution rates across chromosomes. One of the factors that are found to be associated with mutation rates is the chromatin structure. The human genome contains two types of chromatin structures—closed and open. The open regions of genome are gene-dense and closed regions are relatively gene-poor (Gilbert et al. 2004). Housekeeping and tissue-specific genes are generally found in the more open and most closed regions of the genome, respectively. According to a recent study, the density of SNPs is higher in the most closed regions of the human genome, and genes in these regions also show the highest level of selection at synonymous sites. In fact, the average rate of nonsynonymous changes (dN) observed in human–mouse alignments is much higher in the most closed chromatin region of the genome than in the most open regions. Similarly, the ratio of nonsynonymous to synonymous substitution rates (dN/dS) is also higher, which indicates a strong selection. On the other hand, genes in the regions of open chromatin display the lowest mutation rates and the least constraints at the synonymous sites. However, the average synonymous rate (dS) for genes in relatively open chromatin is higher than that for genes in a closed chromatin structure. One of the explanations suggested by researchers for the lesser constraint in the regions of open chromatin is that open regions may be more accessible to repair mechanisms. On the other hand, as mentioned earlier, changes at synonymous sites do not affect the encoded amino acids. Therefore, a synonymous site would have to undergo relatively strong selection to evolve in a non-neutral condition. It is also possible that synonymous sites may experience constraints because they may have a role in RNA stability or splicing.

Fitness, gene pool and functional redundancy

These types of SNPs studies (comparison of relative fixation rates of silent and nSNPs) may allow us to trace the branch point of an evolutionary tree. At this branch point, the variants must have become advantageous for the species and fixed in the gene pool (Zhang et al. 2006). According to comparative genomics, those sequences that contribute to the fitness of an organism evolve slowly. For example, selenoproteins play an important role in antioxidant defense. When polymorphisms of six genes, namely glutathione peroxidases (GPX1, GPX2, GPX3, GPX4), thioredoxine reductase 1 (TXNRD1) and selenoprotein P (SEPP1), were compared in 102 individual populations representing four major ethnic groups, evidence for positive selection was found at the GPX1 locus (Foster et al. 2006). However, in the remaining five genes there was no strong evidence for selection and hence they must have adopted the neutral equilibrium model of evolution. This may imply that they are functionally redundant. It is not clear at present whether this selective pressure on GPX1 is exerted to protect the genome from damaging oxidants or to reduce susceptibility to oxidative stress in erythrocytes, where it is mostly expressed, or both.

Similarly, the ability to digest lactose (present in milk) usually disappears in childhood in most human populations. However, in European-derived populations, lactase activity persists into adulthood. This type of lactase persistence could be due to multiple causes and it may also depend on the population under study. One interesting finding, however, is that, when a region of 3.2 Mbp around the lactase gene consisting of 101 SNPs were typed in northern European and African populations, two alleles were found to be tightly associated with lactase persistence (Trishkoff et al. 2007; Coelho et al. 2005; Bersaglieri et al. 2004). This association could be due to a strong positive selection because of animal domestication and adult milk consumption (advantages to the organism), and hence it is fixed in the gene pool. In contrast, the human mannose binding lectin (MBL-2) allele (a member of the collectin protein family that binds a broad range of microorganisms) occurs at a high frequency worldwide (Verdu et al. 2006). This allele produces little or no protein and was shown to result from human migration and genetic drifts. This evolutionary neutrality (with respect to fitness) of MBL-2 may also suggest that the MBL-2 allele is functionally redundant in the host human defense.

Additional factors that may also contribute to the evolution of the human genome may include DNA methylation, genome duplication, deletions, insertions and the presence of introns (Tang et al. 2006). Insertions and deletions are collectively known as indels and they are approximately 300 bp in length. Because of their high frequency and wide distribution, indels are considered to be the strong driving force of evolution. In addition, the distribution patterns of DAS and nSNPs also show that positions that have many indels in other species contain more nSNPs than DAS. This is not due to the mutation rate, because an excess of nSNPs would be expected in positions with many indels if it was, and that is not found to be the case. Future studies using a recently characterized second type of DNA variation, called CNVs, which show marked variations among populations and individuals, may be helpful (Sharp et al. 2005; Locke et al. 2006) in understanding genetic diversity and evolution.

Concluding remarks

After the First International Meeting on SNPs in 1998, it was realized that SNP technologies may have an impact on healthcare. There is no doubt that clinicians, geneticists, patients and the public will benefit from the identification of genes underlying polygenic diseases and adverse drug reactions. Over the past ten years, tremendous progress has been made in cataloging human sequence variations since this high-density map will provide the necessary tools to develop genetically based diagnostic and therapeutic tests. When more functional polymorphisms have been identified, it may be possible to develop useful genetic markers as well as personalized medicines. If the concept of individualized medicine becomes more realistic, every newborn child in the neonatal unit may be genotyped in a routine procedure (similar to a blood transfusion procedure) for improved treatment. The newly developed fields of toxicogenomics, pharmacogenetics and nutrigenetics are rapidly advancing to achieve their goals.

Another interesting aspect of SNPs is that they can also be used to understand the molecular mechanisms of sequence evolution. Natural selection will maintain the amino acid type and retain the amino acid position among species because these amino acids are critical to protein function. Deleterious mutations that affect the biological functions of proteins are effectively being eliminated by natural selection from the gene pool. As discussed above, there is a clear evolutionary relationship between the positions and types of neutral and DAS in the human genome. Residues that evolve under strong selective pressure are found to be significantly associated with human diseases. These patterns are clearly different among species. In short, nucleotide substitutions that are fixed during evolution are either in some way advantageous for the organism, remain neutral regarding fitness, or become deleterious and thus cause pathology. Therefore, evolution and disease-causing nucleotide substitutions can be considered to be related to one another. In the future, it is hoped that research will uncover methods of making SNP markers useful tags for medical testing. Finding out how SNPs affect the health of an individual and then transforming this knowledge into the development of new medicines will undoubtedly revolutionize the treatments of the most common devastating disorders. At the same time, this knowledge will also help us to uncover the secrets of human genome evolution.