Introduction

Lactose is the major carbohydrate in milk and to be absorbed it needs to be broken down by hydrolysis in the intestinal tract. Lactase, an enzyme located in the brush border of the small intestine, catalyses the hydrolysis of lactose to monosaccharides, which are absorbed and serve as a source of energy.1 Milk is the major source of energy in infancy, and lactase activity at this age is universal. Lactase activity declines after the age of weaning in most mammals, although in humans this pattern is variable. Although most populations of the world have a low prevalence of lactase persistence, Northern European populations tend to have high prevalences.2 Lactose tolerance is traditionally thought of as an autosomal-dominant genetic trait, although the actual levels of lactase in the intestinal mucosa show a trimodal distribution, with very low levels in the people homozygous for the lactase non-persistence variant.3 The selective advantage of lactase persistence in milk-producing dairy farming populations, and the consequent expansion of milk production in successful populations, could have allowed for rapid selective pressures or an inceptive niche construction event,4 leading to increases in the prevalence of a genetic variant associated with lactase persistence.5 This is supported by molecular genetic evidence, which suggests strong recent selection occurring in the vicinity of the lactase gene.6, 7

Recently, a genetic variant associated with lactase persistence was identified,8 and the use of genotype data to avoid time-consuming and potentially misleading lactose tolerance tests has rendered study of lactase persistence more straightforward.9 Previous work has presented evidence of a complete correlation between lactase persistence and a common variant located 14 kb upstream of the lactase coding region (LCT) in the MCM6 gene.8 Further to this, extensive linkage disequilibrium for 1 Mb across LCT has been reported in a North European population.10 This work places the correlated allele on an extended common haplotype background and there is redundancy with respect to the genotyping of further variants associated with lactase persistence. Therefore, genotyping of the putatively causal C/T−13910 variant (rs4988235) will effectively capture the genetic variation highlighted by previous association studies.8, 10 The T allele of this single nucleotide polymorphism (SNP) is associated with lactase persistence and its prevalence has been shown to vary across Europe, being 70–80% in Northern European populations and 5–10% in Southern European populations.6

The geographic variation in the prevalence of lactase persistence raises the possibility that C/T−13910 may also vary geographically within a country such as Britain, as part of the north–south gradient seen across much of Europe.2

Such phenotypic correlations are important for the understanding of both why such geographical patterning may exist and also how such variation in genotype frequencies can influence the interpretation of gene–disease associations, given the potential for population stratification to generate spurious allelic association.11, 12, 13, 14, 15

Methods

We have used data from the British Women's Heart and Health Study. Full details of the selection of participants and measurements have been reported earlier.16, 17 Women aged 60–79 years were randomly selected from primary care lists in 23 British towns. A total of 4286 women participated and baseline data were collected between April 1999 and March 2001.

Women were asked whether they drank milk and what type of milk they drank (never drink milk or usually drink: full cream, semi-skimmed and skimmed). They were asked whether they had ever been diagnosed by a doctor as having osteoporosis and whether they had ever fractured their hip or wrist. Details for methods for other phenotype assessment, determination of LCT C/T13910 variant (rs4988235) genotypes and the study ethics are provided in the Supplementary web methods.

Statistical analyses

Logistic regression was used to assess the association between a self-report of never drinking milk versus drinking milk with osteoporosis and fractures. Prevalences (for dichotomous variables) and means (for continuous variables) are presented by genotype. Linear and logistic regressions were used for testing differences between genotypes for continuous and dichotomous variables, respectively. Further details of the statistical models are provided in the Supplementary web methods.

Results

Of the 4278 participants who gave consent for genetic testing, 15 (5 Afro-Caribbean, 8 South Asian and 2 other) were defined by the examining nurse as not being white and were excluded from further analysis. Of the remaining 4263 women, 3553 (83%) had DNA available for genotyping, and for 3344 (94%) of these women, the genotypic primary florescence data fell into one of three distinct clusters with positive signal for at least one allele. There was no difference in mean age (68.9 (5.5) versus 69.0 (5.7) years, P=0.4)) and no difference in the prevalence of never drinking milk (2.8 versus 2.5%, P=0.6) between those with and without genotypic data. Similarly, mean longitude and latitude of place of birth measures for both place of birth and place of residence were the same in those with and without genotypic data (all P>0.7). Supplementary web table 1 (see journal website) shows the characteristics of the participants included in the analyses presented in this paper. Table 1 shows genotype and allele frequencies for LCT C/T−13910 variant (rs4988235) in this study sample. Observed frequencies matched those reported earlier for North European origin populations2 and maintained Hardy–Weinberg equilibrium in the total sample (P=0.2) and among those aged 60–69 years (P=0.2) and those aged 70–79 years (P=0.6).

Table 1 Genotype and allele frequency for rs4988235 among study participants

Milk consumption, bone health, lifestyle, anthropometric, vascular, metabolic and socioeconomic characteristic associations with genotype

Among women with genotypic data, 3143 (94%) responded to the question concerning milk consumption, and of these, 89 (2.8%) reported never drinking milk. Never drinking milk was not associated with socioeconomic position in childhood or adulthood, vegetarianism, frequency of consumption of fruit and vegetables, red or processed meat, type of cooking fat, physical activity or smoking (all P-values >0.5). However, women who reported never drinking milk were more likely to have osteoporosis (age-adjusted odds ratio (95% confidence interval) as compared with milk drinkers (2.07 (1.10, 3.88), P=0.01), and were more likely to have a history of having had either a hip or wrist fracture (1.74 (1.09, 2.77), P=0.02).

Table 2 shows the characteristics of participants by genotype status and the P-values for the three genetic models of association that we tested. There was no evidence that those with either one or two copies of the lactase non-persistence (C) allele were different from homozygotes for the persistence (T) allele with respect to milk consumption, osteoporosis, fractures or calcium supplementation. Anthropometric, vascular, metabolic traits, socioeconomic position, lifestyle and fertility characteristics were largely unrelated to genotype. However, women who carried either one or two lactase non-persistence (C) alleles (ie, were CT or CC) had higher HDLc than those who were homozygous for the T allele (ie, TT): mean difference of 0.04 mmol/l (95% CI: 0.01, 0.07). Women who were homozygous for the C allele (CC) reported a higher prevalence of their general health being poor or fair than all other women (CT or TT), with an odds ratio of 1.38 (1.04, 1.84). There was no difference in the effect of genotype on HDLc levels or general health when stratified by drinking milk or not (P-value for interaction=0.48 for HDLc outcome and 0.85 for general health outcome). As anticipated genotype was not associated with age, and therefore despite many of the characteristics presented in Table 2 being related to age, the lack of any association with genotype would mean that any association of genotype with these characteristics could not be explained by age effects. This is demonstrated in Supplementary Table 2 (see journal website), which presents the associations of genotype with participant characteristics after adjustment for age; the results are essentially identical to those presented in Table 2 in this paper.

Table 2 Prevalence and means (95% CI) of participant characteristics by genotype status among British women described as white by examining nurse and aged 60–79 years

Gene, place of birth and area of residence association

The prevalence of homozygotes for the non-persistence (C) allele in the whole population was 6.8% (95% CI: 6.0, 7.7), which is similar to the prevalence among those women (N=3224) who were born in Britain, that is 6.1% (95% CI: 5.3, 7.0). The (C) allele frequency was also similar in the whole sample (0.253 (95% CI: 0.242, 0.263) and those born in Britain (0.246 (95% CI: 0.236, 0.257).

Of those women with genotype data, 3184 (95%) were born in Britain and had data on place of birth and all of these had data on area of residence. All results in this section are based on these 3184 women. Women with either one or two copies of the non-persistence (C) allele were more likely to have been born in the south of England and least likely to have been born in Scotland (Figure 1 and Table 3). A similar marked geographical variation for adult area of residence was also seen (Tables 4 and 5). The odds ratio (95% CI) of carrying a non-persistence (C) allele comparing those born in the south of England with all other areas was 1.63 (1.40, 1.91) and for comparing those living in the south of England with those living in all other areas in adulthood was 1.38 (1.19, 1.61). When both area of birth and area of adult residence were included simultaneously in a logistic regression model, the association between area of birth and possession of a minor allele remained largely unchanged (1.65 (1.33, 2.04)), whereas that with area of residence attenuated to the null (0.98 (0.80, 1.22)).

Figure 1
figure 1

C (lactase non-persistence) allele frequency of rs4988235 by area of birth for participants in British Women's Heart and Health Study.

Table 3 rs4988235 genotype and allele frequency by area of birth among women aged 60–79 years described by examining nurse as being white and who have complete data for genotype, area of birth and area of residence data
Table 4 rs4988235 genotype and allele frequency by area of residence among women aged 60–79 years described by examining nurse as being white and who have complete data for genotype, area of birth and area of residence data
Table 5 Non-persistence C allele frequency by town of residence in adulthood

When we examined frequency of the non-persistence (C) allele by town of residence in adulthood, we found a gradient of increasing frequency in the more southerly and easterly regions (Table 5). Because the women were born in a large number of different towns and for many towns there were only one or two women from the study born there, we are unable to present allele frequencies by town of birth. However, when we examined these associations using indicators of latitude and longitude for area of birth, there was a linear trend of women with a non-persistence (C) allele being from more southerly and easterly areas (Table 2). Women with this allele were also more likely to be from areas that experienced a greater number of hours of sunlight. The mean difference in distance north (latitude) among women with either one or two non-persistence (C) alleles as compared with homozygotes for the persistent T allele was −379.43 km (95% CI: −502.43, −256.42), and the mean difference in distance east (longitude) among these groups was 225.60 km (95% CI: 154.42, 296.80). When we adjusted these differences for sunlight exposure, they both attenuated markedly to −95.18 km (95% CI: −178.34, −12.01) for the association with distance north (latitude) and 108.76 km (95% CI: 47.85, 169.67) for the association with distance east (longitude). The mean difference in sunshine (as percent of total daylight) comparing non-persistence (C) allele carriers with non-carriers was 0.48 (95% CI: 0.33, 0.64). After adjustment for latitude and longitude, this attenuated to 0.13 (95% CI: 0.02, 0.25).

Many of the characteristics examined in this study are related to geography (see Supplementary web Table 2). Given the geographic differences in allele frequency, together with geographical variation in the characteristics we have examined, we repeated all of the analyses presented in Table 2 with additional adjustment for latitude and longitude of birth. Supplementary web Table 4 (see journal website) gives these adjusted results. The association of non-persistence (C) allele homozygosity or heterozgosity and HDLc remained unchanged with additional adjustment for latitude and longitude of place of birth; 0.04 mmol/l (0.01, 0.07), P=0.01. The association of the non-persistence (C) allele homozygosity and self-rated poor or fair health attenuated from an odds ratio of 1.38 (1.04, 1.84) to one of 1.19 (0.87, 1.64) with adjustment for latitude and longitude of place of birth.

Discussion

Prevalence and geographical distribution

In a representative sample of white British women aged 60–79 years, we found a prevalence of the genotype associated with lactase non-persistence of 6.8%; among the 96% of these women who were born in Britain, the prevalence was 6.1%. The prevalence of phenotypically defined lactase non-persistence in participants described as ‘British born’ or ‘white’ has been estimated as ranging from 3 to 7%.18, 19, 20 Until recently, no large-scale studies with genotype data have been reported from Britain, but two small studies reported prevalences of 5.4 and 9.0% of non-persistence (C) allele homozygosity.21, 22 The latter study recruited participants from the south of England, where the prevalence of non-persistence (C) allele homozygosity in our study was higher than in the north of Britain. There was no evidence of an age difference by genotype, and there was no strong evidence of departure from the Hardy–Weinberg equilibrium either at younger or older ages, suggesting that in this population genotype is not related to survival.

The Wellcome Trust Case Control Consortium used the availability of both genomewide genotypic data and geographic information to assess evidence of unequal geographical distribution of genetic markers.23 This work noted a very small proportion (less than 1%) of loci that showed evidence for the difference in allele frequency by geographic region, the most marked of these being that of rs1042712, a marker proximal to the MCM6 locus (r2 with rs4988235=0.24; D′: 0.88; derived from the HAPMAP phase 2 CEU samples). This work did not unite this variation with patterns of phenotypic difference that might account for these differences.

The prevalence of lactase persistence shows strong associations with latitude, generally increasing with latitudes greater than 25° north or south.5 This association has usually been shown between countries rather than within countries. Here, we demonstrate that such a gradient exists within Britain; individuals with the lactase non-persistence (C) allele were more likely to have been born in the most southeasterly areas. These findings are supported by the aforementioned findings from the Wellcome Trust Case Control consortium. In this study, we show that this gradient depends on latitude of place of birth rather than on place of residence, suggesting that it reflects prevalence differences between peoples who have lived in different latitudes within Britain for many generations. These graded and substantial differences in prevalence of a genetic variant for a population all born in the United Kingdom and identified as being ‘white’ demonstrate that potentially important levels of population substructure of genetic variants do exist within an apparently homogeneous population. The finding of an association between homozygosity for the lactase non-persistence allele and self-reported poor or fair health, which attenuated substantially on adjustment for latitude and longitude, further demonstrates that apparent associations with health outcomes can be generated by such stratification. Our findings are consistent with those of a recent study of an apparently homogeneous European American population, all of whom had described themselves as ‘white’ or ‘Caucasian’ and all of whom had grandparents born in either the United States or Europe.24 In that study, an apparent strong association between the LCT variant examined here and height was shown to be largely explained by population stratification. However, lactase persistence – a trait for which there is evidence of strong and recent selection6, 7 – may be an exception to a general rule of low levels of population stratification.

Lactase persistence, drinking milk and bone health

Between populations, there is a very strong association between the prevalence of lactase persistence and drinking milk.5 However, within populations, the association is much less strong. We found no strong association of milk consumption with lactase persistence, although our data on this issue were crude, consisting of a comparison of never drinking milk with drinking any amount, as opposed to a more graded assessment of amount consumed. Findings from previous studies have been mixed.25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 The power of many previous studies has been low because of small sample sizes given the time-consuming nature of phenotypic testing for lactase non-persistence. Lactase non-persistence tends to influence the amount of milk that can be tolerated rather than relating to being unable to drink milk at all, therefore our lack of data on the quantity of milk consumption is an important limitation in examining this issue.

There has been considerable discussion of the degree to which lactase non-persistence leads to ill-health. Recently, Matthews et al36 have demonstrated that lactase non-persistence variant homozygosity is associated with higher rates of many symptoms of ill-health in an UK population. We did not have questions to examine the symptoms for which strongest influences were found in that study – abdominal pain, gut distension, flatulence, diarrhoea, headache and generalised tiredness and pain36 – but questions on depression, asthma, gastric ulcer and nonspecific chest pain showed no strong association with genotype (data not shown). Our finding of a higher prevalence of poor or fair self-rated health appears to be generated at least, in part, by population substructure (as discussed above). Given the large number of possible statistical contrasts, any weak associations with genotype – for example, adult social class – need to be treated with considerable caution. The work by Matthews et al suggests that those who are lactase non-persistent but continue to drink milk are the ones most likely to report symptoms, but we found no difference in the effect of genotype on our general health rating when we stratified whether the woman drank milk or not. However, this analysis would be better performed taking into account amount of milk drunk, and we do not have data on this. Several studies have suggested associations between milk consumption and vascular and metabolic traits,37, 38, 39, 40, 41 and we have previously reported that women in this study who reported never drinking milk had lower levels of homoeostasis model assessment insulin resistance.42

We found no evidence that the LCT variant examined here was associated with insulin resistance and it was largely unrelated to vascular and metabolic traits. However, the lack of association with milk consumption in this population means that LCT cannot be used as an instrument for examining the causal effect of milk consumption on these outcomes. The association between possession of a C (lactase non-persistence) allele and HDLc remained even with the adjustment for latitude and longitude of birth, and hence may not be fully explained by population stratification. However, we have undertaken a number of statistical tests in this study, and replication of this finding would be required before one could consider this as anything other than a chance association.

Implications for the origin of lactase persistence

As a trait, lactase persistence shows strong evidence of selection, and there has been considerable discussion regarding the basis of this. The genetic haplotype associated with lactase persistence of which one perfect tag was investigated here also shows molecular evidence of selection, as does the variant recently identified in African populations that underlies a second origin of the trait.6, 7 There are several non-exclusive hypotheses regarding the transition from presumed near-universal lactase non-persistence in original Homo sapiens to high prevalences of lactase persistence in many populations today. They are all predicated on the notion that lactase persistence has evolved in parallel with the use of milk-producing animals in animal husbandry and the resultant potential for there to be large supplies of milk. Indeed, it has been shown that diversity in milk protein genetic variants among cattle is distributed across Europe in parallel to both current day lactase persistence and the Neolithic distribution of European cattle pastoralist societies.43

The two current main hypotheses regarding the origin of lactase persistence can be summarised as (1) the ‘culture historical’ hypothesis,44, 45, 46 which concentrates on the general nutritional (and survival) advantage of milk consumption in populations that have milk availability, do not process milk into low-lactose foods such as cheese and are subject to other (non-dairy) dietary stresses and (2) the calcium absorption hypothesis,5, 47 which considers the ability to use milk as of particular importance for high-latitude populations with low ultraviolet light exposure who are thus subject to potential vitamin D deficiency and poor calcium absorption and for whom the calcium absorption-stimulating effect of lactose would increase fitness. There are other less formally articulated hypotheses that can be identified in the literature, which are as follows:5, 48, 49 (1) a reduced diarrhoeal disease mortality hypothesis that considers that, in populations that have become high consumers of milk, this consumption will increase risk of diarrhoeal disease in individuals who are not lactase persistent and thus select for lactase persistence; (2) an auxiliary water/electrolyte hypothesis specific to the aberrant high-lactase persistence populations in Africa that considers that, in arid regions with animal husbandry practices allowing access to milk, the ability to use milk has a selective advantage through the provision of water and electrolytes; (3) the enhanced fertility by early weaning hypothesis that postulates that lactase persistence leads to earlier weaning and that earlier cessation of breastfeeding reduces the infertile period following each birth; and finally, (4) that the gradients observed could be due to simple genetic drift.

It is possible that dairy farming became more strongly established at an earlier time in the north of Britain, although evidence for this is weak.50 When we adjusted the latitude association with hours of sunlight at the area of birth, we found marked attenuation of the latitude associations. In areas with lower sunlight exposure (and hence vitamin D levels), lactase persistence could provide a means of enhancing calcium absorption. However, the strong association of hours of sunlight with latitude renders statistical separation of these two influences difficult.

The reduced diarrhoeal disease hypothesis is dependent on the lack of a strong link between lactase persistence and milk consumption in cultures with high milk consumption and in this sense receives support from our findings. However, the consequences of prolonged childhood diarrhoeal disease that might be expected in survivors – shorter body height, leg length and perhaps higher blood pressure51, 52 – were not seen in our data (Table 2). Our study does not of course address the auxiliary water/electrolyte hypothesis. Finally, even if the enhanced fertility hypothesis applied to historical populations, it is unlikely to be reflected in recent fertility patterns in the United Kingdom, where total offspring number will not depend on periods of breastfeeding-induced fertility reductions. In our data, there was no association between genotype and parity (Table 2).

Further genotype-based studies across disparate populations and relating genotype to latitude, sunlight exposure, milk consumption, bone health, calcium absorption and diarrhoeal disease history could help progress understanding. For example, the finding that lactase–persistence-associated alleles are in different haplotypes in African as compared with European populations53 demonstrates a separate origin in these two continents and may provide some support for the auxiliary water/electrolyte intake hypothesis for high lactase persistence populations in Africa.

Implications for population stratification

Our finding of gradients in lactase persistence across Britain could relate to population origin (eg, a relatively greater contribution of high lactase persistence Norsemen in the north and of lower lactase persistence Germanic Anglo Saxons and Normans in the south50) or to differential strength of selection. Whatever the origins of these patterns, they demonstrate that common genetic variants may show considerable differences in prevalence between subgroups within apparently homogenous populations and thus that population substructure could generate bias in both estimates of precision of genetic variant – outcome associations and in the strength of these associations. A similar gradient in allele frequency within a country has recently been demonstrated within Italy.54 Lactase persistence would clearly be a useful variant for detecting population substructure and for use in statistical approaches for controlling the effects of such stratification.55 A recent study of European American populations has also identified this variant as one that is highly sensitive to population substructure24 and a subsequent paper using the same sample proposed a new, computationally simple method for correcting population stratification.56

Conclusions

Among a population of British-born women identified as being white, we show that there is a considerable population stratification for a common genetic variant, to a level that could produce spurious findings in genetic association studies. The lactase persistence variant could be useful in exploring the existence of, and then statistically controlling for, population stratification in genetic association studies. Lactase persistence is a classic genetic polymorphism identifiable through its phenotypic effect and as such has been included in the banks of genetic polymorphisms utilised for studying the origin and distribution of human genes.57, 58 Other such polymorphisms, some of which, like lactase persistence, can now be easily studied as SNPs, could also provide valuable data on hidden population subdivision.