Childhood acute leukaemia is the most common paediatric cancer in developed societies, accounting for one-third of all cases, with a variable incidence rate of 10–45 per 106 children per year and a cumulative risk of ~1 in 2,000 up to the age of 15 years1. The most common paediatric leukaemia, acute lymphoblastic leukaemia (ALL), is an intrinsically lethal cancer, as evidenced by a universally adverse clinical outcome before effective therapy was developed2. Currently, however, cure rates for ALL using combination chemotherapy are around 90%3, making this one of the real success stories of oncology.

While this is a cause for celebration, the current treatment remains toxic, traumatic for young patients and their families, and carries some long-term health consequences4,5. It is unfortunate that we have remained ignorant as to the cause of ALL. The open question as to whether this cancer is potentially preventable is therefore important.

Environmental exposures possibly linked to ALL are numerous6, but in many cases, these associations are weak, inconsistent or lacking in biological plausibility7. Large and multidisciplinary nationwide studies or international consortia8,9 have provided a more enabling framework for addressing this question, but to date, the only accepted causal agent for ALL, albeit under exceptional circumstances, is ionizing radiation10,11,12. The causes of ALL might be best understood by using biological insights into the cancer itself as the foundation for designing, testing and validating hypotheses.

Childhood ALL includes a number of subtypes defined by cell lineage (B cell or T cell), differentiation status and genetics (Fig. 1a). These differ by age distribution (Fig. 1b) and clinical outcome (see Fig. 1 legend) and could have distinctive aetiologies. In this Review, I focus on the body of evidence — epidemiological, biological and genetic — that has accumulated, particularly over the past decade, and supports a causal mechanism that is selective for the common, or B cell precursor, subtype of childhood ALL (designated here as BCP-ALL)7. This is suggested to be a multifactorial mix of infectious exposure, inherited or constitutive genetics and chance, with patterns or timing of common infection in early life identified as the critical component and a potential route for preventive intervention.

Fig. 1: Major subtypes of childhood ALL.
figure 1

a | Immunophenotype screens in the 1970s and early 1980s established that acute lymphoblastic leukaemia (ALL) could be divided into subsets corresponding to early developmental compartments of the B and T cell lineages, as indicated in the key. Common or B cell precursor ALL (BCP-ALL) is genetically diverse (as illustrated), with the two most prevalent alterations being ETS translocation variant 6 (ETV6)–runt-related transcription factor 1 (RUNX1) fusion and hyperdiploidy. The rare (~2%) subtype with a mature B cell immunophenotype (and frequent IGH–MYC rearrangements) was subsequently recognized (and treated) not as ALL but as B cell lymphoblastic lymphoma. For more detailed descriptions of genomic diversity in ALL, see refs179,180,181. b | The age distribution of ALL subtypes from a cohort of 1,184 patients with ALL entered into MRC-UKALL clinical trials182 (1975–1984) is shown. This pattern of age-associated ALL subtypes was validated in a later cohort of MRC-UKALL trials (1991–1996; 1,088 patients up to 14 years of age)8. It had been known that childhood ALL had a very marked incidence peak at 2–5 years of age throughout the developed world. But this peak appeared to be diminished or absent in less developed societies and appeared in particular countries and ethnic groups at different times182. Immunophenotypic screens, linked to clinical trials in the UK183, and an international collaborative group study182 documented that the peak in incidence was selective for common or BCP-ALL. Recent epidemiological data indicate that the incidence of this subtype of leukaemia in Europe has continued to increase at ~1% per year184,185,186. This suggested that the increase over time was real, rather than ascertainment bias, and that BCP-ALL had a distinctive aetiology. The BCP-ALL subtype was also found to have a much more favourable clinical outcome183,187,188, emphasizing its distinct biology. MLL, mixed-lineage leukaemia; Pro-B, B cell progenitor; TCR, T cell receptor. Part b is adapted with permission from ref.182, Elsevier.

Infection hypotheses

The idea that infections might play a causal role in childhood ALL is ~100 years old13. When it became clear that leukaemia in a number of animal species — chickens, mice, cattle and cats — was viral in origin14, there was an expectation that a similar transforming virus might be responsible for childhood ALL, as well as for other blood cell cancers. To date, all attempts to molecularly identify or otherwise incriminate a leukaemogenic virus in ALL have failed7.

In 1988, two hypotheses were presented that suggested a new perspective on this problem. The two models are sometimes considered as alternative or competing explanations. I believe they portray the same picture through different lenses. Both propose that childhood leukaemia may arise as a consequence of an abnormal immune response to common infection(s). One model advanced by epidemiologist Leo Kinlen was based on transient and localized increases in the incidence of childhood leukaemia that could be ascribed, epidemiologically, to population mixing15,16 (Box 1).

The other model that I proposed was dubbed the ‘delayed infection’ hypothesis, the focus of this article, and was more biological than epidemiological in its origins and was applied specifically to BCP-ALL7,17. Central to this were two propositions.

First was the idea that the immune system, in both its innate and adaptive arms, had evolved to both anticipate and require microbial infectious exposure perinatally or in infancy18. The dynamics and composition of the microbiome and virome of infants is highly variable19, and early microbial exposures have lasting impacts on immune function and health20,21. Metabolites of commensal bacteria promote regulatory T cells and affect subsequent inflammatory signalling pathways22. Deficits of this natural microbial experience, especially in modern societies, result in an unmodulated or distorted immune network23. A consequence of an under-exposed immune network in infancy was predicted to be subsequent dysregulated responses to common infections that could promote or trigger BCP-ALL. The increased incidence in childhood ALL in developed societies was therefore considered to be a paradox of progress, and the link to infection was considered to be inverted: the problem might be lack of infection. An equivalent mismatch of evolutionary adaptations and modern lifestyles may underlie the causation of several common adult cancers in the developed world24.

The second proposition related to the natural history of the disease. The speculation was that ALL most likely developed by two critical steps: first, an initiating event in utero and second, a postnatal mutational event that promotes clinical leukaemia development. The prediction was then that an abnormal immune response to infection(s) indirectly triggered the requisite secondary mutational events. No specific infection was proposed in relation to either protection in early life or postnatal promotion and the immunological mechanism was considered to be indirect and therefore not akin to a transforming virus.

Common infections were therefore proposed to have two opposing impacts on risk of ALL that depend on timing — antagonistic (early) or promotional (late). A parallel would be with the divergent roles of microbial infections and chronic inflammation in gastrointestinal and other common cancers in adults25,26.

The two-hit model of childhood ALL

With a few informative exceptions (Supplementary Box 1), ALL has a clinically silent natural history before diagnosis. My colleagues and I developed three different tactics for backtracking the origins of this covert process to before birth. These exploited the fact that common fusion genes in ALL (for example, ETS translocation variant 6 (ETV6)–runt-related transcription factor 1 (RUNX1) (also known as TEL–AML1), mixed-lineage leukaemia (MLL)–AF4 (also known as KMT2A–AFF1) and BCR–ABL1) have uniquely variable or idiosyncratic breakpoints within the intronic, breakpoint cluster regions of the two partner fusion genes involved. The genomic sequences at the gene fusion junction provide stable, sensitive and clone-specific markers27,28.

Comparative genomics of concordant ALL in monozygotic twins

Studies on monozygotic twins have been especially informative29. The possibility that concordance of leukaemia in identical twins might be attributable not to co-inheritance of genetic susceptibility but to an in utero origin in one twin was first proposed in 1962 (ref.30) and elaborated on in 1971 (ref.31). This idea was based on previous understanding that monochorionic or single placentae have vascular anastomoses permitting twin–twin blood transfusion with consequent blood cell chimerism32.

The prediction was then that ALL in both twins should arise in one twin but be monoclonal31. Unambiguous evidence that this was the case derived from the finding of identical, but non-constitutive, clone-specific fusion gene breakpoints and sequences in a series of twin pairs29,33 (Fig. 2). That evidence is strengthened by the observation of shared, clone-specific immunoglobulin heavy chain (IGH) diversity–joining (DJ) or variable–diversity–joining (V(D)J) genomic sequences of concordant BCP-ALL in twins34.

Fig. 2: Summary of comparative genomics of ALL in identical twin pairs.
figure 2

The figure is based on analysis of 12 monozygotic twin pairs (the number of pairs with each founder lesion is noted in parentheses) with concordant acute lymphoblastic leukaemia (ALL)29,33,37,40,48. The sharing of a patient-specific and clone-specific fusion gene that is not inherited in the twins indicates that in such cases of concordant ALL, the leukaemia must have been initiated in a single cell, in one twin of the pair in utero, and the clonal progeny of that cell then disseminated to the co-twin via intra-placental anastomoses. In further support of this notion, it was noted that concordance of ALL occurred only when the twins shared a single or monochorionic placenta, providing a route for cellular transmission29. CNA, copy number alteration; ETV6, ETS translocation variant 6; MLL, mixed-lineage leukaemia; RUNX1, runt-related transcription factor 1; SNV, single nucleotide variant. Figure adapted with permission from ref.189, Elsevier.

With further exploration of ALL genomes in twins, it has become clear that while patients with concordant ALL share the identical and singular fusion gene event, other genetic alterations present, including copy number alterations (CNAs) and single nucleotide variants (SNVs), were different in twin pairs35,36 (Fig. 2). This then suggested that such distinctive mutational events reflected independent and divergent subclonal evolution postnatally. Similarly, the majority of ongoing V(D)J rearrangements in IGH are subclonal and distinctive in twin pairs34. In one twin pair with concordant ETV6–RUNX1+ ALL, whole-genome sequencing revealed that the fusion gene was the only shared or clonal genetic lesion37. These data endorsed the likelihood that ETV6–RUNX1 fusion was an initiating event or founder mutation for ALL.

The concordance rate in monozygotic twins varies according to age and ALL subtype. In infants (<18 months) with B cell progenitor ALL (pro-B ALL) and MLL fusions, the rate approximates to 100% for those with a monochorionic or single placenta29. This suggested that MLL fusion-driven leukaemogenesis in such infants was essentially completed in utero and that the fusion gene, or a single mutation, was sufficient for leukaemogenesis. Subsequent genomic sequencing of these patients is compatible with this possibility even though other subclonal mutations, for example, in RAS family genes, do occur38,39.

Pre-leukaemic clones in healthy co-twins

The concordance rate in older children with BCP-ALL was calculated to be around 10–15%, which is lower than that in infants29. A prediction for those pairs of twins with a monochorionic placenta, where only one twin develops ALL, is that the healthy co-twin should have a population of covert pre-leukaemic cells harbouring the same initiating lesion as his or her co-twin with ALL, that is, the twins are discordant for the critical postnatal secondary genetic event. This has been confirmed in three twin pairs with BCP-ALL with hyperdiploidy40, BCR–ABL1 fusion36 or ETV6–RUNX1 fusion35,41. In this context, the healthy co-twin provides a rare ‘experiment of nature’ and unique access to the pre-leukaemic clone. Putative pre-leukaemic cells from the blood of one healthy co-twin and propagation in vitro and in vivo (in NOD–severe combined immunodeficient (SCID) mice) established that these cells have both self-renewal capacity and intact B cell differentiation capacity41, features commensurate with a pre-leukaemic status. Equivalent pre-leukaemic stem cells for acute myeloid leukaemia (AML) have now been identified in patients with AML42 and healthy adults43.

Backtracking early genetic events in ALL to neonatal blood spots

Less than 1% of childhood BCP-ALL cases occur in twins. However, ALL in twins is no different in its biological and clinical features or age incidence to that in singletons. This suggests that many or most childhood ALL cases in singletons are also initiated in utero.

To validate this proposition, my colleagues and I exploited the fact that neonatal blood spots, also known as Guthrie cards, contain reasonably intact DNA. Archived blood spots from patients with ALL were screened for clone-specific fusion gene sequences identified at diagnosis. This was first carried out in three infants with ALL with MLL–AF4 fusion, and blood spots from all three patients evaluated were positive44. Subsequent studies with samples from children with ETV6–RUNX1+ ALL found that ~75% were positive45. These results have been independently confirmed46,47. Negative blood spot results are uninterpretable, as this could reflect either a postnatal origin or an inadequate number of leukaemic cells in the sample. The conclusion drawn from these screens was therefore that the majority of childhood ALL cases, though possibly not all, were prenatal in origin.

The twin and blood spot studies also provided insight into persistence of pre-leukaemic stem cells and postnatal latency in ALL. The oldest twin with concordant ALL originating in utero was 14 years at diagnosis, and her twin sibling had been diagnosed with ALL some 9 years earlier48. The oldest non-twin patient with ALL and a positive neonatal blood spot to date was diagnosed at 9 years and 4 months old49.

Frequency of ALL initiation in utero

The data on discordant, monozygotic twins suggested that some or possibly most individuals harbouring a prenatally generated, covert pre-leukaemic clone never progress to overt ALL. This begs the important question, relevant to aetiology, of the frequency of initiation of ALL in utero and the frequency of its transition to overt leukaemia.

To address this issue, my colleagues and I screened a large cohort of unselected cord blood samples for ETV6–RUNX1 fusion mRNA (data summarized in Supplementary Fig. 1). The striking result was that ~1% of newborn babies (6/567) had a covert and modest sized, putative pre-leukaemic population of ~1 in 104 B lineage cells50. This result, initially challenged51, has been independently confirmed52,53,54. An ~1% incidence for ETV6–RUNX1 in relation to incidence of the leukaemia itself reflects a low transition probability of ~1%, with 99% of pre-leukaemic clones initiated during fetal development never progressing to clinical ALL. This low transition probability could reflect either lack of persistence of the pre-leukaemic stem cells after birth or a severe bottleneck in acquisition of the necessary secondary genetic changes.

These data suggest that initiation of leukaemia in utero is far more common than indicated by the incidence of disease and carry implications for causation. The same may apply to some other paediatric cancers. Histological evidence and some genetic data suggest that the frequency of precursor lesions for neuroblastoma55 and Wilms tumour56 is also some 100 times the incidence of clinical cancer50.

Other subtypes of ALL

These lines of investigation were pursued using fusion genes as the predominant clonal markers of early genetic events in ALL. The most frequent subtype of BCP-ALL is, however, characterized by chromosomal hyperdiploidy, which is harder to track than fusion genes. There is evidence, however, that the key findings described above for the ETV6–RUNX1 subset are likely to apply to hyperdiploid ALL. Monozygotic, monochorionic twin pairs concordant for hyperdiploid ALL are described with identical karyotypes40, and neonatal blood spots of children with hyperdiploid ALL have clone-specific IGH sequences57,58,59. In one case of hyperdiploid ALL, the child’s cord blood had been frozen at birth. Retrieval of this sample led to the identification of putative pre-leukaemic cells in the cord blood with the same triploid chromosomes as in the child’s subsequent ALL60. Hyperdiploidy, generated by a one-off abnormal mitosis resulting in trisomies61, can therefore occur in utero as an alternative initiating event to gene fusion for BCP-ALL.

Further genomic exploration of ALL

Whole-genome sequencing of a cohort of 57 patients with ETV6–RUNX1+ BCP-ALL provided an audit of all mutational changes62. This confirmed the previous finding that the most common recurrent events were CNAs, primarily gene deletions63. SNVs were also present but with low or undetectable recurrency62.

Genomic sequencing in cancer can reveal mutational signatures of relevance to aetiology64. Almost 50% of CNAs in ETV6–RUNX1+ BCP-ALL had partial or complete V(D)J recombination-activating protein (RAG) heptamer-nonamer recognition motifs within 20 bp of the breakpoints62. This finding may explain the observation that highly recurrent CNAs in BCP-ALL are reiteratively present in subclones of individual patients65,66. A comparison with CNAs in ~14,000 patients with breast, prostate or pancreatic cancer revealed none with RAG motifs62. SNVs in ETV6–RUNX1+ BCP-ALL had two main mutated signatures: one was C>T transitions at CpGs and C>G and C>T mutations at TpCs, and a second was transitions and transversions in a TpC context at NpCpG trinucleotides62. This second signature is very common in cancer and generally reflects apolipoprotein B mRNA-editing enzyme catalytic subunit (APOBEC) cytidine deaminase activity64,67.

These genomic studies indicate that BCP-ALL has very restricted but informative mutational signatures and a low level of background or neutral genetic alterations62. This makes it less likely that BCP-ALL is caused by genotoxic exposures, which generally precipitate more widespread genomic instability with multiple distinctive signatures64.

The other genetic subtype of BCP-ALL — hyperdiploid ALL — also has recurrent CNAs that may be RAG-mediated in genes including PAX5, IKZF1 and ETV6. In contrast to ETV6–RUNX1+ ALL, however, hyperdiploid ALL has recurrent mutations in receptor tyrosine kinase (RTK)–RAS pathways and histone modifiers61,63.

Collectively, these data provide a firm cellular, genetic and mechanistic framework for the two-step model for BCP-ALL and highlight both critical time windows, prenatally and postnatally, and mutational mechanisms. Any proposed causative mechanism should accommodate this natural history profile. The initiating role of ETV6–RUNX1 and the postulated sequence of events in BCP-ALL are endorsed by modelling within both human and animal cells (Box 2 and below). The timing and tissue site of BCP-ALL initiated by ETV6–RUNX1 or hyperdiploidy in utero is uncertain but may involve transformation of a unique, fetal liver progenitor cell (Box 3).

Inherited susceptibility

Childhood ALL only very rarely runs in families, but this observation may underplay inherited genetic risk because the disease itself is rare. Twin concordance is unhelpful in this respect because the risk has a mostly non-genetic basis: blood cell chimerism in utero. The risk of ALL in non-identical twins is unknown, but sibling risk has been estimated to be ~3.0 times higher than the risk in the general population, which provides evidence of a modest but definite contribution of constitutive, genetic variation to risk68.

Earlier, targeted gene screening approaches suggested that inherited allelic variants encoding proteins involved in DNA repair, carcinogen metabolism or the folic acid pathway are linked to risk of childhood ALL69,70. Unfortunately, most of these studies were under-powered to detect small effects or have not been consistently replicated, so their significance remains uncertain70.

Genome-wide association studies (GWAS) have provided unambiguous evidence for multiple gene variants that affect the risk of ALL70,71 (Table 1). The individual alleles described to date have a significant but somewhat modest impact (see odds ratio in Table 1) and appear to be functionally additive rather than synergistic. The functional logic of these associations is unclear, but as most of the relevant single nucleotide polymorphisms (SNPs) lie outside coding regions, they are likely to be regulatory, affecting levels of proteins72,73.

Table 1 Inherited alleles and risk of childhood B cell precursor ALL from genome-wide association studiesa

It is striking that in ALL, as in many other cancers, most of the candidate risk genes implicated in GWAS (Table 1) are the same genes that have acquired (non-constitutive) mutations in the same cancer type. One interpretation of this is that the inherited allelic variants interact functionally (or epistatically) with the mutated alleles to increase vulnerability of cells to transformation. A low functioning inherited allele, for example, would render a deletion in the other allele functionally homozygous, with a potentially increased impact on cellular fitness. A prediction that follows from this is that there should be a preferential loss, by deletion, of the non-risk allele (in heterozygotes for that allele), as only that deletion would increase clonal fitness. Evidence for this has been presented with respect to risk variants of CDKN2A74,75. However, for ARID5B, there is preferential gain of the risk allele (via trisomy 10) in heterozygotes72.

To date, GWAS have provided no evidence implicating immune response gene variants, as might have been anticipated from infection-based hypotheses for the aetiology of ALL. However, previous studies examining major histocompatibility complex (MHC, also known as HLA) genes76,77,78, interferon-γ (IFNG)79, Toll-like receptor 6 (TLR6)80 or the presence and/or absence of specific killer-cell immunoglobulin-like receptor (KIR) family genes81 did record significant associations with particular allelic variants. Notably, TLR6 variants and KIR genes were associated with decreased risks of all childhood ALL types. It remains unclear whether the large GWAS multi-cohort studies invalidate these data or whether the SNP screens in GWAS adequately detect the relevant variants. This is an important discrepancy to resolve.

Childhood ALL can also arise in a rare familial syndrome context, with inherited mutations in genes also implicated as acquired mutations in leukaemia, including PAX5 (ref.82) and ETV6 (ref.83). The relatively infrequent low hypodiploid subset of BCP-ALL is strongly associated with inherited TP53 mutations or Li Fraumeni syndrome84. Further rare risk alleles, but with intermediate-to-high penetrance, are likely to be uncovered in ongoing, large-scale studies. Children with Downs syndrome have an approximately 20-fold to 30-fold increased risk of BCP-ALL85. Trisomy 21 in Downs syndrome is associated with overexpression of the nucleosome-remodelling protein high mobility group nucleosome-binding domain-containing protein 1 (HMGN1) and enhanced self-renewal of B cell progenitors86. In patients with ALL, this is complimented by secondary genetic changes, including those in CRLF2, JAK2, NRAS and KRAS85. All told, however, familial syndromes and Downs-associated ALL are likely to account for only a small fraction of patients diagnosed with childhood ALL.

The general conclusion to be drawn from these genetic studies is that inherited susceptibility does contribute to risk of BCP-ALL. The attributable risk or quantitative contribution is, however, unclear. A sibling risk of threefold for ALL, seen against a background risk of 1:2,000, suggests that the genetic component, though real, is minor, at least compared with some other common adults cancers (prostate and breast)87. On the other hand, there could be a complex interplay between genes and environmental exposures73,88 in which genetic background makes a more substantial difference. This has yet to be fully explored.

Possible causes of initiating events

No epidemiological studies to date have clearly implicated exposures during pregnancy, linked to risk of ALL, that might explain how the initiating mutations for BCP-ALL arise. ETV6–RUNX1+ BCP-ALL has no mutational signatures that might implicate any particular type of aetiological exposure. There is no evidence for RAG involvement, but in common with IGH rearrangements89, the recombination event appears to involve non-homologous end joining via microhomologies27.

If ~1% of unselected newborn babies have an in-frame ETV6–RUNX1 fusion in an expanded clone derived from the appropriate cell type for BCP-ALL (B cell lineage progenitor), then considerably more newborn babies should have acquired this or other fusion genes in the wrong cell types or out of frame for a functional protein. It therefore seems likely that whatever causes this genetic alteration could be very common or possibly ubiquitous.

The original proposition17 was that BCP-ALL was initiated in utero by a spontaneous mutation or with no external exposure involvement. Developmental, endogenous factors such as proliferative and oxidative stress or the profound apoptotic signalling in early lymphopoiesis could be involved. Spontaneous mutations or mutations caused by endogenous processes are common during fetal development90. Endogenously driven double-strand breaks, required for fusion gene recombinants, occur at ~50 per cell cycle in human cells91. It has been suggested that most paediatric cancers arise during embryonic or fetal life and can similarly be considered as developmental errors92. In the absence of evidence to the contrary, this remains the most plausible explanation for initiation of BCP-ALL and focuses attention on the postnatal triggering of promotional events, which are required for clinical disease. There is clearly scope for more research on the origins and mechanisms involved in fusion gene formation, and hyperdiploidy, in utero.

Epidemiological evidence

Epidemiological evidence suggests that patterns of infection after birth have a causal role in triggering ALL. The delayed infection hypothesis lends itself to epidemiological evaluation in a case–control setting. A prediction of the model was that common infections in infancy should be protective against BCP-ALL. There is no prior reason to implicate any particular infectious agent (for example, bacteria, virus or parasite), and the relevant infections need not be symptomatic or pathological. A longstanding need for microbial, immune network modulation might reflect common, commensal, or ‘old friend’ organisms, such as gut microbiota, soil mycobacteria or helminth parasites93. In this context, a surrogate of overall infectious exposure during infancy could be considered an appropriate variable. Quantifiable surrogates include social exposures of infants in the home, related to the number of siblings and birth order, or in day care centres, and breastfeeding. These variables have been investigated in epidemiological case–control or cohort studies for risk associations with ALL overall and, in some instances, selectively for the major BCP-ALL subset.

Impact of day care attendance in infancy

In the 1990s, the UK Children’s Cancer Study Group (UKCCS) was set up to test the delayed infection hypothesis, in a case–control context, in addition to analysing other exposures including ionizing and non-ionizing (such as that from electromagnetic fields) radiation and chemicals8. Day care attendance was chosen as one surrogate for infectious exposure because this was well documented as a context for increased social contacts facilitating spread of common infections94. The UKCCS involved almost 1,300 patients with ALL (all subtypes) and over 6,000 matched controls. Although only a relatively small number of controls experienced day care in the first 12 months of life, the data showed a significant protective impact on risk of ALL overall and on BCP-ALL95.

This association has been documented in additional studies in California96, Scandinavia97 and France98 and in an international consortium99, and has been endorsed by a meta-analysis100. The meta-analysis noted significant between study heterogeneity, and one early, large-scale study failed to detect any impact of day care on risk of ALL101. No protection is afforded against childhood AML by day care attendance or, to date, any other paediatric cancer, which increases the confidence that the associations seen in ALL were not confounded by social or other variables.

It was anticipated that assessing, in a case–control fashion, actual infections in infancy would be informative. This, however, is fraught with difficulties and has provided mixed results. Parental recall is known to be suspect or inaccurate in this respect102. Medical records are more reliable, particularly in the UK, with nationwide registration of children with general practitioners and a free health service. One such analysis found more, rather than fewer, infections reported for children who subsequently developed ALL than for controls102. The main difficulty here, other than possible bias in use of general practitioner services, is that we do not know whether the relevant modulating infectious exposures in infancy are necessarily symptomatic; they might well not be. In this sense, the surrogate of day care could be considered preferable. However, several studies have reported, in accord with the hypothesis, an inverse relationship between common infections in early life, including inner ear infections, and risk of ALL98,101,103,104,105,106.

Birth order and risk of ALL

A further surrogate measure of infectious exposures in infancy is the number of siblings cohabiting and, in particular, birth order. The prediction was that firstborns would be more at risk than laterborns, who would, as infants, benefit from protective exposures via older siblings. One large UK-based study (with >3,000 patients with ALL and the same number of matched controls) found a striking association of birth order with risk of ALL, but not ofAML107. Other case–control studies in France98 and California100 also found a significantly increased risk of ALL for firstborn (versus thirdborn) children, as did a recent international cohort analysis (O. Paltiel, personal communication).

If natural infections early in life reduce risk of ALL, then it might be expected that some vaccinations would have an effect. The data on vaccination histories have produced null or inconsistent results. However, there is one exception: immunization against Haemophilus influenzae type B in infancy appears to confer a degree of protection against ALL108.

If the natural microbiota is part of a longstanding and critical interaction with the developing immune system, then antibiotic use in infancy might increase risk of ALL. This has not been systematically evaluated to date, though an earlier report from China did suggest an increased risk associated with exposure to chloramphenicol109.

Mode of delivery, breastfeeding and risk of ALL

The mode of delivery at birth influences the early exposure of newborn babies to benign microbiota110, as caesarean delivery deprives newborn babies of the microbial exposures associated with vaginal passage. Cohort and case–control studies have reported a significantly increased risk of ALL associated with caesarean delivery111,112,113,114. No such increased risk was observed for brain cancer or lymphoma111.

Breastfeeding during infancy provides nutritional support, maternal antibodies, anti-inflammatory molecules, some maternal cells, microorganisms (lactobacilli) and oligosaccharides that nourish the infant’s intestinal microbiome (Bifidobacteria spp.)115. It might be anticipated that long-term breastfeeding would have a modulating effect on the immune system of infants and reduce the risk of ALL. Seventeen case–control studies of the impact of breastfeeding on ALL risk have been published116. In the largest of these, from the USA117 and UK118, there was a reduced risk of ALL (10–20%) associated with breastfeeding of 6 months or more. Five meta-analyses have now been published with concordant conclusions, and the latest of these indicated a reduced ALL risk of ~20% for breastfeeding of 6 months or more116.

Clusters of ALL

Although no specific microbial agent or a unique transforming virus is suspected in ALL, there might be one circumstance where a single type of infection is involved: the very rare cases of space–time clusters. A prediction of the hypothesis would be that a single cluster of affected patients might be associated with a single infection or microorganism species, but independent, space–time clusters could involve different infectious triggers.

Many putative clusters of childhood leukaemia have been reported, but two stand out. The first was in Niles, a suburb of Chicago, Illinois, USA, in 1957–1960 — where there were eight patients (relative risk (RR) 4.3) diagnosed with ALL or ‘stem cell’ leukaemia119. All patients and/or their older siblings attended the same school. The second cluster involved 13 patients with BCP-ALL over 4 years (2000–2004; RR 12.0), but 10 of those patients were diagnosed within just 10 months in 2001, in the small town of Fallon, Nevada, USA120.

A neglected aspect of these two clusters is that the patients, though resident in the cluster area at the time of diagnosis, were mostly born outside of that area7. Additionally, the clusters involved children diagnosed with ALL at different ages (2–11 years) and a narrow time frame of diagnoses. Given what we now know of the natural history of ALL, these data then indicate that any causal exposure linked to the cluster would have to be proximal in time to diagnosis (rather than prenatal) and, therefore, promotional. The Niles cluster was linked, observationally, to an outbreak of streptococcal fever119. The cause(s) of the Fallon cluster of ALL remain unresolved and contentious, though a possible role of adenovirus was hypothesized120.

A significant space–time cluster of BCP-ALL in Milan, Italy has recently been recorded121. Seven patients were diagnosed in a 4-week period; four of these lived within one small residential area, and three of these four attended a single school. The Kulldorff scan method122 identified this as a significant space–time cluster (P = 0.017). Given the narrow time window of the diagnoses (4 weeks) and the age range of the patients (2–11 years), any causal exposure, as in Niles and Fallon, would be proximal to diagnosis, promoting overt ALL evolution from a prior and covert pre-leukaemic state. The Milan cases sparked substantial public anxiety, particularly in relation to the school, and a detailed epidemiological investigation was launched. No link was found with ionizing radiation, non-ionizing radiation or chemicals. There was, however, an association with a particular common infection. All seven patients had been infected with endemic influenza A H1N1 swine flu virus during the epidemic that preceded the ALL cluster by 3–6 months. The infection frequency in children in Milan during the same period was relatively high, at around one-third, but this still indicated that the link with patients was significantly different from expected (P = 0.01)121. Six of the seven patients were firstborn children, and none attended day care in the first year of life.

Proof of a causal role for infections in these situations is not possible, and clustering of cases by chance cannot be excluded. But the observations accord with predictions of the infection hypothesis and highlight that influenza viruses are potential promoting agents for ALL. A previous study in the UK observed peaks in the incidence of childhood ALL ~6 months after seasonal influenza epidemics123. A final piece of epidemiological evidence indirectly supporting a role of common infection in childhood ALL comes from anecdotal but striking observations of rapid changes in the incidence of ALL that were preceded by major social changes in Germany and Hong Kong (Supplementary Box 2).

There is no compelling reason for postulating an exclusive role for influenza viruses or, indeed, for viruses. A role for cytomegalovirus (CMV) in ALL has been proposed but as an early, in utero modulator of immunity rather than as a proximal trigger124.

In some respects, it is surprising that the epidemiological data are as consistent as they are for individual factors related to infection in ALL, because many variables will interact to influence patterns of microbial exposures in early life (Supplementary Box 3).

Modelling the missing link in ALL

There are limits to what epidemiological studies can achieve and to the robustness of the findings. Nevertheless, the associations described are compatible with the infection model proposed, and their selectivity for BCP-ALL versus AML is striking. But associations are not necessarily causal. Functional components of the infection hypothesis are best addressed by modelling studies in mice (Box 2). These have proved very informative. One inflammatory cytokine — transforming growth factor-β (TGFβ) — was found to induce preferential expansion of ETV6–RUNX1-driven pre-leukaemic cells of both mouse and human origin125. Normal B cell precursor proliferation is inhibited by TGFβ via activation of the cell cycle (cyclin-dependent kinase) inhibitor p27. ETV6–RUNX1 blocks this activity, giving pre-leukaemic cells a fitness advantage in the presence of TGFβ125.

The missing link in the chain of events between infection, inflammatory responses and promotion of BCP-ALL may be activation-induced cytidine deaminase (AID; also known as AICDA)126,127. As noted above, genomic sequencing in BCP-ALL revealed that the recurrent CNAs have signatures of RAG activity62. Physiological RAG activity in germinal centre immunoglobulin class switching or hypermutation requires AID128, as does illegitimate recombination between the IGH locus and oncogenes129,130. In B cell precursors, AID is not normally co-expressed with RAGs but is inducible by infection-driven cytokine signals131. This suggested that one route by which infection or chronic inflammation triggers RAG-mediated CNAs and ALL is via activation of AID expression in pre-leukaemic stem cells.

A mouse model has tested the requirement for RAGs and AID in the transition from ETV6–RUNX1 pre-leukaemia to overt ALL126. The data revealed that lentiviral transfection of ETV6–RUNX1 into progenitor cells leads to BCP-ALL when those cells are treated with a surrogate inflammatory signal (bacterial lipopolysaccharide binding to TLR4) that activates AID. Critically, mice did not develop ALL if the same experiment was conducted in a Rag1–/– genetic background. More recently, my colleagues and I have screened a series of inflammatory cytokines for their ability to trigger AID expression in human B cell precursors. The most potent was TGFβ (V. Cazzaniga, A. M. Ford and M. Greaves, unpublished observations). TGFβ is known to promote other cancers, often in the context of chronic inflammation132. In ALL, its role may include not only selective expansion of pre-leukaemic cells125 and activation of AID but compromise of natural killer cell-based immune-surveillance133.

If the aetiological hypothesis is correct, then it should be possible to influence risk or penetrance of ALL in murine models by timed exposure of the immune system to natural infections. Using a model of BCP-ALL, it was shown that ALL developed if Pax5+/– mice were switched from a germ-free environment to one providing exposure to common microbial pathogens134. Similarly, another study found that ~10% of mice with ETV6–RUNX1 developed BCP-ALL after exposure to common pathogens135. These experiments provide evidence, albeit in murine models, that common infections can have, as predicted, a promotional role in ALL.

Another mouse model has provided evidence that early stimulation of the immune system can be protective. Exposure of mice with transgenic Eμ-Ret or E2A (also known as TCF3)–PBX1 to oligodeoxynucleotides (which bind to TLR9) at 4 weeks depleted both normal and pre-leukaemic precursors and both delayed and diminished the risk of progression to ALL136. This effect was dependent upon IFNγ. By contrast, binding of polyinosinic:polycytidylic acid (poly(I:C)), a TLR3 ligand that does not induce IFNγ, resulted in an expansion of the pre-leukaemic cell pool. These data hint that the nature of infectious exposures in infancy and responses of the innate immune system may influence not only subsequent immune responses but also the fate of pre-leukaemic cells.

Conclusions: paradoxes of progress

We incline on our evidence to the belief that the solution of the problem of leukaemia lies rather in some peculiar reaction to infection than in the existence of some specific infective agent.

F. J. Poynton, H. Thursfield and D. Paterson, Great Ormond Street Hospital for Sick Children, 1922 (ref.137)

Collectively, the accumulated evidence derived from epidemiological studies, GWAS, genome sequencing, biological scrutiny of the natural history and molecular pathogenesis of BCP-ALL and mechanistic and modelling studies provide us with a more substantive and credible version of the original7,17 two-hit model for childhood ALL, as summarized in Fig. 3. The model applies selectively to the common, B cell precursor subset of ALL, although the evidence is currently more compelling for the ETV6–RUNX1+ subset of BCP-ALL than for the hyperdiploid subset. The rarer pro-B ALL in infants appears likely to have a different causation and molecular pathogenesis, as does childhood AML and childhood lymphoma. There are insufficient data for thymic or T cell precursor ALL (T-ALL) in this respect. Other causal associations in leukaemia and cancer in general might be revealed or strengthened by a focus on well-defined subtypes, as suggested for breast cancer138.

Fig. 3: Summary of the two-hit model for role of infections in B cell precursor ALL.
figure 3

Genetic, inherited risk alleles are depicted (top of figure) as having effects at any or several stages of the stepwise process of acute lymphoblastic leukaemia (ALL) development. Step 1 is the prenatal initiation lesion (ETS translocation variant 6 (ETV6)–runt-related transcription factor 1 (RUNX1) or hyperdiploidy), which is common (~100 times clinical ALL frequency) and postulated to arise as a spontaneous, developmental error. This generates a clinically silent pre-leukaemic clone that can persist for up to 14 years. Step 2 is that in a small fraction (~1%) of patients with a covert pre-leukaemic clone, an abnormal immune response to one or more common infections triggers (probably via transforming growth factor-β (TGFβ) and possibly other cytokines) activation-induced cytidine deaminase (AID) activation, which, in combination with V(D)J recombination-activating protein 1 (RAG1) and/or RAG2, induces secondary genetic changes (predominantly copy number alterations (CNAs)). This occurs in patients who carry a covert pre-leukaemic clone and have a deficit of infectious exposures in infancy. The postulated immune deficit in infancy may increase the risk of Step 2 either by failure of immune network modulation and/or by affecting the persistence of a pre-leukaemic clone. SNV, single nucleotide variant.

The causal mechanism proposed here is multifactorial, involving patterns of infection, inherited genetics and other modulators of risk including chance and, probably, diet (Box 4). It has a logical coherence139 and is grounded in the fundamental biology of leukaemia and evolutionary logic of the immune system network functions. The central thesis posits BCP-ALL as a paradox of progress in developed societies contingent upon a mismatch between the historical or evolutionary programming of the immune system and contemporary lifestyles that restrain opportunities for early-life microbial exposures. Childhood ALL is probably not the only unanticipated, deleterious health consequence of diminished infectious exposure in infancy93. Similar epidemiological associations exist for Hodgkin lymphoma in young adults140 as well as for childhood allergies and autoimmune disease141 (Supplementary Box 4). In all these clinical situations, the common theme is that acquisition of common microbial infections in early life has an impact on later responses of the immune system to challenge and the subsequent presence or absence of pathology93,141,142. Diminished exposure early in life to microorganisms that are pathological has been highly beneficial, reducing infant mortality, but it seems plausible that a suite of illnesses prevalent now in young people in more developed societies, including BCP-ALL, could be due to an unanticipated consequence of this advance93,141.

The infection hypothesis would benefit from further scrutiny, including validation and extension of the animal modelling, but its public health implication is clear. Most cases of childhood ALL are potentially preventable. But how? Lifestyle changes including day care attendance or protracted breastfeeding in the first year of life can be advocated but would be difficult to achieve. A more realistic prospect might be to design a prophylactic vaccine that mimics the protective impact of natural infections in infancy, correcting the deficit in modern societies. Reconstitution or manipulation of the natural microbiome143,144,145,146 or helminth injections147,148 are strategies under consideration for early-life immune disorders in modern societies, including autoimmune and allergic conditions. Oral administration of benign synbiotics (bacteria species such as Lactobacillus spp. and oligosaccharides) can have profound and beneficial modulating effects on the developing immune system149. The results of those endeavours might inform approaches for preventing BCP-ALL. Cross collaboration of scientists working in disparate fields of early-life immune dysfunction — allergy, autoimmune disease and ALL — would be beneficial.