Main

The COVID-19 pandemic has caused considerable morbidity and mortality, and has resulted in the death of over a million people to date3. The clinical manifestations of the disease caused by the virus, SARS-CoV-2, vary widely in severity, ranging from no or mild symptoms to rapid progression to respiratory failure4. Early in the pandemic, it became clear that advanced age is a major risk factor, as well as being male and some co-morbidities5. These risk factors, however, do not fully explain why some people have no or mild symptoms whereas others have severe symptoms. Thus, genetic risk factors may have a role in disease progression. A previous study1 identified two genomic regions that are associated with severe COVID-19: one region on chromosome 3, which contains six genes, and one region on chromosome 9 that determines ABO blood groups. Recently, a dataset was released by the COVID-19 Host Genetics Initiative in which the region on chromosome 3 is the only region that is significantly associated with severe COVID-19 at the genome-wide level (Fig. 1a). The risk variant in this region confers an odds ratio for requiring hospitalization of 1.6 (95% confidence interval, 1.42–1.79) (Extended Data Fig. 1).

Fig. 1: Genetic variants associated with severe COVID-19.
figure 1

a, Manhattan plot of a genome-wide association study of 3,199 hospitalized patients with COVID-19 and 897,488 population controls. The dashed line indicates genome-wide significance (P = 5 × 10−8). Data were modified from the COVID-19 Host Genetics Initiative2 (https://www.covid19hg.org/). b, Linkage disequilibrium between the index risk variant (rs35044562) and genetic variants in the 1000 Genomes Project. Red circles indicate genetic variants for which the alleles are correlated to the risk variant (r2 > 0.1) and the risk alleles match the Vindija 33.19 Neanderthal genome. The core Neanderthal haplotype (r2 > 0.98) is indicated by a black bar. Some individuals carry longer Neanderthal-like haplotypes. The location of the genes in the region are indicated below using standard gene symbols. The x axis shows hg19 coordinates.

The genetic variants that are most associated with severe COVID-19 on chromosome 3 (45,859,651–45,909,024 (hg19)) are all in high linkage disequilibrium (LD)—that is, they are all strongly associated with each other in the population (r2 > 0.98)—and span 49.4 thousand bases (kb) (Fig. 1b). This ‘core’ haplotype is furthermore in weaker linkage disequilibrium with longer haplotypes of up to 333.8 kb (r2 > 0.32) (Extended Data Fig. 2). Some such long haplotypes have entered the human population by gene flow from Neanderthals or Denisovans, extinct hominins that contributed genetic variants to the ancestors of present-day humans around 40,000–60,000 years ago6,7. We therefore investigated whether the haplotype may have come from Neanderthals or Denisovans.

The index variants of the two studies1,2 are in high linkage disequilibrium (r2 > 0.98) in non-African populations (Extended Data Fig. 3). We found that the risk alleles of both of these variants are present in a homozygous form in the genome of the Vindija 33.19 Neanderthal, an approximately 50,000-year-old Neanderthal from Croatia in southern Europe8. Of the 13 single nucleotides polymorphisms constituting the core haplotype, 11 occur in a homozygous form in the Vindija 33.19 Neanderthal (Fig. 1b). Three of these variants occur in the Altai9 and Chagyrskaya 810 Neanderthals, both of whom come from the Altai Mountains in southern Siberia and are around 120,000 and about 60,000 years old, respectively (Extended Data Table 1), whereas none of the variants occurs in the Denisovan genome11. In the 333.8-kb haplotype, the alleles associated with risk of severe COVID-19 similarly match alleles in the genome of the Vindija 33.19 Neanderthal (Fig. 1b). Thus, the risk haplotype is similar to the corresponding genomic region in the Neanderthal from Croatia and less similar to the Neanderthals from Siberia.

We next investigated whether the core 49.4-kb haplotype might be inherited by both Neanderthals and present-day people from the common ancestors of the two groups that lived about 0.5 million years ago9. The longer a present-day human haplotype shared with Neanderthals is, the less likely it is to originate from the common ancestor, because recombination in each generation will tend to break up haplotypes into smaller segments. Assuming a generational time of 29 years12, the local recombination rate13 (0.53 cM per Mb), a split between Neanderthals and modern humans of 550,000 years9 and interbreeding between the two groups around 50,000 years ago, and using a published equation14, we exclude that the Neanderthal-like haplotype derives from the common ancestor (P = 0.0009). For the 333.8-kb-long Neanderthal-like haplotype, the probability of an origin from the common ancestral population is even lower (P = 1.6 × 10−26). The risk haplotype thus entered the modern human population from Neanderthals. This is in agreement with several previous studies, which have identified gene flow from Neanderthals in this chromosomal region15,16,17,18,19,20,21 (Extended Data Table 2). The close relationship of the risk haplotype to the Vindija 33.19 Neanderthal is compatible with this Neanderthal being closer to the majority of the Neanderthals who contributed DNA to present-day people than the other two Neanderthals10.

A Neanderthal haplotype that is found in the genomes of the present human population is expected to be more similar to a Neanderthal genome than to other haplotypes in the current human population. To investigate the relationships of the 49.4-kb haplotype to Neanderthal and other human haplotypes, we analysed all 5,008 haplotypes in the 1000 Genomes Project22 for this genomic region. We included all positions that are called in the Neanderthal genomes and excluded variants found on only one chromosome and haplotypes seen only once in the 1000 Genomes Project data. This resulted in 253 present-day haplotypes that contained 450 variable positions. Figure 2 shows a phylogeny relating the haplotypes that were found more than 10 times (see Extended Data Fig. 4 for all haplotypes). We find that all risk haplotypes associated with severe COVID-19 form a clade with the three high-coverage Neanderthal genomes. Within this clade, they are most closely related to the Vindija 33.19 Neanderthal.

Fig. 2: Phylogeny relating the DNA sequences that cover the core Neanderthal haplotype in individuals from the 1000 Genomes Project and Neanderthals.
figure 2

The coloured area highlights the haplotypes that carry the risk allele at rs35044562—that is, the risk haplotypes for severe COVID-19. Arabic numbers indicate bootstrap support (100 replicates). The phylogeny is rooted with the inferred ancestral sequence of present-day humans. The three Neanderthal genomes carry no heterozygous positions in this region. Scale bar, number of substitutions per nucleotide position.

Among the individuals in the 1000 Genomes Project, the Neanderthal-derived haplotypes are almost completely absent from Africa, consistent with the idea that gene flow from Neanderthals into African populations was limited and probably indirect20. The Neanderthal core haplotype occurs in south Asia at an allele frequency of 30%, in Europe at an allele frequency of 8%, among admixed Americans with an allele frequency of 4% and at lower allele frequencies in east Asia23 (Fig. 3). In terms of carrier frequencies, we find that 50% of people in South Asia carry at least one copy of the risk haplotype, whereas 16% of people in Europe and 9% of admixed American individuals carry at least one copy of the risk haplotype. The highest carrier frequency occurs in Bangladesh, where more than half the population (63%) carries at least one copy of the Neanderthal risk haplotype and 13% is homozygous for the haplotype. The Neanderthal haplotype may thus be a substantial contributor to COVID-19 risk in some populations in addition to other risk factors, including advanced age. In apparent agreement with this, individuals of Bangladeshi origin in the UK have an about two times higher risk of dying from COVID-19 than the general population24 (hazard ratio of 2.0, 95% confidence interval, 1.7–2.4).

Fig. 3: Geographical distribution of the Neanderthal core haplotype that confers risk for severe COVID-19.
figure 3

Pie charts show the minor allele frequency at rs35044562. Frequency data were obtained from the 1000 Genomes Project22. Map source data were obtained from OpenStreetMap23.

It is notable that the Neanderthal risk haplotype occurs at a frequency of 30% in south Asia whereas it is almost absent in east Asia (Fig. 3). This extent of difference in allele frequencies between south and east Asia is unusual (P = 0.006, Extended Data Fig. 5) and indicates that it may have been affected by selection in the past. Indeed, previous studies have suggested that the Neanderthal haplotype has been positively selected in Bangladesh25. At this point, we can only speculate about the reason for this—one possibility is protection against other pathogens. It is also possible that the haplotype has decreased in frequency in east Asia owing to negative selection, perhaps because of coronaviruses or other pathogens. In any case, the COVID-19 risk haplotype on chromosome 3 is similar to some other Neanderthal and Denisovan genetic variants that have reached high frequencies in some populations owing to positive selection or drift14,26,27,28, but it is now under negative selection owing to the COVID-19 pandemic.

It is currently not known what feature in the Neanderthal-derived region confers risk for severe COVID-19 and whether the effects of any such feature are specific to SARS-CoV-2, to other coronaviruses or to other pathogens. Once the functional feature is elucidated, it may be possible to speculate about the susceptibility of Neanderthals to relevant pathogens. However, with respect to the current pandemic, it is clear that gene flow from Neanderthals has tragic consequences.

Methods

Linkage disequilibrium was calculated using LDlink 4.129 and alleles were compared to the archaic genomes8,9,10,11 using tabix30 (HTSlib 1.10). Haplotypes were constructed from the phase 3 release of the 1000 Genomes Project22 as described. Phylogenies were estimated with phyML 3.331 using the Hasegawa–Kishino–Yano-8532 substitution model with a gamma shape parameter and the proportion of invariant sites estimated from the data. The probability of observing a haplotype of a particular length or longer owing to incomplete lineage sorting was calculated as previously described14. The inferred ancestral states at variable positions among present-day humans were taken from Ensembl33. The distribution of frequency differences of Neanderthal haplotypes between east and south Asia was computed by filtering diagnostic Neanderthal variants (fixed positions in the three high-coverage Neanderthal genomes and the Neanderthal allele missing in 108 Yoruba individuals) using a published introgression map20, followed by pruning using PLINK1.9034 (r2 cut-off of 0.5 in a sliding window of 100 variants) and allele frequency assessment in the 1000 Genomes Project. Maps displaying allele frequencies and linkage disequilibrium in different populations were made using Mathematica 11.0 (Wolfram Research) and OpenStreetMap data.

For the meta-analysis carried out by the COVID-19 Host Genetics Initiative2, participants consented and ethical approvals were obtained (https://www.covid19hg.org/partners/). The following eight studies contributed to the meta-analysis of hospitalization versus population controls: Genetic modifiers for COVID-19-related disease ‘BelCovid’ (Université Libre de Bruxelles, Belgium), Genetic determinants of COVID-19 complications in the Brazilian population ‘BRACOVID’ (University of Sao Paulo, Brazil), deCODE (deCODE Genetics, Iceland), FinnGen (Institute for Molecular Medicine Finland, Finland), GEN-COVID (University of Siena, Italy), Genes & Health (Queen Mary University of London, UK), COVID-19-Host(age) (Kiel University and University Hospitals of Oslo and Schleswig-Holstein, Germany and Norway) and the UK Biobank (UK).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.