Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency <1% and 46% are singletons. These rare variants provide insights into mutational processes and recent human evolutionary history. The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and extends the reach of nearly all genome-wide association studies to include variants down to ~0.01% in frequency.


Summary paragraph
The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency <1% and 46% are singletons. These rare variants provide insights into mutational processes and recent human evolutionary history. The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and extends the reach of nearly all genome-wide association studies to include variants down to ~0.01% in frequency.
Advancing DNA sequencing technologies and decreasing costs are enabling researchers to generate collections of human genetic variation at an unprecedented scale 1,2 . For these advances to improve understanding of human health and disease, they must be deployed in well phenotyped human samples and used to build resources such as variation catalogs 2-4 , control collections 5,6 and imputation reference panels 7-9 that will enhance all ongoing human genetic studies. Here, we describe high-coverage whole genome sequencing (WGS) of 53,831 TOPMed samples for which data are now available to qualified researchers through dbGaP.
Long-term goals of the TOPMed program include (1) characterizing the genetic architecture of phenotypic variation in heart, lung, blood, and sleep (HLBS) disorders and related phenotypes; (2) identifying causal genetic variants and assessing how they may interact with environmental factors; (3) characterizing the spectrum of disease types; (4) understanding ethnic differences in these disorders; and (5) establishing a foundation for personalized interventions for disease prediction, prevention, diagnosis, and treatment. We are pursuing these goals through generation of whole genome sequence (WGS) and "omics" data for research participants of diverse backgrounds and with deep phenotypic characterization through ongoing studies, and by developing analytical tools to effectively mine the resulting data.
Here we describe WGS from the first 53,831 TOPMed samples selected from data sets that are now available via dbGaP controlled-access (Supplementary Tables 1 and 2); additional data will be made available as quality control (QC), variant calling and dbGaP curation are completed. Our work identifies and characterizes the rare variants that comprise the majority of human genomic variation 7,12-14 and extends previous efforts that relied on genotyping arrays [15][16][17] , low-coverage WGS 7,8 , exome sequencing 2,12,18 , or analyses of smaller sample collections [19][20][21][22] . Since rare variants represent more recent and potentially more deleterious changes that can impact protein function, gene expression, or other biologically important elements, their discovery and study are crucial for understanding the genetics and biology of human health and disease 13,23,24 . The TOPMed WGS dataset is the largest collection of human genetic variants that is now broadly available to researchers (through the BRAVO browser 25 ) and has substantially more ancestral diversity than other whole genome datasets 2,26 (Supplementary Figure 3).

TOPMed WGS
TOPMed WGS data processing is performed periodically to produce genotype data "Freezes" that include all samples available at a given time. The Freeze 5 genotype call set was based on 65,000 samples, of which~55,000 have now been curated and released on dbGaP. The 53,831 samples described here are drawn from this set.
To evaluate the reproducibility of our genotype calls, non-reference allele discordance was estimated for 378 samples sequenced in duplicate. For each pair of duplicates, discordances were counted as any difference in called genotypes, regardless of read depth. For single nucleotide variants (SNVs), the median discordance rate across duplicate pairs was 0.040% for passing and 33% for variants failing site-level quality filtering. For indels, the median discordance rate was 0.61% for passing and 23% for failing variants. Considering only singleton variants in a set of unrelated samples, the median discordance was 0.23% for passing, 34% for failing SNVs, 0% for passing indels, and 47% for failing indels. These reproducibility estimates indicate the high quality of the genotype calls, although they may be optimistic because members of each duplicate sample pair processed and called jointly.
We also evaluated potential benefits from high coverage WGS relative to exome sequencing (depth >30X) 18 and low coverage WGS (depth >6X) 22 in 430 Framingham Heart Study samples. TOPMed WGS identified 23.8 million variants in these samples, compared with 20.5 million variants in low coverage sequencing analysis (a 16% increase). Essentially all the newly observed variants had minor allele frequency <0.5%. When we restricted the analysis to coding regions, TOPMed identified~17% more coding variants than both low-coverage WGS and exome sequencing ( Supplementary Table 3 Table 4). Conversely, we found lower proportions of singletons in putative transcriptional repressor CTCF (45.4%) and transcription factor binding sites (45.7%), suggesting these regions are relatively tolerant of variation.
We identified an average of 3.78 million variants in each studied genome. Among these, an average of 30,207 were novel (0.8%) and 3,510 were singletons (0.1%). Thus while there are vast numbers of rare variants in humans, only a few of these are present in each genome and large numbers of genomes must be studied to draw inferences about their function. Also striking is the observation that while, among all variants, we observed 3.17 million non-synonymous and 1.53 million synonymous variants (a 2.1:1 ratio), individual genomes contained similar numbers of non-synonymous and synonymous variants (11,743 non-synonymous, 11,768 synonymous, on average, Table 1). The difference can be explained if more than half of non-synonymous variants are removed from the population before they become common and, indeed we observed a relative paucity of non-synonymous variants at higher frequencies (56,428 non-synonymous to 52,889 synonymous at >0.5% frequency; 21,150 non-synonymous to 22,018 synonymous at >5% frequency).

Protein loss of function variants
A particularly interesting class of variants are the 228,966 putative loss of function (pLoF) variants, which include premature stop codons, frameshifts, and mutations in splice acceptor and donor sites ( Table 2). This class includes the highest proportion of singletons among all variant classes we examined. An average individual carried 2.5 unique pLoF variants; across all individuals, we observed pLoF variants in 18,493 (95.0%) of GENCODE v29 27 genes. Significantly, we identified more pLoF variants per individual than in previous surveys based on exome sequencing --an increase that was mainly driven by the identification of additional frameshift variants (Supplementary Table 5) and by more uniform and complete coverage of protein coding regions (for example, whereas~88.9% of protein coding bases are covered at depth >10X in the Exome Aggregation Consortium 2 (ExAC) data,~98.8% of such bases are covered at depth >10X in TOPMed; Supplementary Figures 4 and 5). Our catalog of variants is available to researchers at the BRAVO browser 28 and through dbSNP. Both are important resources for studies of Mendelian disorders, which must find proverbial "needles-in-a-haystack" by distinguishing novel, likely deleterious variants from other variants segregating in the population.

The distribution of genetic variation
We examined the distribution of variant sites across the genome by counting variants in 1Mb of contiguous sequence ond in 1Mb segments containing ordered concatenations of sequence with similar conservation level (indicated by CADD score) or coding versus non-coding status ( Figure 1). Using 40,772 unrelated samples, we find that the vast majority of human genomic variation is rare 12,13 and located in putatively neutral, non-coding regions of the genome ( Figure  1A). While coding regions have lower average levels of both common and rare variation, we find ultra-conserved non-coding regions with even lower levels of genetic variation, consistent with previous findings 33 ( Figure 1A).
After filtering to focus on regions of the genome that are accessible through short read sequencing, most contiguous 1 Mb segments show similar levels of common (5,141土1,298 variants with MAF≥0.5%) and rare variation (120,414土19,862 variants with MAF<0.5%) ( Figure  1B). However, outlier segments with notably high or low levels of variation do exist, and likely represent biologically interesting genomic regions that can serve as candidates of interest in follow-up analyses. One region of interest on chromosome 8p (GRC 38 positions 1,000,001-7,000,000 bp) has the highest overall levels of variation ( Figure 1B). This region is estimated to have the highest mutation rate across the human genome (with the possible exception of the Y chromosome) 34 . Interestingly, high levels of variation in this region are driven primarily by non-coding variation.
While levels of common and rare variation within segments are significantly correlated (R 2 = 0.462, p-value ≤ 2x10 -16 , Supplementary Figure 6), there are outliers. For example, segments overlapping the Major Histocompatibility Complex (MHC) genes have the highest levels of common variation across the genome while also having some of the lowest levels of rare variation, consistent with the evolutionary consequences of balancing selection [35][36][37] . Segments with a high proportion of coding bases are negatively correlated with overall variation, and feature a strong depletion in the number of common variants and a more modest depletion in rare variants. This is consistent with the expectation that rare variants observed in very large samples (such as ours ) have not yet been effectively pruned by natural selection (Supplementary Figure 7).

Insights into mutation processes
The distribution of singleton variants in large studies such as ours has been only modestly shaped by purifying selection and largely reflects underlying mutation processes 38 . Thus, studying singletons and other rare variants in our sample offers opportunities to dissect the mutation processes that generate extant human variation. We explored the spatial clustering of genetic variants, focusing on 32,058,136 singleton SNVs ascertained in a subset of 2,000 unrelated individuals (equal numbers of African and European ancestry). We particularly tried to dissect multi-nucleotide mutation events where two or more closely-spaced mutations arise simultaneously as a result of error-prone replication and repair mechanisms 39 .
If neighboring singletons were the result of independent mutation events, fewer than 0.07% of the singletons in an individual in this sample should be <100bp apart, while in fact, 2% of our singletons meet this criterion. We suspect closely spaced singletons reflect the consequences of translesion synthesis, an error-prone replication process in which specialized polymerases bypass DNA lesions at stalled replication forks 40 , often causing multiple mutations within very short spans 39 .
To explore the relative importance of different mutation processes and how they influence clustering of genetic variants, we modeled the spatial distribution of singletons in each individual as a mixture of exponential processes (Supplementary Figure 8). Each component represents a distinct mutational process, resulting in unique spacing and mutation patterns. We found clear evidence for four classes of clustered singletons, consistent across individuals and ancestries ( Figure 2A). Class 1 represents singletons occurring an average of~2 -8 bp apart accounted for~1.5% of singletons in each sample These appear to arise through translesion synthesis and are substantially enriched for A>T and C>A transversions ( Figure 2B), consistent with known signatures of translesion synthesis errors 41,42 . Class 2 singletons occurring~500 -5,000 bp apart accounted for~12-24% of singletons. Some of these variants likely result from hypermutability of single-stranded DNA intermediates during repair of double-strand breaks 43,44 and are enriched for C>G transversions. They show prominent subtelomeric concentrations on chromosomes 8p, 9p, 16p, and 16q 43,44 (Figures 2B and 2C, Supplementary Figure 9), consistent with evidence that subtelomeric regions are enriched for double-strand breaks in eukaryotic genomes 45 . Class 3 singletons occurring~30,000 -50,000 bp apart accounted for ~43-49% of all singletons. Class 4 singletons occurring~125,000 -170,000 bp apart accounted for~31-37% of all singletons. Singletons in classes 3 and 4 show the same mutational spectrum as the genomic background, suggesting that these represent independent mutation events whose spacing is mainly driven by variation in local ancestry. Since multi-nucleotide mutations are uniquely implicated in the genetic architecture of complex diseases 46 and the evolutionary history of the genome 47 , our findings illustrate how the TOPMed sample will facilitate more refined and robust approaches to interpreting patterns of genetic diversity.

Beyond SNVs and Indels
To evaluate the potential of our data to generate even more comprehensive variation datasets, we developed and applied a method based on de novo assembly of unmapped and mismapped read pairs, enabling us to assemble sequences present in a sample but absent or improperly represented in the reference.
We identified~200 -500kb of non-reference hominid sequence per sequenced individual (N50 length of 954 bp, range: 200bp to 9,330bp). Merging across~17,000 samples included in this analysis, we collapsed this to 1,627 scaffolds containing 1,932 contigs and spanning 2,179,874bp. From this set, we fully resolved the insertion sequence and precise breakpoint locations on the human genome for 737 contigs (totalling 1,160,253bp, of which 548,749 were inserted bases; largest contig: 12,488bp) and were able to locate one breakpoint for an additional 397 contigs (453,888bp; 323,675 putatively inserted/hanging bases) (Supplementary File 1).
In line with prior observations suggesting that the vast majority of human non-reference sequence is present in the assembled genomes of non-human primates 48,49 , we find that our assemblies likely represent retained ancestral sequences that have been deleted in some human lineages, including on the reference haplotype. Consistent with this, the frequencies of the newly assembled alleles (Supplementary Figure 10) are higher than those observed for SNVs and indels, with 78.3% of the events present in >5% of the samples and only 6% having a frequency <0.5%. Comparing our findings to two previous studies on different smaller datasets 48,49 , 243 sequences (164,099bp retained sequence) are wholly novel. Additionally, we have resolved length and both breakpoints for 137 events (170,133bp) for which only one breakpoint was previously known ( Figure 3D).
Overall, retained ancestral sequences appear evenly distributed across the genome ( Figure  3A), occurring at near expected frequencies with respect to genes. Of the 737 fully resolved events, 331 are intergenic and 406 overlap with GENCODE v29 27 genes (15 exonic, 370 intronic, 21 in promoters, Figure 3B). Interestingly, we identified a sequence, present in all individuals, that fully contains exons of the UBE2QL1 gene (including the translation start site). The sequence is absent from current human annotation but present in Mus musculus and Rattus norvegicus , suggesting this common gene is not currently represented in the reference genome (Supplementary Figure 11).
On average, African ancestry individuals have both more non-reference sequences (350.4 sequences/sample vs 297.7 sequences/sample) and greater total assembly size (226.2kb/sample vs 171.6kb/sample, respectively) than European-ancestry individuals ( Figure  3C). This is consistent with a loss of genetic variation in the out-of-Africa bottleneck. Overall, 5 sequences spanning 15,653bp are found only in African samples, whereas no single fully resolved sequence is unique to Europeans; we do identify one partially resolved 605bp break end in a single European. All these ancestry-specific events are rare (present in <0.5% of samples), whereas the overwhelming majority of events are shared by multiple continental ancestry groups.

Variation in CYP2D6
A complementary approach to de novo genome assembly is to develop approaches that combine multiple types of information -including previously observed haplotype variation, SNVs, indels, copy number, and homology information -to identify and classify haplotypes in interesting regions of the genome. One such region is the CYP2D6 gene locus, which encodes an enzyme that metabolizes approximately 25% of drugs and whose activity varies substantially among individuals [50][51][52] . More than 100 CYP2D6 haplotypes have been described to date, some involving a gene conversion with its nearby nonfunctional but highly similar paralog CYP2D7 . These haplotypes are typically labelled using star alleles (e.g., CYP2D6*1 , *2 ), each defined by SNVs, indels, structural variants (SVs), or some combination of these.
We were particularly interested in understanding whether current variation catalogs for CYP2D6 were complete, especially in non-European individuals. Therefore, we performed CYP2D6 haplotype analysis for 3,418 African American individuals from the Jackson Heart Study (JHS) using the Stargazer program 50 . We called a total of 56 star alleles (51 known and 5 novel) representing increased function, decreased function, and loss of function (Supplementary Table  7). The five novel star alleles were comprised of gene duplications ( CYP2D6*29x3 , *34x2 , and *42x2 ) and multiplications ( CYP2D6*1x3 and *2x3 ). We observed various CYP2D6/CYP2D7 hybrids and extensive copy number variation ranging from zero to five gene copies (Supplementary Figure 12). Based on the CYP2D6 genotypes from 1,923 unrelated JHS individuals, we estimated that 26% of these individuals carry SVs, and that 2%, 9%, 83%, 4%, and 2% of the individuals are poor, intermediate, normal, ultrarapid, and unknown metabolizers 53 , respectively.
Heterozygosity and rare variant sharing among ancestrally diverse sampled individuals The TOPMed variation data also present an opportunity to examine expectations about rare variation studies -for example, which types of studies share the same rare variants and might be expected to provide the most concordant results? Which studies show unique or distinct patterns of variation and might be expected to provide unique insights? We grouped TOPMed participants by study and by self-reported ancestry and calculated genetically determined ancestry components, heterozygosity, number of singletons, and rare variant sharing (Figure 4, Supplementary Table 8).
We find that African ancestry studies have the greatest heterozygosity 7,54 , followed by Hispanic/Latino, European, Amish, East Asian, and then Samoan studies --consistent with a gradual loss of heterozygosity tracking recent African origin of modern humans and subsequent migration from Africa to the rest of the globe. Interestingly, while the East Asian ancestry studies have among the lowest heterozygosity in our sample (even lower than the Amish, a European ancestry founder population with notably low heterozygosity 55,56 ), the East Asian ancestry studies have the greatest singleton counts (in contrast to the Amish, who have the lowest). These observations are consistent with a prolonged bottleneck in East Asian populations followed by extreme recent exponential growth, but could also reflect a more modest sampling of East Asian variation in TOPMed. The Amish on the other hand have a more recent bottleneck and have not experienced as great a growth since founding 55,56 (see Supplementary Information section 1.5).
Using rare variation, we are also able to distinguish fine-scale patterns of population structure (Figure 4, Supplementary Figure 13 ; Supplementary Information section 1.6). Broadly, we observe sharing between studies with shared continental ancestry (whether African, European, Asian, or Hispanic/Latino). Nevertheless, additional patterns emerge. The Amish are unique among the included studies: they exhibit little rare variant sharing with outside groups and also the greatest rare variant sharing within study -consistent with a marked founder effect. Further, we observe~4x greater rare variant sharing between African ancestry studies than between European ancestry studies, even after correcting for sample size differences (Supplementary Figure 14). We also observe that the MESA and GALAII Hispanic/Latino studies share~2x more rare variants with the African ancestry studies than the SAFS and WHI Hispanic/Latino studies, consistent with ADMIXTURE analysis results and the expectation that these studies contain individuals from populations with greater African admixture (Figure 4, Supplementary Figure 13).

Haplotype sharing
A corollary to rare variant sharing is rare haplotype sharing through segments inherited from a recent common ancestor. These identical-by-descent (IBD) segments show similar patterns of within and between study sharing to the rare variants ( Supplementary Figures 15 and 16). The Amish study shows the greatest average within study IBD sharing levels (Supplementary Figure  15), consistent with a founder event 14 generations ago and a subsequently closed society 56,57 . In the Amish and other samples, the distribution of IBD segments allows estimates of effective population sizes over the last 300 generations ( Figure 5). The demographic histories are broadly similar between ancestry groups, with the exception of the Amish, who experienced a more extreme bottleneck when moving to the New World, and the Samoans, who have had a smaller effective population size than the East Asian populations from which they separated ~5000 years ago 58,59 . Both non-Amish European ancestry and African ancestry populations appear to have experienced a bottleneck~5-10 generations ago, consistent with moving to the New World, whether through colonization or forced migration.
Large sample sizes alleviate impact of selection at linked sites The relative numbers of singletons, doubletons and other very rare variants contained at neutral sites can be used to infer human demographic history 13,60,61 . While most demographic inference is currently based on fourfold degenerate sites in protein sequences, these sites evolve under the influence of strong selection at nearby linked protein coding sites 62,63 , which can affect the inferred timing and magnitude of population size changes. WGS enabled us to access intergenic regions of the genome that are minimally affected by selection and complement the analyses of fourfold degenerate sites in protein coding regions. In addition, the recent expansion of the human population has led to an increase in the number of rare mutations in the human genome 12,64 that may not have had time to be influenced by selection at linked sites 64 . While these recent mutations will dominate rare variant portions of the site frequency spectrum (SFS) and may lessen differences caused by selection at linked sites, they can only be identified with extremely large sample sizes. We measured how the site frequency spectrum and demographic inference changes as a function of sample size and an index of selection at linked sites (McVicker's B statistic 65 ). Increasing the sample size demonstrated a more limited impact of selection at linked sites on singleton and doubleton bins of the SFS ( Figure 6A; Supplementary  Figure 17). Estimates of European effective population size based on the 1% of the genome with the weakest effect of selection at linked sites consistently yielded~1.1M million individuals. In contrast, inference at fourfold degenerate synonymous sites was sensitive to sample size until >3,000 chromosomes were included in the analysis, at which point estimates using the two datasets converged ( Figure 6B ; Supplementary Figure 18; Supplementary Table 9). These results suggest demographic inference of recent human history will be facilitated by large sample size and the ability to focus on regions of the genome least affected by selection at linked sites.

Human adaptations
When adaptive mutations occur their frequencies quickly rise, bringing surrounding linked variation to higher frequencies as well. If selection is strong enough, we expect to see regions of low diversity haplotypes flanking these events. We used the Integrated Haplotype Score 66 (iHS) to search for regions where positive selection occurred or is ongoing. Overall, we find three regions with strong evidence of selection in all ancestry groups: one on chromosome 3 and two on chromosome 4. These regions center on three protein coding genes: PIGG 67 , SLC30A9 68 , and STXBP5L 69 (Supplementary Figure 19, Supplementary Table 10).

TOPMed imputation resource
In addition to enabling detailed analysis of the combined genomic and health data for sequenced samples, TOPMed can enhance analyses of any genotyped samples. We constructed a TOPMed-based imputation reference panel of 60,039 individuals, including 239,756,147 SNVs and indels (see Supplementary Table 11 for breakdown by frequency). It is the first imputation reference panel based exclusively on deep WGS in diverse samples and greatly exceeds prior alternatives such as the Haplotype Reference Consortium (HRC) 8 (39,635,008 autosomal SNVs) and the 1000 Genomes Project 7 phase 3 (49,143,605 SNVs and indels) in resolution and accuracy (Supplementary Figure 21). For example, the average imputation quality for variants with frequency 0.001 increased from~0.15 (1000 Genomes, HRC) to 0.91 (TOPMed) in African ancestry genomes and from 0.26 (1000 Genomes) and 0.59 (HRC) to 0.92 (TOPMed) for European ancestry genomes. Similar improvements were observable in other ancestries we considered except South Asians, which would be expected to improve in a future version of the panel that has more representation from this group. The minimum allele frequency at which variants could be well-imputed (r 2 >0.3) decreased from ~0.13-0.21% (1000 Genomes, European or African ancestry) to 0.05% (HRC, Europeans only) to~0.004 -0.006% (TOPMed, European or African ancestry) (Supplementary Figure 20). This means that 93% of the~84,000 rare variants with minor allele frequency < 0.5% in an average African ancestry genome can be recovered through genotype imputation using the TOPMed panel (in comparison to 47% using the 1000G panel). Similarly, 95% of the~69,000 variants with frequency <0.5% in an average European ancestry genome can be imputed using the TOPMed panel (in comparison to 49% using 1000G and 72% using HRC), enabling many human genetic studies to approximate the benefits of WGS for a modest investment in array genotyping and computation.
To illustrate the possibilities, we imputed TOPMed variants in 409,694 White British UK Biobank participants 1 , and performed association analyses for >1,400 binary phenotypes, defined as composites of ICD-10 billing codes 70 . We focused an initial analysis on 49,892 rare (frequency ≤0.5%) pLoF variants that were imputed (compared with 17,727 pLoF variants available from HRC imputation and 3,759 pLoF variants examined by Emdin et al. 71 ). Among other findings, our analysis enabled several rare variant associations with breast cancer to be observed: we found a frameshift variant in CHEK2 (chr22:28695868:AG:A; OR=2.04; P=6.98x10 -22 ; minor allele count=2,111) and a stop gain variant in PALB2 (chr16:23621362:C:T; OR=4.48; P=6.92x10 -14 ; minor allele count=316) to be associated with breast cancer (12,636 cases and 200,417 controls). In addition to these two individual rare variants, we also found that a burden of rare pLoF variants in BRCA2 (comprised of 23 rare pLoF variants; P=2x10 -8 ; cumulative allele frequency cases vs. controls= 0.11% vs. 0.05%), the well-established risk gene for breast cancer, was significantly associated. Variants in these three genes are present in the ClinVar 32 database as potentially pathogenic for familial breast cancer, but this is the first time these variants have been identified through a population of generally healthy adults not ascertained for cancer (Supplementary Table 12). In addition, identifying variant carriers in this sample of 500,000 well-characterized adults will help carefully dissect and understand the natural history and consequences of these variants. Other examples of rare variant association signals included associations with the burden of rare pLoF variants in USH2A and retinal dystrophies (83 cases; 34 rare pLoF variants; P=4x10 -8 ; cumulative allele frequency cases vs. controls= 4% vs. 0.2%) and IFT140 and acquired kidney cysts (1,257 cases; 15 rare pLoF variants; P=3x10 -8 ; cumulative allele frequency cases vs. controls= 0.6% vs. 0.1%). These are all examples of signals that could previously only be studied through direct sequencing.

Conclusion and future prospects
The first set of 53,831 sequenced genomes from TOPMed is now available to the community. The samples are deeply phenotyped and enable discovery of important biology 72-76 for heart, lung, blood, and sleep disorders. In addition, they provide a rich resource for developing and testing methods for surveying human variation, for inference of human demography, and for exploring functional constraint in the genome. Beyond these uses, we expect TOPMed data will improve nearly all ongoing studies of common and rare disorders by providing both a deep catalog of variation in healthy individuals and an imputation resource that enables array-based studies to achieve completeness previously only attainable through direct sequencing.
The WGS and phenotypic resources for TOPMed studies are currently being enriched by applying transcriptomic, epigenomic, metabolomic, and proteomic assays to tens of thousands of samples selected through an open, peer-reviewed process 77 . Investigators affiliated with TOPMed studies share data and collaborate on cross-study analyses in Working Groups, from which many abstracts 78 and publications 79 are emerging. Members of the broader scientific community are using TOPMed resources through the WGS data available on dbGaP, the BRAVO variant server, and soon as an imputation reference panel on the Michigan Imputation Server.
Full utilization of the program's resources by the scientific community will require new approaches to deal with the large size of the "omics" data (petabytes for read alignments, terabytes for genotype call sets); the diversity of phenotypic data types and structures; and the need for data sharing in an environment that supports privacy of participant data and respect for consent. These issues are currently being addressed in partnerships with the NIH Data Commons 80 and NHLBI Data STAGE 81 cloud-computing programs, which have both selected TOPMed as an initial data resource to be used for developing infrastructure and applications. Through these partnerships, investigators will be able to access the full range of TOPMed datasets, search and retrieve specific data types, and perform integrated analyses within a secure cloud environment that also supports data sharing among investigators within collaborative working groups.

DNA samples
Whole genome sequencing (WGS) for the 53,831 samples reported here was performed on samples previously collected and consented from research participants in 33 NHLBI funded research projects. All sequencing was done from DNA extracted from whole blood, with the exception of 17 Framingham samples (cell lines), and HapMap samples NA12878 and NA19238 (cell line) used periodically as sequencing controls.
Whole genome sequencing WGS targeting a mean depth of at least 30X (paired end, 150-bp reads) using Illumina HiSeq X Ten instruments occurred over several years at six sequencing centers (Supplementary Table  13). All sequencing used PCR-free library preparation kits purchased from KAPA Biosystems, equivalent to the protocol in the Illumina TruSeq PCR-Free Sample Preparation Guide (Illumina cat# FC-121-2001). Center-specific details are available from the TOPMed website 82 . In addition, 30X coverage WGS for 1,606 samples from four contributing studies were sequenced prior to the start of the TOPMed sequencing project and are included in this data set. These were sequenced at Illumina using HiSeq 2000 or 2500 instruments, have 2 x 100 bp or 2 x 125 bp paired end reads and sometimes used PCR amplification.
Sequence data processing and variant calling Sequence data processing was performed periodically to produce genotype data "Freezes" that include all samples available at a given time. All sequence was remapped using BWA-MEM 83 to the hs38DH 1000 Genomes build 38 human genome reference including decoy sequences, following the protocol 84 published by Regier et al. 2018 85 . Variant discovery and genotype calling was performed jointly, across TOPMed Parent studies, for all samples in a given Freeze using the GotCloud 86 pipeline. This procedure results in a single, multi-study, genotype call set. A support vector machine (SVM) quality filter for variant sites was trained using a large set of site specific quality metrics and known variants from arrays and the 1000 Genomes Project as positive controls and variants with Mendelian inconsistencies in multiple families as negative controls (see online documentation 87 for more details). After removing all sites with minor allele count less than 2, the genotypes with minimal depth >10X (minDP10) were phased using Eagle 2.4 88 . Sample-level quality control included checks for pedigree errors, discrepancies between self-reported and genetic sex, and concordance with prior genotyping array data. Any errors detected were addressed prior to dbGaP submission. Details regarding WGS data acquisition, processing, and quality control vary among the TOPMed data Freezes. Freeze-specific methods are described on the TOPMed website 89

Identifying putative loss of function variants
Putative loss of function (pLoF) variants were identified using Loss Of Function Transcript Effect Estimator (LOFTEE) v0.3-beta 92 and Variant Effect Predictor (VEP) v94 93 . The genomic coordinates of coding elements were based on GENCODE v29 27 . Only stop-gained, frameshift, and splice site disturbing variants annotated as high-confidence (HC) pLoF were used in the analysis. The pLoF variants with AF > 0.5% or within regions masked due to poor accessibility ( Supplementary Information 1.2).
We evaluated enrichment and depletion of pLoF variants (AF < 0.5%) in gene sets (i.e. terms) from Gene Ontology (GO) 94,95 . For each gene annotated with a particular GO term we computed number of pLoF variants per protein coding base pair, L, and proportion of singletons, S . Then, we tested for lower/higher average L and S in a GO term using bootstrapping (1,000,000 samples) with adjustment for protein-coding gene length (CDS): (1) sort all genes by their CDS length in ascending order and divide them into equal-size bins (1,000 genes each); (2) count how many genes from a GO term are in each bin; (3) from each bin, sample with replacement same number of genes and compute average L and S ; (4) count how many times sampled L and S were lower/higher than observed values. The p-values were computed as the proportion Identification of CYP2D6 alleles using Stargazer's genotyping pipeline Details of the Stargazer genotyping pipeline were described previously 50 . Briefly, SNVs and indels in CYP2D6 were assessed from a VCF file generated using GATK-HaplotypeCaller 97 .
The VCF file was phased using the program Beagle 98 and the 1000 Genomes Project haplotype reference panel. Phased SNVs and indels were then matched to star alleles. In parallel, read depth was calculated from BAM files using GATK-DepthOfCoverage 97 . Read depth was converted to copy number by performing intra-sample normalization 50 . Following normalization, SVs were assessed by testing all possible pairwise combinations of pre-defined copy number profiles against the sample's observed copy number profile. For novel SVs, breakpoints were statistically inferred using changepoint 99 . Information regarding novel SVs was stored and used to identify subsequent SVs in copy number profiles. Output data included individual diplotypes, copy number plots, and a VCF of SNVs and indels that were not used to define star alleles.

Novel genetic variants in unmapped reads
Analysis of unmapped reads was performed using~18,000 samples from TOPMed data Freeze 3. From each sample, we extracted and filtered all read-pairs with at least one umapped mate and used them to discover human sequences absent from the reference. The pipeline included four steps: (i) per-sample de novo assembly of unmapped reads and selection of hominid contigs using the Pan paniscus , Pan troglodyte s, Gorilla gorilla and Pongo abelii genome references; (ii) hominid-reference-based merging and scaffolding of sequences pooled together from all samples; (iii) reference placement and breakpoint calling; and (iv) variant calling. The detailed description of each step is provided in Supplementary Information section 1.4.

Contiguous segment analysis
We excluded indels and multi-allelic variants, and categorized remaining variants as common (AF≥0.005) or rare (AF<0.005), and as coding or noncoding based on protein coding exons from Ensembl 94 100 . Variant counts were analyzed across 2,739 non-empty (i.e. with at least one variant) contiguous 1 Mbp chromosomal segments, and counts in segments at the end of chromosomes with length L <10 6 bp were scaled up proportionally by the factor 10 6 x L -1 . For each segment, the coding proportion, C , was calculated as the proportion of bases overlapping protein coding exons. The distribution of C is fairly narrow, with 80% of segments having C ≤ 0.0195, 99% of segments have C ≤0.067, and only 3 segments having C ≥0.10. Due to the significant negative correlation between C and the number of variants in a segment, and potential mapping effects, we use linear regression to adjust the variant counts per segment according to the model count = ß x C + A + count_adj, where A is the proportion of segment bases overlapping accessibility mask ( Supplementary Information 1.2). Unless otherwise noted, we present analyses and results that use these adjusted count values.

Concatenated segment analysis
Distinct base classifications were defined by both coding and noncoding annotation (based on Ensembl 94 100 ) and CADD in silico prediction scores 101 (downloaded from the CADD server for all possible SNVs). For each base we used maximum possible CADD score (when using the minimum CADD score, results were qualitatively the same). Bases beyond the final base with a CADD score per chromosome were excluded. This resulted in six distinct types of concatenated segments: high (CADD≥20), intermediate (10≤CADD<20), and low (CADD<10) CADD scores for coding and similarly for non-coding. Common (AF≥0.005) and rare (AF<0.005) variant counts were then calculated across these concatenated segments. Multi-allelic variants and those in regions masked due to accessibility were excluded. Counts in segments at the end of chromosomes were scaled up as in the contiguous analysis.

Multi-nucleotide mutations Data
From the TOPMed Freeze 5 dataset, we selected a subset of 1,000 unrelated individuals of African ancestry and 1,000 unrelated individuals of European ancestry, based on the global ancestry proportions inferred using RFMIX 102 . In each sample of 1,000 individuals, we recalculated the allele counts of each SNV and extracted SNVs that were singletons within that sample. Note that a singleton defined here is not necessarily a singleton in the entire TOPMed freeze 5 dataset. We chose to limit the size of each sample to N=1,000 for two reasons: first, to ensure each subsample shared homogenous ancestry, and second, to limit the incidence of recurrent mutations at hypermutable sites, which can alter the underlying mutational spectrum of singleton SNVs in large samples 103 .

Mixture model parameter estimation
For each individual , we collected the set of singletons unique to that individual (with i S singleton status determined relative to other individuals from the same population subsample). Assuming singletons occur independently at a constant rate , we can model the probability of ϕ i observing singletons in individual as a Poisson variable with mean , where is the size in bp of the mappable autosomal regions. G Then, for singleton , let be the distance in base pairs to its nearest neighboring singleton in j d i,j individual . These distances thus follow an exponential distribution with rate : Now suppose the set of singletons are generated by independent Poisson processes, S i K > 1 each with a different rate. Then the distribution of inter-singleton distances across all S i singletons is parameterized as a mixture of exponential component distributions, given by: K and is the proportion of singletons resulting from process ..
We estimate the parameters of this mixture ( ) using the , .., , , .., λ i,1 . λ i,K θ i,1 . θ i,K expectation-maximization (EM) algorithm as implemented in the mixtools R package 104 . To identify an optimal number of mixture components, we iteratively fit mixture models for increasing values of K and calculated the log-likelihood of observed data D given the parameter estimates ( , stopping at K components if the p-value of the likelihood ratio , .., , , .., test between K-1 and K components was >0.01 (chi-squared test with 2 degrees of freedom). The goodness-of-fit plateaued at four components for the majority of individuals, so we used the 4-component parameter estimates from each individual in all subsequent analyses.
Now let indicate which of the four processes generated singleton in individual . We k i,j j i calculated the probability of being generated by process as: k Where we then classified the process-of-origin for each singleton per the following optimal decision rule:

Identification of cluster class 2 hotspots
After assigning singletons to the most likely cluster class, we counted the frequencies in non-overlapping 1Mb windows throughout the genome, and defined hotspots as the top 5% of 1Mb bins containing the most singletons within each ancestry group.

Simulations
To quantify the effects of external branch length heterogeneity on singleton clustering patterns, we simulated singletons under the following model. We performed simulations using a per-site, per-generation mutation rate ranging from 1x10 -8 to 2x10 -8 . Because our aim was to compare these simulated singletons to unphased singletons in the TOPMed data, we randomly assigned each of the 2,000 haploid samples into one of 1,000 diploid pairs, and recalculated the inter-singleton distances per diploid sample, ignoring the chromosome on which each simulated singleton originated.

Evolutionary genetics of diverse ancestry individuals
Rare variant sharing We used 39,722 unrelated individuals that had consent for population genetics research. Each individual was grouped into their TOPMed study, except for individuals from the AFGen project, which were treated as one study ( Supplementary Tables 1 and 2). FHS and ARIC individuals, which overlapped with the AFGen project, remained in their respective studies and were not grouped into the AFGen project. Individuals for whom self-described ancestry was either missing or "other" were removed from the analysis. We then removed all indels, multi-allelic variants, and singletons from the remaining 39,168 individuals. Each study was then split by self-described ancestry. We excluded studies that had <19 samples from the analysis, however all 39,168 samples were used to define singleton filtering. We used the Jaccard Index 106 , J , to determine the intersection of rare variants (2 ≤ sample count ≤ 100) between two individuals divided by the union of that pairs' rare variants, where sample count are the number of individuals with either heterozygote or homozygote. We then determined the average J value between and within each study.
To confirm that J is not biased by sample size, we randomly sampled 500 individuals from each of two European (AFGen, FHS) and two African (COPDGene, JHS) ancestry studies in TOPMed data Freeze 3, without replacement. We then recalculated J between and within these randomly sampled studies from AC = 2 to 100, with AC calculated from these 2,000 individuals.

Identity-by-descent sharing
We used the RefinedIBD program 107 to call segments of IBD having length ≥2 cM on the autosomes using passing SNVs with MAF>5%. All 53,831 samples were included in this analysis, and we used genotype data which had been phased with Eagle2 88 . Since IBD LOD scores are often deflated in populations with strong founding bottlenecks, such as the Amish, we used a LOD score threshold of 1.0 instead of the default 3.0. To account for possible phasing and genotyping errors, we filled gaps between IBD segments for the same pair of individuals if the gap had length at most 0.5 cM and at most one discordant genotype. As a result of the lower LOD threshold, regions with low variant density can have an excess of apparent IBD segments. We therefore identified regions with highly elevated levels of detected IBD using the procedure of Browning and Browning (2015) 108 , and removed any IBD segments falling wholly within these regions.
We divided the data by study and by self-identified ancestry within study. In the analyses of IBD sharing levels and recent effective size, we did not include studies without appropriate consent or ancestry groups with <80 individuals within a study. We calculated the total length of IBD segments for each pair of individuals, and we averaged these totals within each ancestry group in a study and between each pair of ancestry-by-study groups. We also estimated recent effective population size for each group using IBDNe 108 .

Demographic estimation under selection at linked sites
We selected 2,416 samples from the TOPMed data Freeze 3 which (a) had a high percentage of European ancestry; (b) were unrelated; and (c) gave consent for performing population genetics research. More detailed information about ancestry estimation and filters is provided in Supplementary Information 1.7.
We also performed several steps to filter the genome for high-quality neutral sites, which were based off of the ascertainment scheme used by Torres et al. 2018 109 ( Supplementary  Information 1.7). After filtering, positions in the genome were annotated for how strongly affected they are by selection at linked sites by using the background selection (BGS) coefficient McVicker's B statistic 65 . We used all sites annotated with a B value for performing general analyses. However, when performing demographic inference, we limited our analyses to regions of the genome within the top 1% of the genome-wide distribution of B ( B >= 0.994). These sites correspond to regions of the genome inferred to be under the weakest amount of BGS (i.e., under the weakest effects of selection at linked sites). Sites in the genome were also polarized to ancestral and derived states using ancestral annotations called with high-confidence from the GRCh37 e71 ancestral sequence. After keeping only polymorphic di-allelic sites, we had 20,324,704 sites, 191,631 with B >= 0.994. We also identified 91,177 fourfold degenerate synonymous sites that were polymorphic (di-allelic) and had high-confidence ancestral/derived states.
We performed demographic inference with the moments 110 program by fitting a model of exponential growth with three parameters ( N Eur0 , N Eur , T Eur ) to the site-frequency spectrum (SFS). This included two free parameters: the starting time of exponential growth ( T Eur ) and the ending population size after growth ( N Eur ). The ancestral size parameter (i.e, the population size when growth begins), N Eur0 , was kept constant in our model such that the relative starting size of the population was always 1. We applied the inference procedure to either fourfold degenerate sites or sites with B >= 0.994. The SFS used for inference was unfolded and based on the polarization step described above. The inference procedure was fit using sample sizes (2N) of 1,000, 2,000, 3,000, 4,000, and 4,832 chromosomes. To convert the scaled genetic parameters output by the inference procedure to physical units, we used the resulting theta (also inferred by moments ) and a mutation rate of 1.66 x 10 -8 111 to generate corresponding effective population sizes ( N e ). To convert generations to years, we assumed a generation time of 25 years. 95% confidence intervals were generated by resampling the SFS 1,000 times and using the Godambe Information Matrix to generate parameter uncertainties 112 . A more detailed description is available in Supplementary Information 1.7.

Selection
We used 8,377 unrelated individuals selected from the TOPMed data Freeze 3 for which we had consent for population genetic analyses (Supplementary Table 16). We assigned each individual to one of six population labels using k-means clustering on the first 7 PCs as described in Supplementary Information 1.8. Then, we analyzed each population separately. Only bi-allelic sites with unambiguous ancestral state, inferred using WGSA pipeline, were used. Sites near chromosome boundaries and with minor allele frequency <0.05 were also excluded (Supplementary Table 17). We used the program selscan 113 to perform all iHS scans 66 in each population using default parameters. Then we normalized raw iHS scores within 20 frequency bins. We followed Voight et al. 2006 66 to find regions with the largest proportion of extreme iHS scores --putatively selected regions ( Supplementary Information 1.8). A full list of genes found in significant regions, and the populations in which they were found significant, is given in Supplementary File 2. A full list of regions identified as significant in each population can be found in Supplementary Files 3-8.

TOPMed Imputation Panel Construction
We divided each autosomal chromosome and the X chromosome into overlapping chunks (with chunk size 1Mb each and with 0.1Mb overlap between consecutive chunks), and then phased each of the chunks using Eagle v2.4 88 . We removed all singleton sites and compressed the haplotype chunks into m3vcf format 114 . Afterwards, we ligated the compressed haplotype chunks for each chromosome to generate the final reference panel.

Evaluation of imputation accuracy
For all TOPMed individuals, genetic ancestries were estimated using the top four principal components (PCs) projected onto the PC space of 938 Human Genome Diversity Project (HGDP) individuals using verifyBamID2 115 . For each TOPMed individual, we identified the 20 closest individuals from 2,504 individuals from the 1000 Genomes Project Phase 3 (1000G) based on Euclidean distances in the PC space estimated by verifyBamID2. If all of the 20 closest 1000G individuals belonged to the same super-population -among African (AFR), Admixed American (AMR), East Asian (EAS), European (EUR), South Asian (SAS) -we estimated that the TOPMed individual also belongs to that super-population. Among the 60,039 reference panel individuals, 55,365 (92%) were assigned to a super-population, with the following breakdown: AFR=14,764, AMR=4,347, EUR=31,686, EAS=4,429, SAS=139. Of 5,504 additionally sequenced individuals for the BioMe study but not included in the TOPMed reference panel, 4,725 were assigned to a single super-population using this method. We randomly selected 100 individuals from each super-population, and selected markers on chromosome 20 present on the Illumina HumanOmniExpress (8v1-2_A) array. The selected genotypes were phased with Eagle 2.4.1 88 , using the 1000G (n=2,504), Haplotype Reference Consortium (HRC, n=32,470) 8 , and TOPMed (n=60,039) reference panels. The phased genotypes were imputed using Minimac4 116 from each reference panel, and the imputation accuracy was estimated as the squared correlation coefficient (r 2 ) between the imputed dosages and the genotypes calls from the sequence data. The allele frequencies were estimated among all TOPMed individuals estimated to belong to the same super-population, and the r 2 values were averaged across variants in each MAF category. Variants present in 100 sequenced individuals but absent from the reference panels were assumed to have r 2 =0 for the purposes of computing the average r 2 . The minimum MAF to achieve r 2 >0.3 was calculated from the average r 2 in each MAF category by finding the MAF that crosses r 2 =0.3 using linear interpolation. The average number of rare variants (MAF<0.5%) and the fraction of imputable rare variants (r 2 >0.3) were calculated based on the number of non-reference alleles in imputed samples above and below the minimum MAF, assuming HWE.

Imputation of the UK Biobank to the TOPMed panel and association analyses
After phasing the UK Biobank genetic data ( carried out on 81 chromosomal chunks using Eagle v2.4. ), the phased data were converted from GRCh37 to GRCh38 using LiftOver 117 . Imputation was performed using Minimac4 116 .
We tested single putative loss of function (pLoF) nonsense, frameshift or essential splice site variants (determined as described previously) for association with 1,403 traits: PheCodes constructed from composites of ICD-10 codes to define cases and controls. Construction of the PheCodes has been previously described 70 . To perform the association analyses, we used a logistic mixed model test implemented in SAIGE 70 with birth year and the top four PCs (computed from the White British subset) as covariates. For the pLoF burden tests, for each gene with at least two rare pLoF variants (N=8,636 genes) a burden variable was created in which dosages of rare pLoF variants were summed for each individual. This sum of dosages was tested for association with the 1,403 traits using SAIGE. The same covariates used in the single-variant tests were included. For both the single-variant and the burden tests we used 5.0x10 -8 as the genome-wide significance threshold. data for TOPMed. The full study specific acknowledgments are detailed in Supplementary Information section 2.
The UK Biobank analyses were conducted using the UK Biobank Resource under application number 24460.
Other acknowledgments are detailed in Supplementary Information section 3.
The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services. Tables   Table 1. Coverage, sequencing depth and number of variants.

Total
Singletons (%) Per Individual Figure 1. Distribution of genetic variants across the genome. A. Common (allele frequency ≥ 0.5%) and rare (allele frequency < 0.5%) variant counts are shown above and below the x-axis, respectively, and each 1MB concatenated segment is sorted based on their number of rare variants. Non-coding regions of the genome with CADD scores below 10 (lower predicted function) have the largest levels of common and rare variation (dark and light blue, respectively), followed by low CADD coding regions (black and grey). Overall, the vast majority of human genomic variation is comprised of rare variation.  Each study label is shaded based on their self-described ancestry. From the outside moving inwards each track represents: the unrelated sample size of each study used in these calculations, average ADMIXTURE values, average number of heterozygous sites in each individual's genome, average number of singleton variants in each individual's genome, and the average within study rare variant sharing (RV Sharing). The links depict the 75th percentile of between study rare variant sharing. All between study rare variant sharing comparisons can be found in Supplementary Figure 13. The sample size, heterozygosity, and singleton average values can be found in Supplementary Table 8.   legend). B) The population size inferred in the last generation of an exponential growth model for Europeans. Demographic inference was conducted with different sample sizes and by using different parts of the genome (see legend). Whiskers show 95% confidence intervals (see Supplementary Table 9 for parameter values).