A pangenome reference of 36 Chinese populations

Human genomics is witnessing an ongoing paradigm shift from a single reference sequence to a pangenome form, but populations of Asian ancestry are underrepresented. Here we present data from the first phase of the Chinese Pangenome Consortium, including a collection of 116 high-quality and haplotype-phased de novo assemblies based on 58 core samples representing 36 minority Chinese ethnic groups. With an average 30.65× high-fidelity long-read sequence coverage, an average contiguity N50 of more than 35.63 megabases and an average total size of 3.01 gigabases, the CPC core assemblies add 189 million base pairs of euchromatic polymorphic sequences and 1,367 protein-coding gene duplications to GRCh38. We identified 15.9 million small variants and 78,072 structural variants, of which 5.9 million small variants and 34,223 structural variants were not reported in a recently released pangenome reference1. The Chinese Pangenome Consortium data demonstrate a remarkable increase in the discovery of novel and missing sequences when individuals are included from underrepresented minority ethnic groups. The missing reference sequences were enriched with archaic-derived alleles and genes that confer essential functions related to keratinization, response to ultraviolet radiation, DNA repair, immunological responses and lifespan, implying great potential for shedding new light on human evolution and recovering missing heritability in complex disease mapping.


S2. Populations and Samples
For the Phase I of CPC, we selected 68 samples from 731 individuals with genomes deep-sequenced using next-generation sequencing. It turned out that we applied a procedure similar to that of HPRC to select the representative samples of a subpopulation 1 . After quality control of the HiFi data, eventually 58 samples were retained for all the further analyses in this study. These include Achang (n=1), Bai

Blood collection and lymphocyte separation
For each participant, 3ml peripheral blood were collected in vacutainer tubes containing EDTA. The blood was diluted with 3mL RPMI1640 and spun at 1200G for 20min at room temperature (24℃). The opaque lymphocyte layer was collected down to the erythrocyte pellet and transferred to another 15-ml centrifuge tube.
The cells were washed with centrifugation in 7mL RPMI1640 at 800G for 8min and then the clear top plasma layer was discarded.

Immortalization of lymphocytes
Lymphocytes were shaken at 120RPM for 1 hour with 4 ml EBV stock from Epstein-Barr virus (EBV)-transformed B95-8 marmoset cell line. The cells were grown at 37℃ in RPMI1640 medium, supplemented with 20% fetal bovine serum and cyclosporine A with a final concentration of 2µg/ml. Every 4 days of incubation, the cultures were examined and photographed by microscopy. The cell aggregates of proliferative lymphoblast cells indicated successful transformation.
These LCLs were limited in their growth to minimize genetic changes that may occur during extended cell culture. The LCLs were also examined for mycoplasma contamination using a polymerase chain reaction-based method and a culture assay; the results showed no infestation. The whole procedure was done in the Immortalize Cell Bank of Chinese Ethnics Groups hosted in the Institute of Medical Biology, CAMS.

DNA preparation for NGS
Genomic DNA was extracted from the blood samples using QIAGEN DNeasy  Table 2) and reads data were quality controlled for ensuring that 80% of the bases achieved at least a base quality score of 30 (Supplementary Table 2).

DNA preparation for TGS
Genomic DNA was extracted from the peripheral blood samples and cells line using the optimized Cetyl Trimethyl Ammonium Bromide (CTAB) based protocol.
Take frozen finely powdered tissues in a new 50 mL Falcon tube and mixed with the pre-heated extraction buffer (10 ml) for one sample. Briefly, vertex for 30sec to ensure sample is fully mixed with buffer. Fine grind is a key to obtaining high DNA quantity with lesser artifacts of resin. Next the falcon tube was kept into the 65°C incubator or water bath and mix gently by inversion after every 10 min till 45 min. After incubation, place the tube at room temperature for five min to reach to room temperature environment. Centrifuge the 50ml falcon tube for 5 min at 3000×g on room temperature. For the Third-Generation Sequencing (TGS), the greater the genome size, lower is speed of initial centrifugation. Transfer 1ml of supernatant to each 2ml Eppendorf tubes already containing 1 ml of chloroform: isoamyl alcohol (24:1). Mix supernatant and chloroform: isoamyl alcohol by gentle inversions for 10 min and subsequently place the tube on ice for 10 min. Centrifuge the tube for 10 min at 5000 × g at 4°C. Transfer the upper aqueous phase into new 2ml tubes and add 5μl RNase A (10 mg/mL). Place the tube for 30 min at 37°C.
After RNase treatment, add one volume of chloroform: isoamyl alcohol again and mix by inversions for 5 min. The tube is centrifuged for 10 min at 5000×g at 4°C.
Transfer the clear supernatant into new 2ml tube and add half volume of 5 M NaCl to the sample and mix gently by inversions. Add 2 volumes of cold 95% ethanol and mix gently by inversion. The tubes are incubated at for 45 min in -20°C freezer.
It should not be for more time as some remaining phenolics and resin may also precipitate with DNA. After incubation, the tubes were centrifuged at 5000 × g for 10 min at 4°C and the supernatant was gently removed. The pellet is washed two times with 1ml of 70% ethanol and the DNA is pellet by 5000 × g at 4°C for only 5 min. The supernatant is discarded and the pellet is air-dried (10 min). The pellet is allowed to re-suspend in 50μL of TE (10 mM Tris. HCl pH 8.0; 1 mM EDTA pH 8.0).

S6. DNA Quantification & Qualification
Total DNA was qualified and quantified as follows: (1) DNA purity and concentration were then examined using NanoDrop 2000; (2) Size distribution of and degradation degree DNA were measured by Pulsed-field Electrophoresis; (3) Accurate quantification of DNA were measured by Qubit Fluorometric Quantitation.

S7. Library construction and sequencing
The library of 15 kb was constructed using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA). The construction includes DNA shearing, damage repair, end repair, hairpin adapter ligation, size selection and purification of the library. The process is as follows.

DNA shearing.
After examination of the quality of isolated DNA, dilute 2.0 µg gDNA into 200 µL 1X Elution Buffer to a final concentration of 10 ng/µL. Transfer gDNA to g-TUBE and centrifuge at 2400 x g (6000 rpm in the Eppendorf MiniSpin Plus) for 2 minutes. Then repeat spin until entire gDNA sample has passed through the orifice. (This may take 2-3 spins). Transfer the sheared gDNA to a fresh 1.5 mL Lo-bind microfuge tube. After that, concentrate DNA using AMPure PB beads to obtain the aimed DNA fragments for the construction of ＞15kb HiFi libraries.

DNA damage repair and end-repair/A-tailing
Use the following table to prepare the DNA damage repair reaction (Table 1).
Firstly, pipette mix 10 times with wide-bore pipette tips. It is important to mix well.
And spin down contents of tube with a quick spin in a microfuge. The incubate at 37°C for 30 minutes, then return the reaction to 4°C.
Use the following table to prepare the next end-repair/A-tailing processes (Table 2). Pipette mix 10 times with wide-bore pipette tips and spin down contents of tube with a quick spin in a microfuge. Then incubate at 20°C for 10 minutes and at 65°C for 30 minutes, then return the reaction to 4°C. The obtained samples are used to the next step.

Adapter ligation
Use the following table to prepare your reaction, adding the components below in the order listed. Pipette mix 10 times with wide-bore pipette tips, and spin down contents of tube with a quick spin in a microfuge. Then incubate at 20°C for 60 minutes, then return the reaction to 4°C.
After quality control test, a SMRTbell library was obtained. The library was sequenced using a single 8 M SMAT Cell on the PacBio Sequel II/Sequel IIe platform (Pacific Biosciences, CA, USA).

Size selection and purification of SMRTbell Templates
When constructing large insert SMRTbell libraries for whole genome sequencing of complex organisms, it is beneficial to remove small insert SMRTbell templates by performing size selection with the BluePippin System (which collects fragments above a size cut-off threshold). And bring up the volume of eluted, sizeselected DNA SMRTbell templates to 100 µL with 1X Elution Buffer to perform the purification of SMRTbell templates after size selection.

Anneal and Bind SMRTbell Library Templates and sequencing.
Use SMRT Link Sample Setup for instructions for primer annealing and

Sequence quality checking and filtering
The PacBio SMRT-Analysis package (https://www.pacb.com) was used for the quality control of the raw polymerase reads; we removed the following types of polymerase reads: (1) Polymerase reads with length less than 50bp; (2) Polymerase reads with the quality value lower than 0.8; (3) Polymerase reads containing the adaptor ligation to itself;(4)Removed the adaptor sequence in the polymerase reads. HiFi reads were generated by SMRTLink 9.0 software with parameters --min-passes=3 --min-rq=0.99.

S8. Hi-C sequencing
To anchor hybrid scaffolds onto the chromosome, genomic DNA was extracted for the Hi-C library from blood. We followed the standard protocol described previously with certain modifications (PMID: 19815776 and 25497547).
In brief, we constructed the Hi-C library and obtained sequencing data via the Illumina Novaseq-6000 platform. About 1 ml blood sample was washed by PBS buffer and the pellet was resuspended and fixed in 2% formaldehyde in room temperature. Crosslinking was stopped by adding 0.125M glycine. The fixed sample was then resuspending and lysed in Hi-C lysis buffer. The purified nuclei were digested with 100 units of DpnII and marked by incubating with biotin-14-dATP. Biotin-14-dATP from non-ligated DNA ends was removed owing to the exonuclease activity of T4 DNA polymerase. The ligated DNA was sheared into 300-500 bp fragments, and then was blunt-end repaired and A-tailed, followed by purification through biotin-streptavidin-mediated pull down. Finally, the Hi-C libraries were quantified and sequenced using the Illumina Novaseq-6000 platform.

Genomic DNA preparation
DNA degradation and contamination were monitored on 1% agarose gels.

Library preparation and Illumina sequencing
A total amount of 1.5µg DNA per sample was used as input material for the DNA sample preparations. Sequencing libraries were generated using Truseq Nano DNA HT Sample preparation Kit (Illumina USA) following manufacturer's recommendations and index codes were added to attribute sequences to each sample. Briefly, the DNA sample was fragmented by sonication to a size of 350bp, then DNA fragments were end polished, A-tailed, and ligated with the full-length adapter for Illumina sequencing with further PCR amplification. At last, PCR products were purified (AMPure XP system) and libraries were analyzed for size distribution by Agilent2100 Bioanalyzer and quantified using real-time PCR.
These libraries constructed above were sequenced by Illumina NovaSeq-6000 platform and 150bp paired-end reads were generated with insert size around 350bp.

Genomic DNA Preparation
High molecular weight genomic DNA was prepared by the CTAB method and followed by purification with QIAGEN® Genomic kit (Cat#13343, QIAGEN) for regular sequencing, according to the standard operating procedure provided by the manufacturer. The DNA degradation and contamination of the extracted DNA was monitored on 1% agarose gels. DNA purity was then detected using NanoDrop™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA), of which OD260/280 ranging from 1.8 to 2.0 and OD 260/230 is between 2.0-2.2. At last, DNA concentration was further measured by Qubit® 4.0 Fluorometer (Invitrogen, USA).

Library preparation and sequencing
SMRTbell target size libraries were constructed for sequencing according to

Genomic DNA preparation
Samples were collected, and high molecular weight genomic DNA was prepared by the CTAB method and followed by purification with QIAGEN® Genomic kit (Cat#13343, QIAGEN) for regular sequencing, according to the standard operating procedure provided by the manufacturer. Ultra-long DNA was extracted by the SDS method without purification step to sustain the length of DNA.
The DNA degradation and contamination of the extracted DNA was monitored on 1% agarose gels. DNA purity was then detected using NanoDrop ™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA), of which OD260/280 ranging from 1.8 to 2.0 and OD 260/230 is between 2.0-2.2. At last, DNA concentration was further measured by Qubit® 4.0 Fluorometer (Invitrogen, USA).

Library preparation and sequencing
A total amount of 3-4 µg DNA per sample was used as input material for the ONT library preparations. After the sample was qualified, size-select (>50k and >100k) of long DNA fragments were performed using the PippinHT system (Sage Science, USA). Next, the ends of DNA fragments were repaired, and A-ligation reaction were conducted with NEBNext Ultra II End Repair/dA-tailing Kit (Cat# E7546). The adapter in the SQK-LSK109 (Oxford Nanopore Technologies, UK) was used for further ligation reaction and DNA library was measured by Qubit® 4.0 Fluorometer (Invitrogen, USA).

S14. Pangenome growth
We covert the CPC graph to the vcf file by "vg deconstruct", and calculated the haplotype depth of non-reference segments (presenting as SNPs and insertions in vcf file). The haplotype depth is defined as the frequency of non-reference sequences presenting in the assembled haplotypes. We counted the frequency of segments based on the GT information from the vcf file and calculated the segment length based on the ALT and REF sequences. We removed the super-large segments (> 3 Mb) from the results since they may be caused by graph constructing.

S15. Statistics on variants
We used vg deconstruct (https://github.com/vgteam/vg) to infer the variants and the genotype of assemblies from the graph, and variants with reference allele longer than 10 Mb were removed using vcfhub v1.0 (https://github.com/pangenome/vcfbub) because they are likely to be misjudged as deletions. Variants were classified as small variants (< 50 bp) and SVs (≥ 50 bp), and then were converted to per-sample biallelic VCF files using bcftools v1.14 12 separately to apply individual and haplotype level statistics.

S19. Archaic introgression segments detection and functional annotation
We performed ArchaicSeeker v2.0 23 (AS2) to identify archaic introgression segments (AIS) in the CPC and HPRC genomes. Prior to the analysis, we converted the genomic coordinates in the phased VCFs from human genome assembly GRCh38 to GRCh37 as required by AS2, using LiftoverVcf embedded in Picard toolkit v1.117 (http://broadinstitute.github.io/picard/), and then removed the multiallelic SNPs. We focused on the Neanderthal-like or Denisovan-like AISs identified by AS2, and then used GFF3 (Generic Feature Format Version 3) of GRCh37 version to annotate these AIS. To detect the novel AIS in CPC compared to HPRC, we applied BEDtools 24 to remove the overlapping regions in the AISs identified in CPC and HPRC. The sharing of archaic-introgression segments in any two given populations named as ancestry-sharing ratio was calculated as follows: where ) represents the genome-wide archaic introgression proportion of population i, )! denotes whether there is archaic introgression of population i at position k, ! is the segment length, and = ∑ ! # !+, is the total length of the genome. )* measures the ratio of archaic-introgression-sharing to the random archaic-introgression-sharing of any two populations. We then used pheatmap1.0.12 to generate a heatmap for the ancestry-sharing visualization.
Alternatively, we applied vg giraffe to map the reads of Altai Neanderthal and Denisovan to the graph genome, and then use vg view to get the nodes from the mapped reads. We compared the graph nodes covered by the archaic reads across assemblies, and include those unaligned to the African genomes as potential archaic introgression SNVs/SVs. As linkage disequilibrium was not taken into consideration in this strategy, the estimated archaic introgression proportion would be much smaller than that obtained by ArchaicSeeker 2.0.

S20. Testing for neutrality by estimating Tajima's D
We measured the chromosome-wide genetic diversity by the estimator of Tajima's D 25 which may indicate possible signatures of natural selection. Tajima's D was calculated along the chromosomes with a sliding window of 20 kb in size, advanced by 10 kb, using in-house scripts. The theoretical P-values of Tajima's D were estimated following the β distribution, and were adjusted for multiple comparisons with the FDR method.

S22. Supplementary Tables
Supplementary Table 1  * Uncertain due to inconsistency of self reported sex and that based on genetic data.

Supplementary
The presence/absence of the CNV-related gene in HPRC is indicated with " ○ "/"×"; Archaic proportion is calculated as the length of archaic introgressed segments (AIS) in each gene devided by the total AIS in the genome. The Tajima's D estimates with FDR-adjusted P < 0.05 are highlighted in red fonts. Note: GeneRatio denotes the ratio of input genes that are annotated in a term, and BgRatio denotes the ratio of all genes that are annotated in a term. The P-values are obtained by one-sided Fisher's exact test.

Supplementary
They are further adjusted for multiple comparisons using the Benjamini-Hochberg procedure (denoted as the BH-adjusted P-value), and are also adjusted for the false discovery rate (denoted as the q-value). The complexity of a graph is calculated by the ratio of edges to nodes (edges/nodes). Note: To make it comparable to GRCh38-based gnomAD and GIAB, the number of small variants was calculated from the joint MC pangenome vcf file which used CHM13 as the reference (the first assembly) and the GRCh38 (the second assembly) as the coordinate with parameters "--reference CHM13v2 --vcfReference GRCh38" in the "cactus-graphmap-join" step. Note: GeneRatio denotes the ratio of input genes that are annotated in a term, and BgRatio denotes the ratio of all genes that are annotated in a term. The P-values are obtained by one-sided Fisher's exact test.

Supplementary
They are further adjusted for multiple comparisons using the Benjamini-Hochberg procedure (denoted as the BH-adjusted P-value), and are also adjusted for the false discovery rate (denoted as the q-value). Note: GeneRatio denotes the ratio of input genes that are annotated in a term, and BgRatio denotes the ratio of all genes that are annotated in a term. The P-values are obtained by one-sided Fisher's exact test.

Database Term ID Description GeneRatio BgRatio Odds ratio BH-adjusted P-value q-value
They are further adjusted for multiple comparisons using the Benjamini-Hochberg procedure (denoted as the BH-adjusted P-value), and are also adjusted for the false discovery rate (denoted as the q-value). Note: reference assembly GRCh38 is with 2 copy of RASA4(B) and CHM13 is with 2 copy and a 14.9

Covered by CPC Archaic Introgression Segments
They are further adjusted for multiple comparisons using the Benjamini-Hochberg procedure (denoted as the BH-adjusted P-value), and are also adjusted for the false discovery rate (denoted as the q-value). Note: GeneRatio denotes the ratio of input genes that are annotated in a term, and BgRatio denotes the ratio of all genes that are annotated in a term. The P-values are obtained by one-sided Fisher's exact test.

Supplementary
in the CPC assemblies.
Tajima's D was calculated for the 58 CPC samples (116 haploid assemblies) with a sliding window spanning 20 kb in size, stepping 10 kb. The estimate for each gene was represented by the top window overlapping that gene. The horizontal dashed line shows the threshold of P = 0.05, which is obtained by the two-tailed significance test based on the beta distribution, and is further adjusted for multiple testing using the FDR method; the vertical dashed line indicates the CNV counts ≥ 5.