An integrated personal and population-based Egyptian genome reference

A small number of de novo assembled human genomes have been reported to date, and few have been complemented with population-based genetic variation, which is particularly important for North Africa, a region underrepresented in current genome-wide references. Here, we combine long- and short-read whole-genome sequencing data with recent assembly approaches into a de novo assembly of an Egyptian genome. The assembly demonstrates well-balanced quality metrics and is complemented with variant phasing via linked reads into haploblocks, which we associate with gene expression changes in blood. To construct an Egyptian genome reference, we identify genome-wide genetic variation within a cohort of 110 Egyptian individuals. We show that differences in allele frequencies and linkage disequilibrium between Egyptians and Europeans may compromise the transferability of European ancestry-based genetic disease risk and polygenic scores, substantiating the need for multi-ethnic genome references. Thus, the Egyptian genome reference will be a valuable resource for precision medicine.

It is interesting to read through this manuscript and the following positive points are made: • De novo human genome assemblies are a rare resource; this project is even more eminent as it in from a region whose populations are poorly represented in global genome sequencing initiatives. • Very good data quality sequenced at an overall coverage of about 270x. • Robust comparison in terms of data quality with the published de novo Korean (AK1) and Yoruba genomes.
• Availability of phasing information, resulting in 98.99% of variants being phased. • Overall it is a high-quality data from an under-represented population, and the analyses performed is extensive and in line with that of the published de novo genome sequenced data.
Suggestions for further analysis The following suggestions are made for authors to address: • Regional middle eastern population genome sequence data were not used for comparison, for example in the PCA analysis. Recently, there has been quite a few additions in terms of genome resources from the Middle Eastern populations in Kuwait, Qatar and the GME database. The GME reportedly has a number of Egyptian samples. However, the authors mentioned none of these or used these data which is closest to the Egyptians geographically.
• Analysis on Runs of Homozygosity and IBD regions is missing, taking into consideration the fact that rate of consanguinity is generally very high among the Arabs.
Reviewer #3 (Remarks to the Author): The manuscript by Wohlers et al. presents a de novo assembly of human genome from an Egyptian individual with their descriptive parameters, plus short-read sequencing data of ten additional Egyptian individuals. The manuscript is descriptive and, therefore, it is not clear the added value of providing a new de novo assembly of a human genome. The authors should stress what is the added value to the reference genome; how this new de novo assembly provides new information to the reference genome besides providing the description of new variants, which could be obtained by a resequencing process. The sequencing of some Egyptian individuals is not an added value since recent whole genome data from Egyptians at decent coverage (~30X) is already available ( As stated above, most of the analyses are descriptive, not performed in depth. An example of that is the European and African admixture approach based on a PC analysis and a description of mitochondrial lineages, which is very basic when dealing with whole-genome sequences that can provide more refined information (nonetheless, most of these refined analyses are already performed in the original paper from Pagani et al 2015 where most of the present data was already published). Another example is the tag-SNP analysis that is merely descriptive.
In sum, the authors should make an effort to explain the added value of a de novo assembly of a human genome and refine the analyses beyond the description of variants.
1. Line 23-24; Authors claim variants are phased to maternal and paternal haplotypes. In Line 67, authors claim they generated a "phased de novo assembly". In fact, the authors generated a collapsed assembly and identified / phased small variants using 10X Genomics linked-reads using GRCh38 as the reference, not the assembly. Linked-reads enable local phasing on short variants. Without having the parental genotypes, it is not possible to identify the paternal / maternal haplotypes. This needs to be corrected. For claiming a 'phased de novo assembly', one would expect to see SINE/LINE repeats when comparing both haplotypes to each other, which is not shown in this manuscript. In fact, a fully phased assembly requires phasing large structural variations, including the sex chromosomes. I understand obtaining true haplotype is out of scope. It would be better to be focused on variants commonly shared among Egyptian population. This needs to be better communicated.
Response: We fully agree with the reviewer's comments and apologize for using the misleading terms maternal/paternal haplotypes and phased assembly. As the reviewer states, we generated a collapsed assembly and performed short variant phasing for the same individual without considering parental information. We agree that this should have been made clearer, especially in light of recent advances towards obtaining fully phased assemblies. We rephrased the abstract (line 24-25) and the main text ("phased" removed on line 71, line 254) accordingly.

2.
Is the assembled individual a male? Or a female? How are the sex chromosomes assembled? If the sex chromosomes weren't manually curated, it needs to be mentioned at least once in the main text that the analysis was only performed on the autosomes.
Response: The assembly individual is male. We added this information to the main text (line 81). Sex chromosomes have been assembled, but were not manually curated. This information is added to the main text (line 90).
3. The authors chose WTDBG2-based assembly as the base line, "because it performs comparable or better according to various quality control (QC) measures". Which metrics did the authors consider specifically? The EGYPT_falcon assembly seems to have better k-mer based completeness. Continuity (NG50 and above) also seems better in the falcon version. The num. of misassemblies and k-mer misjoins could be an artifact caused by real structural differences compared to the reference (GRCh38).

4.
When comparing the meta-assembly to GRCh38, were there any novel insertions / complex sequences found that are commonly shared among Egyptian population? All variant analysis is using GRCh38 as the reference. Not using the newly assembled Egyptian individual seems rather odd. Including variants that can be found only from the newly assembled genome -typically hard to call from short-reads only using GRCh38 as the reference -will improve the overall impact significantly of this manuscript.
Response: This is a very good question and remark. To address this point, which is also made by reviewer #3, we searched for novel sequences in our assembly that are absent from the GRCh38 reference. The corresponding paragraph now added to the main text reads: ================================================================== Using the EGYPT de novo assembly, we searched for unique insertions that are common in Egyptians. Towards this, we first mapped all short-read data against the GRCh38 reference genome and to other decoy or alternative haplotype sequences from the GATK bundle. All reads that could not be mapped were subsequently mapped against the EGYPT de novo assembly. A similar approach was recently applied to identify novel, unique insertions in de novo assemblies of 17 individuals from 5 populations using 10x genomics sequencing 36 . Altogether we identified 40 unique insertions longer than 500 bp with a total length of 40kb, for which we required for every base in the identified region to have a minimal coverage of 5 reads in at least 10 Egyptian individuals (Suppl. Table 9). Of these sequences, 28 have been mentioned before by Wong et al. 36 , and 10 more in different studies within the last 15 years 37 38 39 40 . Two out of the 40 insertions are most likely novel. In addition, one region contains three unique insertions, of which two contain additional, novel sequences longer than 500 bases. Closer inspection reveals that these sequences are located within a region of two 50 kb gaps . This large reference genome region that contains the largest gap covering sequence reported for AK1 2 is not resolved yet. Overall, we identified 330 single nucleotide variants and indels in 36 of 40 non-reference sequences (Suppl. Table 10). The percentage of reads that could not be mapped to GRCh38 or GATK bundle sequences, but which were mappable against the de novo assembly is on average 8.6%, but for some individuals up to 34.2% (cv. Suppl. Fig. 30). Previously unmapped short reads of 110 Egyptians covered positions for more than 19 Mb of the Egyptian de novo assembly. Unique sequences that are commonly shared among Egyptians illustrate that additional reference genomes are needed to capture the genetic diversity that are neither assessable by short read sequencing nor with the current human reference genome. In addition, the large number of assembly positions to which such short reads map which could not be mapped to the reference genome GRCh38 (including widely used supplementary sequences included in the GATK bundle), indicate a need for further assembly-based reference data and for new approaches to better capture genetic diversity. ================================================================== This corresponding section was added to the Methods in the main manuscript: ================================================================== Unique inserted sequences We trimmed Illumina short sequencing reads of 110 Egyptian individuals using FASTP 0.20.0 with default parameters, mapped the output reads to GRCh38 and GATK bundle sequences using BWA 0.7.15-r1140 and sorted by chromosomal position using SAMTOOLS 1.3.1. Subsequently, we extracted reads that did not map to GRCh38 using SAMTOOLS with parameter F13 (i.e. read paired, read unmapped, mate unmapped) and repeated the mapping and sorting using the Egyptian de novo assembly. We merged the read-group specific BAM files for each sample and calculated the per base read depth using SAMTOOLS. Afterwards, we aggregated the results via custom scripts and extracted uniquely inserted sequences from the Egyptian de novo assembly. Insertions were defined as contiguous regions of at least 500 bp having a coverage of more than 5 reads per base in 10 or more samples. Lastly, we BLASTed the obtained sequences against the standard databases (option nt) for highly similar sequences (option megablast) using a custom script. For the uniquely inserted sequences that we identified, we created a pileup over all BAM files containing the reads that did not map to GRCh38 using SAMTOOLS. Based on these pileups, we then called the variants using BCFTOOLS. Variants with quality of more than 10 were kept. ================================================================== 5. Table 1. is showing initial results from QUAST-LG. Additional validations needs to be provided regarding the # misassemblies, as QAST-LG does not account for population/individual specific variations that could be counted as mis-assemblies when aligning an assembly that is structurally different from the reference. This will penalize genomes more divergent from the current GRCh38, where >70% of the GRCh38 is representing one individual from African-European ancestry (Schneider et al, 2017).

Minor comments 1. Reference 8 and 9 are swapped
Response: We corrected the swapped references 8 and 9.

Supp. 3-7 needs improvements. Sort the GRCh38 by chromosome numbers and note in the label.
Response: We adjusted the tables accordingly.

Response:
We generated the suggested graphical view using the 10x Genomics visualization Software LOUPE and added it to the Supplement (Suppl. Fig. 48), referring to it in the caption of Fig. 4. The phase block in which the BRCA2 gene lies is about 8 Mb (from chr13:25,831,216-33,523,430). However, when zooming out to see phase block boundaries, variants are not displayed anymore. Thus, we don't display the phase block boundaries.

Reviewer #2 (Remarks to the Author):
It is interesting to read through this manuscript and the following positive points are made: • De novo human genome assemblies are a rare resource; this project is even more eminent as it in from a region whose populations are poorly represented in global genome sequencing initiatives.

• Very good data quality sequenced at an overall coverage of about 270x. • Robust comparison in terms of data quality with the published de novo Korean (AK1) and Yoruba genomes. • Availability of phasing information, resulting in 98.99% of variants being phased. • Overall it is a high-quality data from an under-represented population, and the analyses performed is extensive and in line with that of the published de novo genome sequenced data.
Suggestions for further analysis The following suggestions are made for authors to address: • Regional middle eastern population genome sequence data were not used for comparison, for example in the PCA analysis. Recently, there has been quite a few additions in terms of genome resources from the Middle Eastern populations in Kuwait, Qatar and the GME database. The GME reportedly has a number of Egyptian samples. However, the authors mentioned none of these or used these data which is closest to the Egyptians geographically.

Response:
We fully agree with reviewer #2 that it is very interesting to resolve Egyptian admixture on a finer scale with respect to geographically close populations. So far, however, we had restricted our population genetic analysis to whole genome sequencing-based data. Until last year the 1000 genomes data set was the only available source, which, however, does not contain North African or Middle Eastern populations or individuals. For the revision, we thus included additional population genetic analyses for world-wide populations, comprising the currently largest number of diverse populations. Towards this, we included a very recent paper (March 2020) which published whole genome sequencing-based variant data covering 929 individuals from 54 diverse human populations (Bergström et al.,  2020). This allowed us to add variant data from geographically close populations, which are either based on SNP arrays (Fernandes et al., 2019) or whole exome sequencing (Kuwait, Qatar, GME). In fact, we obtained the Greater Middle East (GME) data set, the most comprehensive of the region, from dbGAP and performed admixture analysis. Unfortunately, the data download from dbGAP did not include detailed enough population annotation (e.g. samples are annotated with "Arabic", only). We contacted the corresponding author on January 30 and again March 11 asking for the annotation that was used in the GME paper, but have not received a reply so far and had to leave out this data set for now. Alternatively, we obtained the largest number of geographically close populations by combining all available whole genome data  Table 12. We now also included an admixture analysis, which identifies world-wide genetic components that occur in Egyptians.
The summary of this more sophisticated population genetic analysis replaces in the main section the previous population genetics paragraph that used only 1000G data and reads as follows:   Fig. 2c). To preclude a technical bias when intersecting WGS with SNP array data, we compared the analysis results when using whole genome data, only, or when intersecting WGS data with SNP arrays and found comparable results in both cases (Suppl. Fig. 38). The Egyptian PCA location is further supported by an admixture analysis. Our analysis specifies k=24 as the optimal number of genetic components for the entire data set, i.e. having the smallest cross validation error (see Suppl. Fig. 39 for results for k=10 to k=25). Accordingly, the genetics of Egyptian individuals comprises four distinct population components that sum up to 75% on average. Egyptians have a Middle Eastern, a European / Eurasian, a North African and an East African component with 27%, 24%, 15% and 9% relative influence, respectively (see Fig. 2a). According to our cohort, Egyptians show genetically little heterogeneity, with little variance in the proportion of individual components between the individuals (Suppl. Figs  40 and 41). With a focus on populations from the Horn of Africa, the four components we identified have been described before by Hodgeson et al. 44 in a cohort of 2,194 individuals from 81 populations (mainly 1000 Genomes and HGDP) and substantially fewer variants (n=16,766), but including also 31 Egyptians. They and others hypothesize that most non-African ancestry, i.e. the Eurasian / European and Middle Eastern components in the populations from North Africa and the Horn of Africa is resulting from prehistoric back-to-Africa migration 44 24 . Recently, Serra-Vidal et al. describe North Africa as a "melting pot of genetic components", attributing most genetic variation in the region also to prehistoric times 45 . Here, we confirm previously identified genetic components, yet using 2.5 times as many individuals, and using WGS data for the majority of them. This is thus the hitherto most comprehensive data set on genetic diversity world-wide and in this region. ================================================================== The corresponding Methods section has been changed to ==================================================================

Population genetics
For population genetic analyses, we compared the Egyptian variant data with variant data from five additional sources specified in Suppl. Methods contain additional details on the analyses. We slightly changed the paragraph on mitochondrial haplogroups, now stating that these support our admixture results.
• Analysis on Runs of Homozygosity and IBD regions is missing, taking into consideration the fact that rate of consanguinity is generally very high among the Arabs.

Response:
Also reviewer #1 mentioned that current variant analyses use GRCh38 as a reference and that "Including variants that can be found only from the newly assembled genome -typically hard to call from short-reads only using GRCh38 as the reference -will improve the overall impact significantly of this manuscript.". To address this common and valid point, we performed additional analyses to characterize and identify genetic variation that cannot be obtained from GRCh38 or a resequencing process. Please refer to our answer to remark 4 of reviewer #1. Expansion in North Africa" in November 2019, after the submission of our manuscript; their study includes 2 WGS Egyptians and few SNP array-based Egyptians from the Human Origins data set, which are also contained in our admixture analysis. In the revised manuscript, we cite them and state that they attribute most genetic variation in the region to prehistoric times. In summary, to the best of our knowledge, the data set we compiled constitutes the most comprehensive Egyptian whole genome sequencing-based variant data to date.
In fact, most of the analyses performed in the manuscript use the large dataset available from Pagani et al 2015.

Response:
We respectfully disagree. The paper of Pagani et al. is titled "Tracing the Route of Modern Humans out of Africa by Using 225 Human Genome Sequences from Ethiopians and Egyptians" and it relates to the field archaeogenetics. The only type of analysis that the Pagani et al. and our paper share is the genetic characterization of the Egyptians with respect to other populations using principal component analysis. For this we chose all African and European populations individuals from 1000G in the first version of the paper and a many more individuals and data sets in the revised version, while Pagani et al. selected 1000G and further population data they deemed relevant to trace the route of modern humans out of Africa.
As stated above, most of the analyses are descriptive, not performed in depth. An example of that is the European and African admixture approach based on a PC analysis and a description of mitochondrial lineages, which is very basic when dealing with wholegenome sequences that can provide more refined information (nonetheless, most of these refined analyses are already performed in the original paper from Pagani et al 2015 where most of the present data was already published). Another example is the tag-SNP analysis that is merely descriptive.

Response:
We believe that our descriptive analyses of a population for which very little genetic information has been assessed so far form the basis of many ensuing in-depth analyses. This is particularly true when using the latest sequencing and assembly techniques together with novel genome analysis methods. We provide the data used for the descriptive summaries within Supplementary Tables, which can be the starting point for many in depth analysis. For example, the data underlying the tag SNP analysis is provided in Suppl. Table 17 and lists for every tag SNP the associated GWAS catalog data including the linked diseases, and we discuss Alzheimer's disease SNP rs2075650 as example. We deliberately omitted lengthy discussions of individual findings within the main manuscript, because we believe it is beyond the scope of a paper that introduces a comprehensive Egyptian genome reference for the first time.
With respect to reviewer #3's impression that particularly the population genetics analysis is not performed with sufficient depth, this analysis is now significantly extended and includes geographically closer populations, as suggested by reviewer #2. Additionally, we used SNP array-based variant data from 398 additional Egyptian individuals to replicate population genetics analyses using a different cohort and different set of variants. For details on the updated population genetics analysis, please refer to our answers for reviewer #2.
In sum, the authors should make an effort to explain the added value of a de novo assembly of a human genome and refine the analyses beyond the description of variants.

Response:
We addressed this point with two extensive additional analyses: (1) by targeting genetic variation that can be obtained from the de novo assembly, but not from the reference genome GRCh38.
(2) by a comprehensive and more sophisticated population genetics analysis comprising more than 4000 individuals.