The QChip1 knowledgebase and microarray for precision medicine in Qatar

Risk genes for Mendelian (single-gene) disorders (SGDs) are consistent across populations, but pathogenic risk variants that cause SGDs are typically population-private. The goal was to develop “QChip1,” an inexpensive genotyping microarray to comprehensively screen newborns, couples, and patients for SGD risk variants in Qatar, a small nation on the Arabian Peninsula with a high degree of consanguinity. Over 108 variants in 8445 Qatari were identified for inclusion in a genotyping array containing 165,695 probes for 83,542 known and potentially pathogenic variants in 3438 SGDs. QChip1 had a concordance with whole-genome sequencing of 99.1%. Testing of QChip1 with 2707 Qatari genomes identified 32,674 risk variants, an average of 134 pathogenic alleles per Qatari genome. The most common pathogenic variants were those causing homocystinuria (1.12% risk allele frequency), and Stargardt disease (2.07%). The majority (85%) of Qatari SGD pathogenic variants were not present in Western populations such as European American, South Asian American, and African American in New York City and European and Afro-Caribbean in Puerto Rico; and only 50% were observed in a broad collection of data across the Greater Middle East including Kuwait, Iran, and United Arab Emirates. This study demonstrates the feasibility of developing accurate screening tools to identify SGD risk variants in understudied populations, and the need for ancestry-specific SGD screening tools.


INTRODUCTION
A major goal of precision medicine is to optimize medical care for subgroups of patients based on genetic and/or molecular profiling 1 . A challenge in widespread adaptation of genetic profiling is the genome variability among different population groups 2 . One example is the identification of pathogenic variants in (Mendelian) single gene disorders (SGDs). While the same genes are responsible, there is considerable variability across populations in the specific causative pathogenic variants 3 . For example, while all pathogenic variants causing cystic fibrosis affect the CFTR gene, the common pathogenic variant observed in Puerto Rico 4 is different from the variant observed in Qatar 5 and both are different from the pathogenic variants common in European populations 6 . A recent analysis of ClinVar, the main NCBI database of pathogenic variants causative of SGDs, shows a significant bias towards pathogenic variants observed in European ancestry individuals 2 . As is the case for Hispanics, Blacks, and other non-European groups, SGD pathogenic variants found in Greater Middle Eastern populations are under-reported. Since screening technologies depend on public resources such as ClinVar 7 , OMIM 8 , and 1000 Genomes Project 9 for source data, there are limited screening platforms to assess SGD pathogenic variants in the Greater Middle East 10 .
A striking example of this is the Qatari population 11,12 . The inhabitants of Qatar include approximately 300 thousand Qataris and 2.5 million expatriates 13 . The Qataris are comprised of distinct genetic subgroups 11,14 . The proportion of consanguineous marriage among Qataris is high 15 , leading to longer runs of homozygosity 16 . In addition, the tribal nature of marriages, where individuals select a mate from a limited gene pool that are members of the same tribe, contributes to higher chance of homozygosity for a pathogenic founder variant derived from a common ancestor, such as the well-known p.Arg366Cys CBS variant linked to homocystinuria 17 .
In prior studies, we and others have identified SGD pathogenic variants that are common in the Qatari population 3 and in other Greater Middle East populations 18 , including many pathogenic variants that are only observed in Qatari genomes or are at an enriched (higher) risk allele frequency compared to populations outside of the Greater Middle East 14 . At present, there is a limited screening of the Qatari populations for inherited pathogenic variants 19 .
The focus of this study is to develop "QChip1," a genotyping microarray designed as a research and screening tool capable of enabling precision medicine of Qataris. The aim for QChip1 was to enable accurate and comprehensive screening for SGD pathogenic variants in Qatari newborns, premarital couples and patients presenting to the clinic. First, we analyzed genetic data from 8445 Qataris, including whole-genome sequence (WGS), whole-exome sequence (WES), and clinical pathology case reports from affected families. Using these data, a Qatari Genome Knowledgebase was constructed, containing known and predicted pathogenic variants in SGDs. Second, with this knowledgebase, QChip1 was designed to assess the Qatari genome for SGD pathogenic variants in the knowledgebase. Third, QChip1 accuracy was confirmed by comparison of QChip1 genotypes to WGS data for a batch of Qatari genomes. Fourth, genomes from Qataris and residents of New York City (NYC), and Puerto Rico (PR) were genotyped on QChip1 to determine the prevalence of SGD pathogenic variants in Qataris and to compare this to other populations. The analysis demonstrated that QChip1 is highly accurate in identifying deleterious variants in Qataris, and that the majority of pathogenic variants among Qataris are Qatari-specific or Qatari-enriched. Overall, this study demonstrates the value of a custom genotyping array for precision medicine identification of pathogenic variants that cause single-gene disorders in human populations absent from or underrepresented by common knowledgebases used for pathogenic variant screening assay design [7][8][9]20,21 . In the interest of the advancement of science and open data sharing, a list of variants on the array, the genes and disorders with a known or potential link to the variants, and the prevalence of these variants in Qatar, Kuwait, NYC, and PR will be made available to the public through the QChip Browser (http://qchip.biohpc.cornell.edu), as well as through our 3rd party data sharing repositories at FigShare (https://figshare.com/projects/QChip1/120108) and NCBI BioProject (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA774497).

Construction of the Qatari Genome Knowledgebase
The Qatari Genome Knowledgebase of single gene coding sequence pathogenic and potentially pathogenic variants was based on sequence data from 8416 Qataris, including 6218 wholegenome sequence of Qataris recruited by the Qatar BioBank (QBB) 22,23 and sequenced by the Qatar Genome Program (QGP) 24,25 , 180 whole-genome sequences 12,26 and 1297 exome sequences 11 Table 1). After filtering to remove variants observed in multiple cohorts, the analysis yielded 104,473,390 total variants in 20,069 genes in the Qatari population, including 87,813,560 single nucleotide variants (SNV) and 16,659,829 indels (Table 1); below we refer to this dataset as the Qatar Genome Knowlegebase (QGK). Assessment of QGK for ClinVar pathogenic variants and genes yielded a list of 10,490,820 variants in 3770 genes known to ClinVar. Parallel assessment of QGK for moderate or high impact variants in protein coding genes using SnpEff identified 805,649 variants in 19,770 genes (Table 1, Supplementary Table 2). The SnpEff list of moderate/high impact predicted variants was intersected with the ClinVar list of known variants and known genes to generate a final list of 207,370 pathogenic variants in 3770 genes, including 196,855 single nucleotide variants (SNVs) in 3769 genes and 10,515 indels in 1897 genes. This final list of variants included 13,891 (7%) predicted high impact (e.g., nonsense, frame shift and other loss of function) and 193,479 (93%) predicted moderate impact (e.g., missense variants).

Design of QChip1
For each variant in the Axiom QChip design, one or more probesets were added to the design, depending on the computationally predicted difficulty of obtaining a high-quality genotype, the priority of the variant, and available space on the array. QChip0 consisted of a total of 184,713 probes organized in 159,377 probesets for genotyping 91,942 variants in 3540 genes ( Table 2). The additional probesets represent variants not previously genotyped by Thermo Fisher (formerly Affymetrix) arrays, for these novel variants (67,435 or 73.3% of 91,942) 2 or more probes were included in the probeset, while for known variants (24,507 or 26.7%) a single probe was included in the probeset.
QChip0 was then tested on 26 Qatari genomes for which WGS was available. Concordance was 99.7% ± 0.002 for n = 61,592 of n = 91,942 variant sites with non-missing genotypes in both WGS and QChip0 for all n = 26 samples. This high-confidence dataset consisted of 70,715 probes in 61,592 probesets for genotyping of 61,592 variants in 3438 genes (61,195 SNV probesets for 61,195 variants in 3476 genes, and 397 indel probesets for 397 variants in 300 genes), resulting in the final design of QChip1 (Table 2). Of these probes, 61,565 were autosomal and a small proportion (n = 27; 0.04%) non-autosomal (located in ChrX, ChrY, or MtDNA).

Testing of QChip1
The single nucleotide variants and indels represented on QChip1 were tested with an additional 473 Qatari genomes for which whole-genome sequencing was available 24 . After selection of the top performing probeset for each variant, probesets that were consistently top-performing across batches were compared to WGS genotypes. A total of 27,850 ± 0.75 variant sites where a high-confidence genotype was obtained for both QChip and WGS were compared, concordance was 99.1% ± 0.00034 (Table 3). Concordance was high for indels (92.4% ± 0.0057) and SNVs (99.2% ± 0.00034). QChip1 was then used to determine the prevalence in the Qatari population and in non-Qatari populations for variants of interest for SGD pathogenicity research and screening in Qatar. Genotyping of n = 2708 Qatari, n = 226 European-American, South Asian American and African-American New York City (NYC) residents and n = 51 European and Afro-Caribbean Puerto Rico (PR) residents was conducted and analyzed as a single batch, including data from the first two (QChip0/QChip1) batches described above and a third batch with the rest of the samples. Probesets were again filtered based on performance, and variants were filtered based on missing genotype rate (<10%) low concordance with WGS in batches 1 or 2 (>90%) and minor allele frequency (<5%). The final set of variants for analysis included n = 32,674 SNVs. In order to assess the utility of QChip1 for use in other populations of the Greater Middle East (GME), the allele frequency of these variants was obtained for n = 540 Kuwaiti exomes and each variant was checked for presence in the Center for Arab Genetic Disorders (CAGS) database (http://cags.org.ae).

Use of QChip1
Among the 2,708 Qatari genomes tested, QChip1 identified a median of 2 homozygotes and 130 heterozygotes for SNVs of interest for SGD pathogenicity research and screening (Table 4). When assessed by Qatari subpopulations 25 , the highest median number (n = 205) of SNVs were identified in the Peninsular Arab subpopulation, 1.6-fold greater than the average median for the General Arab (109), Arabs of Western Eurasia and Persia (132), South Asian Arabs (137) and African Arab (129) subpopulations.
To help validate that QChip1 accurately detects known Qatari pathogenic variants, n = 140 variants identified as pathogenic either by the Hamad Medical Corporation (HMC) or by ClinVar were assessed in 2708 Qatari genomes by QChip1 (Table 5). There were n = 140 QChip1 pathogenic variants, including n = 140 (100%) present in ClinVar, n = 25 (18%) present in HMC, and n = 27 (19%) present in CAGS. Among these n = 140, n = 94 were only present in ClinVar, n = 19 were present in both HMC and ClinVar, n = 21 were present in ClinVar and CAGS but not HMC, and n = 6 present in all three pathogenic variant databases (ClinVar, HMC, CAGS). Among the n = 140 pathogenic variants, n = 3 were classified as "suspicious" based on high allele frequency (greater than 0.005) 27 . The three variants were previously reported in CAGS, HMC, or both, and appear to be truly pathogenic variants are enriched in the Qatari population due to founder effects, tribalism, consanguinity or a combination of these factors. One of these, NM_000071.2(CBS):c.1006C > T (p.Arg336Cys) linked to homocystinuria, is a well-documented founder variant in Qatar that was experimentally validated and is a priority for screening in the population 17,28 .
A major question for the future of QChip is the applicability of the variant list in other GME populations. In order to begin to answer this question, the QChip1 variant list was looked up in four datasets, including sequencing data from CAGS, Kuwait, Iran, and a collection across the GME (GME Variome) [29][30][31][32] . Out of the n = 140 pathogenic variants in Qatar genotyped by QChip1, 50% % (n = 70) were observed in one or more of the 4 GME datasets, including n = 28 (20%) in Kuwait, n = 32 (23%) in Iran, and n = 37 (26%) in the GME Variome. As expected, only n = 8 (6%) were observed in Puerto Rico and n = 16 (13%) were observed in NYC (Table 6). Based on these data, the utility of QChip1 was higher in GME than in the Americas; however, half the variants were unique to Qatar, and thus each GME nation (such as Kuwait and Iran) could benefit from a custom design.
All 140 of the pathogenic variants were accurately detected by QChip1 and were described in Table 5; for additional variants of  interest for SGD research on QChip1 assessed on 2,708 Qatari  genomes, see Supplementary Table 3. In Table 5 pathogenic variants were identified in CBS, a gene linked to homocystinuria (rs398123151 and rs121964972, 1 homozygote and 32 heterozygotes combined, 0.62% genomes), nemaline myopathy (rs886041851,16 heterozygotes, 0.3% genomes), and factor XI deficiency (rs121965063, 0.13% genomes). Relevant to these observations, all 2708 genomes tested were from the general medical clinic and general population, not from referrals to genetic disease clinics, and hence these data were interpreted as representative of the general population of Qatar.
Examination of the distribution of types of functional variants identified by QChip1 in the Qatari genome, the majority of variants of interest for research that were computationally predicted to have "high impact" were involved in structural interaction, which currently would be considered "benign" or "uncertain significance" by ACMG standards and ClinVar. The most common class of variants of interest for research that were computationally predicted "moderate" impact were missense variants (Supplementary Table 4). In some cases, the SnpEff annotation was different from the ClinVar annotation for a pathogenic variant, typically in situations where multiple transcripts lead to multiple alternative annotations for a varant and SnpEff is not aware of the "canonical" annotation in the literature, such as for NM_000071.2(CBS):c.1006C > T (p.Arg336Cys), which SnpEff correctly annotated on the transcript as c.1006C > T but did not provide the amino-acid change, but rather annotated it as "structural_interaction_variant".
The applicability of the QChip1 was assessed across populations, including those directly genotyped using the array and others not genotyped in the array but of relevant Greater Middle Eastern ancestry. Of the 32,674 variants of interest for SGD research and screening were observed by QChip1 in at least 1 Qatari, 77% were at a frequency higher than any of the non-Qatari   populations genotyped on the array (Fig. 1A). Among the Qatari genomes, the highest proportion of SGD risk alleles were in the Arabs of Western Eurasia and Persia, and African Arab subpopulations (Fig. 1A). As predicted, the majority (76%) of the Qatari genome pathogenic variants were not present in non-Qatari populations (Fig. 1B). QChip1 assessment of NYC and Puerto Rico residents demonstrated only rare detection of Qatari pathogenic variants in populations that included (based on genetic analysis of population clusters, Supplementary Fig. 1) European-American, South Asian-American, African-American populations ( Table 5,  Supplementary Table 3).
Within the subset of the variants that are known pathogenic and of interest for screening (n = 140), similar results were observed for Western populations, with only 6% of QChip1 pathogenic variants observed in Puerto Rico and only 13% found in NYC. Within Arab populations, the results were better but still not sufficient to justify the use of the array, with only 24% of QChip1 pathogenic variants observed in Kuwait and 15% reported in the Center for Arab Genetics Studies database.

Array performance
Using NGS data as the gold standard, the authors calculated the analytical sensitivity, specificity, accuracy, positive predictive value, and negative predictive value of QChip1. Using data from WGS and QChip1 for n = 140 (mostly rare) pathogenic variants in n = 472 Qatari, comparison was conducted for n = 66,220 genotypes. Of these, n = 39,286 could not be compared due to missing genotype in one of the two platforms, (99.8% were missing in WGS only), and among the remaining n = 26,934 there were n = 26,781 true negatives, n = 132 true positives, n = 21 false negatives, and n = 0 false positives. Based on these data, the sensitivity was 86.3%, the specificity was 100%, the accuracy was 99.9%, the positive predictive value was 100%, and the negative predictive value was 99.9%. This performance is very high relative to recently published evaluations of SNP chips performance on rare pathogenic variants 33 .

DISCUSSION
This report described the design, testing, and application of QChip1, the first genotyping microarray specifically designed for precision medicine in the Greater Middle Eastern population. QChip was designed for and determined to be suitable for SGD research, clinical screening of newborns or couples planning children, and for genetic diagnosis of SGD patients in the country and in the region.
The main hypothesis of this project was confirmed, that variants of interest for SGD pathogenicity research and screening within known genes vary considerably across populations, as the majority of the QChip1 variants observed in Qatar were either Qatar-private or Qatar-enriched, and were absent from other GME populations and databases of SGD pathogenic variants specific to GME populations. In addition, the majority of QChip1 variants were absent from the Thermo Fisher database, one of the largest knowledgebases in the world of genetic disease variants used in clinical genetics and research genetics. Given the low cost (<$100 each array) and ease of use of the QChip1, it provides an accessible and sustainable alternative to extensive sequencing and interpretation of variants of unknown significance 34 for the implementation of precision medicine in countries such as Qatar.
The development of QChip1 included the following steps: (1) assessment of the Qatari population to identify Qatari variants and genes of interest for SGD pathogenicity research and screening; (2) design and manufacture of genotyping probesets for inclusion in the QChip1 microarray; (3) refinement and testing of QChip1 by analysis of data from 469 Qataris also sequenced using WGS; and (4) use of the refined QChip1 for quantification of variants of In order to assess the quality of QChip1 data, genotypes were generated for n = 473 Qataris for all QChip1 sites compared to whole-genome sequencing in data. Genotypes were generated for all sites, including both reference and variant genotypes in whole-genome sequencing. The concordance between QChip1 and whole-genome sequencing indels and single nucleotide variants (SNV) genotypes were compared. Shown for all and for each population and for each variant class (indel, SNV, both) the average, standard deviation, sample size, and 95% confidence interval for the number of concordant variants, the number of discordant variants, the total number of variants compared, and the concordance rate. interest for SGD pathogenicity research and screening in 2708 Qatari genomes, with a focus on (a) variants specific-to or enriched-in Qatar relative to non-Qatari DNA samples also genotyped using QChip1 and (b) variants known to be pathogenic.
The key findings of this study were that out of over 104 million variants in Qatar, extensive analysis both in silico and in vitro identified with over 99% accuracy over 32 thousand variants in the Qatari population that are known or predicted to alter the function of genes with a known role in SGDs. The majority of these 32 thousand variants were only observed in Qatar, including 103 of 140 (64%) known pathogenic variants previously observed in Qatari clinical case reports and in ClinVar. Of those variants also observed in Kuwait, the CAGS database of GME variants, NYC or Puerto Rico, the majority were enriched in Qatar, at a higher risk allele frequency. These observations confirm the hypothesis that a considerable proportion of SGD risk variants are populationprivate founder variants or population-enriched variants that drifted to elevated allele frequency in Qatar. Surprisingly, this hypothesis holds even when compared to neighboring GME populations. This observation justifies the effort invested this research team in developing QChip1 and in producing a framework for the development of similar SGD clinical and research arrays for other understudied populations in the GME, the Americas, and beyond. The population genetic analysis presented here suggests that the high diversity of the Qatari population demonstrates the limited applicability of this array in the Greater Middle East region, which from a genetic perspective spans from Africa to Southern Europe, the Near East, Central Asia, and South Asia. The population-specificity of the variants on the array is a confirmation of the uniqueness and genetic isolation of the Qatari population as previously described by this research team.
The majority of genotyping arrays in use today were designed for coverage of the whole genome, and provide limited coverage of rare variants in genes known and potentially pathogenic in genetic disorders 35 . Screening arrays do exist, most designed for detection of cytogenetic defects in newborns 36 , arrays designed for pre-natal screening 37 , and exome arrays designed for exomewide association studies (ExWAS) 38 . Exome sequencing is growing in popularity for the detection of risk variants, and a number of companies offer it as a service, including variant interpretation 39 . The challenge with exome sequencing is for clinical use is how to deal with the identification of variants of unknown significance 40 .
In contrast, the concept of the QChip1 array is that all variants in the array were annotated prior to genotyping, hence circumventing the issue of variants of unknown significance issues while still covering rare variants. In this sense, the QChip1 knowledgebase is of great value, as it can be used to aid the interpretation of genetic data produced by targeted sequencing or genotyping of a panel of variants of interest for carrier screening, similar to the Plain Insight Panel 41 .
The challenge for array design is the selection of variants. There are over 7 million known missense and loss of function variants 42 , and no array can fit all. Unlike arrays designed for ExWAS, genome-wide association study (GWAS) and population genetics, limiting the array to common variants is not useful for screening for pathogenic variants, as common variants are less likely to be pathogenic, and rare variants are difficult to impute using reference panels and common variant genotype data 43 . In order to focus on pathogenic rare variants, arrays custom-tailored to a population are a better fit for individuals sampled from that population, as rare variants are more likely to be populationspecific 44 .
This study provides advances in both knowledge and technology for the field of genomic medicine for a specific genetic population. On the knowledge front, it contains the largest knowledgebase of variants of interest for genetic disease research and screening in a Greater Middle Eastern population. While the consequences of many of the variants on QChip1 are unknown, the array provides a paradigm for clinical screening of this population and a platform for future genetic disease research in the Greater Middle Eastern populations. The variants included in the design and validated in a batch of n = 2708 Qatari were as rare as 1 in 5000 (minor allele frequency of 0.0002), and future whole-genome sequencing of Qataris are expected to yield thousands of additional variants of interest. A high confidence in the true existence of such rare SGD risk variants in the Qatari population was boosted by this study, as the variants were discovered by WGS and verified by QChip genotyping.
The QChip1 array did not include short tandem repeats, other repetitive variants, copy number variants, or structural variants. A small proportion of probes on QChip1 were designed for indel detection, but the concordance with whole-genome sequencing for the indels was inadequate. This may be due to inadequate probeset design and should be a focus for future QChip designs. The main limitation of arrays is the space for probes, and in this In order to compare the precision medicine value of QChip1 for pathogenic variant screening and research across Qatari subpopulations, n = 2708 Qatari genomes were assessed by QChip1 for the number of variants of interest for SGD research and screening in the Qatari genetic subpopulations. After exclusion of common variants (minor allele frequency >0.05), variants in genes not containing ClinVar pathogenic variants, variants with a batch effect, and variants not observed in Qatar, n = 32,674variants of interest were analyzed. Population genetic analysis was conducted as described in Fig. 3. The Qatari individuals genotyped on QChip1 were stratified based on dominant ancestry cluster, without exclusion of admixed individuals. Shown is (left-to-right) each population with sample size, the median number of QChip1 variants per individual (homozygous, heterozygous, wild type, and missing) and median number of genes with one or more variants per individual. b Populations include: Qatari (all Qatari) and subpopulations: QGP_PAR (Peninsular Arabs); QGP_GAR (General Arabs); QGP_WEP (Arabs of Wester Eurasia and Persia); QGP_SAS (South Asian Arabs); and QGP_AFR (African Arabs). c Not included QGP_ADM, Admixed Arabs, see Table 3.
J.L. Rodriguez-Flores et al.           Table 5 continued  Table 5 continued  case the majority of variants were novel to the Axiom platform and hence required multiple probesets. In future iterations, the highest performing probesets identified in this study can be used, and poor performing probesets can be eliminated, thus making additional space on the array for additional variants. Thus, multiple iterations of QChip are needed to produce a high-quality design that genotypes a variety of variants. Another strategy that is frequently used by genotyping array manufacturers is to spread a design across multiple arrays that are genotyped together, i.e., the manufacturers can advertise an array with up to 5 million variants, in reality the "array" consists of 4 or more individual arrays 45 . Another limitation of this study is cis/trans phase of variants, a challenge for exome sequencing. For example, multiple pathogenic variants in BTD can occur in the same genome, and hence screening for these variants includes a second step to determine phase 46 . In the case of this study, there were three pathogenic variants in BTD (rs397514369, rs13078881, rs138818907). Among those individuals with a BTD pathogenic variant, there were five heterozygotes for rs397514369, n = 4 homozygotes and n = 135 heterozygotes for rs13078881, and n = 5 heterozygotes for rs138818907. Zero individuals were positive for more than one BTD pathogenic variant, which rules out the possibility of two pathogenic variants in trans. However, were it the case that multiple BTD variants were observed in the same genome, follow-up validation of phase by Sanger sequencing would be needed. This is a disadvantage of exome sequencing and exome-focused array genotyping, as insufficient coverage of intergenic regions is available for phase inference. Follow-up sequencing is needed, until genome-wide technologies are widely available, such as WGS. Plans for QChip2 include broad coverage of sufficient variants for phase inference.
QChip1 was designed to be competitive relative to sequencing and existing arrays, hence there was a focus on achieving a platform that could provide data for under $100 per DNA sample, including reagents and labor. This is a price point that should remain competitive compared to alternative options for up to a decade, and remains the objective of major manufacturers of sequencing instruments 47 . A major saving is the small data footprint of the QChip1, relative to exome or genome sequencing, where orders of magnitude more data storage are needed. In particular, if the objective is to apply QChip1 on a national scale, the infrastructure investment is considerably more manageable for the prospect of running hundreds of thousands of arrays relative to sequencing hundreds of thousands of genomes or exomes. In perspective, the total Qatari population is approximately 300,000, so the entire Qatari population could be screened for all known and potentially pathogenic variants for approximately $30 million. As presented by the chair of the Qatar Foundation, HH Sheikha Moza bin Nassert at the WISH 2018 summit in Doha, such a precision medicine objective is under consideration for the next decade 48 .
Assessment of 2708 Qatari genomes shed novel insight into the Qatari population. As predicted from our prior assessments of the Qatari population 3,11 , the majority of the pathogenic and predicted pathogenic variants were Qatari-specific, underrepresented in non-Greater Middle Eastern genomes. The most commonly known and high predicted severity pathogenic variants were structural interaction variants and stop gain loss-of-function variants. The most pathogenic variants per genome were observed in the General Arab population, a finding that has implications for other Greater Middle East populations such as Kuwait, United Arab Emirates, and Saudi Arabia that share considerable ancestry with Qatar 18,49-51 . The median Qatari genome had 134 known or computationally predicted pathogenic alleles of interest for SGD research or screening. Of the known pathogenic alleles that were both previously observed in Qatar and known to the ClinVar database, the most common known pathogenic variants were causative of biotinidase deficiency, Stargardt disease, and homocystinuria. Among these 3 variants with risk allele frequency above 0.5% in Qatar, one was not previously known to the CAGS nor HMC databases NM_000060.2 (BTD):c.[470G > A;1330G > C] linked to biotinidase deficiency. This is unusual, given the high frequency of the pathogenic variant at 0.0265, and could be an indication that either biotinidase deficiency is under-diagnosed in Qatar, or that the variant should be re-classified as "uncertain significance". The other two variants with elevated risk allele frequency, one was reported in CAGS but not HMC database, NM_000350.2(ABCA4):c.[5512C > G;5882G > A] linked to Stargardt disease, risk allele frequency 0.0207. Again, it is unusual that the variant was not previously observed in the HMC database, although it is a known pathogenic variant in Arabs and quite possibly enriched in a subset of the Qatari population due to drift. The NM_000071.2(CBS):c.1006C > T (p.Arg336Cys) variant linked to homocystinuria is a well-known variant that is present in both the HMC and CAGS databases, and is known to be an enriched founder variant in the population. It was notable that this variant was incorrectly annotated by SnpEff as "structural interaction", and only manual review based on the rsID identified the known function (Arg336Cys). This is an issue with annotation software that is not exclusive to SnpEff, where multiple transcripts overlap a variant (4 in the case of CBS), and the annotation for the "canonical" experimentally validated function of the variant in In order to quantify the utility of QChip1 for single gene (Mendelian) disorder screening outside of Qatar, the presence and (when available) allele frequency of each variant in Table 5 was checked in seven datasets, including three produced by this research team (HMC, NYC, PR) and four externally obtained [CAGS (http://cags.org.ae/), Dasman Diabetes Institute, GME Variome (http://igm.ucsd.edu/gme/data-browser.php), Iranome (http://www.iranome.ir/)]. Only the DDI, GME, and Iranome datasets had allele frequency data. Shown is the name of the knowledgebase, the sample size when available, and the QChip1 pathogenic variants found in the knowledgebase, including number and percentage of 140 total on QChip1 (Table 5). 2 For datasets where allele frequency is available, the variant is counted as "present" if the frequency was great than zero. For datasets where allele frequency is not available, the variant is counted as "present" if a query of the dataset found the variant. The bottom two rows show aggregate data, where the "anywhere" row indicates variants present in any of the seven datasets (HMC, CAGS, Kuwait, GME, Iran, NYC, PR), and the "Middle East" row indicates variants present in the Middle Eastern datasets (CAGS, DDI, GME, Iran).
disease is buried among other annotations. This is a general problem in variant annotation, and computationally predicted annotations are to be considered an estimate that needs to be validated both by manual review of the literature and experimental validation in vitro. Other known pathogenic variants found using QChip1 included a Factor XI deficiency variant that was previously observed in both Arabs and in ancestral Jewish populations 52 . QChip1 was designed to assess for pathogenic variants in SGDs, with the aim of genomic medicine for Qatari newborns, premarital couples and clinical genetics patients. A likely future strategy for QChip2 and beyond will be to produce multiple arrays for In order to demonstrate the population-specific value of QChip1, the risk alleles that were discovered by genome/exome sequencing, prioritized in the knowledgebase, included in the array design, successfully genotyped, and observed in array data for at least one of n = 2,708 Qataris are provided for download in Supplementary Table 1 and online at the Qatar Genome Browser (http://qchip.biohpc.cornell.edu). Shown is a summary of the population enrichment of these variants.
A Enrichment of potentially pathogenic variants on QChip1 in Qatari subpopulations. In order to determine if Mendelian disease risk alleles were enriched in single Qatari subpopulations, a cross-population allele frequency comparison was conducted for five ancestries observed in Qatar (k1, QGP_PAR, Peninsular Arabs; k2, QGP_GAR, General Arabs; k4, QGP_WEP, Arabs of Western Eurasia and Persia; k5, QGP_SAS, South Asian Arabs, and k3, QGP_AFR, African Arabs). Not shown, QGP_ADM, Admixed Arabs. For each subpopulation, the risk allele frequency was compared to the maximum of the other four subpopulations. Shown is the proportion that was highest in the subpopulation for (left-to-right) QGP_PAR, QGP_GAR, QGP_WEP, QGP_SAS, and QGP_AFR. B Enrichment of potentially pathogenic variants on QChip1 in the Qatari genome relative to non-Qatari. The non-Qatari genomes were residents of New York City (total n = 226) and Puerto Rico (n = 51). The ancestry proportions of these 226 non-Qatari genomes in 5 clusters (k1 to k5) were calculated as described in Fig. 2 (combined analysis of non-Qataris and Qataris using ADMIXTURE 68 ), the lowest cross-validation error was for k = 5, with the non-Qataris falling in 3 clusters (African-Americans from NYC, n = 60, k3; European-Americans from NYC, n = 153, k4; South Asian-Americans from NYC, n = 13, k5; Puerto Ricans of European Ancestry, k4; and Puerto Ricans of Afro-Caribbean Ancestry, k3). More details of the population structure were made available in Fig. 2 (Qataris) and Supplementary Fig. 1 (non-Qataris). Shown is the percentage of n = 32,674 potentially pathogenic variants in Mendelian (single gene) disorder genes that were observed in at least one Qatari and have a risk (minor) allele frequency in Qatar higher than in non-Qatari populations. The proportion of variants was calculated that were at elevated minor allele frequency (enriched) in the Qatari genome relative to the genomes of the 5 non-Qatari population clusters tested: USA African-American (k3), USA European-American (k4), USA South-Asian American (k5), PR Afro-Caribbean (k3), PR European (k4). Shown from left-to-right is the proportion that are enriched in Qatar relative to the maximum of all 5 populations, followed the proportion enriched relative to each individual population. different purposes, including (1) genome-wide association array designed for genotyping of common variants and calculation of polygenic risk scores for multifactorial disorders 53 ; (2) imputation of rare variants based on a Qatari genome imputation reference; (3) population-specific variants that influence drug kinetics and adverse effects; (4) structural variants and repeats; (5) expansion of the QChip1 SGD variants based on a larger sample of Qatari genomes; and (6) variants relevant to autoimmune disease and infectious disease in HLA 54 and non-autosomal chromosomes, such as ChrX variants in the ACE2 receptor used by the SARS-Cov-2 virus to infect human cells 55 .
In addition to future versions of the array, the QChip knowledgebase and browser (Qatar Genome Browser) will continue to expand and be updated as more public data from Qatar and literature data on known SGD variants and genes become available. The knowledgebase, array, and browser produced by this project were intended as a first and enabling step towards advancing the state of the art of genomic medicine in Qatar and in populations that share ancestry with Qatar, as demonstrated in the population genetics analysis presented in this study. The intent is to demonstrate this approach as a framework for the development of precision medicine in populations of countries in continents such as Africa 56 , where a per-sample genome analysis cost beyond $100 is out of reach. Given the low cost of sequencing data production, the availability of cloud-based genome analysis infrastructure that does not require large capital investment, and the ease of rapid array design using the Axiom platform, a nation or population that currently has no prior knowledge of genetic variation could take the approach presented here and produce a genetic disease screening program in under a year, potentially saving thousands of lives at risk of unknowingly being affected by a genetic disorder.
The applicability of the QChip1 technology in the Qatari national population is clear, as all of the variants genotyped were previously observed in Qatari nationals, and we know from current and prior studies that the Qatari population sample used as the source of genetic variation for the QChip is also very diverse, with contributions of ancestry from Africa, Europe, and Asia 11,12 . The applicability to expatriates both living within Qatar and those outside of Qatar will depend on shared ancestry between the expatriate individual and the Qatari population. An expatriate coming from one of the populations that contribute to Qatari ancestry will be more likely to have one or more pathogenic variants in QChip. More distantly related individuals would see less benefit from QChip for screening. Confirming that hypothesis, only 6% of the known pathogenic variants were observed in Puerto Ricans, hence an expatriate from Puerto Rico in Qatar would not benefit as much from QChip1 screening as an expatriate from Kuwait, where 20% of QChip1 pathogenic variants were observed. Across the Greater Middle East region, a total of 50% of the QChip1 variants were observed. This study provides a strong argument for ancestry inference as a standard part of precision medicine, to determine the appropriate screening tool and allele frequency reference database for SGDs.

Subject recruitment and sample collection
All research participants were recruited using IRB-approved protocols and informed consent. Recruitment sites included Doha, Qatar (Weill Cornell Medicine -Qatar Institutional Review Board); New York, New York, USA (Weill Cornell Medicine Institutional Review Board); and Mayaguez, Puerto Rico, USA (Institutional Review Board, University of Puerto Rico at Mayagüez). Every research participant received and understood the accurate information in the consent document and other written information and (s)he released the permission to take part in the research by signing the informed consent. No plan was put in place for recontacting participants with information on actionable findings. DNA extracted from whole blood 57 was tested for quality by RUCDR Infinite Biologics (Piscataway, New Jersey) to be of sufficient quality for array genotyping 58 .

Strategy to design and assess QChip1
QChip1 was developed in steps (Fig. 2). Step 1. Pathogenic variants (known and predicted) in the coding regions of single genes in the Qatari genome were cataloged. Step 2. Using these data, QChip0 (the precursor of QChip1) was designed on the Axiom platform, tested using Qatari genomes and refined with optimal probes, variants and genes to create QChip1.
Step 3. QChip1 was tested for concordance with whole-genome sequencing.
Step 4. QChip1 was used to evaluate pathogenic variant Qatari prevalence and specificity by assessing genomes from Qataris and non-Qatari populations.
Step 1: Identification of variants of interest for research or screening in the Qatari Genome  Table 1) The identification of variants of interest for SGD research and screening in the Qatari genome was carried out in a 3 step process: (1) establishing a list of genes with a known link to Mendelian SGDs described in the ClinVar (version 7/21/20) database; (2) identification of Qatari variants computationally predicted to alter the function of SGD genes in a pathogenic maner, which are primarily of interest for SGD pathogenicity research, and (2) identification of Qatari variants known to be pathogenic in SGDs, based on being classified as such by the ClinVar database or by the HMC case reports.
Establishing a list of genes. A list of genes was compiled from ClinVar with the following criteria: (i) protein coding gene in human genome that (ii) has a known link to a SGD and (iii) contains one or more variants in ClinVar that are classified with a "clinical significance" value of "pathogenic" (Supplementary Table 2), recommended by American College of Medical Genetics (ACMG) for variants interpreted for Mendelian disorders 59 .

Identification of variants of interest for SGD pathogenicity research in
Qataris. Single nucleotide variants (SNV) and indel variants in the Qatar Genome Knowledgebase were annotated using data from public and private sources. First, the allele frequency for each variant in Qataris and non-Qataris was calculated. Variants with a minor allele frequency above 5% in either Qataris or non-Qataris were excluded, per ACMG guidelines 59 . Second, variants were annotated with respect to impact on protein-coding genes in the ENSEMBL database 60 using SnpEff 61 . Variants that did not affect the function of a SGD gene from ClinVar identified as described above were excluded. Third, variants that were predicted to produce missense or loss-of-function (LoF) variants were kept: these variants are classified by SnpEff as having "High" or "Moderate" potential impact on protein function. This collection of variants includes a variety of variants, including known pathogenic variants, variants of unknown significance, and benign variants.
Identification of pathogenic variants for SGD screening. Among the variants defined in step 1.2, a subset is known pathogenic variants, including those classified by ClinVar as pathogenic or those previously observed in HMC case reports of SGDs. These variants can be used for screening of Qataris in a Precision Medicine setting.
Step 2: Design of QChip1 The microarray platform for the QChip was based on the Axiom custom array platform capable of accommodating 1.3 × 10 6 probe features, each consisting of DNA probes covalently linked to a silicon wafer designed to hybridize DNA for the genomic sample. Multiple probes designed to hybridize to a genomic segment can be included in a single "probeset", and one or more probesets designed to genotype a single variant can be included in the design, such that the performance of probes sets can be compared. The initial design was named "QChip0" and the final (postquality-filtering) version as "QChip1". The array design contained 693,652 probes in 597,049 probesets. A subset of n = 184,713 of the probes (27%), the focus of this report, were designed to assess variants of interest for SGD pathogenicity research and screening. These variants are computationally predicted or are known to affect the function of ClinVar SGD genes found in the variant knowledgebase. The remaining 73% of probes on QChip0, not the subject of this report, were designed for research purposes focused on population genetics, pharmacogenomics, and multifactorial disease research, and will be described in future publications based on future versions of QChip.
The probesets included probes complementary to reference and variant alleles, plus flanking sequence of 35 bases in both 5' and 3' directions. Note that this manuscript refers to reference GRCh38 and variant alleles from a genome sequencing perspective. However, in microarray genotyping, there is no "reference" allele, as both alleles are treated as equal by the technology, and hence potentially reducing false genotype calls attributable to reference bias 62 . Some variants were already present in the ThermoFisher (previously Affymetrix) knowledgebase, and thus previously validated to provide accurate genotypes for an SNV or indel, were assessed using a single probeset, while novel variants were assayed using two or more probesets.
Once the array was manufactured, it was tested on an initial batch of genomic DNA samples, including n = 26 Qataris from the Weill Cornell Medicine cohort WGS data. Genotypes were generated from the WGS data for these n = 26 using GATK Haplotype Caller 3.8 63,64 , configured to output genotypes for all sites on the QChip list, including homozygous reference calls. Comparison of QChip and WGS genotypes was conducted for sites where both WGS and QChip produced a non-missing (sufficient quality) genotype.
In order to exclude poorly performing probesets, two rounds of filtering were applied, including a primary filter to select the highest performing probeset for each variant with multiple probesets, and a secondary filter to exclude variants with a high rate (>10%) of missing genotypes or high rate of discordant genotypes. Excluding poorly performing probes and variants led to the final design of QChip1 with 166,695 probes designed to detect Step 3: Testing of QChip1 Step 2: Design of QChip1 Step 1: Establish Qatari genome knowledgebase of pathogenic and potentially pathogenic variants and genes

Concordance of QChip1 compared to whole genome sequencing
Testing with 473 Qatari DNA samples to assess concordance Step 4: Use of QChip1 Fig. 2 Strategy to design and assess QChip1. Step 1. Qatari Genome Knowledgebase. Identification of the single gene (Mendelian) pathogenic variants and genes in protein coding regions of the Qatari genome was generated using whole-genome sequencing, exome sequencing and clinical reports (see Table 1). After cataloging all variants and respective genes, the pathogenic variants and genes were identified using ClinVar and SnpEff. Step 2. Using this list, Qchip0 (the precursor of QChip1) was designed on the Axiom platform which was then tested with 25 Qatari DNA samples for which whole-genome sequencing was available.
Step 3. Elimination of poor performance probes and variants led to the final design of QChip1, which was tested for concordance with genome sequencing using DNA samples from Qataris.
Step 4. Use of QChip1 to assess the prevalence of pathogenic variants and genes among Qataris, New York City residents and Puerto Ricans. 83,542 variants of 3438 genes. Concordance and filtering analysis were performed using Python 65 scripts. The concordance analysis script takes as input two single-sample VCF files 66 as input, including one with QChip1 genotypes and a second with WGS genotypes for all QChip1 sites (including reference and variant genotypes) by GATK 3.8 64 .
Step 3: Test of QChip1 The concordance of genes and variants of QChip1 with whole-genome sequencing data was calculated for a second array genotyping batch of n = 443 Qatari genomic DNA samples previously sequenced using WGS by the Qatar Genome Program. Concordance was performed using the same method for the first batch of n = 26 as described above.
Step 4: Use of QChip1 QChip1 was then used to determine the prevalence of variants of interest for SGD research and screening in the Qatari population (n = 2708) compared to genomes for European-American, South Asian-American and African-American New York City (NYC) residents (n = 226) and European and Afro-Caribbean in Puerto Rico (PR) residents (n = 51). In addition to assessment of variant prevalence in Qataris as a single population, the population structure of Qataris was quantified as described previously 67 , and the prevalence of each variant was quantified for each known Qatari population cluster [Peninsular Arab (QGP_PAR), General Arab (QGP_GAR), Admixed Arab (QGP_ADM), Arabs of Western Eurasia and Persia (QGP_WEP), South Asian Arabs (QGP_SAS) and African Arabs (QGP_AFR); this nomenclature has replaced our prior nomenclature for these subgroups of Q1a, Q1b, Admixed, Q2a, Q2B and Q3, respectively, used in prior publications; Fig. 3] 11 . The population structure was quantified using ADMIXTURE 68 for both Qataris and non-Qataris ( Supplementary  Fig. 1) using QChip1 data that was filtered to exclude indels, singletons, and variants in linkage disequilibrium (window 1000, step 25, maximum r 2 0.1). Each genome was assigned to an inferred population cluster based on the k value with lowest cross-validation error (k = 5). Rather than classify individuals as admixed/non-admixed, each individual genome was assigned to the cluster (k) with the highest proportion of ancestry 69 . The Sites and samples that failed QC based on variant batch effects or PC outliers were excluded. After QC, ADMIXTURE analysis was conducted on the remaining n = 37,674 variants and n = 2985 samples of Qataris (n = 2708) and non-Qataris (n = 277) for a range of K from 3 to 12. The lowest cross-validation error was observed for k = 5 for the full dataset. After analysis, the Qatari and non-Qatari samples were plotted separately, the panels here show the Qatari samples from the joint analysis. A Admixture (k = 5) proportions. Shown is a plot of the admixture proportions (% k from 0 to 100%, y axis), with each column representing one genome, sorted from left-to-right by dominant (highest %) k, and decreasing % k1 to k5. Genomes are color-coded by the dominant (largest %) ancestry (QGP_PAR, Peninsular Arabs, red; QGP_GAR, General Arabs, orange; QGP_WEP, Arabs of West Eurasia and Persia, bright green; QGP_SAS, South Asian Arabs, olive green; and QGP_AFR, African Arabs, light blue). Samples from prior studies of Qatar population structure (Qatar Genome public samples from Fakhro et al. 11 and Rodriguez-Flores et al. 12 genotyped on QChip1 were included in the clustering analysis and were used to assign the clusters. B Principal components analysis of Qataris. Shown is a PC1 × PC2 plot of Qatari genomes in squares color-coded by cluster of largest proportion of inferred ancestry. Not shown, QGP_ADM, Admixed Arabs. results were visualized in a plot of principal components (PCs) calculated using PLINK 70 , with visualization in R 71 . Outliers were excluded based on over 2 standard deviations outside the median PC value for PCs 1 to 5. Each genome was color-coded by the inferred ancestry (1)(2)(3)(4)(5) and the country of origin (Qatar, US, PR).

Data analysis
The final set of QChip1 data included SNV variants with high-quality genotypes and genomes with known ancestry that are of interest for research and screening of SGDs in Qataris. Analysis of these data included quantification and comparison across populations of the following parameters: (1) individual burden of variants; (2) prevalence of variants; (3) enrichment of variants among Qatari subpopulations; and (4) enrichment of variants in Qataris compared to non-Qatari populations.

Performance
Once a final set of pathogenic variants screened using QChip1 was identified, the performance of the array was quantified. Data for QChip1 and WGS was compared on n = 140 pathogenic variants for n = 472 genomes. Using WGS as a "gold standard", the number of true negative

Utility beyond Qatar
In order to assess the potential utility of QChip1 beyond Qatar, the number of QChip1 pathogenic variants was quantified in internal and external knowledgebases. The internal knowledgebases included the QChip1 data for Qatar, NYC, Puerto Rico, and the Hamad Medical Corporation (https:// www.hamad.qa/EN/Pages/default.aspx) list of pathogenic variants. The external knowledgebases included ClinVar (https://www.ncbi.nlm.nih.gov/ clinvar/), the Center for Arab Genetics Studies (https://www.cags.org.ae/ en), the Iranome (http://www.iranome.ir/), the GME Variome (http://igm. ucsd.edu/gme/), and a set of exomes sequenced by the Dasman Diabetes Institute in Kuwait (https://www.dasmaninstitute.org/). Among the external databases, allele frequency was available for Iran (n = 800), GME (n = 886), and Kuwait (n = 540). The subset of variants present in one or more of the knowledgebases, as well as the subset present in one or more external knowledgebase focusing on the Greater Middle East region (CAGS, Iran, GME, Kuwait) was also quantified.

QChip genome browser
In order to provide researchers and clinicians access to annotation and allele frequency data in Qatar and USA for the QChip1 Qatar SGD pathogenicity research and screening variants and genes, a web browser was constructed. The Qatar Genome Browser architecture consisted of a searchable table with a user interface implemented in a Shiny RStudio 72 application frontend, running within a Docker (docker.com) container instance installed on a Linux Centos (centos.org) server backend. The server was custom built by Red Barn (thinkredbarn.com) and configured by Cornell BioHPC 73 . In order to maintain security, the development version was accessible only within Cornell campus network or via Cornell VPN, with plans for a public release after publication of this report. Testing of the server was conducted to confirm that the url (http://qchip.biohpc.cornell. edu) was accessible from both Weill Cornell Medicine New York and Weill Cornell Medicine Qatar.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
Public datasets not produced by the authors and used in this study that describe disease genes, variants in disease genes, and their prevalence in Greater Middle East populations are available from ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/), the Center for Arab Genetics Studies (https://www.cags.org.ae/en), the Iranome (http:// www.iranome.ir/), the GME Variome (http://igm.ucsd.edu/gme/), and the Thanaraj Lab at the Dasman Diabetes Institute in Kuwait (https://research.dasmaninstitute.org/ en/persons/alphonse-thangavel-thanaraj). The data produced by the authors and used in this study can be divided into three categories: (1) sequence and genotype data used to produce the QChip knowledgebase of variants (2) QChip genotype data, and (3) summaries of variants in QChip. For the sake of scientific reproducibility, availability and access to these three categories of data is described here. Category 1 data includes WGS data produced either by the Qatar Genome Program (QGP), Qatar BioBank (QBB) or by Weill Cornell Medicine, WES data produced by Weill Cornell Medicine (WCM), and a table of pathogenic variants previously observed at Hamad Medical Corporation (HMC). The QGP/QBB WGS data is described in Mbarek et al 24 , sharing of these data outside of Qatar is prohibited and is not consented by the IRB protocol. However, external access to QBB/QGP genotype and phenotype data can be obtained through an established ISO-certified process by submitting a project request at https://www. qatarbiobank.org.qa/research/how-apply which is subject to approval by the QBB IRB committee. A detailed description of the data management infrastructure for QBB was described previously 22 . The data and biosamples collected or generated by QBB are available to researchers at public and private institutions that conduct scientific research and that meet the requirements detailed in the Qatar Biobank Research Access policy. Approved Users are given access to QBB's Research Data and/or Biosamples for the period agreed upon in the approved Access Agreement, with the possibility of subsequent renewal." For more information on what meets the requirements, researchers can request the Qatar Biobank Research Access policy from qbbrpsupport@qf.org.qa. This policy has enabled data sharing and collaboration in multiple studies, including a population genetics analysis of over 6000 Qataris 25 and the latest results of the COVID-19 Host Genetics Initiative 74 . Category 1 data also includes WGS and WES data produced by Weill Cornell Medicine, these data are available for sharing with researchers. The majority of these data was described in prior publications and is available for download from NCBI SRA, see SRP060765 for published WGS data, SRP061943 and SRP061463 for published WES data. Unpublished WGS data from this study is accessible Unpublished WGS data from this study is accessible through NCBI BioProject PRJNA774497. Category 1 data also includes an unpublished list of variants identified by HMC, these data are available from a FigShare repository created for this project (https://figshare. com/projects/QChip1/120108). Category 2 data consists of QChip array genotypes for Qataris recruited by WCM, Qataris recruited by QBB, New Yorkers recruited by WCM, and Puerto Ricans recruited by UPRM. Consent for data sharing is not possible for Qataris recruited by QBB as well as for Puerto Ricans recruited by UPRM. QChip array genotypes for Qataris and New Yorkers recruited by WCM was deposited at NCBI (project accession PRJNA774497) and is included in the FigShare repository (https://figshare.com/projects/QChip1/ 120108). Category 3 data consists of summaries of QChip variants, including annotation from Thermo Fisher (Affymetrix) on the QChip contents, annotation produced by the authors on QChip contents including allele frequency, a list of QChip variants of interest for SGD research, and a list of QChip variants of interest for SGD screening. All four datasets are available through the FigShare repository (https://figshare.com/ projects/QChip1/120108). A browsable version of the list of variants with allele frequency data is in development and will be available at the project website (http:// qchip.biohpc.cornell.edu). Variants of interest for screening in Qatar on QChip1 were deposited to dbSNP in a batch submission, are expected to be a part of dbSNP build 156, and were assigned the following accessions: ssID 2137544269 and ssIDs 5314393773 through 5314393911. The batch submission is available online at https:// www.ncbi.nlm.nih.gov/SNP/snp_viewBatch.cgi?sbid=1063269.

CODE AVAILABILITY
Software code consisting of Python, Bash, and R scripts used to produce and analyze the data presented in this manuscript are available through the GitHub https:// github.com/juansearch/qchip1 and on the project website http://qchip.biohpc. cornell.edu.