Identifying genetic factors that contribute to the increased risk of congenital heart defects in infants with Down syndrome

Atrioventricular septal defects (AVSD) are a severe congenital heart defect present in individuals with Down syndrome (DS) at a > 2000-fold increased prevalence compared to the general population. This study aimed to identify risk-associated genes and pathways and to examine a potential polygenic contribution to AVSD in DS. We analyzed a total cohort of 702 individuals with DS with or without AVSD, with genomic data from whole exome sequencing, whole genome sequencing, and/or array-based imputation. We utilized sequence kernel association testing and polygenic risk score (PRS) methods to examine rare and common variants. Our findings suggest that the Notch pathway, particularly NOTCH4, as well as genes involved in the ciliome including CEP290 may play a role in AVSD in DS. These pathways have also been implicated in DS-associated AVSD in prior studies. A polygenic component for AVSD in DS has not been examined previously. Using weights based on the largest genome-wide association study of congenital heart defects available (2594 cases and 5159 controls; all general population samples), we found PRS to be associated with AVSD with odds ratios ranging from 1.2 to 1.3 per standard deviation increase in PRS and corresponding liability r2 values of approximately 1%, suggesting at least a small polygenic contribution to DS-associated AVSD. Future studies with larger sample sizes will improve identification and quantification of genetic contributions to AVSD in DS.


Imputed samples
There originally were 459 samples with Down syndrome (DS), including 211 atrioventricular septal defect (AVSD) cases (which we refer to as DS+AVSD cases) and 248 controls with structurally normal hearts (which we refer to as DS+NH controls), all of whom had available Affymetrix Genome-Wide Human SNP 6.0 array genotype data.
These 459 samples included the 210 cases and 242 controls analyzed in the prior genome-wide association study (GWAS) of DS-associated AVSD. 1 Using PLINK1.9 (version 1.90b6.6) 2,3 and R (version 3.4.1), 4 we applied standard GWAS quality control (QC) procedures, excluding subjects for sex discordance, outlier heterozygosity rates For these samples, we then performed genotype imputation using the Michigan Imputation Server. 7 Prior to imputation, all alleles were aligned to the (+) strand, and we used a program 8  Mean correlation between true and imputed genotypes for the ~600,000 genotyped SNPs was 0.990, suggesting high quality imputation. Considering all post-imputation variants, those with MAF ≥ 0.05 (5,349,403 variants) had mean imputation r 2 = 0.971, those with 0.01 ≤ MAF < 0.05 (2,300,344 variants) had mean r 2 = 0.882, and those with MAF < 0.01 (30,946,655 variants) had mean r 2 = 0.180. This indicates good imputation quality for variants with common or moderate MAF. We decided to drop variants with MAF < 0.01, those missing for more than 2% of samples, those with a maximum imputed genotype probability < 0.80, and those with imputation r 2 < 0.80.
We then applied standard GWAS QC to the imputed dataset. We dropped one sample with an outlying heterozygosity rate (> 3 SDs below the mean). No samples were dropped for excess missing genotypes (all had < 1% missingness). Following removal of the single sample, we again excluded variants missing for > 2% of individuals and those with MAF < 0.01, and also dropped variants with HWE mid-pvalue < 0.00001 and those with significant differences in missing genotype rate between cases and controls (p < 0.05). We also removed variants with A/T, T/A, C/G, and G/C alleles which can be difficult to match between datasets due to strand ambiguity. This left a dataset with 440 samples (206 DS+AVSD cases, 234 DS+NH controls) and 5,079,537 autosomal SNPs.

Whole genome sequencing samples
Starting from the previously described post-QC whole genome sequencing (WGS) dataset, which included 175 samples (148 DS+AVSD cases, 27 DS+NH controls), we applied additional variant filters in order to more closely match the variant QC procedures which had been applied to the imputed dataset. We removed variants with MAF < 0.01, those missing for > 2% of samples, and indels, leaving a WGS dataset with 175 samples and 4,173,676 autosomal SNPs (excluding chromosome 21).

Merging WGS and imputed samples
Coordinates for the WGS dataset were based on human genome build 38 (hg38), while those for the imputed dataset were based on human genome build 19 (hg19). Prior to merging the datasets, we used the University of California Santa Cruz (UCSC) Genome Browser 10 LiftOver tool to convert the WGS data coordinates from hg38 to hg19, and also modified dbSNP Reference numbers (rsIDs) for each variant as needed using an external file based on HRC panel variants containing hg19 rsIDs and coordinates. We chose to convert the WGS data to hg19 rather than converting the imputed data to hg38 as a matter of convenience, given the polygenic risk score (PRS) training files we used had hg19 coordinates.
As one additional step prior to merging the WGS and imputed datasets, we compared allele frequencies for SNPs in each dataset in order to identify any instances where allele frequency for a SNP in one dataset differed considerably from its allele frequency in the other dataset, which could indicate genotyping error for the variant. We identified and removed 77 SNPs with allele frequencies that differed by at least 0.20 between the WGS and imputed datasets.
We then merged the WGS and imputed datasets on rsID, position, and alleles

Target dataset for secondary PRS analysis
Our secondary PRS analyses examined the contribution by alleles on the trisomic chromosome 21 to a polygenic component for DS-associated AVSD. To do this, we compared PRS results based on polygenic scores generated using all autosomes (including chromosome 21) to PRS results based on scores using all autosomes except for chromosome 21.
We analyzed the same set of target samples as for the primary analyses (245 DS+AVSD cases, 242 DS+NH controls), 158 of whom had WGS data for chromosome 21, and 329 of whom had Affymetrix Genome-Wide Human SNP 6.0 array genotype data for chromosome 21 (given the complexities of imputing trisomic genotypes, we did not have imputed data for these 329 samples). Given that trisomic data cannot be represented by the PLINK1.9 binary format, we handled these chromosome 21 data separately from the other chromosomes. Prior to merging chromosome 21 data for these WGS and array samples, we applied certain QC filters. None of the 158 WGS samples nor the 329 array samples had an excess of missing genotypes for chromosome 21 (all had approximately 5% or less missingness). For variant QC, we excluded SNPs missing for > 5% of samples, as well as SNPs with A/T, T/A, C/G, and G/C alleles which can be difficult to match between datasets due to strand ambiguity.
We also removed SNPs with substantially different allele frequencies between the WGS and array datasets (we determined that a frequency difference of ≥ 0.125 was an appropriate threshold for these chromosome 21 datasets). Post-merger, we removed SNPs with excess missingness specifically among cases or controls (missing for > 3% of cases or > 3% of controls), and we also excluded SNPs that were monoallelic in the full sample. These steps yielded a merged chromosome 21 dataset with 487 samples and 3,984 SNPs.
We then took the dataset used for the primary analyses (487 samples and 2,351,951 autosomal SNPs, excluding chromosome 21), and limited it to SNPs on the Affymetrix Genome-Wide Human SNP 6.0 array, leaving 389,544 SNPs. This was done since the chromosome 21 data were also necessarily limited to the array SNPs. We used these SNP-array-level genotype data, both with and without the chromosome 21 data, in order to perform the secondary PRS analyses. Supplementary Table S1: PRS results using discovery GWAS of 2,594 mixed CHD cases and 5,159 controls and SNPs with MAF ≥ 0.35. 'Threshold' indicates that SNPs with discovery GWAS p-values below the threshold were used for PRS construction, and 'No. SNP' is the corresponding number of SNPs used for scoring. OR: Odds ratio per standard deviation increase in PRS, CI: Confidence interval (corresponding to uncorrected p-value), Nag. r 2 : Nagelkerke's r 2 , Punadj: Uncorrected p-value, Padj: P-value corrected for multiple correlated tests.

SUPPLEMENTARY FIGURES
Supplementary Figure S1: Flowchart showing the multiple steps involved in generating the final data set for the primary PRS analyses.
Supplementary Figure S2. Representative SKAT-O Manhattan plot and QQ plot of common variants, generated from analyses using the whole exome sequencing dataset. Each dot represents a gene in the SKAT-O analysis, ordered by chromosome. No gene reached Bonferroni significance (red horizontal line), however 30 genes showed a nominal significance level of p < 0.001 (blue horizontal line).
Supplementary Figure S3: PRS results using discovery GWAS of 2,594 mixed CHD cases and 5,159 controls and various MAF thresholds. MAF thresholds were applied to the discovery GWAS; SNPs with MAF below the threshold were excluded from PRS construction. Top row: Each plot displays odds ratio per standard deviation in PRS and the corresponding 95% confidence interval (y-axis) for PRS constructed based on particular discovery GWAS p-value thresholds (x-axis). Padj are adjusted p-values (corrected for multiple correlated tests). 95% CIs correspond to unadjusted p-values. Bottom row: Each plot displays Nagelkerke's r 2 (y-axis) for PRS constructed based on particular discovery GWAS p-value thresholds (x-axis). Numbers above each r 2 bar are the number of SNPs used to construct PRS at that particular p-value threshold and MAF filter combination.
Supplementary Figure S6: PRS results for all autosomes excluding chromosome 21. These analyses used the discovery GWAS of 2,594 mixed CHD cases and 5,159 controls. Various MAF thresholds were applied to the discovery GWAS; SNPs with MAF below the threshold were excluded from PRS construction. Top row: Each plot displays odds ratio per standard deviation in PRS and the corresponding 95% confidence interval (y-axis) for PRS constructed based on particular discovery GWAS p-value thresholds (x-axis). Padj are adjusted p-values (corrected for multiple correlated tests). 95% CIs correspond to unadjusted p-values. Bottom row: Each plot displays Nagelkerke's r 2 (y-axis) for PRS constructed based on particular discovery GWAS pvalue thresholds (x-axis). Numbers above each r 2 bar are the number of SNPs used to construct PRS at that particular p-value threshold and MAF filter combination.
Supplementary Figure S7: PRS results for all autosomes including chromosome 21. These analyses used the discovery GWAS of 2,594 mixed CHD cases and 5,159 controls. Various MAF thresholds were applied to the discovery GWAS; SNPs with MAF below the threshold were excluded from PRS construction. Top row: Each plot displays odds ratio per standard deviation in PRS and the corresponding 95% confidence interval (y-axis) for PRS constructed based on particular discovery GWAS p-value thresholds (x-axis). Padj are adjusted p-values (corrected for multiple correlated tests). 95% CIs correspond to unadjusted p-values. Bottom row: Each plot displays Nagelkerke's r 2 (y-axis) for PRS constructed based on particular discovery GWAS pvalue thresholds (x-axis). Numbers above each r 2 bar are the number of SNPs used to construct PRS at that particular p-value threshold and MAF filter combination.
Supplementary Figure S8: Maximum variance in target phenotype that can be explained by PRS (y-axis: liability scale r 2 ) given a range of training sample sizes (x-axis: number of cases in thousands). Assumptions: training sample with case:control ratio of 1:2 (same as ratio for larger of the two independent CHD discovery datasets); target sample with case:control ratio of 1:1 (same as ratio for DS target dataset); prevalence of CHD in training population is 1%; prevalence of AVSD in DS target population is 20%; 100,000 independent variants in the training SNP panel; genetic effects for training and target samples are identical (correlation = 1); proportion of SNPs in the training set panel that affect the training phenotype is 1%, 10% or 100%. For plot A, amount of variance in the training phenotype explained by the training set SNP panel (Vgtrain) is 15%; for plot B Vgtrain is 25%; for plot C Vgtrain is 35%. Solid black horizontal line marks the maximum r 2 that can be explained by PRS using an infinitely large training sample size (given the assumed parameters). Vertical orange line marks the number of CHD cases in the larger of the two independent discovery datasets (2,594 cases). A.