Genetic predisposition to hypertension is associated with preeclampsia in European and Central Asian women

Preeclampsia is a serious complication of pregnancy, affecting both maternal and fetal health. In genome-wide association meta-analysis of European and Central Asian mothers, we identify sequence variants that associate with preeclampsia in the maternal genome at ZNF831/20q13 and FTO/16q12. These are previously established variants for blood pressure (BP) and the FTO variant has also been associated with body mass index (BMI). Further analysis of BP variants establishes that variants at MECOM/3q26, FGF5/4q21 and SH2B3/12q24 also associate with preeclampsia through the maternal genome. We further show that a polygenic risk score for hypertension associates with preeclampsia. However, comparison with gestational hypertension indicates that additional factors modify the risk of preeclampsia.


Supplementary Information
Genetic predisposition to hypertension is associated with preeclampsia in European and Central Asian women  Tables  Supplementary Table 1 Studies included in meta-analyses, follow-up and downstream analyses 20 Supplementary  Table 7 Results for associated variants in offspring and maternal and discovery meta-analyses 24 Supplementary Table 8 Samples included in preeclampsia subgroup analysis 24 Supplementary Table 9 Heritability of preeclampsia 25 Supplementary b. c. d.
Supplementary Figure 1 Manhattan plots of population specific meta-analyses Panel a. Central Asian preeclampsia offspring, b. European preeclampsia offspring, c. Central Asian preeclampsia mothers, d. European preeclampsia mothers. a. b.
Supplementary Figure 2 Power to detect genome-wide significant association using the current meta-analyses The plots show allelic odds ratio (OR) required for 80% power to detect association to preeclampsia at genome-wide significance (P < 4 × 10 −9 after adjusting for 12 million variants tested) in metaanalysis of offspring or maternal subjects from Europe, Central Asia or Europe and Central Asia combined. Results are shown for a causal SNP with minor allele frequency (MAF) in the range of 0.01-0.5.
a. Offspring meta-analyses. Effective sample size for the Central-Asian, European and Combined meta-analysis is 2,064, 7,259 and 9,323 cases respectively and an equal number of controls. In the combined analysis we have 80% power to detect an OR > 1.15 at MAF = 0.5. b. Maternal meta-analyses. Effective sample size for the Central-Asian, European and Combined meta-analysis is 2,137, 10,255 and 12,392 cases respectively and an equal number of controls. In the combined analysis we have 80% power to detect an OR > 1.13 at MAF = 0.5. a. b.
Supplementary Figure 3 Forest plots of variants associating with preeclampsia through fetal or maternal genome.  The plots show that variants that are common (maf > 10%) in all 1000 genomes regional populations and in the 1000 genomes accessible genomic regions are reliably discovered in our dataset.  Figure 10. Efficacy of Genotype Refinement The plot shows the efficacy of the genotype refinement using Beagle and the 1000 genomes reference panel. Supplementary

Supplementary Note 1: Construction of Central Asia Haplotype Reference Panel
Extensive efforts have been made to provide reference panels of whole genome sequencing data from diverse populations through initiatives such as 1000Genomes, but Central Asian populations have not so far been included in these panels. Central Asia lies at the centre of the Silk Road, historic trading routes between Asia and Europe. The traditional nomadic lifestyle of much of the population, and repeated past invasions by surrounding powers, created a population of mixed ethnicity. During the Soviet era large numbers of Russians were resettled in Central Asia, and ethnic Russians made up a significant proportion of the population. The movement of ethnic Russians has reversed since the Central Asian republics gained independence in the early 1990s, and the two largest ethnic groups are now Kazakhs and Uzbeks.
Whilst reference panels for genotype imputation are readily available for European populations, no such panel was available for the Kazakh and Uzbek populations of Central Asia. We therefore undertook whole genome sequencing (WGS) of 100 Kazakh and 100 Uzbek individuals, equally divided between males and females, recruited from Kazakhstan and Uzbekistan respectively. The ancestry of each Central Asian volunteer was determined by the ethnicity of all four grandparents. Of interest, Kazakhs are represented by three hoards, or zhus -older, middle and younger -and information about grandparental zhus was recorded for each Kazakh subject.

Construction of Central Asia Haplotype Reference Panel
200 Central Asian individuals (100 Kazak and 100 Uzbek) were whole genome sequenced at a coverage of approximately 4-5X. Variant calling discovered 11,870,850 single nucleotide polymorphisms (SNP) and 1,013,884 indels including over 2 million variants not detected in 1000 Genomes Phase 1. Phased genotypes were used to create a haplotype reference panel, and genotype imputation performance was assessed in 1600 chip-genotyped subjects. Combining reference data from Central Asian WGS and 1000Genomes Phase 3 yielded better imputation quality than using ether reference panel alone.

Population Structure
A PCA analysis of the combined Central Asian WGS and 1000 Genomes Phase 3 (Europe, South Asia and East Asia) data indicates that the Kazakh and Uzbek populations are on a cline between East Asia and Europe with Uzbekistan exhibiting a greater affinity with Europe and South Asia (Supplementary Figure  6). Importantly, the analysis shows the two Central Asian populations clustering separately from any of the Eurasian 1KGP3 populations.

Imputation Quality
We assessed the imputation quality using the internal IMPUTE2 leave one-out measurements of the squared correlation, ‫ݎ‬ ଶ , between the genotype dosage of directly genotyped variants vs the expected dosage of the corresponding imputed variants. We randomly selected 800 unrelated GWAS samples from each population and pre-phased each grouping of samples separately. The phased samples were then imputed into 4 different panels; 1000 Genomes Phase 3 + Central Asia (1KGP3+CA), 1000 Genomes Phase 3 (1KGP3), 1000 Genomes Phase 1 (1KGP1) and Central Asia (CA). The imputation quality was assessed at sites that are present on both the chip and the intersection of tested reference panels. We find that both Uzbek and Kazakh samples exhibit better imputation quality across the full range of Central Asia allele frequencies with the combined panel (1KGP3+CA) than with any of the other panels (Supplementary Figure 7). Performance relative to allele frequencies calculated in other regions exhibit a similar pattern (Supplementary Figures 12, 13 and 14).

Ethics Statement
This study was approved by the Central Commission on Ethics of the Republic of Kazakhstan, the National Ethics Committee of the Ministry of Health of the Republic of Uzbekistan, and the Medical School Research Ethics Committee of the University of Nottingham. Volunteers in each Central Asian country gave informed consent and provided an irreversibly anonymised sample of venous blood for DNA extraction.

Kazakh and 200
Uzbek subjects for WGS were recruited from healthy volunteers in Almaty, Kazakhstan, and Tashkent, Uzbekistan respectively. Only 100 of the 200 subjects from each country were selected for whole genome sequencing in order to enhance subject anonymity and protect the identity of the subjects whose genomes were sequenced. The grandparental ethnicity of all four grandparents of each subject was recorded to minimise ethnic admixture. Kazakhs belong to one of three Zhus (hordes); the Zhu of each subject was recorded at the time of recruitment and subjects were selected for WGS to account for the approximate composition of Zhus in Kazakhstan. There were no corresponding ethnic strata in the Uzbek subjects who were therefore randomly selected. Subjects selected for WGS were equally split between males and females for each country. DNA was extracted in the country of origin and transferred to the Wellcome Trust Sanger Institute, UK, where it was subjected to quality control measures prior to sequencing. These included measurement of DNA concentration both by absorbance at 260nm/280nm and by the pico green method; gel electrophoresis to check for DNA degradation, and Sequenom genotyping at 30 SNPs, including four sex-specific variants. Samples which failed gender checks, or where Sequenom genotyping was unsuccessful at 10 or more SNPs, were not selected for WGS.

Library Preparation and Sequencing
Approximately 1ug genomic DNA for each subject was fragmented to an average size of 500 base pairs (bp) and subjected to DNA library creation using established Illumina paired-end protocols.

Construction of haplotype reference panels
The Central Asia panel was created from the above filtered variants as follows: • The novel variants (i.e. those not in 1000 Genomes Phase 3 (1KGP3)) were merged with 1KGP3 Biallelic and Multiallelic sites.
• The resulting genotype likelihoods were processed as follows: • An initial genotype refinement for 1KGP3 variants was carried out with Beagle (v4 (r1399.jar)) using the 1KGP3 panel (Europe+East Asia+South Asia groups only) downloaded from the Beagle website.
• The genotype probabilities generated by this initial genotype refinement were then fixed as hard-called genotypes and merged with the genotype likelihoods of novel variants.
• Beagle was then run again without a reference panel on this merged data-set in order to obtain refined calls for the novel variants.
• The refined genotype probabilities for the novel variants were then extracted and merged with the refined genotype probabilities for 1KGP3 variants obtained from the first run of Beagle.
• The Beagle vcf files were converted into Oxford gen format and then hard-called and filtered using plink2 (--hard-call-threshold 0.1 --geno 0.05) and saved again in Oxford gen format.

Concordance of Chip Genotypes
The WGS samples were also chip genotyped; 69 samples on Illumina OmniExpress 2.5.8 and 134 samples on Illumina OmniExpress (some samples were genotyped on both platforms). Basic QC was carried out on these chip genotypes (call rate>98% and heterozygosity within 3 s.d. of the mean). Supplementary  Figures 10 and 11 demonstrate the accuracy of the refined genotype probabilities.

Imputation Benchmarking
The chip genotyped samples from Uzbekistan and Kazakhstan were separately QC'd using the following protocol. Quality control analysis was conducted using PLINK (http://zzz.bwh.harvard.edu/plink/) and SMARTPCA. Briefly, the quality control included the following subject-level exclusion criteria: individual call rate <98%, heterozygosity >3 s.d. from the mean; any of the first three HapMap (based on CEU, YRI, CHB, JPT and GIH populations) principal axes of variation >4 s.d. from the mean; and sex mismatch. Related individuals (identity by descent (IBD) > 0.1) with the lowest call rates were preferentially removed. The variant-level exclusion criteria were as follows: call rate <98%; exact Hardy-Weinberg equilibrium P < 1 × 10−6; minor allele frequency (MAF) <1%; and non-random missingness of uncalled genotypes (plink-test-mishap) with Bonferroni-corrected P < 0.05.
From each of the above datasets 800 female samples that passed the above subject-level QC were selected and 123,250 chromosome 1 chip genotyped SNPs that were present in both KAZ and UZB post QC datasets.

Introduction
In [1] a method is introduced for inferring Maternal, Fetal, Imprinting and Maternal-Fetal Interaction effects from family genotype data. This method, EMIM, fits a genetic model using a multinomial likelihood model for the possible family genotype combinations constrained by Mendelian inheritance. Taking advantage of recent improvements in statistical haplotype phasing improvements a recent extension to the method [2] resolves heterozygous genotype cells to improve inference of parent of origin effects.
One difficulty in using EMIM in mildly heterogeneous populations is that it does not provide a means to control for population stratification. Here we present an extension that incorporates cohort indicator variables and ancestry principal components as covariates into the method. We provide an R package implementation of the method (remim).

Methods
Following the notation in [1] we assume the following model for penetrance. Let A2 be the risk allele with allele frequency ‫ܣ‬ ଶ (alternate allele A1 has allele frequency ‫ܣ‬ ଵ = ‫ܣ-1‬ ଶ ) • ݅ copies of A2 in the child multiplies penetrance by ܴ • ݅ copies of A2 in the mother multiplies penetrance by ܵ • ݅ paternally transmitted copies A2 multiplies penetrance by ‫ܫ‬ • ݅ maternally transmitted copy A2 multiplies penetrance by ‫݉ܫ‬ • ݅ maternal copies of A2 and j fetal copies A2 multiplies penetrance by ߛ , This gives the following equation for the penetrance: Where = ൛ܴ , ܵ , ‫ܫ‬ , ‫݉ܫ‬ , ߛ , , ‫ܣ‬ ଶ ൟ and ܷ݉, ݉ܶ and ‫ܶ‬ are the counts of the maternal untransmitted, maternal transmitted and paternal transmitted risk alleles and ߙ is the disease prevalence.
The central equation in [1] is: Here we assume random mating, so that we can write: In the case where we have genotype data but are unable to infer transmitted and untransmitted alleles (e.g for sparsely genotyped datasets) it is necessary to sum over values of ܷ݉, ݉ܶ and ‫ܶ‬ that are consistent with the genotype data: The parameters are then fitted by maximising the above likelihood and statistical significance is assessed by likelihood ratio tests on nested models or by estimating standard errors in the usual way using the inverse Hessian matrix. Note that the above model fits a single population allele frequency in the above process.
Here, rather than fitting a single allele frequency, we propose modelling individual specific genotype/allele probabilities (allele frequency) using a logit transform of the weighted sum of continuous covariates: Where ܿ ሺሻ and ܿ ሺሻ are maternal and paternal covariates respectively (e.g. principal components or cohort indicator variable). In the case of Duos the missing parent covariate can be estimated by assuming that the child covariates ci (c) (available by assumption) are the average of parental covariates. For lone cases (Mothers, Fathers or Offspring) we do not necessarily have values for the missing parents. Under these circumstances we are forced to either remove the samples or to make an assortative mating assumption and set unknown covariates to the same value as the sampled family member. For lone controls the covariates of the individual sample are sufficient.
We incorporated these new variables into ߠ as = ൛ܴ , ܵ , ‫ܫ‬ , ‫݉ܫ‬ , ߛ , ߚ ൟ so that the β coefficients are jointly estimated with the parameters of the EMIM genetic model by maximising the composite loglikelihood for the set of parameters:

Discussion
The extension to include cohort indicator variables and principal components extends the scope of the method allowing the joint analysis of multiple cohorts and/or genetically heterogeneous populations. The cost of considering continuous covariates is that individuals must be included separately in the likelihood equation rather than grouped into cells containing the same genotype combinations. This makes the approach less computationally efficient and so less appropriate for analysing whole genome data.
We provide a R based implementation (remim) where we have made of use of explicit expressions for both the likelihood and likelihood gradient to facilitate the rapid convergence of R's "optim" function using the method "L-BFGS-B". The likelihood and gradient functions are implemented in C.