Genetic polymorphisms associated with adverse pregnancy outcomes in nulliparas

Adverse pregnancy outcomes (APOs) affect a large proportion of pregnancies and represent an important cause of morbidity and mortality worldwide. Yet the pathophysiology of APOs is poorly understood, limiting our ability to prevent and treat these conditions. To search for genetic markers of maternal risk for four APOs, we performed multi-ancestry genome-wide association studies (GWAS) for pregnancy loss, gestational length, gestational diabetes, and preeclampsia. We clustered participants by their genetic ancestry and focused our analyses on three sub-cohorts with the largest sample sizes: European, African, and Admixed American. Association tests were carried out separately for each sub-cohort and then meta-analyzed together. Two novel loci were significantly associated with an increased risk of pregnancy loss: a cluster of SNPs located downstream of the TRMU gene (top SNP: rs142795512), and the SNP rs62021480 near RGMA. In the GWAS of gestational length we identified two new variants, rs2550487 and rs58548906 near WFDC1 and AC005052.1, respectively. Lastly, three new loci were significantly associated with gestational diabetes (top SNPs: rs72956265, rs10890563, rs79596863), located on or near ZBTB20, GUCY1A2, and RPL7P20, respectively. Fourteen loci previously correlated with preterm birth, gestational diabetes, and preeclampsia were found to be associated with these outcomes as well.

included biospecimens, clinical measurements, ultrasounds, behavior (through interviews and questionnaires), physical activity assessment, and dietary content.
By precisely characterizing different aspects of over 10,000 pregnancies, the nuMoM2b cohort has already yielded valuable insights into factors that contribute to APOs 11,12 .Additionally, the availability of biospecimens provides a unique opportunity to study the genetic underpinnings of APOs.The objective of this study was to test for association between common variants across the maternal genome and four APOs in the nuMoM2b cohort: gestational length (as a proxy for preterm birth), preeclampsia, gestational diabetes mellitus (GDM), and pregnancy loss.Our investigation leverages ancestrally diverse populations to further isolate potential genetic factors involved in these APOs.

Participants
The participants of the analysis were enrolled in the nuMoM2b cohort (https:// www.nichd.nih.gov/ resea rch/ suppo rted/ nuMoM 2b), a longitudinal, multiethnic cohort study of nulliparous individuals.All participating centers, documented in Haas et al. 10 , obtained approval by the local Institutional Review Boards (IRBs) of their corresponding recruitment institutions.The genotype analysis is covered by Indiana University's IRB, which was Protocol Study number 1008-08, approved on 9/28/2010.This study was conducted in accordance with the ethical principles of the Declaration of Helsinki.All participants included in this study provided informed consent.The study enrolled 10,038 nulliparous people from the first trimester of their pregnancy to participate in three study visits during pregnancy.Collection of health status and biomarkers were conducted at regular intervals, and documentation of pregnancy outcomes was performed by medical record abstraction using a priori definitions (details of this process were described by Haas et al. 9,10 ).

Phenotype definitions
Pregnancy loss: All subjects who had a pregnancy loss, regardless of gestational age, were considered as cases (Table 1).A pregnancy loss occurs when the fetus dies during gestation.This categorization includes all cases of fetal demise occurring under 20 weeks of gestation, and all stillbirths (defined as a fetal death occurring from 20 weeks of gestation onward).Individuals who underwent termination of pregnancy were excluded from the pregnancy loss analysis.Subjects who had a live birth were treated as controls.
Gestational length: We opted to use a quantitative phenotype, gestational length, instead of a binary preterm/ full term outcome to gain additional information and statistical power by using a more granular phenotype.Gestational length was determined from an estimated due date established by a first-trimester ultrasound crownrump length measurement and was recorded in weeks 10 .Preterm birth was defined as any live birth that occurred before 37 weeks gestational age.Cases of stillbirth, fetal demise, and termination (elective and indicated) were all excluded from this phenotype group (Table 1).Cases of preterm birth that were medically indicated (e.g. for preeclampsia), were excluded.Term births that had labor induced or were by planned cesarean delivery were included as they occurred at term and because gestational age at delivery is a right-bordered outcome.Thus, we did not believe that their inclusion would alter this result and excluding these cases would limit the sample size of deliveries > 37 weeks, potentially skewing the data toward lower gestational age means.
Gestational diabetes (GDM): GDM was diagnosed through clinical evaluation, from fasting blood sugar, sequential 1-h glucose challenge test followed by a 3-h glucose tolerance test (GTT), or a single step 2-h 75-g GTT 13 .We excluded individuals diagnosed with pregestational diabetes from GDM analyses, and all other individuals were treated as controls.
Preeclampsia: Cases are individuals with a diagnosis of preeclampsia (with and without severe features), eclampsia, and chronic hypertension with superimposed preeclampsia.A detailed description of nuMoM2b study definitions of hypertensive disorders of pregnancy was published in the supplement to the paper by Facco et al. 13 .All other individuals were treated as controls.We did not exclude individuals with hypertension antepartum, as we are interested in unearthing markers specific to gestational disease, not general hypertension.

Genotyping
We genotyped all participants who had adequate samples and agreed to be genotyped (n = 9,757).We used the Infinium Multi-Ethnic Global Array (MEGA; Illumina, USA), which is designed to adequately query individuals of multiple genetic ancestries 14 (a known issue in genotyping studies), enriched for variants of clinical importance 15 , and has been successfully used to study recent admixed populations 14,16 .The MEGA allowed us to genotype > 1 million variants that are on average 1.4 Kb apart, effectively covering the entire genome.DNA extractions from whole blood were carried out on a QIAsymphony instrument (from Qiagen; extraction kit DSP DNA Midi Kit #937,355, protocol Blood_1000_V7_DSP) at the Center for Genomics and Bioinformatics (Indiana University, Bloomington), and genotyping was completed at the Van Andel Institute (Grand Rapids, MI, USA).We imposed standard quality control filters at this stage, all involving technical measurements of the raw intensity data (cluster separation < 0.3, normalized R-value mean < 0.2 for all genotypes, 10th percentile of the GenCall scores < 0.3) using GenomeStudio v2.4 (Illumina).Genotype calls for the ~ 1.7 million loci that passed initial quality control (98.3% of all markers in the array) were made with Beeline autoconvert (Ilumina).

Quality control pipeline
We leveraged a multi-step quality control (QC) pipeline, depicted in Fig. 1, to adequately address the heterogeneous nature of the dataset.The pipeline is broken into five separate modules that integrate current best practices in GWAS QC, which are further described in the text below.Unless otherwise stated, both the quality control steps and analyses were carried out using PLINK1.9 17.

Module 1: preprocessing and QC
The initial preprocessing of the dataset removed poorly genotyped individuals and SNPs according to the following criteria: (1) minor allele frequency (MAF) < 1%, (2) missingness of genotyping per individual and per marker > 5%, (3) Hardy-Weinberg Equilibrium (HWE) test p-value < 5 × 10 -6 , and (4) heterozygosity F-statistic within 3 standard deviations from the mean heterozygosity across all subjects using autosomes (Supplementary Figure S1).Significantly reduced heterozygosity may be indicative of high levels of consanguinity and subjects with excessive heterozygosity are suggestive of sample contamination, thus we excluded these subjects from the downstream dataset.

Module 2: cryptic relatedness
The self-reported racial and ethnic diversity of the nuMoM2b cohort necessitates a careful approach in assessing relatedness.As the pairwise identity-by-descent (IBD) estimation implemented in PLINK assumes a homogeneous subset, we used the KING-Robust algorithm, a pairwise kinship estimator for GWAS that is robust to the presence of unknown population substructure 18 .We inferred kinship estimates between all pairs of subjects in the cohort (Supplementary Figure S2), randomly removing one subject from the pairs of subjects with first-or second-degree relatedness (kinship coefficient > 0.0884) such that we minimized the number of subjects removed.

Module 3: subpopulation stratification
With the goal of minimizing spurious genetic associations in downstream analyses driven by population stratification and ancestry-based allele frequency differences, we clustered the cohort by genetic ancestry (Supplementary Figure S3).We determined the ancestry of each subject using SNPweights v.2.1 and a set of approximately 40,000 ancestry-informative markers curated by the 1000 Genomes Consortium 19 .Samples were clustered into five ancestry groups concordant with the 1000 Genomes Consortium 19 subpopulations: African (AFR, n = 1425), Admixed American (AMR; n = 846), East Asian (EAS; n = 323), European (EUR, n = 6082), and South Asian (SAS; n = 112).Membership in each ancestry subpopulation was assigned based on having ≥ 80% ancestry in the specific subpopulation (Supplementary Figure S4).We observed a large fraction of highly admixed individuals; thus, we also established a sixth group (ADM, n = 891) of subjects who have no percent ancestry > 50% in any single ancestry group.Not all ancestry groups contained enough subjects to power a genome-wide association.Accordingly, we only proceeded with imputing the European, African, and Admixed American ancestry subcohorts.From this module onward, all QC steps are performed at the sub-cohort level.
Next, we checked the consistency of reported sex with sex assignments imputed from X chromosome breeding coefficients.This step was performed after subjects were grouped into subpopulations because F-statistic approximation for the X-chromosome relies on accurate MAF estimates, which vary at the subpopulation level.The sex check consists of four steps: (1) unambiguously re-coding the pseudo-autosomal region of the X-chromosome, (2) performing LD-based pruning on the set of markers used for the F-estimate, (3) confirming that all F estimates yield female calls using the threshold F < 0.6 (PLINK suggests a cutoff of F > 0.8 for a male call), and (4) removing any subjects with discordant results.Lastly, population structure was determined in PLINK using pruned SNPs from the data (linkage disequilibrium pruning r 2 < 0.1).The top ten principal components were computed for each of the three subpopulations (EUR, AFR, AMR).

Module 4: phasing and imputation
Subjects in these sub-cohorts were phased with Eagle2 and imputed by Minimac3 20 using the TOPMED Imputation Server 21,22 .Prior to phasing, TOPMED partitions the data into 20 megabase length chunks and removes SNPs that are: (1) duplicates, (2) indels, (3) strand ambiguous (C/G and A/T), ( 4) not included in the Haplotype Reference Consortium (HRC) panel, and (5) mismatched between the reference panel and study.Data were then imputed using version R2 of the TOPMED panel, currently the largest panel of sequenced human genomes, and containing representation from the ancestry groups observed in nuMoM2b.

Genome-wide associations
Association testing was carried out using regression models implemented in PLINK v1.9 17 .The model was adjusted for each subject's rank-transformed age, body mass index (BMI), and the first ten principal components www.nature.com/scientificreports/from population structure analysis.As maternal age has a nonlinear effect on pregnancy 23,24 , we transformed each subject's age into the distance from the median percentile in all subjects.The suggestive association threshold was P < 1 × 10 -5 and the threshold for genome-wide significance was P < 5 × 10 -8 .Any subjects with missing age or phenotype information were excluded from analysis.The multi-ancestry results from the sub-cohorts were fixed-effect meta-analyzed using GWAMA 25 and result plots were displayed using R libraries.As the majority of the subjects in nuMoM2b are of European ancestry, we only considered variants shared by this group, as the African and Admixed American sub-cohorts were underpowered in detecting signal in a standalone GWAS.Due to the smaller cohort size and large class imbalance (7 cases, 768 controls) observed in the Admixed American cohort for pregnancy loss outcomes, we did not perform a pregnancy loss GWAS for this sub-cohort.
Using the results of the genome-wide associations for each APO, we inferred SNP-based heritability of these traits using LDscore 26 .We scanned for putative regulatory effects by carrying out a transcriptome-wide association study (TWAS) as implemented in the Fusion software, and using the available expression reference weights for whole blood and adipose tissue 27 , as well as liver, pancreas, vagina, and uterus, and whole blood from the GTEx v7 multi-tissue RNA-seq data set.For TWAS, P-values were Bonferroni-corrected by the number of genes in each panel.Conditional expression analyses were performed using R scripts from the Fusion software.As both SNP-based heritability and TWAS inferences rely on reference panels, and relevant panels for these were developed with EUR cohorts, we used only EUR individuals in these two analyses.

Variant fine mapping and annotation
We utilized the previously defined fine-mapping approach 28 to delineate the 95% credible set for each significant meta-analysis locus containing more than one SNP of interest.In this approach, the posterior probability that each variant is causal is computed using approximations of the necessary Bayes factors.Only variants with a cumulative posterior probability above 95% are included in the credible set.
In characterizing the putative causal role of our variants of interest, we further annotated each SNP using Variant Effect Predictor 29 (VEP) to identify the corresponding gene function, and RegulomeDB 30 to understand the regulatory context.VEP enables genome interpretation by providing the gene and transcript level context, while also describing the regulatory location within the gene and type of point mutation for each variant.Regu-lomeDB heuristically scores variants on a scale of 1a (high confidence) to 7 (low confidence) given the presence of regulatory function as assessed by various functional screens, including annotations from ENCODE 31 and other sources.

Pregnancy loss
The multi-ancestry meta-analysis revealed a set of 12 novel SNP associations across two loci associated with pregnancy loss (Table 2 and Supplementary Table S1).Significant SNPs were located within or near the genes TRMU and RGMA (Table S1, Supplementary Figures S10-S12).While most of the associated SNPs appeared in both the European and African cohorts, two SNPs (rs143149726 and rs142372194) were only found in the European cohort (Supplementary Table S2).Additionally, fine-mapping the eleven SNPs identified on or near TRMU eliminated rs147382049 from the 95% credible set for the locus.For the nine remaining multi-ancestry variants, the direction of effect was concordant across the two ancestry cohorts and suggestive of an increased odds of pregnancy loss in carriers of the alternative allele.The genomic inflation factor was suggestive of minimal confounding effects in the meta-analysis (Supplementary Figure S5).This trait showed the highest SNP-based heritability (h 2 = 0.30, SE = 0.075) of the four APOs studied.

Table 2.
The top genome-wide SNPs across the three genome-wide associations with significant results.Odds ratio (OR) is computed for the pregnancy loss, and GDM GWAS, while beta is computed for the gestational length (GWAS)."Effect Direction" reflects the direction of the odds ratio or beta of the SNP in each sub-cohort containing the variant (EUR, AFR, AMR; + indicates positive association,-indicates negative association).www.nature.com/scientificreports/Further investigation of the variants using annotation tools pinpoints potential functional effects.Regu-lomeDB scores all the associated variants as "regulatory" (scoring < 5 on the RegulomeDB scale).rs143149726, a downstream gene variant of TRMU, (TRNA 5-Methylaminomethyl-2-Thiouridylate Methyltransferase), scored the lowest score across all associated variants (indicating the greatest amount of evidence for the variant to be in a functional region).The variant had a RegulomeDB rank of 1f. with a probability of 1 (Table S1), indicating that it was "likely to affect binding and linked to expression of a gene target" given evidence of an expression quantitative trait locus (eQTL) and transcription factor binding or DNase peak.rs62021480, the top SNP association at the RGMA locus, encodes a 3 prime UTR variant for the gene, which is a glycoprotein that guides developing axons and may act as a tumor suppressor.

Gestational length
One multi-ancestry and one European-specific variant were implicated in the quantitative meta-analysis of gestational length (Supplementary Figures S13-S14).rs58548906 is an intergenic variant located on the X chromosome near AC005052.1 and appears to be associated with a reduced gestational length (beta = − 1.43, CI = [− 1.94, − 0.921]) across African, Admixed American, and European ancestries.rs2550487, located on the 3 prime UTR of WFDC1, was found in the European sub-cohort only, and individuals carrying the minor allele are associated with lower gestational length (beta = − 1.28, CI = [0.215,− 1.70] ).Both SNPs show "minimal binding evidence" in their RegulomeDB ranking, with evidence of transcription factor binding or a DNase peak.

Gestational diabetes mellitus
Three significant loci were identified in the GWAS meta-analysis for GDM (Table 2, Table S2, Supplementary Figures S15-S19).A set of four SNPs appear in the intergenic region near RPL7P20, Ribosomal Protein L7 Pseudogene 20.The most significantly associated SNP in this set is the intergenic variant, rs79596863 (Table 2).The second significant locus includes intronic variants rs61167087 and rs72956265 in ZBTB20, a Zinc Finger and BTB Domain Containing 20.Lastly, rs10890563, a 3'UTR variant on the gene GUCY1A2, Guanylate Cyclase 1 Soluble Subunit Alpha 2, had a significant association with GDM as well.Several loci show suggestive associations in the meta-analysis, which we report in the supplementary text (Fig. 2; Table S2).We did not find genomic inflation in any of the ancestry groups or in the meta-analysis (λ = 1.001,Supplementary Figure S7), and we estimated SNP-based heritability to be h 2 = 0.13 (SE = 0.085).
We also found a positive association with gestational diabetes for four previously reported variants (P < 0.05; Table 3).These SNPs were mapped near four genes: Hexokinase HKDC1, transcription factor TCF7L2, melatonin receptor MTNR1B, and major histocompatibility complex HLADQB1.Table 3. Previously identified variants that were replicated in this study (association with relevant APO P < 0.05).The odds ratio (OR; standard error in parentheses), significance (P-value), and direction of the effect are shown for the meta-analysis of three sub-populations, as well as the closest gene(s) to the relevant marker.www.nature.com/scientificreports/

Discussion
We present a set of multi-ancestry GWAS of four of the most ubiquitous APOs.Overall, we identified variants at six novel loci associated with our APOs of interest, while also replicating 14 signals from previous studies.Results from the European-ancestry analyses also reveal two secondary signals at the TRMU locus, and the discovery of a new locus association with gestational length on WFDC1.Using a Bayesian fine-mapping approach, we further narrowed the list of putative causal variants to those within a credible posterior probability of at least 95%.The candidate genetic markers identified in the present study provide a starting point for molecular exploration of each APO of interest.The cluster of 11 significant variants at the TRMU locus in the pregnancy loss GWAS are characterized as having a significant regulatory role in binding and gene expression given high-throughput experimental data from RegulomeDB.Indeed, the expression quantitative trait locus (eQTL) annotation from RegulomeDB and VEP indicate that the SNPs in the aforementioned cluster (Table S1) mediate gene expression across several two genes located on chromosome 22q13.31:TRMU and CELSR1.Additionally, TWAS analysis implicates the gene TTC38 in pregnancy loss across three tissue types: liver, ovary, and uterus.A top SNP in this locus, rs114058457 (Table S1), was previously found by Jansen et al. as an eQTL for TTC38 in blood in a cohort of 4,896 subjects of European ancestry (FDR-corrected P < 1.34 × 10 -5 ) 32 .Overall, there appears to be consistent evidence mapping the top SNPs from the chromosome 22q13.31locus to biologically meaningful roles.
While the preliminary annotations of the chromosome 22q13.3SNP cluster seem to be promising, further work is necessary to pinpoint the mechanistic contribution of the four target genes.Previous work has shown that TRMU encodes a mitochondria-specific tRNA-modifying enzyme and plays a key part in mitochondrial translation 33 .Aberrant expression of TRMU is likely pathogenic to humans early in life, as variants in TRMU have been linked to several disease phenotypes, including infantile liver failure 33 , and infantile reversible cytochrome c oxidase deficiency 34 .Next, TTC38, implicated through the TWAS, has been posited as a factor associated with kidney development 35 , while CELSR1 is highly prevalent in embryonic tissues 36,37 and linchpins embryonic development across humans and other vertebrates 38,39 .The genes we have outlined here span a diverse set of functions, however we speculate that an underlying commonality is that they act as important contributors to early development.
Rs62021480, located near RGMA and associated with greater risk of pregnancy loss, plays a role in transcription factor binding and lies in an open chromatin region, however the evidence for a regulatory role is minimal.Still, rs62021480 lies within the 3' UTR of RGMA, and therefore may have post-transcriptional regulatory function.RGMA is a repulsive guidance molecule for axons of the retina, which is a pivotal step in the developing brain 41 .
Interpretation of the results from the remaining two GWAS is less straightforward as the associated SNPs have greater ambiguity from a functional perspective.The top GDM SNPs identified localize to three genes: RPL7P20, ZBTB20, and GUCY1A2.While RPL7P20 codes for a pseudogene, several studies have made associations between SNPs at this locus and increased heart rate 42 as well as general cognition 43,44 .The zinc finger ZBTB20 has been linked to Primrose syndrome 45 , which may include insulin-resistant diabetes and other metabolic disruptions.This transcription factor regulates beta-cell function in mice 46 and has been hypothesized to be involved in the control of glucose metabolism.GUCY1A2 (a guanylate cyclase, GTP-binding protein) has shown high expression levels in placenta and was differentially expressed in rat placentas responding to hyperglycemia 47 .In addition, GUCY1A2 has been inferred to be associated with diabetes through exposure to several toxins (reported in the Comparative Toxicogenomics Database 48 ).
Two significant SNPs map to two separate loci in the GWAS of gestational length: AC005052.1 and WFDC1.While the gene profile of AC005052.1 is relatively unknown, WFDC1 modulates inflammatory response 49 .Inflammation has a clearly established link to preterm birth 50 and dysregulation of the immune system through inflammation may lead to harmful effects on pregnancy 51 .These initial findings suggest potential mechanisms through which the SNP findings may shorten gestational length during pregnancy.
In conjunction with the GWAS findings, nuMoM2b offers a uniquely valuable resource for genetic studies on two levels.First, the dataset spans over 4,600 features per subject 12 , representing a multimodal array of clinical, biological, and environmental factors influencing pregnancy.The deep, longitudinal phenotyping of the cohort presents varied opportunities for additional GWAS and downstream analyses.Second, another advantage of this study and any future genetic studies performed using nuMoM2b's multi-ancestry population is the potential for better genetic signal triangulation in the face of confounds.As nuMoM2b contains several distant ancestral populations, meta-analyses of GWAS from each ancestry group takes advantage of naturally occurring differences in linkage disequilibrium between SNPs to disentangle false positives from the most probable causal variant.In particular, having an African ancestry sub-cohort is not only inherently beneficial to study, but also helpful in that African populations contain a more diverse set of haplotypes 52 that can hone in on causal variant signals from a more homogeneous population.The preprocessing pipeline established in this study can be used in future studies to study additional variants in an ancestry-stratified manner.
There are still important limitations.When considering many of the disease phenotypes in nuMoM2b, there is an imbalanced breakdown of subjects in terms of both case-control status and genetic ancestry.This imbalance persists in the analyses presented here-indeed, the pregnancy GWAS meta-analysis excluded the Admixed American sub-cohort due to extreme class imbalance, and the meta-analysis results contain > 70% of individuals of European ancestry, despite the inclusion of diverse ancestry populations.Our results would therefore benefit from further validation cohorts, particularly of individuals of non-European genetic ancestry, to improve with broad applicability of the findings.Future studies may want to study quantitative traits to improve statistical power given modest sample sizes.
This study confirmed previously reported associations between several SNPs and APOs but failed to do so for most markers queried.This result is not unexpected, however, since low levels of replication in GWAS are not uncommon 53,54 .While we investigated the functional role of our variants of interest through fine-mapping approaches and annotation tools, the APOs studied in this paper are ultimately defined by complex genetic architectures.The effect sizes observed for individual SNPs suggest that the APOs studied are highly polygenic.These findings imply that, rather than aiming for genetic screening at one or a few markers, better prediction of these APOs will come through the modeling of genetic signals genome-wide (e.g., through polygenic risk scoring) together with non-genetic factors.Examination of gene-by-environment interactions, which disentangle the nonlinear interplay of variants and external factors on phenotype, may further illuminate drivers of APOs 55 .Such efforts are already underway, with auspicious results for the prediction of GDM through the combination of polygenic risk scores and behavioral data 56 .It is also possible that these complex phenotypes are largely controlled by rare genetic variants 57 for which GWAS are not adequate.Identifying rare causative variants would require sequencing efforts including whole-genome and whole-exome sequencing, as well as larger study populations of diverse genetic ancestry.
The results of our multi-ancestry study highlight the role of previously unknown loci across several APOs: gestational length, GDM, pregnancy loss, and preeclampsia.We unearth potential contributors to preterm birth at multiple levels of granularity-SNP, gene, and tissue.We also confirm previously identified genetic associations for preterm birth, GDM, and preeclampsia.These findings broaden our understanding of the complex polygenic nature of the APOs studied, informing further research directions and enabling downstream genetic analyses.

Figure 1 .
Figure 1.Overview of the quality control (QC) pipeline used to preprocess nuMoM2b subjects.

Table 1 .
Number of subjects in the nuMoM2b cohort used for the GWAS.Each row represents the number of subjects remaining after applying the step.Bolded numbers represent the final number of subjects used for each separate GWAS (GDM: gestational diabetes, GL: gestational length, PEC: preeclampsia, PL: pregnancy loss).