Introduction

Uncovering the mechanisms underlying ovarian function and follicle growth is essential for our understanding of human reproduction and female (in)fertility. Central to our understanding of these traits is spontaneous multiple birth: the conception and development of two or more independent zygotes in one pregnancy, which might indicate increased fertility. In addition, multiple birth is associated with increased risks for both the mother and her offspring, such as increased risk of preterm birth [1] and increased maternal morbidity [2]. By improving our knowledge on the genetic basis and physiological mechanisms underlying this trait, we can make important advances in the outcomes for mother and offspring and reveal novel possibilities for fertility treatment.

Among multiple births, twinning is the most common outcome, with the prevalence of twins varying over time and geographic location. In Western Europe, the twinning rate is about 15–18 per 1000 maternities, and this number can increase to up to 40 per 1000 maternities in Africa [3]. The incidence of monozygotic (MZ) births is relatively stable over the world (3–4 per 1000 births) [4], but variations in, e.g., isolated populations worldwide are seen. The incidence of twin births has substantially increased since the 1970s for two reasons. First, the advent of assisted reproductive technologies (ARTs), such as in-vitro fertilisation (IVF), where multiple embryos are often transferred to increase the likelihood of pregnancy, have resulted in higher rates of multiple gestation [5]. IVF also is associated with increased MZ twinning, though the mechanisms are poorly understood. Second, the age at which women get children has increased, a factor that is associated with higher rates of twinning [6].

While evidence indicating that MZ twinning is influenced by genetic factors is complex, it is long established that dizygotic (DZ) twinning is a heritable trait. Earlier studies on the inheritance of DZ twinning examined the twinning rate in families and found that relatives of mothers of twins report higher twinning rates compared with the general population [7, 8]. Although having DZ twins was at some point considered to be most consistent with an monogenic model [9], it is now established that DZ twinning is a complex polygenic trait. The search for twinning genes began with the first genome-wide association (GWA) study [10, 11]. GWA analyses were performed in 1980 mothers of spontaneous DZ twins and 12,953 controls, which led to the identification of two genetic variants increasing the chance of spontaneous DZ twinning. One of these variants (rs11031006) is near FSHB, a gene involved in the release of follicle-stimulating hormone (FSH) that controls ovarian folliculogenesis and ovulation. The other variant (rs17293443) is located in SMAD3, which regulates the response of the ovaries to FSH.

While the identification of the first genetic variants influencing spontaneous DZ twinning are an important development in the field of human reproduction, there is still much unknown fact about the mechanisms and genetic pathways underlying this trait. Most importantly, in previous GWA analyses we see a number of other loci that reach suggestive levels of significance that are near likely multiple birth candidate genes (e.g. near/in INHB and SMAD4) [10]. Yet, to establish whether these suggestive loci are in association with multiple births, larger sample sizes are needed. UK Biobank [12] (UKB), a cohort-based health resource, contains multiple birth and genotype data for an extensive number of participants. Employing these data, genetic variants associated with being part of a multiple birth can be used as a proxy of the genetic variants for giving birth to multiple birth offspring. The aim of this study was to perform a GWA study (GWAS) in “spontaneous” (see Materials and methods) DZ twins to identify genetic variants influencing multiple births and to genetically correlate our findings across a broad range of fertility-related traits.

Materials and methods

Discovery cohort—UKB

For this study, we analysed data from UKB release 2. The UKB cohort contains data for 488,377 participants from across the United Kingdom aged between 40 and 69 years, collected between 2006 and 2010. The database was established to power investigations of the genetic and non-genetic determinants of human disease. Participants filled out questionnaires on many socioenvironmental, health, and lifestyle variables, and, additionally, provided blood, saliva, and urine samples. Detailed information on data collection and protocol are publicly available on the UKB website (http://www.ukbiobank.ac.uk/). Data access permission was granted under UKB application 25472 (PI Bartels).

Genetic data in UKB

An extensive description of the genetic data in UKB is described in Bycroft et al. [13]. In brief, participants were genotyped on two similar genotyping arrays (95% overlap of markers): the Applied Biosystems UK BiLEVE Array by Affymetrix (807,411 markers) and the Applied Biosystems UK Biobank Axiom Array (825,927 markers). The genotype data were subjected to a standardised quality control (QC) pipeline that was designed to address challenges specific to this dataset described in the paper by Winkler et al. [14]. For marker QC, the thresholds were set such that only strongly deviating markers fail the tests, allowing researchers to apply their own QC procedures on remaining markers. In sample QC, only duplicates and laboratory mishandled samples were removed. Other dubious samples were kept in the dataset, but information on these samples was made available to researchers. Single nucleotide polymorphisms (SNPs) were imputed from both the 1000 Genomes Phase 3 and the haplotype reference consortium (HRC) reference panel, but when SNPs were present in both panels, HRC imputation was the preferred option.

Subject selection

In order to avoid bias due to population stratification, we limited our GWA analysis to Caucasian participants (total N = 8 962). We divided the sample into three separate groups where, within each group, none of the participants had a closer genetic relationship than fourth cousin with one another (see Bycroft et al. [13], for more details on the kinship coefficients). We first selected all participants that self-report to be “White British” and have similar genetic ancestry based on principal component analysis (PCA) (see Bycroft et al. [13], for more details on the PCA) and took the maximal set of unrelated participants (N = 7036 cases, N = 325,773 controls). Next, from the remaining set of Caucasian UK participants, we again selected a second maximal group of unrelated subjects (N = 1364 cases, N = 56,507 controls). Finally, we selected all Caucasian participants that did not self-report to be “White British” and did not show genetic similarity to UK participants (N = 562 cases, N = 27,311 controls).

Multiple birth in UKB

To assess whether participants were part of a multiple birth, they were presented with the question: “Are you a twin, triplet or other multiple birth?” (UKB questionnaire field ID 1777). The options were: “Yes”, “No”, “Do not know”, and “Prefer not to answer”. In our analyses, we focused on 8962 participants with Caucasian ancestry that reported being part of a multiple birth, and 409,591 controls.

Identity-by-state (IBS) information (UKB questionnaire field ID 22013) is available for a subgroup of UKB participants that identify as being genetically related (UKB questionnaire field ID 22011), and can be used to assess the genetic relationship. A kinship coefficient, reflecting the possibility that two alleles sampled at random from two individuals are identical-by-descent, associated with each pair of participants is also available (UKB questionnaire field ID 22012). We identified and removed MZ twins from the same pair by plotting their kinship coefficient against the proportion of IBS-0 (proportion of no IBS sharing). If only one individual of a pair participated in UKB, it was not possible to identify him/her as an MZ or DZ twin based on kinship- and IBS information. Moreover, to remove twins potentially conceived as a result of clomifene, IVF or other ART we removed twins born after 1967, the year clomifene was introduced in the UK. In total, we excluded 358 cases based on zygosity and 174 cases based on year of birth.

Statistical analyses

Genome-wide association analyses

We performed GWA analyses in PLINK [15] using logistic regression under an additive genetic model with adjustment for age, sex, genotyping chip, and 40 principal components reflecting genetic ancestry, supplied by UKB. The results from these analyses were followed-up by post-GWA QC procedures, where we excluded structural variants, indels, monomorphic SNPs, SNPs with minor allele frequency (MAF) < 0.005, and SNPs with missing or invalid data. Next, we aligned all SNPs to a reference file (http://www.uni-regensburg.de/medizin/epidemiologie-praeventivmedizin/genetische-epidemiologie/software/) and removed SNPs with allele mismatches and SNPs where the absolute difference between the reported effect allele frequency (EAF) and reference EAF was larger than 0.2. Finally, to meta-analyse the summary statistics from the three groups, we performed an N-weighted GWA meta-analysis (GWAMA), correcting for relatedness between the two UK samples based on the linkage disequilibrium (LD) score cross trait intercept [16] (N = 8962 cases, N = 409,591 controls, NSNPs = 8,532,721).

For functional annotation of our GWAS results, we used FUMA [17] (FUnctional Mapping and Annotion) to define genomic risk loci. SNPs that reached a significance threshold of (5 × 10−8) were considered genome-wide significant. If two or more SNPs were genome-wide significant and independent from each other at r2 < 0.6 or r2 < 0.1 they were considered independent significant SNPs, or lead SNPs, respectively. Independent significant SNPs, which were closer than 250 kb were merged together in one genomic risk locus. SNPs in LD with these independent significant SNPs were considered candidate SNPs and these determined the borders of the genomic risk loci. We used LocusZoom to plot the genomic risk loci [18].

Gene-based test

In addition to single SNP analyses, we performed a gene-based GWA analysis (GWGAS), which combines SNP p-values within a gene into a gene test-statistic to increase power when the effects of individual markers are too weak to detect. We used the MAGMA (Multi-marker Analysis of GenoMic Annotation) function implemented in FUMA to perform a gene-based test [19]. We used the SNP-based p-values as input and annotated them to 18,187 known protein-coding genes. The Bonferroni-corrected significance threshold was defined at α = 0.05/18,187 = 2.75 × 10−6.

Gene mapping

To map the associated variants to genes, we made use of the three mapping strategies in FUMA: (1) positional mapping, where we mapped SNPs to genes that are a maximum of 10 kb distance from the genomic locus, (2) expression Quantitative Trait (eQTL) mapping, where we mapped SNPs to genes whose RNA expression level they influence, and (3) chromatin interaction mapping, where we mapped SNPs to genes when there is a three-dimensional (3D) DNA–DNA interaction between a SNP region and another gene region. These interactions are the result of the packaging of genomes in the 3D nucleus, so that genomic regions interact in the same, or even distinct, chromosomes. If the SNP region interacts with a region that contains multiple genes, the SNP was mapped to all genes.

Genetic correlations

To quantify the shared genetic contribution between multiple birth and several other traits, we performed explorative genetic correlation analyses in LD Hub [20]. We included publicly available data from 687 traits, based on multiple published GWASs and the available traits in UKB. LD Hub calculates genetic correlations between user-defined summary statistics of a trait of interest and predefined categories of other traits using LD score regression [21]. This method distinguishes between bias and inflation from a true polygenic signal by examining the relationship between linkage disequilibrium and test statistics. For the health-related and anthropometric traits (N = 70), the two categories most likely related to fertility and reproduction, we calculated the false discovery rate (FDR)-adjusted p-values as a means of assessing significance (threshold = 0.05). The genetic correlations were visualised using the ggplot2 [22] package in R [23].

Results

Genome-wide association analyses

We carried out a GWAS for being part of a multiple birth in the UKB discovery samples including a total of 8962 cases and 409,591 controls (see Fig. 1 and Supplementary Figure 1). We identified one region on chromosome 15 containing a genome-wide significant SNP, rs428022 (hg19 chr15:g.68249135A>G, p < 5 × 10−8). The region contains 33 candidate SNPs, with one independent significant lead SNP (see Table 1 and Supplementary Table 1). The strongest signal rs428022 (p = 2.84 × 10−8, odds ratio (OR) = 1.04) is an intergenic SNP, flanked by PIAS1 [OMIM: 603566] and SKOR1 [OMIM: 611273] (see Fig. 2). The gene-based test identified another genome-wide significant gene, FSHB/ARL14EP (p = 1.17 × 10−7) [OMIM: 612295] (see Supplementary Figure 2 and 3).

Fig. 1
figure 1

Manhattan plot genome-wide association study (GWAS) multiple birth on 8962 cases and 409,591 controls

Table 1 Genome-wide significant SNPs in the GWAS multiple birth (n = 8962 cases versus 409,591 controls)
Fig. 2
figure 2

Regional association plot for the top SNP rs428022

In a GWAS of spontaneous DZ twinning, we previously identified FSHB (1.54 × 10−9) and SMAD3 (1.57 × 10−8) as maternal susceptibility loci for DZ twinning [10]. These two loci were replicated in this study with a significance threshold after Bonferroni correction (p < 0.05/2) and in the same direction (see Supplementary Table 2).

Gene mapping

The positional, eQTL, and chromatin interaction gene-mapping results of the new signal in 15q23 (rs428022) can be found in supplementary table 3, 4 and 5, respectively. Using positional mapping, we identified nine genes that are a maximum 10 kb up- or downstream of the genomic locus. Four genes were found through eQTL mapping. Finally, we found the same nine genes in chromatin interaction mapping as in positional mapping. Two genes were significant across all three mapping methods: PIAS1 and CALML4.

Genetic correlation analyses

We calculated genetic correlations between being part of a multiple birth and all available traits in LD Hub. Supplementary Table 6 shows the complete results for all traits. Table 2 shows the 70 associations with a FDR threshold equal to or lower than 0.05. Of these associations, 32 can be classified as anthropometric traits (Fig. 3). We found positive genetic associations with measures of body mass, both with whole-body mass measures such as body mass index (BMI) (rg = 0.20, FDR = 0.017), and specific body parts such as leg fat mass (left: rg = 0.20, FDR = 0.017; right: rg = 0.20, FDR = 0.017). We identified negative associations between multiple birth and anthropometric traits for impedance measures, again both whole-body (impedance of whole-body rg = −0.22, FDR = 0.017) and body part specific (e.g., impedance of left arm rg = −0.22, FDR = 0.017). We also found genetic associations with 38 other health-related traits (Fig. 4). These include a variety of traits, among which cardiovascular measures (e.g., acute myocardial infarction; rg = 0.24, p = 0.007), fertility-related measures (e.g., age at menarche; rg = −0.22, FDR = 0.017), and glucose-related traits (e.g., diabetes rg = 0.27, FDR = 0.017).

Table 2 Genetic correlations FDR threshold < 0.05
Fig. 3
figure 3

Genetic correlations (rg + 95% confidence interval (CI)) with anthropometric traits

Fig. 4
figure 4

Genetic correlations (rg + 95% confidence interval (CI)) with other health-related traits

Discussion

In this study, we replicated the association between FSHB, SMAD3, and twinning, which has been established in previous GWA analyses in mothers with DZ twin offspring. In addition, we report a novel genetic variant associated with multiple births, rs428022 at 15q23.

It is important to note that the significant association observed for the FSHB gene reconfirms the important role it plays in both male and female fertility. Recently, Rull and colleagues [24] proposed that FSHB -211 G>T variant (association p-value in this study = 2.02 × 10−5) represents a key genetic modulator of circulating gonadotropin, leading to various possible downstream effects on reproductive physiology. The novel genome-wide hit on chromosome 15 identified in this study, rs428022, is an intergenic SNP flanked by PIAS1 and SKOR1, and was additionally mapped to CALML4 in all three mapping strategies. PIAS1 (protein inhibitor of activated STAT 1) acts as a regulator of the androgen receptor (AR), dysregulation of which might lead to prostate cancer [25, 26]. In line with this role, it has been shown that PIAS1 is upregulated upon androgenic stimulation [26] and in prostate cancer tumours [27]. The AR plays an important role in male fertility as testosterone exerts its action through this receptor, and variants in the AR gene may cause male infertility [28, 29]. In addition, it has been shown that protein inhibitors of activated signal transducers and activators of transcription (PIAS) proteins interact with the transforming growth factor (TGF)-beta pathway and regulate SMAD-mediated transcriptional activity [30, 31].

The other close gene to rs428022, SKOR1, also known as Fussel-15 (functional SMAD suppressing element on chromosome 15) interacts with Smad1, Smad2, and Smad3 molecules and has been identified as molecular regulator of bone morphogenetic protein (BMP) signalling [32]. The BMP family of proteins regulates many aspects of reproductive system development and biology. Animal studies showed that variants in two BMP genes (GDF9 and BMP15) in sheep were associated with increased ovulation rate [33], and this was recapitulated in the marmoset [34], which has a high rate of twinning. The reduced activity of the BMP signaling system in the ovary leads to decreases in granulosa cell mitosis and its inhibiting action on FSH sensitivity. This in turn leads to selection of more follicles, increased ovulation rate, and multiple births.

One other gene was implicated in all three mapping strategies: CALML4 (calmodulin-like 4). This gene is a protein-coding gene coding for calmodulin-like protein 4. However, very little information is present in the literature concerning the role of this gene. Although the role of our identified SNP in relation to PIAS1 and SKOR1 is in need of further investigation, a better understanding of the role of this gene and its interaction with FSHB and SMAD3 in twinning and fertility could lead to new insights in basic and clinical reproductive physiology research.

We also identified possible genetic correlations between being part of a multiple birth and other phenotypes. The genetic correlations between multiple birth and several anthropometric traits such as BMI, weight, and hip circumference are in line with the previous epidemiological studies that linked these traits to a higher relative risk of having twins [3]. We replicated findings from previous investigations into the genetics of twinning with several fertility-related traits, such as a negative genetic association with age at menarche [10]. The negative association with age at death relates to theories on lifespan and fertility stating that higher longevity is at a cost of lesser reproductive success and vice versa [35], for which we now also found genetic evidence. The genetic associations we found with cardiovascular traits are interesting in light of a recent paper by Byars and colleagues, where they examined whether coronary artery disease (CAD)-linked selection signals are linked to traits influencing reproduction [36, 37]. They found that CAD loci are enriched for effects on female lifetime reproductive success. The positive association we found also suggests these antagonistic pleiotropic effects as a higher genetic risk for twinning is associated with a higher genetic risk for myocardial infarction. Yet, it should be kept in mind that these analyses were explorative and that these correlations do not survive a Bonferroni-corrected threshold for multiple testing.

While this study provides important new insights into the genetic aetiology of multiple births, further work is required to establish the functional/biological pathway through which the identified genetic variant influences multiple births. At the moment, PIAS1 and SKOR1 seem likely candidate genes through which the SNP influences multiple births. The findings may be somewhat limited because of the constraints of the UKB database. Although we did what we could to exclude MZ twins (based upon identical by descent (IBD) = 2 for complete pairs) we could not exclude all single MZ twins. In addition, Yengo et al. [38] recently pointed out the limitation of potential intercept inflation of bivariate LD score regression in large samples. Therefore, due to large sample size, sample overlap might have been overestimated in the current study.

To summarise, in this study we replicated previous findings for multiple birth, and identified new potential genes influencing multiple birth (PIAS1/SKOR1). In addition, we examined the genetic overlap between being part of a multiple birth and several other traits and identify many possible genetic associations with diverse health and anthropometric traits. While this study provides new insights into the genetic aetiology of multiple birth, further work is required to establish the functional pathway through which the identified genetic variant influences multiple birth.