Landscape of pathogenic mutations in premature ovarian insufficiency

Premature ovarian insufficiency (POI) is a major cause of female infertility due to early loss of ovarian function. POI is a heterogeneous condition, and its molecular etiology is unclear. To identify genetic variants associated with POI, here we performed whole-exome sequencing in a cohort of 1,030 patients with POI. We detected 195 pathogenic/likely pathogenic variants in 59 known POI-causative genes, accounting for 193 (18.7%) cases. Association analyses comparing the POI cohort with a control cohort of 5,000 individuals without POI identified 20 further POI-associated genes with a significantly higher burden of loss-of-function variants. Functional annotations of these novel 20 genes indicated their involvement in ovarian development and function, including gonadogenesis (LGR4 and PRDM1), meiosis (CPEB1, KASH5, MCMDC2, MEIOSIN, NUP43, RFWD3, SHOC1, SLX4 and STRA8) and folliculogenesis and ovulation (ALOX12, BMP6, H1-8, HMMR, HSD17B1, MST1R, PPM1B, ZAR1 and ZP3). Cumulatively, pathogenic and likely pathogenic variants in known POI-causative and novel POI-associated genes contributed to 242 (23.5%) cases. Further genotype–phenotype correlation analyses indicated that genetic contribution was higher in cases with primary amenorrhea compared to that in cases with secondary amenorrhea. This study expands understanding of the genetic landscape underlying POI and presents insights that have the potential to improve the utility of diagnostic genetic screenings.

In total, DNA extraction and WES was performed for 1,030 cases. Variant calling and annotation were conducted as described in Methods. Multiple sequence quality parameters were used to remove artifacts, and common variants (minor allele frequency (MAF) > 0.01 either in public controls from the gnomAD database 9 or in-house controls from the HuaBiao project 10 ) were filtered out (Methods).

Identification of pathogenic variants in known POI-causative genes
We first quantified the contribution yield (defined as the percentage of cases) to POI attributable to pathogenic variants in 95 well-characterized POI-causative genes (Supplementary Table 2). Variant pathogenicity in these known causative genes was evaluated by manual review following guidelines of the American College of Medical Genetics and Genomics (ACMG) 11 or by ClinVar (Methods and Supplementary Table 3). Pathogenic (P) and likely pathogenic (LP) variants were prioritized for contribution analysis (Extended Data Fig. 1). Because variants of uncertain significance (VUSs) were likely to be upgraded to P/LP by introducing PS3 evidence via functional studies, we experimentally validated 75 VUSs from seven common POI-causal genes involved in homologous recombination (HR) repair (BLM, HFM1, MCM8, MCM9, MSH4 and RECQL4) and folliculogenesis (NR5A1). Fifty-five variants were confirmed to be deleterious (Extended Data Fig. 2), among which 38 were upgraded to LP from VUS. The two P/LP heterozygous mutations occurring in the same gene in the same individual were confirmed to be in trans via T-clone or 10x Genomics approaches (Extended Data Figs. 3 and 4). The combined data, including the distribution of 4,730 variants detected in known genes, are shown in Extended Data Fig. 5.
Ultimately, 195 P/LP variants were identified across 59 known genes (Fig. 2), including 108 (55.4%) loss-of-function (LoF) variants, 81 (41.5%) missense, four (2.1%) inframe deletions or insertions and two (1.0%) splice regions. Specifically, LoF variants consisted of 38 frameshift deletions or insertions, 44 nonsense, 23 canonical splice site and three start-loss. Most P/LP variants (119, 61.0%), spanning 45 genes, were previously undocumented (Fig. 2a), including 76 LoF variants and 38 missense or inframe variants functionally verified in the present study (Extended Data Fig. 2). Among the 195 variants, 184 (94.4%) had PHRED-scaled CADD scores 12 of greater than 20; nine (4.6%) with scores between 10 and 20; and just two with scores lower than 10. CADD is reasonably accurate in predicting the pathogenicity of variations, with both >10 and >20 having a similar predictive value. Among those genes, EIF2B2 had the highest prevalence of pathogenic alleles in cases (16, 0.8%), due to the most recurrent variant p.Val85Glu (four heterozygotes, five homozygotes and one in trans with another LP variant p.Lys273Arg), which was described to cause SA in a Japanese patient due to compromised GDP/GTP exchange activity 13 .
Recent advances in high-throughput sequencing have greatly expanded understanding of the pathogenesis of POI, with approximately 90 genes now linked to either isolated or syndromic POI [5][6][7] . However, variants in these known genes account for only a small fraction of patients, indicating the high genetic heterogeneity of POI 8 . Furthermore, limited sample sizes and inadequate controls in previous studies have prevented establishment of statistically robust single-gene associations and identification of novel causative genes. In this study, we performed, to our knowledge, the largest-scale whole-exome sequencing (WES) study in patients with POI to date and conducted a case-control analysis to systematically explore the genetic landscape of POI.

Patient cohort
We recruited 1,790 unrelated patients with POI from the Reproductive Hospital Affiliated to Shandong University for initial evaluation. Diagnosis of POI was based on the European Society of Human Reproduction and Embryology (ESHRE) guidelines: (1) oligomenorrhea or amenorrhea for at least 4 months before 40 years of age and (2) an elevated follicle stimulating hormone (FSH) level >25 IU L −1 on two occasions >4 weeks apart. Patients with chromosomal abnormalities and other known non-genetic causes of POI (including autoimmune diseases, ovarian surgery, chemotherapy and radiotherapy) were excluded. The final cohort included 1,030 unrelated patients with POI, consisting of 120 cases with primary amenorrhea (PA) and 910 cases with secondary amenorrhea (SA) (Fig. 1)  To validate potential associations between genotype and phenotype, we determined the contributions of each causative gene to PA and SA. The results showed that FSHR was most prominently involved in PA (4.2% in PA versus 0.2% in SA), whereas putative pathogenic variants in AIRE, BLM and SPIDR were observed only in patients with SA in our cohort (none in PA versus 0.7% in SA) (Fig. 2g). Among them, SPIDR was previously reported in only a consanguineous family with PA 14 . The other 11 genes were not linked to a specific type of amenorrhea, including mutations in three genes (HFM1, MSH4 and POLG) previously reported in SA and eight genes described in both PA and SA (Supplementary Table 4). These findings extended the phenotypic spectrum of known POI-causative genes.

Association studies identified 20 novel POI candidate genes
To further investigate enriched genes and potential genetic defects associated with POI, we performed case-control association analyses at the variant and gene levels against in-house controls. The in-house control cohort was obtained from the HuaBiao project 10 , including 5,000 unrelated individuals, generated using the same exome capture kit as the cases and which had similar sequencing statistics (Methods and Supplementary Table 5).
To this end, we first screened a manually curated list of 703 genes (including known POI-causative genes) implicated in ovarian function (Methods and Supplementary Table 6). Most of these genes were involved in different stages of follicle initiation and development, including gonadogenesis, oogenesis, folliculogenesis, oocyte maturation and ovulation. To minimize bias, we removed genes with a mean coverage of informative reads in the coding region <30× in either the cases or the controls, and a final set of 646 genes was included in further association analyses (Supplementary Table 7). Furthermore, the burden tests of synonymous variants showed that there was no significant inflation of background rate between the case and control cohorts (Extended Data Fig. 6).
In the coding model (exonic and splice region variants), all qualifying variants that met specific criteria (Methods) were subjected to association analyses using one-sided Fisher's exact tests, which identified 41,046 variants across 639 genes in the control cohort and 11,981 variants across 628 genes in the case cohort. As a result, EIF2B2 p.Val85Glu was the only variant that stood out in variant-level association tests (Extended Data Fig. 7).
A gene-based collapsing approach was then used to identify novel candidate genes. In brief, we identified rare (MAF < 0.001) likely LoF variants, or likely damaging missense (D-mis) variants, and we then aggregated the qualifying alleles into genes and tested for differences between cases and controls using one-sided Fisher's exact test. The LoF model included 1,439 variants across 433 genes identified in the total cohort (cases and controls). After multiple test correction using the Benjamini-Hochberg method, a false discovery rate (FDR) of 0.3 was set as the threshold. Finally, 32 genes passed the threshold for significantly higher burden of LoF alleles in cases (Fig. 3a). Among these, 20 genes were not previously implicated in patients with POI. In particular, ZAR1 exhibited the greatest enrichment, followed closely by ZP3, both of which had an FDR of 1.1% (Fig. 3a).
Additionally, in the D-mis model, we used multiple algorithms to identify genes with significantly more detrimental missense variants in cases than in controls (Methods). Only three POI genes (NR5A1, FSHR and EIF2B2) were enriched for more variants predicted to be damaging by at least four criteria with FDR < 0.3 (Extended Data Fig. 8). However, all three genes are well known to cause POI. The absence of additional pathogenic genes in this analysis may be due to difficulties in evaluating the pathogenicity of missense variants.
Taken together, 20 novel candidate genes were identified by gene-based collapsing analysis in the LoF model. We further investigated their functions (Fig. 3b) and evaluated the pathogenicity of each variant according to ACMG guidelines. The distribution of LoF variants across different locations and the key functional domains involved are shown in Extended Data Fig. 9. The function of novel candidate genes and the LoF variants' potential deleterious effects are detailed in Supplementary Table 8.

ZAR1 and ZP3 had the strongest associations with POI
ZAR1 had the highest probability of association (P = 1.1 × 10 −4 ), largely driven by the presence of seven LoF alleles in cases but only two in controls (Fig. 3a,b). ZAR1 was one of the first maternal effect genes identified in mammals 15 . It is abundantly expressed in human oocytes in growing and pre-ovulatory follicles, where it performs multiple roles in folliculogenesis, oocyte maturation and embryonic development 16 . Zar1-null female mice have a normal number of follicles until meiotic maturation and zygotic genome activation are blocked 17 . In contrast, Zar1 −/− zebrafish exhibit early arrest of oogenesis resulting from aberrant de-repression of the zona pellucida (ZP) gene mRNA translation 16 . However, despite extensive research in various animal models, no deleterious variant of ZAR1 has been reported thus far in women with infertility. In the present study, six patients were identified to carry ZAR1 variants, including one with compound heterozygous and five with heterozygous. All of the LoF variants were predicted to disrupt the conserved C-terminal ZNF domain, through which ZAR1 interacts with ZP or other target mRNAs, thereby impeding its ability to regulate target gene translation.
ZP3 had the second most significant association with POI (P = 1.5 × 10 −4 ), with five LoF alleles in cases and none in controls. ZP3 exerts pleiotropic effects on ovarian development because it is a critical component of the zona pellucida starting from the primordial follicle stage 18 . Interestingly, only missense variants or in-frame deletions in ZP3 have been reported in patients with defects in oocyte maturation 19,20 . However, these LoF mutations, which were identified in the current POI study, tend to induce more severe protein defects, possibly due to loss of the conserved ZP domain and transmembrane domain.
Moreover, ZAR1 and ZP3 appearing as the strongest signals in enrichment analysis illustrates the crucial role that genes involved in follicle development and oocyte maturation play in POI. Among the 20 novel candidate genes, HMMR 21,22 , HSD17B1 (ref. 23 ) and BMP6 (refs. 24,25 ) have been implicated in follicle development through their regulation of granulosa cell division or steroidogenesis, whereas H1-8 (ref. 26 ), PPM1B 27,28 , ALOX12 (ref. 29 ) and MST1R 30 are involved in oocyte maturation or ovulation through various mechanisms, such as lipid metabolism and inflammatory response.

Enriched gonadal development and meiosis-related genes in POI
The establishment of ovarian reserve relies on the well-orchestrated development of female gonads and meiosis with HR repair proceeding correctly. PRDM1 encodes a crucial transcriptional regulator required for specification and migration of primordial germ cells (PGCs) 31,32 . Three heterozygous LoF variants were identified in PRDM1 (Fig. 4a), and functional experiments were performed to validate their pathogenicity. Western blotting revealed that p.Gly11Valfs*14 and p.Tyr622* resulted in truncated proteins, whereas p.Leu776Valfs*19 resulted in substantially reduced expression (Fig. 4b). Furthermore, in contrast with the uniform nuclear distribution observed in the wild-type (WT) GFP fusion protein, the p.Gly11Valfs*14 variant was expressed in both the Article https://doi.org/10.1038/s41591-022-02194-3 nucleus and cytoplasm, more similar to the pEGFP empty vector group, whereas p.Tyr622* was concentrated in large, distinct puncta, possibly attributable to protein self-aggregation resulting from abolished DNA binding (Fig. 4c). In addition, p.Tyr622* and p.Leu776Valfs*19 exhibited reduced PRDM1 protein stability after cycloheximide (CHX) treatment compared with the WT (Fig. 4d). By contrast with the PGC-associated candidate, the novel candidate gene LGR4 was shown to participate in gonadal development by regulating pre-granulosa cell specialization 33 .
In addition, meiotic genes were strikingly enriched (9/20, 45%) among these candidates. STRA8 is well known to be responsible for triggering meiotic entry and transcriptional activation of meiotic prophase-related genes 34,35 . One homozygous splice site variant c.258 + 1 G > A was found in the present POI cohort (Fig. 4e), whereas no biallelic LoF variant was identified in our in-house controls, and no homozygous LoF variant was found in any public population databases. Mini-gene assays verified that STRA8 c.258 + 1 G > A caused exon2  skipping, leading to a 66-amino acid in-frame deletion (p.Leu21_Lys-86del) in the highly conserved nuclear localization and DNA-binding region of STRA8 ( Fig. 4f-g) 36 . Further immunofluorescence staining revealed that this STRA8 mutant was restricted to the cytoplasm (Fig. 4h), which was consistent with observations in N-terminal deleted Stra8 Δ121/Δ121 mice 35 , indicating impairment of its transcriptional activation functions and abolished capacity to initiate meiosis. MCMDC2, which promotes homologue alignment and crossover formation during meiosis prophase I, is preferentially expressed in the gonads 37 . One homozygous (p.Gln229*) and one heterozygous (p.Ala69Leufs*18) variant were identified, both of which eliminate the critical MCM and AAA-lid domains 37 (Fig. 4i). Further GFP-based HR repair efficiency assays (Methods) verified that variants exhibited an HR efficiency 20% that of WT (Fig. 4j-k), potentially impeding HR progression during oocyte meiosis.
The functional annotations of these 20 genes suggest their considerable relevance to POI, with all 20 having a significantly higher burden of LoF variants that could alter gene expression or biological function, as exemplified by experimental validation of PRDM1, STRA8 and MCMDC2 (Fig. 4). These collective data strongly suggest the likelihood that these 20 genes may be previously unrecognized POI-causative genes.

Stepwise increases in genetic contribution of POI
In the present study, we followed a pipeline through different lines of evidence to identify and validate pathogenic variants and increased the scope of understanding about the contribution of genetic defects in the pathogenesis of POI (Fig. 5a). Known causative genes accounted for 18.7% (193/1030) of cases, of which 86 cases were attributable to 76 variants previously described in ClinVar or published studies, and an additional 107 cases were explained by 119 variants that were reported as damaging in this work. Furthermore, the discovery of novel POI-associated genes introduced 59 P/LP variants, all of which were LoF variants, found in 61 cases. Among these patients carrying variants of novel POI-associated genes, 49 had no P/LP variants in known genes, yielding an additional contribution of 4.8%. Consequently, the rate of contribution to POI by genetic variations reached 23.5% (242/1030) in this study.
We generated an integrated matrix of pathogenic variants identified in known causative genes and novel POI-associated genes ( Fig. 5b and Supplementary Table 9). Among the different functional gene groups, no significant clusters were observed in modes of inheritance or mutation load. Genes involved in gonadogenesis tended to have high probability of LoF intolerance (pLI) scores, corresponding to the identification of these genes with LoF variants in the cases. Missense Z-score (Mis-Z) did not appear to be associated with any functional gene groups. Six genes had Mis-Z exceeding 1.96; however, no P/LP missense variants were observed in TP63 or CPEB1. Missense variants were the predominant mutation type in EIF2B2 and POLG although with relatively low Z-scores. Overall, both LoF and missense variants substantially contributed to POI, and pLI could serve as a rough guide for prioritizing new POI genes, whereas Mis-Z were relatively uninformative. It should be noted that the high heterogeneity of POI makes detailed, gene-specific analyses indispensable.

Gene sets associated with POI
Even if a single gene-level association does not reach statistical significance under the constraints of limited sample size, as is the case with most known POI genes, the cumulation of non-significant genes but having mild trends may still be informative when prioritizing candidate genes as relevant or increasing susceptibility to POI. A combination of gene signals in gene-set-level analysis can provide insight into the pathogenesis of POI or give clues toward the detection of novel genes. Therefore, we performed some preliminary analyses in 36 gene sets potentially relevant to ovarian function (Methods and Supplementary  Table 10), and the results are shown in Extended Data Fig. 10. Genes implicated in meiosis and DNA repair had significant set-level associations (P = 1.3 × 10 −6 and P = 4.8 × 10 −6 , respectively), revealing their likely non-trivial roles on POI [47][48][49] . Among the subgroups of DNA repair, nucleotide excision repair (P = 0.054) also emerged as potentially relevant pathway, in addition to the well-established HR repair 50 (P = 8.5 × 10 −7 ) and Fanconi anemia (FA) pathways 51 (P = 1.5 × 10 −3 ). This large-scale human genetic study also confirmed the roles of oxidoreductase activity 52 (P = 2.0 × 10 −5 ) and fatty acid metabolism (P = 5.8 × 10 −4 ) in POI. Moreover, the finding that LoF variants in Mendelian mitochondrial disorder-related genes (P = 2.1 × 10 −3 ) are also significantly enriched in POI provides a link between mitochondrial and ovarian dysfunction. This discovery indicates the possible benefits of monitoring ovarian function in patients with mitochondrial disease.

Discussion
Here we report, to our knowledge, the largest-scale WES study of POI conducted to date, and we provide a detailed characterization of its genetic landscape. To ensure the reliability of our data, we excluded patients with aberrant karyotypes or other acquired etiologies, adopted ACMG standards to classify variant pathogenicity and used uniformly processed data from a large control population to minimize false positives from possible subpopulations. Upon integrating the 59 known causative genes with 20 novel POI-associated genes identified in the present study, a total of 254 P/LP variants were ultimately identified in 23.5% of the patients with POI. Additionally, P/LP variants in top-ranked genes were detected in less than 1.2% of cases, highlighting its remarkably high genetic heterogeneity.
The substantial contribution of genetic variants to POI in this cohort should prompt reconsideration of routine mutation screening in diagnosed patients, which is not currently recommended in POI management guidelines due to the presumed rarity of monogenic causes 53 . The findings of this large cohort investigation indicate that at least 18~23% of patients have genetic abnormalities, thus supporting implementation of routine clinical WES in POI. Such screening could facilitate determination of patient etiologies and guide genetic counseling for the proband's female relatives. Genetic testing is particularly beneficial for patients with SA, because their development of POI could be a gradual process, spanning occult (normal FSH level with reduced fecundity), biochemical (elevated FSH level with regular menses) and overt (irregular menses) stages 1 . For women at high risk, alerted by genetic mutations, modest alteration on ovarian reserve indicates the need for timely fertility guidance, including family planning, fertility preservation or assisted reproduction technology.
The link between clinical manifestation of PA or SA and genotype has long presented a challenge to understanding the basis of POI. Previous genetic studies suggest that patients with PA are more likely to have biallelic defects in a single gene 54 , which is confirmed by phenotypic analyses in this report (Fig. 2f). In addition, previous studies have inferred that patients with SA are likely to have multiple defects across various genes coupled with environmental interactions. However, we found a higher frequency of multi-het variants in PA (2.5%) than in SA (1.2%), suggesting that oligogenic models 55 should be considered in the etiology of POI, regardless of the age at onset of amenorrhea. Comprehensively, our findings support the likelihood that the accumulation of multiple genetic defects may result in a more severe phenotype. Additionally, because FSHR mutations are notably    with PA may have great clinical importance for individualized therapeutic interventions.
The novel candidate genes identified in this study are involved in several processes that were previously unrecognized to play a role in human POI, such as PGC specification, meiosis initiation and maternal mRNA metabolism. Pathogenic variants of ZAR1, a maternal effect gene, were reported in human patients with POI and have a remarkably high prevalence. Identification of POI-associated variants in ZP3 broadens the phenotypic spectrum of this gene beyond its currently understood role in empty follicle syndrome 20,57 . The discovery of variants in PRDM1 demonstrates that the pathophysiology of POI may begin as early as PGC specification. Although STRA8 and MEIOSIN are well known for their roles in initiating meiosis in animal models, our findings highlight their critical roles in human ovarian disease. Additionally, mutations in SHOC1 and KASH5 have been reported in men with non-obstructive azoospermia 58 , and this report described their variants in women with POI, implying that these meiotic genes are shared genetic determinants in both female and male human gametogenesis.
Due to the limitations of WES, systematic analyses of non-coding regions, copy number variations and structural variations in POI were not conducted here. Beyond these technical limitations, our sample size in this cohort still lacks sufficient statistical power to detect much rarer associated genes due to the high genetic heterogeneity of POI. One indication of this heterogeneity is the lack of significance in a large proportion of known POI-causative genes. In addition, the stringent criteria in ACMG guidelines resulted in exclusion of a proportion of missense variants from classification as causative in the absence of experimental data. Despite the larger size of this study compared to previous efforts, the genetic contribution to POI reported here is likely to represent an underestimation.
In summary, this study provides a detailed characterization of pathogenic variants in POI, broadening the scope of known POI-associated genes, to depict the genetic landscape of this disease. Larger cohort size, parent-proband trio sequencing, advanced genomic technologies and international collaborative studies are critical for overcoming the limitations of this study to further determine the genetic etiologies of POI. In addition, mapping the genetic landscape of individuals with differences in ovarian function, such as decreased ovarian reserve, early menopause and POI, may aid in understanding the common genetic factors in reproductive aging.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41591-022-02194-3.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/. Article https://doi.org/10.1038/s41591-022-02194-3

Participants
Patients. All procedures involving patients in this study were approved by the institutional review board of Reproductive Medicine of Shandong University (approval number 2014IRB52). A total of 1,030 female patients with POI from the Reproductive Hospital Affiliated to Shandong University were recruited in the present study. Written informed consent was obtained from each participant. The diagnosis criteria for POI were oligo/amenorrhea for at least 4 months and elevated serum FSH levels >25 IU L −1 on two occasions (>4 weeks apart) before 40 years of age 53 . Each patient underwent chromosomal analysis, pelvic ultrasound and a thorough examination of the patientʼs medical history. Individuals with etiologies such as chromosomal abnormalities, histories of ovarian surgery or chemotherapy, radiotherapy or autoimmunity disorders were excluded. Patients with POI were further categorized to PA and SA for phenotypic analysis. PA is defined as the absence of menarche before age 16 years, and SA refers to a spontaneous menstrual cycle at least once. The age at recruitment of patients in this study ranged from 16 years to 40 years. All patients self-reported as females, and ultrasound and karyotype confirmed their sex.

Controls.
For association tests, we applied WES data of 5,000 unrelated individuals (including 2,739 females and 2,261 males, age range 16−83 years) from the HuaBiao project 10 as a human population control in this study. This project was approved by the Human Ethics Committee of Fudan University. All participants provided written consent for the extraction and storage of their DNA samples and future usage of their DNA data for research. Their sex and age are self-reported.
All the participants voluntarily involved in the study, and no compensation was provided.

WES and variant calling
Exome sequencing was performed on genomic DNAs extracted from peripheral blood samples of all 1,030 patients with POI, captured with AIExome V1-CNV (iGeneTech) and sequenced on NovaSeq platforms (Illumina) with 150-bp paired-end reads. Sequence reads were aligned to the human reference genome GRCh37/hg19 using Burrows-Wheeler Aligner (BWA 0.7.17) MEM 59 . Removal of duplicate reads and variant calling of single-nucleotide variants and small indels were used (Genome Analysis Toolkit (GATK 4.1.8.1)) 60 . Annotation of variants was used (Ensembl Variant Effect Predictor (VEP 100)) 61 with the RefSeq database. Variants with a genotype call rate >95%, MAF > 1% and LD-pruned based on maximum r 2 = 0.2 (parameters: -indep-pairwise 50 5 0.2) were selected for identity-by-descent analyses using PLINK 1.9 (ref. 62 ). All participants were confirmed to be unrelated to each other, with the PI-HAT value of less than 0.185.

Interpretation of variants
The pathogenicity of variants in this study was manually determined according to ACMG guidelines 11 , and the details of criteria we used are listed in Supplementary Table 3. Those variants were classified as P/LP in the ClinVar database with criteria provided by multiple submitters, and no conflicts or those that were reviewed by an expert panel were also considered 63 . Only variants classified as P or LP were reported here, and all of them were confirmed by Sanger sequencing.

Gene list determination
Known causative genes. Human genes that were considered known POI causal genes were all previously identified deleterious variants in patients with syndromic or isolated POI, and their causative association was evaluated by functional verification in animal models or in vitro studies or by observing co-segregation with POI in large families or co-occurrence in multiple unrelated patients. We generated the list of known POI genes by searching the PubMed and OMIM databases for articles published up to December 2021, using terms related to genes (for example, 'gene', 'genetic', 'mutation' or 'variant') in conjunction with terms related to POI (for example, 'ovarian insufficiency', 'ovarian failure', 'ovarian dysgenesis', 'ovarian aging', 'ovarian dysfunction', 'gonadal failure', 'gonadal dysgenesis', 'reproductive dysfunction' or 'hypogonadism'). The roles of genes in POI etiology were then carefully evaluated for each unique search result. In total, a list of 95 known causative genes was compiled as well, and the associated phenotypes references for each known POI gene are listed in Supplementary Table 2.
Candidate gene list for collapsing analyses. We manually curated a list of 703 genes with established associations with ovarian function as follows: (1) genes whose mutations had been implicated in the development of isolated or syndromic reproductive diseases caused by abnormal ovarian function, such as POI and oocyte maturation defect; (2) genes whose disruptions in mouse models yielded impairment of ovarian function according to Mouse Genome Informatics (MGI; http:// www.informatics.jax.org/) and the International Mouse Phenotyping Consortium (https://www.mousephenotype.org/) database; and (3) genes with functional verification by in vitro studies or in other animal models (for example, zebrafish and flies). Gene lists compiled by refs. 6,64 were referenced.
Gene sets. Meiosis and meiotic prophase I gene sets were curated based on the functional association per MGI and literature review. The DNA repair gene set and its subsets were curated from the updated Human DNA Repair Genes 65 (https://www.mdanderson.org/documents/Labs/Wood-Laboratory/human-dna-repair-genes.html) and the REPAIRtoire dabatase 66 , whereas genes in the FA set containing 22 FA genes and nine FA-associated genes were compiled by ref. 51 and ref. 67 . The mitochondrial function gene set consisting of 255 nuclear genes reported to cause mitochondrial disease was curated from ref. 68 , ref. 69 and ref. 70 Table 10.

Statistical analysis
The case cohort and the control cohort used in this study were captured with the same exome enrichment kit, and the same standardized bioinformatics pipeline was applied. Cases and controls showed similar sequencing metrics (Supplementary Table 5). To minimize bias, only the genes with mean coverages of coding regions greater than 30× both in the two cohorts were included in our association analyses (Supplementary Table 7).
Qualifying coding variants were defined based on the following criteria: (1) exonic or splice region; (2) mean read depth (DP) > 10; (3) alternative allele read frequency ≥25%; (4) mean quality by depth (QD) < 20; (5) mean phred quality (QUAL) < 30; and (6) mean genotype Nature Medicine Article https://doi.org/10.1038/s41591-022-02194-3 phred quality (GQ) < 20. We used a MAF cutoff of 0.001 either in the global population or the East Asian population from the gnomAD database (http://gnomad.broadinstitute.org/). Synonymous variants, most of which are presumed benign, were usually used to determine whether there is a preferential inflation of background variation. To further confirm the lack of preferential inflation of background variation, we assessed the tallies of rare qualifying synonymous variants per individual and burden tests of each synonymous variant. As a result, both of them did not show a significant difference between the case and control cohorts (Extended Data Fig. 6).
For the gene-level collapsing analysis, we ran two models: the LoF and the D-mis model. The LoF model included only LoF variants (start-loss, canonical splice site, frameshift and nonsense) removing at least the last 2% of amino acids. For the D-mis model, multiple algorithms were used to predict deleteriousness of missense variants, and we set six criteria to define D-mis: (1) SIFT < 0.05, Polyphen2 > 0. 15 and MutationTaster predicted as deleterious; (2)  For gene set enrichment analysis, we first constructed a comparison set consisting of the 50 nearest neighbors in the genome of each gene within the gene set. The genes in the gene set and the corresponding matched genes were combined and were applied the same quality control criteria as above. Gene-level associations of LoF variants were calculated for each gene between cases and in-house controls. The gene-level P values were then ranked, and a one-sided Wilcoxon rank-sum test was performed in each gene set to assess whether the genes in the gene set ranked significantly higher than the comparison genes.

Phasing of two heterozygous variants
Multiple approaches were applied to confirm the phase of two variants detected in one gene of a single individual in circumstances where parental DNA samples were unavailable. Variants located within the 150-bp region (POI-572: NR5A1) were phased using GATK Haplotype-Caller and reviewed manually using Integrative Genomics Viewer software 74 .
For variants ranging in size from 150 bp to ~10 kb pairs (POI-506: AARS2; POI-910: RECQL4; POI-991: AARS2; POI-1151: EIF2B2; and POI-1660: ZAR1), phasing was determined using TA cloning sequencing. A genomic DNA fragment covering the two heterozygous variants was amplified through LA Taq DNA polymerase (Takara). PCR products were purified with a gel extraction kit (BioTeke) and cloned into T-Vector pMD20 (Takara). The construct products were transformed into Escherichia coli competent cells. At least four bacterial colonies were collected and cultured overnight in LB medium containing 100 μg ml −1 of ampicillin (Solarbio). Plasmid DNA was isolated and verified by Sanger sequencing. Phasing of the two variants was determined by analyzing whether they occurred in the same or distinct clones. The primers, antibodies and other commercial reagents used in the present study are listed in Supplementary Tables 11 and 12. For variants with a distance over ~10 kb pairs (POI-169: NR5A1; POI-516: HFM1; POI-841: MCM9; POI-1228: MCM9; and POI-1453: MSH4), phasing was determined using 10x Genomics as described previously 75 . High-molecular-weight genomic DNA (>50-kb pairs) was extracted using the Magnetic Blood Genomic DNA Kit (TIANGEN) from the peripheral blood. The sample-indexed sequencing libraries were prepared via the GemCode platform (10× Genomics) and sequenced on NovaSeq platforms (Illumina) with 150-bp paired-end reads according to the manufacturerʼs protocol. Average coverage of each sample was around 30×, and the sequence data were 128 Gb.

Plasmids and mutagenesis
The full-length cDNA of PRDM1 was purchased and cloned into pEGFP-C1. The mutant PRDM1 overexpression plasmids were generated by overlap extension PCR. The methods to construct plasmids used in the minigene splicing assay of STRA8 are described in detail below. To validate the function of exon2 deletion of STRA8, the full-length cDNA of STRA8 was PCR amplified from human transcriptome cDNA and cloned into p3 × FLAG-CMV7.1 as the WT plasmids, and the exon2 deletion mutant STRA8 overexpression plasmid was generated using overlap extension PCR.
The WT overexpression plasmids of BLM, HFM1, MCMDC2, MCM8, MCM9, MSH4 and RECQL4 cloned in pcDNA3.1-3 × FLAG-C were purchased from YOUBAO Biology. The WT overexpression plasmid NR5A1 in pENTER was purchased from Vigene Biosciences. All the mutant overexpression plasmids were generated through QuickChange Lightning Site-Directed Mutagenesis Kit (Agilent Technologies) according to the manufacturerʼs protocol.
Cell culture and plasmid transfection HEK293 (Procell), 293T (China Center for Type Culture Collection), HeLa (Procell) and CHO (Procell) cell lines used for in vitro experiments in this study were all derived from females. Cells were cultured at 37 °C and 5% CO 2 and grown in DMEM (Gibco) or Ham's F-12K medium (Gibco), supplemented with 10% FBS (Gibco) and 1% penicillin-streptomycin (Gibco). When cells reached the appropriate confluence, they were transfected with plasmids using Lipofectamine 3000 (Invitrogen) according to the manufacturerʼs protocol in the absence of antibiotics, and, 6 hours later, the media were replaced with fresh complete DMEM or F12-K culture media containing FBS and penicillin-streptomycin.

Protein blotting and CHX chase assay
HEK293 cells were transfected with wild or mutant pEGFP-C1-PRDM1 overexpression plasmids. pEGFP-C1 vector was also transfected as the negative control group. At 48 hours after transfection, cells were harvested and lysed in RIPA lysis buffer (Beyotime) with 1% Protease Inhibitor Cocktail (Cell Signaling Technology). The total protein was quantitated using a BCA protein assay kit (Thermo Fisher Scientific) according to the manufacturerʼs instructions. Total protein (20 μg) of each sample was loaded, separated on an SDS-PAGE gel and transferred to a polyvinylidene fluoride membrane (MilliporeSigma). The membrane was blocked, incubated with GFP antibody (1:10,000 dilution, Abcam) and anti-rabbit secondary antibodies (1:5,000 dilution, Proteintech) and with β-actin antibody (1:5,000 dilution, Proteintech) and anti-mouse secondary antibody (1:5,000 dilution, Proteintech). The membranes were scanned using a Chemidoc MP Imaging System (Bio-Rad). Two independent experiments were carried out.
For the CHX chase assay, CHX (Beyotime) was added to the culture medium 24 hours after transfection at a concentration of 0.01 mM. Upon treating with CHX for 0 hours, 4 hours, 8 hours and 12 hours, cell samples were collected and stored at −80 °C until western blot analysis was performed. For each timepoint, three independent experiments were carried out.

Minigene splicing assay
A minigene splicing assay was conducted to examine the function of splicing site mutation c.258 + 1 G > A identified in STRA8. WT STRA8 fragment with restriction sites (XhoI and BamHI) encompassing 3′ terminal intron1 (220 bp), exon2 (198 bp) and 5′ terminal intron2 (382 bp) was obtained from human genomic DNA through nested PCR amplification. c.258 + 1 G > A located in the donor site of intron2 was introduced by overlap extension PCR. WT or mutant STRA8 fragment were digested and then ligated into pcMINI vector containing ExonA-IntronA-multiple cloning site-IntronB-ExonB (Bioeagle Biotech Company). The pcMINI-STRA8 constructs were transfected into HeLa and 293T cell lines, and cells were harvest after 48 hours. Total RNA https://doi.org/10.1038/s41591-022-02194-3 was extracted using TRIzol reagent (Invitrogen) and then reverse transcribed to cDNA. cDNA was PCR amplified using primers flanking the minigene. PCR products were separated by agarose gel electrophoresis. Each band was gel purified and then sequenced to determine the transcripts of WT and mutant constructs. It is worth noting that this experiment was repeated in two cell lines, HeLa and 293T, with each cell line being transfected once with WT and mutant constructs.

Immunofluorescence microscopy
To evaluate the effects of variants detected in PRDM1 and STRA8 on protein location or expression profiles, immunofluorescence microscopy was conducted according to standard techniques as described previously 76 . In brief, HeLa cells were cultivated on glass coverslips in 12-well plates and transfected with expression plasmids at the appropriate density. At 36 hours after transfection with WT and mutant pEGFP-N1-PRDM1 constructs, cells were fixed with 4% paraformaldehyde, mounted and stained for nuclei using antifade reagent containing DAPI (Beyotime). After transfection with WT and mutant p3 × FLAG-CMV7.1-STRA8 constructs, fixation, permeabilization and blocking of non-specific antibody binding (in 1× PBS containing 0.3% Triton X-100 and 10% BSA) were performed. Cells were then stained with FLAG antibody (1:300 dilution, Cell Signaling Technology) and goat anti-rabbit secondary antibody conjugated with Alexa Fluor 488 (1:800 dilution, Invitrogen) before mounting and staining for nuclei. Sealed coverslips were visualized under a confocal microscope (ANDOR Technology), and immunofluorescence pictures were captured by performing z-axis scan at 5-μm intervals. Three independent experiments were performed.

HR repair efficiency assay
To investigate the effects of variants detected in genes implicated in the HR repair pathway (BLM, HFM1, MCM8, MCM9, MCMDC2, MSH4 and RECQL4), a stable HEK293 cell line carrying a GFP-based I-SceI-cleavable reporter (provided by Fengli Wang from Huazhong University of Science and Technology and Hailong Wang from Capital Normal University) was adopted as previously described 77 . Lentiviral I-SecI expression plasmid was infected into cells to generate double-strand breaks (DSBs). Around 24 hours later, WT or mutant HR gene overexpression plasmids were transfected to cells. pcDNA3.1 vector was also transfected as the negative control group. After culturing for 48 hours, cells were harvest for flow cytometry analysis on an LSR-Fortessa Cell Analyzer (BD Biosciences) to quantitate the number of GFP + cells and total cells. If the DSBs were repaired by the means of HR, the GFP would be expressed; thus, HR repair efficiency was evaluated by the percentage of GFP + cells in total cells. Three independent experiments were conducted, with a minimum of 30,000 cells counted for each group.

Luciferase assays
Luciferase assays were used to determine the effect of variants identified in NR5A1 on transcriptional activity. Chinese hamster ovary (CHO) cells were seeded in 24-well plates. The cells were co-transfected with WT or mutant overexpression plasmids (pENTER-NR5A1) and pEZX-PG04.1-CYP19A1 reporter plasmids. Simultaneously, pENTER was transfected as a negative control group. The cell culture medium was collected 48 hours after transfection, and the luciferase activity was determined using the Secrete-Pair dual luminescence kit (Gene-Copoeia) according to the manufacturerʼs protocol. The luminescent activity of GLuc and SEAP were measured by the Enspire luminometer reader (PerkinElmer). Results were normalized to the activity of SEAP luciferase. Three individual experiments were carried out.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
The raw sequencing data of 1,030 patients with POI reported in this study have been deposited in the Genome Sequence Archive (GSA) in the National Genomics Data Center, China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences, under accession number HRA003245 (project: PRJCA012479), which can be accessed at https://ngdc.cncb.ac.cn/gsa-human/. These data are available under restricted access, as individual genomic sequencing data are protected owing to patient privacy and Regulations on the Management of Human Genetics Resources of China. The raw data can be requested via the GSA-Human System and can be authorized for downloading by the Data Access Committee for research and non-commercial use only. Detailed guidance on data access requests can be found in the repository's document (https://ngdc.cncb.ac.cn/ gsa-human/document/GSA-Human_Request_Guide_for_Users_us.pdf). Accession requests are typically responded to within 2 weeks. The processed genotype dataset in VCF format (including the position, reference allele, mutated allele, allele frequencies and qualities of all variants) is open-accessed via the National Omics Data Encyclopedia and can be freely and publicly downloaded under accession number OEP003709. Variants of the control cohort used in this study were generated by the HuaBiao project and can be obtained from https:// www.biosino.org/wepd/.
Article https://doi.org/10.1038/s41591-022-02194-3 Extended Data Fig. 4 | Phasing results of two P/LP variants in ultra long distance via 10×Genomics. a, Haplotype browser of c.842 G > A and c.1198 G > A detected in NR5A1 in case POI-169. HAPLOTYPE 1 and HAPLOTYPE 2 are displayed as two separate tracks for multiple variations. The small icons represent SNPs (solid blue circles), small insertions (solid green triangles) and deletions (solid yellow rectangles). The GENES track displays annotated reference genes and the direction of each gene is indicated with arrows. Each vertical green bar in the COVERAGE track shows the average coverage-per-base for the area of the genome under the bar. The two variants were phased to different haplotype tracks and confirmed to be in trans. b, Haplotype browser of c.1792G > C and c.3784 G > A detected in HFM1 in case POI-516. The two variants were phased to different haplotype tracks and confirmed to be in trans. c, Haplotype browser of c.398 C > G and c.1306 A > G detected in MCM9 in case POI-841. The two variants were phased to the same haplotype track and confirmed to be in cis. d, Haplotype browser of c.322 G > T and c.1151-1 G > A detected in MCM9 in case POI-1228. The two variants were phased to different haplotype tracks and confirmed to be in trans. e, Haplotype browser of c.1094 T > A and c.1345 T > G detected in MSH4 in case POI-1453. The two variants were phased to different haplotype tracks and confirmed to be in trans.