Genome-wide analyses of non-syndromic cleft lip with palate identify 14 novel loci and genetic heterogeneity

Non-syndromic cleft lip with palate (NSCLP) is the most serious sub-phenotype of non-syndromic orofacial clefts (NSOFC), which are the most common craniofacial birth defects in humans. Here we conduct a GWAS of NSCLP with multiple independent replications, totalling 7,404 NSOFC cases and 16,059 controls from several ethnicities, to identify new NSCLP risk loci, and explore the genetic heterogeneity between sub-phenotypes of NSOFC. We identify 41 SNPs within 26 loci that achieve genome-wide significance, 14 of which are novel (RAD54B, TMEM19, KRT18, WNT9B, GSC/DICER1, PTCH1, RPS26, OFCC1/TFAP2A, TAF1B, FGF10, MSX1, LINC00640, FGFR1 and SPRY1). These 26 loci collectively account for 10.94% of the heritability for NSCLP in Chinese population. We find evidence of genetic heterogeneity between the sub-phenotypes of NSOFC and among different populations. This study substantially increases the number of genetic susceptibility loci for NSCLP and provides important insights into the genetic aetiology of this common craniofacial malformation.

O rofacial clefts (OFCs) are the most common craniofacial malformations in humans and present a major public health burden, imposing substantial health care and financial burdens on the individual, their family and society 1 . In general, the highest birth prevalence rates of OFCs are reported in Asia (especially in China and Japan), often as high as 1 in 500 and affecting more than 2.6 million people in China 2 . According to whether the patients have other malformations or anomalies, OFCs can be divided into syndromic and non-syndromic forms. Approximately 70% of cleft lip (CL) with or without cleft palate (CP) cases and 50% of CP only (CPO) cases occur, as isolated entities with no other abnormal phenotypes are considered to be non-syndromic (referred to as NSOFC) [1][2][3] . NSOFC is further classified into non-syndromic cleft lip with palate (NSCLP), non-syndromic CL only (NSCLO) and non-syndromic CPO (NSCPO), based on the anatomical morphology 4 . As they share common epidemiological patterns and occur during the same embryological period, NSCLP and NSCLO are often grouped together as non-syndromic CL with or without CP (NSCL/P), differing only in severity 5 . However, there is some evidence showing that NSCLP and NSCLO might harbour different genetic aetiologies [6][7][8][9] .
To facilitate the understanding of the genetic architecture and gain a better understanding of the genetic basis underlying the sub-phenotypes of NSOFC, here we perform a NSCLP GWAS using two independent case-control samples from China and replicate interesting markers in a total of 23,463 samples from sub-phenotypes of NSOFC and multiple ethnic groups. We identify 14 new loci and confirm 12 previously reported ones for NSCLP. These susceptibility variants identified in the current study collectively account for 10.94% of the heritability for NSCLP in Chinese population. In addition, evidence of genetic heterogeneity is observed between the three sub-phenotypes of NSOFC and among different populations.

Results
Identification of 26 NSCLP-associated loci. In the discovery stage, we genotyped 900,015 single-nucleotide polymorphisms (SNPs) using the Illumina HumanOmniZhongHua-8 BeadChip in 2,096 cases and 4,051 controls of Chinese ancestry (cohort 1). After quality control, 803,202 SNPs (call rate495% and minor allele frequency (MAF)41%) in 2,033 NSCLP cases and 4,051 controls of Chinese ancestry were used in the GWAS discovery analysis ( Fig. 1 and Supplementary Table 1). The Manhattan plot of P-values using Cochran-Armitage trend test with adjustment for gender is shown in Supplementary Fig. 1. All cases and controls were assessed by principal components analysis for population stratification and were confirmed to be of Chinese ancestry ( Supplementary Fig. 2). Quantile-quantile plots were constructed and genomic control values were calculated (l GC ¼ 1.04) ( Supplementary Fig. 3). Both of these results indicate negligible inflation of the genome-wide association signals caused by population stratification, further suggesting that the deviated tail of the P-values' distribution reflects some true genetic associations with NSCLP. We then conducted logistic regression analysis to assess the genotype-phenotype association.
To perform a fast-track replication study, we selected and genotyped 152 SNPs (Po1 Â 10 À 4 ) within 79 loci for a follow-up analysis in an additional 1,346 NSCLP cases and 4,542 controls of Chinese ancestry (cohort 2). Of the 146 successfully genotyped SNPs, 64 showed nominal association (Po0.05 using logistic regression) in the validation stage and 61 of them showed a consistent direction in their estimated effects on risk between the discovery (cohort 1) and validation (cohort 2) stages (Supplementary Table 2). A fixed-effects meta-analysis of the combined cohorts 1 and 2, totalling 3,379 NSCLP cases and 8,593 controls, identified 14 new loci (20 SNPs) (Po5.00 Â 10 À 8 using Cochran-Mantel-Haenszel test), namely 2p25. 1 Table 5). After reviewing the published GWASs of NSCL/P and the present study, we summarize the susceptibility loci identified in different populations in Supplementary Fig. 4.
One novel (9q22.32) and two reported loci (1q32.2 and 8q24.21) showed significant associations with NSCPO (Supplementary Table 7). The marker in 1q32.2 showed opposite directions of association between the NSCLP and NSCPO groups (rs9430019; OR NSCLP ¼ 1.25 and OR NSCLO ¼ 0.66), whereas the markers in the 8q24.21 and 9q22.32 loci were concordant in the estimated direction of association with NSCLP (Supplementary  Tables 3, 4 and 7). It is worth mentioning that the recent GWAS 17 and sequencing study 18 revealed an aetiological missense variant in GRHL3 (1p36.11) for NSCPO. The additional locus 9q22.33 (FOXE1) was identified potentially accounting for linkage to both NSCL/P and NSCPO 19 . The markers at 1p36.11 and 9q22. 33 were not significant at the GWAS stage in our study and thus were not replicated in NSCLP and NSCPO.
Interestingly, gender stratification analysis revealed that one previously identified locus (1q32.2) showed strong evidence of heterogeneity (P ¼ 1.38 Â 10 À 4 ) in the evidence of association with NSCLP from male and female cases. The marker on 8q21.3 was observed to exhibit significant evidence of heterogeneity in its estimated effect between older mothers (435 years) and the reference age of mothers (25-35 years) (Supplementary Table 8).
Replications of 16 NSCLP loci in multi-ethnic groups. We further checked for associations of the 14 novel loci and two recently reported NSCL/P loci (16p13.3 in China 15 and 2p24.2 in a multi-ethnic study 16 ) using cohorts 5-7. Different loci showed evidence of association in different cohorts (Po0.05 using logistic regression test), specifically three loci in Central Europeans, seven loci in Asians and five loci in European ancestry (Table 1  and Supplementary Table 9). For the majority of the 16 loci mentioned, the direction and magnitude of the effect of ORs were consistent across Chinese and non-Chinese samples. However, we observed an apparent difference in risk allele frequencies (AFs) for most of these 16 Figure 1 | Study design. We first conducted a GWAS study in 2,096 Chinese NSCLP cases and matched 4,051 controls using Illumina HumanOmniZhonghua-8 BeadChip. After quality control, 803,202 SNPs were remained and analysed in 2,033 NSCLP cases and 4,051 controls, and 1,151 SNPs showed Po1.0 Â 10 À 4 using logistic regression in the discovery stage. One hundred and fifty-two SNPs with Po1.0 Â 10 À 4 were selected for replication in an independent Chinese cohort including 1,346 NSCLP cases and 4,542 controls. After quality control, 146 SNPs remained, of which 61 SNPs showed Po0.05 using logistic regression and consistent direction in their estimated effects on risk in the discovery and validation samples. Then, a fixed-effects meta-analysis of the combined cohorts 1 and 2 samples identified 14 novel loci (20 SNPs) and confirmed 12 previously reported ones (21 SNPs) associated at genome-wide significance (P meta o5 Â 10 À 8 using Cochran-Mantel-Haenszel test). We genotyped 41 top SNPs in further 1,104 NSCLO, 1,104 NSCPO patients and 3,312 shared controls in Chinese Han population, respectively. As a result, ten and three loci showed significant associations in cohort 3 and 4 samples (P Bonferroni o1.25 Â 10 À 3 using logistic regression and Bonferroni correction). The 24 SNPs (20 from the 14 novel loci and 4 from two newly reported NSCL/P loci) out of the 41 SNPs were also evaluated in Central European, Asian and European ancestry populations, and 3, 7 and 5 loci showed evidence of association in the different cohorts, respectively (Po0.05 using logistic regression). We additionally explored the molecular functionalities of risk variants and their related genes using several complementary methods. regulate transcription for 50 reference genes (Supplementary Data 1). Expression profiles of the 49 genes within 500 kb and showing strong linkage disequilibrium (LD) with index SNPs (r 2 Z0.7) in the 26 loci based on craniofacial gene expression patterns in mouse (EMAGE database) showed that a total of these 20 genes were related to embryonic development, among which 12 were expressed in the related tissue for OFC (Supplementary  Table 11). Twelve genes were found to produce CL/CP malformations in mutant mouse models (Supplementary Table 12). Seven genes were reported to be associated with nine recognized malformation syndromes including OFC as a clinical phenotype (Supplementary Table 13). By manually reviewing the literature related to these 49 genes and their functions, annotated systematically using several databases, 28 of these genes were classified into different biological categories according to morphogenesis, development, molecular and cellular function, and 17 of these categories contained 3 or more genes. The most relevant catalogue was the morphogenesis-and development-related traits, which included the bone, limb, brain, ear and other organs. Several important signalling pathways were also reviewed, among which epidermis/epithelium development and morphogenesis, and the fibroblast growth factor receptor (FGFR) signalling pathway involved the most genes, whereas the bone morphogenic proteins (BMP), WNT and Notch signalling pathways have also been reported to be associated with lip and palate development (Supplementary Table 14).

Discussion
In this study, we identified 14 novel risk loci associated with NSCLP and confirmed 12 previously reported loci in the Chinese population. Several lines of evidence support the hypothesis that these newly identified loci contribute to orofacial clefting. At 4p16.2, two markers in moderate LD with one another (r 2 ¼ 0.65) are located 50 kb upstream of MSX1 ( Supplementary Fig. 5b). Mutations in MSX1 contribute to CL/P and tooth agenesis in humans 20 . Msx1 À homozygotic mice exhibit CP, deficiency of the alveolar mandible and maxilla, and failure of tooth development 21 . The orthologous Spry1 of the optimized gene SPRY1 at 4q28.1 has been shown to exhibit facial clefting and CP in transgenic mice 22 . The marker at 5p12 is adjacent to FGF10, whose encoded protein has been reported to play roles in the epithelial-mesenchymal transition, and Fgf10 knockout mice show the symptoms of CP 23 . The 6p24.3 region harbours two candidate genes for the OFC: OFCC1 was reported as a potential OFC susceptibility gene in humans, based on the observation of three unrelated OFC patients all harbouring a chromosomal break within or close to the gene 24 . TFAP2A mutations have been found in patients diagnosed with branchio-oculo-facial syndrome, which is characterized by branchial defects, ocular anomalies and facial defects, including CL/P 25 .
The SNP at 8p11.23 is located in an enhancer within FGFR1. In humans, mutations of FGFR1 have been found in syndromic CL/P 26 , whereas in mice a homozygous hypomorphic allele at Fgfr1 caused CP 27 . A marker at 9q22.32 is located at the intronic region of PTCH1, mutations of which have been identified in patients of Gorlin syndrome with CL/P as one of its clinical manifestations 28 . Ptch1 DL mouse models showed craniofacial defects, including underdeveloped palatal shelves and clefting of the secondary palate 29 . The marker rs705704 is located at the locus 12q13.2, which includes the gene RPS26, which may be relevant to orofacial clefting, as mutations of RPS26 have been reported in Diamond-Blackfan anaemia (DBA) with CPO as one of its clinical phenotypes 30 . Two highly correlated markers (r 2 ¼ 0.99) at 14q32.13 are located between GSC and DICER1. GSC modulates the epithelial-mesenchymal transition and mutations in GSC lead to a syndrome defined by short stature, auditory canal atresia, mandibular hypoplasia and skeletal abnormalities 31 , whereas DICER1 mutations have been reported to cause pleuropulmonary blastoma and multinodular goiter-1, with or without Sertoli-Leydig cell tumours (MNG1) 32 , and Dicer1 conditional knockout mice exhibit secondary palate clefting and other severe craniofacial dysmorphisms 33 . The excellent candidate gene WNT9B at 17q21.32 has already been functionally implicated in craniofacial development, as mice with Wnt9b targeted mutation were described as presenting CL/P phenotypes 34 . Among the other new signals, two markers at 2p25.1 are in perfect LD (r 2 ¼ 1) with one another and are located 12 kb upstream of TAF1B, encoded protein of which is important for polymerase (Pol) I transcription 35 . At 8q22.1, a synonymous codon SNP rs957448 (KIAA1429) is correlated (r 2 ¼ 0.65) with rs12681366 (an intronic SNP of RAD54B). Human RAD54B was first identified as a homologue of RAD54, which plays an important role in DNA repair 36 . The strongest associated marker at 12q13.13 is located 500 bp downstream of KRT18, which encodes a protein in the large family of cytoskeletal proteins with specific expression in epithelial cells 37 . At 12q21.1, the signals are near the TMEM19 gene, involving the SNPs rs2304269 and rs7967428, which are in strong LD with each other (r 2 ¼ 0.98). Rs2304269 and rs7967428 are respectively located at one active promoter and five strong enhancers in epidermal keratinocytes according to ENCODE data. The 14q22.1 signal is close to LINC00640, a gene of unknown function. In addition, to gain further insight into the possible involvement of genes at some   Fig. 6).
Notably, in our study, two members of the FGF signalling pathway, including FGF10 at 5p12 and FGFR1 at 8p11. 23 We performed a network analysis of notable genes in the 26 NSCLP associated loci, which showed that several FGF signalling related genes such as FGFR1 and FGF10 are connected (Fig. 3). The FGF signalling pathway was proposed to contribute to NSCL/P 41 and previous candidate gene studies have provided evidence in humans and animal models 41,42 . The findings of our association study strengthen the hypothesis that the FGF signalling pathway might play important roles in craniofacial development. Intriguingly, we also found a potential link between ribosomopathies and the genes in our NSCLP-associated loci, including RPS26, RAD54B and TAF1B. Mutations in RPS26 were reported to affect the functions of the proteins in ribosomal RNA processing in DBA patients and DBAs belong to a class of diseases called ribosomopathies 30,43 . Moreover, RPS26 and RAD54B were reported to regulate p53 (refs 44,45) and the p53 pathway is importantly involved in ribosome biogenesis 43 . In addition, TAF1B was reported as a component of RNA Pol I basal transcription factor, which is essential for Pol I recruitment to the ribosomal RNA gene promoter 35 .
For the 12 significant associated loci in the study that had been previously reported, the strongest signals occurred for 2 SNPs in near-perfect LD (r 2 ¼ 0.92) in 2p24.2, located in the 3 0 -untranslated region of FAM49A. It is worth mentioning that both the LD block and ±500 kb on either side of the index SNPs in this region only contain the single gene FAM49A, although a few non-coding RNA genes are located in this region. FAM49A is a protein-coding gene whose paralogue, FAM49B, is located in a previously reported susceptibility locus near the gene desert region of 8q24, which shows a strong association with the risk of NSCL/P in European populations 10,46,47 . ENCODE data indicate that SNP rs7552 alters the regulatory motifs of TBX5 and BRCA1, and the highly correlated SNP rs4832651 (r 2 ¼ 0.98) lies within a conserved enhancer for mammary epithelial cell activity. Although Myc-oncogene has been reported as the probable target effect gene in the 8q24 region for NSCL/P 47 , the functions of both FAM49A and FAM49B remain poorly defined. These genes might play a role in the aetiology of NSCL/P and whether their functions vary across different populations is clearly worth further investigation. In addition, as expected, the second strongest signals were near IRF6 at 1q32.2 and this association signal has been independently replicated in numerous GWAS studies and candidate gene studies 2,6,10,12,14-16 . Of the remaining ten loci, 1p36.13 and3q12.1 approached genome-wide significance, 15q22.2 showed suggestive evidence of association and the additional seven loci were only analysed in the NSCLP GWAS stage (Supplementary Table 4).
Comparisons of NSCLP, NSCLO and NSCPO have yielded clear evidence of genetic heterogeneity among the three subgroups of NSOFC. The two sub-groups (NSCLP and NSCLO) generally grouped together appeared to share more genetic risk factors, which is consistent with previous findings 4, 6,48 , and these results argue for distinct origins of development of the lip and primary palate versus the secondary palate 1,49 . In addition, although 1p36.11 and 9q22.33 were not confirmed in NSCPO in OR is calculated based on minor allele; alleles are shown as minor allele/major allele. *P-value below 1.25 Â 10 À 3 (0.05out of 40, the P-value using logistic regression test and Bonferroni correction) was considered to be statistically significant.
our study, 1q32.2, 8q24.21 and 9q22.32 were first demonstrated to have an effect on NSCPO in the Chinese population. Importantly, our study provided evidence that 1q32.2 exhibits an overlapping effect on all three sub-phenotypes of NSCLP, NSCLO and NSCPO. The evidence of association at 1q32.2 was stronger among males than females, which may reflect the higher prevalence rate of NSCLP among males (male:female ¼ 2.6:1 in our study). Stratification of the results by maternal gestational age revealed that older mothers may have a higher risk of having a child with NSCLP, as suggested by some previous studies of congenital disorders such as NSOFC 50,51 . Plausible reasons for the failure to replicate all of the associated loci in different ethnic groups could be due to the limited sample size, the combined sub-groups of NSCL/P used in previous analyses, the differential tagging of unobserved causal variants across ethnic groups or the existence of true genetic heterogeneity across ethnic groups. Further studies using larger sample sizes or analytical approaches, such as a trans-ethnic genome-wide meta-analysis approach 52 with more detailed classification of subphenotypes, are warranted to further investigate this hypothesis.
Of the 26 genetic risk factors, 19 had reported associations with a total of 34 other diseases/traits. These associations could mainly be categorized into six different groups, including developmental, immune, metabolic, neoplastic, endocrine and degenerative categories (Supplementary Data 2). To further assess the possible independence among these various birth defects/diseases/traits within these particular SNPs, we examined LD patterns between these SNPs in Asian, African and European populations using data from the 1,000 Genomes Project. As a result, three susceptibility loci were identified to be shared by NSCLP and other diseases/traits, including schizophrenia at 8p11.23, asthma, polycystic ovary syndrome, rheumatoid arthritis, vitiligo, type 1 , which showed significant and non-independent association of the risk of NSCLP and other diseases/traits. Interestingly, NSCLP and adult height shared the same index SNP (rs10512248 at PTCH1 in 9q22.32), suggesting that some shared genetic factors might underlie these two very distinct phenotypes. Furthermore, some reported GWAS loci had susceptibility genes shared between NSCLP and other diseases/traits, such as MAFB at 20q12 for Dupuytren's disease, low-density lipoprotein cholesterol and total cholesterol, which showed a clearly independent association with NSCLP index SNPs and suggested pleiotropic effects of these genes on other diseases/traits. Overall, our current study has advanced the understanding of the genetic architecture controlling the risk of NSOFC by substantially increasing the number of genetic risk factors and has highlighted potential candidate genes through subsequent genetic and biological analyses. This study has also provided further insight into the possible pleiotropic effects of genetic risk factors on different sub-phenotypes, in different populations and among different diseases/traits. Through a comprehensive analysis of cases and controls from a Chinese population, we have identified 14 new genetic risk factors and validated associations in a large majority of previously reported loci. Further sequencing and functional investigations will probably identify causal mutational events and true susceptibility genes in or near these tagging SNPs and further elucidate the disease pathogenesis of these common congenital birth defects.

Methods
Samples. In the current study, we carried out a two-stage GWAS and further replications of NSOFC. The discovery stage included 2,096 NSCLP cases and 4,051 controls (cohort 1). Replication studies were performed in an additional 1,346 unrelated NSCLP cases and 4,542 controls (cohort 2). Further replications consisted of cohort 3 (1,104 NSCLO cases versus 3,312 controls), cohort 4 (1,104 NSCPO cases versus 3,312 controls shared with cohort 3), cohort 5 (399 NSCL/P cases versus 1,318 controls), cohort 6 (861 NSCL/P case-parent trios) and cohort 7 (557 NSCL/P case-parent trios). Samples of cohorts 1-4 were recruited from the Chinese population through collaboration with multiple hospitals in Hubei, Henan and Anhui province. All cases were interviewed and clinically assessed by at least two experienced clinicians, and a detailed questionnaire was completed to identify any further anomalies, such as congenital heart disease, hypospadias, accessory auricle, lip pits and so on, which would suggest an underlying syndrome. We collected clinical information from the subjects through a full clinical checkup and additional demographic information from the cases was obtained through a structured questionnaire that mainly included four parts: basic information, clinical feature, maternal situation and life style during the first trimester of pregnancy, and genetic background of the patients. All controls were healthy individuals without OFC or family history of OFC (including first-, second-and third-degree relatives). Peripheral blood samples were collected after the written informed consents were obtained from all the participants or their guardians. The study was approved by the institutional ethics committee of each hospital (Hospital of Stomatology Wuhan University, The Second Charity Hospital of Henan Province, Stomatological Hospital of Nanyang, Stomatological Hospital of Xiangyang and The First Affiliated Hospital of Anhui Medical University) and was conducted according to the Declaration of Helsinki principles. The replication data in cohort 5 from the GWAS in Central Europeans was provided by Mangold et al. 13 , whereas the replication data in cohorts 6 and 7 were from the GWAS of case-parent trios of Asians and European ancestry provided by Beaty et al. 12 . All the controls and cases for each replication cohort were sampled from the same locality and the same population in each study, to assure minimal population stratification effects for each replication.
DNA extraction. Approximately 4 ml EDTA anticoagulated venous blood sample was collected from each participant. Genomic DNAs of the cases were extracted from peripheral blood lymphocytes using the standard SDS-proteinase K-phenol/chloroform method. For the controls, DNAs were isolated by standard procedures using Flexi Gene DNA kits (Qiagen) according to the manufacturer's protocol. After quality control, DNAs were diluted to working concentrations of 45-55 ng ml À 1 for genome-wide genotyping and 20-30 ng ml À 1 for the validation studies, respectively.
Genotyping and quality controls in GWAS. The discovery-stage genotyping was conducted according to the Infinium HD protocol using the Illumina HumanOmniZhongHua-8 v1.1 BeadChip (Illumina, San Diego, CA, USA) at the Key Laboratory of Dermatology at Anhui Medical University (Ministry of Education), Hefei, Anhui, China. Genotyping was performed as described in the Infinium HD protocol from Illumina 53 .
In the GWAS stage, a total of 900,015 SNPs were genotyped in 2,096 cases and 4,051 controls. A standard quality-control criterion was applied to select SNPs and samples for further analysis. SNPs were excluded if they had (i) a call rate o95% in cases or controls; (ii) an MAF of o1% in the population; or (iii) significant deviation from Hardy-Weinberg equilibrium (HWE) in the controls (Pr10 À 4 ). In addition, all the SNPs on the X, Y and mitochondrial chromosomes, as well as the copy number variation-related SNPs and probes, were excluded from statistical analysis. Meanwhile, samples were removed if they (i) had an overall genotyping rate of o98%; (ii) were duplicates or showed familial relationships based on pairwise identity by state using PLINK 1.07 (ref. 54), the sample with higher call rate was left between the related samples (PI_HAT40.025); (iii) showed inconsistent genetic gender with epidemiological or clinical data; (iv) and were ancestry outliers or heterozygosity outliers. Samples were assessed for population stratification using the software package EIGENSTRAT 55 . The original script from EIGENSTRAT was modified to extract the principal components for plotting. In total, 63 samples were removed from analysis. After quality control, the genotype data of 803,202 autosomal SNPs in 2,033 cases and 4,051 controls remained for further analysis.
SNP selection for replication studies. SNPs were selected for NSCLP replication according to the following steps: (i) we first picked out all the top SNPs with Po1.0 Â 10 À 4 in the initial stage and excluded the SNPs with ambiguous genotype scatter plots; (ii) then we selected at least one SNP with the lowest P-values in each of the novel loci, which defined by using the PLINK option '-indep-pairwise 50 5 0.2'; (iii) in addition, one to four top SNPs were chosen from the previously reported loci; (iv) we also selected SNPs that were located within or close to the susceptibility genes with gene expression profiling evidence for OFC or for syndromes with OFC symptoms. In total, 152 SNPs were selected for the NSCLP replication stage. Furthermore, all the promising SNPs were selected for the NSCLO and NSCPO replications. These SNPs had the lowest P-values in NSCLP-meta stage and showed Po0.05, as well as with call rate 490%, MAF40.01 and P HWE 410 À 4 in NSCLP replication stage; thus, 41 SNPs reached genome-wide significance. The above 41 SNPs were distributed in 12 previously reported NSCL/P associated loci and 14 novel loci. In cohort 5-7 replications, 24 SNPs were selected for replication in 3 GWAS data sets from Central Europeans, Asians and European ancestry groups. Of the 24 SNPs, 20 were picked out from the 14 novel loci and 4 were from 2 newly reported NSCL/P loci (16p13. 3 (ref. 15) and 2p24.2 (ref. 16)) and all of them were from the 41 significant SNPs.
Genotyping and quality control in replication studies. Genotyping analyses of replications in cohorts 2-4 were conducted by using the Sequenom MassARRAY system, at the Key Laboratory of Dermatology at Anhui Medical University (Ministry of Education), Hefei, Anhui, China. Locus-specific PCR primers (Supplementary Table 15) were designed using MassARRAY Assay Design 3.0 software, following the manufacturer's instructions (Sequenom) 53 . Quality control was performed in each data set separately using PLINK 1.07. In each case-control replications (cohorts 2-4), we excluded SNPs with a call rate o90% in cases or controls, or deviation from HWE proportions (Pr1 Â 10 À 4 ) in the controls.
To evaluate the quality of the genotype data for the validation analyses, 100 randomly selected samples from the GWAS stage were re-genotyped using the Sequenom system. The concordance rate between the genotypes from the Illumina HumanOmniZhongHua-8 v1.1 BeadChip and the Sequenom MassARRAY assay analyses was 499%. The cluster plots from the Illumina and Sequenom analyses were checked to confirm their good quality. After quality control, 146 SNPs were remained for NSCLP replication and 40 SNPs were left for further replications in cohort 3 and 4 analyses, respectively.
Statistical analyses. In the GWAS stage, we examined potential genetic relatedness based on pairwise identity by state for all of the successfully genotyped samples using the PLINK 1.07 software. For the duplicated samples and all pairs of first-, second-and third-degree relatives detected, the subject from each pair with the lower call rate was removed from further analysis. All cases and controls were assessed by principal components analysis for population stratification and were confirmed to be of Chinese ancestry. Quantile-quantile plots were constructed and calculations of genomic control values (l GC ¼ 1.04 indicated a negligible inflation of the genome-wide statistical results due to population stratification) were performed by using the software R (http://www.r-project.org/) to evaluate the overall significance of genome-wide association results and the potential impact of population stratification, respectively, in the discovery stage.
Association of GWAS and replication analysis were performed using the Cochran-Armitage trend test. Single-marker association analyses were performed to test for disease-SNP associations using logistic regression in each stage. Fixed effects meta-analyses of cohorts 1 and 2 (NSCLP combine) were performed using the Cochran-Mantel-Haenszel test, where P-values and heterogeneity index Q-values from Cochran's Q statistics were also obtained. Assessment of heterogeneity across studies was carried out by evaluating the P het values from Cochran's Q statistics (Bonferroni-corrected heterogeneity Q-values P het ofo0.05 were considered significant) 52,56 . OR values were measured as OR per allele and presented for the minor allele of a SNP, unless otherwise stated. A threshold of Po5 Â 10 À 8 was adopted to define novel loci with genome-wide significance. The regional association plots for each susceptibility locus were generated in R using information from the HapMap project (CHB and JPT samples). After applying quality control and removing those SNPs with MAFo1%, HWEo0.0001 and call rate o95% from GWAS data set, 2,033 cases and 4,051 controls with 803,202 SNPs were used for the disease variation assessment in the genome-wide level. Furthermore, the samples passed quality control from the discovery and NSCLP replication (3,379 cases and 8,593 controls) with the 41 markers attaining genomewide significance (Po5 Â 10 À 8 ) were used for disease variation estimating of the 26 NSCLP risk loci. The proportion of variance in NSCLP risk was examined via the residual maximum likelihood method in the program genome-wide complex trait analysis and estimated assuming a disease prevalence of 0.001 (1 out of 1,000) and log additive risk 52,57 . All power calculations were performed using the genetic power calculator assuming a disease prevalence of 0.001 and log-additive risk. We carried out conditional analyses to identify additional association signals after accounting for the effects of known and newly discovered susceptibility loci. To investigate more than two association signals per locus, we used a stepwise procedure in which additional SNPs were added to the model according to their conditional P-values, as programmed in EMMAX. We estimated the LD metrics r 2 and D' using 6,084 individuals from METSIM, who passed genotyping quality control. To replicate associations of the 24 SNPs in different ethnicities, GWAS data from three previously published NSCL/P populations (Central Europeans, Asians and European ancestry groups) were extracted. Replication in the Central European NSCL/P samples was based on a data set published in Mangold et al. 13 SNPs that had not been genotyped in this study were imputed using IMPUTE2 software 7 . Genotype imputation for the case-parent trios described in Beaty et al. 12 was run by the GENEVA Coordinating Center 58 , using a worldwide 1,000 Genomes Project reference panel and the IMPUTE2 software in 2012. Imputed genotypes and accompanying marker annotation and quality metrics files are available through the authorized access portion of the dbGaP posting.
Stratification analyses. Genotype-phenotype stratification analyses were conducted by using PLINK 1.07 software for the 41 significant associated markers in NSCLP-meta stage. Genotype data were extracted from GWAS and NSCLP NATURE COMMUNICATIONS | DOI: 10.1038/ncomms14364 ARTICLE replication stages. Then, we performed stratification analyses on gender and maternal gestational age in NSCLP. P-value below 1.22 Â 10 À 3 using logistic regression (0.05 out of 41, Bonferroni correction) was considered to be statistically significant.
Heterogeneity analyses among NSCLP, NSCLO and NSCPO were performed by using PLINK 1.07 software based on the 40 significant associated markers in NSCLP meta-stage. Genotype data were extracted from discovery and replication stages of NSCLP, NSCLO and NSCPO (cohorts 1-4). We first divided the cases into three sub-phenotypes NSCLP, NSCLO and NSCPO, then extracted genotype of each case from the above four cohorts and calculated the association between each combination of two sub-phenotypes. P-value below 1.25 Â 10 À 3 using logistic regression (0.05 out of 40, Bonferroni correction) was considered to be statistically significant.
Locus annotation and candidate gene prioritization. To prioritize candidate genes, besides the nearest genes to the index SNPs, the following methods were used to help prioritize potential causal genes in each associated region. All genes located in the same LD block as the index SNPs (r 2 Z0.7) were selected 52 and annotated for function in molecular, cellular, animal model and tissue/organ levels using several databases, including PubMed (http://www.ncbi.nlm.nih.gov/pubmed/), EMAGE (http://www.emouseatlas.org/emage/home.php), MGI (http:// www.informatics.jax.org/), OMIM (http:/www.omim.org/), Gene (http:// www.ncbi.nlm.nih.gov/gene/), UCSC (http://genome.ucsc.edu/) and Ensembl (http://www.ensembl.org/index.html). The nearest genes on both sides of the index SNP were annotated when no gene was located within the LD block. A total of 135 SNPs at these 26 NSCLP risk loci (all with r 2 Z0.7 with the SNPs found to be genome-wide significant here) with MAF40.05 and P HWE 41 Â 10 À 4 were annotated by using the following methods: regulatory features from ENCODE Consortium/ENCODE/Roadmap Epigenomics Project (http://www. roadmapepigenomics.org/) 59,60 .
Network analysis. We expanded the global network by including the Human Net protein interaction database 61 and literature-curated interactions from STRING 62,63 to derive an expanded global network based on known protein-protein interactions using the notable genes of the 26 NSCLP associated loci from the present study (Fig. 3).
GWAS catalogue reviews. We evaluated all the SNPs within ±500 kb of the index SNPs (from the 26 loci) and with Po5 Â 10 À 8 recorded in National Human Genome Research Institute GWAS catalogue (http://www.genome.gov/gwastudies) updated on 20 February 2015. The LD patterns of the index SNPs and the recorded SNPs in GWAS catalogue were inquired using SNAP version 2.2 in Asian (CHB þ JPT) and European (CEU) populations using data from the 1,000 Genomes Project Pilot 1.
Expression studies in the mouse. Eight-to 14-week-old wild-type Kunming mice were housed in approved specific pathogen-free conditions and mated for 12 h, the presence of a vaginal plug was designated as E0.5. Pregnant mice were randomly divided into four groups and killed at embryonic stages E13.5-E16.5, respectively. Embryos with death or other malformations were ruled out. Normal fetuses were harvested and fixed in 4% paraformaldehyde overnight at 4°C for IHC. The 4 mm paraffin sections were deparaffinized, rehydrated and subjected to antigen retrieval with high pressure method. A mixture of 30% H 2 O 2 and methanol (1/9, v/v) was performed to inhibit endogenous peroxidase activity. The rabbit polyclonal antibodies to Fam49a (LS-C167900, LSbio; 1:100 dilution), Rad54b (orb100108, Biorbyt; 1:200 dilution), Rps26 (14909-1-AP, Proteintech; 1:800 dilution), Taf1b (12818-1-AP, Proteintech; 1:600 dilution) and Thap2 (orb186252, Biorbyt; 1:200 dilution) were incubated with the sections at 4°C overnight, respectively, and were then detected with the Rabbit SP kit (SP9001, Zhongshan Golden Bridge Biotech). The sections were then counterstained with haematoxylin. The results were assessed by an investigator who was blinded to the group allocation. All experimental procedures were carried out in accordance with the Institutional Animal Care and Use Committee of the Laboratory Animal Center of Wuhan University, China. The study was approved by the Ethics Committee, School and Hospital of Stomatology of Wuhan University, China.
Data availability. The data that support the findings of this study are available from the corresponding author on request.