Periodontitis (PD) is a common inflammatory disease of the oral cavity that leads to the resorption of alveolar bone, making it the major cause of tooth loss in adults above 40 years [1]. The inflammation is often chronic and can affect large areas of the gingival tissues. Since periodontal disease is painless, the inflammation is usually long-term persisting and it is common for PD to have reached advanced degrees of severity before it is diagnosed and treatment is started. Recurrent and persistent inflammations caused by bacteria are also recognized as continually renewing reservoirs for the systemic spread of bacterial antigens, cytokines, and other proinflammatory mediators and may bring a burden onto the rest of the body [2]. Accordingly, researchers have hypothesized the etiologic role of PD in the pathogenesis of various systemic illnesses like diabetes mellitus [3], cardiovascular disease [4], and osteoporosis [5], bridging the once-wide gap between medicine and dentistry.

Severe forms of PD are prevalent with > 9% in adults with an age of 30 years and older [6]. PD is classified into the widespread forms chronic periodontitis (CP) and the rare, early-onset and much more severe disease phenotype of aggressive periodontitis (AgP), which has a prevalence of <0.1% globally [7]. Both forms have an estimated heritability of 50%, with AgP having a stronger and better-established heritable component compared to CP [8,9,10]. CP and AgP have a similar etiology and histopathology and can be considered as parts of the same disease spectrum. The different disease manifestations develop as a consequence of individual combinations of genetic risk loci that determine the individual immune response in conjunction with external factors like smoking. In this view, the different disease manifestations of a complex disease such as PD are not considered as confined entities [11].

To date, several genome-wide association studies (GWAS) for CP have been published [9, 12,13,14,15,16,17] but in these studies no common allele reached genome-wide significance, likely related to an insufficiently large number of available case-control samples of this late-onset disease phenotype. Previously, we performed a GWAS of AgP and identified a single nucleotide polymorphism (SNP) within the gene GLT6D1 to be associated with AgP at a genome-wide significance level [18]. This association was replicated in a Sudanese-African case-control sample of AgP, but could not be validated for the more moderate late-onset form CP [19]. This suggests a role in the etiology of early-onset very severe disease phenotypes of AgP alone. In an expanded AgP case-control sample, we have recently performed a second GWAS and validated the most significant associations with CP in a German CP case-control sample. In this study, two common variants at SIGLEC5 and DEFA1A3 reached genome-wide significance in the combined sample [20]. Currently, these variants are the only shared risk variants of CP and AgP identified by GWAS, together with two haplotype blocks at PLG (plasminogen) and at the gene cluster PF4 (platelet factor 4)/PPBP (pro-platelet basic protein)/CXCL5 (C-X-C motif chemokine ligand 5), which were identified by different approaches [21,22,23]. In the current study, we aimed to identify additional common susceptibility variants for AgP and CP by conducting a meta-analysis with our North-West European GWAS of AgP and a European-American GWAS of CP. Suggestive associations were validated in imputed genotypes from our previously used German CP case-control sample. We identified two novel loci, which showed genome-wide significant associations with PD.

Materials and methods

Participating studies

The meta-analysis samples consisted of case-control GWA studies of German and Dutch AgP [20] and of European American [9] and German CP [17] patients (Supplementary Table 1).

The German AgP sample (AgP-Ger) included 680 cases and 3,973 controls. Cases were recruited across Germany by the biobank Popgen [24], University-Hospital Schleswig-Holstein, Germany. Controls originate from North-Germany and West-Germany and were recruited from the Competence Network “FoCus–Food Chain Plus” [25], the Dortmunder Gesundheitsstudie–DOGS [26] and the Heinz Nixdorf Recall Studies 1–3 [27]. The Dutch AgP sample (AgP-NL) consisted of 171 cases and 2607 controls. The Dutch cases were recruited from the ACTA (Academisch Centrum Tandheelkunde Amsterdam) and the Dutch controls were recruited from Rotterdam and Wageningen by the B-Proof Study [28]. Inclusion criteria for AgP were ≥2 affected teeth with ≥ 30% bone loss in patients <36 years of age; disease phenotype was diagnosed by full mouth dental radiographs. Genotyping was performed on Illumina Omni Bead Chips for German and Dutch AgP cases and subsequent genotype imputation was performed on the 1000 Genomes Phase 3 reference.

The European American CP (CP-EA) sample included 958 severe (sev) CP cases, 2293 moderate (mod) CP cases and 1909 controls from the Atherosclerosis Risk in Communities (ARIC) Study and were described before [29]. In brief, the patients were classified by the Centers for Disease Control/American Academy of Periodontology (CDC/AAP) consensus three-level classification system [30]. The CDC/AAP taxonomy uses clinical attachment loss (CAL) and PD criteria to define three CP categories as healthy-mild, moderate, and severe CP cases, the first being the control. Genotyping was carried out using the Affymetrix Genome-Wide Human SNP Array 6.0 and the subsequent genotype imputation was performed on the HapMap Phase II reference with individuals of Northern and Western European (CEU) ancestry.

The German CP (CP-Ger) sample consisted of 993 cases and 1419 controls from a meta-analysis of SHIP and SHIP-TREND cohorts [31,32,33]. In brief, subjects within the first and the third tertile of proportion of proximal sites with attachment loss (AL) ≥ 4 mm were contrasted after stratification by sex and 10-year age groups. Age-specific tertiles were defined to include severely diseased cases within each age stratum. Thus, also young and severely diseased subjects were captured and included in the third tertile. Otherwise, those in the third tertile would have been the older ones and those in the first tertile would have included the younger ones only. Individuals aged > 60 years were excluded. The identification of these subjects is important as we assume that genetic predispositions might manifest especially in younger age, while in older subjects these effects are overlaid with risk factor associated disease progression. This case-control sample was previously described in detail [17]. Cases and controls were genotyped either with the Affymetrix Genome-Wide Human SNP Array 6.0 or the Illumina Human Omni 2.5 array and imputed on the 1000 Genomes Phase 1 reference.

Data availability

Genotype data for aggressive periodontitis samples and the chronic periodontitis sample with German descent are available upon request from the biobanks PopGen ( and ShIP ( Summary statistics for CP-EU can also be downloaded (see ref. [9]).


In the post imputation QC processing, variants were excluded, which passed the following cut-off criteria: AgP: Hardy-Weinberg-Equilibrium P-value (PHWE) < 10−4, imputation quality (INFO) of < 0.8 [20]; CP (Germany): PHWE ≤ 0.001, imputation quality (r2HAT) ≤ 0.3 [17]; CP (European-US): PHWE < 10−5 (ARIC) and PHWE < 10−6 (Health ABC), imputation quality < 0.8 [9];

Moreover, we filtered out variants with a minor allele frequency (MAF) < 0.05 because our study lacked statistical power to analyze rare variants. The variant sets did not completely overlap between the different genotype data sets which was mainly due to the different human genome references that were used to impute the different data sets. Only variants with genotype data available in each study were analyzed in the meta-analysis.


Firstly, hypotheses on candidate gene regions were generated by meta-analyzing the genome-wide association scans of AgP-Ger, AgP-NL, CP-EA-sev, and CP-EA-mod based on the additive model using logistic regression. In this stage, we did not include CP-Ger, because the complete genome-wide summary statistics were not available to us. Subsequently, all variants with Pmeta < 10−5 were selected and clustered into genomic loci. Our clustering algorithm compared pairs of variants and assigned pairs with a maximum distance of 200 kilo base pairs (kb) to the same locus. For each locus, we then interrogated for CP-GER the genotypes of all variants in the range of the variant with lowest P-value +/− 500 kb. However, of 35 loci that we meta-analyzed with the CP-GER sample, from this sample no genotype data were available for four loci.

Secondly, we ran a validation meta-analysis for the prioritized candidate risk loci by pooling CP-Ger with the other samples and filtered them again. We considered variants to be significantly associated with PD, if the pooled P-value in this meta-analysis was genome-wide significant (i.e., P < 5 × 10−8). Additionally, we considered variants as suggestively associated with PD if the P-value was <10−6 in the final meta-analysis. By default, we applied a fixed effects model. However, for variants with a high degree of heterogeneity, i.e., a P-value of Cochran’s Q P(Q) < 0.05 and a heterogeneity index I2 > 0.5, we applied a random effects model instead. We used the P-value correction method for shared controls by Zaykin and Kozbur to account for inflation when pooling summary statistics of CP-EA-mod and CP-EA-sev [34].

Functional annotation

Variants were annotated using the Genehopper database (DB) [35]. Genehopper DB integrates data of many public sources by applying periodically executed extraction, transformation and loading (ETL) processes. Specifically, we used integrated datasets of linkage equilibrium (LD), expression quantitative trait loci (eQTL) mappings, topologically associated domain (TAD) boundaries and GWAS studies to annotate our findings. Identified loci were annotated using LD information (correlation measures r2 and D’) from the European reference population (EUR) of 1000 Genomes Phase 3. EQTL mapping information was gathered from Genotype-Tissue Expression project (GTEx) [36], Haploreg v4.1 [37], GRASP v2.0 [38], SCAN [39], seeQTL [40], and Blood eQTL Browser [41]. EQTLs describe genetic variants that contribute to the variation of gene expression levels. Information about TAD boundaries derived by Hi-C experiments was taken from Dixon et al. [42]. A TAD is a genomic region defined by interactome boundaries that represent a spatial compartment in the genome. TADs are essentially cell-type independent and physical interactions occur more frequently inside a TAD. We utilized TAD boundaries to separate genes, which reside in the same TAD as the associated LD block, i.e., genes in cis, from genes that reside outside the TAD of the associated LD block, i.e., genes in trans. The dataset used contained TADs with a length of ~ 853 kb in average (maximum length = 4.44 mega base pairs, minimum length = 0.8 kb). Variant consequence information was taken from Ensembl Variation DB and additionally, we annotated variants using combined annotation dependent depletion (CADD) score. The CADD score combines several annotations into a single Phred-scaled pathogenicity score between 1 to 99 by applying the formula −10 × log10(variant_rank/number_of_variants) to the list of variants sorted by their pathogenicity. Accordingly, a score ≥10 indicates that the variant belongs to the 10% most deleterious substitutions in the human genome, a score ≥ 20 indicates that the variant belongs to the 1% most deleterious substitutions. To illuminate the relationship to other traits and diseases, we took information of phenotype associations and links from the NHGRI-EBI Catalogue of published GWAS [43].



The genetic datasets of the four samples AgP-Ger (6,416,838 variants), AgP-NL (6,319,518 variants), CP-EA-mod (2,135,233 variants), CP-EA-sev (1,809,893 variants) contained a total of 1,722,107 variants being present in all datasets. In the first meta-analysis comprising 4102 cases and 8489 controls (due to the shared controls between CP-EA-sev and CP-EA-med, this control group was only counted once), 35 distinct loci met the pre-specified criteria of Pmeta < 10−5 (Fig. 1a, Supplementary Table 2).

Fig. 1
figure 1

Description of the analysis workflow. a A two-step meta-analysis was performed because for the German CP case-control sample (SHiP cohort), the complete GWAS dataset was not available to be directly included into the meta-analysis. Thus, in the first step, hypotheses on candidate gene regions were generated with the genome-wide summary statistics of the N.-W.-EU AgP-GWAS and the EU-USA CP-GWAS data alone. Next, genotypes of the priorized candidate risk loci were obtained from the SHiP cohorts and included into the dataset of the first meta-analysis. This gave the validation meta-analysis dataset, which provided the highest statistical power. This increase in statistical power resulted in genome-wide significance of two loci. (N.W. EU North-West Europeans, EU-USA European-US Americans, NL The Netherlands, PD Periodontitis, AgP Aggressive Periodontitis). b In small sample sizes such as the Dutch AgP case sample, chance effects are immanent and can skew the results, e.g., by random depletion of risk alleles in a particular sample. The misrepresentation of the allele frequency in a small case sample can have strong effects on the inter sample heterogeneity. However, a low inter sample heterogeneity is the prerequisite for applying the fixed effects model in which the effects of each study sample are weighted by the sample size. Otherwise the random effects model must be applied in which all study samples are weighted equally, irrespective of the sample size. Therefore, the two-step meta-analysis of a was repeated without the Dutch AgP case-control sample. In this analysis, we re-discovered the association of SNP rs729876 with an equal association P-value. Additionally, a second association of SNP rs16870060 passed the genome-wide significance threshold (P = 5 × 10E-08)

Subsequently, these 35 loci were meta-analyzed including the additional sample of CP-Ger, which raised the overall sample size to 5,095 cases and 9,908 controls. In this meta-analysis, SNP rs729876, located downstream of SHISA9 on chromosome (chr) 16, intronic to the long intergenic noncoding RNA (lincRNA) LOC107984137, reached genome-wide significance (P = 1.21 × 10−8, OR = 1.23, 95% CI = [1.15–1.32], Table 1, Fig. 2). Three more variants had a P < 10−6. SNP rs11084095, located at SIGLEC5 (sialic acid binding Ig-like lectin 5), reached the second highest significance level in this study with P = 5.09 × 10−8 (OR = 1.17, 95% CI = 1.11–1.24). SIGLEC5 was earlier identified by us as a susceptibility locus for PD with genome-wide significance, tagged by the GWAS SNP rs4284742 [20]. SNP rs4284742 was not included in the variant set of CP-EA. The other variants were SNP rs9982623, intronic to MCM3AP (minichromosome maintenance complex component 3 associated protein; chr21), with P = 8.65 × 10−7 (OR = 1.23, 95% CI = 1.13–1.33) and SNP rs9984417, located between the processed pseudogene MAPK6PS2 and lincRNA AP000959.2 on chr21, with P = 9.31 × 10−7 (OR = 1.16, 95% CI = 1.09–1.23).

Table 1 Six variants were identified to be associated with AgP and CP at the significance level P < 10−6 in the meta-analyses using either all samples (all) or after excluding the small sample AgP-NL (no AgP-NL)
Fig. 2
figure 2

Regional association plots of the identified loci with P < 10−6 for the lead variants. Due to the low number of variants that were existent in all samples, this plot is based on the results of the meta-analysis in the first stage including AgP-GER, AgP-NL, CP-EA-sev and CP-EA-mod and excluding CP-Ger. a SNP rs729876 at 16p13.12, b SNP rs11084095 at 19p13.41, c SNP rs9982623 at 21q22.3, and d SNP rs9984417 at 21q22.1

Meta-analysis without smallest study sample

In addition, we performed a meta-analysis without the smallest study sample of AgP-NL for 1,749,472 variants (Fig. 1b). In the first step of this analysis (3931 cases and 5882 controls), 39 loci surpassed our preassigned selection criteria and were selected for follow-up in the second step (4924 cases and 7301 controls, Supplementary Table 3). By using this approach we were able to detect two variants with genome-wide significance: SNP rs729876 (P = 9.77 × 10−9, OR = 1.24, 95%, CI = 1.15–1.34) had been identified in the first meta-analysis with the Dutch sample included. Additionally, we detected the association of SNP rs16870060 with P = 3.69 × 10−9 (OR = 1.36, 95% CI = 1.23–1.51, Table 1, Fig. 3). SNP rs16870060 is intronic to the pseudogene MTND1P5 (mitochondrially encoded NADH: ubiquinone oxidoreductase core subunit 1 pseudogene 5, chr8). Three other variants were associated with P < 10−6, two (SNPs rs4801882 and rs9982623) of which were concordant with the top variants or with their high LD (r2> 0.8) variants of the above-mentioned analysis based on all samples, and P-values differed only slightly. The third variant, SNP rs2064712, was associated with P = 5.29 × 10−7 (OR = 1.24, 95% CI = [1.14–1.35]) and located on chromosome 6 between unprocessed pseudogene AL109933.3 and lincRNA AL391361.2 and 42 kb downstream of PLG (Plasminogen). The PLG locus had earlier been identified by us as a susceptibility locus for PD in the same analyses samples in a candidate-gene association study, where it was tagged by the GWAS SNP rs1247559 (r2 < 0.2 with rs2064712) [22]. Rs1247559 had a P-value of 2.25 × 10−5 in the meta-analysis of AgP-Ger, CP-EA-mod and CP-EA-sev and was not pursued further in the current study. For the six variants described in this section, we compared the pooled P-value for CP-EA-mod and CP-EA-sev with and without correction for shared controls. A maximum difference of less than one potency indicated a reasonable inflation (Supplementary Table 4).

Fig. 3
figure 3

Regional association plots of the loci with P < 10−6 for the lead variants identified after removing the smallest sample of AgP-NL. Due to the low number of variants that were existent in all samples, this plot is based on the results of the meta-analysis in the first stage including AgP-GER, CP-EA-sev, and CP-EA-mod and excluding CP-Ger. a SNP rs16870060 at 8q22.3 and b SNP rs2064712 at 6q26

In-silico characterization of selected variant effects

For the lead variants in the six distinct loci, we defined LD blocks by incorporating their high LD variants (r2 > 0.8) resulting in 52 variants, including the lead variants (Supplementary Table 5). These variants were examined for known associations with other traits, putative effects on genes and their impact on the pathogenesis of PD.

eQTL. Examination of the 52 variants indicated cis-regulatory and trans-regulatory effects on multiple genes for all loci (Supplementary Table 8). The subset of eQTLs reported in gastrointestinal tissue, bone tissue and blood are summarized in Table 2. The LD block tagged by the genome-wide significant lead SNP rs729876 at 16p13.12 indicated a trans-effect on gene expression of HOXC10 (homeobox C10, chr12), with P = 7 × 10−6 in peripheral blood monocytes. For the LD block of the second SNP with genome-wide significant association with PD, rs16870060 at 8q22.3, trans eQTL effects were reported to the genes ORM1 (orosomucoid 1; chr9) with P = 2 × 10−6 in peripheral blood monocytes.

Table 2 For each locus, cis and trans eQTLs of the lead variant and the high LD variants were compiled from public resources using the web application at

GWAS Catalog

To assess the relationships of the identified loci with other traits and diseases, we screened the LD blocks for entries in the NHGRI-EBI GWAS Catalog with P < 10−5. None of the LD blocks was associated with other phenotypes. However, the regions defined by the lead variant +/− 500 kb are associated with several other phenotypes with genome-wide significance, including lipoprotein levels (SNP rs118039278, P = 1 × 10−396, 231 kb distance to lead variant, r2 < 0.2) and coronary artery disease (SNP rs55730499, P = 3 × 10−154, 211 kb distance to lead variant, r2 < 0.2) at 6q.26, blood protein levels (SNP rs12459419, P = 3 × 10−165, 399 kb distance to lead variant, r2 < 0.2), plasma plasminogen levels (SNP rs10412972, P = 8 × 10−9, 28 kb distance to lead variant, r2 < 0.2) and high density lipoprotein (HDL) cholesterol levels (SNP rs17695224, P = 7 × 10−16, 197 kb distance to lead SNP, r2 < 0.2) at 19q13.41. Moreover, at 19q13.41 the previously reported association of GWAS SNP rs4284742 (4.7 kb distance to lead variant, r2 = 0.2) with AgP is located. The complete list of associations is shown in Supplementary Table 6.


We could identify 20 genes at 5q31.1, 29 genes at 8q22.3, 7 genes at 13q21.1, 19 genes at 16p13.12, 65 genes at 19q13.41, and 31 genes at 21q22.3 to be located inside the same TAD as the associated LD block at each locus (Supplementary Table 7).


We annotated the LD blocks with the CADD score (Supplementary Table 9) and compared the pathogenicity of the variants in the high LD blocks as shown in Supplementary Fig. 1.


This study identified two novel susceptibility loci of PD with genome-wide significance.

SNP rs729876 showed genome-wide significant association with PD in both meta-analyses samples and the association was consistent with the same risk allele in all samples. The variant is located within the intronic region of the lincRNA LOC107984137, the function of which is unknown. Currently, it is not clear if this SNP does affect the function of this lincRNA and/or of other genes. Regarding expression effects, eQTL data indicated tissue specific effects on the expression of the genes HOXC10 in blood monocytes and ZC3H7A (zinc finger CCCH-type containing 7A) and MYH11 (myosin, heavy chain 11) in the brain. Experimental work suggests that the chimeric protein β/MYH11 inhibits the function of RUNX1 (runt-related transcript factor 1) [44]. RUNX1 plays a role in hematopoiesis and bone formation [45, 46]. These eQTL effects did not pass the study-wide significance thresholds and we point out the suggestive nature of these observations and emphasize the need for more extensive experimental follow-up before a mechanism that links the effect of the associated SNPs with the cis-regulation or trans-regulation of these genes can be proposed.

We were aware that the very small size of the Dutch AgP case sample (N = 171 cases) in conjunction with the comparably large Dutch control sample (N = 2607 controls) might have a potentially large impact on the inter-sample heterogeneity. Small samples a generally impaired by random depletion, as well as enrichment of risk alleles due to chance effects. This random misrepresentation of the allele frequency in the case sample can have strong effects on the results, especially if the control sample is large like in the Dutch AgP case-control sample, and might skew the results of the meta-analysis. Therefore, we performed the meta-analysis again without the AgP-NL GWAS sample. In this analysis, SNP rs16870060 showed the strongest of association in our meta-analysis, with P = 3.7 × 10−9. In the AgP-NL case sample, the frequency of the effect allele was lower than that of the controls, which was reverse to the allele frequencies of the other AgP and CP samples. This, together with the large size of the Dutch AgP-controls neutralized the association results. This may indicate a false positive association finding, but we consider it more likely that either the small number of cases diluted the association, or rs16870060, with high likelihood not being the causative variant of the association, shows a different recombination pattern in the Dutch cases sample. SNP rs16870060 is located within an intron of the processed pseudogene MTND1P5, 13 kb downstream of the protein coding gene ATP6V1C1 (ATPase H + transporting V1 subunit C1). eQTL data indicated no cis-effect of variants of this haplotype block on these genes, but suggest tissue specific trans-effects on ARHGEF28 (rho guanine nucleotide exchange factor 28) in the liver and on ORM1 in monocytes. The biological function of ARHGEF28 is yet not clearly described. However, ORM1 is an interesting novel candidate gene for PD because it encodes a key acute phase plasma protein, which is increased due to acute inflammation. The specific function of ORM1 has not yet been determined; however, it is probably involved in aspects of immunosuppression [47, 48]. In addition, ORM1 was experimentally shown to interact with PAI-1 (plasminogen activator inhibitor-1) and the binding of PAI-1 to ORM1 results in significant stabilization of its inhibitory activity toward plasminogen activators [49]. Likewise, ORM1 may play a significant role in the regulation of fibrinolysis. We consider this as relevant, because variants with putative cis-effects on the expression of PLG were previously found to be associated with both AgP and CP [21, 22] and an association with the PLG locus was among the top 6 associations in the current study. SNP rs16870060 showed an increased CADD pathogenicity score ( > 8), indicating some biological effect. However, the regulatory effect of rs16870060 or the associated putative causative variants on the expression of ORM1 in blood monocytes requires further experimental validation.

We note that among the suggestive associations with P < 10−6, the chromosomal region at 21q22.3 (tagged by lead variant rs9982623), spans a particularly broad area, which covers the genes LSS (Lanosterol synthase), YBEY (ybeY metallopeptidase), and PCNT (pericentrin). Because this locus encompasses a cluster of SNPs with strong associations, we suspect it to be less likely a false positive association compared to the other suggestive loci tagged by rs10491294 and rs11084095. eQTL data indicated strong cis-effect of variants of this LD block on these genes, as well as on the genes proximal and distal to this associated region, SPATC1L (spermatogenesis and centriole associated 1-like) and PRMT2 (protein arginine N-methyltransferase 2), respectively.

The limitation of this study was that not all case-control samples were employed in the initial meta-analysis and that the samples were imputed to different genome assemblies. Future studies are required to address these shortcomings and to increase the number of investigated variants and the statistical power, which may allow the identification of additional susceptibility loci. Although we could have identified the genome-wide significant loci by excluding smallest sample of AgP-NL completely from the analysis, we decided to make use of the Dutch case-sample because it can provide some additional statistical power. Thus, we included the Dutch case-control sample in the first meta-analysis, because this approach gave the highest statistical power.

In conclusion, the current study identified two novel susceptibility variants for PD increasing the number of genome-wide significant associations of common susceptibility variants to five (previously identified variants are located at DEFA1A3, SIGLECS [20], and GLT6D1 [18]). These novel variants are associated at a genome-wide significance level, providing statistical evidence for the relevance of these loci in the disease etiology. The putative causative variants underlying the associations, as well as their target genes need to be identified. For the latter, the lincRNA LOC107984137 and ORM1 could provide the first targets. In addition, we could replicate the previously reported association of SIGLEC5 and PLG with PD, using a set of different genetic variants, adding evidence to these previously reported risk genes of PD. We further suggest the chromosomal regions at 21q21.1 (tagged by rs9984417) and 21q22.3, spanning the genes LSS, YBEY, and PCNT to harbor putative susceptibility variants for PD.