A computational and structural analysis of germline and somatic variants affecting the DDR mechanism, and their impact on human diseases

DNA-Damage Response (DDR) proteins are crucial for maintaining the integrity of the genome by identifying and repairing errors in DNA. Variants affecting their function can have severe consequences since failure to repair damaged DNA can result in cells turning cancerous. Here, we compare germline and somatic variants in DDR genes, specifically looking at their locations in the corresponding three-dimensional (3D) structures, Pfam domains, and protein–protein interaction interfaces. We show that somatic variants in metastatic cases are more likely to be found in Pfam domains and protein interaction interfaces than are pathogenic germline variants or variants of unknown significance (VUS). We also show that there are hotspots in the structures of ATM and BRCA2 proteins where pathogenic germline, and recurrent somatic variants from primary and metastatic tumours, cluster together in 3D. Moreover, in the ATM, BRCA1 and BRCA2 genes from prostate cancer patients, the distributions of germline benign, pathogenic, VUS, and recurrent somatic variants differ across Pfam domains. Together, these results provide a better characterisation of the most recurrent affected regions in DDRs and could help in the understanding of individual susceptibility to tumour development.

IM 20 and OmniPath 21 databases using the Metascape tool 22 .This search resulted in a DDR interaction network (NetDDR) with 229 DDR-hits and 1,182 non-DDR-hits, joined by 38,494 edges, which corresponded to the more connected network according to Metascape 22 (for a detailed description of NetDDR see "Materials and methods").
We first used the NetDDR interaction network to investigate the importance of the DDR protein-coding genes in terms of their protein interactions and accumulation of genetic variants.We found that recurrent somatic variants (i.e.occurring in more than one sample) annotated in COSMIC are found in more than 93% of both DDR and non-DDR interacting proteins in the NetDDR network (Fig. 1, panel a).However, germline variants annotated in ClinVar were identified in DDR (40%) more frequently than in non-DDR (21%) genes (Fig. 1, panel b).This finding suggests there is some bias towards the most studied genes, human variations, and phenotypes with supporting evidence annotated in the ClinVar database, whose primary source of information are published studies.In contrast, the COSMIC database documents somatic mutations in human cancers not only from published studies, but also from whole genome and exome sequencing experiments, that identify variants more homogeneously across all human genes.
In addition, the largest set of ClinVar annotations are VUS variants, followed by pathogenic and benign, thus showing another example of bias in the dataset.Other authors have observed some of the biases associated with ClinVar, in particular, the inflated pathogenic variants profiles that make it difficult to study variant penetrance and disease prevalence in endocrine tumour syndromes 23 .In the COSMIC database we found fewer variants identified in metastatic tumours than in a primary state (Fig. 1, panel a), probably due to the low number of individuals studied with advanced cancer.
Also shown in Fig. 1 are the numbers of somatic and germline variants in the cancer-predisposition genes ATM, BRCA1, BRCA2, MLH1, MSH2, and MSH6 (Fig. 1, panels c and d).These DDR genes are among the most affected in the COSMIC and ClinVar datasets (Supplementary Fig. S5).The majority of recurrent somatic variants, observed in more than two samples, were from primary tumours.In addition, most germline variants in these cancer-predisposition and DDR genes correspond to pathogenic and VUS, although their ratios varied depending on the gene analysed.Besides, DDR cancer-predisposition genes are highly connected and central in the NetDDR interaction network.Indeed, the average node degree (AvrNodDeg) for these six genes is 166, while in the complete DDR-hits AvrNodDeg = 87.As expected, closeness centrality (ClossCentl), which indicates how close a node is to all other nodes in the network, is higher on average for the DDR-hits (ClossCentl = 0.46) compared to the complete NetDDR interaction network (ClossCentl = 0.43) (for more details see Supplementary Table S1).It has been suggested that mutations in regulatory regions of highly connected protein-coding genes in protein-protein interaction and regulatory networks, have a higher functional impact than those targeting peripheral genes in the network 24 .
In the sections below we also investigated the effects of protein length, domain composition, and protein 3D structure on the accumulation of germline and somatic variants in DDR protein-coding genes.
DDR genes show a different pattern of accumulation of germline and somatic variants as a function of protein length.First, we studied the distribution of the protein length in the DDR and non-DDR groups (Fig. 2, panel a).Although the average protein lengths for DDR (745 a.a.) and non-DDR (675 a.a.) are very similar, we compared the accumulation of somatic and germline variants in the most affected cancerpredisposition genes ATM, BRCA1, BRCA2, MLH1, MSH2, and MSH6 as a function of protein length (Supplementary Fig. S6).These cancer-predisposition genes show a similar pattern of accumulation of somatic and germline variants with and without normalization by protein length.
We also analysed the distribution of germline and somatic variants, normalized by protein length, using the complete dataset of DDR and non-DDR proteins (Fig. 2 and Fig. 3, respectively).The proteins in our dataset range in length from 44 a.a.(TMSB4X) to 7,968 a.a.(OBSCN).The boxplot in Fig. 2, panel b, shows that the mean number of germline variants per protein length in the DDR proteins, according to ClinVar database, is 0.287, which is significantly higher than the mean of 0.106 for the non-DDR (P-value = 5.0 × 10 -4 ) proteins.
Panel c shows the differences when the variants are grouped by the different categories (i.e., benign/likely benign, pathogenic/likely pathogenic, and VUS).A few DDR proteins appear above the 75th percentile, indicating accumulation of a high number of germline variants per protein length (i.e.BRCA1, BRCA2, MLH1, MSH2, MSH6) (Supplementary Fig. S5).So, we hypothesize that the accumulation of a large number of ClinVar annotations in these DDRs could bias research findings or limit the generalizability of the results.Moreover, Fig. 2, panel c, shows a greater variability in the DDR distributions and also shows that the medians (at 95% confidence interval) tend to be larger than for the non-DDR group, with P-values being 3.0 × 10 -2 for benign, 9.3 × 10 -3 for pathogenic, and 6.6 × 10 -6 for VUS germline variants.
In addition, we analysed the accumulation of germline and somatic variants in the different pathways where DDRs are involved.We found that the Homology Recombination, Fanconi Anaemia, and Mismatch Repair pathways are the most affected by germline mutations (Fig. 2, panel d).The analysis of recurrent somatic variants annotated in COSMIC, also indicated the same affected pathways (Fig. 3, panel c).
The boxplots in Fig. 3 (panels a and b) show similar medians and dispersion of the somatic variants, according to the COSMIC database, per protein length in DDR and non-DDR groups (P-value = 0.36; metastatic: P-value = 0.35, primary: P-value = 0.35).However, this pattern of variability is different to the one previously observed in the germline variants annotated in ClinVar.In this case, ATM, ATRX, CHEK2, ERCC2, MLH1, MSH6, and SMARCA4, among other genes, are representative DDRs above the 75th percentile that indicate accumulation of somatic variants from the COSMIC database.Moreover, according to panel c in Fig. 3, only the Nucleotide Excision Repair (NER) pathway appears to be more affected by somatic variants in primary tumours than by germline pathogenic variants (Fig. 2, panel d).Therefore, our analyses of accumulation of somatic variants in the NER pathway coincide with the results by other authors in that this pathway shows an increased contribution of a somatic mutational pattern (COSMIC mutational signature 8) recurrently observed in various cancer types 25 .So, the disruption of this pathway could potentially drive carcinogenesis and accelerate aging.

DDR germline and somatic variants occur differently within Pfam domains and protein interaction interfaces.
We also investigated the occurrence of DDR germline and somatic variants according to the ClinVar and COSMIC databases, respectively, in Pfam domains and protein interaction interfaces.The www.nature.com/scientificreports/curated list of germline and recurrent (≥ 2 samples) somatic variants were annotated using the Structure-PPi system 26 and the results are summarized in Fig. 4 and in Supplementary File 2 and Supplementary Table S2.
The number of variants annotated in the different classes were: for germline variants (1) 10,301 pathogenic and likely pathogenic, (2) 1,117 benign and likely benign, and (3) 28,248 VUS, whereas for somatic variants (1) 5,795 variants in primary tumours and (2) 2,030 variants in metastasis.These results indicate that germline variants in the pathogenic and VUS classes have a similar distribution across Pfam domains (P-value = 0.13) (Fig. 4, panel a).In particular, they possess a very similar percentage of variants mapped onto protein interaction interfaces (12.9% and 11.4%, respectively), and higher than in the case of benign variants (7.2%).Interestingly, the percentage of somatic variants in metastatic tumours found in both Pfam domains and protein interaction interfaces (Fig. 4, panel b), is significantly higher than in germline pathogenic (P-value < 0.0001) and VUS variants (P-value < 0.0001) (Fig. 4, panel a).The 47.6% metastatic variants affecting Pfam domains are double the value of 25% for germline pathogenic and VUS variants (Fig. 4, panel a and b).Otherwise, the percentage of metastatic variants affecting protein interaction interfaces (36.4%) is threefold higher than the value of 13% to 11% in pathogenic and VUS germline variants, respectively.The most affected Pfam domains by germline pathogenic variants were: MutS_III (Pfam code: PF05192, 371 variants), MutS domain V (Pfam code: PF00488, 323 variants), and BRCA2 (Pfam code: PF00634, 224 variants).MutS domains are found in human MSH2/MSH6 proteins implicated in non-polyposis colorectal carcinoma (HNPCC), while BRCA2 is a known tumour suppressor gene.For the BRCA2 or FANCD1 proteins, their association with the Fanconi anaemia (FANC) protein complex is well known 27 .Other Pfam domains affected with germline pathogenic variants are: BRCA2-helical (Pfam code: PF09169, 159 variants), BRCA2-OB1 (Pfam   On the other hand, Pfam domains affected by somatic variants identified in metastatic tumours also included the known tumour suppressor genes P53 and PTEN: P53 DNA-binding (Pfam code: PF00870, 408 variants), PTEN-C2 (Pfam code: PF10409, 44 variants) and P53 tetramer (Pfam code: PF07710, 27 variants).A description of the affected Pfam domains discussed in this article and the associated functions is presented in Table 1.
Overall, these observations agree with the hypothesis of two-hit events: first a germline variant produces a predisposition to develop a tumour, while a second somatic event increases this probability in a specific organ and triggers tumour development.
Co-localization of pathogenic germline and somatic variants in protein sequence and 3D structure in DDR genes ATM, BRCA1, BRCA2, and MUTYH.In a previous work, we described ATM, BRCA1, BRCA2, and MUTYH as recurring genes with mutations in the PROREPAIR-B PCa cohort 14 .Here, as use cases in the DDR research, we studied these genes and analysed more pathogenic germline variants identified in different cohorts of advanced PCa reported in the literature (Table 2), recurrent somatic and germline variants in PCa, as well as hotspot positions in different tumour types collected from cBioPortal (https:// www.cbiop ortal.org) (Supplementary File 1), and also, the TCGA-PanCancer study of pathogenic germline variants in 10,389 adult cancers 28 .
Figure 5 shows the distribution of these different types of variants across each protein sequence.We observe that the vast majority of variants in ATM, BRCA1, and BRCA2 are localized in flexible and/or intrinsically disordered regions (IDRs), outside Pfam domainss (Fig. 5, panel a).The IDRs are common in protein interaction interfaces.In fact, the NetDDR interaction network revealed a high number of interactions involving these specific DDRs: ATM (AvrNodDeg = 228), BRCA1 (AvrNodDeg = 299), BRCA2 (AvrNodDeg = 94), and MUTYH (AvrNodDeg = 17).Interestingly, the high number of interactions in ATM and BRCA1, in comparison with the moderate number in BRCA2 and MUTYH, coincide with a higher predicted consensus disorder content in ATM and BRCA1 (16.4% and 82.3%, respectively) than in BRCA2 and MUTYH (3.2% and 7.5%, respectively).The percentages of disordered regions are as given in the MobiDB database (https:// mobidb.bio.unipd.it).
Moreover, we observed a few cases where PCa germline and recurrent somatic variants either co-localize at the same amino acid position, or are close neighbours (e.g., red and cyan dots in Fig. 5).Based on this finding, we expanded the analysis and compared the distribution of germline and somatic variants from different datasets as shown in Fig. 6 (i.e., germline variants identified in different cohorts of mCRPC collected from the literature (PubMed), germline variants from TCGA, germline variants from ClinVar, and somatic variants from COSMIC).Germline and somatic variants are distributed along the full length of the protein sequence, although the shape  3 shows significant differences (P-value < 0.05) between the pathogenic germline variants versus their benign and VUS counterparts in BRCA2.The same significant tendency was observed in BRCA2 for the distribution of VUS versus benign, both in the ClinVar dataset.No significant differences were observed in the distribution of somatic variants between primary and metastatic tumours.
In ATM (Fig. 6   the FATC domain (Pfam code: PF02260).The density of benign variants in this interaction region is different from pathogenic and VUS.Somatic variants from COSMIC in primary and metastatic tumours also tend to accumulate in the same interaction region.A different scenario is observed in BRCA1 (Fig. 6, panel b) and BRCA2 (Fig. 6, panel c).In BRCA1, the pathogenic germline variants from ClinVar show a uniform distribution, but in different cohorts of PCa they accumulate over a flexible region connecting a RING type Zinc finger (zf-C3HC4; Pfam code: PF00097) and the serine-rich region associated with BRCT (BRCT_assoc; Pfam code: PF12820) N-terminal domains.By contrast, pathogenic variants from TCGA accumulate in the C-terminus, which is the interaction region with FANCJ (IntAct accessions: EBI-3509650 and EBI-349905).In BRCA2, pathogenic variants in different PCa cohorts, TCGA, and ClinVar datasets show a peak over the central flexible region including BRCA2 repeats (Pfam code: PF00634), that constitute the interaction region with RAD51 (IntAct accessions: EBI-79792 and EBI-15557721).Somatic variants in metastatic tumours accumulate in the linker and flexible region between central BRCA2 repeats and the C-terminus of the protein.These findings reinforce the idea of a synergistic effect between germline and somatic variants, and that somatic events tend to accumulate in protein interaction regions such as IDRs.Also, according to the mutational data available, for some DDR members it is possible to identify protein regions that accumulate pathogenic variants versus benign and VUS.Remarkably, disrupting or impairing these protein interactions is likely to have a marked impact on their function, since these cancer-predisposition genes are highly connected and central in the NetDDR interaction network.Table 3. Statistical analysis of the distribution of the variants across the protein sequence.Statistical significance of the differences between the density distributions in Fig. 6, as calculated using the GLDEX package (https:// CRAN.R-proje ct.org/ packa ge= GLDEX).Statistically significant differences (shown in bold) are indicated where the P-value is less than 0.05.www.nature.com/scientificreports/ We also studied the co-localization of variants that are discontinuous along the sequence but proximal in the protein 3D structure.The pathogenic germline variants in mCRPC, recurrent germline and somatic variants, and hotspot positions in different tumour types were mapped onto ATM, BRCA1, BRCA2, and MUTYH available structures (Fig. 7), in order to find 3D-clusters of variants.We identified different clusters containing at least three individual positions within spheres with a 15-30 Å diameter that accommodate germline and somatic variants (Fig. 7 and Table 4).Variants included in the same 3D-cluster can be considered part of a continuum of cancer-promoting variants, each with a relatively small but additive effect.
In the ATM protein (Fig. 7, panel a), according to the low-resolution Cryo-EM structures (PDB ID: 5np0, 5np1 at 5.70 Å resolution; a.a.1-3056) (Table 5), we identified four 3D-clusters which are listed in Table 4.We found other pairs of pathogenic and recurrent variants that were close to each, but without having a significant P-value (P-value ≥ 0.05) as computed by the mutation3D method 29 .The mutation3D program computes significance by using an iterative bootstrapping algorithm to calculate a background distribution of cluster sizes arising from a random placement of an equivalent number of substitutions in the selected protein structure.For each cluster in the input data, P-values are computed empirically as the percentile rank of its "maximum cluster diameter or CL" among all for randomized clusters containing the same number of amino acid substitutions (see "Materials and methods").As an example, the PCa pathogenic germline variant p.R3047*, recurrent somatic variants p.N2875K and p.N2875S, and the hotspot position R3008 in different tumour types, are within a sphere of diameter = 25.6 Å in the ATM protein 3D-structure (Fig. 7, panel a).
In the MUTYH protein, no 3D-clusters of somatic and germline variants were identified.However, according to the MUTYH crystal structure (PDB ID: 3n5n at 2.30 Å resolution; a.a.76-362) and the solution NMR structure (PDB ID: 1x51; a.a.356-497), pairs of variants are located in the same spatial region (Fig. 7, panel d).
In the MUTYH crystal structure (PDB ID: 3n5n at 2.30 Å resolution; a.a.76-362) the pathogenic germline variant p.Y176C is located in the same spatial region as recurrent somatic variant p.S252T in PCa (diameter = 26.4Å).The solution NMR structure (PDB ID: 1x51; a.a.356-497) indicated that germline variant p.G393D is located in the same surface region as hotspot positions in different tumour types G393 and R423 (diameter = 24.5 Å) (see Fig. 7, panel d).
These findings suggest that the accumulation of variants in these spatial regions impairs protein interaction interfaces, and hence the biological function of the protein.The co-localization and 3D-clustering of germline and somatic variants onto the protein 3D-structure have also been applied by other authors to link rare predisposition variants to functional consequence 30 .   , which compares the mutational profiles of genes across cancer genomes with their natural germline variation across healthy individuals.DiffMut 31 uses the 1000 Genomes data as a background mutation rate.Data about DiffMut uEMD score and q-value for DDR and non-DDR is provided as Supplementary Files S3 and S4, respectively.Figure 8 shows the DiffMut results for DDR and non-DDR genes in 33 different cancer types.The non-DDR genes have significantly higher uEMD scores than DDR genes in most cancer types according to a Wilcoxon rank-sum test (Fig. 8, panels a and b).This could either mean that non-DDR genes have more cancer somatic mutations compared to the background rates of mutation in healthy individuals or that they have fewer background variants.As we said before, non-DDR genes accumulate fewer germline variants than DDR genes, which is in agreement with the DiffMut data.
We also show a heatmap with DiffMut uEMD scores for the cancer-predisposition genes ATM, BRCA1, BRCA2, MLH1, MSH2, and MSH6 (Fig. 8, panel c).For comparison, we included the TP53 gene that has an uEMD score greater than 1 in 25 out of 33 tumour types.It is important to mention that TP53 was excluded from the study of the selected somatic variants identified in DDR genes extracted from COSMIC database (Supplementary Fig. S5), and their accumulation in different biological pathways (Fig. 3, panel c) due to the high number of variants, and for better visualization.Genes that score 0 are either those that were never observed to have a somatic mutation across the tumour samples or those that had higher background mutation rates than somatic mutation rates.
Overall, DiffMut data is in agreement with our previous results for DDR and non-DDR genes, and with the assumption that genic regions with a high ratio of rare variants to common ones are more intolerant to functional variation, so changes in these regions are more likely to be responsible for diseases.

Discussion
DNA repair pathways protect cells against genomic damage; disruption of these pathways can contribute to the development of cancer.In this study we show that an integrative structural analysis of affected regions in the DDR protein-coding genes can help identify susceptibility to tumour development.Based on a combined analysis of the NetDDR interaction network and the curated list of germline and somatic variants mapped onto 1,411 nodes, we first observed that the percentage of DDR and non-DDR hits with annotations in COSMIC are very similar, while in ClinVar the percentage of DDR-hits with annotations is twofold higher than in non-DDR.This first observation suggested some bias in the ClinVar annotations towards the most studied genes reported in the literature, which may bias research findings or limit generalizability of the results.
A second observation from this analysis was the importance of the highly connected DDR genes associated with cancer-predisposition: ATM, BRCA1, BRCA2, MLH1, MSH2, and MSH6.These DDR genes are among the most affected in the COSMIC and ClinVar datasets, and their encoded proteins are central in the interaction network.
Proteins in the NetDDR network range in length from 44 to 7,968 amino acid residues, therefore we analysed the accumulation of germline and somatic variants as a function of protein length.We found that DDR hits accumulate a statistically higher number of germline variants per protein length than non-DDR (P-value = 5.0 × 10 -4 ).Indeed, accumulation of variants is not uniform along the complete list of DDR proteins, showing a high number of germline variants accumulated in a few DDR hits.These DDR hits (BRCA1, BRCA2, MSH2, MLH1, and MSH6) coincide with highly connected and central proteins in the NetDDR interaction network.On the other hand, the numbers of somatic variants per protein length are similar in the DDR and non-DDR groups  www.nature.com/scientificreports/(P-value = 0.36).However, genes ATM, ATRX, CHEK2, ERCC2, MLH1, MSH6, and SMARCA4 accumulate somatic variants above the 75th percentile of the distribution.Our analysis of germline and somatic variants in the different pathways where DDRs are involved showed that the Homology Recombination, Fanconi Anemia, and Mismatch Repair pathways are the most affected by both types of mutations, whereas the Nucleotide Excision Repair pathway appears to be more affected by somatic variants in primary tumours than by germline pathogenic variants.The latter finding is in agreement with the results of other authors in that it shows an increased contribution of a somatic mutational pattern 25 .
Relatively few articles have investigated the structural and co-localization relationships between germline and somatic variants, and none have focused on a specific protein family 5 .Hence, in this article, we analysed different structural features: Pfam domains, 3D protein interfaces, protein flexible and/or intrinsically disordered regions (IDRs), and 3D clustering of variants.Interestingly, we discovered that 47.6% of somatic variants in metastatic tumours occur within Pfam domains, which is nearly double the value of 25% for germline variants.The percentage of somatic variants in metastatic tumours affecting 3D protein interfaces (36.4%) is threefold higher than the value of 11% to 13% in germline variants.Furthermore, it appears that accumulation of both germline and somatic variants within Pfam domains and 3D protein interfaces results in a synergistic effect that damages protein function.
In this article, as use cases in the DDR study, we investigated ATM, BRCA1, BRCA2 and MUTYH genes, previously characterized in the mCRPC PROREPAIR-B cohort 14 .We analysed pathogenic germline variants identified in different cohorts of advanced PCa reported in the literature, recurrent somatic and germline variants in PCa, as well as hotspot positions in different tumour types collected from cBioPortal and the TCGA-PanCancer study of pathogenic germline variants in 10,389 adult cancers 28 .Using this large dataset, we observed that the vast majority of variants in ATM, BRCA1 and BRCA2 are located in flexible and/or IDRs, which are common in protein interaction interfaces.Moreover, it is possible to identify protein regions where germline pathogenic variants accumulate more than benign and VUS variants.These results together reinforce the hypothesis that there is a synergistic effect between germline and somatic variants affecting protein function and interactions 32 .
It is worth noting that a recent study into aggressive PCa, not limited to DDR genes, proposed 90 (out of 266, 34%) functionally related genes containing both germline and somatic variants 33 .The analysis used germline variants from genome-wide association studies (GWAS) and somatic variants identified in a cohort of 305 patients with aggressive tumours downloaded from The Cancer Genome Atlas (TCGA).Only 11 genes (i.e., ATM, BRCA1, CCNH, CHEK2, FANCC, GADD45A, HERC2, NSMCE2, PARG , RAD23B, RAD51B) are common to both our and their analyses.Note that ATM was not in the list of 266 genes with germline variants but in the differentially expressed genes associated with aggressive PCa and having somatic variants (see Supplementary Tables SA and S1A by Mamidi et al. 33 ).These authors also identified different signalling pathways enriched for germline and somatic variants in which ATM is involved (i.e., PCa (P < 5.81 × 10 -6 ), MSP-RON (P < 1.54 × 10 -5 ), and P53 (P < 1.24 × 10 -4 )).Interestingly, these authors claim that genes containing germline variation did not have a high frequency of somatic variants, which is opposite to the findings we present here for the DDR genes.In our NetDDR network study, 118 (out of 229, 52%) protein-coding genes, including ATM, contain germline and somatic variants (see Supplementary File 1).The contradictory results are explained, in part, by the large number of non-DDR genes considered in the analysis done by Mamidi et al. 33 .
Co-localization and 3D-clustering of germline and somatic variants on the protein 3D-structure have previously been used to link rare predisposition variants to functional consequences 30 .Here we identified different clusters of variants within spheres with a 15-30 Å diameter that accommodate germline and somatic variants in ATM, BRCA1 and BRCA2, and propose that variants in the same 3D-cluster are part of a continuum of cancer-promoting changes, each with a relatively small but additive effect.The integrative structural analysis discussed in this article provides a comprehensive characterisation of affected regions in DDRs and can help in the understanding of an individual's susceptibility to tumour development.One limitation of this approach is that some proteins may have incomplete, or no, structural coverage in the PDB, even when considering structures of homologous proteins, so no 3D cluster information can be obtained.
Overall, our findings indicated a synergistic effect between germline and somatic variants affecting protein domains and interactions in the DDR family genes.In particular, Pfam domains and protein interactions interfaces are more likely to be affected by somatic variants than pathogenic germline or VUS, suggesting that the emergence of a second somatic "hit" damages the protein's function.On the other hand, we documented 3D clusters of pathogenic germline, recurrent somatic variants from primary and metastatic tumours, and hotspots positions in ATM and BRCA2.Proper structural characterization of germline and somatic variants is needed to better stratify cancer patients with affected driver genes.

Materials and methods
Protein-protein interaction network of DDRs.The aggregated protein-protein interaction (PPI) network for the 276 DDRs (DDR-hits) was constructed based on data from the BioGRID 19 , InWeb_IM 20 and OmniPath 21 databases.These databases were searched using the Metascape tool 22 .The initial PPI-Network generated in the first phase comprised 1,466 nodes (276 DDR-hits and 1,191 non-DDR-hits) and 38,494 edges.However, applying a filtering implementation from Metascape 22 only 268 of the 276 DDR proteins have at least one interaction among themselves, and 229 of these 268 were considered to have enriched interactions or be "over-connected".According to Metascape, an "over-connected" protein means that it has more than two interactions with other DDR-hits and possesses an over-connection p-value < 0.01 22 .In brief, Metascape produces an initial interaction map which is then pruned in a filtering step to detect.Later, for each connected component, the MCODE algorithm us iteratively applied to identify densely connected elements, excluding less connected proteins to identify the more relevant DDR hits, reduce false positives and prevent needles expansion and ran- www.nature.com/scientificreports/1n0w, 3n5n, 3n5n) and solution structures (PDB id: 6hka, 1jm7, 1oqa, 1x51, 5dpk) of different regions of these proteins are available.Furthermore, low-resolution electron microscopy models (PDB id: 5np0, 5np1) covering the complete sequence of ATM have been published.Additional data about the resolution, protein chains, and amino acid regions are shown in Table 5.
Comparison of germline and somatic mutational profiles across 33 different cancer types.We retrieved mutational data from TCGA in MAF format, generated by the MuTect2 workflow, through the NCI Genomic Data Commons data portal (https:// portal.gdc.cancer.gov/).Then, we ran DiffMut (https:// diffm ut.princ eton.edu/) 31 , with default parameters to compare the germline and somatic mutational profiles across 33 tumour types annotated in TCGA.The 1000 Genomes data was used as a background mutation rate.All TCGA mutations, with DiffMut uEMD score and q-value annotations, were filtered out for the 229 DDR and 1182 non-DDR genes.A Wilcoxon rank-sum test was used to assess statistical differences between germline and somatic mutational profiles in tumour samples.

Figure 1 .
Figure 1.Accumulation of germline and recurrent somatic variants in DDR and non-DDR interactors in the NetDDR network.Percentages of protein-coding genes with variant annotations in COSMIC and ClinVar are shown in panels (a,b), respectively.Patterns of recurrent somatic (≥ 2 samples) and germline variants in the highly mutated DDR genes are illustrated in panels (c,d), respectively.An additional analysis of the variant distributions, normalized by protein length, is shown in Supplementary Fig. S6.

Figure 2 .
Figure 2. Distributions of germline variants in DDR and non-DDR interactors.Panel (a) shows the distribution of protein lengths for DDR and non-DDR proteins.Their average lengths are similar, being 745 a.a and 675 a.a, respectively.Boxplots show the contrasting patterns of germline variants as a function of protein length in the DDR and non-DDR proteins (panel b), and as categorized into benign, pathogenic, and VUS (panel c).Outliers in panel b and c were removed here for clarity, but included in the T-test statistical analysis.Some genes occur in two or three of the categories in panel (c), hence the sum of the N values shown is higher than the N in the corresponding plot in panel b which is the number of unique DDR genes.All boxplots depict the first and third quartiles as the lower and upper bounds of the box, with a thicker band inside the box showing the median value, and whiskers representing 1.5 × the interquartile range.Panel d shows the accumulation of germline variants in biological pathways related to the 229 DDR genes.

Figure 3 .
Figure 3. Distributions of somatic variants in DDR and non-DDR interactors.Boxplots showing the contrasting patterns in somatic variants as a function of protein length in DDR and non-DDR (panel a), and as categorized into metastasis and primary tumours (panel b).Outliers in panels (a,b) were removed for clarity, but retained for the T-test statistical analysis.Some genes occur in one or two of the categories in panel b, hence the sum of the N values is higher than in the corresponding plot in panel (a).The N in panel (a) indicates the number of unique DDR genes.All box plots depict the first and third quartiles as the lower and upper bounds of the box, with a thicker band inside the box showing the median value and whiskers representing 1.5 × the interquartile range.Panel (c) shows the accumulation of somatic variants in different biological pathways associated with the 229 DDR genes.For better visualization of the barplot we excluded TP53, which has a large number of recorded variants.

Figure 4 .
Figure 4. Accumulation of germline and somatic variants in Pfam domains and protein interaction interfaces.Panel (a) shows the percentage of pathogenic, VUS and benign germline variants extracted from the ClinVar database across interfaces and Pfam domains.Panel (b) shows the percentage of metastatic and primary somatic variants extracted from COSMIC across interfaces and Pfam domains.
, panel a), the pathogenic germline variants from different mCRPC cohorts, TCGA, and Clin-Var datasets show a tendency to accumulate in the interaction region with the NF-κ B essential modulator IKBKG (NEMO) (a.a.1960-2565; IntAct accessions: EBI-495465 and EBI-81279), that overlaps a flexible region flanking Table 2. Pathogenic germline variants identified in mCRPC cohorts and affecting the PCa relevant genes ATM, BRCA1, BRCA2 and MUTYH.(*) Numbers indicate the total variants identified in each study and in parenthesis the non-identical germline variants.(a) From 11 PCa studies in the non-redundant dataset at cBioPortal.(b) From 176 different studies in the non-redundant dataset at cBioPortal.Details about the PCa populations in cBioPortal are provided in Supplementary File 1.

Figure 5 .
Figure 5. Lollipop diagrams of germline and somatic variants in ATM, BRCA1, BRCA2 and MUTYH.The locations of different types of variants along the protein sequences are indicated by the coloured "lollipops".The variants are coloured: red for pathogenic germline variants identified in different cohorts of advanced PCa, cyan for recurrent germline and somatic variants in PCa, and green for hotspot positions in different tumour types.

Figure 6 .
Figure 6.Co-localization of germline and somatic variants in ATM, BRCA1, BRCA2 and MUTYH.Panel (a): Histograms and density distributions for different types of variants in ATM.The histograms in the left-hand column show the counts of each variant type in bins of 75 residues along the sequence.They show: germline variants identified in different cohorts of mCRPC collected from the literature (PubMed), germline variants from TCGA, germline variants from ClinVar, and somatic variants from COSMIC.Where more than one type of variant is plotted the widths of the bars in each bin are reduced accordingly.To the right of each histogram are the equivalent density distributions, as computed by smoothing the histogram bars.A small version of the Pfam domain layout is also shown beneath the histograms and density distributions.Panels (b-d) show the data for BRCA1, BRCA2, and MUTYH, respectively.

Figure 7 .
Figure 7. Mapping of germline and somatic variants onto protein 3D structures.The spatial 3D clusters in ATM (panel a), BRCA1 (panel b), BRCA2 (panel c), and MUTYH (panel d) are highlighted.Pathogenic germline variants are represented in red, recurrent somatic and germline in PCa in cyan, and hotspot positions in different tumour types in green.The spatial 3D clusters were calculated using the Mutation3D program (http:// mutat ion3d.org) 29 .

Figure 8 .
Figure 8. Study of DDR and non-DDR as cancer drivers across 33 tumour types.Panel (a) shows the DiffMut uEMD scores for the DDR and non-DDR genes across 33 tumour types studied in TCGA.BoxPlots indicate the 25th and 75th percentiles (box extent) and the median (centre line of each box).The whiskers extend from the hinge to the largest value no further than 1.5 × interquartile range from the hinge.Dots represent uEMD scores.The uEMD scores higher than 3 were excluded from the representation for clarity.Panel (b) shows P-values comparing DDR versus non-DDR genes based on the Wilcoxon rank-sum test.Panel (c) shows a heatmap of uEMD scores obtained by the different studies for the cancer-predisposition genes ATM, BRCA1, BRCA2, MLH1, MSH2, MSH6, and TP53.

Table 1 .
List of Pfam domains affected by germline and somatic mutations in the DDR family.
PF02259FAT (A novel domain in PIK-related proteins; i.e.FRAP, ATM, and TRRAP) Still need to be elucidated experimentally.Usually involved in protein-protein interactions PF00454 PI3_PI4_kinase (Phosphatidylinositol 3-and 4-kinase) Involved in cell growth, proliferation, differentiation, motility, survival and intracellular trafficking, which in turn are involved in cancer PF02260 FATC (A novel motif at the extreme C-terminus of PIK-related proteins) Still need to be elucidated experimentally.Usually involved in protein-protein interactions PF00730 HhH-GPD (HHH-GPD superfamily base excision DNA repair protein) Involved in DNA repair functions (i.e.endonuclease III, DNA glycosylase, and methyl-CPG binding protein) PF00633 HHH (Helix-hairpin-helix motif) DNA-binding domain PF14815 NUDIX_4 (NUDIX domain) A/G-specific adenine glycosylases.Involved in DNA base-pair mismatch repair Vol.:(0123456789) Scientific Reports | (2021) 11:14268 | https://doi.org/10.1038/s41598-021-93715-6www.nature.com/scientificreports/ of histograms and density plots is not symmetrical and shows different peaks, which indicate accumulation of variants in defined regions of the protein.Table

Table 4 .
List of 3D clusters identified in the protein structure of ATM, BRCA1, BRCA2 and MUTYH.
Here, we also compare the distribution of germline and somatic variants in 229 DDR and 1182 non-DDR components of the NetDDR network across mutation data in 33 different cancer types available from TCGA.First, we retrieved TCGA data in MAF format, generated by the MuTect2 workflow through the NCI Genomic Data Commons data portal (https:// portal.gdc.cancer.gov/).Then, we applied the DiffMut method Scientific Reports | (2021) 11:14268 | https://doi.org/10.1038/s41598-021-93715-6www.nature.com/scientificreports/Classification of DDR and non-DDR as cancer drivers across 33 tumour types.

Table 5 .
List of the 3D structures of the key DDR proteins available from the PDB database.

Table 6 .
List of 3D models annotated in the ModBase database.