Mutational signatures in GATA3 transcription factor and its DNA binding domain that stimulate breast cancer and HDR syndrome

Transcription factors (TFs) play important roles in many biochemical processes. Many human genetic disorders have been associated with mutations in the genes encoding these transcription factors, and so those mutations became targets for medications and drug design. In parallel, since many transcription factors act either as tumor suppressors or oncogenes, their mutations are mostly associated with cancer. In this perspective, we studied the GATA3 transcription factor when bound to DNA in a crystal structure and assessed the effect of different mutations encountered in patients with different diseases and phenotypes. We generated all missense mutants of GATA3 protein and DNA within the adjacent and the opposite GATA3:DNA complex models. We mutated every amino acid and studied the new binding of the complex after each mutation. Similarly, we did for every DNA base. We applied Poisson-Boltzmann electrostatic calculations feeding into free energy calculations. After analyzing our data, we identified amino acids and DNA bases keys for binding. Furthermore, we validated those findings against experimental genetic data. Our results are the first to propose in silico modeling for GATA:DNA bound complexes that could be used to score effects of missense mutations in other classes of transcription factors involved in common and genetic diseases.

The family of proteins that code for transcription factors is considered the largest family among all proteins types (about 10%). Specifically, 2600 proteins in the human genome contain DNA-binding domains, and most of them code for transcription factors 1 . The GATA3 TF is encoded in humans by the GATA3 gene and it controls the expression of a wide range of biologically and clinically important genes. GATA3 belongs to the GATA family of zinc finger transcription factors, which are named according to their DNA binding subsequence 'GATA ' .
Many studies confirmed that GATA3 mutations are involved in the development of certain types of breast cancer in humans 2 . GATA3 was shown to be one of the three genes mutated in > 10% of breast cancers 3 . Some studies on mice indicated that GATA3 is critical for the normal development of breast tissue and directly regulates luminal cell differentiation, whereas other studies indicated that it is integral to the expression of estrogen receptor alpha and to the signaling of androgen receptor [4][5][6] . Approximately one-half of the GATA3 mutations identified in patients with breast cancer are clustered in exons 5 and 6, which encode ZnF2 and the C terminal domain of the protein 7 . Experimental evidence showed that ZnF2 of GATA3 is required for DNA binding 8,9 . 15% of the mutations published in male breast cancer are present in GATA3, with hotspots recorded at residues S308 and S407 in luminal A and luminal B subtypes, respectively 10 .
Besides the different associations of GATA3 with different forms of breast cancer, GATA3 has been associated with hypoparathyroidism, deafness, and renal dysplasia (HDR) syndrome. The first described missense mutation (Leu348Arg) in HDR patients does not alter DNA binding or the affinity but likely alters the conformational change that occurs during binding in the DNA major groove 11 . Other mutations of GATA3 TF, causing the HDR dysplasia syndrome, include: Two nonsense mutations Glu-228 to Stop and Arg-367 to Stop, one acceptor splice site mutation that leads to a frameshift from codon 351, a premature termination at codon 367, and two missense mutations: Cys-318 to Arg and Asn-320 to Lys. Mutations involving GATA3 ZnF2 or adjacent basic amino acids result in a loss of DNA binding, but those involving GATA3 ZnF1 either lead to a loss of interaction with FOG2 (Friend of GATA , cofactor) or alter DNA-binding affinity 12 www.nature.com/scientificreports/ In this work, we studied all possible amino acids and DNA bases missense mutations on two protein-DNA complexes (GATA3:DNA complex models with PDB ID: 3DFV and 3DFX). We applied Poisson-Boltzmann electrostatic study for the analysis of those mutations. The original method has been applied on many protein-protein complexes, and for ionic-only amino acids. In this context, we studied the role of every amino acid, and the role of every DNA base in regard to binding, and that is by mutating each amino acid and mutating each of Subsequence 1 _GATA _Subsequence 2 DNA bases (i.e., Subsequence 1 _GATA _Subsequence 2 represents the DNA sequence where the GATA3 TF protein binds, and where Subsequence 1 and Subsequence 2 can be any of the four DNA bases: Adenine (A), Cytosine (C), Thymine (T), or Guanine (G)) that might lead to malfunction in transcription. Unlike previous studies, we hereby assessed the role of all amino acids (ionic and non-ionic) of the GATA3 protein and the role of all DNA bases of the Subsequence 1 _GATA _Subsequence 2 DNA sequence during non-specific recognition (between amino acids and DNA backbone) and in specific binding (between amino acids and DNA bases).
This paper is organized as follows: We first described the computational method applied to the study of amino acids and DNA bases mutations in Section II. We then illustrated in details the results of applying the method in Section III; we revealed key amino acids and key DNA bases for complex binding, in addition to revealing the amino acids and DNA bases that play neutral roles in binding. Afterwards, we discussed those results, linked them to disease phenotypes, and validated them with published experimental data. Finally, we drew conclusions in Section IV and presented related future work.

Methods
Recognition and binding form the two major steps of electrostatic association between protein and DNA molecules 19 . Nonspecific long-range electrostatic interactions characterize recognition, whereas specific favorable local short-range electrostatic interactions, such as hydrogen bonds, salt bridges, medium-range coulombic interactions, in addition to hydrophobic and van der Waals interactions, characterize binding. An accelerated weak encounter complex is formed during recognition. Contrariwise, the protein and DNA are locked into their final bound conformation during binding, and this occurs after local side change rearrangements, and exclusion of solvent atoms from their binding interface.
Several diseases were revealed through the effect of charged amino acids 20 . Similar diseases include the eye disease known as Age-related Macular Degeneration (AMD) 21 , the kidney disease known as atypical Hemolytic Uremic Syndrome (aHUS), the Dense Deposit Disease (DDD), also known as membranoproliferative glomerulonephritis 22 , and immune system disorder (over-activity or under-activity) 23 . The electrostatic type of interactions was shown in complexes like C3d-CR2 [24][25][26][27] and C3d-EfbC/Ehp association [26][27][28][29] , and in interactions with viral proteins VCP/SPICE 30,31 and Kaposica 32 . The functional properties of each subunit of the E1 heterodimer activating-enzyme for NEDD8, UBA3, and APPBP1 was studied electrostatically in 33 . Hierarchical clustering analysis of electrostatic potentials and charges of V3 loop of HIV-1, which plays a crucial role in viral entry into cells, was performed in 34 and was mainly mediated by electrostatics. Single-alanine mutants of charged residues in the complexes CD46(SCR1-2)-Ad11k and CD46(SCR1-2)-Ad21k were computationally generated to mark out specific interfacial electrostatic interactions that are critical for association 35 .
SUMO4, a type 1 diabetes susceptibility gene, was found amenable to SENP2-a protease enzyme that processes SUMO into conjugatable form-processing via a single amino acid mutation through electrostatic computational modeling, and a combination of two amino acid mutations makes it highly accessible to SENP2 substrate 36 . Electrostatic detailed investigation of factor H (FH) complement control protein (CCP) modules, in which mutations are linked to autoimmunity, revealed three binding sites in binding to complement protein C3b, thus increasing the affinity of FH for host surfaces 37 . Similar to FH, mutations in the MAC complex (C5b6) can lead to autoimmune diseases. Correspondingly, an electrostatic study of the interaction between C5b and C6 complement proteins was completed in 38 . Electrostatic similarity methods applied to perturbed structures of C3d and Cr2 revealed electrostatic "hot-spots" at the two functional sites of C3d and a lack of electrostatic "hot-spots" at the surface of Cr2, despite its excessively positive nature 39 . Additional electrostatic computational approaches were used to gain insight into the binding mode of the C3d:CR2 complex 40 . Theoretical alaninescanning mutagenesis and validation with experimental data was completed on five protein complexes in order to discern the role of individual ionized amino acids to protein association 41 .
Protein-DNA interactions: framework description. Binding free energy calculations of many protein-protein interactions were implemented using the integrated Analysis of Electrostatic Similarities Of Proteins (AESOP) framework 26,27,29,[33][34][35][36][37][38][39][40][41][42] . We based the electrostatic study of protein-DNA interactions on the same framework because protein-DNA interactions and protein-protein interactions share the same types of interactions; they both comprise bonded (bond, angle, torsion) and non-bonded (short-range and long-range electrostatic, van der Waals (vdW), and hydrogen bonds) interactions, as depicted in Table 1. 43 Intra-molecular represents interactions within the same molecule, whereas inter-molecular represents interactions between different molecules, like protein and DNA molecules. We then expanded AESOP to study all types of amino acids (ionic and non-ionic), in addition to all DNA bases and DNA backbone.
The Expanded-AESOP framework encompasses the following steps: Preparation of mutants from all GATA3 protein amino acids and from all nucleotides of the DNA sequences. R scripts were implemented to generate all types of mutants. We used as input the two forms of GATA3:DNA crystal structures (3DFV and 3DFX) from the Protein Data Bank (PDB). We replaced every amino acid-expected to be charged at physiological pH-one at a time, with each of the other nineteen amino acids. Along, we also Calculation of Poisson-Boltzmann electrostatic potentials. The electrostatic potentials were calculated using APBS software 43 , which is based on the linearized Poisson-Boltzmann equation, as illustrated in previous studies [25][26][27]29 . The atomic radii and charges, needed for APBS calculations, were calculated using PDB2PQR 44 program and AMBER force field parameters 45,46 .
Calculation of electrostatic free energies of complex binding. The calculations of the electrostatic potentials were fed into the calculations of the electrostatic free energies of binding based on a thermodynamic cycle, as described in 26,27,29 , and in the form of the following equations: where Eq. (1a) presents the binding free energy of the complex in solvent. Equation (1b) presents the binding free energy of the complex after eliminating artifacts. Equation Data visualization. We used data visualization to help scientists understand the significance of data by placing it in a visual context, such as patterns, trends, and correlations that might go undetected in text-based forms. Swiss PDB Viewer and Chimera represent the visualization software we used 47,48 .

Results and discussion
Real data. We studied the binding free energy calculations of GATA3:DNA complex (Fig. 1a, b). GATA3:DNA is available under two different conformations with PDB IDs: (a) 3DFX (GATA3 binding to DNA in an opposite manner) and (b) 3DFV (GATA3 binding to DNA in an adjacent manner) 16 . The two crystal models share the same types of interactions. Accordingly, we used the available interactions of the Opposite model (OPP) to elaborate on the role of some amino acids hubs in the Adjacent model (ADJ). For electrostatic calculations, we used the crystal structure 3DFV, in which GATA3 comprises the coordinates of amino acids Arg311-Arg366 for each of Chain-D and Chain-C, and in which the DNA module comprises the coordinates of nucleic acids from T1 to C20 for Strand-Y and from A1 to G20 for Strand-Z, as follows: DNA Strand-Z:  Fig. 1b, Chain-D binds to 'GATA' subsequence on Strand-Z and Chain-C binds to 'GATA' subsequence (in reverse) on Strand-Y. Table 1. Types of protein-DNA interactions. Non-bonded Specific refers to interactions between Amino Acid (AA) and DNA bases and Non-bonded Non-specific refers to interactions between AA and DNA backbone.

Intra-molecular
Bonded Bond Angle Torsion

Mutational analysis from computational results.
In order to detect the effect of each mutated perturbation on the overall binding ability of the complex GATA3:DNA, we performed the following steps: First, we generated a family of GATA3 mutants and a family of DNA mutants from the crystallographic structure GATA3:DNA 43 at atomic detail, and that is by replacing each amino acid with each of the other nineteen amino acids and replacing each base of the DNA sequence Subsequence 1 _GATA _Subsequence 2 with each of the other three DNA bases. Second, we performed Poisson-Boltzmann electrostatic calculations on each of those mutant The current dataset consists of one family of GATA3 protein mutants and one family of DNA sequence mutants. Since each chain consists of 56 amino acids (numbered 311-366) and each amino acid is mutated to 19 other amino acids, we have a total of 1064 protein mutants per one chain of the 3DFV structure, or 2128 per both chains (Chain-C and Chain-D are symmetrical). The dataset also consists of a family of Subsequence 1 _GATA _Subsequence 2 DNA sequence mutants, where each DNA base is mutated to each of the three other DNA bases, composing a total of 60 mutants for each strand (Strand-Y and Strand-Z are complementary), or 120 DNA mutants per both strands.
We superimposed the structures of the GATA3 protein mutants and those of the Subsequence 1 _GATA _Subsequence 2 DNA sequence using the backbone Cα atoms and centered them on the same grid used for the parent/ wild structure (GATA3:DNA).  1a), and compared against the parent/wild protein-DNA complex solvation free energy. Such comparison serves as a physicochemical classifier of binding ability. The free energy difference of the difference ΔΔG is computed by Eq. (1b) based on the thermodynamic cycle described in section 2 of 26,27,29 . For binding, an increase in solvation binding free energy ΔΔG is considered favorable, whereas a decrease is considered unfavorable.
As shown in Figs. 2, 3, 4, 5, 6, 7, 8, 9 and 10, the strength of each perturbation (mutation) is illustrated in terms of ΔΔG free energy calculation (Eq. 1a). Mutations of acidic GATA3 residues or DNA bases have free energy values higher than the parent/wild free energy and are called enhancers because they enhance binding, whereas mutations of basic GATA3 residues or DNA bases have free energy values lower than the parent/wild free energy and are called inhibitors because they inhibit binding. The lower the computed ΔΔG for a specific mutation, the more crucial the corresponding residue or DNA base is to binding, compared to other residues or DNA bases respectively, and the reasoning behind this is as follows: If the computed solvation binding free energy ΔΔG of a mutated residue or DNA base is lower than the initial solvation binding free energy ΔΔG of the same residue or DNA base before mutation, this implies that this specific residue/DNA base binds better before being mutated, and so this residue/DNA base has a significant impact on binding and is considered a hotspot (case of an inhibitor). Conversely, if the computed solvation binding free energy ΔΔG of a mutated residue or DNA base is higher than the initial solvation binding free energy ΔΔG of the same residue or DNA base before mutation, this implies that this specific residue/DNA base binds better after being mutated, and so this residue/DNA base does not have a significant impact on binding (case of an enhancer).  The results showed the following DNA bases mutants to impact binding (inhibitors shown below with green highlights). The lowest free energy values correspond to inhibitors with highest impact (strongest inhibitors), and those are: G6T, G6A, and G6C on Strand-Z, and G5T, G5A, and G5C on Strand-Y. Accordingly, we can see how the bases in the 'GATA ' subsequence are definitely the first to impact binding, and then a few bases in the neighborhood of the 'GATA ' subsequence, which are not necessarily the first adjacent base on the left and/or the first adjacent base on the right of the 'GATA ' subsequence. The rest of the DNA bases mutants that lay around the parent/wild region (Fig. 10)    www.nature.com/scientificreports/ Comparatively, all inhibitors are presented together in order to reveal the ones with the highest impact on protein-DNA binding. Hence, Fig. 11 shows that Arg and Lys are the most influential amino acids for efficient binding. It also shows the minor impact of mutated DNA bases, which lay around the parent/wild region, when compared to amino acid mutants. This result uncovers the crucial role of DNA backbone in the interactions with the GATA3 protein amino acids, unlike the specific role of the DNA bases, which appears to be minimal in comparison. Similarly, Fig. 12 shows all enhancers of amino acids, highlighting the crucial role of Gln mutants in binding.
Computationally, we detected the hydrogen bonds (listed in detail in in Supplementary Figure SF12 and Supplementary Table ST12) between Chain-C and the DNA, revealing the vital role to binding of the following amino acids: Arg312, Arg329, Arg330, Arg364, Arg366, Lys346, and Asn339. In addition, we computationally detected the salt bridges (listed in detail in in Supplementary Figure SF13 and Supplementary Table ST13) between Chain-C and the DNA for the following amino acids: Arg312, Arg330, Arg352, and Lys358.
Mutating amino acids to Arg or Lys make them enhancers due to adding more positive to the binding with DNA, which is initially more negatively charged. Figures 2, 3, 4, 5, 6, 7, 8 and 9 show the effect on binding when mutating any amino acid to Arg or Lys. Conversely, when any amino acid is mutated to Asp/Glu, it is implied as an inhibitor that impedes binding. In particular, when Arg or Lys are changed to Asp/Glu (Fig. 2/Table 2 or SF6/ ST6 respectively), those mutants are called strong inhibitors and the impeding effect on binding is multiplied. In this case, we are losing more positive charges (property of Arg and Lys) and adding more negative charges           www.nature.com/scientificreports/ Both models (OPP and ADJ) are similar in that Chain-B and Chain-C both bind to 'GATA ' DNA subsequence. The only minor difference between Chain-A and Chain-D lies in that they bind to slightly different DNA subsequences, 'GATT ' and 'GATA ' respectively. Hence, we can use some of the experimental results performed on the OPP model to validate some of the computational results performed on the ADJ model. As shown in Fig. 1a Interactions and hubs. Based on the results of our Expanded-AESOP method (as shown in Figs. 2, 3, 4, 5, 6, 7, 8, 9 and 10), we validated experimentally the predicted enhancers and inhibitors, described them in details in Supplementary Material, and summarized them in Supplementary Tables ST14, ST15, ST16, and ST17. Experimental evidence showed extensive protein-protein interactions (between the two GATA3 molecules) and protein-DNA interactions (between GATA3 and the DNA molecule) of many enhancers (Ala340, Asn351, Asn349,  Asn339, Pro353, Leu354, Leu347, Leu327, Leu343, Glu359, His348, Ile350, Ile361, Thr355, Thr326, Gln362, Met356, Tyr344) and many inhibitors (Ala340, Arg364, Arg329, Arg352, Arg312, Arg330, Lys357, Lys358,  Lys346, Asn351, Asn349, Asn339, Pro353, Leu354, Leu347, Leu327, Leu343, His348, Ile350, Ile361, Thr355,  Thr326, Met356, Tyr344), revealing strong hubs of interactions comprising all different types of interactions (hydrogen bonds, salt bridges, van der Waals, etc.).
On the other hand, we detected computationally hydrogen bonds between Chain-C amino acids (Arg312, Arg329, Arg330, Arg364, Arg366, Lys346, Asn339) and the DNA (shown in Supplementary Figure SF12 and summarized in Supplementary Table ST12). In addition, we detected computationally salt bridges between     www.nature.com/scientificreports/ Chain-C amino acids (Arg312, Arg330, Arg352, Lys358) and the DNA (shown in SF13 and ST13), again bringing to light the role of those amino acids to binding. Thus, the experimental and computational validations substantiate the crucial role of the predicted amino acids and of the DNA backbone, in addition to demonstrating the effectiveness of our developed approach Expanded-AESOP.
Gene-disease associations/phenotypes. Identifying gene-disease associations from experimental methods can be expensive and time consuming. Yet, this process is highly needed to design therapeutic strategies against diseases. Accordingly, in silico methods were developed to predict those associations from available experimental data and other types of data. In this section, we validated the results predicted by our computational method by elaborating on the diseases already witnessed as a result of the disruptions caused by our studied mutations.
The enhancer C317R (as listed in the GATA3:DNA .pdb file from the Protein Data Bank, but numbered as C318R in 12 -with an offset of one position) caused by a missense mutation as shown in Fig. 5, leads to a disruption of the second zinc finger (ZnF2) that is manifested in HDR syndrome, where the loss of ZnF2 coordination was marked as haploinsufficiency (HI) 12 . Another missense mutation, leading to a disruption of the second zinc finger (ZnF2) and manifesting in HDR syndrome, is the enhancer N319K (numbered as N320K in 12 ), as shown in Fig. 3. Again, this specific missense mutation has been noticeable as HI in the HDR syndrome 12 . A different missense mutation caused by the enhancer L347R (numbered as L348R in 11 ), is shown in Fig. 7, and has been observed in the HDR syndrome 11 . This mutation affects the basic region and is likely to disturb the DNA conformational change. The mutation R352S 4 , despite being in the parent/wild region, has a major effect. As shown in Fig. 2, it is on the far end of the parent/wild region, and that makes it an approximate inhibitor. It is predicted to disrupt the helical turn and thus change the angle between the C-terminal zinc finger and the adjacent C-terminal tail; this phenomenon has been visible in the HDR syndrome within the Chinese population 15 .
On the other hand, GATA3 TF is one of the most frequently mutated genes in Breast Cancer. The mutation R366L (numbered as R367L in 49 ) which lies in the parent/wild region, turns to be of major effect. As shown in Fig. 2, it is considered an approximate inhibitor due to its distant location from the parent/wild complex. Such missense mutation in Exon5 of ZnF2 was seen in Breast Tumor Ull-011 49 , resulting in high expression of GATA3, and leading to Breast Neoplasm disease 49 . Also, the missense mutation L343F (numbered as L344F in 49 ) in Exon 5 of ZnF2 showed high expression of GATA3 in Breast Tumor BR00-0587, causing the same disease 49 . On the other hand, the mutations M356K (numbered as M294K in 50 and is the only Methionine in ZnF2) and N333K (numbered as N334K in 50 ), in Exon 4 and Exon 5 respectively, were witnessed in Breast Cancer on the molecular and clinical levels 50 . A heterozygous mutation (frameshift) D335Gfs (numbered as D336Gfs in 51 ) was seen in Breast Cancer 51 and the frameshift mutation R329fs (numbered as R330fs in 52,53 ) was associated with Breast Cancer. The previous two results show the importance of the specified amino acids (D335 and R329) which were predicted as strong enhancer and inhibitor respectively in our results (Figs. 2 and 4 respectively). Lastly, the mutational frameshift P490Afs, which occurred in several Breast Cancer cases 54 , shows the importance of studying all GATA3 mutants (charged and non-charged amino acids). Yet, we could not verify this specific amino acid (P490) due to its position outside ZnF2 (not covered in .pdb input file range).

Conclusion
We started from the AESOP 26,27,29,33-42 framework, which predicts ionic residues with major effect to binding, and modified it to the Expanded-AESOP framework, which predicts all types of residues (ionic and non-ionic) and DNA bases, affecting binding in a biomolecular protein-DNA complex. Unlike previous GATA work 51-55 where we tackled only charged amino acids, we modified the method here to cover all mutations types of all amino acids and all mutations types of all DNA bases. We applied the new method to the structural information of two models of the GATA3:DNA complex 43 . After computing the electrostatic potential calculations using APBS, and feeding them into the free energy calculations in view of a two-step model, we detected key residues and key DNA bases crucial for the complex intermolecular interactions, and therefore for binding. Analysis of the corresponding free energy calculations showed that the DNA backbone plays a more critical role in binding than the DNA bases, and that was confirmed by the related interactions listed computationally and experimentally.
The results showed that some non-ionic amino acids do play a major role in binding, and that may be rationalized to many factors, such as the position of the non-ionic amino acid in the complex (i.e., too close to the interface or too close to many other charged amino acids) or the contribution of this non-ionic amino acid to some favorable conformation.
Future work will include studying key amino acids and key DNA bases in the crystal structure of GATA4:DNA. Such studies will form the basis for designing future experiments and biopharmaceutical studies that will assist in understanding better the biochemical pathways involved in GATA:DNA binding, for enhanced regulation of GATA target genes.