ProtCID: A tool for hypothesis generation of the structures of protein interactions

Interaction of proteins with other molecules is central to their ability to carry out biological functions. These interactions include those in homo-and heterooligomeric protein complexes as well as those with nucleic acids, lipids, ions, and small molecules. Structural information on the interactions of proteins with other molecules is very plentiful, and for some proteins and protein domain families, there may be 100s or even 1000s of available structures. While it is possible for any biological scientist to search the Protein Data Bank and investigate individual structures, it is virtually impossible for a scientist who is not trained in structural bioinformatics to access this information across all of the structures that are available of any one extensively studied protein or protein family. Furthermore, the annotation of biological assemblies in the PDB is only correct for about 80% of entries in the PDB, and there is therefore a great deal of biological information that is present in protein crystals but not annotated as such. We present ProtCID (the Protein Common Interface Database) as a webserver and database that makes comprehensive, PDB-wide structural information on the interactions of proteins and individual protein domains with other molecules accessible to scientists at all levels. In particular, in the paper we describe the utility of ProtCID in generating structural hypotheses from available crystal structures that may not be readily apparent even to experienced structural biologists. One example is a homodimer of HRAS, KRAS, and NRAS (the α 4-α 5 dimer) that has recently been experimentally validated for all three proteins, and for which we are able to present structural data from 16 different crystal forms and 108 PDB entries, including a structure of the NRAS dimer for the first time. We also present a provocative hypothesis for the dimer structure of the second bromodomain of BET histone-binding proteins (including human BRD2, BRD3, BRD4, BRDT) that is present in every structure of these proteins available in the PDB. We also show how ProtCID can be utilized to extend existing experimental data on some proteins to many other proteins in the same family or even much larger superfamilies and to identify structural information for multi-protein complexes and hub proteins.


Abstract
Interaction of proteins with other molecules is central to their ability to carry out biological functions. These interactions include those in homo-and heterooligomeric protein complexes as well as those with nucleic acids, lipids, ions, and small molecules. Structural information on the interactions of proteins with other molecules is very plentiful, and for some proteins and protein domain families, there may be 100s or even 1000s of available structures. While it is possible for any biological scientist to search the Protein Data Bank and investigate individual structures, it is virtually impossible for a scientist who is not trained in structural bioinformatics to access this information across all of the structures that are available of any one extensively studied protein or protein family. Furthermore, the annotation of biological assemblies in the PDB is only correct for about 80% of entries in the PDB, and there is therefore a great deal of biological information that is present in protein crystals but not annotated as such. We present ProtCID (the Protein Common Interface Database) as a webserver and database that makes comprehensive, PDB-wide structural information on the interactions of proteins and individual protein domains with other molecules accessible to scientists at all levels. In particular, in the paper we describe the utility of ProtCID in generating structural hypotheses from available crystal structures that may not be readily apparent even to experienced structural biologists. One example is a homodimer of HRAS, KRAS, and NRAS (the α4-α5 dimer) that has recently been experimentally validated for all three proteins, and for which we are able to present structural data from 16 different crystal forms and 108 PDB entries, including a structure of the NRAS dimer for the first time. We also present a provocative hypothesis for the dimer structure of the second bromodomain of BET histone-binding proteins (including human BRD2, BRD3, BRD4, BRDT) that is present in every structure of these proteins available in the PDB. We also show how ProtCID can be utilized to extend existing experimental data on some proteins to many other proteins in the same family or even much larger superfamilies and to identify structural information for multi-protein complexes and hub proteins.
All proteins function via interactions with other molecules, including nucleic acids, small molecular ligands, ions, and other proteins in the form of both homo-and heterooligomers. How such interactions occur and defining their role in protein function are the central goals of structural biology. For many proteins and protein families, there is abundant structural information available in the form of structures determined by X-ray crystallography, nuclear magnetic resonance, and cryo-electron microscopy and deposited in the Protein Data Bank (PDB). The number of structures for a protein and its homologues can reach into the hundreds or thousands. Some proteins occur in different structural forms during their functional lifetimes, depending on conditions such as pH, phosphorylation state, interaction partners, and cofactors. For any protein system of interest, it is most valuable to understand the structure in any and all of its structural and functional states, including different homooligomeric states and functional interactions with nucleic acids, ligands, and other proteins. To accomplish this, it is often necessary to examine many available structures of a protein and even structures of its homologues. However, the process is very challenging and time consuming even for scientists trained in bioinformatics. It is virtually impossible when there are dozens or hundreds of available structures.
To address this issue, we have previously developed PDBfam 1 , which assigns protein domains defined by Pfam 2 to all structures in the PDB. It enables a user to browse a page of PDB entries for any particular domain family. Each PDB entry listed on a PDBfam page includes the Uniprot identifier, species, and Pfam architecture of the chain that contains the query domain as well as the same features of other protein partners in the same entries. It also includes information on other bound partners such as peptides, nucleic acids, and ligands. In this way, it is possible to rapidly identify structures within a protein family that will provide information on interactions that are critical to biological function.
Of central importance to the utility of experimental structures is the accuracy of annotations. Authors of crystal structures are required to deposit a 'biological assembly' into the PDB, which is what they believe to be the biologically relevant oligomeric form present in the crystal. This is in contrast to the asymmetric unit, which is the set of coordinates used to model the unit cell and the crystal lattice when copied and placed with rotational and translational symmetry operators. The author-deposited biological assembly is different from the asymmetric unit (ASU) for about 40% of crystal structures in the PDB 3 . When it is different, roughly half the time the biological assembly is larger than the ASU (i.e., made from parts or all of multiple copies of the ASU), and half the time it is smaller than the ASU (i.e., a sub-assembly of the ASU). Various authors have estimated the accuracy of the biological assemblies in the PDB in the range of 80-90% 4,5 .
For some entries, the PDB provides additional biological assemblies derived from the Protein Interfaces, Surfaces and Assemblies (PISA) 4 server from the EBI that are predicted to be stable in solution based on calculations of thermodynamic stability of the complexes. Several servers analyze interfaces in either the asymmetric units and/or the deposited biological assemblies of PDB entries [6][7][8][9][10][11] , and some are intended to predict which interfaces may be biologically relevant by measuring conservation scores and physical features and using machine learning predictors [12][13][14] . Interactions with peptides, nucleic acids, and ligands have also been analyzed and presented in several webservers and databases 9, 15-21 .
A common approach for identifying biological interactions of molecules within protein crystals is to compare multiple crystal forms of the same or related proteins. Each crystal form will have different symmetry and different non-biological interfaces between proteins and with crystallization ligands (sulfate ion, glycerol, etc.), while in most cases the biological interactions will be shared between them. We have shown that if a homodimeric or heterodimeric interface is present in a number of crystal forms, especially when the proteins in the different crystals are homologous but not identical, then such interfaces are very likely to be part of biologically relevant assemblies 5 . To enable this form of analysis, we previously developed the Protein Common Interface Database (ProtCID) which compares and clusters the interfaces of pairs of full-length protein chains with defined domain architectures in different entries and crystal forms in the PDB 22 (e.g., homodimers of one Pfam architecture such as (SH2)_(Pkinase), or heterodimers of two different Pfam architectures such as (C1_set) and (MHC_I)_(Ig_3) present in Class I MHC proteins). ProtCID has been very useful in identifying biologically relevant interfaces and assemblies within crystals [23][24][25][26] , including those that may not have been annotated in the biological assemblies provided to the PDB by the authors of each structure. ProtCID allows users to download coordinates the PyMol scripts for visualizing all available interfaces In this paper, we extend the ProtCID approach to clustering the interactions of individual domains within and between multi-domain proteins and between protein domains and peptides, nucleic acids, and small-molecular ligands. We focus in particular on how ProtCID can be used to generate hypotheses of which molecular interactions within crystals provide biologically relevant information that can be tested experimentally. The inclusion of interactions between individual domains greatly extends our ability to generate hypotheses about the functional interactions of proteins. We show examples of both chain-level and domain-level ProtCID clusters for some experimentally validated, biologically relevant protein-protein interactions that were in some way challenging to identify in the biological literature. This is especially true of weaker interactions within homooligomers, which are very difficult to distinguish from crystallization-induced interactions. There have been several cases where structures containing homodimers of proteins had existed in the Protein Data Bank (PDB) for many years before the homodimers were recognized as biologically relevant structures. Examples of this include the asymmetric homodimerization of the EGFR kinase domain involved in kinase activation 27 , the homodimer of cytosolic sulfotransferases involved in half-sites reactivity 28,29 , and very recently a homodimer common to both H-RAS 30 and K-RAS 31 . We show cases where at least one protein in a ProtCID cluster has been well validated as a homodimer but the cluster contains other proteins in the same family which have not been described as such. In this way, we generate hypotheses of how these proteins function as oligomers.
Many proteins interact with peptide segments from other proteins, often when these segments are from intrinsically disordered regions 32 . Thus, when attempting to define how two proteins interact, it is important to identify domains in one or both partners that might bind short segments of the other protein via a known peptide-binding domain. We have clustered peptide/Pfam-domain interactions in the PDB to identify a subset of domain families that we refer to as "professional peptide-binding domains" (PPBDs). The function of PPBDs in most contexts is to bind peptide segments from other proteins. Peptides bind to PPBDs in a similar manner across each family, often with a specific binding motif. Examples include SH2 33 , BRCT 34 , and PDZ domains 35 . These structurally characterized PPBDs are explicitly identified in ProtCID during searches of how two or more proteins might interact structurally.
Many protein domain families bind to nucleic acids, usually in ways that are well conserved across each family. Analysis of the available structures in large families has enabled the identification of how DNA or RNA sequence specificity is encoded in the protein sequence and structure 36,37 . This can even be extended to domain families within larger superfamilies, where some member families do not have any structures bound to DNA. Like other classifications of protein domains, Pfam is organized into superfamilies (called clans in Pfam), which sometimes contain very remotely related protein families of similar structure. Thus, the existence in ProtCID of domain/nucleic-acid interactions for one protein family can be used to develop hypotheses for the structures of other protein families within the same superfamily. We show some examples of this approach.
Binding sites of biological ligands are usually conserved within protein families, and the bioinformatics analysis of these sites with bound ligands has enabled the development of specific inhibitors for many proteins [38][39][40] . We have clustered ligands bound to each Pfam domain in the PDB using a metric based on volume overlap of the ligands. This enables the rapid identification of common binding sites of biological ligands and inhibitors as well as pockets in proteins that bind molecular fragments present in the solutions used to crystallize proteins.
Many proteins act as hub proteins and it can be difficult to develop hypotheses of how such proteins might function. We have enabled a new search feature in ProtCID with which a user can identify possible domain-domain and PPBD/peptide interactions that are possible between a hub protein and its partners. Interacting partners can be obtained from several databases, such as IntAct and String, and input to ProtCID. Many proteins participate in large complexes of many different proteins. ProtCID identifies the Pfam domains in a list of input Uniprot identifiers and then provides a list of potential domain-domain and PPBD/peptide interactions among these proteins that are represented in homologous proteins in the PDB.
These utilities might enable the identification of direct protein-protein interactions that might occur in large multi-subunit protein complexes.

Generating hypotheses for oligomeric protein assemblies with ProtCID
The ProtCID database contains information on four types of interactions: protein-protein  A primary goal of ProtCID is to generate hypotheses of the structures of oligomeric protein assemblies that may not be readily obvious to authors of crystal structures. Many such structures are due to weakly interacting dimers that are facilitated by attachment to the membrane or by scaffolding by other proteins or nucleic acids. One verified case that inspired the development of ProtCID is the small dimer interface of cytosolic sulfotransferases involved in half-sites reactivity of these enzymes 29 , and initially identified by cross-linking, protease digestion, and mass spectrometry 42 Table 3). As a testament to how difficult it is for biophysical calculations and sequence conservation in interfaces to determine biological assemblies, PISA predicts the asymmetric dimer of these structures as biological in only 6 of the 98 EGFR entries and all 3 of the heterodimer entries.
EPPIC predicts only the heterodimers as biological assemblies; it predicts that all of the homodimers in this ProtCID cluster are monomers 46 . These and other methods typically do not support the prediction of asymmetric homooligomeric assemblies (those that are not Cn or Dn symmetric) 3 . ProtCID may therefore be a suitable tool for identifying such assemblies. EPPIC does not recognize the α4-α5 dimer as part of a biological assembly for any of these PDB entries. We also find a smaller cluster (5 crystal forms) of a beta dimer, which has been studied by Muratcioglu et al. 48 By contrast, we do not find the α3-α4 dimer implicated in the same study in any PDB entry.
The α4-α5 dimer in our ProtCID cluster has been implicated as a biologically relevant assembly for NRAS, KRAS, and HRAS. Spencer-Smith et al. found that a nanobody to the α4-α5 surface, as determined by a co-crystal structure of the nanobody with HRAS (PDB: 5E95), disrupted HRAS nanoclustering and signaling through RAF 49 . Via manual search, they identified the α4-α5 dimer in 74 of 80 active structures of HRAS but in none of 33 inactive structures of HRAS. Our α4-α5 cluster contains 92 structures of HRAS, 81 of them (88%) with GTP or guanine triphosphate analogs, 9 of them with GDP, and two without a ligand (Supplementary Table 4 In addition to motif B, the first bromodomain of BET proteins has also been shown to dimerize in solution through dynamic light scattering, cross-linking, and co-immunoprecipitation experiments of tagged BD1 constructs 55 . The same authors determined the crystal structure of BD1 of human BRD2, and identified a head-to-head symmetric dimer as the likely biological assembly, which was verified by mutation of residues in the interface. Mutants unable to form the dimer bound to H4K12ac with drastically reduced affinity compared to the wildtype. This dimer is distinct from the BD2 dimer we found in the largest chain-based cluster in ProtCID.

Extending dimers with established biological activity to other family members whose crystals contain the same dimer
The observation of similar interfaces in crystals of homologous proteins can be used to utilize experimental data available on one protein to generate hypotheses for other members of the same family. ProtCID enables this kind of inference in an easily accessible way. One  Interleukin-2-inducible kinase (ITK) plays an important role in the activation of T-cells in the immune response. ITK has been found to form homodimers at the plasma membrane, which does not require the PH, proline-rich, SH2, or SH3 domains 66 . Although catalytic activity is not required for dimerization 66 , the only remaining domain that may be involved in dimerization is the kinase domain. The ITK kinase dimers, found in five different crystal forms in the BRAF ProtCID cluster, are a reasonable hypothesis for the mechanism of homodimerization of ITK at the membrane. These ITK dimers are not discussed or annotated by the authors of these

Peptide-binding domains in ProtCID
The function of many protein domains is to bind peptides from other proteins. Some of these peptide-binding domains are catalytic including proteolysis and amino acid modifications such as phosphorylation and methylation. Others serve to bring other protein domains into contact or to regulate the activity of the bound partner. We define a peptide as a protein chain  Figure 6).

Ligand/protein and nucleic-acid/protein interactions in ProtCID
We define all non-polymer molecules except water in the PDB as ligands. A total of 6,514 Pfams have contacts with 23,156 ligand types in the PDB (Supplementary Table 2). The ligands are clustered based on the extent to which they share Pfam HMM positions that they contact. ProtCID provides coordinates for two different views of Pfam-ligands interactions: 1) one ligand and its interacting Pfams; 2). one Pfam and its interacting ligands. Figure 4a Table 7).   Figure 7).

Modeling of protein complexes in ProtCID
Protein-protein interaction studies have identified protein complexes that consist of subunits expressed by many different protein-coding genes. Other proteins function in homooligomeric complexes larger than dimers. ProtCID can be used to develop hypotheses of the structures of these large complexes. We provide several examples.
4HBT proteins have a 'hot dog' fold and form different oligomers. In ProtCID, the 4HBT Pfam has 7 common clusters with at least 10 crystal forms (Figure 6a). The largest cluster has 144 entries in 83 distinct crystal forms. After identifying common PDB entries in these 7 clusters, there are three distinct combinations with no overlapping structures: clusters 1 and 2, clusters 1 and 5, and clusters 1 and 6, which form a β-sheet tetramer, an α-helix tetramer, and a hexamer respectively. Clusters 3, 4, and 7 are smaller interfaces that are present in most of the 1-2, 1-5, and 1-6 structures respectively. A phylogenetic tree shows that 4HBT proteins that are more closely related also tend to form the same oligomers (Supplementary Figure 8). We added the remaining four human proteins (BCHL, ACO11, ACO12 and ACOT9) with 4HBT domains. These proteins are located in the branch of the hexamers, and we can therefore hypothesize that the correct structure of their oligomers resembles the hexamers in Figure 6a Figure 6c shows these six common clusters of (ANF_receptor), (Lig_chan-Glu_bd) and (Lig_chan) from different species which can be used to assemble the tetramer, even if the full-length structures were not known.
ATG16 exists as a homodimer in solution and crystals, which further assembles the autophagosome into higher-order hetero-oligomers 79 . The ATG5-ATG12-ATG16-ATG3 complex is essential in autophagosome formation, in which a disordered region of ATG3 binds to ATG12 80 . The ProtCID cluster of ATG16 homodimers and two ProtCID clusters of ATG16 interacting with ATG5 can be used to build a heterooctameric structure of two copies each of ATG16, ATG5, ATG12, and ATG3 (Figure 6b bottom). This octameric structure is present in the crystal of ATG16-ATG5-ATG12-ATG3 (PDB: 4NAW) but was not identified by the authors as the biological assembly (they only deposited and showed the structure of the ATG16-ATG5-ATG12-ATG3 heterotetramer).

Modeling of hub proteins in ProtCID
Large-scale studies of protein-protein interactions have identified "hub" proteins that participate in a large number of interactions with other proteins [81][82][83] . Structural analysis of hub proteins is often hampered by the complexity of structural information in the PDB that might be utilized to develop hypotheses of how these proteins may interact with a large number of other proteins. We have enabled searches on ProtCID designed to provide hypothetical interactions of hub proteins and their partners and among the subunits in protein complexes. The hub protein search page in ProtCID allows a user to upload a list of proteins that are likely to interact with a specific hub protein; the server then returns a list of potential domain-domain and PPBDpeptide interactions that may explain how the hub protein interacts with each of its partners. Figure 6d shows the interactions in the PDB between P53 and its interactors based on Pfams and PPBDs. ProtCID identifies PPBDs (defined in Supplementary Table 4) in both hub and partner proteins so that this mode of binding is also presented to the user as a viable hypothesis.
For example, studies show that a peptide segment of the p53 C-terminus binds to 14-3-3 proteins 84, 85 ; however there is no structure of this interaction in the PDB. A potential interaction between human a peptide of P53 and and the PPBD 14-3-3 proteins is identified by the edge in the network in Figure 6d.

ProtCID web site
ProtCID is composed of data pipelines, databases, and a web site. On the ProtCID web

Discussion
Correctly identifying biologically relevant interfaces and assemblies in protein crystals remains a challenging task in structural biology. Automated methods for doing this depend on the biophysical properties 4 and sequence conservation of interfaces 46 , but their accuracy is only about 85% on large benchmarks 86 . An alternative approach is to take advantage of multiple crystal forms of the same protein or related proteins 87,88 , and identifying the presence of similar interfaces and assemblies in the crystals. We previously demonstrated that this approach performs very well on benchmarks 5 , and developed a database of clusters of similar interfaces of homologous protein chains 22 . We have extended this resource by clustering protein structures at the level of individual domains, which greatly increases the number of independent crystal forms for a large fraction of the protein families in the PDB. We further extended ProtCID by clustering domain-peptide and domain-ligand interactions.
We showed the power of this approach for generating hypotheses for the structures of biologically relevant interfaces and assemblies in cases where identification of the correct assembly is particularly challenging. This occurs for low-affinity dimers whose ability to form complexes is due to the concentration effect of confinement. This is true of proteins in the membrane and of small protein domains adjacent to other dimerization motifs and domains. We showed several examples of this, including the asymmetric dimer of ErbB proteins, a symmetric dimer of the BD2 bromodomains of BET proteins, and a symmetric homodimer of HRAS, KRAS, and NRAS. Notably, programs which aim to identify biological assemblies within crystals from biophysical properties or sequence conservation usually assume that assemblies must be symmetric. The extent of asymmetric assemblies such as the ErbB dimer that are not filamentous (like actin) is unknown. The identification of such complexes is difficult to perform manually for a large number of structures. ProtCID provides a very useful resource to systematically analyze structural data from the PDB and generate hypotheses with strong structural evidence. The BD2 dimer of BET proteins is found in all crystal forms that contain BD2 domains, and a dimer of these domains may be formed due to the increased concentration caused by the dimerization of the neighboring BD1 domains and the ET dimerization motif.
There is significant evidence in favor of the α4-α5 dimer of RAS proteins, particularly for HRAS. which is equal to a weighted count of the common contacting residue pairs in two interfaces divided by the total number of unique pairs. A value of Q of 1 means two interfaces are interacting in an identical way. A value of Q of 0 means there are no common contacts. We cluster domain interfaces with surface area > 100 Å 2 by a hierarchical average-linkage clustering algorithm. Initially each interface is initialized to be in its own cluster. At each step, the two clusters with the highest average Q-score between them are merged as long as their Q avg ≥ 0.30. When no two clusters can be merged with Q avg ≥ 0.30, then the algorithm is stopped.

Pfam-Peptide interfaces.
A peptides is defined as any protein chain with length less than 30 amino acids in the PDB. The chain type is based on the attribute "Polymer Type" defined in the PDB mmCIF files. A protein chain has the type defined as "polypeptide". A domain-peptide interface is defined as an interaction with ≥ 10 C β -C β contacts with distance ≤ 12 Å, or ≥ 5 atomic contacts with distance ≤ 5 Å. If a peptide is contacting several chains in a biological assembly, the interface with ≥ 75% atomic contacts is used as the peptide-interacting interface; otherwise, we keep all the interfaces. For any two Pfam-peptide interfaces in the same Pfam, the number of same-Pfam HMM positions interacting with peptides are counted as N hmm . RMSDs of peptides (RMSD pep ) are calculated after superposing the domain structures. We used PyMOL to generate the coordinates by using the "pair_fit" command to align the domain structures via their Pfam HMM positions, then calculated the minimum RMSD by applying linear least-squares fit on the domain interacting regions of two peptides, that is, the region between the first residue interacting with the domain and the last residue interacting with the domain. We clustered Pfampeptide interfaces using a hierarchical average-linkage clustering algorithm by N hmm and RMSD pep . In this method, each interface is initialized to be a cluster. At each step, the two clusters with the highest common interacting Pfam HMM sites are merged, as long as the N hmm ≥ 3 and RMSD pep ≤ 10 Å.
We used several criteria to define Professional Peptide Binding Domains (PPBDs): 1. The primary function of the domain must be peptide-binding in most proteins that contain the domain. Some domains primarily perform other functions such as binding DNA or other folded protein domains, and their functions are modified by peptide binding; these are excluded from PPBDs. In most cases, there is a common motif, often confined to one amino acid position, that demonstrates that peptide binding was a function of the common ancestor of proteins that contain the domain. 3. There must be at least 3 human proteins that contain the domain 4. We exclude repeat proteins (e.g. TPR repeats) that have evolved the ability to bind peptide multiple times in a manner consistent with convergent evolution.
We exclude domains for which peptide-binding includes catalytic modification of the peptide, which includes proteases and enzymes that add or remove post-translational modifications. least missing coordinates of a PDB entry, marked by "pdb" (e.g. PK_pdb.tar.gz. ); (ii) the best domain of a unique sequence, marked by "unp" (e.g. PK_unp.tar.gz); (iii) the best domain of a crystal form, marked by "cryst" (e.g. PK_cryst.tar.gz). The "unp" and "cryst" files contain fewer a PyMOL script to study the interaction between a Pfam and this ligand.
Interactions of user sequences. A user can input one or two sequences on the ProtCID web site to retrieve common interfaces between them or between homologues with the same Pfams Protein-protein interactions. A user can input a list of UniProt codes to identify interactions among them (http://dunbrack2.fccc.edu/ProtCiD/Search/Uniprots.aspx). There are two ways to identify interactions: "First to All" and "All to All". "First to All" refers to the interactions between the first protein in a list and all other proteins in the list (proteins 2 to N). These results may be of interest if the first protein is a hub protein with many protein interactors. "All to All" means the pairwise interactions among a list of input proteins, which is more useful for large protein complexes or complicated pathways with uncertain connections.
A user can choose interface types to search for: either "Interfaces on Pfams" or "Interfaces on Structures". "Interfaces on Structures" means ProtCID only returns structures that contain the actual user-input Uniprot sequences (i.e. not their homologues as defined by Pfam).
This may be useful as a first search to determine whether the query sequences are actually in the PDB in heterocomplex structures. The "Interfaces on Pfams" search means ProtCID will return interfaces from the PDB between any pairs of homologous proteins of the query sequences. The UniProts are assigned to Pfams either by the ProtCID database, or by running itself. An interaction is represented as an edge, with the number of structures as the weight.
Second, in Pfam interaction networks (http://dunbrack2.fccc.edu/ProtCiD/Browse/PfamBioNet.aspx) a Pfam interacts with at least one other Pfam, where an interaction refers to a domain interface cluster with at least two crystal forms and minimum sequence identity < 90%. A graph XML file is generated, where a node is a Pfam, an edge is the interaction between two Pfams. An edge weight is the number of crystal forms of the biggest interface cluster. The label of an edge is the number of crystal forms of the biggest cluster (numerator), the number crystal forms of an interaction (denominator) and the minimum sequence identity in parentheses.