Abstract
Initial protein structural comparisons were sequencebased. Since amino acids that are distant in the sequence can be close in the 3dimensional (3D) structure, 3D contact approaches can complement sequence approaches. Traditional 3D contact approaches study 3D structures directly and are alignmentbased. Instead, 3D structures can be modeled as protein structure networks (PSNs). Then, network approaches can compare proteins by comparing their PSNs. These can be alignmentbased or alignmentfree. We focus on the latter. Existing network alignmentfree approaches have drawbacks: 1) They rely on naive measures of network topology. 2) They are not robust to PSN size. They cannot integrate 3) multiple PSN measures or 4) PSN data with sequence data, although this could improve comparison because the different data types capture complementary aspects of the protein structure. We address this by: 1) exploiting wellestablished graphlet measures via a new network alignmentfree approach, 2) introducing normalized graphlet measures to remove the bias of PSN size, 3) allowing for integrating multiple PSN measures, and 4) using ordered graphlets to combine the complementary PSN data and sequence (specifically, residue order) data. We compare synthetic networks and realworld PSNs more accurately and faster than existing network (alignmentfree and alignmentbased), 3D contact, or sequence approaches.
Introduction
Motivation and related work
Proteins perform important cellular functions. While understanding protein function is clearly important, doing so experimentally is expensive and timeconsuming^{1,2}. Because of this, the functions of many proteins remain unknown^{2,3}. Consequently, computational prediction of protein function has received attention. In this context, protein structural comparison (PC) aims to quantify similarity between proteins with respect to their sequence or 3dimensional (3D) structural patterns. Then, functions of unannotated proteins can be predicted based on functions of similar, annotated proteins. By “function”, we mean traditional notions of protein function, such as its biological process, molecular function, or cellular localization^{4}, or any protein characteristic (e.g., length, hydrophobicity/hydrophilicity, or folding rate), as long as the given characteristic is expected to correlate well with the protein structure. In this study, we propose a new PC approach, which we evaluate in an established way: by measuring how accurately it captures expected (dis)similarities between known groups of structurally (dis)similar proteins^{5}, such as protein structural classes from Class, Architecture, Topology, Homology (CATH)^{6,7}, or Structural Classification of Proteins (SCOP)^{8}. Application of our proposed PC approach to protein function prediction is out of the scope of the current study and is the subject of future work.
Early PC has relied on sequence analyses^{9,10,11}. Due to advancements of highthroughput sequencing technologies, rich sequence data are available for many species, and thus, comprehensive sequence pattern searches are possible. However, amino acids that are distant in the linear sequence can be close in 3D structure. Thus, 3D structural analyses can reveal patterns that might not be apparent from the sequence alone^{12}. For example, while high sequence similarity between proteins typically indicates their high structural and functional similarity^{3}, proteins with low sequence similarity can still be structurally similar and perform similar function^{13,14}. In this case, 3D structural approaches, unlike sequence approaches, can correctly identify structurally and thus functionally similar proteins. On the other extreme, proteins with high sequence similarity can be structurally dissimilar and perform different functions^{15,16,17,18,19}. In this case, 3D structural approaches, unlike sequence approaches, can correctly identify structurally and thus functionally different proteins.
3D structural approaches can be categorized into traditional 3D contact approaches, which are alignmentbased, and network approaches, which can be alignmentbased or alignmentfree. By alignmentbased (3D contact or network) approaches, we mean approaches whose main goal is to map amino acid residues between the compared proteins in a way that conserves the maximum amount of common substructure. In the process, alignmentbased approaches can and typically do quantify similarity between the compared protein structures, and they do so under their resulting residue mappings. Given this, alignmentbased approaches can and have been used in the task of PC as we define it^{5,20}, even though they are not necessarily directly designed for this task. On the other hand, by alignmentfree (network) approaches, we mean approaches whose main goal is to quantify similarity between the compared protein structures independent of any residue mapping, typically by extracting from each structure some network patterns (also called network properties, network features, network fingerprints, or measures of network topology) and comparing the patterns between the structures. Alignmentfree approaches are directly designed for the task of PC as we define it. We note that there exist approaches that are alignmentfree but not networkbased^{21}, which are out of the scope of our study. Below, we discuss 3D contact alignmentbased PC approaches, followed by network alignmentbased PC approaches, followed by network alignmentfree PC approaches.
3D contact alignmentbased PC approaches are typically rigidbody approaches^{22,23}, meaning that they treat proteins as rigid objects. Such approaches aim to identify alignments that satisfy two objectives: 1) they maximize the number of mapped residues and 2) they minimize deviations between the mapped structures (with respect to e.g., Root Mean Square Deviation). Different rigidbody approaches mainly differ in how they combine these two objectives. There exist many approaches of this type^{5,24,25,26}. Prominent ones are DaliLite^{27} and TMalign^{28}. These two approaches have explicitly been used and evaluated in the task of PC as we define it^{5,20}, and are thus directly relevant for our study.
Network alignmentbased PC approaches are typically flexible alignment methods, meaning that they treat proteins as flexible (rather than rigid) objects, because proteins can undergo large conformational changes. These approaches align local protein regions in a rigidbody manner but account for flexibility by allowing for twists between the locally aligned regions^{5,29}. Also, these approaches are typically of the Contact Map Overlap (CMO) type. That is, first, they represent a 3D protein structure consisting of n residues as a contact map, i.e., n × n matrix C, where position C _{ ij } has a value of l if residues i and j are close enough and are thus in contact, and it has a value of 0 otherwise. Note that contact maps are equivalent to protein structure networks (PSNs), in which nodes are residues and edges link spatially close amino acids^{30}. Second, CMO approaches aim to compare contact maps of two proteins, in order to align a subset of residues in one protein to a subset of residues in another protein in a way that maximizes the number of common contacts and also conserves the order of the aligned residues^{31}. Prominent CMO approaches are Apurva, MSVNS, AlEigen7, and GRAlign^{5}. When evaluated in the task of PC as we define it, i.e., when used to compare proteins labeled with structural classes of CATH or SCOP, GRAlign outperformed Apurva, MSVNS, and AlEigen7 in terms of both accuracy and running time^{5}. So, we consider GRAlign to be the stateoftheart CMO (i.e., network alignmentbased) approach. In addition to these network alignmentbased approaches, GRAlign was evaluated in the same manner against the existing 3D contact alignmentbased approaches (DaliLite and TMAlign mentioned above, as well as three additional approaches, MATT, Yakusa, and FAST)^{5}. In terms of running time, GRAlign was the fastest. In terms of accuracy, GRAlign was superior to MATT, Yakusa, and FAST, but it was inferior or comparable to DaliLite and TMAlign. So, while GRAlign remains the stateoftheart network alignmentbased PC approach, DaliLite and TMAlign remain stateoftheart 3D contact alignmentbased PC approaches, and we continue to consider all three in our study.
Network alignmentfree approaches also deal with PSN representations of compared proteins, but they aim to quantify protein similarity without accounting for any residue mapping. We propose a novel network alignmentfree PC approach (see below). We first compare our approach to its most direct competitors, i.e., existing alignmentfree approaches. Then, we compare our approach to existing alignmentbased approaches. We recognize that evaluation of alignmentfree against alignmentbased approaches should be taken with caution^{32,33}, (because the two comparison types quantify protein similarity differently – see above). Yet, as we show later in our evaluation, existing alignmentbased PC approaches are superior to existing alignmentfree PC approaches and are thus our strongest (though not necessarily fairest) competitors.
Next, we discuss existing network alignmentfree PC approaches. Such approaches have already been developed^{14,34}, to compare network topological patterns within a protein or across proteins, for example to study differences in network properties between transmembrane and globular proteins, analyze the packing topology of structurally important residues in membrane proteins, or refine homology models of transmembrane proteins^{35,36,37}. Existing network alignmentfree PC approaches, however, have the following limitations:

1)
They rely on naive measures of network topology, such as average degree, average or maximum distance (diameter), or average clustering coefficient of a network, which capture the global view of a network but ignore complex local interconnectivities that exist in realworld networks, including PSNs^{38,39,40}.

2)
They can bias PC by PSN size: networks of similar topology but different sizes can be mistakenly identified as dissimilar by the existing approaches simply because of their size differences alone.

3)
Because different network measures quantify the same PSN topology from different perspectives^{41}, and because each existing approach uses a single measure, PC could be biased towards the perspective captured by the given measure.

4)
They ignore valuable sequence information (also, the existing sequence approaches ignore valuable PSN information).
Our contributions
We present a new network alignmentfree PC framework that overcomes the above limitations. Specifically:

1.
We extend graphlets^{42,43}, a sensitive measure of local network topology, to improve PC. Graphlets have already been used for alignmentfree comparisons of proteinprotein interaction networks^{44,45,46}, via graphlet degree distribution agreement (GDDA)^{43}, relative graphlet frequency distance (RGFD)^{42}, and graphlet correlation distance (GCD)^{47} approaches. While these three graphlet approaches can trivially be used in the task of PC (which we do in our study), we use graphlets differently. Namely, we summarize the topology of a PSN (i.e., the corresponding protein) into a graphlet vector. Then, we use graphlet vectors of all considered PSNs as input to principal component analysis (PCA) to reduce the dimensionality of the vectors in a way that keeps only the most important (discriminative) information from the vectors. Finally, we quantify the similarity between the PSNs by comparing their PCA dimensionalityreduced graphlet vectors. The combination of graphlets and PCA is a novel PC approach.

2.
We perform graphlet normalization to address the bias of PSN size.

3.
We allow for integrating different and complementary network topological measures within our PCA framework.

4.
We extend the idea of ordered graphlets ^{5} to integrate the PSN amino acid interconnectivity information with sequence (i.e., residue order) information. Ordered graphlets were already used by GRAlign^{5}. However, even though GRAlign is also a network PC approach and even though GRAlign also uses ordered graphlets, one key difference between GRAlign and our work is that GRAlign is alignmentbased, while our graphlet PCA framework is alignmentfree. This is expected to provide drastic speedup compared to GRAlign (and it does, as we verify in our evaluation), because our approach does not need to actually map residues between compared proteins. Yet, if one wished to actually map residues, using an alignmentfree approach such as ours would not suffice. Another key difference between GRAlign and our work is that GRAlign uses only up to 3node ordered graphlets. Using larger graphlets can be beneficial in many realworld contexts^{41,44,45,46}. Hence, we extend the existing notion of 3node ordered graphlets both theoretically and implementationwise to be able to deal with larger graphlets. Additionally, we introduce a novel concept of “longrange(K)” ordered graphlets to give higher importance to amino acids that are close enough in the protein 3D structure but are at least K amino acids apart in the protein sequence, because such longerrange interactions could help distinguish protein structures better^{12,48}. We include the extended idea of (larger size, normalized, and “longrange(K)”) ordered graphlets into our software (available at: http://nd.edu/~cone/PSN/).
We refer to our entire PC framework as GRAFENE (graphletbased alignmentfree network comparison), which has nine different approach versions, depending on graphlet size, whether graphlets are normalized, whether graphlets are ordered, and whether the “longrange(K)” constraint is considered (see Fig. 1 and Methods).
We study two network types: synthetic networks (in order to illustrate wide applicability of our approach across many domains) and realworld PSNs (in order to illustrate a specific application of our approach in the task of PC). For each network type, we analyze multiple data sets. In each data set, each network has a known label, meaning that we know that networks having the same label should be identified as similar, while networks having different labels should be identified as dissimilar. For synthetic networks, labels correspond to different random graph models. For realworld PSNs, labels correspond to structural classes from CATH or SCOP, where we study all four levels of CATH and SCOP hierarchies (see Methods). We study 21 network approaches (Fig. 1): the nine versions of our proposed GRAFENE approach and 12 existing approaches. Of the 12 existing network approaches, four use graphlets, namely GDDA, RGFD, GCD, and GRAlign, and eight do not use graphlets. Also, of the 12 existing network approaches, only GRAlign is alignmentbased, and the others are alignmentfree. In addition to the 21 network approaches, we also study the two prominent 3D contact alignmentbased approaches (DaliLite and TMalign) and a sequencebased approach (namely, amino acid composition, AAComposition, which has been used in several tasks, including PC^{9}, i.e., discriminating between different structural classes and folding types^{49}, as well as protein function prediction^{50}). Note that not all of the 21 + 2 + 1 = 24 approaches are applicable to synthetic networks. This is because nine of the 24 approaches require sequence residue order or 3D structural information, which synthetic networks do not contain. All 24 approaches are applicable to realworld PSNs.
Given a data set and an approach, we compute similarity/distance between each pair of networks/proteins. We evaluate each approach by measuring how accurately it can identify networks/proteins of the same label as similar and networks/proteins of different labels as dissimilar. We measure this by computing the area under precisionrecall curve (AUPR) and area under receiver operator characteristic curve (AUROC), which is an established PC evaluation framework that was used by e.g., GRAlign^{5}. Also, we evaluate each method’s running time. For details, see Methods.
Our key findings are as follows. For synthetic networks, our GRAFENE approach (the larger size, normalized, and regular, i.e., nonordered, graphlet version – NormGraphlet35) is superior to all eight evaluated existing alignmentfree network approaches that have been used in the task of PC, none of which use graphlets. This already justifies the introduction of our approach. Also, GRAFENE (the same version – NormGraphlet35) is superior to all three evaluated existing graphlet alignmentfree approaches that are generalpurpose network comparison tools (GDDA, RGFD, and GCD), indicating that our approach is advantageous for generalpurpose network comparison tasks. Similarly, for realworld PSNs, GRAFENE (the larger size, normalized, ordered graphlet, and “longrange(K)” version – NormOrderedGraphlet34(K)) is superior to all evaluated approaches. For details, see Results.
Methods
Because of space constraints, we provide only the most relevant methodological information here. All additional information, in enough detail to ensure the reproducibility of our results, is presented in the Supplementary Information.
Data
We collect from the Protein Data Bank (PDB)^{51} 3D atomic structures of all 17,036 nonredundant proteins and denote this data set as ProteinPDB (Supplementary Section S1). A protein is typically composed of one or more domains (a domain refers to a part of a protein structure that can fold and often function independently). Class, Architecture, Topology, Homology (CATH) and Structural Classification of Proteins (SCOP) are independent databases of categorized (annotated) protein domains^{6,7,8}.
Forming networks
We evaluate the considered approaches in the task of PC on: 1) synthetic networks, i.e., artificially generated networks for which we know the topologybased ground truth categorization, and 2) realworld PSNs, for which we know CATH or SCOPbased categorizations that we hypothesize correlate well with the PSNs’ topologybased characteristics.
Synthetic networks
We generate synthetic networks by using different network models. A good general approach should identify networks from the same model (i.e., with the same label) as similar, and it should identify networks from different models (i.e., with different labels) as dissimilar. We use three wellestablished models: ErdösRényi random graphs (ER), geometric random graphs (GEO), and scalefree random graphs (SF) ^{38,40}.
First, we aim to analyze a synthetic network set in which all networks are of the same size but have different labels. We study individually three network sets of this type, which differ from each other in terms of the sizes of their networks. The three sets are: Synthetic100, Synthetic500, and Synthetic1000, reflecting the number of nodes in the network (Supplementary Section S2 and Supplementary Table S1).
Second, we aim to analyze networks of different sizes and different labels, to check whether an approach can correctly identify: 1) as similar networks from the same model despite the networks being of different sizes, and 2) as dissimilar networks from different models despite the networks being of the same size. To generate a synthetic network set of different sizes, we combine Synthetic100, Synthetic500, and Synthetic1000, resulting in Syntheticall (Supplementary Table S1).
The synthetic network sets allow us to evaluate our proposed GRAFENE approach against existing network comparison approaches without having available any sequence or 3D structural information.
Forming realworld PSNs
Each protein in ProteinPDB (defined above) is composed of 3D coordinates of the heavy atoms of its amino acids. Given a protein, we use its 3D coordinate information to construct its PSN in which nodes represent amino acid residues and edges connect pairs of amino acid residues that are sufficiently close (i.e., within a given Euclidean distance cutoff) in the protein’s 3D structure. Clearly, PSN construction depends on two parameters: 1) the atom type that is considered to represent an amino acid residue as a node in the PSN, and 2) the distance cutoff threshold that determines whether two amino acids (i.e., their atoms defined in step 1 above) are close enough and thus form an edge in the PSN. Different choices of these parameters can result in different PSNs and consequently affect the performance of a networkbased PC approach. To evaluate this effect, we consider four PSN construction strategies (i.e., combinations of the above two parameters), as follows.
In three out of the four PSN construction strategies that we use, regarding the choice of atom type, we consider for the given amino acid any of its heavy atoms, as is often done^{30}. Regarding the choice of distance cutoff threshold, while effective definitions of contact between amino acids may differ from fold to fold^{52}, we use suggested distance cutoffs in the 4 Å–6 Å range (because when considering any atom type, cutoffs below 4 Å result in highly disconnected PSNs, while cutoffs beyond 6.5 Å result in randomlike PSN structures^{30}). Specifically, in this range, we consider cutoffs of 4 Å, 5 Å, and 6 Å.
The remaining (fourth) PSN construction strategy that we use is the default strategy of GRAlign^{5}, which (as we show later) is the best existing network PC approach in our evaluation. Specifically, we use the αcarbon atom type and the 7.5 Å distance cutoff. For additional discussion about these parameter choices, see Supplementary Section S3.
Realworld PSNs with CATH categorization
We analyze all 9,509 protein domains (i.e., their corresponding PSNs) from ProteinPDB that have a CATH categorization (i.e., label) (Supplementary Section S4 and Supplementary Table S2). At each of the four CATH hierarchy levels, for each CATH category (which we refer to as a PSN set), we test how well each considered PC approach can compare PSNs between the different lowerlevel subcategories. We analyze all \(1+3+9+\mathrm{6=19}\) CATH PSN sets across the four levels (Fig. 2, Supplementary Section S4, and Supplementary Tables S3–S5).
Realworld PSNs with SCOP categorization
Also, we analyze all 11,451 protein domains (i.e., their corresponding PSNs) from ProteinPDB that have a SCOP categorization (Supplementary Section S5 and Supplementary Table S2). Just as with CATH, we analyze all four hierarchy levels of SCOP. This results in \(1+5+6+4\,=\,16\) SCOP PSN sets across the four levels (Fig. 2, Supplementary Section S5, and Supplementary Tables S3–S5).
Realworld PSN set groups
Over both CATH and SCOP, we analyze \(19+16\,=\,35\) PSN sets. We partition the 35 PSN sets into four PSN set groups (Fig. 2): group 1 (all \(1+1\,=\,2\) PSN sets in which we compare PSNs between the toplevel categories of CATH or SCOP), group 2 (all \(3+5=8\) PSN sets in which we compare PSNs between the secondlevel categories of CATH or SCOP), group 3 (all \(9+6=15\) PSN sets in which we compare PSNs between the thirdlevel categories of CATH or SCOP), and group 4 (all \(6+4\,=\,10\) PSN sets in which we compare PSNs between the fourthlevel categories of CATH or SCOP). Also, we denote by “all groups” the PSN set group that contains all 35 PSN sets of CATH and SCOP.
Realworld PSNs of the same size
To study the bias of PSN size, we need a data set with PSNs of the same (or similar) network size. Hence, focusing on PSNs of α and β categories from the CATHprimary PSN set (Fig. 2), we infer three such samesize PSN sets, denoted as CATH95, CATH99, and CATH251265 (Supplementary Section S6). We denote by “equal size” the group consisting of these three PSN sets.
Our GRAFENE framework
The PCA component of GRAFENE
The novelty of GRAFENE comes from combining graphletbased measures of network topology with principal component analysis (PCA) (Fig. 1). Yet, GRAFENE is generalizable, as it can use any measure(s). For a given measure, given a network set, GRAFENE first computes one vector per network. Then, it performs PCA (a standard dimension reduction technique) on the resulting vectors to compute principal components for each network. We pick the first r principal components, where the value of r is at least two and as low as possible so that the r components account for at least 90% of variation in the data. For every pair of networks \({N}_{i}\) and \({N}_{j}\), we compute their cosine similarity, \({s}^{cos}({N}_{i},{N}_{j})\), based on the networks’ first r principal components. We convert the similarity into distance as \({d}^{cos}({N}_{i},{N}_{j})=1{s}^{cos}({N}_{i},{N}_{j})\). We use the PCAbased distances to hypothesize that samelabel networks will be close in the PCA space while networks of different labels will be distant. Like most of the network approaches from Fig. 1, GRAFENE performs alignmentfree network comparison.
Our graphlet measures
Graphlets are small connected induced subgraphs (Fig. 3). They are proven to be sensitive and superior measures of topology when studying proteinprotein interaction networks^{5,41,42,43,53,54}. We use graphlets as PSN measures for PC, as follows.
Graphlet counts
We count occurrences of each graphlet on up to n nodes in the given network. To investigate the best choice for n, we use counts for 34node (Fig. 3) and 3–5–node graphlets, resulting in Graphlet34 and Graphlet35 measures (i.e., GRAFENE versions), respectively. Graphlet counts typically vary by orders of magnitude in realworld networks^{42}. Hence, we normalize graphlet counts by taking their logarithms. Here, we do not consider 3nodeonly graphlets, because there are only two 3node graphlets, which may not be suitable for our PCAbased GRAFENE framework, and also because using up to 4 or 5node graphlets improves accuracy upon using only 3node graphlets^{44,45,46}.
Normalization of graphlet counts
Networks with similar topology can have dissimilar graphlet counts simply because of their dissimilar network sizes (see Results). To remove the bias of PSN size, we normalize graphlet counts by scaling them between 0 and 1. Formally, given a network, let \({g}_{1},{g}_{2}\mathrm{,...,}{g}_{n}\) be counts of n graphlets \({G}_{1},{G}_{2}\mathrm{,...,}{G}_{n}\), respectively (\(n=8\) for 34node graphlets and \(n=29\) for 35node graphlets). We normalize count \({g}_{i}\) of graphlet G _{ i } as \({g}_{i}/{\sum }_{j=1}^{n}{g}_{j}\). We denote the normalized Graphlet34 and Graphlet35 measures as NormGraphlet34 and NormGraphlet35, respectively.
Integration of graphlets with residue order in the protein sequence: ordered graphlet counts
While amino acids appear in a particular order in the sequence, graphlets were not originally designed to capture this node order information. For example, nodes in graphlet G _{1} can appear in three different orders (Fig. 3), but G _{1} cannot differentiate between them. To take advantage of both network and sequence data, ordered graphlets were recently proposed^{5}, which embed the relative order of nodes onto graphlets. For example, the three different orders of graphlet G _{1} were formulated as three different ordered graphlets: O _{1}, O _{2}, and O _{3} (Fig. 3). This way, MalodDognin and Pržulj^{5} defined all four possible 3node ordered graphlets for all two possible 3node “regular” (i.e., original nonordered) graphlets. We denote the measure consisting of the existing four counts for 3node ordered graphlets as OrderedGraphlet3, and we denote our normalized counterpart of OrderedGraphlet3 as NormOrderedGraphlet3 (normalization is done in the same way as explained above). Unlike for regular (nonordered) graphlets, we do consider 3nodeonly ordered graphlets within GRAFENE. We do this to compare as fairly as possible our GRAFENE approach with the existing GRAlign approach that can support only 3node ordered graphlets^{5}. To benefit from larger graphlets, we extend this idea to include within GRAFENE all 38 possible 4node ordered graphlets for all six possible 4node regular graphlets on top of the existing four 3node ordered graphlets. We denote the resulting measure consisting of 42 ordered graphlet counts for 34node graphlets (Fig. 3) as OrderedGraphlet34 and its normalized counterpart as NormOrderedGraphlet34. Inclusion of ordered graphlets on five nodes would cause the number of graphlets to grow significantly (e.g., graphlet G _{9} can be formulated as 60 different ordered graphlets). Since using too many measures often causes overfitting, which can eventually lead to increased error rate^{55}, we do not consider 5node ordered graphlets. Note that ordered graphlet counts do not vary by orders of magnitude in our data as regular graphlet counts do, so we do not take their logarithms.
Even though ordered graphlets capture relative sequence positions of interacting amino acids, they do not capture how far those amino acids are in the sequence. While there have been conflicting findings regarding the effect of longrange interactions on secondary structure prediction accuracy^{12}, we hypothesize that amino acids that are close enough in the protein 3D structure but are far away in the protein sequence are more important than amino acids that are close enough in the 3D structure simply because they are also close in the sequence. To evaluate this hypothesis, we propose a novel concept of “longrange(K)” ordered graphlets, where the “longrange(K)” constraint is introduced so that a given ordered graphlet is identified in the given PSN if and only if: 1) the same ordered graphlet would also be identified in the above described analysis, and 2) every pair of amino acids that are linked by an edge in the graphlet are at least K distance apart in the sequence (that is, K is the absolute difference between sequence positions of two amino acids of interest). Supplementary Fig. S1 illustrates this concept. Clearly, all graphlets identified under the “longrange(K)” ordered graphlet approach will also be identified under the traditional ordered graphlet approach, but the opposite is not necessarily true. As a proof of concept, we apply the concept of “longrange(K)” ordered graphlets on the NormOrderedGraphlet34 (which as we show in Results is the best of all graphlet features) and we denote the new measure as NormOrderedGraphlet34(K). To evaluate the performance of NormOrderedGraphlet34(K), we vary K from one to 10 in increments of one and from 10 to 35 in increments of five. Then, for each considered data set, we report results for the value of K that results in the best PC accuracy (for details, see Results).
Existing approaches
We use 15 existing network, 3D contact, and sequence approaches in the task of PC (Fig. 1). Due to space constraints, we describe these approaches in Supplementary Section S7.
Evaluation of PC accuracy
Given a set of objects (proteins or networks) with known labels, the distance between objects of the same label should ideally be small, while the distance between objects of different labels should ideally be large. To evaluate this, we rely on an established unsupervised strategy^{32}. By “unsupervised”, we mean that we rely on object labels only in the phase of evaluating a method’s output. That is, we do not use any label information to train the given method or produce its output, as a supervised (classification) approach would do. Specifically, for each considered approach, we first compute the distance between each pair of objects. Then, we sort all object pairs in terms of their increasing distance and consider k closest object pairs, where we vary k from 0% to 100% in increments of 0.1%. Next, we compute the accuracy in terms of precision and recall, where precision is the fraction of labelmatching object pairs out of the considered object pairs, and recall is the fraction of the considered labelmatching object pairs out of all labelmatching object pairs. To summarize the precision and recall results over the whole [0–100%] range of k, we measure overall accuracy of the given approach by computing AUPR. Alternatively, we compute the accuracy in terms of sensitivity and specificity, where sensitivity is the fraction of the considered labelmatching object pairs out of all labelmatching object pairs, and specificity is the fraction of the considered nonlabelmatching object pairs out of all nonlabelmatching object pairs. To summarize the sensitivity and specificity results over the whole [0–100%] range of k, we measure overall accuracy of the given approach by computing AUROC. Given a data set, we compare different approaches by comparing their AUPR or AUROC values.
Results
Comparison of synthetic networks
We first analyze synthetic networks to demonstrate the general applicability of our PCAdriven GRAFENE approach, compared to the 15 existing networkonly approaches (i.e., no 3D structural, sequence, or node order information available), some of which have been used in tasks different than our considered task of PC. Because we show that these approaches cannot successfully cope with synthetic networks of different sizes, we develop a normalized version of GRAFENE, as follows.
Evaluation of nonnormalized network measures
We evaluate nonnormalized versions of our GRAFENE approach (Graphlet34 and Graphlet35), existing graphlet approaches (GDDA, RGFD, and GCD), and existing nongraphlet approaches (average degree, average distance, maximum distance, average closeness centrality, average clustering coefficient, intrahub connectivity, assortativity, and Existingall), on synthetic network data of the same size (Synthetic100, Synthetic500, and Synthetic1000); see Methods.
For each data set, both nonnormalized GRAFENE versions outperform the existing graphlet and nongraphlet approaches, as the former two always achieve 100% accuracy (Fig. 4 and Supplementary Tables S6,S7). Some existing methods also achieve 100% accuracy on some of the data sets, but only one (RGFD) does so on all three data sets. However, RGFD loses its comparable performance in other tests (see below). Note that the graphlet (PCA and nonPCA) approaches outperform all seven existing nongraphlet approaches that have been used for PC (Fig. 4 and Supplementary Tables S6,S7). Even so, combining the seven measures into Existingall and using Existingall in our PCA framework improves the accuracy of each individual nongraphlet measure. Although Existingall is comparable to our GRAFENE approach (Fig. 4 and Supplementary Tables S6,S7), it also loses its comparable performance in other tests (see below).
Network size affects comparison via nonnormalized measures
To test whether the considered approaches are robust to the sizes of compared networks, we evaluate them on the Syntheticall set that contains networks of different topologies and different sizes. For these evaluations, we observe a decline in accuracy for each approach (Fig. 4 and Supplementary Tables S6,S7).
Normalization of graphlet measures improves comparison
Given that the accuracy appears biased by network size, we use the normalized GRAFENE versions (NormGraphlet34 and NormGraphlet35), which we hope will preserve maximum (100%) accuracy for the three equalsize network sets (Synthetic100, Synthetic500, and Synthetic1000) while improving accuracy for Syntheticall that contains networks of different sizes. Indeed, this is exactly what we observe (Fig. 4 and Supplementary Tables S6,S7). Now our NormGraphlet35 GRAFENE version outperforms each of the three existing generalpurpose graphlet (nonPCA) approaches, including RGFD, suggesting that henceforth GRAFENE should be used for generalpurpose network comparison. Also, now our NormGraphlet35 GRAFENE version outperforms the nongraphlet Existingall approach under the same PCA framework.
Comparison of PSNs
Unlike the previous synthetic data, protein structure networks (PSNs) have 3D structural and sequencebased information. Recall that we use four PSN construction strategies (Section Methods), to evaluate whether the performance of the networkbased PC approaches depends on the choice of PSN construction strategy. Below, we show that this is not the case, i.e., the method comparison results are robust across the different choices. Yet, to give the bestcase advantage to each PC approach, unless otherwise noted, for each PC approach and each PSN set, we report results for the best PSN construction strategy.
With this in mind, first, we evaluate all approaches (i.e., their existing versions that are nonnormalized in terms of network size) on the three PSN sets that form the “equal size” PSN set group, each of which contains PSNs of the same size (Section Methods). Second, we test the approaches on all 35 PSN sets that form the “all groups” PSN set group, each of which contains PSNs of different sizes (Section Methods). Third, we test whether graphlet normalization improves PC on the differentsize PSN sets. Fourth, to investigate whether the integration of network topology with protein sequence (i.e., residue order) data can improve PC, we test the ordered graphlet version of our GRAFENE approach, including the effect of the “longrange(K)” constraint. Here, we compare in terms of accuracy the best of our GRAFENE versions (with normalized ordered graphlets and the “longrange(K)” constraint – NormOrderedGraphlet34(K)) against the existing network, 3D contact, and sequence approaches. Fifth, in order to see whether the results vary across the four different levels of the CATH or SCOP hierarchies, we break down the above analyses (that are for the “all groups” PSN set group) into perlevel analyses, i.e., into analyses for each of PSN set groups 1–4 (Fig. 2 and Section Methods). All of the analyses up to this point are for the best of the four PSN construction strategies. So, sixth, we evaluate whether the results vary across the four different PSN construction strategies. Seventh, we compare the considered PC approaches in terms of their running times (rather than accuracy, as up to this point). Eighth, we summarize our key findings. The eight items are discussed in the following eight subsections.
Evaluation of nonnormalized measures
Here, we benchmark the nonnormalized versions of our PCAdriven GRAFENE approach, existing graphlet (nonPCA) network approaches, existing nongraphlet network approaches, existing 3D contact approaches, and existing sequence approach on all PSN sets for which the networks within the given set are of same size, i.e., CATH95, CATH99, and CATH251265. For each PSN set, just as for the synthetic networks, the nonnormalized versions of GRAFENE (Graphlet34 and Graphlet35) are superior to all existing approaches (Fig. 5 and Supplementary Tables S8,S9). Again, combining the seven existing nongraphlet measures into Existingall typically improves the accuracy of each individual measure (Fig. 5 and Supplementary Tables S8,S9). Existingall is superior to the nonnormalized GRAFENE versions only on one of the three “equal size” PSN sets.
Network size affects comparison via nonnormalized measures
Next, we evaluate the same nonnormalized PC approaches on all 35 realworld PSN sets of different network sizes (Section Methods). As with the synthetic data, we observe a decline in accuracy for most of the PC approaches (Fig. 6) compared to their performance on the “equal size” PSN sets (Fig. 5 and Supplementary Tables S11 and S12). Even so, the nonnormalized GRAFENE versions remain superior or comparable to all existing methods except GRAlign, Existingall, and DaliLite (Fig. 6). However, as we show below, these three existing approaches lose their superiority in other tests.
Normalization of graphlet measures improves comparison
Motivated by the above results, we next evaluate the normalized versions of GRAFENE. As with the synthetic network comparisons, we hope to ideally improve or at least preserve the accuracy for the “equal size” PSN sets (CATH95, CATH99, and CATH251265) while improving the accuracy for the 35 PSN sets of different sizes, compared to the accuracy of the nonnormalized counterparts. Indeed, this is exactly what we observe (Figs 5, 6(B), and Supplementary Tables S11 and S12).
Integration of network and sequence (i.e., residue order) data via ordered graphlets
Thus far, we considered GRAFENE versions that are based on regular (nonordered, as considered thus far) graphlets. These versions already perform better than the considered existing AAComposition sequence approach (Fig. 6 and Supplementary Tables S10–S12). Integration of network data with sequence data may further improve the performance. To test this, we rely on ordered graphlets. GRAlign already used 3nodeonly ordered graphlets to impose the sequencebased order of amino acid residues onto nodes in regular graphlets. We adopt this existing idea but: 1) we do so within our alignmentfree GRAFENE framework as opposed to the alignmentbased GRAlign approach; 2) we extend this idea into larger, 34node ordered graphlets (Fig. 3); 3) we normalize ordered graphlets; and 4) we add the “longrange(K)” constraint. We find that each of these four extensions improves PC performance, as follows.
First, when we consider 3nodeonly ordered graphlets, which makes the comparison with GRAlign as fair as possible, GRAFENE (version OrderedGraphlet3) is superior to GRAlign for the three “equal size” PSN sets (Fig. 5). At the same time, GRAlign is ~25 times slower than OrderedGraphlet3 (Supplementary Table S10).
Second, when we also consider larger ordered graphlets, this further improves the performance of GRAFENE, i.e., OrderedGraphlet34 is superior to OrderedGraphlet3 (Figs 5 and 6, and Supplementary Table S10).
Third, considering only nonnormalized graphlet measures, the ordered graphlet version of GRAFENE (OrderedGraphlet34) on average improves upon its regular graphlet counterpart (Graphlet34) for the three “equal size” PSN sets as well as for the 35 PSNs of different sizes (Figs 5 and 6). Considering normalized graphlet measures, the normalized ordered graphlet version of GRAFENE (NormOrderedGraphlet34) also improves upon its nonordered counterpart (NormGraphlet34) (Figs 5 and 6). Hence, using ordered graphlets to integrate network and residue order data is always beneficial in our evaluations.
Fourth, adding the “longrange(K)” constraint, i.e., considering the NormOrderedGraphlet34(K) version of GRAFENE, further improves accuracy (Figs 5, 6, and Supplementary Tables S11, S12). Recall that in these tests, we vary K (see Methods). The best value of K is data setdependent. Of the three samesize PSN sets and 35 differentsize PSN sets, increasing K to at least two (i.e., considering the “longrange(K)” ordered graphlet approach) improves accuracy compared to K = 1 (i.e., the traditional ordered graphlet approach) for the majority (25) of the PSN sets (Supplementary Tables S13–S36). In particular, accuracy improves for most of the PSN sets at the lower hierarchy levels of CATH or SCOP (i.e., PSN sets from groups 2–4). For the 25 PSN sets, the best value of K ranges from two to 35. Since even as high value of K as 35 can yield better accuracy than smaller values of K, these results exemplify the importance of longrange interactions in the task of PC. Note that for the \(3525=10\) PSN sets where increasing K to at least two does not improve accuracy, i.e., where K = 1 is superior, NormOrderedGraphlet34(K) results in the same performance as NormOrderedGraphlet34.
In terms of the comparison of our GRAFENE approach against the existing ones, the best GRAFENE version, i.e., NormOrderedGraphlet34(K), is statistically significantly superior to all considered existing network, 3D contact, and sequence approaches (with paired ttest pvalues between \(4.6\times {10}^{4}\) and \(2.46\times {10}^{13}\) for AUPR and between \(2.44\times {10}^{3}\) and \(7.08\times {10}^{15}\) for AUROC; Supplementary Table S10). The fact that within our PCAdriven framework ordered graphlets beat regular graphlets alone and the considered sequence approach alone confirms that PSN data and sequence (residue order) data are complementary and should thus be integrated. We consider this to be one of our key contributions. Another of our key contributions is that our GRAFENE approach (and especially its largersize normalized ordered graphlet “longrange(K)” version) is superior to traditional 3D contact approaches, even though both approach types (network vs. 3D contact) use 3D structural information. This highlights the usefulness of network analyses of protein structures.
Performance comparison of PC approaches is similar across different PSN set groups
The above analyses have encompassed all 35 PSN sets that form the “all groups” PSN set group. Recall that we divide the 35 PSN sets into groups 1–4, which correspond to the four hierarchy levels of CATH and SCOP (Fig. 2 and Section Methods). Here, we evaluate whether the above key results (e.g., the statistically significant superiority of our best performing GRAFENE version, NormOrderedGraphlet34(K)) vary across the different PSN set groups. We find that this is not the case. That is, NormOrderedGraphlet34(K) still significantly outperforms (with pvalues <0.05 according to the paired ttest) all other approaches for each of the PSN set groups, except that for group 4, our approach is comparable to GRAlign (Fig. 7 and Supplementary Fig. S3).
Performance comparison of PC approaches is similar across different PSN construction strategies
Thus far, for each approach and each PSN set, we have selected the best of the four considered PSN construction strategies.
Instead, here, first, for each PC approach, we evaluate whether any one of the four PSN construction strategies is consistently better than the other three over all (or at least the majority) of the PSN sets. We find that this is not the case, i.e., the choice of the best PSN construction strategy is heavily PSN setdependent (Supplementary Figs S8–S13). The only exceptions are the GDDA, GRAlign, and average closeness centrality approaches, which favor the fourth strategy (αcarbon atom type, 7.5 Å distance cutoff). Importantly, for the given PC approach, the performances of the different PSN construction strategies are typically within 5% of each other (Supplementary Figs S8–S13) and can thus be considered as somewhat similar.
Second, we evaluate whether the above results (statistically significant superiority of our best GRAFENE version, namely NormOrderedGraphlet34(K)) vary across the different PSN construction strategies. We find that the choice of PSN construction strategy has no effect on the results. That is, our NormOrderedGraphlet34(K) approach still significantly outperforms (with pvalues <0.05 according to the paired ttest) all other approaches for each of the PSN construction strategies, with only two exceptions, as follows. 1) For the fourth PSN construction strategy, NormOrderedGraphlet34(K)’s performance is comparable to GRAlign’s performance with respect to AUPR (Fig. 8). However, with respect to AUROC, NormOrderedGraphlet34(K) keeps its significantly superior performance over GRAlign (Supplementary Fig. S14). 2) For the first PSN construction strategy, NormOrderedGraphlet34(K)’s performance is only marginally better than DaliLite’s performance with respect to AUROC (Supplementary Fig. S14). However, with respect to AUPR, NormOrderedGraphlet34(K) keeps its significantly superior performance over DaliLite (Fig. 8).
Running time comparison
In terms of running time, all alignmentfree network approaches (including ours) are comparable to each other (running times between 0.38 and 2.41 hours) as well as to the sequence approach (running time of 0.24 hours), they are followed by the only alignmentbased network approach, GRAlign (running time of 9.49 hours), and all of them are significantly faster than the two 3D contact approaches, DaliLite and TMAlign (running times of 2,021 and 168 hours, respectively). See Supplementary Table S10 for more details.
Regarding the running times of our regular (unordered) graphlet approach Graphlet34 and its ordered counterpart OrderedGraphlet34, one might intuitively expect the running time should be lower for counting ordered graphlets than for counting regular graphlets, because the order constraint decreases the number of possible graphlets that can be found in a network. Our results reveal that ordered graphlet counting (OrderedGraphlet34) is actually 5.5 times slower than regular graphlet counting (Graphlet34). While this result might be counterintuitive, it is actually expected. This is because each regular graphlet that is found in a network corresponds to some ordered graphlet, and thus, counting ordered graphlets entails: 1) counting regular graphlets and 2) on top of that, determining for each identified (regular) graphlet, the order of its nodes.
Summary of results for PSNs
We use graphlet measures within the PCA framework in the task of PC. By normalizing the graphlet measures, we improve upon our nonnormalized graphlet measures (Fig. 6 and Supplementary Tables S11 and S12). By imposing sequence order onto nodes via ordered graphlets, we further improve the accuracy. By distinguishing between shorter and longerrange amino acid interactions via “longrange(K)” ordered graphlets, we additionally improve the performance. The best version of our GRAFENE approach, NormOrderedGraphlet34(K), is significantly superior to all other considered approaches in terms of its accuracy, and it is comparable to or faster than the considered approaches in terms of running time (Fig. 9, Supplementary Table S10, and Supplementary Figs S19–S24).
Note that for all networkbased PC approaches, we observe better performance (higher AUPR/AUROC values) on the synthetic networks than on the realworld PSNs. One potential explanation of this behavior could be that the network categories are more precisely defined (in the sense that they correlate better with network topology) for the synthetic data than for the realworld PSN data. That is, in the case of the synthetic networks, network categories are defined by wellunderstood, artificial models, where networks of different categories are defined by rules of clearly distinct network models, while networks of the same category follow the same rules. Consequently, for the synthetic networks, it is highly expected that networks that belong to different categories will be topologically dissimilar, and that networks that belong to the same category will be topologically similar, and that a good network approach will capture well the network (dis)similarities. On the other hand, the PSNs are categorized based on the protein structural domain categories to which they belong, which is a nonartificial and consequently likely (theoretically) imperfect network categorization process, which possibly results in lower AUPR/AUROC values compared to the synthetic networks.
Also, note that we provide in Supplementary Section S8 additional results regarding an occasional difference between the performance of: 1) different PC approaches on same PSN sets (e.g., some approaches showing higher accuracy on SCOPα than on SCOPβ, but other approaches showing higher accuracy on SCOPβ than on SCOPα), and 2) same PC approaches on different PSN sets (e.g., the given approach’s accuracy being higher for CATHα than for CATHβ, but the same approach’s accuracy being lower for SCOPα than for SCOPβ). We provide these results and their discussion in Supplementary Section S8 rather than in the main paper due to space constraints.
Validation of graphlet PCA measures in revealing biochemically interesting PSN patterns
We aim to identify graphlet patterns that lead to successful distinction of different CATH or SCOP label categories from the PSN data, focusing as an illustration on the PSN sets containing networks of the same size (CATH95, CATH99, and CATH251265) from α or β protein domain labels. Such graphlets that are significantly (MannWhitney U test; pvalue <0.05) represented in α but not in β, or vice versa, could be linked to the functionality of the given domain label.
For the 35node regular graphlet measure (Graphlet35), graphlets represented in α tend to be denser than those represented in β (Fig. 10). For example, all of the complete graphlets (i.e., G _{2}, G _{8}, G _{29}, which are the densest graphlets) are represented in α, while all of the pathlike graphlets (i.e., G _{1}, G _{3}, G _{9}, which are the sparsest graphlets) are represented in β.
For the 34node ordered graphlet measure (OrderedGraphlet34), in ordered graphlets represented in α (e.g., O _{1}), there is typically a node orderrespecting path through the graphlet, unlike in most of ordered graphlets represented in β (e.g., O _{2} and O _{3}) (Supplementary Fig. S25). Note that for the PSN sets from this section (CATH95, CATH99, and CATH251265), the “longrange(K)” constraint does not improve accuracy, and so we do not consider NormOrderedGraphlet34(K) here.
The above trends, especially the α category being enriched in denser graphlets and the β category being enriched in sparser graphlets, are somewhat expected, given the formulations of αhelix^{56} and βsheet^{57} 3D configurations. The fact that our graphlet PCA framework uncovers these biochemically relevant PSN patterns in the different protein structural categories further validates the approach. Given this, we hypothesize that our approach can be used to identify biochemically interesting PSN patterns in other applications, such as studying functionally (rather than structurally) different protein categories. Testing this hypothesis is the subject of our future work.
Conclusions
We present GRAFENE, a general PCAbased computational framework for alignmentfree network comparison, which can use any measure(s) of network topology. In the task of generalpurpose network comparison, when using graphlets as stateoftheart measures, GRAFENE outperforms the existing stateoftheart generalpurpose alignmentfree network comparison approaches, which are also graphletbased but not PCAbased. This validates GRAFENE as a new stateoftheart generalpurpose alignmentfree network comparison approach. At the same time, in the more specific task of PC, we use ordered graphlets, along with several additional methodological improvements (e.g., normalization or the “longrange(K)” constraint), to integrate complementary protein 3D structural data with sequence (i.e., residue order) data. The resulting GRAFENE version significantly outperforms all existing stateoftheart PC approaches in our evaluations.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.
Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature Genetics 25, 25–29 (2000).
 2.
Kasabov, N. K. Springer Handbook of Bio/NeuroInformatics, 1 edn (Springer, 2013).
 3.
Lee, D., Redfern, O. & Orengo, C. Predicting protein function from sequence and structure. Nature Reviews Molecular Cell Biology 8, 995–1005 (2007).
 4.
Blake, J. A. et al. Gene ontology consortium: going forward. Nucleic Acids Research 43, D1049 (2015).
 5.
MalodDognin, N. & Pržulj, N. GRAlign: fast and flexible alignment of protein 3D structures using graphlet degree similarity. Bioinformatics 30, 1259–65 (2014).
 6.
Sillitoe, I. et al. CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Research 43, D376–D381 (2015).
 7.
Orengo, C. A. et al. The CATH database provides insights into protein structure/function relationships. Nucleic Acids Research 27, 275–279 (1999).
 8.
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247, 536–540 (1995).
 9.
Ofran, Y. & Margalit, H. Proteins of the same fold and unrelated sequences have similar amino acid composition. Proteins: Structure, Function, and Bioinformatics 64, 275–279 (2006).
 10.
Dai, Q. & Wang, T. Comparison study on kword statistical measures for protein: From sequence to ‘sequence space’. BMC Bioinformatics 9, 394 (2008).
 11.
Mu, Z., Wu, J. & Zhang, Y. A novel method for similarity/dissimilarity analysis of protein sequences. Physica A: Statistical Mechanics and its Applications 392, 6361–6366 (2013).
 12.
Kihara, D. The effect of longrange interactions on the secondary structure formation of proteins. Protein Science 14, 1955–1963 (2005).
 13.
Krissinel, E. On the relationship between sequence and structure similarities in proteomics. Bioinformatics 23, 717–723 (2006).
 14.
Gao, J. & Li, Z. Conserved network properties of helical membrane protein structures and its implication for improving membrane protein homology modeling at the twilight zone. Journal of ComputerAided Molecular Design 23, 755–763 (2009).
 15.
Tuinstra, R. L. et al. Interconversion between two unrelated protein folds in the lymphotactin native state. Proceedings of the National Academy of Sciences 105, 5057–62 (2008).
 16.
Kosloff, M. & Kolodny, R. Sequencesimilar, structuredissimilar protein pairs in the PDB. Proteins 71, 891–902 (2008).
 17.
Burmann, B. M. et al. An α helix to β barrel domain switch transforms the transcription factor RfaH into a translation factor. Cell 150, 291–303 (2012).
 18.
Clarke, T. F. & Clark, P. L. Rare codons cluster. Plos One 3, e3412 (2008).
 19.
Sander, I. M., Chaney, J. L. & Clark, P. L. Expanding Anfinsen’s principle: contributions of synonymous codon selection to rational protein design. Journal of the American Chemical Society 136, 858–861 (2014).
 20.
Holm, L. & Sander, C. Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology 233, 123–138 (1993).
 21.
Bachar, O., Fischer, D., Nussinov, R. & Wolfson, H. A computer vision based technique for 3d sequenceindependent structural comparison of proteins. Protein Eng. 6, 279–288 (1993).
 22.
Kufareva, I. & Abagyan, R. Methods of Protein Structure Comparison, 231–257 (Humana Press, Totowa, NJ, 2012).
 23.
Lancia, G. & Istrail, S. Protein Structure Comparison: Algorithms and Applications, 1–33 (Springer Berlin Heidelberg, 2003).
 24.
Ma, J. & Wang, S. Algorithms, applications, and challenges of protein structure alignment. Advances in Protein Chemistry and Structural Biology 94, 121–175 (2014).
 25.
Hasegawa, H. & Holm, L. Advances and pitfalls of protein structural alignment. Current Opinion in Structural Biology 19, 341–348 (2009).
 26.
Godzik, A. The structural alignment between two proteins: Is there a unique answer? Protein Science 5, 1325–1338 (1996).
 27.
Holm, L. & Rosenström, P. Dali server: conservation mapping in 3D. Nucleic Acids Research 38, W545–W549 (2010).
 28.
Zhang, Y. & Skolnick, J. TMalign: a protein structure alignment algorithm based on the TMscore. Nucleic Acids Research 33, 2302–09 (2005).
 29.
Ye, Y. & Godzik, A. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19, ii246–ii255 (2003).
 30.
Milenković, T., Filippis, I., Lappe, M. & Pržulj, N. Optimized null model for protein structure networks. PLoS ONE 4, e5967 (2009).
 31.
Andonov, R., MalodDognin, N. & Yanev, N. Maximum contact map overlap revisited. Journal of Computational Biology 18, 27–41 (2011).
 32.
Yaveroglu, O. N., Milenković, T. & Pržulj, N. Proper evaluation of alignmentfree network comparison methods. Bioinformatics 31, 2697–2704 (2015).
 33.
Yaveroglu, O. N., MalodDognin, N., Milenković, T. & Pržulj, N. Rebuttal to the letter to the editor in response to the paper: proper evaluation of alignmentfree network comparison methods. Bioinformatics 33, 1107–1109 (2017).
 34.
Emerson, I. A. & Gothandam, K. M. Residue centrality in alpha helical polytopic transmembrane protein structures. Journal of Theoretical Biology 309, 78–87 (2013).
 35.
Pabuwal, V. & Li, Z. Network pattern of residue packing in helical membrane proteins and its application in membrane protein structure prediction. Protein Engineering, Design and Selection 21, 55–64 (2008).
 36.
Pabuwal, V. & Li, Z. Comparative analysis of the packing topology of structurally important residues in helical membrane and soluble proteins. Protein Engineering, Design and Selection 22, 67–73 (2009).
 37.
Emerson, I. A. & Gothandam, K. M. Network analysis of transmembrane protein structures. Physica A 391, 905–916 (2012).
 38.
Milenković, T., Lai, J. & Pržulj, N. GraphCrunch: a tool for large network analyses. BMC Bioinformatics 9 (2008).
 39.
Memisević, V., Milenković, T. & Pržulj, N. An integrative approach to modeling biological networks. Journal of Integrative Bioinformatics 7, 120 (2010).
 40.
Kuchaiev, O., Stevanović, A., Hayes, W. & Pržulj, N. GraphCrunch 2: Software tool for network modeling, alignment and clustering. BMC Bioinformatics 12 (2011).
 41.
Faisal, F. E. & Milenković, T. Dynamic networks reveal key players in aging. Bioinformatics 30, 1721–1729 (2014).
 42.
Pržulj, N., Corneil, D. G. & Jurisica, I. Modeling interactome: Scalefree or geometric? Bioinformatics 20, 3508–3515 (2004).
 43.
Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 23, e177–e183 (2007).
 44.
Hulovatyy, Y., Solava, R. & Milenković, T. Revealing missing parts of the interactome via link prediction. PLoS ONE 9, e90073 (2014).
 45.
Hulovatyy, Y., Chen, H. & Milenković, T. Exploring the structure and function of temporal networks with dynamic graphlets. Bioinformatics 31, i171–i180 (2015).
 46.
Solava, R., Michaels, R. & Milenković, T. Graphletbased edge clustering reveals pathogeninteracting proteins. Bioinformatics 18, i480–i486 (2012).
 47.
Yaveroglu, O. N. et al. Revealing the Hidden Language of Complex Networks. Scientific Reports 4, 4547 (2014).
 48.
Gromiha, M. M. & Selvaraj, S. Interresidue interactions in protein folding and stability. Progress in Biophysics and Molecular Biology 86, 235–277 (2004).
 49.
Taguchi, Y.h & Gromiha, M. M. Application of amino acid occurrence for discriminating different folding types of globular proteins. BMC Bioinformatics 8, 404 (2007).
 50.
Lee, B. J., Shin, M. S., Oh, Y. J., Oh, H. S. & Ryu, K. H. Identification of protein functions using a machinelearning approach based on sequencederived properties. Proteome Science 7, 27 (2009).
 51.
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000).
 52.
Yuan, C., Chen, H. & Kihara, D. Effective interresidue contact definitions for accurate protein fold recognition. BMC Bioinformatics 13, 292 (2012).
 53.
Milenković, T. & Pržulj, N. Uncovering biological network function via graphlet degree signatures. Cancer Informatics 6, 257–273 (2008).
 54.
Milenković, T., Memišević, V., Bonato, A. & Pržulj, N. Dominating biological networks. PLoS ONE 6, e23016 (2011).
 55.
Aggarwal, C. C. Data Mining: The Textbook (Springer, 2015).
 56.
Pauling, L. & Corey, R. B. Atomic coordinates and structure factors for two helical configurations of polypeptide chains. Proceedings of the National Academy of Sciences 37, 235–240 (1951).
 57.
Pauling, L. & Corey, R. B. The pleated sheet, a new layer configuration of polypeptide chains. Proceedings of the National Academy of Sciences 37, 251–256 (1951).
Acknowledgements
This work was supported by the Air Force Office of Scientific Research (YIP FA95501610147), National Institutes of Health (1R01GM120733, 1R21AI111286, and R01GM074807), and Clare Boothe Luce Graduate Research Fellowship.
Author information
Author notes
Fazle E. Faisal and Khalique Newaz contributed equally to this work.
Affiliations
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, 46556, USA
 Fazle E. Faisal
 , Khalique Newaz
 , Scott J. Emrich
 & Tijana Milenković
Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, IN, 46556, USA
 Julie L. Chaney
 & Patricia L. Clark
Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, 46556, USA
 Jun Li
Department of Chemical and Biomolecular Engineering, University of Notre Dame, Notre Dame, IN, 46556, USA
 Patricia L. Clark
Interdisciplinary Center for Network Science and Applications, University of Notre Dame, Notre Dame, IN, 46556, USA
 Fazle E. Faisal
 , Khalique Newaz
 & Tijana Milenković
Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, 46556, USA
 Fazle E. Faisal
 , Khalique Newaz
 , Patricia L. Clark
 & Tijana Milenković
Authors
Search for Fazle E. Faisal in:
Search for Khalique Newaz in:
Search for Julie L. Chaney in:
Search for Jun Li in:
Search for Scott J. Emrich in:
Search for Patricia L. Clark in:
Search for Tijana Milenković in:
Contributions
F.E.F., K.N., J.L.C., P.L.C., and T.M. designed the study. F.E.F. implemented all of the proposed methodology, with the following exceptions: K.N. implemented the “longrange(K)” approach within the proposed methodology, and J.L. suggested the use of PCA within the proposed methodology. F.E.F. and K.N. equally contributed to carrying out all of the computational experiments. J.L.C. assembled the protein structure data sets. All authors analyzed the results. F.E.F., K.N., J.L.C., P.L.C., and T.M. wrote the manuscript. All authors read and approved the manuscript. P.L.C. supervised all applied aspects of the study. T.M. supervised all computational aspects of the study.
Competing Interests
The authors declare that they have no competing interests.
Corresponding author
Correspondence to Tijana Milenković.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.