A Structurally Validated Sequence Alignment of All 497 Typical Human Protein Kinase Domains

Protein kinases are important in a large number of signaling pathways. Their dysregulation is involved in a number of human diseases, especially cancer. Studies on the structures of individual kinases have been used to understand the functions and phenotypes of mutations in other kinases that do not yet have experimental structures. The key factor in accurate inference by homology is an accurate sequence alignment. We present a parsimonious structure-based sequence alignment of 497 human protein kinase domains excluding atypical kinases, even those with related but somewhat different folds. Starting with a computed multiple sequence alignment, the alignment was manually refined in Jalview based on pairwise structural superposition onto a single kinase (Aurora A), followed by sequence alignment of the remaining kinases to their closest relatives with known structures. The alignment is arranged in 17 blocks of conserved regions and unaligned blocks in between that contain insertions of varying lengths present in only a subset of kinases. The aligned blocks contain well-conserved elements of secondary structure and well-known functional motifs, such as the DFG and HRD motifs. We validated the multiple sequence alignment by a pairwise, all-against-all alignment of 272 human kinases with known crystal structures. Our alignment has true-positive rate (TPR) and positive predictive value (PPV) accuracies of 97%. The remaining inaccuracy in our alignment comes from a few structures with shifted elements of secondary structure, and from the boundaries of aligned and unaligned regions, where compromises need to be made to encompass the majority of kinases. A new phylogeny of the protein kinase domains in the human genome based on our alignment indicates that 14 kinases previously labeled as “OTHER” can be confidently placed into the CAMK group. These kinases comprise the Aurora kinases, Polo kinases, ULK kinases, Calcium/calmodulin-dependent kinase kinases, and STK36.

Protein kinases catalyze the transfer of a phosphoryl group from an ATP molecule to 4 0 substrate proteins 1 , and are crucial for cellular signaling pathways 2 . Mutations in kinases that 4 1 lead to gain of function are frequently observed in many cancer types 3,4 , while mutations may 4 2 also result in drug resistance rendering existing drugs inefficient 3 . Humans have over 500 genes 4 3 that catalyze the phosphorylation of proteins, collectively called the 'kinome' 5 .

4
Protein kinase activity is found in a number of protein families and superfamilies in the 4 5 human proteome. The vast majority of human kinases come from one very large, diverse family 4 6 that share a common fold consisting of an N-terminal lobe, composed of 5 β -sheet strands and 4 7 an α -helix called the C-helix, and a C-terminal lobe comprising 6 α -helices 6  Structure alignment with Aurora A indicated that four of the atypical kinase families are 1 1 9 homologous to typical kinases (Figure 2), containing some elements of the typical kinase fold 1 2 0 but containing changes and additions in elements of secondary structure. These include ADCK, 1 2 1 Alpha-type, PI3-PI4-related, and RIO kinases. The ADCK (aarF-domain containing) kinases 1 2 2 consist of 5 proteins: ADCK1, ADCK2, COQ8A (ADCK3), COQ8B (ADCK4), and ADCK5. Only 1 2 3 the structure of COQ8A is available (PDB:4PED 8 ). The structure consists of 384 residues, 13 1 2 4 helices, and 8 beta sheet strands. Structure alignment with FATCAT aligned 192 residues with 1 2 5 an RMSD of 3.92 Å, covering the N-terminal domain, the HRD and DFG motifs, and the E and F 1 2 6 helices of the C-terminal domain. COQ8A's N-terminal domain contains an additional 1 2 7 subdomain of five alpha helices, three of which precede the typical kinase domain and two of 1 2 8 which are inserted between beta strand B3 and the C-helix. Instead of the activation loop 1 2 9 leading into the F-helix, the DFG motif leads into a bundle of four alpha helices that precede 1 3 0 COQ8A's F-helix, which is followed by one additional helix. There are six kinases in the Alpha-type kinase family: ALPK1, ALPK2, ALPK3, EEF2K, 1 3 8 TRPM6, and TRPM7. The structure of mouse TRPM7 (PDB:1IAJ 26 ) has been determined; only 1 3 9 the N-terminal domain and the E helix could be aligned to AURKA with an RMSD of 5.8 Å over 1 4 0 120 residues. The remainder of the C-terminal domain of TRMP7 consists of two beta sheet 1 4 1 strands, large coil regions, and a short helix. The human PI3/PI4 kinases consist of 7 genes: 1 4 2 ATM, ATR, MTOR, PIK3CA, PIK3CB, PIK3CD, PIK3CG, PRKDC, and SMG1. All of these 1 4 3 except SMG1 have known structures (PIK3CB is represented by a structure of mouse PIK3CB). 1 4 4 The structure of PIK3CA (PDB: 4L2Y 27 ) aligns with Aurora A with RMSD 6.0 Å over 168 1 4 5 residues. The structures of two of the three RIO kinases (RIOK1, RIOK2, and RIOK3) are 1 4 6 domain approximate and partial. Therefore, we did not include any atypical kinase sequence in 1 6 4 our multiple sequence alignment of human protein kinases.  The creation of the MSA was a multi-step process. The initial alignment of all the kinase 2 0 9 domain sequences was done using ClustalOmega 37 which aligned the main conserved regions 2 1 1 1 in a majority of the sequences up to the beginning of the activation loop. Because of very large 2 1 1 insertions in the activation loop and in the C-terminal domain in some kinases, the C-terminal 2 1 2 domain was aligned only within some families. For example, the AGC-family Great wall kinase 2 1 3 (AGC_MASTL) has a 548 amino acid insertion in the activation loop that caused the entire AGC 2 1 4 family to be misaligned in the C-terminal domain with respect to the other families.   A list of aligned and unaligned blocks is provided in Table 2 lengths. The longest of these are the B3~HC and HG~HH regions. We left the B3~HC region 3 0 7 unaligned because in most of the 62 AGC and 96 CAMK kinases, the region is in the form of a 3 0 8 helix called the B helix while in the other families it takes on a coil form. The HG~HH unaligned 3 0 9 region is highly divergent in structure because of the variation in the position of the G helix. We 3 1 0 have created sequence logos with the program Weblogo 45 to visualize the conservation of 3 1 1 residues in all the aligned blocks of our MSA ( Figure 6). The logos show the well-known 3 1 2 conserved motifs including the HRD motif in the catalytic look and the DFG and APE motifs in 3 1 3 the activation loop, as well as hydrophobic positions in the beta sheet strands and alpha helices.

1 4
For instance, positions 6, 7, and 10 in the G helix contain predominantly hydrophobic amino 3 1 5 Structural validation of the MSA 3 2 2 As described above, the MSA was guided by pairwise alignments of kinase structures to 3 2 3 a single kinase (AURKA). However, to determine the accuracy of our MSA we have compared it 3 2 4 with the sequence alignments derived from pairwise structure alignments of 272 human kinases 3 2 5 in the PDB. Because changes in conformation of the activation loop or movement of the C-helix 3 2 6 may affect the corresponding alignment, we used structures that carry an inward disposition of 3 2 7 the C-helix as often as possible, as determined by our recent classification of the active and 3 2 8 inactive states of kinases 46 . The resulting structure alignments from FATCAT were read by SE 3 2 9 to print the unaligned blocks in lower case letters. A residue pair in any two kinases is assumed 3 3 0 to be correctly aligned in the MSA if it is also aligned in the pairwise structural alignment of the 3 3 1 two kinases. To do the validation we have computed three quantities as described in Methods: pairs that are identically aligned in both the MSA and the structure alignments divided by the 3 3 8 total number of unique residue pairs aligned in the MSA or the structure alignments or both 3 3 9 (counting each pair only once). The Jaccard index shows the overlap between MSA and 3 4 0 structural alignments, and penalizes both under-and over-prediction of aligned residues in the 3 4 1 MSA.

4 2
The average values and distributions of these quantities are presented in Table 3  We calculated the 'gappiness' of each element, which we identify as the average 3 6 1 number of gap regions in each sequence in the MSA. These are also contained in Table 3. Our 3 6 2 alignment is the least gappy, with average number of gap regions of 19. While we have 16 3 6 3 unaligned regions, three of the aligned regions contain short gapped regions internally to 3 6 4 accommodate one or more kinases with an unusual insertion in the aligned region. The Möbitz 3 6 5 and ClustalOmega alignments are slightly more gappy than ours, while the Manning and Kwon 3 6 6 alignment are substantially gappier with 45 and 140 gap regions per sequence respectively. In their paper on the human kinome, Manning et. al provided a phylogenetic tree and 4 1 7 classified the human protein kinases into nine groups extending the early Hanks 49 and Hunter 50 4 1 8 schemes. These groups consisted of AGC, CAMK, CK1, CMGC, NEK, RGC, STE, TKL and 4 1 9 TYR. A total of 83 protein kinases were placed in an OTHER category because no significant 4 2 0 relationship to any of the nine groups was recognized. However, this classification was done 4 2 1 with only a limited amount of structural information, and as shown above, the Manning multiple 4 2 2 sequence alignment was only 80% correct. We have revisited the phylogeny and classification 4 2 3 of kinases to see if we can assign groups to some of the OTHER kinases by benefiting from our 4 2 4 structure guided multiple sequence alignment. 4 2 5 Because using aligned blocks tends to result in better phylogenies 51 , we built the tree 4 2 6 with the 17 conserved blocks from the alignment (Figure 8). The tree was created using the 4 2 7 neighbor joining algorithm in the software MegaX 52 and was visualized using the webserver 4 2 8 iTOL 53 . The resulting tree clusters most of the kinases into the previously recognized nine 4 2 9 groups. Uniprot includes the NEK kinases as a separate group, which also appears in our tree. 4 3 0 In our tree, the RGC kinases form a small sub-branch within the TKL group, but we have 4 3 1 retained the designation. 4 3 2 Among the kinases which are assigned to a group by Uniprot, we observed that eight 4 3 3 Similarly, six sequences consisting of the second domains of RPS6KA1, RPS6KA2, RPS6KA3, 4 3 7 RPS6KA4, RPS6KA5, and RPS6KA6 which were previously annotated to be in the AGC group 4 3 8 by Uniprot also cluster in CAMKs (as they are in the Manning tree). The first domains of these 4 3 9 kinases are AGC members. 4 4 0 Because of their remote homology, most of the OTHER kinases branch out into separate 4 4 1 clades in between the major groups. Due to the smaller size of these clades and relatively low 4 4 2 similarity between the members we have not classified them as individual kinase groups. In our 4 4 3 tree there are seven OTHER kinases in Uniprot that are correctly assigned to groups by 4 4 4 Manning. Four kinases --STK32A, STK32B, STK32C, and RSKR, form a branch within the 4 4 5 AGC group. These kinases were also classified as AGCs , a n d S T K 3 6 . 4 5 7 4 5 8 We have identified a set of 14 kinases from the OTHER category in both Manning and 4 5 9 Uniprot that can be appropriately assigned to the CAMK group. One branch is formed in the 4 6 0 middle of the CAMK group by nine kinases consisting of AURKA, AURKB, AURKC, CAMKK1, 4 6 1 CAMKK2, PLK1, PLK2, PLK3, and PLK4 ( Figure 8). PLK5 is a pseudokinase consisting only of 4 6 2 the C-terminal domain, although mouse PLK5 is full-length 54 . We have included it in the CAMK 4 6 3 group because of its close sequence relationship with the other PLKs. The second branch, 4 6 4 which is on the periphery of the CAMK group, comprises four kinases, ULK1, ULK2, ULK3, and 4 6 5 STK36 (labeled Fused in Manning). ULK4 is more distantly related, and we have left it in the 4 6 6 OTHER group. 4 6 7 To confirm the changes in group membership, we created HMM profiles for each of the 4 6 8 nine groups of kinases, omitting the kinases potentially misclassified by Uniprot (MAPKs in STE 4 6 9 and second domain of RPS6KA1-6 in AGC). We then scanned each of the 497 kinase 4 7 0 sequences against the nine group HMM profiles. A cutoff score of 200 was consistent with the 4 7 1 assignments by Uniprot except for the changes described above. The novel assignments are 4 7 2 the 14 kinases that we can confidently move from OTHER to CAMK described above. The HMM 4 7 3 scores clearly assign them to CAMK rather than AGC or OTHER, since the new CAMKs cluster 4 7 4 with the other CAMK kinases (Figure 9), well above the diagonal line where CAMK=AGC. 4 7 5 Using the new assignments in the AGC, CAMK, CMGC and TKL groups, we created 4 7 6 new HMM profiles to identify if any OTHER sequences could also be reassigned. However, in 4 7 7 the second iteration none of the OTHER category kinases exhibited high scores against any 4 7 8 group HMM profile. insertions. Our aim was to create a parsimonious alignment without unnecessary gaps; the 4 9 0 residues in low similarity regions were therefore not aligned but formatted as left-justified blocks 4 9 1 of lower case letters to distinguish them from aligned regions. It is reminiscent of the first 4 9 2 multiple sequence alignment of kinases produced by Hanks et al. in 1988 49 , which the authors 4 9 3 also described as "parsimonious." 4 9 4 Alignments are only useful if they are accurate. While several multiple sequence 4 9 5 alignments of human kinases have been published and are available online 5,16,19,23 , none of 4 9 6 them has been structurally validated. We assessed the accuracy of our alignment with a set of 4 9 7 all-against-all pairwise structural alignments of 272 human kinases, and calculated true positive 4 9 8 rates (TPR), positive predictive values (PPV), and the Jaccard similarity index. In a large-scale 4 9 9 benchmark of sequence alignment methods, we referred to them as f D (for developer) and f M 5 0 0 (for modeler) for TPR and PPV respectively 56 . Yona and Levitt subsequently used these values 5 0 1 (renamed Q D and Q M ) to benchmark profile-profile sequence alignments, and added Q C or 5 0 2 Q Combined 57 , which is simply the Jaccard index.. The Jaccard index penalizes both overprediction 5 0 3 and underprediction in our sequence alignments. We used all three values for our alignment 5 0 4 (0.97, 0.97, and 0.94 respectively) to demonstrate that our alignment is more accurate than the 5 0 5 others available. 5 0 6 The errors in our MSA of kinases are mostly limited to the boundaries of conserved 5 0 7 blocks where the variability of residue positions across kinases make their unambiguous 5 0 8 placement in aligned blocks difficult. However, structure alignments do not always align every 5 0 9 homologous pair of residues in two proteins. This occurs when residues are disordered in one of 5 1 0 the structures or where there is significant conformational change. In a small number of kinases 5 1 1 the only structure available has a significantly rotated C-helix. Structure alignment therefore 5 1 2 sometimes does not align the homologous residues of the C helix in two kinases. The same is 5 1 3 true for the G-helix in some kinases, which may be positioned in different locations within the C-5 1 4 terminal domain, but retains a homologous sequence and structure, and thus is aligned 5 1 5 differently in our MSA than in the structure alignments. hidden Markov model of CAMK kinases and a phylogenetic tree based on our MSA, these 5 3 0 kinases fit clearly into the CAMK group. Experimental data confirm that these assignments are 5 3 1 correct. CAMKK1 and CAMKK2 (Ca 2+ /calmodulin-dependent kinase kinase 1 and 2) both 5 3 2 phosphorylate Ca 2+ /calmodulin-dependent kinases and bind calmodulin 58-60 . There is also direct 5 3 3 evidence of calmodulin binding to PLK1 61 and of a calmodulin homologue, calcium-and-integrin-5 3 4 binding protein (CIB), to both PLK2 (Snk in the kinome poster) and PLK3 (Fnk) 62 . There is 5 3 5 evidence suggesting the role of calcium ions in the activity of ULK1 through its phosphorylation 5 3 6 by the CAMK kinase AMPK 63 . 5 3 7 Manning et al. put the Aurora kinases on the same branch but not in the AGC group, 5 3 8 which is closely related to the CAMK group. Both groups possess a B helix that is not present in 5 3 9 the other families. The HMM and the phylogenetic tree show that the three Aurora kinases fit 5 4 0 better into the CAMK group than the AGC group. In an earlier study with colleagues, we have 5 4 1 shown experimental evidence that Aurora A binds calmodulin 64 , supporting its assignment to the 5 4 2 CAMK group. Calmodulin also binds to Aurora B kinase (AURKB), preventing its degradation 5 4 3 via the E3 ligase FBXL2 subunit 65 .

4 4
Our MSA provides the benefit of a common numbering scheme using the columns of the 5 4 5 alignment facilitating comparison across all the kinase sequences. The identification of 5 4 6 equivalent residue positions helps in generalizing experimental data from one kinase to another.

4 7
For example, substrate specificity is highly correlated with the amino acid type at a small 5 4 8 number of positions within the substrate binding site 66 . Creixell found that specificity could also 5 4 9 be modulated by more remote sites 19 , based on a multiple sequence alignment of kinases 5 5 0 derived with Clustal W. It is likely that our more accurate alignment would facilitate this analysis 5 5 1 and produce more reliable predictions. Other areas where an accurate alignment and phylogeny 5 5 2 may be useful are in predicting inhibitor specificity 67 , regulatory mechanisms through protein-5 5 3 protein interactions, and computational protein design of kinases with altered functionality 68 .

4
Our alignment is included as supplemental data and on our website, and will be updated 5 5 5 as new structures are determined. We hope that it will be of use in kinase biology and 5 5 6 therapeutic development. The list of human typical and atypical protein kinases was obtained from Uniprot website 5 6 2 (https://www.uniprot.org/docs/pkinfam). To identify any unlisted kinases we searched human 5 6 3 sequences in Uniprot with PSI-BLAST using the typical and atypical protein kinase on the 5 6 4 Uniprot page as queries. PSI-BLAST was also used to identify structures of human kinases or 5 6 5 their closest homologues in the PDB. The structures of atypical kinases (or homologues thereof) 5 6 6 were examined structurally using FATCAT pairwise alignments and structural superposition by 5 6 7 CEalign in Pymol to Aurora A. Four of the atypical kinase families are visibly related to typical 5 6 8 kinases, but contain significant fold differences. The other two families are not homologous to 5 6 9 typical kinases. 5 7 0 A total of 497 typical kinase domain sequences from 484 kinase genes (13 genes have 5 7 1 two kinase domains each) were used to create the MSA. These sequences were initially divided 5 7 2 into 9 phylogenetic groups as per the Uniprot nomenclature: AGC, CAMK, CK1, CMGC, NEK, 5 7 3 RGC, TKL, TYR, and STE, and a tenth group of diverse kinases designated OTHER. Gene 5 7 4 names were retrieved from the Human Gene Nomenclature Committee website 5 7 5 (http://genenames.org) 34 . Each kinase sequence was labeled by group name underscore HGNC 5 7 6 gene name, for example AGC_PRKACA for KAPCA_HUMAN. The 13 kinases that have two 5 7 7 kinase domains in the polypeptide chain were labeled with an underscore, for example 5 7 8 TYR_JAK1_1 and TYR_JAK1_2. The boundaries were determined with PSI-BLAST of the full-5 7 9 length Uniprot sequence against the PDB. The kinase sequences (except some with very long insertions like GWL) were aligned 5 8 3 using ClustalOmega 69 to prepare an initial alignment. This was manually edited using Jalview to 5 8 4 make sure that conserved motifs such as the DFG and HRD motif were aligned across most of 5 8 5 the sequences 70 . The sequences with low sequence similarity to most of the other kinases and 5 8 6 those containing long insertions were difficult to align. To improve the accuracy of the alignment, 5 8 7 pairwise structural alignment of the kinases which have a crystal structure was performed using 5 8 8 with the structure of Aurora kinase (3E5A_A) and the program FATCAT 71 . However, for the 5 8 9 kinases where a structure was not known, alignment of the closest known structures to Aurora A 5 9 0 were used to edit the alignment with Jalview 72 . In a few cases, the most closely related 5 9 1 structures were not human or even mammalian kinases. In these cases, the non-human kinase 5 9 2 was structurally aligned to Aurora A and the target kinase was added to the MSA by transitive 5 9 3 alignment. For a few distant kinases where a closely related structure or sequence was not 5 9 4 known, HHPred was used to identify similarity to another kinase 73 . 5 9 5 5 9 6 Structural validation of the MSA 5 9 7 The MSA was structurally validated using a set of pairwise structural alignments as a were created using FATCAT in rigid mode and optimized using SE 36 . For kinases with multiple 6 0 5 structures known the structure for validation was selected based on their conformational states 6 0 6 using our previously published nomenclature 46 . The active state BLAminus conformation was 6 0 7 preferred over others, followed by different kinds of DFGin inactive states -ABAminus, BLAplus, 6 0 8 BLBminus, BLBplus, BLBtrans and DFGout-BBAminus. 6 0 9 A residue pair between two kinases in the MSA was considered to be aligned if it was 6 1 0 also aligned in the benchmark pairwise structural alignment of the same kinases. Using this 6 1 1 information the accuracy of the MSA was assessed by computing three quantities TPR, PPV 6 1 2 and the Jaccard similarity index. For each pair of sequences, we first calculate the number of 6 1 3 aligned residue pairs that are present in both the sequence alignment and the structure 6 1 4 alignment (N correct ). The TPR is the ratio of N correct and the number of residue pairs aligned in the 6 1 5 structure alignment (N struct ). For computation of the TPR, residue pairs in the structure alignment 6 1 6 are skipped if either or both residues are contained in the unaligned (lower-case) blocks of the 6 1 7 sequence alignment. This takes care of situations that occur when two kinases have identical 6 1 8 length segments between two of our aligned blocks; the structure alignment program would 6 1 9 align them but they would be indicated as unaligned in our sequence alignment. The alignment 6 2 0 of Kwon et al. also includes unaligned regions in lowercase and is treated in the same way. 6 2 1 PPV is the ratio of N correct and the number of aligned residue pairs in the sequence 6 2 2 alignment (N seq ). For the PPV, residue pairs in the aligned blocks of the sequence alignment are 6 2 3 skipped if one or both residues are aligned to gap characters in the structure alignment. This is 6 2 4 usually either because the residues are disordered (no coordinates) in one of the structures or 6 2 5 because there is a significant conformational change of a loop and the residues are aligned to 6 2 6 gaps. The Jaccard similarity index is the ratio of N correct and the number of unique aligned pairs 6 2 7 in either the structure alignment or the sequence alignment (counting each only once). For the 6 2 8 Jaccard index, all the pairs skipped in TPR and PPV are also skipped. A script for calculating 6 2 9 these values is available on

3 2
We also compared our MSA accuracy with the previously published alignments. These 6 3 3 alignments did not contain residue ranges in the Uniprot sequences, and used different 6 3 4 nomenclature for the protein names. To identify a correspondence between the sequences in 6 3 5 previously published alignments and our MSA, we performed PSI-BLAST searches of each of 6 3 6 their sequences against Uniprot and renamed them according to our scheme 6 3 7 (groupname_genename). 6 3 8 1 6 3 9 Phylogenetic tree 6 4 0 The phylogenetic tree of human protein kinases was created from an MSA obtained by 6 4 1 deleting the unaligned regions from our MSA. A distance matrix using the p-distance in the 6 4 2 program MegaX 52 was created which was used to create a phylogenetic tree with the neighbor-6 4 3 joining algorithm. The tree was saved in Newick format and uploaded to iTOL webserver for 6 4 4 visualization where each clade was colored according to its kinase group 6 4 5  r  o  n  T  R  ,  C  l  a  r  k  W  T  ,  B  a  n  k  a  p  u  r  A  R  ,  D  '  A  n  d  r  e  a  D  ,  L  e  p  o  r  e  R  ,  F  u  n  k  C  S  ,  K  a  h  a  n  d  a  I  ,  V  e  r  s  p  o  o  r  7  0  2  K  M  ,  B  e  n  -H  u  r  A  .  A  n  e  x  p  a  n  d  e  d  e  v  a  l  u  a  t  i  o  n  o  f  p  r  o  t  e  i  n  f  u  n  c  t  i  o  n  p  r  e  d  i  c  t  i  o  n  m  e  t  h  o  d  s  s  h  o  w  s  a  n  7  0  3  i  m  p  r  o  v  e  m  e  n  t  i  n  a  c  c  u  r  a  c  y  .  G  e  n  o  m  e  b  i  o  l  o  g  y B  r  a  s  c  h  i  B  ,  D  e  n  n  y  P  ,  G  r  a  y  K  ,  J  o  n  e  s  T  ,  S  e  a  l  R  ,  T  w  e  e  d  i  e  S  ,  Y  a  t  e  s  B  ,  B  r  u  f  o  r  d  E  .  G  e  n  e  n  a  m  e  s  .  o  r  g  :  t  h  e  7  5  0  H  G  N  C  a  n  d  V  G  N  C  r  e  s  o  u  r  c  e  s  i  n  2  0  1  9  .  N  u  c  l  e  i  c  A  c  i  d  s  R  e  s   4  7   :  D  7  8  6  -D  7  9  2  (  2  0  1  8  )  .  7  5  1  3  5  .  Y  e  Y  ,  G  o  d  z  i  k  A  .  F  A  T  C  A  T  :  a  w  e  b  s  e  r  v  e  r  f  o  r  f  l  e  x  i  b  l  e  s  t  r  u  c  t  u  r  e  c  o  m  p  a  r  i  s  o  n  a  n  d  s  t  r  u  c  t  u  r  e  s  i  m  i  l  a  r  i  t  y  7  5  2  s  e  a  r  c  h  i  n  g  .  N  u  c  l  e  i  c  A  c  i  d  s  R  e  s   3  2   :  W  5  8  2  -5  8  5  (  2  0  0   . d  e  C  á  r  c  e  r  G  ,  M  a  n  n  i  n  g  G  ,  M  a  l  u  m  b  r  e  s  M  .  F  r  o  m  P  l  k  1  t  o  P  l  k  5  :  f  u  n  c  t  i  o  n  a  l  e  v  o  l  u  t  i  o  n  o  f  p  o  l  o  -l  i  k  e  7  9  8  k  i  n  a  s  e  s  .  C  e  l  l  c  y  c  l  e   1  0   :  2  2  5  5  -2  2  6  2  (  2  0  1  1  )  .  7  9  9  5  5  .  N  e  e  d  h  a  m  E  J  ,  P  a  r  k  e  r  B  L  ,  B  u  r  y  k  i  n  T  ,  J  a  m  e  s  D  E  ,  H  u  m  p  h  r  e  y  S  J  .  I  l  l  u  m  i  n  a  t  i  n  g  t  h  e  d  a  r  k  8  0  0  p  h  o  s  p  h  o  p  r  o  t  e  o  m  e  .  S  c  i  S  i  g  n  a  l l  e  x  i  b  l  e  s  t  r  u  c  t  u  r  e  a  l  i  g  n  m  e  n  t  b  y  c  h  a  i  n  i  n  g  a  l  i  g  n  e  d  f  r  a  g  m  e  n  t  p  a  i  r  s  a  l  l  o  w  i  n  g  t  w  i  s  t  s  .  8  4  3  B  i  o  i  n  f  o  r  m  a  t  i  c  s   1  9  S  u  p  p  l  2   :  2  4  6  -2  5  5  (  2  0  0  3  )  .  8  4  4  7  2  .  A  l  t  s  c  h  u  l  S  F  ,  M  a  d  d  e  n  T  L  ,  S  c  h  ä  f  f  e  r  A  A  ,  Z  h  a  n  g  J  ,  Z  h  a  n  g  Z  ,  M  i  l  l  e  r  W  ,  L  i  p  m  a  n  D  J  .  G  a  p  p  e  d  B  L  A  S  T  a  n  d  8  4  5  P  S  I  -B  L  A  S  T  :  a  n  e  w  g  e  n  e  r  a  t  i  o  n  o  f  d  a  t  a  b  a  s  e  p  r  o  g  r  a  m  s  .  N  u  c  l  e  i  c  A  c  i  d  s  R  e  s   2  5   :  3  3  8  9  -3  4  0  2  (  1  9  9  7  ) .