Sequence evidence for common ancestry of eukaryotic endomembrane coatomers

Eukaryotic cells are defined by compartments through which the trafficking of macromolecules is mediated by large complexes, such as the nuclear pore, transport vesicles and intraflagellar transport. The assembly and maintenance of these complexes is facilitated by endomembrane coatomers, long suspected to be divergently related on the basis of structural and more recently phylogenomic analysis. By performing supervised walks in sequence space across coatomer superfamilies, we uncover subtle sequence patterns that have remained elusive to date, ultimately unifying eukaryotic coatomers by divergent evolution. The conserved residues shared by 3,502 endomembrane coatomer components are mapped onto the solenoid superhelix of nucleoporin and COPII protein structures, thus determining the invariant elements of coatomer architecture. This ancient structural motif can be considered as a universal signature connecting eukaryotic coatomers involved in multiple cellular processes across cell physiology and human disease.

homologous proteins, matching the ACE1-equivalent segment, returning the third most numerous set of 'new' hits. At step 4, only a few additional hits are found at the required significance level: this step represents a bottleneck that is subsequently surpassed. At step 5, more Nup96 members are detected along a Sec16-like ACE1-containing molecule from Branchiostoma floridae (GI:260816342). At step 6, remarkably, significant hits are enriched in yet more Nup96 homologs with the appearance of Sec31A members. At step 7, the search yields the most numerous set of 'new' hits, including a multitude of Sec31 homologs, as well as a few Nic96 members and a handful of WDR17 poorly characterized proteins 51 . At step 8, new hits are primarily represented by WDR17's, some Sec31A and Sec31B homologs, a few unannotated Nup107s, and the first IFT140 member, again from Harpegnathos saltator (GI:307206999). At step 9, the second most numerous set of 'new' hits is generated, covering a multitude of homologs from a wider range, including primarily Nic96 members, IFT140 members, and Sec31 subunits. At step 10, multiple entries belonging to previously found families are detected, as well as a large group of IFT140 homologs, including many uncharacterized members of this superfamily -surpassing the 3000 homolog milestone, it is quite surprising that there are very few (if any) false positives above threshold at this point. At step 11, the detection of ACE1-like regions is consolidated across superfamilies, with the detection of the first members of the IFT172 superfamily, e.g. a 1735-residue long protein from Camponotus floridanus (GI:307182081) -interestingly, this species is represented by one member per family detected here at this stage (step 11). At step 12, new findings start to drop in relation to previous steps, and a certain consolidation is taking place, with more remote members of all previous superfamilies, especially nonnucleoporins, being detected. At step 13, a similar situation prevails, with the exception of a most remarkable hit, the Sec31 homolog of known structure (Figure 2b). At step 14, annotated homologs of Nup107 (i.e. detected by domain databases) are admitted, and the entire sequence space locality is sufficiently covered. It should be noted that the Nup107 structure is not detected by the process -it is added manually into the alignment by profile matching against the PDB (Figure 2b). A number of other interesting, marginally non-significant, hits include IFT-A components WDR19/IFT144 (e.g. H. saltator, GI:307199281) and IFT122 (e.g. Micromonas pusilla CCMP1545, GI:303272293); Sec16 is found once (Wallemia ichthyophaga EXF-994, GI:505759425) at positions 323-785, matching the structure with PDB code 3mzkB 15-441 and by extension 2pm6A at positions 150-280, consistent with the alignment; Clathrin heavy chain from Schizosaccharomyces pombe 972h-is also detected (GI:19115060), at positions 735-1186, indicating a remote profile affinity with the clathrin heavy chain repeat region. These tantalizing hits are detected correctly below threshold, as verified by reverse searches of the corresponding regions, thus closing the gap of the sequence space locality with Y-Nups 20 , Nic96 1,52 , Sec31 53 and IFT140/IFT172 26 . Thus, the elusive sequence signature that connects by divergent evolution all known ACE1-containing coatomer systems (e.g. Y-Nups), their variants (e.g. IFTs) as well as a number of previously uncharacterized proteins (e.g. WDR17) is described by a very limited number of residues, predicted to play a role in the stability of this structural motif 54,55 . Despite highly integrated structural interpetations of the NPC 56 and transport vesicles 6 , recent functional studies continue to unveil unexpected cellular roles for individual components and their interactions, e.g. in tRNA transcription 57 or pH regulation 58 , respectively. Furthermore, the puzzling interplay between the NPC and the CPC to form a diffusion barrier 27,59 can now be illuminated by an evolutionary relationship of shared structural motifs. Parallel advances in the structural characterization of IFT proteins 60 and their interactions 61 might uncover the structure of the detected domains in some of the longest IFT proteins, e.g. IFT140 and IFT172. The biochemical basis of ciliary function in connection to signalling 62 coupled with novel structural insights might resolve the role of specific mutations in various ciliopathies 30,63 . Figure S1: Statistical measures of the sequence space walk. On the x-axis the thirteen steps are shown; on the left yaxis the number of entries corresponding to new hits (blue line) and the cumulative sum of hits (green line). Estimates for precision (red line) and recall (green line, corresponding to percentage points) are shown on the right y-axis (for actual values, see Table S2). While precision is never below 99%, coverage is slowly climbing from approximately 10% at step 1 to over 85% at step 13 -the total number of available sequence entries containing the ACE1-like motif is tentatively estimated at 4000. Figure S2: Global superposition of five available ACE1-like alpha-solenoid motif-containing structures. Viewpoint is maintained according to Figure 2; color scheme as in Figure 1. Despite the complexity of the five superimposed motifs, it is evident that they all exhibit a high degree of structural similarity, with the exception of Sec31's last helix hairpin, at the C-terminal region. Note that this structural superposition is purely sequence-driven with conserved residues (see main text for details). Figure S3: Index of the five eukaryotic coatomer structures with superfamily labels below. Viewpoint is maintained according to Figure 1; color scheme as in Figure 1. The purpose of this representation is solely to assist visualization of the complex global superposition in Figure S2.

Supplementary Video
Video S1: A video representation of the sequence space walk capturing endomembrane coatomer superfamilies with increasing sensitivity. Top panel: a heat-map representation (yellow color corresponding to low values, blue color to high values) is shown for successive KMAP tables for all thirteen steps: on the x-axis, the twenty residue types are displayed and on the y-axis, the full sequence alignment positions -for display purposes only. Bottom panel: a sequence of screenshots for Figure 1 is shown where the limelight marks the relevant step, and the corresponding structure when available. Values change because the search does not converge as the profile is typically enriched at each step, notably for those steps where high numbers of related sequences are admitted.

Supplementary Tables Table S1
Distribution of putative false positive cases excluded from sequence profile constructions across the 13 steps described.
Step Unique Total List of identifiers Sum Column name, explanations: Step -corresponding sequence profile search; Unique -sum of unique false positives; Total -sum of all false positives including duplicates; List of identifiers -list of false positive identifiers; Sum -cumulative sum of unique false positives across steps. § Mark signifies possible true positive, unconfirmed (Sec31 homolog), does not impact search. Last row contains grand sums. Gray-colored labels signify identifiers which have been observed more than once.

Table S2
Description of 13 steps of profile sequence searches and relevant statistics.