There is growing evidence for the involvement of Y-complex nucleoporins (Y-Nups) in cellular processes beyond the inner core of nuclear pores of eukaryotes. To comprehensively assess the range of possible functions of Y-Nups, we delimit their structural and functional properties by high-specificity sequence profiles and tissue-specific expression patterns. Our analysis establishes the presence of Y-Nups across eukaryotes with novel composite domain architectures, supporting new moonlighting functions in DNA repair, RNA processing, signaling and mitotic control. Y-Nups associated with a select subset of the discovered domains are found to be under tight coordinated regulation across diverse human and mouse cell types and tissues, strongly implying that they function in conjunction with the nuclear pore. Collectively, our results unearth an expanded network of Y-Nup interactions, thus supporting the emerging view of the Y-complex as a dynamic protein assembly with diverse functional roles in the cell.
Coat nucleoporins form the inner core of nuclear pores of eukaryotes, protein supercomplexes responsible for the regulated transport of macromolecules between the nucleus and the cytoplasm. The Y-shaped Nup84/Nup107-160 subcomplex (Y-complex) forms the outer ring scaffold, is evolutionarily conserved, and is composed of certain key proteins referred to as outer ring coat nucleoporins (Y-Nups – 9 in vertebrates and 7 in yeast)1,2, with common structural features yet elusive sequence similarities3,4. While the functional capacities of coat nucleoporins are primarily connected with the nuclear pore and, in fact, despite their key role in maintaining the integrity of the outer ring, there is growing evidence for their involvement in other processes5,6, including mitotic spindle assembly7 and transcription regulation8. Few other nucleoporins bind directly to the outer rings, rendering detection of their interacting partners experimentally highly challenging1.
In order to identify potential novel functional associations for coat nucleoporins, we thus set out to characterize the nine families of Y-Nups across eukaryotes. In particular, we examined in detail their multi-domain architectures and their membership in co-expression groups in human and mouse that further support functional interactions. By integrating information from these analyses with previous knowledge, we significantly extend the emerging evidence for Y-Nup roles outside the nuclear pore, defined as ‘moonlighting’ roles in this broader context9.
To augment the limited set of known protein associations for Y-Nups across eukaryotes10, we deploy computational and experimental sequence analysis involving extensive sequence comparisons, RNA-seq expression profiling across diverse human and mouse tissues, protein domain detection and inference of protein interactions11, using the Drosophila melanogaster Y-Nups as queries. Using established protocols for low-complexity masking, sensitive iterative sequence profile searches, consistent labeling and annotation of homologs, automated sequence clustering and visualization of sequence similarity, we unambiguously assign the initially detected homologies (Supplementary Fig. 1) into nine Y-Nup families (Figure 1). The resulting multiple sequence alignments share as low as <10% identity between certain members and their homologs (p<10−04, see Methods and Supplementary Fig. 2)12.
Clustering of Y-Nups into protein families
We identified ~3000 proteins as Y-Nups (Supplementary Table 1), many of which are reported here for the first time, especially for lower eukaryotes – including the previously undetected presence of Nup43 in fungal species (see Supplementary Text). These results confirm the universal distribution of the Y-Nup nuclear pore subcomplex in eukaryotes13. In particular, it is noteworthy that many of the protein sequences we detect here have not been reported previously in annotation efforts, due to the presence of subtle sequence similarities that are confounded by extensive low-complexity regions or repeats (e.g. WD40): of the 2962 entries in the resulting Y-Nup compendium, there are 1813 characterized and 1149 newly discovered Y-Nups, thus increasing the level of characterization by more than 63%. It should be pointed out that without low-complexity, compositionally biased region detection, the majority of these similarities are lost, mostly due to the presence of WD40 repeats, particularly for shorter query protein sequences. Based on our workflows (see Methods and Data Supplements DS01-09), we assigned detected homologs automatically into similarity clusters12 following detailed validation, essentially replicating our meticulous manual characterization in a highly consistent, reproducible manner. Of the 22,033 off-diagonal hits (i.e. excluding self-hits) in an all-against-all sequence comparison, 5,403 (24.5%) and 557 (2.5%) Y-Nups exhibit pairwise sequence identity <30% and ≥80% respectively (Supplementary Fig. 2). The nine independently derived clusters detected by the automated procedure correspond to all known classes of Y-Nups with the Nup37/SEH1/SEC13 families merging into the largest group (1077 members, minimum identity 7%), while the two smallest clusters represent distant sub-families of Nup75 (70 members from Ascomycota, minimum identity 8%) and Nup107 (10 members from Trypanosomatidae) (Figure 1).
Coordinated tissue-specific gene expression
The rigorous delineation of Y-Nup structural features drawn from evolutionary relationships and multi-domain architecture is a prerequisite for the inference of genome-wide functional relationships both at the gene expression and protein interaction levels14. We thus examine Y-Nup gene expression tissue specificity patterns15, via RNA-seq data across a wide range of tissues and cell lines in human and mouse16 (Figure 2). There is a remarkable consistency of Y-Nup expression patterns across the two species17 (see Methods), with the most prominent feature a detected over-expression of Nup98 (Nup98-96) and SEC13 in testis. Also, SEC13 is more highly expressed in muscle, liver, kidney, heart and neural tissue than SEH1, Nup43 and Nup37 are significantly more expressed in mouse than in human testis, and mouse SEC13 has a higher expression in heart tissue compared to human (Figure 2). Exon skipping is found to be limited, with subtle tissue specificity patterns and minor alternative exon splicing events for Nup98 (Nup98-96) observed in both species (not shown), indicating a tight, evolutionarily conserved regulation at the transcriptional level (Figure 2).
Having established precise protein family relationships across Y-Nups and their coordinated gene expression patterns in two mammalian species, we then proceeded to the identification of domain associations and the extraction of their corresponding expression profiles. Domain associations can be used to infer the range of cellular functions that certain Y-Nup subunits might be performing, previously undetected by more traditional approaches18. These implied moonlighting functions19 for the homologous single-domain counterparts strongly point to the association of the Y-complex with other fundamental, yet transient processes at a given timepoint during the cell cycle20 and nuclear pore reorganization21,22. In fact, as mentioned above, the presence of common-repeat patterns in Y-Nups have occasionally confounded their detailed structural and functional characterization23, delineated with greater accuracy in this study.
Multi-domain architectures of Y-Nups
Following the above reasoning, we are thus able to detect 27 novel multi-domain architectures for Y-Nups (Supplementary Table 2), using an adaptive length threshold for the manual inspection of thousands of sequence alignments (see Methods), which in principle might involve genuine domain associations for Y-Nups11,18. These domains correspond to a wide range of functional categories, not directly related to nuclear pore formation, and thus warrant further investigation, using criteria for genome structure, gene expression and phylogenetic distribution. To validate the detected associations, we have first performed genomic sequence comparisons, using linker sequences of the corresponding multi-domain molecules as queries for genome and expression nucleotide sequence databases (see Methods and Data Supplements DS05-06): eight cases are supported by these exhaustive genomic searches (Supplementary Table 2, ‘by Genome’). Despite the fact that all homologs derive from complete genome sequences or assemblies (not shown) – represented by over 300,000 genes, there are quality issues that require independent experimental confirmation. We subsequently validate these architectures using the homology-based RNA-seq expression data from human and mouse (Supplementary Table 3): six cases are supported by this extensive genome-wide coverage (over 4 billion reads per species, Supplementary Table 4), across tissues and cell lines (Supplementary Table 2, ‘by Expression’). Genes that display coordinated expression across diverse cell and tissue types tend to share common functions, and the property of co-regulation has been used to predict gene function: herein, we use coordinated gene expression patterns as an additional level of validation for domain discovery associations. Remarkably, while there are three cases supported by both genomic and expression evidence, there are another three cases supported by either of the above, as well as presence in multiple species ('by Frequency') (Supplementary Table 2). While cases with variable support will require further experimental probing, six strongly supported cases (Table 1) can be unambiguously connected with coat nucleoporin function (Figure 3): five of these are found in more than one species.
Given the scarcity of known functional relationships for Y-Nups – partly due to technical limitations, the detection of novel genome-wide associations can expand their possible roles beyond the nuclear pore6, to include transient processes rarely detectable by targeted experiments. Thus, when validated by exhaustive functional genomics evidence, the inferred associations pointing to moonlighting roles of Y-Nups are highly consistent with the limited experimental evidence available both for gene expression (Figure 4) and protein interactions (Figure 5), in the broader context of biological processes as indicated by the associated domains (see also Data Supplement DS11). Beyond nuclear pore formation and maintenance5,6,7, the Y-Nups found associated with the strongly supported architectures (Figure 3, Table 1) can be linked to cellular processes – also previously reported, viz. cf. – involved in RNA processing and transport (cf. Rae124), DNA repair25 (cf. RAD5226), chromosome maintenance (cf. Sir4p27) and centrosome control28 (cf. Cenp-F29).
Certain domain configurations with limited support might be due to sequencing artifacts, gene prediction or short-read assembly errors. Of those, four cases deserve further discussion although they are not admitted in our final list. The association of Nup75 (Nup153) from Naegleria gruberi (GI:290983204) with FG-repeats30 might represent a genuine case (see Supplementary Text). Another intriguing, low-support architecture is an association of SMC domains6 (condensins) with Nup75 of Chlorella variabilis (GI:307108886): both pairwise correlations (Figure 3B) and rank correlation clustering (Figure 4) indicate a co-expression of human paralogs with Nup75, SMC1A being the most Nup75-coordinated paralog across human tissues (Supplementary Table 3); independent observations from stem cell Oct4 interactions provide additional evidence31, although the particular Chlorella instance will need to be further validated. The third case with no counterpart elsewhere is the co-occurence of Nup107 with DUF1767 (domain of unknown function) in the flatworm Clonorchis sinensis (GI:358337287); moreover, DUF1767 is found in Rmi1, a protein controlling genome stability in yeast32 and exhibits the highest pairwise correlation of coordinated gene expression with Nup107 (Figure 3B). Finally, while not adequately supported, the co-occurrence of acetyl-CoA carboxylase with Nup75 in Rhodotorula glutinis (GI:342319109) provides clues for a suspected role of lipids in nuclear pore formation33,34. While all other cases are indeed tantalizing (including, e.g. aminopeptidase35), we conclude that more experimental and phylogenetic evidence is required and thus might not deem them as strong candidates for functional association with Y-Nups.
Validation and discovery of Y-Nup moonlighting functions
Strong functional genomics evidence for association with Y-Nups is detected for six domains (Table 1). Using the enriched Y-Nup group discovered by tissue-specific expression (middle block in light green, Figure 4), further substantial support for the six novel discoveries is obtained from high-throughput experiments (Figure 5), via a composite query to GeneMANIA36 (see Methods). By partitioning this network into two sub-networks, with the known cases and discovered multi-domain architectures deemed as positives (25 in number, average network connectivity 18) and all other nodes as negatives (depicted in light blue and grey, 52 in number, average network connectivity 12), the inferred nucleoporin-induced network exhibits a striking difference in topological complexity, thus placing the newly discovered multi-domain architectures pointing to moonlighting roles into a functionally coherent context.
The RAD51-Nup160 composite protein found in two fungal species, Metarhizium anisopliae ARSEF 23 (GI:322708659) and Phaeosphaeria nodorum SN15 (GI:169623440), annotated automatically in the corresponding sequence records, is strongly supported by gene expression data for human tissues, tightly co-ordinated not only with Nup160 but also Nup107, Nup133 and Nup43 (Figure 4). Interestingly, this association is also observed as a tandem gene cluster in Fusarium oxysporum lycopersici supercontig 2.1 (genes Foxg00234/5: https://img.jgi.doe.gov/cgi-bin/imgm_hmp/main.cgi?section=ScaffoldGraph&page=alignment&scaffold_id=2507525031,supercontig_2.1&coord1=779427&coord2=779510), further detected as a conserved pattern in multiple species where Nup160 remains unidentified (see https://img.jgi.doe.gov/cgi-bin/imgm_hmp/main.cgi?section=GeneNeighborhood&page=geneOrthologNeighborhood&gene_oid=2508346114&show_checkbox=1&cog_color=yes&use_bbh_lite=1). Additional experimental evidence is provided by Oct4 interactions (e.g. RAD50)31 as well as DNA damage response (DDR) studies, e.g. a si-RNA-based microscopy screening of ionizing radiation responses, pointing out the critical role that nucleoporins might play in genome maintenance37. The adjacent configuration of Nup160-CcmE in four uncharacterized proteins of Amoebozoa (Figure 3A) Dictyostelium discoideum AX4 (GI:166240053), D. purpureum (GI:330796511), D. fasciculatum (GI:328873820) and Polysphondylium pallidum PN500 (GI:281210825) provides strong comparative genomics evidence, in the absence of solid expression data: the role of CcmE in this context remains unknown at present38. NUP98 (Nup98-96) is found in association with a SET domain in three fungal species, namely in annotated Fusarium oxysporum Fo5176 (GI:342873147), and two uncharacterized proteins in Metarhizium acridum CQMa 102 (GI:322698664) and M. anisopliae ARSEF 23 (GI:322711125). Curiously, a similar configuration is found in patients with acute myeloblastic leukemia, with the fusion protein N-terminal NUP98-MLL acquiring a H3K4 methyltransferase ability through the SET domain present in MLL39. Similar observations support the association of SET with Nup9840, for instance the fusion of Nup98 to NSD1 (another SET-containing histone methyltransferase)41. The Nup43-DHX15 helicase association found in uncharacterized proteins of multiple insect species, for instance Nasonia vitripennis (GI:345482402), is consistent with the presence of a Werner helicase interacting protein in the Y-complex42 and DDX10 in leukemia43, while it is also detected in Oct4 interactions along with Nup4331 and very strong correlations with multiple Y-Nups (Supplementary Table 3, Figure 4). Most importantly, the interaction of DEAD-box helicases with other nucleoporins, for instance Ddx19 with Nup159, has been reported at the molecular level44. The association of TAF9 domain linked to Chs5p-Arf1p-binding domain and SEH1 in Ogataea parapolymorpha (GI:320581285) is supported by coordinated expression in human (TAF9, Figure 4) and the known involvement of TAF9 in the SAGA complex for chromatin remodelling6. Finally, centrosomal protein 192 (CEP192 in human) with a role in both centrosome maturation and spindle assembly45 is detected at the C-terminus of SEH1-specific WD40 repeats in multiple vertebrate species including rodents and marsupials, with high support by correlation clustering (Figure 4) and the presence of CEP192 with other centrosomal proteins and – curiously – Nup160 (figure 2 of cited work)28. Remarkably, this gene pair is also conserved in tandem organization across several vertebrate genomes (not shown). A set of complex patterns of variable functions is thus suggested by domain association analysis and validation (Table 1).
We have demonstrated the presence of particular domains with a wide range of functional roles in four Y-Nup instances (Table 1), indicating the association of those domains with the nuclear pore as unraveled by functional genomics evidence and evolutionary conservation. While issues of sequencing or assembly artifacts remain a possibility and will pose a continuing challenge for whole-genome analysis of this kind, there is strong evidence supporting our findings in recent experimental studies5,27. In this work, we encountered those issues arising from short-read genome assemblies which required the use of independently derived information to support domain association analysis: our approach can thus be regarded as a proposed framework for function prediction, which could be further automated and made available for the wider community. In particular, comparative genomics reveals the extent to which the discovered domain relationships are conserved and can pinpoint towards species-specific adaptations rather than artifacts. These instances can be assessed experimentally in situ, with advances in novel imaging and molecular technologies46: indeed, further experimental analysis will shed light into these multi-domain associations.
Our analysis exemplifies how genome sequence and functional genomics data can be coupled to unravel intricate associations of key supramolecular complexes known to defy biochemical characterization at present. Although our results cannot prove the discovered associations definitively, they direct future experimental efforts. As recently articulated, domain association inference (if properly executed) can yield low-coverage yet high-precision functional relationships and might supplement interaction proteomics18. Herein, we augment substantially the set of known interactions for Y-Nups, contributing evidence for new instances of functionally diverse molecules that are omnipresent in different taxonomic categories. Our results indicate that the structural and functional characterization of Y-Nups thus obtained represents a step towards a better understanding of the functional versatility of this key nuclear pore subcomplex. In summary, our results are consistent with the emerging view that Y-Nups, rather than serving as inert components of the nuclear pore, are actually functionally diverse and possess unexpected moonlighting functions5,46,47.
Proteins from the D. melanogaster nuclear pore complex considered as stoichiometrically assembled Y-Nups (i.e. explicitly excluding ELYS) were collected and tabulated (Supplementary Table 1). We maintain the order according to previous reports10.
Sequence filtering & searching
All sequences were masked using CAST48 with score ≥15 and otherwise default parameters, to exclude subtle compositional bias, including well-known repeats found in these proteins (Data Supplement DS01). In total, 160 regions were filtered out for such elements. These low-complexity, compositionally biased regions are provided separately, for further study (Data Supplement DS02).
The masked sequences were used as queries against the non-redundant protein sequence database (NRDB) at NCBI (15,052,178 entries)49 with BLAST (e-value cut-off threshold 10−06)50. Furthermore, these searches were manually executed with PSI-BLAST with a variable number of iterations until convergence (PSI-BLAST parameters: e-value cut-off threshold 10−04, 500 alignments, CAST score ≥15) (Supplementary Table 1), in particular to delineate possible anomalies such as multi-domain structure (Data Supplement DS03). Results from the above searches were evaluated (and confirmed with reverse sequence searches, not shown) and multi-domain similarities were extracted for subsequent analysis (for similarity distributions see Data Supplement DS04). Validity of domain associations was assessed by searching with linker sequences (Data Supplement DS05) against nucleotide databases – as a proxy for visual inspection of genome browser tracks; linkers were extracted and searched against these data collections within boundaries of ±20 amino acid residues where possible (Data Supplement DS06) and associated domains were separately extracted (Data Supplement DS07) and examined for taxonomic distribution (Data Supplement DS08). Multiple alignments were extracted and visualized by JalView51 – using redundancy elimination interactively until the production of visually appealing multiple alignments (Data Supplement DS03).
Clustering & annotation
All detected homologs labeled accordingly were compared using BLAST in an all-against-all mode (e-value cut-off threshold 0.01), following CAST masking as above. The similarity pairwise list was submitted to MCL sequence clustering using an inflation value of 1.2; clusters were incrementally assigned to an integer identifier12. Clusters are sorted by their size (number of members in a cluster, Data Supplement DS09); thus, the largest clusters have smallest integer identifiers (groups with 2 or less members are omitted, namely 12 instances). These cases (12/2962 or 0.4%) yield a sensitivity level of 99.6%. Conversely, two ‘false’ positives in clusters C1 (Nup98, GI:262118708) and C2 (Nup75, GI:307191801) yield a specificity level of 99.9% (under further investigation – Promponas et al., in preparation).
Expression profiles and protein interactions
Next-generation sequencing (NGS) data for a wide range of human and mouse tissues and cell lines were extracted from multiple available sources (Supplementary Table 4 – for other species, data are not as rich). Expression data for each instance were measured using cRPKM units [corrected (form mappability) Reads Per Kilobase per Million mapped reads], calculated as previously described16,52. The orthologs from human and mouse were analyzed for tissue-specific gene expression across all samples16,17 (Data Supplement DS10). Identification of cassette alternative exons and quantification of their transcript inclusion levels across samples was performed as previously described53 (see also: Hon et al., submitted). Both sequence clusters and gene expression profiles (Figures 1 and 2) were visualized with Circos54.
Gene expression data for Y-Nups and associated domain homologs in human (Supplementary Table 4) were subject to bootstrap rank correlation statistics (Supplementary Table 3). Expression patterns from 300 randomly selected human genes were systematically sampled 500 times with replacement for bootstrapping, in subsets of 100 expression patterns. Each subset was merged with Y-Nup and associated domain homolog expression profiles for Spearman rank correlation analysis, and average ranks were recorded (Supplementary Table 3). The complete gene expression dataset (human Y-Nups, human homologs of the 27 associated domains, random sample of 300 human genes) was clustered based on Spearman rank correlation coefficients (Figure 4).
Known protein interactions were extracted from the PINA database55,56 and annotated appropriately; these data were augmented by the discovered domain associations (Data Supplement DS11a), and are made available in BioLayout57 format for visual exploration. Coordinated tissue-specific gene expression data enriched in Y-Nups were used as a composite query to GeneMANIA36 resulting in supporting evidence from high-throughput experiments (Data Supplement DS11b). Note that the PINA results are used only to reflect the current status of knowledge for Y-Nup interactions while the GeneMANIA results are used to discover and provide support for the novel findings reported here.
Entire Y-Nup sequence compendium
All 2962 Y-Nups +27 external domain = 2989 sequences detected by the above analysis are labeled by property and domain association and provided in FASTA format for further study and a possible basis for a more consistent nomenclature (Data Supplement DS12).
All results (in 12 Data Supplements) are available as a ZIP archive (58.3 MBytes) on http://dx.doi.org/10.6084/m9.figshare.840452
Parts of this work have been supported by the FP7 Collaborative Projects MICROME (grant agreement # 222886-2) and CEREBRAD (grant agreement # 295552), both funded by the European Commission. C.A.O. thanks the Department of Biological Sciences at the University of Cyprus for their kind hospitality during 2012. All authors wish to thank Dr. Valérie Doye (Université Paris Diderot, Sorbonne Paris) for critical comments on the manuscript and for suggesting the term Y-Nups. M.I. is the recipient of a HFSP Long Term Fellowship, and B.J.B. gratefully acknowledges funding from the Canadian Institutes for Health Research.
Supplementary Figure 6