Introduction

The availability of the human genome sequence1,2 provided the blueprint for the diverse elements encoding the proteome. The exciting opportunity of comprehensively deciphering the function of these sequences remains a challenge. Traditionally, translating knowledge of a linear nucleic acid (amino acid) sequence into mechanistic insights requires a mixture of phenotypes obtained through genetic investigation, reconstituted biochemical assays, and structural determination. Though for any gene these studies may prove technically challenging, they are particularly so for membrane proteins at the cell surface or in intracellular organelle bilayers. Membrane proteins include receptors, ion channels, transporters, and enzymes. Constituting a significant fraction (20%–30%) of human genes3, membrane proteins represent the targets of over half of known drugs4,5. As the lipid membrane of the cell constitutes only 6%–12% of the cytosolic volume, with the plasma membrane representing only 2%–5% of this total6, the biochemical environment necessary for transmembrane protein function is highly specialized. Furthermore, the chemical compositions of the two sides of the membrane are physiologically different; a membrane protein is thus theoretically situated in three biochemically distinct environments. In addition to the critical requirement of lipid environment, like soluble proteins functional characterization of membrane proteins faces other challenges including functional redundancy, macromolecular organization and dependence on physiological conditions7,8,9,10,11,12,13,14.

Because of these challenges of characterizing a membrane protein, studies to understand the role of novel genes would benefit from the ability to narrow the potential number of candidates. In one scenario, molecular determinants are sought for a specific physiological process or disease phenotype that is hypothesized to involve membrane receptors, such as ion flux. Here, 'de-orphanizing' involves finding those genes whose presence or function correlates with this phenotype, through reverse genetics, transcriptional profiling, and other methods15,16,17. Alternatively, the phenotype of interest may not be known beyond a general category such as ion channel, and the challenge is to identify a plausible collection of uncharacterized genes that may share general functional similarity with known families18,19. Consequently, 'de-orphanization' may also involve identifying the native ligand for novel receptors, ionic substrate for orphan channels or transporters, and physiological protein-protein interactions. Thus, it aids in the definition of their functional phenotype20,21,22. In all cases, it is helpful to leverage data on the functionally characterized portion of the genome to infer the biological roles of the unannotated set based on existing information. Traditionally, this idea is demonstrated by the use of nucleic acid (amino acid) sequence similarity to infer possible functional homology. A popular heuristic algorithm for this problem is the Basic Local Alignment Search Tool (BLAST), which detects statistically significant matches between a query sequence and a database using a reference distribution of randomized sequence alignments as the 'null' comparison23. More complex approaches have also been proposed, such as hidden Markov models (HMM), which accommodate variations in insertion/deletion probability in different domains of a protein, instead of the position-agnostic gap penalty used by BLAST24,25. Furthermore, innovations in statistical 'machine learning' models allow sequence data to be combined with other protein features and annotations to make functional predictions. As more information from large-scale functional and interaction studies becomes available, this kind of data integration will likely play an increasing role in prioritizing candidate lists of functionally uncharacterized genes as potential molecular determinants for phenotypes of interest.

In this perspective we review the roles that bioinformatics can play in deorphanizing the uncharacterized membrane proteins in the human genome. This task is outlined in Figure 1, which involves the two strategies outlined above. In the first scenario, a phenotype of interest is known, and genome-wide screens are used to generate candidate orphan genes which may be the molecular determinant for the process of interest. Here, bioinformatics approaches such as topology prediction are used to filter the results, with overall predictive accuracy of less concern than the detection of a single validated determinant for the phenotype. Alternatively, the objective is to identify the unknown function of these novel genes, with the only reference as their similarity to known proteins. In the first step of this class of investigation, genomic databases are used as a basis for global prediction of membrane proteins through topological models. Secondly, these membrane proteins are further clustered into functionally related groups, based on sequence homology, conserved motifs, and existing annotations. As discussed in the following sections, this analysis has traditionally consisted of solely in silico approaches, where accuracy is judged through retrospective analyses predicting the class of previously characterized proteins. However, one may speculate that such methods could effectively narrow the search space for novel membrane proteins in experimental studies, particularly in cases where a phenotype of interest is not known or well characterized. After reviewing examples of the methodologies and results from both approaches, we provide an analysis of the current landscape of characterized and orphan membrane proteins in the human genome that might be utilized as a broad guide for de-orphanization efforts. Finally, we discuss future challenges, particularly in integrating experimental and bioinformatics approaches in cases where the phenotype of a novel transmembrane protein is not known in advance.

Figure 1
figure 1

Deorphanization strategies. Left (Blue): In silico analyses of genomic sequences, topological prediction, and functional prediction. Right (Red): Phenotype of interest followed by genomic screen, bioinformatics evaluation of candidate list topology, and experimental validation.

PowerPoint slide

Genomic prediction and validation of transmembrane protein function

The ability to survey the expression and activity of a large number of genes through microarrays, large-scale proteomics, and functional genetics screens has greatly aided the ability to survey and molecularly characterize diseases and signaling pathways26,27,28,29,30. Because these processes may involve cascades that begin at the plasma or organelle membrane, transmembrane proteins that are the primary drivers initiating these pathways may have a similar readout to downstream components in these assays. Thus, bioinformatics that can identify transmembrane proteins helps to narrow the number of candidates in which to invest follow-up experimental effort. More effectively, bioinformatics may even potentially focus the number of genes initially screened by identifying candidates using existing datasets for novel functions. Both cases are illustrated by recent examples summarized in Table 1.

Table 1 Novel experimentally validated membrane proteins.

The discovery of Leucine zipper-EF-hand containing transmembrane protein 1 (Letm1) as a mitochondrial Ca2+/H+ antiporter demonstrates the use of bioinformatics to refine a list of candidates from genomic functional assays15. To identify proteins implicated in calcium transport across the inner mitochondrial membrane, the authors used a genome-wide RNA interference (RNAi) screen in Drosophila cells using fluorescent calcium and membrane potential-sensitive dyes to identify genes whose loss affected the ion homeostasis of the mitochondrial compartment. Having identified a list of candidates, they further filtered the results to include only those with predicted transmembrane segments, as soluble proteins might be members of signaling pathways that indirectly modulate but are not themselves directly implicated in ion transport. A subsequent homology search for related mammalian sequences yielded Letm1 as a Ca2+/H+ antiporter.

Similarly, the chloride-conductive 'swell' Drosophila Bestrophin 1 (dBest1) channel was identified using a fluorescence anion-sensitive dye in a flux assay combined with RNAi knockdown31. As with the Letm1 study, bioinformatics was used to eliminate candidate genes regulating cell volume and chloride conductance lacking predicted transmembrane spanning segments. A challenging aspect of this study is that chloride channels, unlike other better characterized channels, such as voltage-gated potassium channels32, currently lack a signature sequence motif that might help to restrict the search space of possible membrane proteins involved in chloride conductance31. Thus, an unbiased genomic screen using a very specific phenotypic outcome was used to perform the bulk of candidate selection, with bioinformatics refining the hit list rather than defining the initial experimental scope.

These two studies used functional genomics to identify candidate genes whose loss is causally linked to the phenotype of interest. A related approach is to find genes that are correlated, through expression level, with this phenotype. This method was used in one of three studies reporting discovery of the calcium-sensitive chloride channel transmembrane protein 16 (TMEM16a)11,16,19. Here, the authors used microarray analysis of bronchial epithelial cells, which display increased calcium-activated chloride current following interleukin 4 (IL-4) treatment16. After identifying genes differentially expressed following IL-4 treatment, topological predictions to filter the hit list guided subsequent identification of TMEM16a16. A similar strategy of identifying differentially expressed genes correlated with a phenotype of interest was used to identify channels involved with mechanosensation. Unlike other tissues examined by the authors, mouse neuro 2a (N2A) neural crest cells displayed a mechanosensitive current, leading to the hypothesis that pressure sensitive channels would be represented among transcripts enriched in this cellular population33. Experimental studies of the resulting candidates identified Peizo1 and Peizo2 as mechanosensitive channels33. As with functional genomics approaches, the success of these studies appears to require very specific phenotypic queries that may be compared to large genomic space using profiling methodologies such as microarrays.

An example of integrated genomic analysis is the discovery of the mitochondrial calcium uniporter component MCU17. Here, the authors leveraged previous mass spectrometric profiling of the mitochondrial proteome34, phylogenetic conservation of genes along an evolutionary tree, and tissue coexpression35 to identify genes with similar profiles across these three parameters compared to the uniporter regulator mitochondrial calcium uptake 1 (MICU1)17. This analysis identified MCU as a top candidate across all three parameters, a prediction verified by subsequent functional experiments. Unlike the previously described studies, bioinformatics played a key role in forming the initial 'hit list' for experimental validation, rather than refining a list that was primarily generated through unbiased screening with reference to a phenotype of interest. Another striking example of this purely bioinformatic discovery is the identification of the Ciona intestinalis voltage-sensitive phosphatase (Ci-VSP)18. In a 'perfect storm' of sequence homology, this gene was found to contain both a well-defined voltage sensor similar to ion channels and a phosphatase region. Thus, even though such a combination of modular units might not have been anticipated based on existing knowledge of these two protein families, unbiased computational screening allowed discovery of this novel transmembrane protein.

In most of the examples described, bioinformatic techniques have been utilized after unbiased, genome-wide analyses to filter candidate lists of potential membrane proteins underlying a phenomenon of interest, rather than identifying an initial, limited set for experimental evaluation. Also noteworthy is the fact that many of these studies utilize differences in tissue phenotypes such as ionic currents sensitive to particular stimuli, to identify candidate genes, rather than computational motifs. As noted above in analysis of chloride channels, the lack of well-defined functional motifs that might be used as an in silico filter necessitates this sort of approach. However, in the absence of a well-defined phenotype, how might novel membrane proteins be prioritized for characterization? How might the natural ligands, substrates and protein interaction partners of otherwise well-characterized orphan proteins be elucidated? We examine this question by first describing topological prediction algorithms, then methods for functional inference.

Prediction of membrane proteins and topology

The first level of discrimination in the bioinformatics analysis depicted in Figure 1 is to separate putative membrane proteins from soluble proteins. A number of algorithms have been reported for this task, as illustrated in Figure 2 and summarized in Table 2.

Figure 2
figure 2

Algorithms for topological and functional prediction. Primary amino acid sequence (top left) is employed to predict secondary structure topology motifs (transmembrane helices, cytosolic loops, signal peptides) (top right), while secondary descriptors describing composition or substitution patterns of amino acids (bottom) are used for functional prediction for membrane proteins.

PowerPoint slide

Table 2 Algorithms for predicting membrane proteins.

Some of the early studies in this field identified simple and effective heuristics for topology prediction. This is demonstrated by 'rules-based' methods such as Topology Prediction (TOPRED), which score each amino acid using the mean hydrophobicity of its surrounding residues, and calculate putative transmembrane regions and topology using the 'positive inside' rule36 in which positively charged residues have a bias to face the cytoplasm37. Thus, topological predictions are generated in a manner analogous to a Doolittle Plot38, by finding a threshold for hydrophobicity that will divide a protein's hydrophobicity profile into transmembrane and cytosolic elements. Similarly, alignment methods extend this idea by seeking supporting information across multiple proteins, such as dense alignment surface (DAS) and transmembrane multiple alignment prediction (TMAP), which generate consensus dot plots comparing the hydropathy profile of the protein of interest to a collection of background reference sequences or to multiple sequence alignments with homologs39,40.

Later developments have further explored the use of patterns present across databases of known proteins to identify useful statistical patterns for topological analysis. As with homology searches in genomic databases, hidden Markov models (HMMs) are a popular method to model the statistical properties of biological sequences. HMMs were developed for automated speech processing41,42, in which an observed audiogram is produced by a set of unknown words correlated with certain tonal patterns. The algorithm then statistically reconstructs the most likely word producing a given pattern of sounds over each time interval given these input properties. Similarly, an observed distribution of amino acids may be considered as an observed 'signal,' with the hidden states being topological descriptions (such as transmembrane helix or cytoplasmic loop), which produce different distributions of observed amino acids43. This process, which cycles through each amino acid to find the optimal series of 'states' that explain the observed pattern, resembles earlier dynamic programming approaches which sought to find an optimal topological prediction by iteratively building up predictions from sub-sequences44. Additional complexity arises from the fact that type I membrane proteins possess a signal peptide directing them to the secretory pathway45, a motif that resembles a transmembrane helix and thus may be mis-identified by the algorithm. Thus, HMMs may be improved by incorporating 'signal peptide' as one of their hidden states, as implemented in the Phobius and signal peptide obtainer of correct topologies for uncharacterized sequences (SPOCTOPUS) programs46,47. Other variations of this approach are possible, such as the scale-based method for prediction of integral membrane proteins (SCAMPI) program, which uses the predicted free energy of amino acids as the 'observed state' instead of the amino acids themselves48. Taken together, these prediction methods have demonstrated remarkable accuracies of 80%–97% in discriminating soluble from membrane proteins and predicting transmembrane helices in retrospective analyses47,48. Additionally, as demonstrated by previously described studies, these methods' practical utility has been proven in successful filtration of hits lists from unbiased screens.

These successes are particularly notable given the inherently challenging nature of the problem they tackle. Indeed, one of the complexities of predicting protein topology from biological sequence is the inherent dependency between position and structure. Neural networks (NNs) are another approach that seeks to represent these nonlinearities by mapping a set of inputs [such as position-specific scoring matrices (PSSMs) representing the likelihood of residues in particular positions over a sliding window of a protein structure] to topological states such as transmembrane helices49. This mapping is performed by connecting the input data to the output through a series of 'neurons,' a set of logistic functions whose sigmoidal behavior in response to their inputs resembles activation thresholds in the mammalian nervous system. The observed toplogical states of a protein are thus modeled as a weighted combination of nonlinear activation functions, and the weights connecting the units are optimized to best reconstruct the desired output. This approach may be used independently, or combined with other algorithms. For example, the SPOCTOPUS program combines a NN and HMM, using the output from NN as an input to HMM47, thus improving inference of the 'hidden states' in the HMM.

In addition to NNs, Support Vector Machines (SVMs) have also been utilized to perform a nonlinear mapping from input sequences to topological states. The 'support vector' in the name is derived from the fact that only a small subset of the data used to develop the model are used to generate parameters for future prediction. These 'support vectors' lie at the boundary between the classes of data, such as transmembrane helices and cytosolic loop sequences, which the algorithm seeks to classify. A strength of this approach is that the SVM may use a similarity function, such as the Gaussian distribution or a polynomial, to find a boundary separating these classes which may be intermingled in their original vector space. Like NNs, SVMs applied to topology prediction may also utilize as input a Position Specific Scoring Matrices (PSSM) for a sliding window over the protein sequence50. Generating this prediction over the whole length of the protein thus yields a predicted topology.

Given that each algorithm discussed above may have scenarios in which it performs better or worse, it seems reasonable to infer that combining some of these methods may overcome some of these individual shortcomings. This sort of combination has the benefit of offsetting weakness in a single method, and for potentially pooling weak evidence from multiple predictions to yield stronger collective evidence. For example the consensus prediction (ConPred) algorithm uses a heuristic rules system to average inputs from multiple topology prediction methods to derive a consensus51. Similarly, Bayesian prediction of membrane protein topology (BROMPT) uses a Bayesian belief network to combine evidence from five methods into a consensus52, modeling this consensus as a 'child' node that receives weighted inputs from the five 'parent' methods.

The previously described algorithms, whether they employ amino acid frequencies, hydropathy, or folding free energy, primarily use information derived from the linear, primary structure of amino acid sequences. The resulting topology gives a 'flat' inference for tertiary or quaternary structure, but little guidance as to how the resulting helices are organized in a three-dimensional space. Such challenges have prompted the development of algorithms building on two dimensional topological predictions to infer three dimensional coordinates based on linear amino acid sequences, utilizing the population of previously solved x-ray crystal structures of membrane proteins to generate homology-based predictions53,54,55. In the absence of gold-standard structural data for most membrane proteins and channels, such techniques may represent the next-best option for tasks such as virtual small-molecule docking that require three-dimensional coordinates.

Functional sub-classification of transmembrane protein classes

After membrane proteins are identified and separated from soluble proteins using the topology prediction programs outlined above, the second level of classification in Figure 1 involves grouping the population of membrane proteins into individual functional classes, and to prospectively identify the function of characterized genes. Several methods have been reported to accomplish this task, which are summarized in Table 3 and visually diagrammed in Figure 2.

Table 3 Algorithms for predicting functional class of membrane proteins.

As with many topology prediction algorithms, these methods often require the amino acid sequence to be summarized in a quantitative fashion to compare two proteins. One such descriptor that has been successfully utilized is the fraction of a protein's sequence comprised of each of the twenty naturally occurring amino acids, a vector of length twenty that sums to one and is termed the 'amino acid composition'56,57. The intuition behind this descriptor is that distinct classes of membrane proteins have a bias to include particular amino acids at greater frequency due to the structural requirements or constraints for their function. Refinements of the amino acid composition descriptor have also been proposed, such as using the un-normalized count of the twenty amino acids in a protein sequence, a method reported to be more effective as it also captures differences in the characteristic length of a protein family58. Similarly, expanding the normalized amino acid composition to a vector length sixty – twenty for composition of the whole protein, and twenty elements each for the amino acid composition of transmembrane and non-transmembrane segments - has also allowed better discrimination59. Like amino acid composition, dipeptide frequencies have also been successfully utilized as descriptors to discriminate membrane proteins of different classes56,57,60. The previously mentioned PSSM derived from Position-Specific Iterative BLAST (PSI BLAST), which measure the likelihood of a substitution from the observed to an alternate amino acid at a particular position based on substitution patterns between a protein and its homologous neighbors, have also been found to have high sensitivity as a descriptor61. More abstractly, numerical descriptors of folding energetics have also been employed in predictive models61.

Just as the input descriptors to these algorithms are varied, so are the kinds of functional predictions produced in these studies. Several methods have been used to predict a query gene's family membership, such as classifying channels, transporters, and carriers from one another58. In greater detail, these methods have also been used to predict a protein's substrate, such as different metal ions for channels or protein/nucleic acids for transporters62. Predictions have also been targeted for functional parameters specific to particular classes of membrane proteins. For example, amino acid sequence has been used to predict the half-maximal activation potential of voltage gated channels63, discriminate between channels based on their electrophysiological parameters64, or identify channels that may serve as promising therapeutic targets65.

These previously described methods, in essence, rely on the proximity of a query protein to a neighborhood of known proteins in the space of the descriptor used. Further refinements have been proposed, where this proximity measurement may be combined with other features such as Gene Ontology terms describing the biological processes, molecular functions, subcellular localization of a protein66, presence of class-associated protein families (Pfam) domains67, or the number of predicted transmembrane domains68. The resulting combination of annotated and raw sequence information may then be used in a prediction algorithm such as the previously discussed SVM68. Indeed, the ability of amino acid profiles to serve as relevant features for identifying functionally related proteins may suggest that families share specific motifs, and specific structural fragments and motifs have also been identified in related studies69,70.

Expanding these predictions based on two-dimensional structure correlated with classifications or functional parameters, methods have also been developed to directly infer function based on a three dimensional conformation. For example, the SLITHER program uses molecular modeling simulations to predict whether a putative substrate molecule may permeate the cavities or channels in a protein structure71. In cases where the existence of a channel in a protein is unverified, the MolAxis program can be used to predict whether they exist using computational geometry72. Obviously, both of these methodologies require three-dimensional protein coordinates which are experimentally unavailable for most channels or other membrane proteins, but might be combined with homology-based three dimensional structure predictions described in the previous section to generate functional predictions for inferred three dimensional structures.

A related functional prediction is to identify the natural ligand, ion substrate or protein interaction partner of the novel proteins. Indeed, examples that highlight the challenge of deorphanizing a large number of seven transmembrane protein receptors, where the natural binding partner(s) of some otherwise well-characterized transmembrane receptors such as BRS-3 remains unknown73. Though not specifically developed to identify peptide – receptor interactions in silico, large-scale predictions of protein-protein interactions have been described using two and three-dimensional information74,75,76. Conceivably, such algorithms might be employed to identify novel interactions between peptide ligands and the subset of peptide-binding receptors. Direct bioinformatics identification of ligands such as neuropeptide precursors have also benefited from the increased availability of genome-wide proteomic and nucleotide data, as demonstrated by the computational prediction of more than 200 novel neuropeptides in the honeybee Apis mellifera, of which 100 were validated using peptidomics77. Related studies of the red flour beetle Tribolium castaneum have employed homology analysis to validate 30/41 predicted neuropeptide genes using mass spectrometry data, encoding 71 peptides78. Given the accuracy of the predictions in these studies using large genomics datasets, we speculate that such methods and information provide a promising pool of potential novel ligands that might be screened in functional assays against putative peptide-binding receptors.

A reference map of uncharacterized membrane proteins

In the previous sections we have provided an overview of experimental and computational methodologies used to de-orphanize uncharacterized membrane proteins. Here we quantify how much of the transmembrane proteome has been characterized, and whether the coverage of the characterized regions is biased toward proteins with a particular topology by generating a reference map of the human transmembrane proteome.

This analysis is based on 35 879 unique human RefSeq protein sequences downloaded from NCBI as GenBank records. To reduce bias in our analysis resulting from proteins with multiple isoforms, we collapsed this collection into unique gene symbols by retaining only the entry (for a given gene symbol) with the greatest number of annotated transmembrane segments among annotated sites in their GenBank fields, under the hypothesis that the sequence with the most annotated segments represents the most studied and highest-quality record for a particular gene. In cases where the gene has no transmembrane helices we simply kept the first occurring entry. Applying this filter left 19 977 sequences. Because uncharacterized membrane proteins may lack annotated transmembrane segments, we utilized several of the previously described topology prediction programs to generate an estimated transmembrane segment count for these orphan proteins. The three programs used were TMHMM2.043, SCAMPI_multi79, and PHOBIUS1.0.146, and the weights used to average the predictions were estimated using a linear regression against a count of known transmembrane segments.

We estimated the number of membrane proteins using a criterion of one or more predicted or annotated transmembrane segments. This analysis yielded 4991 of the 19 977 sequences for unique genes passing this filter, corresponding to ∼25% of the genome, a value in reasonable alignment with previous estimates3. To determine which of these 4991 membrane proteins were previously unannotated, we used two approaches. First, we selected a list of all RefSeq sequences lacking a Gene Reference into Function (GeneRIF) annotation, giving a set of 5723 unique proteins. While this filter can identify sequences that have previously been annotated for function, the lack of a hierarchy or sub-classification of these annotations by strength of evidence means that some of these sequences may actually by effectively uncharacterized. By manually examining many entries, we have indeed found that some GeneRIF entries describe presumed or inferred function without experimental support. While these may be useful for generating hypotheses, this ambiguity complicates our estimate of the number of uncharacterized membrane proteins. Thus, we also utilized the independent annotation in the Gene Ontology (GO) database. Following a similar methodology used to identify uncharacterized proteins in Arabidopsis thaliana80, we identified all proteins either lacking GO annotation (2983 proteins) or having no data (ND) evidence code for Molecular Function (MF) annotation at the root node (the default assignment in the GO for uncharacterized proteins) (597 proteins), giving a total of 3580. These intersect with the GeneRIF-based set by 2431. The union of the uncharacterized sets gives 6872 proteins, of which ∼25% (1533) are transmembrane. In contrast, only 216 of the intersecting set of 2431 are in our estimated transmembrane set, so we used the union of the estimated uncharacterized sets as a less conservative approach. A summary of all filters applied is given in Figure 3A. Many of the 4991 estimated membrane proteins in this analysis (3791, ∼76%) have GO annotations for MF, including 1479 unique terms (as a single protein may have more than one MF annotation). The distribution of all MF terms assigned to more than ten proteins (167 terms) is shown in Figure 3B, indicating that G-protein coupled receptors, olfactory receptors, nucleotide binding receptors, and calcium interacting proteins dominate this list. To independently evaluate the quality of our inference, we used the same approach to predict the number of transmembrane proteins in Saccharomyces cerevisiae. The localization of approximately 75% of the yeast genome has been experimentally assessed using Green Fluorescent Protein (GFP)-tagged fusion proteins to determine presence/absence at twenty-two organelle sites81, and we used this information to assess the accuracy of TM protein predictions. These analyses, shown in Figure 4, demonstrate that the predicted transmembrane proteins, which constitute ∼20% of the yeast genome, are experimentally localized in the Endoplasmic Reticulum (ER), secretion pathway (vacuole) and cell periphery at higher rates (18, 13, and 7-fold respectively) than predicted soluble proteins, whose localization records are biased for the cytoplasm and nucleus (5 fold and 7.5-fold enrichment, respectively). While the discrimination is not perfect, the population of predicted TM proteins in yeast obtained using the predictive methodology from the human analysis is enriched for experimentally annotated localization at membrane sites, supporting the use of these topological predictions as a proxy for TM localization.

Figure 3
figure 3

Estimating the number of uncharacterized human membrane proteins. (A) Human RefSeq protein sequences are collapsed to unique genes. Three topology prediction algorithms are averaged to generate a list of predicted membrane proteins, and merged with membrane proteins derived from GenBank transmembrane helix annotations to yield a combined population of estimated membrane proteins. Previous functional annotations are evaluated using GeneRIF fields and Gene Ontology (GO) records, which are merged to yield a combined population of estimated uncharacterized proteins. The intersection of the membrane and uncharacterized populations represent uncharacterized membrane proteins. (B) Distribution of top GO molecular function (MF) categories for all membrane proteins.

PowerPoint slide

Figure 4
figure 4

Subcellular localization of predicted membrane and soluble proteins in S Cerevisiase. GFP fusions of individual yeast proteins have been expressed, localized and annotated81. (A) Analytical pipeline for prediction of yeast membrane proteins beginning with 5909 RefSeq entries that are filtered and resulted in 4973 unique gene names. Topology algorithms for unique genes yield 920 putative membrane proteins and 4053 putative soluble proteins. Fractions of both groups possess experimentally determined subcellular locations. (B) The distribution of experimentally determined localization(s) for predicted membrane and soluble proteins in (A) among 22 cellular sites. Bar lengths are normalized to the total number of subcellular location sites available for predicted membrane and soluble proteins from (A).

PowerPoint slide

To gain a global overview of the distribution of the annotated and unannotated membrane proteins identified in our analysis, we generated a vector description of each sequence to allow systematic comparison. The first twenty elements of this vector contain the count of each of the twenty naturally occurring amino acids in the proteins sequence. The next twenty contain these counts restricted to the transmembrane regions, while the last twenty contain the counts for the cytosolic loops, giving a total length of sixty. All counts were calculated using the topological prediction output of TMHMM2.0 for transmembrane segments for consistency. The resulting vectors of length sixty were then embedded in a low-dimensional map using t-Stochastic Neighbor Embedding (t-SNE), an algorithm that produces coordinate maps of high-dimensional data which represent the pairwise similarity between objects82. This algorithm, compared to other nonlinear methods, has been shown to better separate images of handwritten digits and facial photographs into distinct clusters in two dimensional space82. To separate the resulting map into regions, we clustered the resulting coordinates from t-SNE using affinity propagation83 using the squared Euclidean distance between the t-SNE coordinates and the maximum pairwise distance as the input preference for each datapoint to be a cluster center. The resulting map for membrane proteins with previously annotated function is displayed in Figure 5A, with colors representing the clusters defined by affinity propagation. The distribution of the estimated set of uncharacterized membrane proteins is shown in Figure 5B. While the range of space covered by the characterized and uncharacterized sets is comparable, the density of the uncharacterized membrane proteins is concentrated in a region of space occupied by seven transmembrane segment receptors (Figure 6, 5D) and reflecting orphan olfactory receptors. This is reflected quantitatively by the modest correlation coefficient of 0.20 between the grid-cell counts of the uncharacterized and characterized sequences in Figure 6.

Figure 5
figure 5

Landscape of human membrane protein diversity. Two dimensional embedded coordinates are generated from vectors counting the number of each of the twenty amino acids in a whole protein sequence, transmembrane segments, and cytosolic segments for 4991 estimated human membrane proteins, using the t-stochastic neighbor embedding (t-SNE) algorithm. Colors represent groups identified by applying affinity propagation clustering to the embedded coordinates. (A) Embedded coordinates and cluster identity of subset of human membrane proteins with previous functional annotation. (B) As in (A), for uncharacterized membrane proteins. (C) As in (A), for TMEM proteins. (D) As in (A), for sequences with seven transmembrane segments as denoted by RefSeq annotations or averaged predictions of three topology algorithms.

PowerPoint slide

Figure 6
figure 6

Density profile of landscape of human membrane proteins. Embedded two-dimensional coordinates generated from vectors containing counts of each of the twenty amino acids in a whole protein sequence, membrane segments, and cytosolic segments, using the t-stochastic neighbor embedding (t-SNE) algorithm for 4991 estimated human membrane proteins. (A) Count per coordinate grid representing the number of sequences (colorbar) for the subset of human membrane proteins with previous annotation. (B) As (A), for uncharacterized membrane proteins.

PowerPoint slide

While this analysis can distinguish broad structural classes of membrane proteins, as shown by the spatial localization of seven membrane proteins (Figure 5D), voltage-gated sodium and calcium channels are also intermingled with transporters in the lower left quadrant. While these proteins might be topologically similar, they are clearly functionally distinct. It thus remains unclear which class of sequence descriptors, if any, can best capture functional differences in this kind of analysis, and how to evaluate the accuracy of such features. From the perspective of future de-orphanization, it appears encouraging that the TMEM class of proteins is broadly distributed across the sequence space, suggesting that membrane proteins of many functional or topological classes may yet be elucidated.

Perspective

Review of the literature suggests a 'gap' between experimental and computational methods. While in silico functional predictions are primarily verified through retrospective accuracy, experimental studies with unbiased genomics approaches use bioinformatics as a way to pare down a candidate list, rather than restrict and guide the initial search space. Thus, we anticipate there are unrealized opportunities for predictive algorithms to be used to identify novel membrane proteins and suggest possible phenotypes for functional validation. Furthermore, such studies will help computational researchers to better understand which models and descriptions of protein structure are most successful in predicting the results of these experimental validations, and thus iteratively improve the underlying bioinformatics algorithms.

We also speculate that predictions of three-dimensional structure have not been fully exploited for these kinds of studies. Indeed, the small fraction of transmembrane drug targets with crystal structures derived from DrugBank84 indicated in Table 4 suggests that this is a role in which bioinformatics may fill a large existing knowledge gap. As more membrane proteins are crystallized and homology-based three dimensional coordinate prediction methods become more mature, it is intriguing to speculate that tertiary structure predictions might generate functional predictions using substrate docking, in a manner similar to virtual screening of small molecule ligands. Such approaches might complement existing predictors based on amino acid sequence alone.

Table 4 Structural characterization of human drug targets.

An additional challenge comes from the fact that deorphanization often involves identification of unknown functions. Indeed, while many of the experimental studies discussed here have sought to generate candidate lists based on a specific phenotype, the challenge may often lie in assessing a completely unknown function. Indeed, even in cases where bioinformatics has perfectly identified a novel protein, such as Ci-VSP, which contains both a voltage-sensing domain and phosphatase catalytic domain18, the substrate of this enzyme, and thus its biological role, was not immediately apparent from the initial characterization. Therefore, in the absence of functional knowledge – for example, lack of knowledge of a channel's presumed triggering stimulus that generates current – modified screening approaches will be needed to probe the function of unannotated membrane proteins. Ion channels as a class share the properties of conducting ionic currents, a general feature that may be exploited. Recent innovations in high-throughput patch clamping may allow a matrix of different potential stimuli and buffer conditions to be tested, allowing rapid functional profiling. Such approaches, combined with high-throughput imaging to determine localization, have the potential to systematize the characterization of novel membrane proteins.

In summary, while the characterization of the transmembrane genome has witnessed many informatics and experimental successes, our analysis shows that almost one-third of the membrane proteins still lack functional annotation. Given the current seeming lack of overlap between bioinformatics and unbiased screening approaches, we speculate there are opportunities for predictive algorithms to further refine screening studies, and for new profiling technologies to validate these predictive algorithms. This combination of smarter analytics and broader experimental methodology may thus help deorphanize the remaining membrane proteins in the genome, offering potential drug targets as well as greater understanding of these genes' biological roles.