Journal home
Advance online publication
Current issue
Archive
Press releases
Free Association (blog)
Supplements
Focuses
Guide to authors
Online submissionOnline submission
For referees
Free online issue
Contact the journal
Subscribe
Advertising
work@npg
Reprints and permissions
About this site
For librarians
 
NPG Resources
Nature
Nature Biotechnology
Nature Cell Biology
Nature Medicine
Nature Methods
Nature Reviews Cancer
Nature Reviews Genetics
Nature Reviews Molecular Cell Biology
news@nature.com
Nature Conferences
NPG Subject areas
Biotechnology
Cancer
Chemistry
Clinical Medicine
Dentistry
Development
Drug Discovery
Earth Sciences
Evolution & Ecology
Genetics
Immunology
Materials Science
Medical Research
Microbiology
Molecular Cell Biology
Neuroscience
Pharmacology
Physics
Browse all publications
User's guide
Nature Genetics  35, 57 - 62 (2003)
doi:10.1038/ng1198

Question 10
For a given protein, how can one determine whether it contains any functional domains of interest? What other proteins contain the same functional domains as this protein? How can one determine whether there is a similarity to other proteins, not only at the sequence level, but also at the structural level?

To demonstrate how to find functional domains within a protein, the human testis-determining factor TDF, also known as the sex-determining protein SRY, will be used as an example.

Although the search could be commenced from the Entrez search box on the NCBI home page, a better way to perform the initial search is from LocusLink10. One of the advantages of using LocusLink lies in its standardization of gene and protein names with appropriate cross-referencing, making it more likely that the correct protein will be found on the first attempt. From the NCBI home page at http://www.ncbi.nlm.nih.gov/, choose LocusLink from the pull-down menu in the upper left corner, type the gene name, 'TDF', into the query box, and click Go. Four loci are returned (Fig. 10.1). The first column gives the Locus ID, which is a stable identifier associated with that gene locus. Clicking on the LocusID produces a LocusLink report view; more detailed information on the report view can be found in the LocusLink Help feature and in the literature15. The second column, marked Org, gives a shorthand version of the organism name. Here, there is one entry from Drosophila (Dm), one from mouse (Mm), one from human (Hs) and one from rat (Rn). A series of alphabet blocks shown to the right of each entry provide jumping-off points to other database resources. The locus of interest here is the fourth entry in the list, because that is the one for the human form of TDF/SRY. To find additional information on the protein, click on the second P (in green) on that line. This takes the user to the protein entries corresponding to that particular LocusLink entry (Fig. 10.2). At this point, the user can click on any of the hyperlinks to look at the raw database information available on any of the proteins listed.

Figure 10.1
Figure 1 thumbnail

Full FigureFull Figure and legend (42K)
Figure 10.2
Figure 2 thumbnail

Full FigureFull Figure and legend (48K)
Consider the last entry in the list, an NCBI Reference Protein sequence with accession number NP_003131. To the right of the accession number is a series of hyperlinks. Clicking on the link labeled BLink will take the user to the BLink page for the protein of interest (Fig. 10.3). BLink stands for 'BLAST Link' and provides the graphical results of pre-computed BLAST searches that have been performed not just for this protein sequence, but for every protein sequence within the Entrez Proteins data domain. The pre-computed BLAST results for TDF/SRY are shown in the section beginning with the label '204 aa'. Across the top are a number of buttons that allow the user to ask a series of questions regarding their protein of interest. As the object of this question is to find the protein domains present within the TDF/SRY protein, the user can click on CDD-Search (Conserved Domain Database Search18). Doing this will produce a graphical overview of any domains present within the protein, as well as a sequence alignment of those domains with the query sequence (Fig. 10.4). In this case, one functional domain is found: an HMG box, which is a DNA-binding domain found in many nuclear proteins. The domain was found in all of the databases comprising CDD (Pfam, SMART, and COG), as can be seen by looking at the accession numbers in the hit list.

Figure 10.3
Figure 3 thumbnail

Full FigureFull Figure and legend (94K)
Figure 10.4
Figure 4 thumbnail

Full FigureFull Figure and legend (56K)
To determine which other proteins contain this same HMG-box domain, click on the box labeled Show, right under the graphical view near the top of the page. This will invoke the domain architecture retrieval tool (DART). DART shows functional domains within a protein and, more importantly, other proteins with a similar domain architecture (Fig. 10.5). The query (the HMG-box) is shown at the top of the page in red. Every other protein in the NCBI's non-redundant sequence database having that same domain is then shown below the query, with the HMG box again colored red. Other domains within the found proteins are also shown, in various colors and shapes, with a key appearing at the bottom of the web page. Clicking on any of the links to the left would provide additional information about these new proteins.

Figure 10.5
Figure 5 thumbnail

Full FigureFull Figure and legend (46K)
Although a protein domain has now been identified within the query protein, no in-depth information has yet been provided about the function of that domain. Whereas a circuitous path could be followed from the DART page to find this information, an easier method is to use another web-based resource, called InterPro. InterPro is an integrated resource for information about protein families, domains and functional sites, bringing together information from a number of protein domain-based resources, such as PROSITE, PRINTS, Pfam and ProDom19. The InterPro Simple Search engine can be accessed from the InterPro home page, at http://www.ebi.ac.uk/interpro. Clicking on Text Search, on the left, brings the user to the search page; for this search, type "HMG Box" (with quotes) into the text box and hit Search. Two hits are returned (Fig. 10.6). For purposes of this example, follow the link from the first hit, for high mobility group proteins HMG1 and HMG2 (IPR000135). The resulting InterPro summary page (Fig. 10.7) provides information on the function, intracellular location and, most importantly, metabolic role of this particular protein within the cell, in an executive summary format. References are provided at the bottom of the web page for users who wish for more in-depth information about the domain. Users can also retrieve all of the full-length sequences containing the domain; the reader is referred to the InterPro documentation for more details.

Figure 10.6
Figure 6 thumbnail

Full FigureFull Figure and legend (38K)
Figure 10.7
Figure 7 thumbnail

Full FigureFull Figure and legend (74K)
The final part of this question asks whether similarity to the query protein can be found at the structural as well as the sequence level. Answering this question requires a new search against NCBI Structures. From the NCBI home page, change the pull-down menu in the query box at the top of the page to Structure, type 'SRY' in the box and hit Go. Seven three-dimensional structures are returned, one of which is 1HRY, the structure of the human SRY−DNA complex solved by NMR. Clicking on the 1HRY hyperlink takes the user to the Structure Summary page for 1HRY. The summary links to more detailed information about chain A, the protein component of the structure, chain B, the nucleotide component of the structure, and the conserved domain (CD) in the protein, obtained through a CDD search. Click on the chain A graphic to get a list of proteins whose known structures have, using a method called VAST, been deemed similar to that of the original SRY protein; more information on the method and on interpreting the data within the tables can be found elsewhere15. Here, the SRY protein is shown to have some structural similarity to a fasciculin 2−mouse acetylcholinesterase complex (1MAH), a protein named V-1 Nef (1AVZ), a heat-shock protein of 70 kD (1QQN), and a myosin motor-domain complex(1BR1) (Fig. 10.8). The VAST program quite often reveals similarities between proteins that are not evident from simple BLAST or FASTA searches, so readers are encouraged to employ this and similar tools when trying to answer questions related to protein families.

Figure 10.8
Figure 8 thumbnail

Full FigureFull Figure and legend (37K)
 Top
REFERENCES
  1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). | Article | PubMed  | ISI | ChemPort |
  2. Collins, F.S. and McKusick, V.A. Implications of the Human Genome Project for medical science. J. Am. Med. Assoc. 285, 540–544 (2001). | Article | ISI | ChemPort |
  3. Watson, J.D. & Crick, F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737–738 (1953). | PubMed  | ISI | ChemPort |
  4. Green, E.D. Strategies for the systematic sequencing of complex genomes. Nature Rev. Genet. 2, 573–583 (2001). | Article | PubMed  | ISI | ChemPort |
  5. Ouellette, B.F.F. & Boguski, M.S. Database divisions and homology search files: a guide for the perplexed. Genome Res. 7, 952–955 (1997). | PubMed  | ISI | ChemPort |
  6. Bairoch, A. & Apweiler, R. The SWISS-PROT Protein Sequence Database and its supplement TREMBL in 2000. Nucleic Acids Res. 28, 45–48 (2000). | Article | PubMed  | ISI | ChemPort |
  7. Hubbard, T. et al. The Ensembl Genome Database Project. Nucleic Acids Res. 30, 38–41 (2002). | Article | PubMed  | ISI | ChemPort |
  8. Kent, W.J. BLAT—the BLAST-like Alignment Tool. Genome Res. 12, 656–664 (2002). | Article | PubMed  | ISI | ChemPort |
  9. Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493–503 (2001). | Article | PubMed  | ISI | ChemPort |
  10. Pruitt, K.D. & Maglott, D.R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001). | Article | PubMed  | ISI | ChemPort |
  11. Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346–354 (1998). | Article | PubMed  | ISI | ChemPort |
  12. Schuler, G.D. Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends Biotechnol. 16, 456–459 (1998). | Article | PubMed  | ISI | ChemPort |
  13. Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001). | Article | PubMed  | ISI | ChemPort |
  14. Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52–55 (2002). | Article | PubMed  | ISI | ChemPort |
  15. Baxevanis, A.D. & Ouellette, B.F.F. (eds.) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (John Wiley & Sons, New York, 2001).
  16. Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 367–375 (1995). | PubMed  | ChemPort |
  17. Yeh, R.F., Lim, L.P. & Burge, C.B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001). | Article | PubMed  | ISI | ChemPort |
  18. Marchler-Bauer, A. et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30, 281–283 (2002). | Article | PubMed  | ISI | ChemPort |
  19. Apweiler, R. et al. InterPro—an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16, 1145–1150 (2000). | Article | PubMed  | ISI | ChemPort |
  20. Rebhan, M., Chalifa-Caspi, V., Prilusky, J. & Lancet, D. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656–664 (1998). | Article | PubMed  | ISI | ChemPort |
  21. Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A. & Eppig, J.T. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res. 30, 113–115 (2002). | Article | PubMed  | ISI | ChemPort |
  22. Hudson, T.J. et al. A radiation hybrid map of mouse genes. Nature Genet. 29, 201–205 (2001). | Article | PubMed  | ISI | ChemPort |
  23. Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 30, 276–280 (2002). | Article | PubMed  | ISI | ChemPort |
  24. Letunic, I. et al. Recent improvements to the SMART domain–based sequence annotation resource. Nucleic Acids Res. 30, 242–244 (2002). | Article | PubMed  | ISI | ChemPort |
  25. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). | Article | PubMed  | ISI | ChemPort |
  26. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, 1998).
  27. Peri, S., Ibarrola, N., Blagoev, B., Mann, M. & Pandey, A. Common pitfalls in bioinformatics-based analyses: look before you leap. Trends Genet. 17, 541–545 (2001) [erratum Trends Genet. 18, 218 (2002)]. | Article | PubMed  | ISI | ChemPort |
  28. Ponting, C. Issues in predicting protein function from sequence. Brief. Bioinform. 2, 19–29 (2001). | PubMed  | ChemPort |
  29. Aparicio, S.A.J.R. How to count ... human genes. Nature Genet. 25, 129–130 (2000). | Article | PubMed  | ISI | ChemPort |
  30. Beadle, G.W. & Tatum, E.L. Genetic control of biochemical reactions in Neurospora. Proc. Natl Acad. Sci. USA 27, 499–506 (1941). | ChemPort |
  31. Jeffery, C.J., Bahnson, B.J., Chien, W., Ringe, D. & Petsko, G.A. Crystal structure of rabbit phosphoglucose isomerase, a glycolytic enzyme that moonlights as neuroleukin, autocrine motility factor, and differentiation mediator. Biochemistry 39, 955–964 (2000). | Article | PubMed  | ISI | ChemPort |
  32. Wistow, G. & Piatigorsky, J. Recruitment of enzymes as lens structural proteins. Science 236, 1554–1556 (1987). | PubMed  | ISI | ChemPort |
  33. Jeffery, C.J. Moonlighting proteins. Trends Biochem. Sci. 24, 8–11 (1999). | Article | PubMed  | ISI | ChemPort |
  34. Chothia, C. Proteins. One thousand families for the molecular biologist. Nature 357, 543–544 (1992). | Article | PubMed  | ISI | ChemPort |
  35. Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147–164 (1999). | Article | PubMed  | ISI | ChemPort |
  36. Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res. 28, 1481–1488 (2000). | Article | PubMed  | ISI | ChemPort |
  37. Brenner, S.E. Errors in genome annotation. Trends Genet. 15, 132–133 (1999). | Article | PubMed  | ISI | ChemPort |
  38. Smith, R.F. Perspectives: sequence data base searching in the era of large-scale genomic sequencing. Genome Res. 6, 653–660 (1996). | PubMed  | ISI | ChemPort |
 Top
FULL TEXT
Previous | Next
Table of contents
Download PDFDownload PDF
Send to a friendSend to a friend
Save this linkSave this link
More articles like this
Figures & Tables
References
Export citation
Export references
natureproducts

Search buyers guide:

 
Nature Genetics
ISSN: 1061-4036
EISSN: 1546-1718
Journal home | Advance online publication | Current issue | Archive | Press releases | Supplements | Focuses | For authors | Online submission | Permissions | For referees | Free online issue | About the journal | Contact the journal | Subscribe | Advertising | work@npg | naturereprints | About this site | For librarians
Nature Publishing Group, publisher of Nature, and other science journals and reference works©2003 Nature Publishing Group | Privacy policy