User's Guide
Published: September 2003

Question 4 A user wishes to find all the single nucleotide polymorphisms that lie between two sequence-tagged sites. Do any of these single nucleotide polymorphisms fall within the coding region of a gene? Where can any additional information about the function of these genes be found?

Nature Genetics volume 35, pages 29–32 (2003)Cite this article

613 Accesses
Metrics details

You have full access to this article via your institution.

The starting point for this search would be the web site for the Database of Single Nucleotide Polymorphisms (dbSNP) at the NCBI¹³, which is located at http://www.ncbi.nlm.nih.gov/SNP. There is a series of links on the page that allow the user to search using either information about the database submission itself or information regarding genes and gene loci.

For this particular search, assume that the region of interest is known and defined by two STS markers, RH79657 and RH45644. Begin by scrolling to the section labeled Between Markers at the bottom of the page. Enter the STS marker names 'RH79657' and 'RH45644' into the two text boxes, and click on Submit STS Markers. This will produce a display showing SNPs 1–25 out of the total of 56 within the region of interest.

The resulting page (Fig. 4.1) illustrates most of the possible types of result one would find on a typical dbSNP results page. In the table, starting from the left, the first column gives the individual dbSNP cluster IDs (all starting with 'rs'). The second column, labeled Map, shows whether a particular SNP has been mapped to a unique position in the genome (illustrated by a single green arrow, as in the first row of the example) or to multiple positions (not shown here).

The next set of columns, labeled Gene, indicates whether these SNPs are associated with particular features, such as genes, mRNAs or coding regions. The three columns (L, T and C) are either lit up or appear gray in every row. Taking each in order:

If the L (for locus) appears in blue, part or all of the marker position lies either within 2 kilobases (kb) of the 5′ end of a gene feature or within 500 bases of the 3′ end of a gene feature.

If the T (for transcript) appears in green, part or all of the marker position overlaps with a known mRNA. This does not mean, however, that the SNP marker necessarily falls within a coding region.

If the C (for coding) appears in orange, part or all of the marker position overlaps with a coding region.

The next column, labeled Het, indicates the average heterozygosity observed for this marker, on a scale of 0–100%. A reading of zero means that no information is available for that particular marker, whereas the pink bars show a 95% confidence interval for the marker. The Validation column indicates whether the marker has been validated (shown by a star) or is unvalidated (shown by light blue boxes). Validated markers have been verified by independent re-analysis of the sequence. All of the unvalidated markers shown in Fig. 4.1 are denoted by three blue boxes, which, according to the scale at the top of the column, means that there is a >95% success rate in validation. This figure indicates the probability that this marker is real. (The success rate is defined as 1 – false-positive rate.)

For some entries, the penultimate column contains the symbol TT (not shown in this example) indicating that individual genotypes are available for this marker. Finally, the Linkout Avail column indicates which markers are linked to other databases; a P in this column indicates that the variation has been mapped to a known protein structure. For a complete description of all the features within this display, click on any part of the header above the columns.

Returning to the original question, one of the SNPs displayed on this page does indeed fall within a coding region, as indicated by an orange C. To obtain more information on any particular SNP, simply click on the hyperlinked SNP Cluster ID. Clicking on rs1801973, for example, produces a new page, with all available information on that SNP (Fig. 4.2). Under the header marked Submitter records for this RefSNP Cluster is a list of the individual SNPs (in this case, only one SNP) that have been clustered together to form this single reference SNP. The sequence of the SNP is shown in the next header. Under the header marked NCBI Resource Links are GenBank and NCBI RefSeq entries that are associated with this SNP. Scrolling further down on the SNP page (Fig. 4.3), the gene whose coding region this SNP falls within is indicated on the LocusLink Analysis section (ADAM10, a disintegrin and metalloproteinase domain 10). The SNP allele is G/T, a non-synonymous change leading to replacement of the Gly residue in the reference sequence by an unspecified residue. Links are also provided to the NCBI Map Viewer, Ensembl map and UCSC genome assembly in the section labeled Integrated Maps. The sections labeled Variation Summary and Validation Summary (not shown) give the raw data on this particular SNP.

To answer the final part of this question requires jumping from dbSNP to LocusLink¹⁰. To do so, click on the ADAM10 link in the line marked LocusLink at the top of the page (Fig. 4.3). This brings the user to the LocusLink page for ADAM10 and provides numerous jumping-off points to the NCBI and affiliated resources through the boxed links at the top of the page. More information on these resources can be found by following the LocusLink FAQ link in the left-hand column of the page. By simply examining the LocusLink page itself, one sees that the ADAM10 protein belongs to a family of cell surface proteins that have potential adhesion and protease domains, and that this particular member of the family proteolytically processes pro-TNF alpha.

One often-overlooked source of information on genes and gene products is OMIM¹⁴. This is an electronic version of the catalog of human genes and genetic disorders developed by Victor McKusick at The Johns Hopkins University. OMIM provides the user with concise textual information from the published literature on most human disorders with a genetic basis, and links back to the primary literature as appropriate. Information comprising an OMIM entry includes the gene symbol, alternate names for the disease, a description of the disease (including clinical, biochemical and cytogenetic features), details of the mode of inheritance (including mapping information) and a clinical synopsis. These entries are manually curated, ensuring that the 'executive summary' is up to date and accurate. Although OMIM can be searched directly, many LocusLink entries also link to the OMIM record for the gene. The OMIM entry page for the ADAM10 protein is shown in Fig. 4.4. The page is fully hyperlinked to PubMed, GenBank and other related databases.

References

International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Collins, F.S. and McKusick, V.A. Implications of the Human Genome Project for medical science. J. Am. Med. Assoc. 285, 540–544 (2001).
Article CAS Google Scholar
Watson, J.D. & Crick, F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737–738 (1953).
Article CAS Google Scholar
Green, E.D. Strategies for the systematic sequencing of complex genomes. Nature Rev. Genet. 2, 573–583 (2001).
Article CAS Google Scholar
Ouellette, B.F.F. & Boguski, M.S. Database divisions and homology search files: a guide for the perplexed. Genome Res. 7, 952–955 (1997).
Article CAS Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT Protein Sequence Database and its supplement TREMBL in 2000. Nucleic Acids Res. 28, 45–48 (2000).
Article CAS Google Scholar
Hubbard, T. et al. The Ensembl Genome Database Project. Nucleic Acids Res. 30, 38–41 (2002).
Article CAS Google Scholar
Kent, W.J. BLAT—the BLAST-like Alignment Tool. Genome Res. 12, 656–664 (2002).
Article CAS Google Scholar
Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493–503 (2001).
Article CAS Google Scholar
Pruitt, K.D. & Maglott, D.R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001).
Article CAS Google Scholar
Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346–354 (1998).
Article CAS Google Scholar
Schuler, G.D. Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends Biotechnol. 16, 456–459 (1998).
Article CAS Google Scholar
Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article CAS Google Scholar
Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52–55 (2002).
Article CAS Google Scholar
Baxevanis, A.D. & Ouellette, B.F.F. (eds.) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (John Wiley & Sons, New York, 2001).
Book Google Scholar
Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 367–375 (1995).
CAS PubMed Google Scholar
Yeh, R.F., Lim, L.P. & Burge, C.B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001).
Article CAS Google Scholar
Marchler-Bauer, A. et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30, 281–283 (2002).
Article CAS Google Scholar
Apweiler, R. et al. InterPro—an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16, 1145–1150 (2000).
Article CAS Google Scholar
Rebhan, M., Chalifa-Caspi, V., Prilusky, J. & Lancet, D. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656–664 (1998).
Article CAS Google Scholar
Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A. & Eppig, J.T. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res. 30, 113–115 (2002).
Article CAS Google Scholar
Hudson, T.J. et al. A radiation hybrid map of mouse genes. Nature Genet. 29, 201–205 (2001).
Article CAS Google Scholar
Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 30, 276–280 (2002).
Article CAS Google Scholar
Letunic, I. et al. Recent improvements to the SMART domain–based sequence annotation resource. Nucleic Acids Res. 30, 242–244 (2002).
Article CAS Google Scholar
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, 1998).
Book Google Scholar
Peri, S., Ibarrola, N., Blagoev, B., Mann, M. & Pandey, A. Common pitfalls in bioinformatics-based analyses: look before you leap. Trends Genet. 17, 541–545 (2001) [erratum Trends Genet. 18, 218 (2002)].
Article CAS Google Scholar
Ponting, C. Issues in predicting protein function from sequence. Brief. Bioinform. 2, 19–29 (2001).
Article CAS Google Scholar
Aparicio, S.A.J.R. How to count ... human genes. Nature Genet. 25, 129–130 (2000).
Article CAS Google Scholar
Beadle, G.W. & Tatum, E.L. Genetic control of biochemical reactions in Neurospora. Proc. Natl Acad. Sci. USA 27, 499–506 (1941).
Article CAS Google Scholar
Jeffery, C.J., Bahnson, B.J., Chien, W., Ringe, D. & Petsko, G.A. Crystal structure of rabbit phosphoglucose isomerase, a glycolytic enzyme that moonlights as neuroleukin, autocrine motility factor, and differentiation mediator. Biochemistry 39, 955–964 (2000).
Article CAS Google Scholar
Wistow, G. & Piatigorsky, J. Recruitment of enzymes as lens structural proteins. Science 236, 1554–1556 (1987).
Article CAS Google Scholar
Jeffery, C.J. Moonlighting proteins. Trends Biochem. Sci. 24, 8–11 (1999).
Article CAS Google Scholar
Chothia, C. Proteins. One thousand families for the molecular biologist. Nature 357, 543–544 (1992).
Article CAS Google Scholar
Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147–164 (1999).
Article CAS Google Scholar
Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res. 28, 1481–1488 (2000).
Article CAS Google Scholar
Brenner, S.E. Errors in genome annotation. Trends Genet. 15, 132–133 (1999).
Article CAS Google Scholar
Smith, R.F. Perspectives: sequence data base searching in the era of large-scale genomic sequencing. Genome Res. 6, 653–660 (1996).
Article CAS Google Scholar

Download references

Rights and permissions

Reprints and permissions

About this article

Cite this article

Question 4 A user wishes to find all the single nucleotide polymorphisms that lie between two sequence-tagged sites. Do any of these single nucleotide polymorphisms fall within the coding region of a gene? Where can any additional information about the function of these genes be found?. Nat Genet 35 (Suppl 1), 29–32 (2003). https://doi.org/10.1038/ng1192

Download citation

Issue Date: September 2003
DOI: https://doi.org/10.1038/ng1192

Question 4 A user wishes to find all the single nucleotide polymorphisms that lie between two sequence-tagged sites. Do any of these single nucleotide polymorphisms fall within the coding region of a gene? Where can any additional information about the function of these genes be found?

References

Rights and permissions

About this article

Cite this article

Search

Quick links

References

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links