User's Guide
Published: September 2003

Question 5 Given a fragment of mRNA sequence, how would one find where that piece of DNA mapped in the human genome? Once its position has been determined, how would one find alternatively-spliced transcripts?

Nature Genetics volume 35, pages 33–39 (2003)Cite this article

737 Accesses
Metrics details

You have full access to this article via your institution.

For the purpose of this example, the fragment of mRNA of interest is contained within GenBank accession number BG334944. First, retrieve the nucleotide sequence of this EST using the NCBI's Entrez interface, at http://www.ncbi.nlm.nih.gov/Entrez/. Type 'BG334944' into the text box at the top of the page, change the pull-down menu to Nucleotide and press Go. The resulting page shows one entry, corresponding to accession number BG334944. To retrieve this sequence in FASTA format (a common format for bioinformatics programs), change the pull-down menu on this page to FASTA and then press Send to Text (Fig. 5.1). A new web page containing only the sequence, in FASTA format, is produced (Fig. 5.2); copy the resulting sequence.

To determine where this sequence maps within the genome, use UCSC's BLAT tool⁸. Begin this search by pointing your web browser to the UCSC Genome Browser home page, at http://genome.ucsc.edu. From this page, select Human from the Organism pull-down menu in the blue bar on the side of the page, and then click Blat. Paste the FASTA-formatted sequence obtained from Entrez (above) into the large text box on the BLAT search page (Fig. 5.3), change the Freeze pull-down menu to Nov. 2002, change the Query type pull-down menu to DNA and then press Submit. The server will (very quickly) return the search results; in this case, a single match of length 636 is found on the forward strand of chromosome 9 (Fig. 5.4).

To obtain more details on this hit, click the details link, to the left of the entry. A long web page is returned, with three major sections: the mRNA sequence (Fig. 5.5, top), the genomic sequence (Fig. 5.5, middle) and an alignment of the mRNA sequence against the genomic sequence (see Fig. 5.9 for an example). In the alignment in Fig. 5.5, matching bases in the cDNA and genomic sequences are colored in darker blue and capitalized. Gaps are indicated in lower-case black type. Light blue upper-case bases mark the boundaries of aligned regions on either side of a gap and are often splice sites.

Returning to the BLAT summary page for this search (Fig. 5.4), click on browser. This will produce a graphic representation of where this particular mRNA sequence aligns to the genome (Fig. 5.6). The track labeled Chromosome Band indicates that the mRNA maps to 9q33.3. The query sequence itself is represented on the line labeled Your Sequence from BLAT Search (arrow, Fig. 5.6). The sequence is shown as being discontinuous: regions of similarity are shown as vertical lines, gaps are shown as thin horizontal lines, and the direction of the alignment is indicated by the arrowheads. The aligned regions of the EST query correspond to the exons of a known gene, shown on the line immediately below (Known Genes, here RAB9P40). Typing the EST name, BG334944, directly into a UCSC search box would have generated a similar result to that shown in Fig. 5.6, but part of the purpose of this example is to illustrate the use of BLAT.

Approximately halfway down the graphic is a track labeled Human ESTs That Have Been Spliced. This track is at first shown in dense mode, with all the ESTs condensed onto a single line. To see all of the ESTs that align with the genome in this region, potentially representing differentially spliced transcripts, click on the track's label. This will expand this area of the figure so that each EST occupies a single line (Fig. 5.7). The ESTs are of varying length, but most contain the same exons as the known gene and are (presumably) spliced in the same way. Close inspection indicates that some of the ESTs are missing one or more exons compared with the known gene. Consider the lines marked BE798864 and W52533: the former appears to be missing the fifth exon, whereas the latter is missing the fourth, fifth and sixth exons.

Any of the ESTs can be examined in more detail by clicking on that particular line. Here, click on the line for BE798864 (arrow, Fig. 5.7) to reach the information page for this EST (Fig. 5.8). The EST is 99.8% identical to the genomic sequence; clicking anywhere on the hyperlinked line in the section marked EST/Genomic Alignments returns the alignment. To view the actual side-by-side alignment, click on the together link in the left sidebar (Fig. 5.9). Differences exist at the ends of the EST, but the sequences are identical in the region surrounding the putative missing exon.

An alternatively spliced mRNA is more likely to be of biological significance when it changes the sequence of the encoded, wildtype protein. To determine whether EST BE798864 could encode a protein different from that of the known gene (RAB9P40), one can simply compare the two sequences directly against each other using the NCBI's BLAST 2 Sequences tool. First, open a new web browser window, because information from the above search will be needed here; this will prevent having to use the browser's Back and Forward keys excessively and is a good general rule when using multiple web tools. Then access the BLAST home page, at http://www.ncbi.nlm.nih.gov/BLAST. Select BLAST 2 Sequences, under the header labeled Pairwise BLAST. On this page, the user can simply enter accession numbers rather than cutting and pasting sequences into the text boxes. For the EST, simply enter its accession number (BE798864) into the box marked Enter accession or GI for Sequence 1. Obtaining the accession number of RAB9P40 requires going back to the graphic shown in Fig. 5.6 and clicking on the gene's track. Once this has been done, input the gene's accession number (NM_005833) into the box marked Enter accession or GI for Sequence 2. Make sure that the Program pull-down is set to blastn (to compare a nucleotide sequence against another nucleotide sequence, hence the n in blastn) and click the Align button at the bottom of the page to generate the alignment (Fig. 5.10). The sequence corresponding to sequence 1 (the EST) is denoted as the query, whereas the sequence corresponding to sequence 2 (the known gene) is denoted as the subject. The known gene's protein translation is also shown, starting at the end of the third row of the alignment. Examination of the alignment shows that the EST is missing 153 nt (nt 360–512 of the mRNA), which corresponds to the fifth exon that is missing in BE798864. This gap is in frame, so the EST could encode a homologous yet shorter protein.

Because of the nature of EST sequencing, ESTs often contain sequencing errors at a rate much higher than those of the finished or even draft genomic sequence. It is certainly encouraging that EST BE798864 aligns well with the genomic sequence and that its encoded protein could be in the same frame as that produced from the known gene. In addition, it appears from the UCSC graphic (Fig. 5.7) that other ESTs in this region, such as BE779110, are also missing the fifth exon of RAB9P40. All these predictions must, however, be tested computationally by looking at the quality of the EST–genomic alignment as shown above. Final proof of alternative splicing can, of course, only be generated at the laboratory bench.

Accession codes

Accessions

GenBank/EMBL/DDBJ

BG334944

References

International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Collins, F.S. and McKusick, V.A. Implications of the Human Genome Project for medical science. J. Am. Med. Assoc. 285, 540–544 (2001).
Article CAS Google Scholar
Watson, J.D. & Crick, F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737–738 (1953).
Article CAS Google Scholar
Green, E.D. Strategies for the systematic sequencing of complex genomes. Nature Rev. Genet. 2, 573–583 (2001).
Article CAS Google Scholar
Ouellette, B.F.F. & Boguski, M.S. Database divisions and homology search files: a guide for the perplexed. Genome Res. 7, 952–955 (1997).
Article CAS Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT Protein Sequence Database and its supplement TREMBL in 2000. Nucleic Acids Res. 28, 45–48 (2000).
Article CAS Google Scholar
Hubbard, T. et al. The Ensembl Genome Database Project. Nucleic Acids Res. 30, 38–41 (2002).
Article CAS Google Scholar
Kent, W.J. BLAT—the BLAST-like Alignment Tool. Genome Res. 12, 656–664 (2002).
Article CAS Google Scholar
Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493–503 (2001).
Article CAS Google Scholar
Pruitt, K.D. & Maglott, D.R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001).
Article CAS Google Scholar
Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346–354 (1998).
Article CAS Google Scholar
Schuler, G.D. Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends Biotechnol. 16, 456–459 (1998).
Article CAS Google Scholar
Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article CAS Google Scholar
Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52–55 (2002).
Article CAS Google Scholar
Baxevanis, A.D. & Ouellette, B.F.F. (eds.) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (John Wiley & Sons, New York, 2001).
Book Google Scholar
Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 367–375 (1995).
CAS PubMed Google Scholar
Yeh, R.F., Lim, L.P. & Burge, C.B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001).
Article CAS Google Scholar
Marchler-Bauer, A. et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30, 281–283 (2002).
Article CAS Google Scholar
Apweiler, R. et al. InterPro—an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16, 1145–1150 (2000).
Article CAS Google Scholar
Rebhan, M., Chalifa-Caspi, V., Prilusky, J. & Lancet, D. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656–664 (1998).
Article CAS Google Scholar
Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A. & Eppig, J.T. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res. 30, 113–115 (2002).
Article CAS Google Scholar
Hudson, T.J. et al. A radiation hybrid map of mouse genes. Nature Genet. 29, 201–205 (2001).
Article CAS Google Scholar
Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 30, 276–280 (2002).
Article CAS Google Scholar
Letunic, I. et al. Recent improvements to the SMART domain–based sequence annotation resource. Nucleic Acids Res. 30, 242–244 (2002).
Article CAS Google Scholar
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, 1998).
Book Google Scholar
Peri, S., Ibarrola, N., Blagoev, B., Mann, M. & Pandey, A. Common pitfalls in bioinformatics-based analyses: look before you leap. Trends Genet. 17, 541–545 (2001) [erratum Trends Genet. 18, 218 (2002)].
Article CAS Google Scholar
Ponting, C. Issues in predicting protein function from sequence. Brief. Bioinform. 2, 19–29 (2001).
Article CAS Google Scholar
Aparicio, S.A.J.R. How to count ... human genes. Nature Genet. 25, 129–130 (2000).
Article CAS Google Scholar
Beadle, G.W. & Tatum, E.L. Genetic control of biochemical reactions in Neurospora. Proc. Natl Acad. Sci. USA 27, 499–506 (1941).
Article CAS Google Scholar
Jeffery, C.J., Bahnson, B.J., Chien, W., Ringe, D. & Petsko, G.A. Crystal structure of rabbit phosphoglucose isomerase, a glycolytic enzyme that moonlights as neuroleukin, autocrine motility factor, and differentiation mediator. Biochemistry 39, 955–964 (2000).
Article CAS Google Scholar
Wistow, G. & Piatigorsky, J. Recruitment of enzymes as lens structural proteins. Science 236, 1554–1556 (1987).
Article CAS Google Scholar
Jeffery, C.J. Moonlighting proteins. Trends Biochem. Sci. 24, 8–11 (1999).
Article CAS Google Scholar
Chothia, C. Proteins. One thousand families for the molecular biologist. Nature 357, 543–544 (1992).
Article CAS Google Scholar
Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147–164 (1999).
Article CAS Google Scholar
Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res. 28, 1481–1488 (2000).
Article CAS Google Scholar
Brenner, S.E. Errors in genome annotation. Trends Genet. 15, 132–133 (1999).
Article CAS Google Scholar
Smith, R.F. Perspectives: sequence data base searching in the era of large-scale genomic sequencing. Genome Res. 6, 653–660 (1996).
Article CAS Google Scholar

Download references

Rights and permissions

Reprints and permissions

About this article

Cite this article

Question 5 Given a fragment of mRNA sequence, how would one find where that piece of DNA mapped in the human genome? Once its position has been determined, how would one find alternatively-spliced transcripts?. Nat Genet 35 (Suppl 1), 33–39 (2003). https://doi.org/10.1038/ng1193

Download citation

Issue Date: September 2003
DOI: https://doi.org/10.1038/ng1193