Move over genome, here comes the transcriptome. Last week, 120 researchers from around the world gathered in Tokyo to assemble the core of a transcriptome database, which they hope will one day hold all of the expressed sequences in the human genome.
The database, which should be up and running by December, will be a universal resource for biological research and drug discovery, say the meeting's organizers. “We want to know exactly where the genes are and what they do,” says Sumio Sugano, a researcher from the University of Tokyo's Institute of Medical Science.
As the first step in producing proteins, information in genes is transcribed into messenger RNA (mRNA). This process separates the coding sequences of genes from the rest of the genome — often called 'junk' DNA. The transcriptome is the complete set of transcribed mRNA. For years, researchers have studied these transcripts in the form of complementary DNAs (cDNAs), which are made using the mRNA taken from cells as a template. cDNAs represent the mRNA present in the cell, but they are much easier to work with than mRNA itself.
Now researchers want to incorporate the sequences of all of the human cDNAs into a single database, to be run by the Japan Biological Information Research Center in Tokyo and the DNA Data Bank of Japan (DDBJ) in Mishima. At the Tokyo meeting, researchers analysed cDNA data representing over 20,000 genes — covering more than half of the transcriptome — for inclusion in the database.
Trying to find genes within the human genome sequence often means guessing at which parts are expressed by looking for certain patterns in the sequence. cDNAs, made from mRNA expressed in cells, offer a more direct route. “This will be a real human-gene catalogue — not predicted from the human genome sequence. These are real transcripts,” says meeting organizer Takashi Gojobori, director of the DDBJ in Mishima.
Most of the cDNAs are already publicly available — but many exist as fragments of the complete cDNA. In addition, the lack of proper categorization, and inconsistencies between the various databases, limits the usefulness of the sequences for research.
“The data will be well-defined and quality controlled through the checks and balances of over a hundred scientists,” says Ranajit Chakraborty, director of the Center for Genome Information at the University of Cincinnati Medical Center in Ohio.
To create the data set, the researchers mapped 42,000 cDNAs, collected from six databases around the world, to some 23,000 different regions on the human genome. The overlap of many cDNAs at the same regions will shed light on one of the mysteries of the genome — how so few genes can make the range of proteins that carry out the many functions in human development, and also produce so much variety in people's genetically determined features.
One explanation is that the genes undergo alternative splicing, whereby various mRNAs are produced from the same genomic sequence. By looking at many slightly different cDNAs that cover the same gene regions, researchers say that they will find many examples of these alternate forms of mRNA.
The meeting also offered a large data set, and a platform for debate, concerning non-coding RNA, which does not make protein. Some researchers believe such non-coding RNA has a major role in regulating gene expression, but the idea remains controversial (see Nature 418, 122–124; 2002).
About this article
Investigation of protein functions through data-mining on integrated human transcriptome database, H-Invitational database (H-InvDB)
PLoS Biology (2004)
Proceedings of the Japan Academy, Series B (2003)