Original Article | Published:

Relevance and limitations of public databases for microarray design: a critical approach to gene predictions

The Pharmacogenomics Journal volume 3, pages 235241 (2003) | Download Citation


In conjunction with the completion of the human genome sequence, microarray technology offers a complementary strategy to traditional methodologies used to search for genetic determinants involved in multifactorial diseases such as Alzheimer's disease. In order to gain benefits from this strategy, we have designed home-made microarrays to compare the expression of all ORFs located within loci of interest defined by genome scanning in Alzheimer family studies. Two approaches were selected using either probes amplified by PCR from a cDNA bank or specific oligonucleotides. Here, we report the challenging task of validating, prioritising and selecting the best ORFs derived from the genome sequence. The initial inventory from the NCBI website allowed us to select 5849 ORF's within nine loci. Half of them resulted from prediction models using the GenomeScan software. However, our data have shown that predicted ORFs may not be representative of exonic sequences, or even real genes. These observations have led us to exclude these ORFs from our study, decreasing their number from 5849 to 2748. Microarrays may be only ‘snapshots’ of our current knowledge of the human genome.


The completion of the human genome sequence has provided basic structural information on all human genes. Functional techniques, such as cDNA microarrays, serial analysis of gene expression and proteomics analyses make possible the analysis of expression levels of thousands of genes and proteins at the same time. The development of these high-throughput screening techniques is now changing biomedical research in a crucial way and is workable thanks to updated and comprehensive databases such as The National Centre for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/). In this way, NCBI has already referenced more than 40 000 distinct open reading frames (ORFs) classified into six categories depending on the type of evidence used to construct the gene model. However, validating, prioritising and selecting the best sequences from tens of thousands of putative candidate ORF's is a challenging task.

This is clearly important when projects are developed in order to characterise new genetic determinants in multifactorial diseases such as Alzheimer's disease (AD), which result from an interaction of multiple genetic and environmental factors. Microarray technology therefore appears to offer a good way of developing a complementary approach along with linkage and association studies, methodologies ‘traditionally’ used to search for new candidate genes.

As a tool for screening new candidate genes, microarray technology may allow the selection of genes of interest by comparing the pattern of gene expression observed in case and control brain tissues on a functional basis, rather than the usual statistical associations favoured by family or population studies. This assumption results from two major observations: (i) the expression of numerous genes is modified during AD aetiology;1 (ii) apart from qualitative variations (ie coding mutations) in genes already involved in the disease, quantitative variations in the expression of these have been shown to be genetic determinants of disease. For example, functional polymorphisms within the promoter sequences of APOE, PS1 and PS2 genes are associated with an increased risk of developing AD.2,3,4 A similar involvement of the APP gene has been discussed.5

Consequently, we have argued that genes exhibiting a differential expression between cases and controls, and located in one of the loci of interest defined by previous genome scans, could constitute potential candidate genes for AD. We planned to make home-made microarrays to screen all ORFs contained in the risk-associated loci (over nine different chromosomes) previously identified in genome scan studies.6,7,8,9 Two approaches were defined: (i) the first one is based on the development of specific PCR probes from a bank of 12 000 unique cDNAs; (ii) the second one consists in developing specific oligonucleotides. The aim of this report was to show the relevance and limitations of the bio-informatics analysis work, which we developed from NCBI database, to select ORFs of interest for this project.


Development of ORF Data Files from NCBI Database

A list of all ORFs contained in loci of interest was determined using NCBI human map viewer (July 2002). The sequence of these ORFs was extracted from NCBI database, which made possible the development of a file for each chromosomal region of interest. These files also contained the ORF symbol, its full name when known, its position — cytogenetic and in base pair —, its transcriptional way and, finally, the type of evidence used to construct the gene model. There were six evidence codes: (i) the ‘C’ code is for a confirmed gene model — there is a clean alignment between a Refseq or GenBank mRNA sequence and the genomic sequence or there is an exact match between the protein product that was entered in the mRNA sequence record and the conceptual translation of the genomic sequence gene model; (ii) the ‘?’ code reflects some discrepancies between the mRNA sequence and gene model. This may include gaps or the alignment of an mRNA to two or more genomic regions for instance; (iii) the ‘I’ code is representative of a model based on the alignment of mRNA or mRNA plus ESTs to the genome. However, these models may be paralogs, duplication because of assembly errors, or pseudogenes; (iv) the ‘E’ code corresponds to models only based on EST evidence; (v) the ‘P’ and ‘PE’ codes are for models predicted using the GenomeScan software only or predicted using genome scan and EST evidence, respectively.

We defined nine databases for ORFs, each one representing a locus of interest (Table 1). All Refseq accession numbers for the relevant mRNA sequences were retrieved using LocusLink (NCBI). ORF databases were then created downloading selected mRNA sequences from a file containing all human mRNA in the FASTA format. This one was initially downloaded from NCBI and regularly updated to follow database developments.

Table 1: Chromosomal position of the locus of interest defined by genome scan studies

Selection of the Clones

The methodology we used is summarised in Figure 1. A software program used to select clones of interest was developed by the Bioinformatics Integrated Centre of the Lille Génopole. Clone sequences available from the Centre National de Séquençage (CNS) were aligned with the mRNA ORF sequences of NCBI using a Basic Local Alignment Search Tool (BLAST). During this initial stage, clones were selected if the percentage of homology was higher than or equal to 50%. This percentage was calculated as the following ratio: length of alignment with an mRNA/total length of a clone sequence. This low minimum value was mainly chosen because most cDNA's were originating from full-length mRNA's and cloned from a poly-T oligonucleotide. As a consequence, 5′- and 3′-UTR sequences were almost systematically available. However, these regions were not always fully documented into the sequence pool of NCBI, reducing putative lengths of alignment.

Figure 1
Figure 1

Method used to select ORF's of interest for microarray design.

Results obtained were compared to the initial data available from the CNS, and finally the following three files emerged: (i) ORF-clone pairs for which the assignment by BLAST was similar to that reported by the CNS; (ii) ORF-clone pairs only found by BLAST. This possibility was not surprising as the assignment of the clones by the CNS was not yet completed; (iii) the ORF-clone pairs initially assigned by the CNS and not found by BLAST. In order to understand the latter discrepancies, a second BLAST from these ORF's was performed, only this time without any homology minimum limit. During this second stage, we were able to find ORF-clone pairs for which the homology percentage was in fact less than 50%.

Specificity of Selected ORF-clone Pairs

As a third step, selected ORF-clone pairs were tested for their specificity. The specificity of a clone was estimated against the whole NCBI human mRNA database. The similarity minimum limit was determined to be 30% homology: a clone sharing at least 30% homology with another human mRNA and different from the first one identified during the two first steps was considered as unspecific. This method of selection was set for each ORF data file. Results were finally analysed and sorted out in order to obtain a list of specific ORF-clone pairs, representative of one part of our ORFs of interest. In cases where a clone was considered to be unspecific (because it lined up with two or more ORF's of interest in the same locus), a manual study was performed.

List of ORF's of Interest

The bibliographic study and NCBI prediction allowed us to select 5849 ORF's in nine loci of interest (Table 2). A data file of all selected ORF mRNAs was made and compared with CNS clone sequences available. After the various steps of selection and specificity of clone-ORF pairs, only 15.7% (919 ORFs) remained.

Table 2: Number of ORFs selected within the region of interest following the different steps of selection

In order to analyse this final low number of ORFs paired with clones available, the distribution of all 5849 ORFs was classified according to their code of evidence (Table 3). Interestingly, the distribution observed was relatively homogeneous among the various categories; the lowest percentage obtained was for the ‘I’ and ‘E’ codes (7.7 and 6.3% of all 5849 ORFs, respectively). A similar classification was performed for the final 919 clone-ORF pairs but surprisingly, a very different pattern was observed (Table 4). Almost 97% of clone-ORF pairs were classified in the ‘C’, ‘?’ or ‘I’ categories, while only 3% were found again for the ‘E’, ‘PE’ or ‘P’ codes. However, all ORF's characterised by strong biological evidence — ‘C’, ‘?’ and ‘I’ — only represented about 50% of all 5849 ORFs, the other half being constituted of ORFs predicted via software and/or ESTs — ‘E’, ‘PE’ or ‘P’. Since the clone bank was randomly established from a large panel of tissues and if the predicted value of each category was equivalent and correct, one would expect this distribution to be similar for all 919 clone-ORF pairs and 5849 ORFs. These observations may show that the prediction of genes using the GenomeScan software was not sufficiently efficient, and that we could not be certain that oligonucleotides designed from this ORF assembly would be representative of an exonic sequence or even of a real gene. As a consequence, we decided not to include ORF's with an ‘E’, ‘PE’ or ‘P’ code of evidence into the design of our home-made microarrays using oligonucleotides restricting this one to a study of 2748 ORF's (vs 5849) (Table 3). As a result of these analyses, two types of microarrays will be made either from the specific 919 clone-ORF pairs or from 2748 oligonucleotides, the latter including both ORF's represented by the clones and ones with no corresponding clones.

Table 3: Distribution of selected ORFs according their evidence code
Table 4: Distribution of specific ORF-clone pairs according to their evidence code

Biological Relevance of Selected ORFs

Using the gene ontology website (http://www.geneontology.org), we performed a systematic research for biological functions of our ORFs of interest. Among all 2748 selected genes, 1246 were referenced (45.4%). Approximately 50% of them showed at least two different biological activities (Figure 2). In order to simplify the information available, the first level of arborisation of biological activities extracted from gene ontology only is shown (Table 5). Indeed, at this first level, 20 classes were defined. At the second level of arborisation, this number of classes increased to 143, finally reaching 1791 on this last level. Consequently, the biological functions or abilities associated with these 20 classes were not very specific (Table 5).

Figure 2
Figure 2

Number of biological activities attributed to each ORF by the gene ontology website. (a) Classification from selected ORFs for microarray design. (b) Classification from selected genes using bibliographical research.

Table 5: Biological function or ability attributed to ORFs having at least one gene ontology entry (a gene may have several associated activities; see Figure 2)

Using PubMed, we performed a systematic search for published reports of AD genetic determinants and observed that at least 98 genes had already been tested in case–control studies. However, from these numerous reports, there was no consensus evidence of any predisposing risk alleles in genes other than APOE. Only 24/98 genes (24.5%) were located within the defined loci of interest (Table 1), and all were comprised in our ORF pool. Not surprisingly, chromosome 10 and 12 loci had been most studied (6 and 7 genes, respectively), reflecting the intense recent interest of the AD community in these two chromosomal regions. However, these 24 genes represented less than 1% of the ORFs we had selected for our microarray study. Finally, the range of biological functions or abilities of these 24 published candidate genes did not display any noticeable differences when compared with our ORF pool (Table 5).


The human genome is by far the largest genome to be sequenced, and its size and complexity pose many challenges for sequence assembly. Public databases such as NCBI constitute an essential tool for free access to this human genome assembly, and therefore for the development of high-throughput genomic and postgenomic screening techniques. In order to develop microarrays to search for new AD genetic determinants, we used NCBI database to select the most relevant genes located within the loci of interest defined via previous genome scan studies. The human map viewer software allowed us to select 5849 ORFs. As described in the section on equipment and methods, six categories of ORFs were available according to their code of evidence and more than 50% of our selected ORFs were based on predictions using software and/or EST's. The mRNA sequences of these ORFs were compared with the sequences of 12 000 cDNA's available from the CNS in order to select specific ORF-clone pairs and finally to produce PCR probes from these cDNA's. However, we observed that among all 919 specific ORF-clone pairs we had selected from the CNS bank, only 3% corresponded to these predicted ORFs.

Several explanations linked to our selection approach may potentially lead to this unexpected low number of clones being assigned to predicted ORFs: (i) these predicted genes may contain a higher proportion of interesting novel genes that are perhaps expressed at too low a level or in too restricted a set of tissues or conditions to be efficiently sampled in the CNS bank; (ii) alternative splicing is increasingly recognised as an important and widespread form of gene regulation believed to affect more than half of human genes. However, we were not able to determine in what splicing form the gene was cloned, leading to a potential nonselection of any ORF-clone pair during the first BLAST step because of too low a percentage of similarity; (iii) the 30% limit we set for the specificity of BLAST was low. This level was defined to limit the risk of cross-hybridisation when using the full-length cDNA amplified by PCR as a probe for microarrays. We therefore rejected 35.8% of all clone-ORF pairs we had initially selected. In particular, it was likely that we risked rejecting ORFs belonging to a protein family exhibiting a large homology sequence. This may be particularly relevant, as the GenomeScan software program has been developed to build gene structures from the detection of protein sequence homologies.10 However, the distribution of such clone-ORF pairs rejected according to their evidence codes did not reveal any significant difference compared to the distribution we reported for the specific ones (data not shown), suggesting that the specificity limit we used did not account for the low percentage of clones assigned to predicted ORFs.

Apart from these limitations inherent to our strategy, this observation of a low number of clone-predicted ORF pairs may indicate that the GenomeScan software program is not systematically efficient in defining gene structure, and needs to be used with care. The GenomeScan software program used by NCBI for the annotation of the human genome combines the use of research for sequence homologies and ab initio predictions.10,11 As a first step, homologies between genomic and protein sequences are determined. Then, if a protein homology is detected, an ab initio software program is run for the genomic sequence. The ab initio software program has been developed to take into account the structural characteristics of genes — their density, number of exons, distribution and size of these exons, location of the TATA box, polyadenylation signal or splice sites — and their statistical properties such as probability for a nucleotide to be coding or not, depending on its environment. Systematic tests of the accuracy of GenomeScan have shown that it is more accurate than the existing ab initio and similarity-based algorithms across a broad range of similarity levels. Consequently, these predictions permit a systematic large-scale annotation of the human genome. In the initial report describing the GenomeScan software, the authors reported that among 22 607 genes predicted, 49.3% were partial genes.10 These data may at least partly explain why we find it difficult to associate clones with predicted ORFs. Defined sequences for these partial genes may not recover the available sequences for the clones, leading to the loss of some clone-ORF pairs. However, 32.1% of predicted ORFs matched complete genes with at least three exons.10 If this distribution was met in our study, we may expect to have 487 predicted (P+PE+E) ORFs and to pair at least 156 clones with predicted ORFs (32.1% of 487), representing 17% of all ORF-clone pairs, whereas we only reported 30 such pairs. Finally, the observation that 24 of these pairs (2.6%; Table 4) were derived from EST evidence (E) only vs 58 expected ones (6.3%; Table 3) and four from EST and GenomeScan combination (PE) (0.4%; Table 4) vs 187 expected ones (20.4%; Table 3), suggests that the GenomeScan software program is not associated with any significantly better understanding of the fine gene structure compared to the EST evidence alone. This point is supported by the fact that ORFs derived from EST evidence (E) also result from ab initio predictions, although they are not based on a search for homology of protein sequences. Therefore, the GenomeScan software may help to detect exons, while the characterisation of a complete and complex gene structure still seems to be difficult. This point is particularly important for the design of home-made microarrays using oligonucleotides, since these oligonucleotides are to (i) match an mRNA sequence and (ii) be specific to a gene of interest. However, using as targets ORFs defined by the GenomeScan software, we may not be certain of the validity of these two points. This is the reason why we decided not to include these ORFs into our study, which led to a decrease in their number from 5849 to 2748. However, we have to remember that our strategy of selection is likely to be very selective, leading to the rejection of numerous ORFs of potential interest, which are not yet sufficiently defined.

Finally, information available on Locus link or Refseq is regularly updated. For instance, the number of human sequences corresponding to known genes increased from 6000 to more than 10 000 between 1999 and 2000.11 Between May and July 2002, we observed that more than 50% of predicted ORFs were no longer referenced, whereas the proportion of ORFs with a P evidence code had increased from 20.6 to 26.3%. It is likely that this increase is due to the combination of a systematic use of the GenomeScan software and progress in genome sequencing. This strong instability reinforced our choice not to take into account these predicted ORFs for the design of our home-made microarrays.

In conclusion, the design of microarrays in the search for new genetic determinants involved in complex disorders such as Alzheimer's disease depends on the development and quality of the genome assembly. Incidentally, the list of ORFs we made was not a definitive listing of the genes located within the chromosomal regions of interest. By rejecting predicted ORFs as potential targets for screening, the impact of the instability we observed between both updating processes was minimised. Finally, even if future re-localisations and annotations of new confirmed genes following any updating session will not be taken into account in our experiment — indicating that microarrays are only snapshots of our current knowledge of the human genome — our approach seems to be valid and interesting in order to perform systematic screening for AD candidate genes. Only 1% of our selected ORF's have been previously studied in case–control association studies within the loci of interest defined by genome scanning.


Selection of Regions of Interest

Loci likely to contain any gene of interest for AD have been defined using bibliographic study,6,7,8,9 based on the analysis of previously published genome scan results obtained from late-onset familial forms of AD (Table 1).

Bank of cDNAs from the CNS

This bank contains approximately 12 000 unique cDNAs, cloned in a pCMV·SPORT6 vector. Approximately 70% of them are full length, and were randomly obtained from different biological tissues. Two data files are available: (i) the putative chromosomal location of each cDNA and when possible, its attribution to a known gene; (ii) the cDNA sequences. Most 3′- and 5′-UTR sequences are available, as well as internal sequences. This file also specifies the tissue from which the clone was obtained. For the cDNA library construction, different human tissues such as neuroblastoma, placenta, fetal and adult brain, fetal liver, T and B cell lines and the HeLa cell line were used. Libraries were constructed by Life Technologies, a division of Invitrogen Corporation (Full-length cDNA libraries and normalisation: Li WB, Gruber C, Jessee J, Polayes D, unpublished). In brief, first strand cDNA was primed with a NotI-oligo(dT) primer. Five prime ends were enriched and double-stranded cDNA was digested with NotI and cloned into the NotI and EcoRV sites of the pCMVSPORT6 vector. Some of the libraries were normalised. The sequence determination of the cDNA clones was performed using either Li-Cor 4200 or ABI3700 analyzers.


  1. 1.

    , , , , . A gene expression profile of Alzheimer's disease. DNA Cell Biol 2001; 20: 683–695.

  2. 2.

    , , , , , . A new polymorphism in the APOE promoter associated with the risk of developing Alzheimer's disease. Hum Mol Genet 1998; 7: 533–540.

  3. 3.

    , . Transcriptional regulation of Alzheimer's disease genes: implications for susceptibility. Hum Mol Genet 2000; 9: 2383–2394.

  4. 4.

    , , , , , et al. Regulatory region variability in the human presenilin-2 (PSEN2) gene: potential contribution to the gene activity and risk for AD. Mol Psychiatry 2002; 7: 891–898.

  5. 5.

    , , , , , et al. Genetic variability in the amyloid-beta precursor protein locus may contribute to the risk of late-onset Alzheimer's disease. Neurosci Lett 1999; 269: 67–70.

  6. 6.

    , , , , , et al. A full genome scan for late onset Alzheimer's disease. Hum Mol Genet 1999; 8: 237–245.

  7. 7.

    , , , , , et al. Full genome screen for Alzheimer disease: stage II analysis. Am J Med Genet 2002; 114: 235–244.

  8. 8.

    , , , , , et al. Identification of novel genes in late-onset Alzheimer's disease. Exp Gerontol 2000; 35: 1343–1352.

  9. 9.

    , , . A second locus for very-late-onset Alzheimer disease: a genome scan reveals linkage to 20p and epistasis between 20p and the amyloid precursor protein region. Am J Hum Genet 2002; 71: 154–161.

  10. 10.

    , , . Computational inference of homologous gene structures in the human genome. Genome Res 2001; 11: 803–816.

  11. 11.

    , . RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 1999; 29: 137–140.

Download references


We thank Dr Aline Meirhaeghe-Hurez and Dr David Mann for their helpful discussion. This work was supported by INSERM and the ‘Génopole de Lille’.

Author information


  1. Unité INSERM 508, Institut Pasteur de Lille, 1 rue du professeur Calmette, Lille cédex, France

    • J-C Lambert
    • , E Testa
    •  & P Amouyel
  2. Centre intégré de bio-informatique de la Génopole de Lille, Cité Scientifique, 59655 Villeneuve d'Ascq cédex, France

    • V Cognat
    •  & J Soula
  3. Laboratoire de biopuces, Institut Pasteur de Lille, 1 rue du professeur Calmette, 59019 Lille cédex, France

    • D Hot
    •  & Y Lemoine
  4. Centre national de séquençage, 2 rue Gaston Crémieux, CP 5706, 91057 Evry cédex, France

    • G Gaypay


  1. Search for J-C Lambert in:

  2. Search for E Testa in:

  3. Search for V Cognat in:

  4. Search for J Soula in:

  5. Search for D Hot in:

  6. Search for Y Lemoine in:

  7. Search for G Gaypay in:

  8. Search for P Amouyel in:

Corresponding author

Correspondence to J-C Lambert.



open reading frame

About this article

Publication history








None declared.

Further reading