Introduction

A study of DNA polymorphism has become an active area of research in all important crops and several model plant species like Arabidopsis thaliana and Brachypodium distachyon. This involves development and use of molecular markers, which have proved useful not only for marker-assisted selection during plant breeding, but also for understanding crop domestication and plant evolution. To resolve the pattern of DNA polymorphism in any crop, the ultimate approach would be to sequence/resequence the entire genome (or a part of it) in a large number of accessions. This was, however, unimaginable during the 1980s and still remains cost ineffective, therefore DNA-based molecular markers (for example, restriction fragment length polymorphisms, random amplified polymorphic DNAs, simple sequence repeat (SSRs) and amplified fragment length polymorphisms (AFLPs)) have largely been employed for the study of DNA polymorphism (Collard et al., 2005). Most of these molecular markers are based on the use of restriction digestion of genomic DNA, followed by hybridization of electrophoresed DNA, and/or visualization of the products of PCR carried out using suitably designed PCR primers. More recently, however, single nucleotide polymorphisms (SNPs), whose discovery was largely based on sequence information, became the markers of choice due to their abundance and uniform distribution throughout a genome. Once discovered, SNP genotyping can be done using any of the dozens of available methods. For SSR/SNP genotyping, some efforts in the past were made to provide for the desired high throughput and cost effectiveness through the use of PCR tetrad machines (handling 384 PCR reactions at a time), multiplexing, multiple loading of the gels and the use of automatic sequencers. However, this appeared inadequate and there has been an increasing demand to develop ultra-high-throughput low-cost assays for a variety of novel marker systems including SNPs. These new methods will allow ultra-high-throughput genotyping of either one or few individuals for hundreds of thousands of markers, or that of thousands of individuals for one or few markers. High-density oligonucleotide arrays, which are now becoming available in several crops, provide a means for achieving this goal of low-cost ultra-high-throughput genotyping. These arrays may also be custom made according to specific needs and, therefore, also allowed for the development of novel marker systems like single feature polymorphisms (SFPs) (including gene-specific hybridization polymorphisms and gene expression markers), diversity array technology (DArT) and restriction site-associated DNA (RAD) markers, which have now become the markers of choice. Technologies have also been developed, which make use of tag arrays for detection of the products of genotyping reactions. These novel array- or chip-based markers are useful for a variety of purposes including genome-wide association studies, population studies, bulk segregant analysis, quantitative trait loci (QTL) interval mapping, whole genome profiling and background screening and so on (Steinmetz et al., 2002; Winzeler et al., 2003; Wenzl et al., 2004, 2007a; Hazen et al., 2005; Kim et al., 2006). A brief account of the development and use of these high-throughput array-based molecular markers in plants is presented in this study.

Single nucleotide polymorphisms

Single nucleotide polymorphisms refer to DNA polymorphisms at the level of individual base pairs and constitute ∼90% of genetic variation in any organism, therefore DNA-based markers are often classified into SNPs and non-SNP markers. The SNPs are generally discovered in silico from genomic or expressed sequence tag (EST) sequences available in the databases, or through sequencing or resequencing of candidate genes/PCR products/whole genomes in more than one genotypes (see later). Once SNPs are discovered, genotyping for these markers can be done using any one of more than 30 different available methods, although only a few of these methods are microarray based, providing the desired ultra-high throughput (for reviews, see Khlestkina and Salina, 2006; Kim and Misra, 2007). The choice of genotyping method largely depends upon the nature of study. For instance, whereas in some cases, we need to scan one or more individuals for SNPs ranging in number from dozens to several thousands, in other cases, we may need to allelotype thousands of individuals for a specific locus. In either case, ultra-high throughput and low-cost techniques are needed.

High-density platforms for SNP genotyping

Several high-density platforms are now available for genotyping one or more genomic DNA samples for dozens to thousands of SNPs in parallel (Table 1). These platforms have been widely discussed in several earlier reviews (Syvänen, 2005; Fan et al., 2006), so their detailed discussion in this study will be repetitive. Therefore, only an overview of these platforms is included.

Table 1 Micro-array-based high-throughput SNP genotyping systems

GoldenGate genotyping

Illumina's GoldenGate assay, based on their BeadArray/BeadChip technology, is one of the most popular platforms providing for a cost-effective assay (per genotype cost $0.03) for genotyping a fairly large collection of samples for a customized pool of SNP markers in parallel (a pool may range from 96 to 1536 SNPs). Each assay involves a multiplexed SNP genotyping reaction, involving use of two allele-specific oligonucleotides and a locus-specific oligonucleotide for each SNP, the locus-specific oligonucleotide carrying an anti-tag sequence for detection by the BeadArray. At a particular level of multiplexing (pooling), 96 or 384 BeadArrays (each bead carrying a tag oligo for detection of the product of SNP genotyping reaction) are arranged in a matrix, called Sentrix Arrays Matrix, so that up to 384 samples can be processed in one reaction, thus permitting genotyping of each of these 384 samples for as many as 1536 SNPs, simultaneously. Multiple pools can also be used to increase the number of SNPs further (for details with illustration, consult Fan et al., 2003; Hyten et al., 2008). However, beyond a certain level of multiplexing, Infinium assay that is discussed later in this review provides a better alternative.

GoldenGate technology is now being used for several crops. For instance, at Southern California Genotyping Consortium, Oligonucleotide Pool Assays are being developed for several plant systems including Arabidopsis, barley, wheat and maize, which will be used in the future for high-throughput SNP genotyping in these plant systems. In barley, a pilot oligonucleotide pool assay for detection of SNPs in 1524 unigenes was initially developed at SCRI, UK, which is being extended to an oligonucleotide pool assay for SNPs in as many as ∼3000 unigenes. This platform is being used by US barley Coordinated Agricultural Project to allelotype 3840 genotypes and by Association Genetics of UK Elite Barley (AGOUEB) to allelotype ∼1500 genotypes. The genotypes will also be assessed for about 40 traits that are pertinent for barley breeding, therefore the information may be used for a study of haplotype-trait associations (Hayes and Szücs, 2006). Also in maize, genome-wide SNP genotyping has been initiated with a large collaborative effort, which has already made use of the Illumina's GoldenGate assay platform to genotype 7200 RILs for 1300 SNPs using a customized Illumina BeadArray matrix (Yu and Buckler, 2006; Jones et al., 2007; Buckler, personal communication). In soybean, recently a custom-made 384-SNP GoldenGate assay was successfully designed for genotyping of three RIL mapping populations; the above 384 SNPs were discovered through resequencing of five diverse accessions (involved as parents of the above three RIL mapping populations) and were selected such that each of these SNPs segregated in at least one mapping population (Hyten et al., 2008). Similarly, in wheat, a programme on whole genome SNP genotyping using Illumina's GoldenGate and ABI's SNaPshot is being executed at the University of California (http://wheat.pw.usda.gov/SNP/new/index.shtml). The above genotyping activity may certainly be extended further through the use of Solexa's ultra-high throughput and low-cost resequencing technology (for reviews, on ultra-high-throughput DNA sequencing, see Gupta, 2008; Mardis, 2008).

Infinium assay for whole genome genotyping

Illumina's BeadChip-based Infinium assay, which is considered to be a more global approach for genotyping, also utilizes BeadArray technology and is a direct approach allowing parallel detection of most SNPs in a genome. More importantly, it eliminates the multiplexing bottleneck in sample preparation (needed in GoldenGate), making assay scalability mainly dependent on array-feature density (Fan et al., 2006). The randomly amplified fragments representing total genomic sequence are hybridized to these BeadArrays available in the form of a BeadChip on a microscope slide with 12 sections/stripes (each section containing 1.1 million beads, carrying decoded oligonucleotides). The 12 stripes may be used either for loading 12 different bead pools for 720 000 assays for a single sample, or alternatively one can load a single bead pool 12 times for 60 000 assays across each of 12 different samples. Also, as many as 24, 48 or 96 BeadChips can be used simultaneously in a temperature-controlled chamber rack to allow robotic-assisted automated assay of multiple genomic DNA samples simultaneously, thus permitting genotyping of hundreds of thousands of SNPs in dozens of genotypes simultaneously (Syvänen, 2005; Gunderson et al., 2006; Steemers and Gunderson, 2007).

GenomeLab SNPstream genotyping system

Like GoldenGate assay, this genotyping system of Beckman Coulter combines solid-phase primer extension assay and universal tags for SNP genotyping. The instrument designed for this system allows processing of 4600–3 000 000 genotypes per day. DNA samples are used for either 12 or 48 multiplex PCR in a 384 plate using tagged extension primers that are extended using single fluorescence-labelled nucleotide terminator reactions. The PCR-amplified fragments are resolved by hybridization to the complementary tags available on SNPware Tag Array plate having tag arrays in 384-well microplate format, each well with 16 or 52 unique tags that are complementary to the tags of the 12 or 48 extension primers, plus four controls to ensure accuracy. An individual SNP associated with a PCR-amplified fragment is identified by the position of the hybridizing tag in the well. This allows genotyping of 384 samples for either 12 or 48 SNPs per array, as against genotyping of relatively fewer samples for one thousand to more than a million SNPs in other high-throughput microarray-based SNP genotyping systems (Figure 1). This genotyping system has already been used for human blood/saliva samples for forensic purposes and in several plant systems including corn, canola, cotton (Shah et al., 2003) and poplar (Meirmans et al., 2006).

Figure 1
figure 1

Comparison of SNP multiplexing levels and number of samples analyzed per array in microarray-based SNP genotyping systems (reproduced with permission from ref. Syvänen, 2005).

MegAllele genotyping system (molecular inversion probe or MIP technology)

MegAllele genotyping system of Affymetrix is based on ParAllele's Molecular Inversion Probe (MIP) Technology. It allows tens of thousands of genotyping reactions in each of four reaction tubes that are used for each assay. Possible addition of a single specific nucleotide is allowed in each of the four reaction tubes by adding in each tube only one of the four dNTPs, which differ for the four tubes. In contrast to this, GoldenGate uses allele-specific primer extension to score SNPs. Furthermore, MIP uses a single circularizable probe (called padlock probe), whereas in GoldenGate assay, both upstream and downstream probes are separate (Nilsson et al., 1994; Syvänen, 2005; Fan et al., 2006). However, MIP resembles GoldenGate assay in using tags (available in the form of glass GenFlex Tag Array in MIP and BeadArrays in case of GoldenGate assay) that are used for detection of the products of SNP genotyping reactions. A technology making use of modified padlock probes like those used in MegAllele system has been recently utilized in bread wheat (Reid et al., 2007), and in the future, it may be utilized in other crops also.

GeneChip and allele-specific oligonucleotide tiling arrays

Tiling arrays developed by Affymetrix GeneChip platform on the basis of known sequences in several organisms have also been used for SNP discovery and detection. These tiling arrays may be either designed for resequencing (sequencing by hybridization or SBH) of specific genomic regions for SNP genotyping or may be designed for interrogating every individual nucleotide in a template genomic sequence by multiple probes available on the array. In the latter case, probes are available in the form of probesets, so that for each SNP allele, there is one probeset with multiple probe pairs (each pair with a perfect match and a mismatch), the probes of the two alleles at a locus differing only at a specified position. In Affymetrix assay for SNP genotyping, a genomic representation generated through complexity reduction is labelled and hybridized to the above tiling array, which may contain up to 900 000 different oligos, each present in millions of copies (http://keck.med.yale.edu/affymetrix/technology.htm). The hybridization patterns are used for inferring SNP genotypes.

Tiling arrays have already been used in some crops for genome-wide discovery and detection of SNPs. In rice, where genome sequence of one genotype (Nipponbare) is already available, SNP discovery in 100 Mb of the rice genome has been undertaken at International Rice Research Institute using tiling microarrays that are based on allele-specific oligonucleotides from the non-repetitive regions of the genome. This approach allowed identification of 260 000 non-redundant SNPs by Perlegen's model-based (MB) algorithms (McNally et al., 2006; Collard et al., 2008). These SNPs are being currently validated (Collard, personal communication), and the collection of these rice SNPs is being extended through use of machine-learning (ML)-based techniques. The ML approach was used earlier in Arabidopsis, where in one study 20 diverse strains of this weed were genotyped for more than a million non-redundant SNPs for analyzing the patterns of DNA polymorphism (Clark et al., 2007), and in another study, a genotyping array based on a subset of 250 000 ‘tag’ SNPs was used to study genome-wide pattern of LD (Kim et al., 2007). An ‘InDel array’ representing 240 unique InDel markers was also recently developed and used in Arabidopsis to genotype InDels of one or more nucleotides in an ultra-high-throughput manner (Salathia et al., 2007).

SNP genotyping/allelotyping for a specific locus

During marker-assisted selection in plant breeding, one may be interested in microarray-based genotyping of thousands of plants for a specific gene of interest. This can be done by arraying PCR products from all segregating individual plants on a glass, followed by hybridization of this array with labelled probes representing alternative alleles of the gene. The utility of this technique described as tagged microarray marker approach has been demonstrated for humans and pea (Flavell et al., 2003; Ji et al., 2004). However, Affymetrix GeneChips can also be used for SNP genotyping of a number of samples for one or more genes of interest as done in Arabidopsis, where the array AT412 was used for the study of variation in several genes (for example, Eds16 (Cho et al., 1999), Rsf1 (Spiegelman et al., 2000) and FRI (Nordborg et al., 2002)).

The ultra-high-throughput resequencing technologies like 454/Roche, Solexa, AB SOLiD and HeliScope single molecule sequencer that have recently become available, and the SEQUENOM's MassArray genotyping system involving matrix-assisted laser desorption/ionization–time of flight mass spectroscopy can also be used for SNP discovery and detection. These technologies, however, do not fall within the scope of the present discussion on microarray-based markers. Efforts are also being made for microarray-based genome-wide capturing of exons for selective resequencing that may allow SNP genotyping (Hodges et al., 2007).

Single feature polymorphisms

When labelled genomic DNA from two or more genotypes is separately hybridized to the same high-density oligonucleotide array, SFPs are detected as significant differences in hybridization signals among the genotypes used (Figure 2). In this assay, the oligonucleotides, available on the array, function as probes and are described as features, hence the term ‘single feature polymorphism’. For visualizing the signal, whole genome DNA is either Dnase-1 treated and then end labelled with γ-p32 dCTP or else is directly labelled with biotin-14-dCTP. Sometimes, genomic DNA with reduced complexity or complementary RNA (cRNA) is also used to improve the power of SFP detection, although the use of cRNA also has some limitations (see later). The quantity of DNA needed for successful hybridization depends on the size of genome, therefore more DNA is needed for complex genomes (for example, 300 ng for Arabidopsis; 420 ng for rice; 6 μg for wheat and so on). SFPs detected in this manner represent allelic variations ranging from SNPs to large deletions, although in the majority of studies, most SFPs have been found to be SNPs.

Figure 2
figure 2

Schematic representation of (a) the procedure used for detecting SFPs and ELPs using conventional oligonucleotide expression GeneChip, and (b) difference in (i) hybridization intensities of the reference and the genotype under investigation due to deletion (upper panel), SFP (lower panel) and (ii) pattern of hybridization between the reference and the genotype under investigation due to ELP.

Designing of microarrays and their use for detection of SFPs/expression level polymorphisms

Expression oligonucleotide arrays that are often used for SFP technology include Affymetrix (http://www.affymetrix.com) GeneChips or Nimblegen (http://www.nimblegen.com) arrays, which could be either the catalogue microarray meant for a variety of uses or may be custom made for the intended use. However, microarray with only small probes (25 bp) are used for SFP technology, as these detect sequence polymorphisms with high level of specificity; probes that are 200–1000 bp long are not suitable, as they tolerate mismatches in molecular hybridization (Zhu and Salmeron, 2007). Also to ensure that only a single locus will hybridize to each feature, sequences that are likely to hybridize to multiple loci are eliminated (although repeat sequences and multicopy genes have also been used for SFP detection).

The oligonucleotide probes (features), needed for SFP analysis, can be designed either from genomic sequences (as done in Arabidopsis) or from cDNA/EST sequences (as done in barley). In an Affymetrix GeneChip often used for SFP analysis, each of a large number of genes is represented by a variable number of probes covering the entire gene sequence, thus constituting a probeset (the entire GeneChip having a number of probesets, which is equal to the number of genes to be sampled). Such a strategy provides an opportunity to assay multiple loci within each gene for detecting polymorphisms (that is, SFPs). However, when cRNA is used as a surrogate for genomic DNA, we may have to sample multiple tissues to take care of spatial/temporal expression of genes. Besides this, while using cRNA, we need to distinguish between SFPs and expression level polymorphisms (ELPs), as two genotypes may not have sequence variation (SFP) in a particular gene, but may still differ for its ELP. While in an ELP, all probes representing a probeset will give same hybridization affinity with a single cRNA sample and the hybridization intensity will differ only with different cRNA samples (probeset level polymorphism), but in case of SFPs, the affinity of a particular probe will differ from that of all other probes in a probeset for the same cRNA sample (probe level polymorphism; see Figure 2).

SFPs in simple genome of yeast

In the very first study involving SFPs, Winzeler et al. (1998) initially identified 3714 SFPs between two different strains of budding yeast. Subsequently, SFPs in yeast were put to a variety of uses including fine mapping and positional cloning of a QTL for high-temperature growth (Steinmetz et al., 2002) and also for assessment of genealogical relationship among 14 strains of budding yeast (Winzeler et al., 2003). In place of genomic DNA, cDNA/cRNA was also used to study SFPs in yeast, which allowed parallel identification of ELPs and SFPs (Brem et al., 2002; Ronald et al., 2005).

SFPs in complex genomes of seed plants

In the initial phase of SFP development, it was thought that the technique is suitable for only small genomes, as an increase in genome size leads to significant reduction in signal-to-noise ratio. However, it was later recognized that even in complex genomes, SFPs may be detected with reasonable accuracy, provided genome complexity is reduced during sample preparation, and replicating arrays involving multiple tissues/development stages are used in cases of hybridization with cRNAs (see later for details). Consequently, the technique was later used in a number of seed plants including those with moderately complex genomes (for example, Arabidopsis and rice) and also those with large and highly complex genomes (for example, maize, soybean, tomato, lettuce, barley and wheat). The results of SFP analysis in these plant species with complex genomes are summarized in Table 2. However, there are some limitations that make this technology less competitive with some of the recently developed ultra-high-throughput SNP technologies discussed earlier in this review (see later for details).

Table 2 List of SFP studies conducted in various seed plants

As evident from the information presented in Table 2, SFPs have been used for a variety of purposes including detection of marker-trait associations, which sometimes involved construction of a molecular map followed by QTL interval mapping. For instance, in Arabidopsis, bulk segregant analysis has been used for mapping of circadian and developmental genes (Hazen et al., 2005), ion accumulation genes (Gong et al., 2004) and light-responsive QTLs (Wolyn et al., 2004). Similarly, in tomato, 17 SFPs were identified, which were tightly linked to a disease resistance locus (T Zhu, unpublished data; reported by Zhu and Salmeron, 2007). Other important examples of the utility of SFPs are the development of a map of 34 000 SFP loci representing 11 000 genes in maize (Zhu et al., 2006) and that of 8500 SFP loci from 6000 genes in tomato (Salmeron and Zhu, 2007).

Strengths and limitations of SFP technology

It has been argued that for crop plants with large genomes carrying high proportion of repetitive DNA, SNP genotyping for association studies at the whole genome level becomes prohibitive. For instance, in maize, an estimated one million SNP markers are needed for genome-wide association studies (Gore et al., 2007), although this number may be substantially reduced, if tagSNPs are used. This seems unnecessary because large proportion of these SNPs would belong to the non-coding regions. Under these circumstances, an attractive alternative is provided by SFP technology, which also allows coupling of genotyping with gene expression analysis. However, to make SFP technology useful in crops with complex genomes, a complexity reduction and gene enrichment method needs to be developed for the preparation of target DNA that is used for hybridization. Unfortunately, initial results involving four different gene-enrichment methods in maize are not very encouraging, and there seems to be a need for further work for developing not only the improved gene-enrichment methods, but also the custom-designed arrays for SFP detection/genotyping (Gore et al., 2007).

Several other limitations of using SFP technology in crops with complex genomes that have been recognized during the use of this technology include the following: First, while using cRNA as surrogate for genomic DNA, extensive replications with samples from multiple tissues are needed, thus increasing the cost per data point. Second, for achieving reasonable SFP detection power in a crop like maize, 20% or higher false discovery rate (FDR) needs to be allowed (Gore et al., 2007), although for several inbred plant species with smaller genomes, a much lower FDR may be used without loss of power. Third, in an array-hybridization experiment in maize, the detection of probeset polymorphism (ELPs) was found to be more effective than the detection of probe polymorphism (SFP), thus limiting the resolution to gene level rather than to nucleotide level (Gore et al., 2007; Zhu and Salmeron, 2007); even in Arabidopsis, it was shown that ELPs rather than SFPs are the major source of variation in SFP technology (Kliebenstein et al., 2006). Fourth, SFP technology often fails to detect polymorphisms due to SNPs occurring at the edges of the oligonucleotide probes and mainly detects only those polymorphisms that are due to internal SNPs (at positions 6–15, as observed in maize and barley).

Efforts to improve SFP technology

As mentioned above, complexity reduction and gene enrichment of the target DNA is one approach to make SFP technology suitable for complex genomes. This has often been achieved through the use of cRNAs, as done in Arabidopsis and several crop plants, including barley, rice, maize and wheat. This allows simultaneous acquisition of data for SFP genotyping and expression analysis, and thus facilitates development of two different types of markers, SFPs and gene expression markers; the latter recorded as large differences in transcript levels between the parents of a mapping population. In Arabidopsis, both these types of markers were used for construction of a high-density SFP/gene expression marker map and for mapping of expression QTLs or ELPs (West et al., 2006).

The other complexity reduction and gene-enrichment methods include methylation filtration, C0t filtration and AFLP, but these methods offered only a modest improvement in power to detect SFPs, when used with maize genome (cf. Gore et al., 2007). Recently, Salmeron and Zhu (2007) proposed a new enzyme-mediated genome complexity reduction method to detect what are described as ‘gene-specific hybridization polymorphisms’. In this method, separate libraries of fragments produced using methylation sensitive/insensitive enzymes were used for hybridization and led to the detection of gene-specific hybridization polymorphism markers. This method has not yet been widely tested, but its utility in maize suggested that it may also prove useful for other crops with large and complex genomes.

Single feature polymorphism detection/genotyping has also been improved through the use of improved statistical tools. For instance, the use of robustified projection pursuit involved differentiation of signal intensities between two genotypes, first at the level of probeset and then at the level of individual probes. This should help in removing copy number effects and should also allow distinction between SFPs and ELPs. Several other statistical tools and softwares are available, which would lead to improvement in SFP detection.

Methods have also been suggested to reduce the number of false positives and false negatives observed during SFP studies. The false positives are believed to result due to alternative splicing or polyadenylation, gene duplications, chance alignments with RNA from another region, gene expression markers (resulting from polymorphism(s) at trans-acting regulators and secondary structures in target DNA) or SNPs that occur immediately adjacent to the position of a 25mer probe (cf. Luo et al., 2007; Zhu and Salmeron, 2007; Potokina et al., 2008). The number of these false positives can be reduced by (i) sampling more than one tissues per genotype (Rostoks et al., 2005; Luo et al., 2007); (ii) excluding sequence duplications during array design, (iii) studying segregation pattern in the progeny of the cross made from genotypes under question (Luo et al., 2007, iv) adjusting stringency parameters (Luo et al., 2007) and through (v) the use of replication arrays. Similarly, false negatives can be attributed to position (in nucleotides) of known SNP(s) on the corresponding array feature, and can be reduced by increasing feature density per gene (Rostoks et al., 2005).

Diversity array technology

Diversity array technology is a high-throughput microarray hybridization-based technique that allows simultaneous typing of several hundred polymorphic loci spread over a genome without any previous sequence information about these loci (Jaccoud et al., 2001; Wenzl et al., 2004). The technique has also been shown to be reproducible and cost effective. Generally, 50–100 ng of genomic DNA is used for genotyping nearly 5000–8000 genomic loci in parallel in a single-reaction assay to discover polymorphic markers. The same platform is used for both discovery and scoring of markers, so that no specific assay for genotyping needs to be developed after marker discovery, except an initial assembly of all polymorphic markers (detected in a metagenome) into a single genotypic array. The genotyping array with only polymorphic markers thus developed is routinely used for genotyping (Huttner et al., 2005).

How are DArT markers developed and scored?

Diversity array technology markers are polymorphic segments of genomic DNA that are present in a particular genomic representation and are identified through differential hybridization on a diversity ‘genotyping array’ that is developed specifically for this purpose. These markers are biallelic and dominant (presence vs absence) or co-dominant (two doses vs one dose vs absent). DArT technology involves initial development of a ‘discovery array’, which is then used to identify polymorphic DArT markers that are assembled into a ‘genotyping array’. The ‘discovery array’ is developed from a metagenome (pool of genomes representing the germplasm of interest) that is subjected to complexity reduction to reduce the level of repetitive DNA, as repetitive sequences interfere with DArT assays and do not contribute to polymorphic clones (Kilian et al., 2005). Individual clones from genomic representation are amplified and spotted onto glass slides to give the desired ‘discovery array’ (www.DiversityArrays.com). Labelled genomic representations of individual genomes that were earlier included in the metagenome pool are then hybridized to this discovery array, and polymorphic clones (called DArT markers) thus detected are assembled into a ‘genotyping array’ for routine genotyping work (Figure 3). The software DArTsoft is used for analysis of hybridization intensities. The efficiency of identification of polymorphic DArT markers depends on the level of genetic diversity available within the metagenome pool that is used for developing the discovery array. For instance, only 5–10% of wheat and barley DArT clones and 25–30% of cassava DArT clones were found to be polymorphic (Huttner et al., 2005).

Figure 3
figure 3

Schematic representation of steps involved in the development of ‘genotyping array’ for DArT technology.

Molecular basis of DArT polymorphisms

Diversity arrays generally detect polymorphisms due to single base-pair changes (SNPs) at the restriction sites of endonucleases, and InDels/rearrangements within restriction fragments (Jaccoud et al., 2001). The complexity reduction method applied to DNA samples to be used for DArT analysis will determine the type of polymorphisms detected by DArT markers. For instance, if we use a methylation-sensitive restriction enzyme like PstI, it will presumably identify polymorphisms due to both sequence variation (SNP and InDel) and DNA methylation (Kilian et al., 2005). Although the dynamic nature of methylation states may suggest instability of some of these markers, in barley, the majority (97%) of DArT markers from a PstI/BstNI representation were found to be stable (Wenzl et al., 2004). Similarly, in Arabidopsis, while comparing 107 genome sequences of Ler strain with those of a control Col strain, ∼90% of DArT markers from a representation generated with methylation-sensitive PstI detected SNP variation and only the remaining <10% were attributed to methylation polymorphism (Wittenberg et al., 2005).

Present status of DArT development in different crop plants

Diversity array technology markers have already been developed for a fairly large number of plant species, including some orphan crops, for which no molecular information is available (Huttner et al., 2006; for details, see Table 3). The initial proof of concept was provided by using rice, which is one of the major cereal crops having relatively simple genome (Jaccoud et al., 2001). This study was supplemented by a study of about 30 000 rice genomic DNA clones involving 14 different complexity reduction methods, each involving a different frequent cutter restriction enzyme along with methylation-sensitive PstI, which is a rare cutter (Kilian et al., 2005). Later, barley and other crops having more complex genomes were also used (Wenzl et al., 2004; also see Table 3). As evident from Table 3, DArT markers have been developed now on large scale (for instance, in wheat >3000 markers) and were extensively utilized for the study of genetic diversity, preparation of integrated framework linkage maps and association mapping (White et al., 2008). DArT markers have also been used for QTL mapping for Fusarium head blight in wheat and for leaf pubescence in barley (Rheault et al., 2007; Wenzl et al., 2007a).

Table 3 A list of DArT studies conducted in crop plants

RAD markers

More recently, a variety of microarrays (including tiling/cDNA/oligonucleotide arrays) have also been used to develop the so-called RAD markers for study of genome-wide variations associated with restriction sites for individual restriction enzymes. For this purpose, first a genome-wide library of RAD tags is developed from genomic DNA, which is then used for hybridization on to the chosen microarray to detect all restriction-site-associated variations in a single assay. The development of RAD tags involves the following steps: (i) digestion of genomic DNA with a specific restriction enzyme; (ii) ligation of biotinylated linkers to the digested DNA; (iii) random shearing of ligated DNA into fragments smaller than the average distance between restriction sites, leaving small fragments with restriction sites attached to the biotinylated linkers; (iv) immobilization of these fragments on streptavidin-coated beads; and (v) release of DNA tags from the beads by digestion at the original restriction sites. This process specifically isolates DNA tags directly flanking the restriction sites of a particular restriction enzyme throughout the genome. The RAD tags from each of a number of samples, when hybridized on to a microarray, allows high-throughput identification and/or typing of differential hybridization patterns. These markers have clear advantage over the existing marker systems (for example, restriction fragment length polymorphisms, AFLPs and DArT markers) that could assay only a subset of SNPs that disrupt restriction sites. RAD markers were successfully developed in a number of organisms including fruit fly, zebrafish, threespine stickleback, Neurospora (Lewis et al., 2007; Miller et al., 2007a, 2007b) and will certainly find their way in most of the laboratories working on higher plants.

Statistical tools for microarray-based markers

The data generated due to hybridization of DNA on microarrays that are used for microarray-based DNA markers are invariably subjected to detailed statistical analysis. Therefore, a brief discussion of statistical tools used for this purpose is in order. For instance, before accepting an oligonucleotide probe as a marker during SNP/SFP/DArT genotyping assays, one would need to find out whether or not the difference between hybridization signals obtained due to two genotypes is significant. Different methods have been used for this purpose. For instance, methods are available for the estimation of expected hybridization intensities (ÃŽ), which can be compared with observed intensities (I), so that a significant difference (as shown by t-test) between the means of ratio ÃŽ/I for two genotypes involving the same probe will suggest that the probe is a polymorphic marker. Computer programs have been developed for this purpose (Ronald et al., 2005).

The most commonly used test statistics used for the analysis of data for array-based markers is the FDR threshold that is used to control type I error (Benjamini and Hochberg, 1995, 2000) and to minimize the number of false positives during SNP/SFP genotyping. This is achieved by first ranking the test statistics based on its P-value/Q-value, followed by checking/eliminating hypotheses on the basis of chosen FDR values (FDRs are calculated following Benjamini and Hochberg, 1995). This provides more leverage and less stringency relative to other methods, but it also reduces the number of false negatives significantly.

The utility of FDR as a statistical tool also depends on whether MB or ML method is used for genotyping. For instance, during SNP genotyping in A. thaliana, it was shown that at the same FDR, ML algorithm identifies significantly more true SNPs than MB methods (Clark et al., 2007). It was inferred, therefore, that ML method can be used to complement and extend SNP predictions made through MB approach. Furthermore, ML algorithm helps us to detect polymorphic regions containing InDels and variation hotspots, where MB SNP detection algorithms generally fail to identify individual SNPs.

Other statistical methods that are specifically dedicated to SFP/SNP detection include ‘robustified projection pursuit’ (Cui et al., 2005), GeSNP (Greenhall et al., 2007) and two other methods developed by Luo et al. (2007). In a recent study, three methods were compared by testing their relative efficiencies to detect SFPs associated with known barley SNPs (already mapped using same population). It was concluded that different methods have their own advantages/disadvantages, and each is useful only under specific circumstances (Luo et al., 2007). For instance, the method developed by Winzeler et al. (1998) is appropriate for SFP prediction from genomic DNA microarray data only, because it assumes that DNA molecules uniformly hybridize onto a microarray chip across all genes. Similarly, the method developed by Ronald et al. (2005) can, however, utilize cRNA microarray data, but does not take into account the possible large variation in the abundance of transcripts belonging to different genes. Also, the method developed by Cui et al. (2005) relies solely on the data obtained from perfect match probes for SFP detection. It may, therefore, be important to take into account the assumptions associated with each method, while selecting for an analytical method to predict SFPs. Two methods recently developed by Luo et al. (2007) address most of these issues and can be used for SFP analysis in seed plants. A web-based program ‘GeSNP’, which was initially designed to detect SNPs from microarray data, was recently tested for its suitability to predict SFPs in mice, humans and chimpanzees (Greenhall et al., 2007), but its use for plants has yet to be tried.

Perspectives and conclusions

The use of microarrays for the development of DNA-based markers, similar to SNP, SFP, DArT and RAD markers, has provided technology platforms for medium- to ultra-high-throughput genotyping at a low cost (Table 4). These technologies have also made it possible to have access to polymorphic regions of a genome at genome-wide scale at a low cost, and have been shown to be particularly useful for genomes, where the level of polymorphism is low and large-scale genome sequencing is still time consuming and expensive (for example, soybean, tomato and bread wheat). In the future, these array-based marker technologies and other non-array-based high-throughput genotyping platforms (for example, SNPlex and MassArray) will be used for a variety of studies including the development of high-density molecular maps, which may then be used for QTL interval mapping and LD-based association studies for functional and evolutionary studies.

Table 4 A summary of differences among different array-based techniques for detecting DNA polymorphisms

It should also be recognized that most SFPs, DArT and RAD markers are actually SNPs, therefore these marker systems have all the merits of SNPs without the requirement of sequence-based discovery of SNPs. For several crops, GeneChips for SFP genotyping and the ‘diversity microarrays’ for DArT markers have already been developed making their subsequent use cost effective. Currently, the major competitors for microarray-based genotyping are Illumina and Affymetrix, and the genotyping platforms offered by them have their own merits and demerits, which have been briefly outlined in this review.

Genome-wide SNP genotyping has also been improved through the development of SNPchips, where genotyping of only tagSNPs instead of all known SNPs is done (a tagSNP is a representative SNP in a genomic region with high LD, therefore it allows identification of genetic variation without genotyping every SNP). It is also recognized that in many studies, one would like to examine many samples for a fewer SNPs rather than examining few samples for thousands of SNPs, unless the primary objective of such a study is to examine genome-wide pattern of nucleotide diversity. This will also bring down the cost of genotyping, where specific regions or candidate genes are examined for allelotyping.

It has also been realized by many institutions and establishments that instead of having a specific high-throughput genotyping system installed, one may like to get the genotyping done either at a regional or national facility created in the public sector or at a commercial undertaking providing this service. This is desirable because an individual institution would not have enough genotyping work for optimum utilization of the instrument and will also lose the option of choosing any one of the several available genotyping platforms. Also, not many institutions, which would like to have genotyping facility, can afford to have the ultra-high-throughput genotyping platforms installed at site, although they have the funds for getting the genotyping done on contract.