The advent of next-generation sequencing (NGS) has revolutionized genomic and transcriptomic approaches to biology. These new sequencing tools are also valuable for the discovery, validation and assessment of genetic markers in populations. Here we review and discuss best practices for several NGS methods for genome-wide genetic marker development and genotyping that use restriction enzyme digestion of target genomes to reduce the complexity of the target. These new methods — which include reduced-representation sequencing using reduced-representation libraries (RRLs) or complexity reduction of polymorphic sequences (CRoPS), restriction-site-associated DNA sequencing (RAD-seq) and low coverage genotyping — are applicable to both model organisms with high-quality reference genome sequences and, excitingly, to non-model species with no existing genomic data.
New methods that make use of high-throughput sequencing are enabling the simultaneous discovery and sequencing of thousands of genetic markers across whole genomes.
These methods can be used to study wild populations of tens or hundreds of individuals for which genomic resources were not previously available.
They also enable the rapid genotyping of hundreds of individuals in a mapping cross, for quantitative trait locus (QTL) mapping and marker-assisted selection.
We describe best practices and make recommendations for a group of methods involving the use of restriction enzymes, namely reduced-representation libraries, complexity reduction of polymorphic sequences, restriction-site-associated DNA sequencing, multiplexed shotgun genotyping and genotyping by sequencing.
We discuss the impact of several factors — such as the availability of genomic resources, the levels of polymorphism, the pooling of samples and the choice of restriction enzyme — on the design and implementation of high-throughput marker discovery and genotyping experiments.
The analysis of data from these methods can be challenging and new methods for processing high-throughput marker data are described.
At present these methods are far more economical than whole-genome sequencing. We discuss how this situation is likely to change over the next few years, as sequencing costs continue to fall rapidly.
We are grateful to P. Andolfatto, E. Buckler, W. Cresko, R. Elshire, E. Johnson, S. Mitchell, D. Stern and four anonymous referees for reviewing and discussing drafts of this manuscript. We thank S. Bassham, S. Baxter, C. Eland, K. Gharbi, M. Liu, J. Taggart, and P. Fuentes Utrilla for discussions that have improved our understanding of these methods. J.W.D. and M.L.B. are funded by the UK Natural Environment Research Council, grant NE/H019804/1. P.A.H. and J.M.C. received funding support from the US National Institutes of Health (NIH) grant 1R24GM079486-01A1, the US National Science Foundation grant IOS-0843392 and a Keck Foundation grant to W. Cresko. J.M.C. was also funded by the NIH National Research Service Award Ruth L. Kirschstein postdoctoral fellowship 1F32GM095213-01. P.D.E. was supported by grants R21HG003834 and R21HG006036 from the US National Human Genome Research Institute awarded to E. Johnson.
- Quantitative trait locus
(QTL). A locus that controls a quantitative phenotypic trait, identified by showing a statistical association between genetic markers surrounding the locus and phenotypic measurements.
- Marker-assisted selection
The use of genetic markers to predict the inheritance of alleles at a closely linked trait locus.
- Restriction fragment length polymorphism
(RFLP). A fragment-length variant that is generated through the presence or absence of a restriction enzyme recognition site. Restriction sites can be gained or lost by base substitutions, insertions or deletions.
- Amplified fragment length polymorphism
(AFLP). A mapping method in which genomic DNA from different strains is PCR amplified using arbitrary primers. DNA fragments that are amplified in one strain, but not the other, are cloned, sequenced and used as polymorphic markers.
A class of repetitive DNA that is made up of repeats that are 2–8 nucleotides in length. They can be highly polymorphic and are frequently used as molecular markers in population genetics studies.
- Optical mapping
A method for creating a map of a genome by stretching DNA in microfluidic channels on a slide for visualization on a fluorescent microscope. The DNA is then digested by restriction enzymes and the sizes of these fragments are inferred by the integrated intensity of the fluorescent intercalator dye.
- F ST
(Wright's fixation index). The fraction of the total genetic variation that is distributed among subpopulations in a subdivided population.
A statistical method for handling missing data in which the missing values are replaced by estimated values.
- Recombinant inbred lines
(RILs). A population of fully homozygous individuals that is obtained through the repeated selfing of F1 hybrids, and that is comprised of 50% of each original parental genome in different combinations.
- Hidden Markov model
A statistical approach that is used to estimate a series of hidden states (for example, ancestry at loci along a chromosome). The method is based on observations of the states that have uncertainty (for example, the ancestral assignment of sequence reads) and the expected probability of transitions between states (for example, recombination breakpoints).
- Soft ancestry calls
Assigning probabilities to ancestral (for example, parental or grandparental) genotypes, rather than making explicit, 'hard' calls. This approach appropriately propagates uncertainty (which often arises around recombination breakpoints) in individual ancestry assignments, thus enabling a more accurate inference of breakpoint location.
A genomic unit composed of one or more contigs that have been ordered and orientated using end-read information.
- Sliding window averaging
The averaging of statistics, such as nucleotide diversity or FST, for all markers in a chosen size of overlapping genomic region (window). When applied across the genome, this method smoothes out variation within regions so that genome-wide patterns can be observed.
- lod score
(Base 10 'logarithm of the odds' or 'log-odds'). A statistical estimate of whether two loci are likely to lie near each other on a chromosome and are therefore likely to be inherited together. A lod score of three or more is generally considered to indicate that the two loci are close.
- Major histocompatibility complex
(MHC). A complex locus on human chromosome 6p, which comprises numerous genes, including the human leukocyte antigen genes, which are involved in the immune response. MHC molecules bind peptide fragments that are derived from pathogens and display them on the cell surface for recognition by the appropriate T cells. The organizations of the MHC gene clusters are similar in many species.
- Solid-phase reversible immobilization
(SPRI). The purification of nucleic acids using magnetic beads, thus avoiding gel extraction, filtration and centrifugation.