Review Article | Published:

Genome-wide genetic marker discovery and genotyping using next-generation sequencing

Nature Reviews Genetics volume 12, pages 499510 (2011) | Download Citation

Abstract

The advent of next-generation sequencing (NGS) has revolutionized genomic and transcriptomic approaches to biology. These new sequencing tools are also valuable for the discovery, validation and assessment of genetic markers in populations. Here we review and discuss best practices for several NGS methods for genome-wide genetic marker development and genotyping that use restriction enzyme digestion of target genomes to reduce the complexity of the target. These new methods — which include reduced-representation sequencing using reduced-representation libraries (RRLs) or complexity reduction of polymorphic sequences (CRoPS), restriction-site-associated DNA sequencing (RAD-seq) and low coverage genotyping — are applicable to both model organisms with high-quality reference genome sequences and, excitingly, to non-model species with no existing genomic data.

Key points

  • New methods that make use of high-throughput sequencing are enabling the simultaneous discovery and sequencing of thousands of genetic markers across whole genomes.

  • These methods can be used to study wild populations of tens or hundreds of individuals for which genomic resources were not previously available.

  • They also enable the rapid genotyping of hundreds of individuals in a mapping cross, for quantitative trait locus (QTL) mapping and marker-assisted selection.

  • We describe best practices and make recommendations for a group of methods involving the use of restriction enzymes, namely reduced-representation libraries, complexity reduction of polymorphic sequences, restriction-site-associated DNA sequencing, multiplexed shotgun genotyping and genotyping by sequencing.

  • We discuss the impact of several factors — such as the availability of genomic resources, the levels of polymorphism, the pooling of samples and the choice of restriction enzyme — on the design and implementation of high-throughput marker discovery and genotyping experiments.

  • The analysis of data from these methods can be challenging and new methods for processing high-throughput marker data are described.

  • At present these methods are far more economical than whole-genome sequencing. We discuss how this situation is likely to change over the next few years, as sequencing costs continue to fall rapidly.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    , , , & The power and promise of population genomics: from genotyping to genome typing. Nature Rev. Genet. 4, 981–994 (2003).

  2. 2.

    et al. Adaptation genomics: the next generation. Trends Ecol. Evol. 25, 705–712 (2010).

  3. 3.

    , & Genomics and the future of conservation genetics. Nature Rev. Genet. 11, 697–709 (2010).

  4. 4.

    et al. Application of SNPs for population genetics of nonmodel organisms: new opportunities and challenges. Mol. Ecol. Resour. 11, 123–136 (2011).

  5. 5.

    , , & Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet. 32, 314–331 (1980).

  6. 6.

    et al. AFLP: a new technique for DNA fingerprinting. Nucleic Acids Res. 23, 4407–4414 (1995).

  7. 7.

    & Microsatellites, from molecules to populations and back. Trends Ecol. Evol. 11, 424–429 (1996).

  8. 8.

    et al. A polymorphic DNA marker genetically linked to Huntington's disease. Nature 306, 234–238 (1983).

  9. 9.

    et al. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245, 1066–1073 (1989).

  10. 10.

    et al. A genetic linkage map of the human genome. Cell 51, 319–337 (1987).

  11. 11.

    et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, 513–516 (2000).

  12. 12.

    et al. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nature Methods 5, 247–252 (2008). The first description of the RRL approach using NGS.

  13. 13.

    , & SNP discovery in swine by reduced representation and high throughput pyrosequencing. BMC Genet. 9, 81 (2008).

  14. 14.

    et al. Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS ONE 4, e6524 (2009).

  15. 15.

    et al. Application of massive parallel sequencing to whole genome SNP discovery in the porcine genome. BMC Genomics 10, 374 (2009).

  16. 16.

    et al. Genome-wide footprints of pig domestication and selection revealed through massive parallel sequencing of pooled DNA. PLoS ONE 6, e14782 (2011).

  17. 17.

    et al. Large scale single nucleotide polymorphism discovery in unsequenced genomes using second generation high throughput sequencing technology: applied to turkey. BMC Genomics 10, 479 (2009).

  18. 18.

    et al. A first-generation haplotype map of maize. Science 326, 1115–1117 (2009). An example of the simplicity and power of reduced-representation sequencing for the development of whole-genome resources.

  19. 19.

    et al. Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library. BMC Genomics 10, 559 (2009).

  20. 20.

    et al. Genome-wide SNP detection in the great tit Parus major using high throughput sequencing. Mol. Ecol. 19 (Suppl. 1), 89–99 (2010).

  21. 21.

    et al. High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics 11, 38 (2010).

  22. 22.

    et al. High-throughput SNP discovery and assay development in common bean. BMC Genomics 11, 475 (2010).

  23. 23.

    et al. Partial short-read sequencing of a highly inbred Iberian pig and genomics inference thereof. Heredity 16 Mar 2011 (doi:10.1038/hdy.2011.13).

  24. 24.

    et al. Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence. BMC Genomics 12, 59 (2011).

  25. 25.

    et al. Genome wide SNP discovery, analysis and evaluation in mallard (Anas platyrhynchos). BMC Genomics 12, 150 (2011).

  26. 26.

    et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005).

  27. 27.

    et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

  28. 28.

    , & in Next Generation Genome Sequencing: Towards Personalized Medicine (ed. Janitz, M.) 29–42 (Wiley-VCH Weinheim, 2008).

  29. 29.

    & Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  30. 30.

    , , & Genotype and SNP calling from next-generation sequencing data. Nature Rev. Genet. 12, 443–451 (2011).

  31. 31.

    et al. Structural variation in the chicken genome identified by paired-end next-generation DNA sequencing of reduced representation libraries. BMC Genomics 12, 94 (2011).

  32. 32.

    et al. Complexity reduction of polymorphic sequences (CRoPS): a novel approach for large-scale polymorphism discovery in complex genomes. PLoS ONE 2, e1172 (2007). The original description of the CRoPS method.

  33. 33.

    et al. Development of highly polymorphic SNP markers from the complexity reduced portion of maize [Zea mays, L.] genome for use in marker-assisted breeding. Theor. Appl. Genet. 121, 577–588 (2010).

  34. 34.

    et al. Bayesian analysis of molecular variance in pyrosequences quantifies population genetic structure across the genome of Lycaeides butterflies. Mol. Ecol. 19, 2455–2473 (2010). An excellent demonstration of CRoPS, with a useful analysis technique for handling large population genomics data sets.

  35. 35.

    & A hierarchical Bayesian model for next-generation population genomics. Genetics 187, 903–917 (2011).

  36. 36.

    & RADSeq: next-generation population genetics. Brief. Funct. Genomics 9, 416–423 (2010).

  37. 37.

    , , , & Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Res. 17, 240–248 (2007).

  38. 38.

    et al. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS ONE 3, e3376 (2008). The original description of high-throughput RAD-seq.

  39. 39.

    et al. Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genet. 6, e1000862 (2010).

  40. 40.

    et al. Resolving postglacial phylogeography using high-throughput sequencing. Proc. Natl Acad. Sci. USA 107, 16196–16200 (2010). A demonstration of the power of RAD-seq for the study of non-model wild populations.

  41. 41.

    , , , & Next-generation RAD sequencing identifies thousands of SNPs for assessing hybridization between rainbow and westslope cutthroat trout. Mol. Ecol. Resour. 11, 117–122 (2011).

  42. 42.

    et al. Construction and application for QTL analysis of a restriction site associated DNA (RAD) linkage map in barley. BMC Genomics 12, 4 (2011).

  43. 43.

    , , & Mapping with RAD (restriction-site associated DNA) markers to rapidly identify QTL for stem rust resistance in Lolium perenne. Theor. Appl. Genet. 122, 1467–1480 (2011).

  44. 44.

    et al. Linkage mapping and comparative genomics using next-generation RAD sequencing of a non-model organism. PLoS ONE 6, e19315 (2011).

  45. 45.

    , , , & Local de novo assembly of RAD paired-end contigs using short sequencing reads. PLoS ONE 6, e18561 (2011).

  46. 46.

    et al. High-throughput genotyping by whole-genome resequencing. Genome Res. 19, 1068–1076 (2009).

  47. 47.

    et al. Parent-independent genotyping for constructing an ultrahigh-density linkage map based on population sequencing. Proc. Natl Acad. Sci. USA 107, 10578–10583 (2010).

  48. 48.

    et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS ONE 6, e19379 (2011). The original description of the GBS method.

  49. 49.

    et al. Multiplexed shotgun genotyping for rapid and efficient genetic mapping. Genome Res. 21, 610–617 (2011). The original description of MSG, describing the hidden Markov model approach to imputation of genotypes.

  50. 50.

    , & Large scale loss of data in low-diversity Illumina sequencing libraries can be recovered by deferred cluster calling. PLoS ONE 6, e16607 (2011).

  51. 51.

    et al. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 10, R32 (2009). A useful study of the accuracy of variant detection in populations on the Roche Genome Sequencer, Illumina Genome Analyzer and Applied Biosystems SOLiD platforms.

  52. 52.

    et al. A large genome center's improvements to the Illumina sequencing system. Nature Methods 5, 1005–1010 (2008).

  53. 53.

    , & Solid-phase reversible immobilization for the isolation of PCR products. Nucleic Acids Res. 23, 4742–4743 (1995).

  54. 54.

    , , , & SNP discovery and genotyping for evolutionary genetics using RAD sequencing. in Molecular Methods for Evolutionary Genetics (eds Orgogozo, V. & Rockman, M. V.), Humana Press, New York (in the press).

  55. 55.

    , , , & Low coverage sequencing: implications for the design of complex trait association studies. Genome Res. 1 Apr 2011 (doi:10.1101/gr.117259.110).

  56. 56.

    & The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics 186, 207–218 (2010).

  57. 57.

    & To pool, or not to pool? Genetics 186, 41–43 (2010). A useful discussion of the advantages and disadvantages of pooling samples for SNP calling.

  58. 58.

    et al. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23, 2633–2635 (2007).

  59. 59.

    et al. PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PLoS ONE 6, e15925 (2011).

  60. 60.

    , , , & PoPoolation DB: a user-friendly web-based database for the retrieval of natural polymorphisms in Drosophila. BMC Genet. 12, 27 (2011).

  61. 61.

    et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 4 May 2011 (doi:10.1101/gr.110882.110).

  62. 62.

    & RNA sequencing: advances, challenges and opportunities. Nature Rev. Genet. 12, 87–98 (2011).

  63. 63.

    & SNP discovery by transcriptome pyrosequencing. Methods Mol. Biol. 729, 225–246 (2011).

  64. 64.

    , , & Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq. Nucleic Acids Res. 37, e106 (2009).

  65. 65.

    , , , & SNP discovery in the bovine milk transcriptome using RNA-Seq technology. Mamm. Genome 21, 592–598 (2010).

  66. 66.

    et al. SNP discovery in black cottonwood (Populus trichocarpa) by population transcriptome resequencing. Mol. Ecol. Resour. 11 (Suppl. 1), 81–92 (2011).

  67. 67.

    et al. Statistical inference of allelic imbalance from transcriptome data. Hum. Mutat. 32, 98–106 (2011).

  68. 68.

    , , & Construction of normalized RNA-seq libraries for next-generation sequencing using the crab duplex-specific nuclease. Curr. Protoc. Mol. Biol. 94, 4.12.1–4.12.11 (2011).

  69. 69.

    & Comparing de novo assemblers for 454 transcriptome data. BMC Genomics 11, 571 (2010).

  70. 70.

    , , & Critical assessment of assembly strategies for non-model species mRNA-Seq data and application of next-generation sequencing to the comparison of C3 and C4 species. J. Exp. Bot. 11 Mar 2011 (doi: 10.1093/jxb/err029).

  71. 71.

    , , & Comparison of three targeted enrichment strategies on the SOLiD sequencing platform. PLoS ONE 6, e18595 (2011).

  72. 72.

    et al. Performance of microarray and liquid based capture methods for target enrichment for massively parallel sequencing and SNP discovery. PLoS ONE 6, e16486 (2011).

  73. 73.

    et al. Identification of novel SNPs by next-generation sequencing of the genomic region containing the APC gene in colorectal cancer patients in China. OMICS 14, 315–325 (2010).

  74. 74.

    & Exome sequencing: the sweet spot before whole genomes. Hum. Mol. Genet. 19, R145–R151 (2010).

  75. 75.

    et al. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res. 20, 1420–1431 (2010).

  76. 76.

    The case for cloud computing in genome informatics. Genome Biol. 11, 207 (2010).

  77. 77.

    , & A window into third-generation sequencing. Hum. Mol. Genet. 19, R227–R240 (2010).

  78. 78.

    , & Optical mapping of DNA: single-molecule-based methods for mapping genomes. Biopolymers 95, 298–311 (2011).

  79. 79.

    Estimation of allele frequencies from high-coverage genome-sequencing projects. Genetics 182, 295–301 (2009).

  80. 80.

    et al. Whole-genome resequencing reveals loci under selection during chicken domestication. Nature 464, 587–591 (2010).

Download references

Acknowledgements

We are grateful to P. Andolfatto, E. Buckler, W. Cresko, R. Elshire, E. Johnson, S. Mitchell, D. Stern and four anonymous referees for reviewing and discussing drafts of this manuscript. We thank S. Bassham, S. Baxter, C. Eland, K. Gharbi, M. Liu, J. Taggart, and P. Fuentes Utrilla for discussions that have improved our understanding of these methods. J.W.D. and M.L.B. are funded by the UK Natural Environment Research Council, grant NE/H019804/1. P.A.H. and J.M.C. received funding support from the US National Institutes of Health (NIH) grant 1R24GM079486-01A1, the US National Science Foundation grant IOS-0843392 and a Keck Foundation grant to W. Cresko. J.M.C. was also funded by the NIH National Research Service Award Ruth L. Kirschstein postdoctoral fellowship 1F32GM095213-01. P.D.E. was supported by grants R21HG003834 and R21HG006036 from the US National Human Genome Research Institute awarded to E. Johnson.

Author information

Affiliations

  1. Institute of Evolutionary Biology, University of Edinburgh, Ashworth Laboratories, King's Buildings, West Mains Road, Edinburgh, EH9 3JT, UK.

    • John W. Davey
    •  & Mark L. Blaxter
  2. Institute of Ecology and Evolution, University of Oregon, Pacific Hall, Eugene, Oregon 97403-5289, USA.

    • Paul A. Hohenlohe
    •  & Julian M. Catchen
  3. Institute of Molecular Biology, University of Oregon, Klamath Hall, Eugene, Oregon 97403-1229, USA.

    • Paul D. Etter
  4. Floragenex, Inc., 1,900 Millrace Drive, Eugene, Oregon 97403, USA.

    • Jason Q. Boone
  5. The GenePool Genomics Facility, University of Edinburgh, Ashworth Laboratories, King's Buildings, West Mains Road, Edinburgh, EH9 3JT, UK.

    • Mark L. Blaxter

Authors

  1. Search for John W. Davey in:

  2. Search for Paul A. Hohenlohe in:

  3. Search for Paul D. Etter in:

  4. Search for Jason Q. Boone in:

  5. Search for Julian M. Catchen in:

  6. Search for Mark L. Blaxter in:

Competing interests

J.Q.B. is an employee of Floragenex, Inc., an organization that offers RAD-seq and associated consulting as a commercial service. The other authors declare no competing interests.

Corresponding author

Correspondence to John W. Davey.

Glossary

Quantitative trait locus

(QTL). A locus that controls a quantitative phenotypic trait, identified by showing a statistical association between genetic markers surrounding the locus and phenotypic measurements.

Marker-assisted selection

The use of genetic markers to predict the inheritance of alleles at a closely linked trait locus.

Restriction fragment length polymorphism

(RFLP). A fragment-length variant that is generated through the presence or absence of a restriction enzyme recognition site. Restriction sites can be gained or lost by base substitutions, insertions or deletions.

Amplified fragment length polymorphism

(AFLP). A mapping method in which genomic DNA from different strains is PCR amplified using arbitrary primers. DNA fragments that are amplified in one strain, but not the other, are cloned, sequenced and used as polymorphic markers.

Microsatellite

A class of repetitive DNA that is made up of repeats that are 2–8 nucleotides in length. They can be highly polymorphic and are frequently used as molecular markers in population genetics studies.

Optical mapping

A method for creating a map of a genome by stretching DNA in microfluidic channels on a slide for visualization on a fluorescent microscope. The DNA is then digested by restriction enzymes and the sizes of these fragments are inferred by the integrated intensity of the fluorescent intercalator dye.

F ST

(Wright's fixation index). The fraction of the total genetic variation that is distributed among subpopulations in a subdivided population.

Imputation

A statistical method for handling missing data in which the missing values are replaced by estimated values.

Recombinant inbred lines

(RILs). A population of fully homozygous individuals that is obtained through the repeated selfing of F1 hybrids, and that is comprised of 50% of each original parental genome in different combinations.

Hidden Markov model

A statistical approach that is used to estimate a series of hidden states (for example, ancestry at loci along a chromosome). The method is based on observations of the states that have uncertainty (for example, the ancestral assignment of sequence reads) and the expected probability of transitions between states (for example, recombination breakpoints).

Soft ancestry calls

Assigning probabilities to ancestral (for example, parental or grandparental) genotypes, rather than making explicit, 'hard' calls. This approach appropriately propagates uncertainty (which often arises around recombination breakpoints) in individual ancestry assignments, thus enabling a more accurate inference of breakpoint location.

Scaffold

A genomic unit composed of one or more contigs that have been ordered and orientated using end-read information.

Sliding window averaging

The averaging of statistics, such as nucleotide diversity or FST, for all markers in a chosen size of overlapping genomic region (window). When applied across the genome, this method smoothes out variation within regions so that genome-wide patterns can be observed.

lod score

(Base 10 'logarithm of the odds' or 'log-odds'). A statistical estimate of whether two loci are likely to lie near each other on a chromosome and are therefore likely to be inherited together. A lod score of three or more is generally considered to indicate that the two loci are close.

Major histocompatibility complex

(MHC). A complex locus on human chromosome 6p, which comprises numerous genes, including the human leukocyte antigen genes, which are involved in the immune response. MHC molecules bind peptide fragments that are derived from pathogens and display them on the cell surface for recognition by the appropriate T cells. The organizations of the MHC gene clusters are similar in many species.

Solid-phase reversible immobilization

(SPRI). The purification of nucleic acids using magnetic beads, thus avoiding gel extraction, filtration and centrifugation.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nrg3012

Further reading Further reading