Deriving genotypes from RAD-seq short-read data using Stacks

Rochette, Nicolas C; Catchen, Julian M

doi:10.1038/nprot.2017.123

Protocol
Published: 30 November 2017

Deriving genotypes from RAD-seq short-read data using Stacks

Nature Protocols volume 12, pages 2640–2659 (2017)Cite this article

17k Accesses
278 Citations
57 Altmetric
Metrics details

Subjects

Abstract

Restriction site-associated DNA sequencing (RAD-seq) allows for the genome-wide discovery and genotyping of single-nucleotide polymorphisms in hundreds of individuals at a time in model and nonmodel species alike. However, converting short-read sequencing data into reliable genotype data remains a nontrivial task, especially as RAD-seq is used in systems that have very diverse genomic properties. Here, we present a protocol to analyze RAD-seq data using the Stacks pipeline. This protocol will be of use in areas such as ecology and population genetics. It covers the assessment and demultiplexing of the sequencing data, read mapping, inference of RAD loci, genotype calling, and filtering of the output data, as well as providing two simple examples of downstream biological analyses. We place special emphasis on checking the soundness of the procedure and choosing the main parameters, given the properties of the data. The procedure can be completed in 1 week, but determining definitive methodological choices will typically take up to 1 month.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Genotype calling greatly depends on coverage.**

**Figure 2: Selection of assembly parameters in a *de novo* analysis.**

**Figure 3: Evolution of the catalog as more samples are added.**

**Figure 4: Transitioning to populations genetics.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Genome-wide association studies

Article 26 August 2021

References

Narum, S.R., Buerkle, C.A., Davey, J.W., Miller, M.R. & Hohenlohe, P.A. Genotyping-by-sequencing in ecological and conservation genomics. Mol. Ecol. 22, 2841–2847 (2013).
Article CAS PubMed PubMed Central Google Scholar
Andrews, K.R., Good, J.M., Miller, M.R., Luikart, G. & Hohenlohe, P.A. Harnessing the power of RADseq for ecological and evolutionary genomics. Nat. Rev. Genet. 17, 81–92 (2016).
Article CAS PubMed PubMed Central Google Scholar
Baird, N.A. et al. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One 3, e3376 (2008).
Article PubMed PubMed Central Google Scholar
Elshire, R.J. et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6, e19379 (2011).
Article CAS PubMed PubMed Central Google Scholar
Peterson, B.K., Weber, J.N., Kay, E.H., Fisher, H.S. & Hoekstra, H.E. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One 7, e37135 (2012).
Article CAS PubMed PubMed Central Google Scholar
Ali, O.A. et al. RAD capture (Rapture): flexible and efficient sequence-based genotyping. Genetics 202, 389–400 (2016).
Article CAS PubMed Google Scholar
Toonen, R.J. et al. ezRAD: a simplified method for genomic genotyping in non-model organisms. PeerJ 1, e203 (2013).
Article PubMed PubMed Central Google Scholar
Franchini, P., Monné Parera, D., Kautt, A.F. & Meyer, A. quaddRAD: a new high-multiplexing and PCR duplicate removal ddRAD protocol produces novel evolutionary insights in a nonradiating cichlid lineage. Mol. Ecol. 26, 2783–2795 (2017).
Article CAS PubMed Google Scholar
Suchan, T. et al. Hybridization capture using RAD probes (hyRAD), a new tool for performing genomic analyses on collection specimens. PLoS One 11, e0151651 (2016).
Article PubMed PubMed Central Google Scholar
Catchen, J.M., Amores, A., Hohenlohe, P., Cresko, W. & Postlethwait, J.H. Stacks: building and genotyping loci de novo from short-read sequences. G3 1, 171–182 (2011).
Article CAS PubMed PubMed Central Google Scholar
Catchen, J., Hohenlohe, P.A., Bassham, S., Amores, A. & Cresko, W.A. Stacks: an analysis tool set for population genomics. Mol. Ecol. 22, 3124–3140 (2013).
Article PubMed PubMed Central Google Scholar
Catchen, J. et al. The population structure and recent colonization history of Oregon threespine stickleback determined using restriction-site associated DNA-sequencing. Mol. Ecol. 22, 2864–2883 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lescak, E.A. et al. Evolution of stickleback in 50 years on earthquake-uplifted islands. Proc. Natl. Acad. Sci. USA 112, E7204–E7212 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kautt, A.F., Machado-Schiaffino, G. & Meyer, A. Multispecies outcomes of sympatric speciation after admixture with the source population in two radiations of Nicaraguan Crater Lake cichlids. PLoS Genet. 12, e1006157 (2016).
Article PubMed PubMed Central Google Scholar
Malinsky, M. et al. Genomic islands of speciation separate cichlid ecomorphs in an East African crater lake. Science 350, 1493–1498 (2015).
Article CAS PubMed PubMed Central Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2015).
Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).
Article CAS PubMed PubMed Central Google Scholar
Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Article CAS PubMed PubMed Central Google Scholar
Korneliussen, T.S., Albrechtsen, A. & Nielsen, R. ANGSD: analysis of next generation sequencing data. BMC Bioinformatics 15, 356 (2014).
Article PubMed PubMed Central Google Scholar
Eaton, D.A.R. PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinforma. Oxf. Engl. 30, 1844–1849 (2014).
Article CAS Google Scholar
Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinforma. Oxf. Engl. 26, 2460–2461 (2010).
Article CAS Google Scholar
Sovic, M.G., Fries, A.C. & Gibbs, H.L. AftrRAD: a pipeline for accurate and efficient de novo assembly of RADseq data. Mol. Ecol. Resour. 15, 1163–1171 (2015).
Article CAS PubMed Google Scholar
Huang, W., Umbach, D.M. & Li, L. Accurate anchoring alignment of divergent sequences. Bioinforma. Oxf. Engl. 22, 29–34 (2006).
Article CAS Google Scholar
Glaubitz, J.C. et al. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS One 9, e90346 (2014).
Article PubMed PubMed Central Google Scholar
Lu, F. et al. Switchgrass genomic diversity, ploidy, and evolution: novel insights from a network-based SNP discovery protocol. PLoS Genet. 9, e1003215 (2013).
Article CAS PubMed PubMed Central Google Scholar
Puritz, J.B., Hollenbeck, C.M. & Gold, J.R. dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms. PeerJ 2, e431 (2014).
Article PubMed PubMed Central Google Scholar
Chong, Z., Ruan, J. & Wu, C.-I. Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads. Bioinforma. Oxf. Engl. 28, 2732–2737 (2012).
Article CAS Google Scholar
Shafer, A.B.A. et al. Bioinformatic processing of RAD-seq data dramatically impacts downstream population genetic inference. Methods Ecol. Evol. http://dx.doi.org/10.1111/2041-210X.12700 (2016).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinforma. Oxf. Engl. 27, 2987–2993 (2011).
Article CAS Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS PubMed PubMed Central Google Scholar
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Hohenlohe, P.A. et al. Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genet. 6, e1000862 (2010).
Article PubMed PubMed Central Google Scholar
Jombart, T. & Ahmed, I. adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinforma. Oxf. Engl. 27, 3070–3071 (2011).
Article CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinforma. Oxf. Engl. 25, 1754–1760 (2009).
Article CAS Google Scholar
Hoffberg, S.L. et al. RADcap: sequence capture of dual-digest RADseq libraries with identifiable duplicates and reduced missing data. Mol. Ecol. Resour. http://dx.doi.org/10.1111/1755-0998.12566 (2016).
Herrera, S., Reyes-Herrera, P.H. & Shank, T.M. Predicting RAD-seq marker numbers across the eukaryotic tree of life. Genome Biol. Evol. http://dx.doi.org/10.1093/gbe/evv210 (2015).
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article CAS PubMed PubMed Central Google Scholar
Romiguier, J. et al. Comparative population genomics in animals uncovers the determinants of genetic diversity. Nature 515, 261–263 (2014).
Article CAS PubMed Google Scholar
Braasch, I. et al. A new model army: emerging fish models to study the genomics of vertebrate Evo-Devo. J. Exp. Zool. B Mol. Dev. Evol. 324, 316–341 (2015).
Article PubMed Google Scholar
Lien, S. et al. The Atlantic salmon genome provides insights into rediploidization. Nature 533, 200–205 (2016).
Article CAS PubMed PubMed Central Google Scholar
Ilut, D.C., Nydam, M.L. & Hare, M.P. Defining loci in restriction-based reduced representation genomic data from nonmodel species: sources of bias and diagnostics for optimal clustering. Biomed. Res. Int. 2014, 675158 (2014).
Article PubMed PubMed Central Google Scholar
Harvey, M.G. et al. Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species. PeerJ 3, e895 (2015).
Article PubMed PubMed Central Google Scholar
Rodríguez-Ezpeleta, N. et al. Population structure of Atlantic mackerel inferred from RAD-seq-derived SNP markers: effects of sequence clustering parameters and hierarchical SNP selection. Mol. Ecol. Resour. 16, 991–1001 (2016).
Article PubMed Google Scholar
Paris, J.R., Stevens, J.R. & Catchen, J.M. Lost in parameter space: a road map for stacks. Methods Ecol. Evol. 8, 1360–1373 (2017).
Article Google Scholar
Weir, B.S. Genetic Data Analysis II: Methods for Discrete Population Genetic Data (Sinauer Associates, 1996).
Excoffier, L., Smouse, P.E. & Quattro, J.M. Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics 131, 479–491 (1992).
CAS PubMed PubMed Central Google Scholar
Meirmans, P.G. Using the AMOVA framework to estimate a standardized genetic differentiation measure. Evol. Int. J. Org. Evol. 60, 2399–2402 (2006).
Article Google Scholar
Bird, C.E., Karl, S.A., Mouse, P.E. & Toonen, R.J. in Phylogeography and Population Genetics in Crustacea 31–55 (CRC Press, 2011).

Download references

Acknowledgements

We thank J. Paris and N. Rayamajhi for their help in testing the procedure and for discussion of the manuscript.

Author information

Authors and Affiliations

Department of Animal Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois, USA
Nicolas C Rochette & Julian M Catchen

Authors

Nicolas C Rochette
View author publications
You can also search for this author in PubMed Google Scholar
Julian M Catchen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.C.R. and J.M.C. designed the protocol, performed experiments, and wrote the manuscript.

Corresponding author

Correspondence to Julian M Catchen.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Observed per-sample coverages.

The median number of reads for the 78 samples of the demonstration dataset is 1,209,782. The last two samples, ‘sj_1483.05’ and ‘sj_181931’, have almost no reads, likely because of a mistake at the bench at the multiplexing step that caused them not to be represented in the DNA library.

Supplementary Figure 2 Comparison between the reference-based and de novo approaches.

The two approaches yield very similar overall results, yet differ in their treatment of specific subsets of the data.

(A) Number of loci (solid lines) and polymorphic loci (dashed lines) shared by 80% of samples in the reference-based (blue) and de novo (black, same numbers as in Figure 2) approaches, for 12 representative samples. The reference-based analysis yields 35,792 loci shared by 80% samples while the de novo one yields (with M=n=4) 35,277 such loci. Furthermore, mapping the consensus sequences of these de novo loci to the reference genome using BWA (not shown) shows that just 32,843 of them (93.1%) have a one-to-one relationship with loci in the reference-based analysis, while 96 (0.3%) appear under-merged (pairs of loci map to the same genomic location) and 1,748 (5.0%) can’t be mapped to the reference. The remaining 590 de novo loci (1.7%) correspond to in-between cases in which the de novo loci partly exist in the reference-based analysis but are not part of the filtered set of loci present in 80% of samples. Conversely, 2,903 of the filtered loci of the reference-based analysis are missing from the filtered de novo set.

(B) Distribution of the number of SNPs per locus for the reference-based (blue) and de novo (yellow-red, same numbers as in Figure 2) approaches. We note that the total number of SNPs is higher in the reference-based analysis than in the de novo analysis with M=n=4 (57,872 vs. 53,051), but that the rate of SNPs with implausibly high heterozygosities (>60%) is also slightly higher (2.0% vs. 1.8%).

(C) PCA of 76 individuals, computed in the same way as figure 4B, but using the genotypes resulting from the reference-based analysis. For comparability with Figure 4B, the Y-axis is PC3, not PC2. Insert: Percentage of the variance explained by the first ten components.

Supplementary information

Supplementary Figures

Supplementary Figures 1 and 2. (PDF 376 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rochette, N., Catchen, J. Deriving genotypes from RAD-seq short-read data using Stacks. Nat Protoc 12, 2640–2659 (2017). https://doi.org/10.1038/nprot.2017.123

Download citation

Published: 30 November 2017
Issue Date: December 2017
DOI: https://doi.org/10.1038/nprot.2017.123

This article is cited by

Conservation genetics and potential geographic distribution modeling of Corybas taliensis, a small ‘sky Island’ orchid species in China
- Yuhang Liu
- Huichun Wang
- Weibang Sun
BMC Plant Biology (2024)
Phylogeography and phenotypic wing shape variation in a damselfly across populations in Europe
- Y. Yildirim
- D. Kristensson
- F. Johansson
BMC Ecology and Evolution (2024)
Population genomic analyses reveal that salinity and geographic isolation drive diversification in a free-living protist
- Karin Rengefors
- Nataliia Annenkova
- Dag Ahrén
Scientific Reports (2024)
Establishment genomics of the Indo-Pacific damselfish Neopomacentrus cyanomos, in the Greater Caribbean
- Giacomo Bernardi
- Francesca Cohn
- D. Ross Robertson
Biological Invasions (2024)
Genetic structure of two endangered shrubs in Central Asia and northwestern China and the implications for conservation
- Li Zhuo
- Zhihao Su
- Lixin Zhang
Plant Systematics and Evolution (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Deriving genotypes from RAD-seq short-read data using Stacks

Subjects

Abstract

Access options

Similar content being viewed by others

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Genome-wide association studies

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary Figure 1 Observed per-sample coverages.

Supplementary Figure 2 Comparison between the reference-based and de novo approaches.

Supplementary information

Supplementary Figures

Rights and permissions

About this article

Cite this article

This article is cited by

Conservation genetics and potential geographic distribution modeling of Corybas taliensis, a small ‘sky Island’ orchid species in China

Phylogeography and phenotypic wing shape variation in a damselfly across populations in Europe

Population genomic analyses reveal that salinity and geographic isolation drive diversification in a free-living protist

Establishment genomics of the Indo-Pacific damselfish Neopomacentrus cyanomos, in the Greater Caribbean

Genetic structure of two endangered shrubs in Central Asia and northwestern China and the implications for conservation

Comments

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links