Abstract
The identification of genetic variation with next-generation sequencing is confounded by the complexity of the human genome sequence and by biases that arise during library preparation, sequencing and analysis. We have developed a set of synthetic DNA standards, termed 'sequins', that emulate human genetic features and constitute qualitative and quantitative spike-in controls for genome sequencing. Sequencing reads derived from sequins align exclusively to an artificial in silico reference chromosome, rather than the human reference genome, which allows them them to be partitioned for parallel analysis. Here we use this approach to represent common and clinically relevant genetic variation, ranging from single nucleotide variants to large structural rearrangements and copy-number variation. We validate the design and performance of sequin standards by comparison to examples in the NA12878 reference genome, and we demonstrate their utility during the detection and quantification of variants. We provide sequins as a standardized, quantitative resource against which human genetic variation can be measured and diagnostic performance assessed.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Accession codes
Primary accessions
BioProject
Referenced accessions
European Nucleotide Archive
NCBI Reference Sequence
References
Goldstein, D.B. et al. Sequencing studies in human genetics: design and interpretation. Nat. Rev. Genet. 14, 460–470 (2013).
Mwenifumbo, J.C. & Marra, M.A. Cancer genome-sequencing study design. Nat. Rev. Genet. 14, 321–332 (2013).
Gundry, M. & Vijg, J. Direct mutation analysis by high-throughput sequencing: from germline to low-abundant, somatic variants. Mutat. Res. 729, 1–15 (2012).
Katsanis, S.H. & Katsanis, N. Molecular genetic testing and the future of clinical genomics. Nat. Rev. Genet. 14, 415–426 (2013).
Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).
Sims, D., Sudbery, I., Ilott, N.E., Heger, A. & Ponting, C.P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).
Rosenbloom, K.R. et al. The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 43, D670–D681 (2015).
Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Satya, R.V. & DiCarlo, J. Edge effects in calling variants from targeted amplicon sequencing. BMC Genomics 15, 1073–1080 (2014).
Huang, W., Li, L., Myers, J.R. & Marth, G.T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Van der Auwera, G.A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 1–33 (2013).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
De Sousa E Melo, F., Vermeulen, L., Fessler, E. & Medema, J.P. Cancer heterogeneity—a multifaceted view. EMBO Rep. 14, 686–695 (2013).
Meacham, C.E. & Morrison, S.J. Tumour heterogeneity and cancer cell plasticity. Nature 501, 328–337 (2013).
Greaves, M. & Maley, C.C. Clonal evolution in cancer. Nature 481, 306–313 (2012).
Griffith, M. et al. Optimizing cancer genome sequencing and analysis. Cell Syst. 1, 210–223 (2015).
Carter, S.L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).
Aran, D., Sirota, M. & Butte, A.J. Systematic pan-cancer analysis of tumour purity. Nat. Commun. 6, 8971 (2015).
Koboldt, D.C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).
Alkan, C., Coe, B.P. & Eichler, E.E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Weckselblatt, B. & Rudd, M.K. Human structural variation: mechanisms of chromosome rearrangements. Trends Genet. 31, 587–599 (2015).
Abel, H.J., Duncavage, E.J. & Duncavage, E.J. Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches. Cancer Genet. 206, 432–440 (2013).
Pirooznia, M., Goes, F.S. & Zandi, P.P. Whole-genome CNV analysis: advances in computational approaches. Front. Genet. 6, 138 (2015).
MacDonald, J.R., Ziman, R., Yuen, R.K.C., Feuk, L. & Scherer, S.W. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42, D986–D992 (2014).
Layer, R.M., Chiang, C., Quinlan, A.R. & Hall, I.M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
Cordaux, R. & Batzer, M.A. The impact of retrotransposons on human genome evolution. Nat. Rev. Genet. 10, 691–703 (2009).
Zarrei, M., MacDonald, J.R., Merico, D. & Scherer, S.W. A copy number variation map of the human genome. Nat. Rev. Genet. 16, 172–183 (2015).
Wineinger, N.E. et al. Statistical issues in the analysis of DNA copy number variations. Int. J. Comput. Biol. Drug Des. 1, 368–395 (2008).
Parekh, S., Ziegenhain, C., Vieth, B., Enard, W. & Hellmann, I. The impact of amplification on differential expression analyses by RNA-seq. Preprint at bioRxiv http://dx.doi.org/10.1101/035493 (2015).
Armbruster, D.A. & Pry, T. Limit of blank, limit of detection and limit of quantitation. Clin. Biochem. Rev. 29 (Suppl. 1), S49–S52 (2008).
Gibbons, J.G., Branco, A.T., Yu, S. & Lemos, B. Ribosomal DNA copy number is coupled with gene expression variation and mitochondrial abundance in humans. Nat. Commun. 5, 4850 (2014).
Lei, H. et al. Identification and characterization of EBV genomes in spontaneously immortalized human peripheral blood B lymphocytes by NGS technology. BMC Genomics 14, 804 (2013).
Schaap, M. et al. Genome-wide analysis of macrosatellite repeat copy number variation in worldwide populations: evidence for differences and commonalities in size distributions and size restrictions. BMC Genomics 14, 143 (2013).
Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
Frampton, G.M. et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat. Biotechnol. 31, 1023–1031 (2013).
Hardwick, S.A. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat. Methods http://dx.doi.org/10.1038/nmeth.3958 (2016).
Lih, C.J. et al. Analytical validation and application of a targeted next-generation sequencing mutation-detection assay for use in treatment assignment in the NCI-MPACT trial. J. Mol. Diagn. 18, 51–67 (2016).
Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011).
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005).
Abyzov, A., Urban, A.E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).
Acknowledgements
We thank our colleagues M. Cowley and M. Pinese for useful discussions and K. Ying for assistance in bioinformatic pipelines. The authors would like to thank the following funding sources: Australian National Health and Medical Research Council (NHMRC) Australia Fellowship (1062470 to T.R.M. and 1062606 to W.Y.C.). I.W.D. and S.A.H. are supported by Australian Postgraduate Award scholarships. The contents of the published material are solely the responsibility of the administering institution, a participating institution or individual authors, and they do not reflect the views of NHMRC.
Author information
Authors and Affiliations
Contributions
T.R.M. conceived the project, designed sequins and synthetic chromosome and conceived experiments. W.Y.C. and S.B.A. prepared sequins and performed experiments. I.W.D., T.W. and T.R.M. performed data analysis. I.W.D., S.A.H., L.K.N., J.S.M. and T.R.M. prepared the manuscript.
Corresponding author
Ethics declarations
Competing interests
The Garvan Institute of Medical Research has filed patent applications on some techniques described in this study.
Integrated supplementary information
Supplementary Figure 1 Workflow for use and analysis of DNA sequin standards in genome sequencing.
(From top) sequins are added to a genomic DNA sample and undergo concurrent library preparation and sequencing. Reads representing genomic DNA or synthetic sequin DNA are distinguished by their alignment to the reference human genome (left) or the artificial in silico reference chromosome (colored), respectively. Sequin-derived alignments undergo independent analysis in parallel to assess the performance of variant detection, the accuracy of allele frequency quantification and facilitate inter-sample comparisons and/or normalization. Multiple aspects of the workflow, including both laboratory (library preparation and sequencing) and bioinformatic (alignment and analysis) processes, can be assessed using sequins, with examples indicated (dashed boxes).
Supplementary Figure 2 Design workflow for artificial in silico chromosome and DNA sequins
(a) An artificial in silico chromosome sequence was designed to represent features of a real human chromosome with artificial, non-homologous primary sequences. To generate a background sequence into which synthetic variants could be incorporated, the human genome (Hg38) sequence was binned into 100 kb windows, which were ranked according to gene density, repeat density and GC content, then systematically sampled to ensure proportional representation. To abolish homology to Hg38, sequences were inverted and, where necessary, subjected to local shuffling, nucleotide substitution and/or manual curation. Annotated instances of small and large scale genetic variation from human populations were systematically sampled and incorporated, along with their flanking genome sequences, into the artificial chromosome, with homology being abolished as above. (b) Artificial sequences (background and genetic variants) were assembled into in silico chromosome sequences. Three in silico chromosomes were assembled; the first sequence (chrIS_D) contains no genetic variation and corresponds to the consensus reference sequence, analogous to the Hg38 reference sequence. The remaining two sequences emulate diploid human genotypes, with homozygous variation encoded throughout both sequences and heterozygous variation encoded in only one of the sequences. These form the template for synthetic DNA sequin synthesis. (c) Sequins are short (<10kB) stretches of synthetic DNA that encode mutations with respect to the sequence of chrIS_D, whose sequence remains in silico. Sequin sequences were flanked by SapI Type II restriction digest sites to enable their excision and purification from the host (pMA) vector. Purified sequins were combined at desired concentrations before library preparation and sequencing. The resulting sequencing reads align to the chrIS_D and identify instances of genetic variation with respect to this reference sequence.
Supplementary Figure 3 Representing human genetic variation within DNA sequins
(a) Large scale (1.5MB) genome browser view of sequins mapping within chrIS_D, in which eight individual sequins are visible, as well as a closer view of a single sequin, encoding six synthetic variants. (b) Histograms indicate spacing between synthetic variants encoded within sequins, with a density distribution for bona fide variants in the NA12878 genome provided for comparison. (c,d) Common genetic variation within human populations was proportionately sampled and incorporated into chrIS_D. Common genetic variation was also supplemented with an additional set of InDels that pose a challenge to detection via short-read sequencing. Histograms indicate the relative frequency of nucleotide substitutions (c) and the distribution of InDel sizes (d) within sequins (blue) and human populations (dbSNP 141; grey). (e) Examples of variants residing in low-complexity and simple-repeat sequences that could not be represented in sequins due to synthesis difficulties and/or cross-alignment to the human genome.
Supplementary Figure 4 Assessment of cross-alignment between sequin and human DNA libraries
When aligned to a combined reference index containing Hg38 and chrIS_D, all libraries (simulated and experimental) show no cross-alignment contamination. Forcing cross-alignment requires (i) performing alignment in absence of either chrIS_D or Hg38 and (ii) reducing the mapping penalty associated with mismatching bases (BWA mem default penalty = 4). This also results in a high rate of erroneous alignment. Histograms illustrate the proportion of unique (blue), multiple/ambiguous alignments (grey) and unmapped reads (orange) from simulated (a) and real (b) libraries when aligned to either chrIS_D or Hg38, as mismatch-penalty rates are reduced from 4 (default) to 0.
Supplementary Figure 5 Sources of variability in observed sequin abundances within and between mixtures
Sequence-specific biases during PCR amplification, library preparation, sequencing and alignment contribute to systematic variation (assumed to be reproducible between independent mixtures) in the observed abundances of sequins, while different mixtures exhibit independent variation, resulting from aliquot variability (pipetting error). (a) The observed abundances (median per-base coverage) of individual sequin standards (normalized to median abundance of all standards within a mixture) are compared between five independent mixtures, in which all standards were theoretically incorporated at equimolar concentrations. The vector directions associated with either systematic or independent variation are indicated. (b) Histograms indicate observed abundances (normalized to median) of sequin standards in five independent mixtures (matched order), before and after correction to remove systematic biases. Standard deviations indicate the magnitude of variability within each, with this being considerably reduced following systematic-bias correction.
Supplementary Figure 6 Sequencing and alignment quality for sequenced reads derived from sequin or NA12878 genomic DNA
(a, b) FastQC reports for NA12878 DNA and sequin-derived reads, showing phred quality scores as a function of nucleotide position within sequenced reads (a) and on a per-read basis (b). (c) Population distributions show the frequency of single-base substitution errors in sequenced reads from sequins or NA12878 genomic DNA.
Supplementary Figure 7 Calibrating sequencing coverage of sequins to match NA12878 genome
(a) Alignments to chrIS_D were subsampled to achieve coverage matched to the accompanying NA12878 genome. Frequency distributions illustrates coverage (per nucleotide) for (i) sequins before (right, blue) and (ii) after (left; blue) sub-sampling and (iii) accompanying NA12878 genome at maximum sequencing depth (red). Box-whisker plots (above) summarize median and interquartile ranges for frequency distributions. (b) Plots show mean coverage (+/- SD) across the length of sequin standards (left; n=36) in comparison to coverage across matched ~1kb windows in NA12878 genome (right; n=36). Terminal regions of sequins where edge-effects impact coverage are indicated (dashed lines) and were excluded during coverage calibration and analysis.
Supplementary Figure 8 Reference scale to measure allele frequencies
Sequin standards representing variant alleles were titrated at a two-fold serial dilution relative to reference allele standards to establish a ladder of allele frequencies from 1:1 (ie. heterozygous) to 1:4096. (a) Scatter plot illustrates the mixture structure used to establish variant allele frequency reference scale. (b,c) Observed alignment counts are plotted against expected allelic fractions for synthetic sequin variants, with SNVs (b) and InDels (c) shown separately. (d) Genome browser views provide illustrative examples of synthetic variants at 0.5, 0.25, 0.125 and 0.0625 allelic fractions, with alignments (grey) containing variant indicated.
Supplementary Figure 9 Impact of library depth on variant detection sensitivity
Sequins constitute an internal standard that can be used to inform the filtering stringency require to achieve sensitivity and specificity according to sample-specific variation, such as library depth. (a, b) Read counts supporting reference (grey) and variant alleles (red and blue) are plotted against expected allelic fractions (1/2 – 1/4096) at 5,000-, 1000-, and 250-fold sequencing depth. Variants called at a minimum confidence threshold (blue; p–value indicated) that was empirically determined within each sampled library in order to maximize sensitivity, without compromising precision. (c, d) Frequency distributions illustrate variant detection sensitivity at different library depth. Whilst a linear relationship between library depth and detection sensitivity is apparent for SNVs (c) a more complicated relationship due confounding impact of erroneous alignments is apparent for InDels (d).
Supplementary Figure 10 Evaluating filtering strategies for somatic variant detection
Filtering strategies based on multiple features are commonly used to distinguish true-positive variants (red) and false-positive variants (blue). Cumulative frequency plots illustrate the distribution of (a) reads frequencies, (b) quality score at identified variant nucleotide(s), (c) confidence p-value ascribed by VarScan2 (derived from Fisher exact test on reference and variant alignment counts). (d) Box whisker plot show fraction (+/- min and max range) of alignments in forward relative to reverse orientation. (e) Read-counts supporting reference (grey) and variant alleles (blue and purple) for synthetic InDels are plotted against expected variant allele fractions. InDels in blue were called with a minimum confidence of p < 10^-6 (Fisher exact test), while green variants failed to reach this threshold. (f) Frequency distributions describe the sensitivity (blue) and precision (red) of synthetic InDel detection and the accumulation of false positive calls (grey) as a function of variant calling confidence threshold (Fisher exact test). (g) Limit of detection plot indicates the confidence with which synthetic InDels were detected relative to their expected allele frequencies. False positive InDel calls (grey) are included for comparison.
Supplementary Figure 11 Large deletions in the in silico chromosome and NA12878 genome
(a) Genome browser examples of large deletions illustrating characteristic coverage profiles for discordant-pair (orange; alignment position/orientation of paired reads violates expected relationship) and split-read (green; partial alignment of single read to two positions) alignments at synthetic breakpoint sites in chrIS_D and deletion sites previously annotated in the NA12878 genome (Personalis 1000 Genomes). (b) Normalized density of 3’ split-alignment termini, in forward (blue) and reverse (red) orientation, are plotted with respect to synthetic variant breakpoint sites (mean, n=3). (c) The accumulation of split-read (green) or discordant-pair (yellow) alignments is plotted against sequencing depth (x-fold coverage). Mean indicated, error bars = SD, n = 3.
Supplementary Figure 12 Large novel insertions in the in silico chromosome and NA12878 genome
(a) Genome browser examples of large novel insertions illustrating characteristic coverage profiles for hanging-read alignments (partnered read is unmapped) in forward (blue) and reverse (red) orientation at synthetic insertion sites in chrIS_D and insertion sites previously annotated in the NA12878 genome (Spiral Genetics). (b) Normalized density of 3’ hanging-alignment termini, in forward (blue) and reverse (red) orientation, are plotted with respect to synthetic insertion breakpoint sites (mean, n=3 for each class). (c) The accumulation of hanging-reads in forward and reverse orientations is plotted against sequencing depth (x-fold coverage).
Supplementary Figure 13 Mobile element insertions in the in silico chromosome
Sequins were constructed to represent examples of mobile-element insertion events (n = 3; based on Alu, MIR and L2 repeat classes) at non-annotated positions within chrIS_D. (a) Genome browser examples show characteristic coverage profiles for discordant-pair alignments in forward (blue) and reverse (red) orientation at synthetic mobile insertion sites. (b) Normalized density of 3’ discordant-pair alignment termini, in forward (blue) and reverse (red) orientation, are plotted with respect to synthetic mobile element insertion breakpoint sites (mean, n = 3 for each class). (c) The accumulation of hanging-reads in forward and reverse orientation is plotted against sequencing depth (x-fold coverage).
Supplementary Figure 14 Copy-number amplification in the in silico chromosome and NA12878 genome
Sequins were designed to represent copy number amplifications of different sizes and amplification numbers (6 × 95 bp, 4 × 202 bp and 2 × 621 bp) with respect to a single copy encoded in the artificial in silico chromosome. (a) Genome browser profile illustrates characteristic coverage by standard (grey), discordant- (orange) and split- (green) alignments at breakpoint (dashed line) in chrIS_D. (b) Scatter-plots indicate normalized standard alignment coverage within amplified region (green) and flanking regions (grey), relative to sequencing depth. The fold-difference in coverage observed between amplified region and flanking regions scales proportionally to copy number, as indicated below.
Supplementary Figure 15 Using sequins as copy-number reference scales
The t-statistic is determined by a one-way t-test between coverage of repeats (A,B,C and D at 1,2,4, and 8 copies, respectively) relative to background (calibrated to B repeat). (a) Frequency distribution illustrates the performance of the t-statistic in relation to an 8-fold shift in mean coverage established by the copy—number sequins scale. (b) Frequency distribution illustrates the t-statistic assigned to previously annotated variants in the NA12878 genome for comparison to copy number sequin scale above. (d) Scatter plot illustrates the observed abundance of the quantitative DNA scale. The inflection point (dashed line) indicates the lower limit (orange) beyond which copy-number elements are not detected or inaccurately measured. Above this line sequins form a quantitative DNA scale (red). For comparison, the corresponding abundance of representative repeat DNA features are indicated (right blue/green panel).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–15 (PDF 2963 kb)
Rights and permissions
About this article
Cite this article
Deveson, I., Chen, W., Wong, T. et al. Representing genetic variation with synthetic DNA standards. Nat Methods 13, 784–791 (2016). https://doi.org/10.1038/nmeth.3957
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3957
This article is cited by
-
Vibrio-Sequins - dPCR-traceable DNA standards for quantitative genomics of Vibrio spp
BMC Genomics (2023)
-
The Quartet Data Portal: integration of community-wide resources for multiomics quality control
Genome Biology (2023)
-
Performance Evaluation of Three DNA Sample Tracking Tools in a Whole Exome Sequencing Workflow
Molecular Diagnosis & Therapy (2022)
-
A universal and independent synthetic DNA ladder for the quantitative measurement of genomic features
Nature Communications (2020)
-
Synthetic microbe communities provide internal reference standards for metagenome sequencing and analysis
Nature Communications (2018)