Representing genetic variation with synthetic DNA standards

Deveson, Ira W; Chen, Wendy Y; Wong, Ted; Hardwick, Simon A; Andersen, Stacey B; Nielsen, Lars K; Mattick, John S; Mercer, Tim R

doi:10.1038/nmeth.3957

Article
Published: 08 August 2016

Representing genetic variation with synthetic DNA standards

Ira W Deveson^1,2^na1,
Wendy Y Chen^1,3^na1,
Ted Wong¹,
Simon A Hardwick^1,3,
Stacey B Andersen⁴,
Lars K Nielsen ORCID: orcid.org/0000-0001-8191-3511⁴,
John S Mattick ORCID: orcid.org/0000-0002-7680-7527^1,3 &
…
Tim R Mercer^1,3

Nature Methods volume 13, pages 784–791 (2016)Cite this article

6346 Accesses
24 Citations
100 Altmetric
Metrics details

Subjects

Abstract

The identification of genetic variation with next-generation sequencing is confounded by the complexity of the human genome sequence and by biases that arise during library preparation, sequencing and analysis. We have developed a set of synthetic DNA standards, termed 'sequins', that emulate human genetic features and constitute qualitative and quantitative spike-in controls for genome sequencing. Sequencing reads derived from sequins align exclusively to an artificial in silico reference chromosome, rather than the human reference genome, which allows them them to be partitioned for parallel analysis. Here we use this approach to represent common and clinically relevant genetic variation, ranging from single nucleotide variants to large structural rearrangements and copy-number variation. We validate the design and performance of sequin standards by comparison to examples in the NA12878 reference genome, and we demonstrate their utility during the detection and quantification of variants. We provide sequins as a standardized, quantitative resource against which human genetic variation can be measured and diagnostic performance assessed.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Sequin standards representing genetic variation.**

**Figure 2: Assessing quantitative variability within and between sequin mixtures.**

**Figure 3: Validation of DNA sequins against the NA12878 reference genome.**

**Figure 4: Using sequins to measure somatic variant-allele frequency.**

**Figure 5: Representing structural variation with DNA sequins.**

**Figure 6: Representing copy-number variation and repeat DNA with sequins.**

A robust benchmark for detection of germline large deletions and insertions

Article 15 June 2020

Justin M. Zook, Nancy F. Hansen, … Marc Salit

Variant calling and benchmarking in an era of complete human genome sequences

Article 14 April 2023

Nathan D. Olson, Justin Wagner, … Justin M. Zook

Use of synthetic DNA spike-in controls (sequins) for human genome sequencing

Article 19 June 2019

James Blackburn, Ted Wong, … Tim R. Mercer

Accession codes

Primary accessions

BioProject

PRJNA321434

Referenced accessions

European Nucleotide Archive

ERS179577

NCBI Reference Sequence

References

Goldstein, D.B. et al. Sequencing studies in human genetics: design and interpretation. Nat. Rev. Genet. 14, 460–470 (2013).
Article CAS Google Scholar
Mwenifumbo, J.C. & Marra, M.A. Cancer genome-sequencing study design. Nat. Rev. Genet. 14, 321–332 (2013).
Article CAS Google Scholar
Gundry, M. & Vijg, J. Direct mutation analysis by high-throughput sequencing: from germline to low-abundant, somatic variants. Mutat. Res. 729, 1–15 (2012).
Article CAS Google Scholar
Katsanis, S.H. & Katsanis, N. Molecular genetic testing and the future of clinical genomics. Nat. Rev. Genet. 14, 415–426 (2013).
Article CAS Google Scholar
Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).
Article CAS Google Scholar
Sims, D., Sudbery, I., Ilott, N.E., Heger, A. & Ponting, C.P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).
Article CAS Google Scholar
Rosenbloom, K.R. et al. The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 43, D670–D681 (2015).
Article CAS Google Scholar
Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article CAS Google Scholar
Satya, R.V. & DiCarlo, J. Edge effects in calling variants from targeted amplicon sequencing. BMC Genomics 15, 1073–1080 (2014).
Article Google Scholar
Huang, W., Li, L., Myers, J.R. & Marth, G.T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Article Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Article CAS Google Scholar
Van der Auwera, G.A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 1–33 (2013).
Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS Google Scholar
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article CAS Google Scholar
De Sousa E Melo, F., Vermeulen, L., Fessler, E. & Medema, J.P. Cancer heterogeneity—a multifaceted view. EMBO Rep. 14, 686–695 (2013).
Article Google Scholar
Meacham, C.E. & Morrison, S.J. Tumour heterogeneity and cancer cell plasticity. Nature 501, 328–337 (2013).
Article CAS Google Scholar
Greaves, M. & Maley, C.C. Clonal evolution in cancer. Nature 481, 306–313 (2012).
Article CAS Google Scholar
Griffith, M. et al. Optimizing cancer genome sequencing and analysis. Cell Syst. 1, 210–223 (2015).
Article CAS Google Scholar
Carter, S.L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).
Article CAS Google Scholar
Aran, D., Sirota, M. & Butte, A.J. Systematic pan-cancer analysis of tumour purity. Nat. Commun. 6, 8971 (2015).
Article CAS Google Scholar
Koboldt, D.C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).
Article CAS Google Scholar
Alkan, C., Coe, B.P. & Eichler, E.E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Article CAS Google Scholar
Weckselblatt, B. & Rudd, M.K. Human structural variation: mechanisms of chromosome rearrangements. Trends Genet. 31, 587–599 (2015).
Article CAS Google Scholar
Abel, H.J., Duncavage, E.J. & Duncavage, E.J. Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches. Cancer Genet. 206, 432–440 (2013).
Article CAS Google Scholar
Pirooznia, M., Goes, F.S. & Zandi, P.P. Whole-genome CNV analysis: advances in computational approaches. Front. Genet. 6, 138 (2015).
Article Google Scholar
MacDonald, J.R., Ziman, R., Yuen, R.K.C., Feuk, L. & Scherer, S.W. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42, D986–D992 (2014).
Article CAS Google Scholar
Layer, R.M., Chiang, C., Quinlan, A.R. & Hall, I.M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
Article Google Scholar
Cordaux, R. & Batzer, M.A. The impact of retrotransposons on human genome evolution. Nat. Rev. Genet. 10, 691–703 (2009).
Article CAS Google Scholar
Zarrei, M., MacDonald, J.R., Merico, D. & Scherer, S.W. A copy number variation map of the human genome. Nat. Rev. Genet. 16, 172–183 (2015).
Article CAS Google Scholar
Wineinger, N.E. et al. Statistical issues in the analysis of DNA copy number variations. Int. J. Comput. Biol. Drug Des. 1, 368–395 (2008).
Article CAS Google Scholar
Parekh, S., Ziegenhain, C., Vieth, B., Enard, W. & Hellmann, I. The impact of amplification on differential expression analyses by RNA-seq. Preprint at bioRxiv http://dx.doi.org/10.1101/035493 (2015).
Armbruster, D.A. & Pry, T. Limit of blank, limit of detection and limit of quantitation. Clin. Biochem. Rev. 29 (Suppl. 1), S49–S52 (2008).
PubMed PubMed Central Google Scholar
Gibbons, J.G., Branco, A.T., Yu, S. & Lemos, B. Ribosomal DNA copy number is coupled with gene expression variation and mitochondrial abundance in humans. Nat. Commun. 5, 4850 (2014).
Article CAS Google Scholar
Lei, H. et al. Identification and characterization of EBV genomes in spontaneously immortalized human peripheral blood B lymphocytes by NGS technology. BMC Genomics 14, 804 (2013).
Article CAS Google Scholar
Schaap, M. et al. Genome-wide analysis of macrosatellite repeat copy number variation in worldwide populations: evidence for differences and commonalities in size distributions and size restrictions. BMC Genomics 14, 143 (2013).
Article CAS Google Scholar
Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
Article CAS Google Scholar
Frampton, G.M. et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat. Biotechnol. 31, 1023–1031 (2013).
Article CAS Google Scholar
Hardwick, S.A. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat. Methods http://dx.doi.org/10.1038/nmeth.3958 (2016).
Lih, C.J. et al. Analytical validation and application of a targeted next-generation sequencing mutation-detection assay for use in treatment assignment in the NCI-MPACT trial. J. Mol. Diagn. 18, 51–67 (2016).
Article CAS Google Scholar
Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011).
Article CAS Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS Google Scholar
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005).
Article CAS Google Scholar
Abyzov, A., Urban, A.E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).
Article CAS Google Scholar

Download references

Acknowledgements

We thank our colleagues M. Cowley and M. Pinese for useful discussions and K. Ying for assistance in bioinformatic pipelines. The authors would like to thank the following funding sources: Australian National Health and Medical Research Council (NHMRC) Australia Fellowship (1062470 to T.R.M. and 1062606 to W.Y.C.). I.W.D. and S.A.H. are supported by Australian Postgraduate Award scholarships. The contents of the published material are solely the responsibility of the administering institution, a participating institution or individual authors, and they do not reflect the views of NHMRC.

Author information

Ira W Deveson and Wendy Y Chen: These authors contributed equally to this work.

Authors and Affiliations

Genomics and Epigenetics Division, Garvan Institute of Medical Research, New South Wales, Australia
Ira W Deveson, Wendy Y Chen, Ted Wong, Simon A Hardwick, John S Mattick & Tim R Mercer
School of Biotechnology and Biomolecular Sciences, Faculty of Science, The University of New South Wales, New South Wales, Australia
Ira W Deveson
St. Vincents Clinical School, Faculty of Medicine, The University of New South Wales, New South Wales, Australia
Wendy Y Chen, Simon A Hardwick, John S Mattick & Tim R Mercer
Australian Institute for Bioengineering and Nanotechnology, University of Queensland, Queensland, Australia
Stacey B Andersen & Lars K Nielsen

Authors

Ira W Deveson
View author publications
You can also search for this author in PubMed Google Scholar
Wendy Y Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ted Wong
View author publications
You can also search for this author in PubMed Google Scholar
Simon A Hardwick
View author publications
You can also search for this author in PubMed Google Scholar
Stacey B Andersen
View author publications
You can also search for this author in PubMed Google Scholar
Lars K Nielsen
View author publications
You can also search for this author in PubMed Google Scholar
John S Mattick
View author publications
You can also search for this author in PubMed Google Scholar
Tim R Mercer
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.R.M. conceived the project, designed sequins and synthetic chromosome and conceived experiments. W.Y.C. and S.B.A. prepared sequins and performed experiments. I.W.D., T.W. and T.R.M. performed data analysis. I.W.D., S.A.H., L.K.N., J.S.M. and T.R.M. prepared the manuscript.

Corresponding author

Correspondence to Tim R Mercer.

Ethics declarations

Competing interests

The Garvan Institute of Medical Research has filed patent applications on some techniques described in this study.

Integrated supplementary information

Supplementary Figure 1 Workflow for use and analysis of DNA sequin standards in genome sequencing.

(From top) sequins are added to a genomic DNA sample and undergo concurrent library preparation and sequencing. Reads representing genomic DNA or synthetic sequin DNA are distinguished by their alignment to the reference human genome (left) or the artificial in silico reference chromosome (colored), respectively. Sequin-derived alignments undergo independent analysis in parallel to assess the performance of variant detection, the accuracy of allele frequency quantification and facilitate inter-sample comparisons and/or normalization. Multiple aspects of the workflow, including both laboratory (library preparation and sequencing) and bioinformatic (alignment and analysis) processes, can be assessed using sequins, with examples indicated (dashed boxes).

Supplementary Figure 2 Design workflow for artificial in silico chromosome and DNA sequins

(a) An artificial in silico chromosome sequence was designed to represent features of a real human chromosome with artificial, non-homologous primary sequences. To generate a background sequence into which synthetic variants could be incorporated, the human genome (Hg38) sequence was binned into 100 kb windows, which were ranked according to gene density, repeat density and GC content, then systematically sampled to ensure proportional representation. To abolish homology to Hg38, sequences were inverted and, where necessary, subjected to local shuffling, nucleotide substitution and/or manual curation. Annotated instances of small and large scale genetic variation from human populations were systematically sampled and incorporated, along with their flanking genome sequences, into the artificial chromosome, with homology being abolished as above. (b) Artificial sequences (background and genetic variants) were assembled into in silico chromosome sequences. Three in silico chromosomes were assembled; the first sequence (chrIS_D) contains no genetic variation and corresponds to the consensus reference sequence, analogous to the Hg38 reference sequence. The remaining two sequences emulate diploid human genotypes, with homozygous variation encoded throughout both sequences and heterozygous variation encoded in only one of the sequences. These form the template for synthetic DNA sequin synthesis. (c) Sequins are short (<10kB) stretches of synthetic DNA that encode mutations with respect to the sequence of chrIS_D, whose sequence remains in silico. Sequin sequences were flanked by SapI Type II restriction digest sites to enable their excision and purification from the host (pMA) vector. Purified sequins were combined at desired concentrations before library preparation and sequencing. The resulting sequencing reads align to the chrIS_D and identify instances of genetic variation with respect to this reference sequence.

Supplementary Figure 3 Representing human genetic variation within DNA sequins

(a) Large scale (1.5MB) genome browser view of sequins mapping within chrIS_D, in which eight individual sequins are visible, as well as a closer view of a single sequin, encoding six synthetic variants. (b) Histograms indicate spacing between synthetic variants encoded within sequins, with a density distribution for bona fide variants in the NA12878 genome provided for comparison. (c,d) Common genetic variation within human populations was proportionately sampled and incorporated into chrIS_D. Common genetic variation was also supplemented with an additional set of InDels that pose a challenge to detection via short-read sequencing. Histograms indicate the relative frequency of nucleotide substitutions (c) and the distribution of InDel sizes (d) within sequins (blue) and human populations (dbSNP 141; grey). (e) Examples of variants residing in low-complexity and simple-repeat sequences that could not be represented in sequins due to synthesis difficulties and/or cross-alignment to the human genome.

Supplementary Figure 4 Assessment of cross-alignment between sequin and human DNA libraries

When aligned to a combined reference index containing Hg38 and chrIS_D, all libraries (simulated and experimental) show no cross-alignment contamination. Forcing cross-alignment requires (i) performing alignment in absence of either chrIS_D or Hg38 and (ii) reducing the mapping penalty associated with mismatching bases (BWA mem default penalty = 4). This also results in a high rate of erroneous alignment. Histograms illustrate the proportion of unique (blue), multiple/ambiguous alignments (grey) and unmapped reads (orange) from simulated (a) and real (b) libraries when aligned to either chrIS_D or Hg38, as mismatch-penalty rates are reduced from 4 (default) to 0.

Supplementary Figure 5 Sources of variability in observed sequin abundances within and between mixtures

Sequence-specific biases during PCR amplification, library preparation, sequencing and alignment contribute to systematic variation (assumed to be reproducible between independent mixtures) in the observed abundances of sequins, while different mixtures exhibit independent variation, resulting from aliquot variability (pipetting error). (a) The observed abundances (median per-base coverage) of individual sequin standards (normalized to median abundance of all standards within a mixture) are compared between five independent mixtures, in which all standards were theoretically incorporated at equimolar concentrations. The vector directions associated with either systematic or independent variation are indicated. (b) Histograms indicate observed abundances (normalized to median) of sequin standards in five independent mixtures (matched order), before and after correction to remove systematic biases. Standard deviations indicate the magnitude of variability within each, with this being considerably reduced following systematic-bias correction.

Supplementary Figure 6 Sequencing and alignment quality for sequenced reads derived from sequin or NA12878 genomic DNA

(a, b) FastQC reports for NA12878 DNA and sequin-derived reads, showing phred quality scores as a function of nucleotide position within sequenced reads (a) and on a per-read basis (b). (c) Population distributions show the frequency of single-base substitution errors in sequenced reads from sequins or NA12878 genomic DNA.

Supplementary Figure 7 Calibrating sequencing coverage of sequins to match NA12878 genome

(a) Alignments to chrIS_D were subsampled to achieve coverage matched to the accompanying NA12878 genome. Frequency distributions illustrates coverage (per nucleotide) for (i) sequins before (right, blue) and (ii) after (left; blue) sub-sampling and (iii) accompanying NA12878 genome at maximum sequencing depth (red). Box-whisker plots (above) summarize median and interquartile ranges for frequency distributions. (b) Plots show mean coverage (+/- SD) across the length of sequin standards (left; n=36) in comparison to coverage across matched ~1kb windows in NA12878 genome (right; n=36). Terminal regions of sequins where edge-effects impact coverage are indicated (dashed lines) and were excluded during coverage calibration and analysis.

Supplementary Figure 8 Reference scale to measure allele frequencies

Sequin standards representing variant alleles were titrated at a two-fold serial dilution relative to reference allele standards to establish a ladder of allele frequencies from 1:1 (ie. heterozygous) to 1:4096. (a) Scatter plot illustrates the mixture structure used to establish variant allele frequency reference scale. (b,c) Observed alignment counts are plotted against expected allelic fractions for synthetic sequin variants, with SNVs (b) and InDels (c) shown separately. (d) Genome browser views provide illustrative examples of synthetic variants at 0.5, 0.25, 0.125 and 0.0625 allelic fractions, with alignments (grey) containing variant indicated.

Supplementary Figure 9 Impact of library depth on variant detection sensitivity

Sequins constitute an internal standard that can be used to inform the filtering stringency require to achieve sensitivity and specificity according to sample-specific variation, such as library depth. (a, b) Read counts supporting reference (grey) and variant alleles (red and blue) are plotted against expected allelic fractions (1/2 – 1/4096) at 5,000-, 1000-, and 250-fold sequencing depth. Variants called at a minimum confidence threshold (blue; p–value indicated) that was empirically determined within each sampled library in order to maximize sensitivity, without compromising precision. (c, d) Frequency distributions illustrate variant detection sensitivity at different library depth. Whilst a linear relationship between library depth and detection sensitivity is apparent for SNVs (c) a more complicated relationship due confounding impact of erroneous alignments is apparent for InDels (d).

Supplementary Figure 10 Evaluating filtering strategies for somatic variant detection

Filtering strategies based on multiple features are commonly used to distinguish true-positive variants (red) and false-positive variants (blue). Cumulative frequency plots illustrate the distribution of (a) reads frequencies, (b) quality score at identified variant nucleotide(s), (c) confidence p-value ascribed by VarScan2 (derived from Fisher exact test on reference and variant alignment counts). (d) Box whisker plot show fraction (+/- min and max range) of alignments in forward relative to reverse orientation. (e) Read-counts supporting reference (grey) and variant alleles (blue and purple) for synthetic InDels are plotted against expected variant allele fractions. InDels in blue were called with a minimum confidence of p < 10^-6 (Fisher exact test), while green variants failed to reach this threshold. (f) Frequency distributions describe the sensitivity (blue) and precision (red) of synthetic InDel detection and the accumulation of false positive calls (grey) as a function of variant calling confidence threshold (Fisher exact test). (g) Limit of detection plot indicates the confidence with which synthetic InDels were detected relative to their expected allele frequencies. False positive InDel calls (grey) are included for comparison.

Supplementary Figure 11 Large deletions in the in silico chromosome and NA12878 genome

(a) Genome browser examples of large deletions illustrating characteristic coverage profiles for discordant-pair (orange; alignment position/orientation of paired reads violates expected relationship) and split-read (green; partial alignment of single read to two positions) alignments at synthetic breakpoint sites in chrIS_D and deletion sites previously annotated in the NA12878 genome (Personalis 1000 Genomes). (b) Normalized density of 3’ split-alignment termini, in forward (blue) and reverse (red) orientation, are plotted with respect to synthetic variant breakpoint sites (mean, n=3). (c) The accumulation of split-read (green) or discordant-pair (yellow) alignments is plotted against sequencing depth (x-fold coverage). Mean indicated, error bars = SD, n = 3.

Supplementary Figure 12 Large novel insertions in the in silico chromosome and NA12878 genome

(a) Genome browser examples of large novel insertions illustrating characteristic coverage profiles for hanging-read alignments (partnered read is unmapped) in forward (blue) and reverse (red) orientation at synthetic insertion sites in chrIS_D and insertion sites previously annotated in the NA12878 genome (Spiral Genetics). (b) Normalized density of 3’ hanging-alignment termini, in forward (blue) and reverse (red) orientation, are plotted with respect to synthetic insertion breakpoint sites (mean, n=3 for each class). (c) The accumulation of hanging-reads in forward and reverse orientations is plotted against sequencing depth (x-fold coverage).

Supplementary Figure 13 Mobile element insertions in the in silico chromosome

Sequins were constructed to represent examples of mobile-element insertion events (n = 3; based on Alu, MIR and L2 repeat classes) at non-annotated positions within chrIS_D. (a) Genome browser examples show characteristic coverage profiles for discordant-pair alignments in forward (blue) and reverse (red) orientation at synthetic mobile insertion sites. (b) Normalized density of 3’ discordant-pair alignment termini, in forward (blue) and reverse (red) orientation, are plotted with respect to synthetic mobile element insertion breakpoint sites (mean, n = 3 for each class). (c) The accumulation of hanging-reads in forward and reverse orientation is plotted against sequencing depth (x-fold coverage).

Supplementary Figure 14 Copy-number amplification in the in silico chromosome and NA12878 genome

Sequins were designed to represent copy number amplifications of different sizes and amplification numbers (6 × 95 bp, 4 × 202 bp and 2 × 621 bp) with respect to a single copy encoded in the artificial in silico chromosome. (a) Genome browser profile illustrates characteristic coverage by standard (grey), discordant- (orange) and split- (green) alignments at breakpoint (dashed line) in chrIS_D. (b) Scatter-plots indicate normalized standard alignment coverage within amplified region (green) and flanking regions (grey), relative to sequencing depth. The fold-difference in coverage observed between amplified region and flanking regions scales proportionally to copy number, as indicated below.

Supplementary Figure 15 Using sequins as copy-number reference scales

The t-statistic is determined by a one-way t-test between coverage of repeats (A,B,C and D at 1,2,4, and 8 copies, respectively) relative to background (calibrated to B repeat). (a) Frequency distribution illustrates the performance of the t-statistic in relation to an 8-fold shift in mean coverage established by the copy—number sequins scale. (b) Frequency distribution illustrates the t-statistic assigned to previously annotated variants in the NA12878 genome for comparison to copy number sequin scale above. (d) Scatter plot illustrates the observed abundance of the quantitative DNA scale. The inflection point (dashed line) indicates the lower limit (orange) beyond which copy-number elements are not detected or inaccurately measured. Above this line sequins form a quantitative DNA scale (red). For comparison, the corresponding abundance of representative repeat DNA features are indicated (right blue/green panel).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15 (PDF 2963 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deveson, I., Chen, W., Wong, T. et al. Representing genetic variation with synthetic DNA standards. Nat Methods 13, 784–791 (2016). https://doi.org/10.1038/nmeth.3957

Download citation

Received: 23 March 2016
Accepted: 28 June 2016
Published: 08 August 2016
Issue Date: September 2016
DOI: https://doi.org/10.1038/nmeth.3957

This article is cited by

Vibrio-Sequins - dPCR-traceable DNA standards for quantitative genomics of Vibrio spp
- Sabrina Flütsch
- Fabian Wiestner
- Kai N. Stölting
BMC Genomics (2023)
The Quartet Data Portal: integration of community-wide resources for multiomics quality control
- Jingcheng Yang
- Yaqing Liu
- Yuanting Zheng
Genome Biology (2023)
Performance Evaluation of Three DNA Sample Tracking Tools in a Whole Exome Sequencing Workflow
- Gertjan Wils
- Céline Helsmoortel
- Steve Lefever
Molecular Diagnosis & Therapy (2022)
A universal and independent synthetic DNA ladder for the quantitative measurement of genomic features
- Andre L. M. Reis
- Ira W. Deveson
- Tim R. Mercer
Nature Communications (2020)
Synthetic microbe communities provide internal reference standards for metagenome sequencing and analysis
- Simon A. Hardwick
- Wendy Y. Chen
- Tim R. Mercer
Nature Communications (2018)