Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Genome-wide profiling of heritable and de novo STR variations

Abstract

Short tandem repeats (STRs) are highly variable elements that play a pivotal role in multiple genetic diseases, population genetics applications, and forensic casework. However, it has proven problematic to genotype STRs from high-throughput sequencing data. Here, we describe HipSTR, a novel haplotype-based method for robustly genotyping and phasing STRs from Illumina sequencing data, and we report a genome-wide analysis and validation of de novo STR mutations. HipSTR is freely available at https://hipstr-tool.github.io/HipSTR.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Performance of variant callers in STR regions.
Figure 2: Experimental validation of de novo STR mutations.

Accession codes

Accessions

European Nucleotide Archive

Sequence Read Archive

References

  1. 1

    Mirkin, S.M. Nature 447, 932–940 (2007).

    CAS  Article  Google Scholar 

  2. 2

    Contente, A., Dittmer, A., Koch, M.C., Roth, J. & Dobbelstein, M. Nat. Genet. 30, 315–320 (2002).

    Article  Google Scholar 

  3. 3

    Gymrek, M. et al. Nat. Genet. 48, 22–29 (2016).

    CAS  Article  Google Scholar 

  4. 4

    Hefferon, T.W., Groman, J.D., Yurk, C.E. & Cutting, G.R. Proc. Natl. Acad. Sci. USA 101, 3504–3509 (2004).

    CAS  Article  Google Scholar 

  5. 5

    Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. Genome Res. 22, 1154–1162 (2012).

    CAS  Article  Google Scholar 

  6. 6

    Highnam, G. et al. Nucleic Acids Res. 41, e32 (2013).

    CAS  Article  Google Scholar 

  7. 7

    Kong, A. et al. Nature 488, 471–475 (2012).

    CAS  Article  Google Scholar 

  8. 8

    Zook, J.M. et al. Nat. Biotechnol. 32, 246–251 (2014).

    CAS  Article  Google Scholar 

  9. 9

    The 1000 Genomes Project Consortium. Nature 526, 68–74 (2015).

  10. 10

    Mallick, S. et al. Nature 538, 201–206 (2016).

    CAS  Article  Google Scholar 

  11. 11

    Rosenberg, N.A. et al. PLoS Genet. 1, e70 (2005).

    Article  Google Scholar 

  12. 12

    DePristo, M.A. et al. Nat. Genet. 43, 491–498 (2011).

    CAS  Article  Google Scholar 

  13. 13

    Rimmer, A. et al. Nat. Genet. 46, 912–918 (2014).

    CAS  Article  Google Scholar 

  14. 14

    Garrison, E. & Marth, G. Preprint at https://arxiv.org/abs/1207.3907 (2012).

  15. 15

    Li, H. et al. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

  16. 16

    Koboldt, D.C. et al. Genome Res. 22, 568–576 (2012).

    CAS  Article  Google Scholar 

  17. 17

    Willems, T. et al. Genome Res. 24, 1894–1904 (2014).

    CAS  Article  Google Scholar 

  18. 18

    Estoup, A., Jarne, P. & Cornuet, J.M. Mol. Ecol. 11, 1591–1604 (2002).

    CAS  Article  Google Scholar 

  19. 19

    Eberle, M.A. et al. Genome Res. 27, 157–164 (2017).

    CAS  Article  Google Scholar 

  20. 20

    Francioli, L.C. et al. Nat. Genet. 47, 822–826 (2015).

    CAS  Article  Google Scholar 

  21. 21

    Willems, T., Gymrek, M., Poznik, G.D., Tyler-Smith, C. & Erlich, Y. Am. J. Hum. Genet. 98, 919–933 (2016).

    CAS  Article  Google Scholar 

  22. 22

    Dempster, A.P., Laird, N.M. & Rubin, D.B. J. R. Stat. Soc. Series B Stat. Methodol. 39, 1–38 (1977).

    Google Scholar 

  23. 23

    Albers, C.A. et al. Genome Res. 21, 961–973 (2011).

    CAS  Article  Google Scholar 

  24. 24

    Hinrichs, A.S. et al. Nucleic Acids Res. 34, D590–D598 (2006).

    CAS  Article  Google Scholar 

  25. 25

    Benson, G. Nucleic Acids Res. 27, 573–580 (1999).

    CAS  Article  Google Scholar 

  26. 26

    Li, H. Bioinformatics 31, 3694–3696 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27

    Li, H. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  28. 28

    Fungtammasan, A. et al. Genome Res. 25, 736–749 (2015).

    CAS  Article  Google Scholar 

  29. 29

    Browning, S.R. & Browning, B.L. Am. J. Hum. Genet. 81, 1084–1097 (2007).

    CAS  Article  Google Scholar 

  30. 30

    Weisenfeld, N.I. et al. Nat. Genet. 46, 1350–1355 (2014).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

Y.E. holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. This study was supported by NIJ grant 2014-DN-BX-K089 (T.W., D.Z., A.G., M.G., and Y.E.) and a generous gift by A. Heafy and P. Heafy. Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award number UM1HG008901. We thank Kailos Genetics for providing the 300-bp Illumina sequencing data. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Affiliations

Authors

Contributions

T.W., M.G., and Y.E. designed the HipSTR algorithm and subsequent analyses. T.W. and A.G. implemented the HipSTR software. T.W. and J.Y. performed the analyses. D.Z. experimentally validated the de novo mutations and analyzed the long MiSeq reads. T.W. and Y.E. wrote the manuscript.

Corresponding authors

Correspondence to Thomas Willems or Yaniv Erlich.

Ethics declarations

Competing interests

Y.E. is a consultant for Arc Bio, a company interested in DNA forensics.

Integrated supplementary information

Supplementary Figure 1 Overview of the HipSTR algorithm

In step 1, an expectation maximization algorithm learns the PCR stutter model for the STR of interest. Step 2 utilizes well-anchored alignments that span the STR to identify candidate alleles. HipSTR then builds haplotypes using these alleles and the flanking sequences upstream and downstream of the repeat. In step 3, the PCR stutter model and an HMM are used to align every read to every candidate haplotype and determine the corresponding alignment likelihoods. Step 4 analyzes the likelihoods for all of a sample’s reads to determine its maximum likelihood (ML) genotype. After realigning every read to its sample’s ML genotype, step 5 uses commonly observed stutter artifacts in the alignments to identify new candidate alleles. If any are found, HipSTR returns to step 3 and repeats the process. This iterative procedure continues until no new candidate alleles are identified, at which point the ML genotypes are output to a VCF file. Red dashes and red boxed bases in the rightmost panel indicate deletions or mismatches relative to the candidate haplotype, respectively.

Supplementary Figure 2 HipSTR alignment models

HipSTR uses two distinct types of alignment models to obtain alignment likelihoods. In regions flanking the STR (top), a pairwise hidden Markov model is used to account for Illumina sequencing errors. In contrast, within STR regions (bottom), HipSTR assumes that the main source of artifacts is stutter. If no stutter error occurs, the likelihood of observing a sequence of characters is given by the agreement between the bases in the read (blue) and the corresponding bases on the haplotype (green). Otherwise, HipSTR assumes that a single stutter indel occurs (orange), that it’s a multiple of the motif length M and that it arose at each position with equal probability. The alignment likelihood for these scenarios is then obtained by marginalizing over all configurations.

Supplementary Figure 3 Physical phasing of STRs onto SNP scaffolds

When provided with phased SNP haplotypes, HipSTR analyzes reads that overlap heterozygous SNPs to phase the STR genotypes (blue) onto the SNP haplotypes (red). The schematic provides a conceptual outline of how this would work for an AC repeat flanked by two heterozygous SNPs with known phase. For a quantitative description of HipSTR’s phasing methodology, please refer to the Online Methods.

Supplementary Figure 4 HipSTR in action

Example alignments for sample NA12878 to a particular Marshfield STR before (top) and after (bottom) processing by HipSTR. While the input alignments have many different indel sizes and mismatches with the reference genome, HipSTR’s algorithm resolves the alignments into two parsimonious STR insertions of 4 and 16 base pairs. Bases highlighted in yellow and red denote mismatches and insertions relative to the reference genome, respectively.

Supplementary Figure 5 Performance of variant callers in STR regions

The accuracy of each tool’s calls is shown as a function of sensitivity for the Marshfield STR panel. Solid and dashed lines denote tools run using default settings and settings optimized for STR genotyping, respectively. This figure provides a zoomed-out version of the comparison in Figure 1.

Supplementary Figure 6 Effect of coverage on STR genotyping accuracy

The accuracy of HipSTR and GATK-HC calls is shown as a function of sensitivity for the Marshfield STR panel. Each line denotes a tool’s performance when run on sequencing data from the Simons Genome Diversity Project downsampled to the indicated coverage. 41x was the median coverage of the original SGDP dataset.

Supplementary Figure 7 An example of STR homoplasy

The figure depicts an STR located at chr6:16429779 in the hg19 reference genome with an AGAT repeat followed by an ACAT repeat. Each subsequent row depicts the maximum likelihood alignment for reads from sample NA12878 after processing with HipSTR. While all of the reads support a 4bp deletion (red dashes), they support two different STR sequences. In the top 9 reads, one AGAT unit is deleted while the ACAT perfectly matches the reference. In the bottom 8 reads, two copies of AGAT are inserted followed by a deletion of three ACAT copies. HipSTR reports genotypes of (AGAT)8 (ACAT)9 and (AGAT)11(ACAT)6 for NA12878 at this locus. Bases highlighted in yellow denote mismatches relative to the reference genome.

Supplementary Figure 8 Identification of de novo mutations in the Illumina Platinum Genomes CEPH trio

The figure highlights the pedigree and analyses we used to identify de novo STR mutations. We first searched for STR loci where NA12878 (bold, arrow) had an STR allele not observed in her parents. We then validated these loci using orthogonal Illumina datasets and genotyped a small subset of them using Sanger sequencing for further validation.

Supplementary Figure 9 High-confidence de novo mutation detected by HipSTR

The figure depicts an STR located at chr20:45121567 in the hg19 reference genome with an A repeat (top row). Each subsequent row depicts a read’s maximum likelihood alignment after processing with HipSTR for the child (NA12878), father (NA12891) and mother (NA12892) in the trio. While the alignments strongly support that both the father and mother match the reference allele, the alignments for NA12878 support the reference allele and a 1bp deletion (red dashes). Only a subset of the alignments for NA12878 (12.5%), NA12891 (50%) and NA12892 (50%) are displayed to facilitate visualization. Bases highlighted in yellow and red denote mismatches and insertions relative to the reference genome, respectively.

Supplementary Figure 10 High-confidence de novo mutation detected by HipSTR

The figure depicts an STR located at chr19:37201975 in the hg19 reference genome with a TG repeat (top row). Each subsequent row depicts a read’s maximum likelihood alignment after processing with HipSTR for the child (NA12878), father (NA12891) and mother (NA12892) in the trio. While the alignments strongly support that both the father and mother match the reference allele, the alignments for NA12878 support both the reference allele and a 2bp insertion. Only a subset of the alignments for NA12878 (12.5%) and NA12892 (50%) are displayed to facilitate visualization. Bases highlighted in yellow and red denote mismatches and insertions relative to the reference genome, respectively.

Supplementary Figure 11 High-confidence de novo mutation detected by HipSTR

The figure depicts an STR located at chr18:27043336 in the hg19 reference genome with an ATC repeat (top row). Each subsequent row depicts a read’s maximum likelihood alignment after processing with HipSTR for the child (NA12878), father (NA12891) and mother (NA12892) in the trio. While the alignments strongly support a 0 bp/3bp heterozygous insertion for the father and a 3bp homozygous insertion for the mother, the alignments for NA12878 support a heterozygous 3bp/6bp insertion. Only a subset of the alignments for NA12878 (12.5%), NA12891 (50%) and NA12892 (50%) are displayed to facilitate visualization. Bases highlighted in yellow and red denote mismatches and insertions relative to the reference genome, respectively.

Supplementary Figure 12 High-confidence de novo mutation detected by HipSTR

The figure depicts an STR located at chr1:221172558 in the hg19 reference genome with an ATCT repeat (top row). Each subsequent row depicts a read’s maximum likelihood alignment after processing with HipSTR for the child (NA12878), father (NA12891) and mother (NA12892) in the trio. While the alignments strongly support an 8bp homozygous deletion for both parents (red dashes), the alignments for NA12878 support both a 4bp and 8bp deletion. Only 25% of the alignments for NA12878 are displayed to facilitate visualization. Bases highlighted in yellow denote mismatches relative to the reference genome.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–12 and Supplementary Tables 1–6 (PDF 3371 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Willems, T., Zielinski, D., Yuan, J. et al. Genome-wide profiling of heritable and de novo STR variations. Nat Methods 14, 590–592 (2017). https://doi.org/10.1038/nmeth.4267

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing