Abundant contribution of short tandem repeats to gene expression variation in humans

Journal name:
Nature Genetics
Volume:
48,
Pages:
22–29
Year published:
DOI:
doi:10.1038/ng.3461
Received
Accepted
Published online

Abstract

The contribution of repetitive elements to quantitative human traits is largely unknown. Here we report a genome-wide survey of the contribution of short tandem repeats (STRs), which constitute one of the most polymorphic and abundant repeat classes, to gene expression in humans. Our survey identified 2,060 significant expression STRs (eSTRs). These eSTRs were replicable in orthogonal populations and expression assays. We used variance partitioning to disentangle the contribution of eSTRs from that of linked SNPs and indels and found that eSTRs contribute 10–15% of the cis heritability mediated by all common variants. Further functional genomic analyses showed that eSTRs are enriched in conserved regions, colocalize with regulatory elements and may modulate certain histone modifications. By analyzing known genome-wide association study (GWAS) signals and searching for new associations in 1,685 whole genomes from deeply phenotyped individuals, we found that eSTRs are enriched in various clinically relevant conditions. These results highlight the contribution of STRs to the genetic architecture of quantitative human traits.

At a glance

Figures

  1. eSTR discovery and replication.
    Figure 1: eSTR discovery and replication.

    (a) eSTR discovery pipeline. An association test using linear regression was performed between STR dosage and expression level for every STR within 100 kb of a gene. WGS, whole-genome sequencing; RNA-seq, RNA sequencing; C, covariates; H0, null hypothesis; H1, alternative hypothesis. (b) Quantile-quantile plot showing results of association tests. The gray line gives the expected P-value distribution under the null hypothesis of no association. Black dots give P values for permuted controls, and red dots give the results of the observed association tests. (c) Comparison of eSTR effect sizes as Pearson correlations in the discovery data set versus the replication data set. Red points denote eSTRs whose directions of effect were concordant in the two data sets, and gray points denote eSTRs with discordant directions of effect.

  2. Variance partitioning using linear mixed models.
    Figure 2: Variance partitioning using linear mixed models.

    (a) The normalized variance in the expression of gene Y was modeled as the contribution to variance of the best eSTR and all common biallelic markers in the cis region (±100 kb from the gene boundaries). (b,c) Heat maps show the joint distributions of the variance explained by eSTRs (x axis) and by the cis region (y axis). Gray lines denote the median variance explained. (b) Variance partitioning across genes with a significant eSTR in the discovery set. (c) Variance partitioning across genes with moderate cis heritability.

  3. eSTR associations in the context of eSNPs.
    Figure 3: eSTR associations in the context of eSNPs.

    (a) Schematic of the eSTR effect versus the effect when conditioning on the genotype for the lead eSNP. Under the null expectation (H0), the original association (red line) comes from mere tagging of the eSNP. Thus, the eSTR effect disappears when restricting to a group of individuals (dots) with the same eSNP genotype (colored backgrounds). Under the alternative hypothesis (H1), the effect is concordant between the original and conditioned associations. (b) Original eSTR effect versus conditioned eSTR effect. Red points denote eSTRs whose directions of effect were concordant in both data sets, and gray points denote eSTRs with discordant directions of effect. (c) Quantile-quantile plot of P values from ANOVA testing of the explanatory value of eSTRs beyond that of eSNPs. (d) STK33 is an example of a gene for which the eSTR (red rectangle) has strong explanatory value beyond that of the lead eSNP (blue circle), according to ANOVA. When conditioning on individuals who are homozygous for the C allele of the eSNP (bottom left; green dots), the STR dosage still shows a significant effect (bottom right). (e) C11orf24 is an example of a gene for which the eSTR was part of the discovery set but did not pass the ANOVA threshold. After conditioning on individuals who are homozygous for the G allele of the eSNP (bottom left; green dots), the STR effect is lost (bottom right).

  4. Conservation and epigenetic analysis of eSTR loci.
    Figure 4: Conservation and epigenetic analysis of eSTR loci.

    (a) Median phyloP conservation score as a function of distance from the STR. Red, eSTR loci; gray, matched control STRs. The inset shows the difference in the phyloP conservation score between eSTRs and matched control STRs as a function of window size around the STR. *P < 0.05, **P < 0.01, ***P < 0.001. (b) Probability that an STR scores as an eSTR in the discovery set as a function of distance from the TSS. eSTRs show clustering around the TSS (black line). Conditioning on the presence of a histone mark (colored lines) significantly modulated the probability that an STR is an eSTR. (c) Enrichment of eSTRs in different chromatin states.

  5. Association of eSTRs with clinical phenotypes.
    Figure 5: Association of eSTRs with clinical phenotypes.

    (a) Overlap between eSTRs and Crohn's disease GWAS genes (red) versus random subsets of genes (gray) matched on the basis of expression and heritability profiles in LCLs to the disease-associated genes. (b) Quantile-quantile plots of eSTR associations in the TwinsUK data. Only traits with significant (FDR < 0.1) associations are plotted. Closed circles, significant; open circles, not significant. A, albumin; C, C-reactive protein; D, diastolic blood pressure, F, FVC; M, mean corpuscular volume; P, phosphate; U, urea; Ua, uric acid.

  6. STR genotype errors reduce power to detect eSTR associations.
    Supplementary Fig. 1: STR genotype errors reduce power to detect eSTR associations.

    (a) Power to detect associations and (b) estimated variance explained for different simulated values of variance explained by the STR. (black, observed capillary electrophoresis genotypes; blue, lobSTR genotypes).

  7. Number of STRs tested per gene.
    Supplementary Fig. 2: Number of STRs tested per gene.

    The histogram gives the number of STRs within 100 kb of each gene that passed quality filters and were included in the eSTR analysis.

  8. Unlinked controls follow the null.
    Supplementary Fig. 3: Unlinked controls follow the null.

    QQ plot of association tests between random unlinked STRs and genes.

  9. Validation of eSTR analysis using high-coverage genotype calls.
    Supplementary Fig. 4: Validation of eSTR analysis using high-coverage genotype calls.

    (a) Comparison of STR dosage in low-coverage 1000 Genomes calls versus calls from high-coverage targeted sequencing of promoter STRs. Bubble area represents the number of calls at each data point. For reference, the bubble at (−20, −20) represents 176 calls. “0” denotes the reference allele. The transparent bubble in the center represents calls that are homozygous reference in both data sets. (b) Distribution of the sizes of errors for discordant allele calls. The majority of errors (89.4%) are off by one or two repeat units. (c) Comparison of eSTR effect sizes between the low- and high-coverage data sets. Red dots denote eSTRs with concordant effect directions.

  10. Expression values are moderately reproducible across platforms.
    Supplementary Fig. 5: Expression values are moderately reproducible across platforms.

    (a) Distribution of Spearman rank correlation coefficients between gene expression profiles of individuals measured on microarray versus RNA sequencing platforms. (b) Distribution of Spearman rank correlation coefficients between the order of individuals ranked by expression levels across transcripts measured using microarray versus RNA sequencing platforms.

  11. Variance partitioning simulations with a single causal SNP.
    Supplementary Fig. 6: Variance partitioning simulations with a single causal SNP.

    Plots show variance partitioning results from simulations in which each gene has a single causal eSNP. (a,b) The distributions of . Black points denote the true value of the variance explained by the causal SNP. (c,d) The distributions of . (a,c) The LMM simulations with STRs as fixed effects. (b,d) The LMM simulations with STRs as random effects. (ad) Red dots denote the average value of the estimator. Red bars denote the median value of the estimator. The figure shows that the median values of the lead STRs are largely insensitive to the presence of a strong SNP eQTL.

  12. Variance partitioning simulations with two causal SNPs.
    Supplementary Fig. 7: Variance partitioning simulations with two causal SNPs.

    Plots show variance partitioning results from simulations in which each gene has two causal eSNPs. (a) The distributions of . Black points denote the true value of the variance explained by the causal SNPs. (b) The distributions of . Red dots denote the average value of the estimator. Red bars denote the median value of the estimator.

  13. STR genotype errors cause underestimation of .
    Supplementary Fig. 8: STR genotype errors cause underestimation of .

    The distribution of observed for each simulated value of is shown for an LMM analysis conducted using true genotypes (black) versus observed genotypes (blue). In the presence of genotyping errors, is strongly underestimated.

  14. Partitioning variance when treating the STR as a random effect.
    Supplementary Fig. 9: Partitioning variance when treating the STR as a random effect.

    The heat map shows the distribution of and for each gene. Gray lines give the medians of each distribution.

  15. Enrichment of eSTRs at promoters and enhancers.
    Supplementary Fig. 10: Enrichment of eSTRs at promoters and enhancers.

    For each distance bin around (a) the TSS and (b) center of H3K27ac peaks, the plot shows the percentage of STRs that were analyzed in that bin that were called as significant eSTRs. (c,d) The number of STRs in each distance bin. Black lines show the number of STRs that were included in our analysis (meaning that they showed sufficient variability and are near genes). Red lines show the number of all STRs in the genome in each bin. Black lines were smoothed by averaging sliding windows of three consecutive data points. In a and b, bins were 10 kb; in c and d, bins were 500 bp.

  16. STRs modulate epigenetic signatures.
    Supplementary Fig. 11: STRs modulate epigenetic signatures.

    (a) Schematic of the application of GERV to predict histone modification signatures for different STR alleles. For each eSTR (red) and control STR (gray), we measured the magnitude of the slope between the STR allele and the GERV score and then tested whether the magnitudes were significantly different between the two sets. (b) Comparison of the distribution of slope magnitudes for eSTRs (red) and controls (gray).

  17. Enrichment of eSTR genes in GWAS.
    Supplementary Fig. 12: Enrichment of eSTR genes in GWAS.

    Number of eSTR genes (red dashed line) overlapping GWAS genes for each trait. Gray bars give the distribution of the number of overlapping genes from 1,000 control sets of STRs matched on the basis of expression in LCLs and cis heritability. (RA, rheumatoid arthritis; CAD, coronary artery disease; T1D, type 1 diabetes; T2D, type 2 diabetes.)

Accession codes

Referenced accessions

ArrayExpress

References

  1. Barrett, J.C. et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat. Genet. 40, 955962 (2008).
  2. Moffatt, M.F. et al. Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448, 470473 (2007).
  3. GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648660 (2015).
  4. Nica, A.C. et al. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet. 6, e1000895 (2010).
  5. Nicolae, D.L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010).
  6. Ward, L.D. & Kellis, M. Interpreting noncoding genetic variation in complex traits and human disease. Nat. Biotechnol. 30, 10951106 (2012).
  7. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012).
  8. Grundberg, E. et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat. Genet. 44, 10841089 (2012).
  9. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506511 (2013).
  10. Stranger, B.E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848853 (2007).
  11. Montgomery, S.B. et al. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res. 23, 749761 (2013).
  12. Wright, F.A. et al. Heritability and genomics of gene expression in peripheral blood. Nat. Genet. 46, 430437 (2014).
  13. Manolio, T.A. et al. Finding the missing heritability of complex diseases. Nature 461, 747753 (2009).
  14. Press, M.O., Carlson, K.D. & Queitsch, C. The overdue promise of short tandem repeat variation for heritability. Trends Genet. 30, 504512 (2014).
  15. Ellegren, H. Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet. 5, 435445 (2004).
  16. Gemayel, R., Vinces, M.D., Legendre, M. & Verstrepen, K.J. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu. Rev. Genet. 44, 445477 (2010).
  17. Weber, J.L. & Wong, C. Mutation of human short tandem repeats. Hum. Mol. Genet. 2, 11231128 (1993).
  18. Mirkin, S.M. Expandable DNA repeats and human disease. Nature 447, 932940 (2007).
  19. Contente, A., Dittmer, A., Koch, M.C., Roth, J. & Dobbelstein, M. A polymorphic microsatellite that mediates induction of PIG3 by p53. Nat. Genet. 30, 315320 (2002).
  20. Martin, P., Makepeace, K., Hill, S.A., Hood, D.W. & Moxon, E.R. Microsatellite instability regulates transcription factor binding and gene expression. Proc. Natl. Acad. Sci. USA 102, 38003804 (2005).
  21. Willems, R., Paul, A., van der Heide, H.G., ter Avest, A.R. & Mooi, F.R. Fimbrial phase variation in Bordetella pertussis: a novel mechanism for transcriptional regulation. EMBO J. 9, 28032809 (1990).
  22. Yogev, D., Rosengarten, R., Watson-McKown, R. & Wise, K.S. Molecular basis of Mycoplasma surface antigenic variation: a novel set of divergent genes undergo spontaneous mutation of periodic coding regions and 5′ regulatory sequences. EMBO J. 10, 40694079 (1991).
  23. Hefferon, T.W., Groman, J.D., Yurk, C.E. & Cutting, G.R. A variable dinucleotide repeat in the CFTR gene contributes to phenotype diversity by forming RNA secondary structures that alter splicing. Proc. Natl. Acad. Sci. USA 101, 35043509 (2004).
  24. Hui, J. et al. Intronic CA-repeat and CA-rich elements: a new class of regulators of mammalian alternative splicing. EMBO J. 24, 19881998 (2005).
  25. Rothenburg, S., Koch-Nolte, F., Rich, A. & Haag, F. A polymorphic dinucleotide repeat in the rat nucleolin gene forms Z-DNA and inhibits promoter activity. Proc. Natl. Acad. Sci. USA 98, 89858990 (2001).
  26. Weiser, J.N., Love, J.M. & Moxon, E.R. The molecular mechanism of phase variation of H. influenzae lipopolysaccharide. Cell 59, 657665 (1989).
  27. Vinces, M.D., Legendre, M., Caldara, M., Hagihara, M. & Verstrepen, K.J. Unstable tandem repeats in promoters confer transcriptional evolvability. Science 324, 12131216 (2009).
  28. Sureshkumar, S. et al. A genetic defect caused by a triplet repeat expansion in Arabidopsis thaliana. Science 323, 10601063 (2009).
  29. Hammock, E.A. & Young, L.J. Microsatellite instability generates diversity in brain and sociobehavioral traits. Science 308, 16301634 (2005).
  30. Yáñez-Cuna, J.O. et al. Dissection of thousands of cell type–specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res. 24, 11471156 (2014).
  31. Sawaya, S. et al. Microsatellite tandem repeats are abundant in human promoters and are associated with regulatory elements. PLoS ONE 8, e54710 (2013).
  32. Bilgin Sonay, T. et al. Tandem repeat variation in human and great ape populations and its impact on gene expression divergence. Genome Res. 25, 15911599 (2015).
  33. Borel, C. et al. Tandem repeat sequence variation as causative cis-eQTLs for protein-coding gene expression variation: the case of CSTB. Hum. Mutat. 33, 13021309 (2012).
  34. Gebhardt, F., Zanker, K.S. & Brandt, B. Modulation of epidermal growth factor receptor gene transcription by a polymorphic dinucleotide repeat in intron 1. J. Biol. Chem. 274, 1317613180 (1999).
  35. Rockman, M.V. & Wray, G.A. Abundant raw material for cis-regulatory evolution in humans. Mol. Biol. Evol. 19, 19912004 (2002).
  36. Shimajiri, S. et al. Shortened microsatellite d(CA)21 sequence down-regulates promoter activity of matrix metalloproteinase 9 gene. FEBS Lett. 455, 7074 (1999).
  37. Warpeha, K.M. et al. Genotyping and functional analysis of a polymorphic (CCTTT)n repeat of NOS2A in diabetic retinopathy. FASEB J. 13, 18251832 (1999).
  38. Hui, J., Stangl, K., Lane, W.S. & Bindereif, A. HnRNP L stimulates splicing of the eNOS gene by binding to variable-length CA repeats. Nat. Struct. Biol. 10, 3337 (2003).
  39. Sathasivam, K. et al. Aberrant splicing of HTT generates the pathogenic exon 1 protein in Huntington disease. Proc. Natl. Acad. Sci. USA 110, 23662370 (2013).
  40. Grünewald, T.G. et al. Chimeric EWSR1-FLI1 regulates the Ewing sarcoma susceptibility gene EGR2 via a GGAA microsatellite. Nat. Genet. 47, 10731078 (2015).
  41. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 10611073 (2010).
  42. Willems, T. et al. The landscape of human STR variation. Genome Res. 24, 18941904 (2014).
  43. Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 22, 11541162 (2012).
  44. Duyao, M. et al. Trinucleotide repeat length instability and age of onset in Huntington's disease. Nat. Genet. 4, 387392 (1993).
  45. La Spada, A.R. et al. Meiotic stability and genotype-phenotype correlation of the trinucleotide repeat in X-linked spinal and bulbar muscular atrophy. Nat. Genet. 2, 301304 (1992).
  46. Flicek, P. et al. Ensembl 2013. Nucleic Acids Res. 41, D48D55 (2013).
  47. Gebhardt, F., Zanker, K.S. & Brandt, B. Modulation of epidermal growth factor receptor gene transcription by a polymorphic dinucleotide repeat in intron 1. J. Biol. Chem. 274, 1317613180 (1999).
  48. Stranger, B.E. et al. Patterns of cis regulatory variation in diverse human populations. PLoS Genet. 8, e1002639 (2012).
  49. Payseur, B.A., Place, M. & Weber, J.L. Linkage disequilibrium between STRPs and SNPs across the human genome. Am. J. Hum. Genet. 82, 10391050 (2008).
  50. Sawaya, S., Jones, M. & Keller, M. Linkage disequilibrium between single nucleotide polymorphisms and hypermutable loci. bioRxiv doi:10.1101/020909 (2015).
  51. Lamina, C. et al. A systematic evaluation of short tandem repeats in lipid candidate genes: riding on the SNP-wave. PLoS ONE 9, e102113 (2014).
  52. Gusev, A. et al. Regulatory variants explain much more heritability than coding variants across 11 common diseases. bioRxiv doi:10.1101/004309 (2014).
  53. Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565569 (2010).
  54. Ioannidis, J.P. Why most discovered true associations are inflated. Epidemiology 19, 640648 (2008).
  55. Gaffney, D.J. et al. Dissecting the regulatory architecture of gene expression QTLs. Genome Biol. 13, R7 (2012).
  56. Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110121 (2010).
  57. Trynka, G. et al. Disentangling the effects of colocalizing genomic annotations to functionally prioritize non-coding variants within complex-trait loci. Am. J. Hum. Genet. 97, 139152 (2015).
  58. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215216 (2012).
  59. Zeng, H., Hashimoto, T., Kang, D.D. & Gifford, D.K. GERV: a statistical method for generative evaluation of regulatory variants for transcription factor binding. Bioinformatics doi:10.1093/bioinformatics/btv565 (17 October 2015).
  60. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001D1006 (2014).
  61. UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature 526, 8290 (2015).
  62. Döring, A. et al. SLC2A9 influences uric acid concentrations with pronounced sex-specific effects. Nat. Genet. 40, 430436 (2008).
  63. Vitart, V. et al. SLC2A9 is a newly identified urate transporter influencing serum urate concentration, urate excretion and gout. Nat. Genet. 40, 437442 (2008).
  64. Wallace, C. et al. Genome-wide association study identifies genes for biomarkers of cardiovascular disease: serum urate and dyslipidemia. Am. J. Hum. Genet. 82, 139149 (2008).
  65. Shin, S.-Y. et al. An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543550 (2014).
  66. Weber, J.L. & Broman, K.W. 7 Genotyping for human whole-genome scans: past, present, and future. Adv. Genet. 42, 7796 (2001).
  67. Chaisson, M.J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608611 (2015).
  68. Bhatia, G. et al. Haplotypes of common SNPs can explain missing heritability of complex diseases. bioRxiv doi:10.1101/022418 (2015).
  69. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559575 (2007).
  70. Guilmatre, A., Highnam, G., Borel, C., Mittelman, D. & Sharp, A.J. Rapid multiplexed genotyping of simple tandem repeats using capture and high-throughput sequencing. Hum. Mutat. 34, 13041311 (2013).
  71. Karolchik, D. et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 42, D764D770 (2014).
  72. Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 9961006 (2002).
  73. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 11051111 (2009).
  74. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562578 (2012).
  75. Stranger, B.E. et al. Population genomics of human gene expression. Nat. Genet. 39, 12171224 (2007).
  76. Barbosa-Morais, N.L. et al. A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data. Nucleic Acids Res. 38, e17 (2010).
  77. Yang, J., Lee, S.H., Goddard, M.E. & Visscher, P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 7682 (2011).
  78. Patterson, N., Price, A.L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

Download references

Author information

Affiliations

  1. Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, USA.

    • Melissa Gymrek,
    • Thomas Willems,
    • Barak Markus &
    • Yaniv Erlich
  2. Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Melissa Gymrek
  3. Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.

    • Melissa Gymrek,
    • Mark J Daly &
    • Alkes L Price
  4. New York Genome Center, New York, New York, USA.

    • Melissa Gymrek,
    • Thomas Willems &
    • Yaniv Erlich
  5. Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Thomas Willems
  6. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA.

    • Audrey Guilmatre &
    • Andrew J Sharp
  7. Department of Pediatric Hematology, Robert Debré Hospital, Paris, France.

    • Audrey Guilmatre
  8. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Haoyang Zeng
  9. Department of Genetics and Biology, Stanford University, Stanford, California, USA.

    • Stoyan Georgiev &
    • Jonathan K Pritchard
  10. Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts, USA.

    • Mark J Daly
  11. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.

    • Alkes L Price
  12. Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.

    • Alkes L Price
  13. Howard Hughes Medical Institute, Chevy Chase, Maryland, USA.

    • Jonathan K Pritchard
  14. Department of Computer Science, Fu Foundation School of Engineering, Columbia University, New York, New York, USA.

    • Yaniv Erlich
  15. Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, USA.

    • Yaniv Erlich

Contributions

M.G. and Y.E. conceived the study. M.G., T.W., H.Z., B.M. and Y.E. performed analyses. A.G. performed experimental work to generate high-coverage sequencing data for promoter STRs. S.G., M.J.D., A.L.P. and J.K.P. provided statistical input. A.J.S. contributed data and analyses. M.G., T.W. and Y.E. wrote the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: STR genotype errors reduce power to detect eSTR associations. (71 KB)

    (a) Power to detect associations and (b) estimated variance explained for different simulated values of variance explained by the STR. (black, observed capillary electrophoresis genotypes; blue, lobSTR genotypes).

  2. Supplementary Figure 2: Number of STRs tested per gene. (55 KB)

    The histogram gives the number of STRs within 100 kb of each gene that passed quality filters and were included in the eSTR analysis.

  3. Supplementary Figure 3: Unlinked controls follow the null. (57 KB)

    QQ plot of association tests between random unlinked STRs and genes.

  4. Supplementary Figure 4: Validation of eSTR analysis using high-coverage genotype calls. (64 KB)

    (a) Comparison of STR dosage in low-coverage 1000 Genomes calls versus calls from high-coverage targeted sequencing of promoter STRs. Bubble area represents the number of calls at each data point. For reference, the bubble at (−20, −20) represents 176 calls. “0” denotes the reference allele. The transparent bubble in the center represents calls that are homozygous reference in both data sets. (b) Distribution of the sizes of errors for discordant allele calls. The majority of errors (89.4%) are off by one or two repeat units. (c) Comparison of eSTR effect sizes between the low- and high-coverage data sets. Red dots denote eSTRs with concordant effect directions.

  5. Supplementary Figure 5: Expression values are moderately reproducible across platforms. (46 KB)

    (a) Distribution of Spearman rank correlation coefficients between gene expression profiles of individuals measured on microarray versus RNA sequencing platforms. (b) Distribution of Spearman rank correlation coefficients between the order of individuals ranked by expression levels across transcripts measured using microarray versus RNA sequencing platforms.

  6. Supplementary Figure 6: Variance partitioning simulations with a single causal SNP. (156 KB)

    Plots show variance partitioning results from simulations in which each gene has a single causal eSNP. (a,b) The distributions of . Black points denote the true value of the variance explained by the causal SNP. (c,d) The distributions of . (a,c) The LMM simulations with STRs as fixed effects. (b,d) The LMM simulations with STRs as random effects. (ad) Red dots denote the average value of the estimator. Red bars denote the median value of the estimator. The figure shows that the median values of the lead STRs are largely insensitive to the presence of a strong SNP eQTL.

  7. Supplementary Figure 7: Variance partitioning simulations with two causal SNPs. (73 KB)

    Plots show variance partitioning results from simulations in which each gene has two causal eSNPs. (a) The distributions of . Black points denote the true value of the variance explained by the causal SNPs. (b) The distributions of . Red dots denote the average value of the estimator. Red bars denote the median value of the estimator.

  8. Supplementary Figure 8: STR genotype errors cause underestimation of . (76 KB)

    The distribution of observed for each simulated value of is shown for an LMM analysis conducted using true genotypes (black) versus observed genotypes (blue). In the presence of genotyping errors, is strongly underestimated.

  9. Supplementary Figure 9: Partitioning variance when treating the STR as a random effect. (85 KB)

    The heat map shows the distribution of and for each gene. Gray lines give the medians of each distribution.

  10. Supplementary Figure 10: Enrichment of eSTRs at promoters and enhancers. (128 KB)

    For each distance bin around (a) the TSS and (b) center of H3K27ac peaks, the plot shows the percentage of STRs that were analyzed in that bin that were called as significant eSTRs. (c,d) The number of STRs in each distance bin. Black lines show the number of STRs that were included in our analysis (meaning that they showed sufficient variability and are near genes). Red lines show the number of all STRs in the genome in each bin. Black lines were smoothed by averaging sliding windows of three consecutive data points. In a and b, bins were 10 kb; in c and d, bins were 500 bp.

  11. Supplementary Figure 11: STRs modulate epigenetic signatures. (100 KB)

    (a) Schematic of the application of GERV to predict histone modification signatures for different STR alleles. For each eSTR (red) and control STR (gray), we measured the magnitude of the slope between the STR allele and the GERV score and then tested whether the magnitudes were significantly different between the two sets. (b) Comparison of the distribution of slope magnitudes for eSTRs (red) and controls (gray).

  12. Supplementary Figure 12: Enrichment of eSTR genes in GWAS. (95 KB)

    Number of eSTR genes (red dashed line) overlapping GWAS genes for each trait. Gray bars give the distribution of the number of overlapping genes from 1,000 control sets of STRs matched on the basis of expression in LCLs and cis heritability. (RA, rheumatoid arthritis; CAD, coronary artery disease; T1D, type 1 diabetes; T2D, type 2 diabetes.)

PDF files

  1. Supplementary Text and Figures (1,869 KB)

    Supplementary Figures 1–12, Supplementary Note and Supplementary Tables 1–9.

Other

  1. Supplementary Data Set 1: Significant eSTRs (18,436 KB)

    A table of all STR × gene associations at a gene-level FDR of 5%.

Additional data