The contribution of repetitive elements to quantitative human traits is largely unknown. Here we report a genome-wide survey of the contribution of short tandem repeats (STRs), which constitute one of the most polymorphic and abundant repeat classes, to gene expression in humans. Our survey identified 2,060 significant expression STRs (eSTRs). These eSTRs were replicable in orthogonal populations and expression assays. We used variance partitioning to disentangle the contribution of eSTRs from that of linked SNPs and indels and found that eSTRs contribute 10–15% of the cis heritability mediated by all common variants. Further functional genomic analyses showed that eSTRs are enriched in conserved regions, colocalize with regulatory elements and may modulate certain histone modifications. By analyzing known genome-wide association study (GWAS) signals and searching for new associations in 1,685 whole genomes from deeply phenotyped individuals, we found that eSTRs are enriched in various clinically relevant conditions. These results highlight the contribution of STRs to the genetic architecture of quantitative human traits.
At a glance
- Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat. Genet. 40, 955–962 (2008). et al.
- Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448, 470–473 (2007). et al.
- GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
- Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet. 6, e1000895 (2010). et al.
- Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010). et al.
- Interpreting noncoding genetic variation in complex traits and human disease. Nat. Biotechnol. 30, 1095–1106 (2012). &
- ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
- Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat. Genet. 44, 1084–1089 (2012). et al.
- Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013). et al.
- Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848–853 (2007). et al.
- The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res. 23, 749–761 (2013). et al.
- Heritability and genomics of gene expression in peripheral blood. Nat. Genet. 46, 430–437 (2014). et al.
- Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009). et al.
- The overdue promise of short tandem repeat variation for heritability. Trends Genet. 30, 504–512 (2014). , &
- Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet. 5, 435–445 (2004).
- Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu. Rev. Genet. 44, 445–477 (2010). , , &
- Mutation of human short tandem repeats. Hum. Mol. Genet. 2, 1123–1128 (1993). &
- Expandable DNA repeats and human disease. Nature 447, 932–940 (2007).
- A polymorphic microsatellite that mediates induction of PIG3 by p53. Nat. Genet. 30, 315–320 (2002). , , , &
- Microsatellite instability regulates transcription factor binding and gene expression. Proc. Natl. Acad. Sci. USA 102, 3800–3804 (2005). , , , &
- Fimbrial phase variation in Bordetella pertussis: a novel mechanism for transcriptional regulation. EMBO J. 9, 2803–2809 (1990). , , , &
- Molecular basis of Mycoplasma surface antigenic variation: a novel set of divergent genes undergo spontaneous mutation of periodic coding regions and 5′ regulatory sequences. EMBO J. 10, 4069–4079 (1991). , , &
- A variable dinucleotide repeat in the CFTR gene contributes to phenotype diversity by forming RNA secondary structures that alter splicing. Proc. Natl. Acad. Sci. USA 101, 3504–3509 (2004). , , &
- Intronic CA-repeat and CA-rich elements: a new class of regulators of mammalian alternative splicing. EMBO J. 24, 1988–1998 (2005). et al.
- A polymorphic dinucleotide repeat in the rat nucleolin gene forms Z-DNA and inhibits promoter activity. Proc. Natl. Acad. Sci. USA 98, 8985–8990 (2001). , , &
- The molecular mechanism of phase variation of H. influenzae lipopolysaccharide. Cell 59, 657–665 (1989). , &
- Unstable tandem repeats in promoters confer transcriptional evolvability. Science 324, 1213–1216 (2009). , , , &
- A genetic defect caused by a triplet repeat expansion in Arabidopsis thaliana. Science 323, 1060–1063 (2009). et al.
- Microsatellite instability generates diversity in brain and sociobehavioral traits. Science 308, 1630–1634 (2005). &
- Dissection of thousands of cell type–specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res. 24, 1147–1156 (2014). et al.
- Microsatellite tandem repeats are abundant in human promoters and are associated with regulatory elements. PLoS ONE 8, e54710 (2013). et al.
- Tandem repeat variation in human and great ape populations and its impact on gene expression divergence. Genome Res. 25, 1591–1599 (2015). et al.
- Tandem repeat sequence variation as causative cis-eQTLs for protein-coding gene expression variation: the case of CSTB. Hum. Mutat. 33, 1302–1309 (2012). et al.
- Modulation of epidermal growth factor receptor gene transcription by a polymorphic dinucleotide repeat in intron 1. J. Biol. Chem. 274, 13176–13180 (1999). , &
- Abundant raw material for cis-regulatory evolution in humans. Mol. Biol. Evol. 19, 1991–2004 (2002). &
- Shortened microsatellite d(CA)21 sequence down-regulates promoter activity of matrix metalloproteinase 9 gene. FEBS Lett. 455, 70–74 (1999). et al.
- Genotyping and functional analysis of a polymorphic (CCTTT)n repeat of NOS2A in diabetic retinopathy. FASEB J. 13, 1825–1832 (1999). et al.
- HnRNP L stimulates splicing of the eNOS gene by binding to variable-length CA repeats. Nat. Struct. Biol. 10, 33–37 (2003). , , &
- Aberrant splicing of HTT generates the pathogenic exon 1 protein in Huntington disease. Proc. Natl. Acad. Sci. USA 110, 2366–2370 (2013). et al.
- Chimeric EWSR1-FLI1 regulates the Ewing sarcoma susceptibility gene EGR2 via a GGAA microsatellite. Nat. Genet. 47, 1073–1078 (2015). et al.
- 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
- The landscape of human STR variation. Genome Res. 24, 1894–1904 (2014). et al.
- lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012). , , &
- Trinucleotide repeat length instability and age of onset in Huntington's disease. Nat. Genet. 4, 387–392 (1993). et al.
- Meiotic stability and genotype-phenotype correlation of the trinucleotide repeat in X-linked spinal and bulbar muscular atrophy. Nat. Genet. 2, 301–304 (1992). et al.
- Ensembl 2013. Nucleic Acids Res. 41, D48–D55 (2013). et al.
- Modulation of epidermal growth factor receptor gene transcription by a polymorphic dinucleotide repeat in intron 1. J. Biol. Chem. 274, 13176–13180 (1999). , &
- Patterns of cis regulatory variation in diverse human populations. PLoS Genet. 8, e1002639 (2012). et al.
- Linkage disequilibrium between STRPs and SNPs across the human genome. Am. J. Hum. Genet. 82, 1039–1050 (2008). , &
- Linkage disequilibrium between single nucleotide polymorphisms and hypermutable loci. bioRxiv doi:10.1101/020909 (2015). , &
- A systematic evaluation of short tandem repeats in lipid candidate genes: riding on the SNP-wave. PLoS ONE 9, e102113 (2014). et al.
- Regulatory variants explain much more heritability than coding variants across 11 common diseases. bioRxiv doi:10.1101/004309 (2014). et al.
- Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010). et al.
- Why most discovered true associations are inflated. Epidemiology 19, 640–648 (2008).
- Dissecting the regulatory architecture of gene expression QTLs. Genome Biol. 13, R7 (2012). et al.
- Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010). , , &
- Disentangling the effects of colocalizing genomic annotations to functionally prioritize non-coding variants within complex-trait loci. Am. J. Hum. Genet. 97, 139–152 (2015). et al.
- ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012). &
- GERV: a statistical method for generative evaluation of regulatory variants for transcription factor binding. Bioinformatics doi:10.1093/bioinformatics/btv565 (17 October 2015). , , &
- The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014). et al.
- UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).
- SLC2A9 influences uric acid concentrations with pronounced sex-specific effects. Nat. Genet. 40, 430–436 (2008). et al.
- SLC2A9 is a newly identified urate transporter influencing serum urate concentration, urate excretion and gout. Nat. Genet. 40, 437–442 (2008). et al.
- Genome-wide association study identifies genes for biomarkers of cardiovascular disease: serum urate and dyslipidemia. Am. J. Hum. Genet. 82, 139–149 (2008). et al.
- An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543–550 (2014). et al.
- 7 Genotyping for human whole-genome scans: past, present, and future. Adv. Genet. 42, 77–96 (2001). &
- Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015). et al.
- Haplotypes of common SNPs can explain missing heritability of complex diseases. bioRxiv doi:10.1101/022418 (2015). et al.
- PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007). et al.
- Rapid multiplexed genotyping of simple tandem repeats using capture and high-throughput sequencing. Hum. Mutat. 34, 1304–1311 (2013). , , , &
- The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 42, D764–D770 (2014). et al.
- The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002). et al.
- TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009). , &
- Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012). et al.
- Population genomics of human gene expression. Nat. Genet. 39, 1217–1224 (2007). et al.
- A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data. Nucleic Acids Res. 38, e17 (2010). et al.
- GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011). , , &
- Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006). , &
- Supplementary Figure 1: STR genotype errors reduce power to detect eSTR associations. (71 KB)
(a) Power to detect associations and (b) estimated variance explained for different simulated values of variance explained by the STR. (black, observed capillary electrophoresis genotypes; blue, lobSTR genotypes).
- Supplementary Figure 2: Number of STRs tested per gene. (55 KB)
The histogram gives the number of STRs within 100 kb of each gene that passed quality filters and were included in the eSTR analysis.
- Supplementary Figure 3: Unlinked controls follow the null. (57 KB)
QQ plot of association tests between random unlinked STRs and genes.
- Supplementary Figure 4: Validation of eSTR analysis using high-coverage genotype calls. (64 KB)
(a) Comparison of STR dosage in low-coverage 1000 Genomes calls versus calls from high-coverage targeted sequencing of promoter STRs. Bubble area represents the number of calls at each data point. For reference, the bubble at (−20, −20) represents 176 calls. “0” denotes the reference allele. The transparent bubble in the center represents calls that are homozygous reference in both data sets. (b) Distribution of the sizes of errors for discordant allele calls. The majority of errors (89.4%) are off by one or two repeat units. (c) Comparison of eSTR effect sizes between the low- and high-coverage data sets. Red dots denote eSTRs with concordant effect directions.
- Supplementary Figure 5: Expression values are moderately reproducible across platforms. (46 KB)
(a) Distribution of Spearman rank correlation coefficients between gene expression profiles of individuals measured on microarray versus RNA sequencing platforms. (b) Distribution of Spearman rank correlation coefficients between the order of individuals ranked by expression levels across transcripts measured using microarray versus RNA sequencing platforms.
- Supplementary Figure 6: Variance partitioning simulations with a single causal SNP. (156 KB)
Plots show variance partitioning results from simulations in which each gene has a single causal eSNP. (a,b) The distributions of . Black points denote the true value of the variance explained by the causal SNP. (c,d) The distributions of . (a,c) The LMM simulations with STRs as fixed effects. (b,d) The LMM simulations with STRs as random effects. (a–d) Red dots denote the average value of the estimator. Red bars denote the median value of the estimator. The figure shows that the median values of the lead STRs are largely insensitive to the presence of a strong SNP eQTL.
- Supplementary Figure 7: Variance partitioning simulations with two causal SNPs. (73 KB)
Plots show variance partitioning results from simulations in which each gene has two causal eSNPs. (a) The distributions of . Black points denote the true value of the variance explained by the causal SNPs. (b) The distributions of . Red dots denote the average value of the estimator. Red bars denote the median value of the estimator.
- Supplementary Figure 8: STR genotype errors cause underestimation of . (76 KB)
The distribution of observed for each simulated value of is shown for an LMM analysis conducted using true genotypes (black) versus observed genotypes (blue). In the presence of genotyping errors, is strongly underestimated.
- Supplementary Figure 9: Partitioning variance when treating the STR as a random effect. (85 KB)
The heat map shows the distribution of and for each gene. Gray lines give the medians of each distribution.
- Supplementary Figure 10: Enrichment of eSTRs at promoters and enhancers. (128 KB)
For each distance bin around (a) the TSS and (b) center of H3K27ac peaks, the plot shows the percentage of STRs that were analyzed in that bin that were called as significant eSTRs. (c,d) The number of STRs in each distance bin. Black lines show the number of STRs that were included in our analysis (meaning that they showed sufficient variability and are near genes). Red lines show the number of all STRs in the genome in each bin. Black lines were smoothed by averaging sliding windows of three consecutive data points. In a and b, bins were 10 kb; in c and d, bins were 500 bp.
- Supplementary Figure 11: STRs modulate epigenetic signatures. (100 KB)
(a) Schematic of the application of GERV to predict histone modification signatures for different STR alleles. For each eSTR (red) and control STR (gray), we measured the magnitude of the slope between the STR allele and the GERV score and then tested whether the magnitudes were significantly different between the two sets. (b) Comparison of the distribution of slope magnitudes for eSTRs (red) and controls (gray).
- Supplementary Figure 12: Enrichment of eSTR genes in GWAS. (95 KB)
Number of eSTR genes (red dashed line) overlapping GWAS genes for each trait. Gray bars give the distribution of the number of overlapping genes from 1,000 control sets of STRs matched on the basis of expression in LCLs and cis heritability. (RA, rheumatoid arthritis; CAD, coronary artery disease; T1D, type 1 diabetes; T2D, type 2 diabetes.)
- Supplementary Text and Figures (1,869 KB)
Supplementary Figures 1–12, Supplementary Note and Supplementary Tables 1–9.
- Supplementary Data Set 1: Significant eSTRs (18,436 KB)
A table of all STR × gene associations at a gene-level FDR of 5%.