High-throughput mapping of regulatory DNA

Journal name:
Nature Biotechnology
Volume:
34,
Pages:
167–174
Year published:
DOI:
doi:10.1038/nbt.3468
Received
Accepted
Published online

Abstract

Quantifying the effects of cis-regulatory DNA on gene expression is a major challenge. Here, we present the multiplexed editing regulatory assay (MERA), a high-throughput CRISPR-Cas9–based approach that analyzes the functional impact of the regulatory genome in its native context. MERA tiles thousands of mutations across ~40 kb of cis-regulatory genomic space and uses knock-in green fluorescent protein (GFP) reporters to read out gene activity. Using this approach, we obtain quantitative information on the contribution of cis-regulatory regions to gene expression. We identify proximal and distal regulatory elements necessary for expression of four embryonic stem cell–specific genes. We show a consistent contribution of neighboring gene promoters to gene expression and identify unmarked regulatory elements (UREs) that control gene expression but do not have typical enhancer epigenetic or chromatin features. We compare thousands of functional and nonfunctional genotypes at a genomic location and identify the base pair–resolution functional motifs of regulatory elements.

At a glance

Figures

  1. Multiplexed editing regulatory assay (MERA).
    Figure 1: Multiplexed editing regulatory assay (MERA).

    (a) In MERA, a genomically integrated dummy gRNA is replaced with a pooled library of gRNAs through CRISPR-Cas9–based homologous recombination such that each cell receives a single gRNA. Guide RNAs are tiled across the cis-regulatory regions of a GFP-tagged gene locus, and cells are flow cytometrically sorted according to their GFP expression levels. Deep sequencing on each population is used to identify gRNAs preferentially associated with partial or complete loss of gene expression. (b) Zfp42GFP mESCs show uniformly strong GFP expression. After bulk gRNA integration, a subpopulation of cells lose GFP expression partially or completely. These cells are flow cytometrically isolated for deep sequencing. (c,d) Bulk reads for gRNAs are highly correlated between replicates from the Tdgf1 (c) or Zfp42 libraries (d), indicating consistent and replicable integration rates.

  2. MERA enables systematic identification of required cis-regulatory elements for Tdgf1.
    Figure 2: MERA enables systematic identification of required cis-regulatory elements for Tdgf1.

    (a) A genomic view of the 40-kb Tdgf1 proximal regulatory region showing the following in track order from top to bottom. (i) The locations of all 3,621 integrated gRNAs in any one of three biological replicates. (ii) gRNAs enriched in GFPneg cells (red) in any one of three replicates at P < 10−10 using a binomial test as described in the methods; bar height is proportional to the mean log-ratio of GFPneg to bulk reads across replicates. (iii) gRNAs enriched in GFPmedium cells (cyan) in any one of three replicates at P < 10−10 using a binomial test as described in the Online Methods; bar height is proportional to the mean log-ratio of GFPneg to bulk reads across replicates. (iv) Annotated genes. (v) Predicted enhancers (cyan, weak; red, strong). (vi) DNase I hotspot regions. (vii) Transcription factor binding density based on ChIP-seq data. (viii) H3K4me3 ChIP-seq data (blue). Several active regulatory elements coincide with dense clusters of overlapping gRNAs. Numerous gRNAs significantly enriched in the GFPneg population are also observed in regions devoid of regulatory element features (UREs). Genomic regions of interest are shaded, annotated above the plot, and described in further detail in the text. (b) Individual validation of specific gRNAs detected as enriched in the GFPneg population in the MERA assay using the self-cloning CRISPR system. The proportion of cells undergoing GFP loss upon incorporation of a particular gRNA divided by the proportion of cells undergoing GFP loss upon incorporation of GFP-targeting positive control gRNA is plotted against the actual genomic location of the gRNA. Negative controls defined as gRNAs showing no reads in either GFPneg or GFPmedium populations but present in the bulk population are highlighted in red. Error bars indicate experimental variability in two replicates. (c) Correlation of gRNAs significantly enriched in the GFPneg population in fixed-size bins varying from 100 bp to 1 kb for biological replicates in Tdgf1 libraries. (d) Fraction of GFPneg-enriched gRNAs among the different functional genomic categories surrounding the Tdgf1 gene. Error bars show variability due to the three biological replicates.

  3. MERA enables systematic identification of required cis-regulatory elements for Zfp42.
    Figure 3: MERA enables systematic identification of required cis-regulatory elements for Zfp42.

    (a) A genomic view of the Zfp42 proximal regulatory region showing the following in track order. (i) The location of all 1,643 integrated gRNAs. (ii) gRNAs in GFPneg cells (red) in any one of four replicates at P < 10−10 using a binomial test as described in the methods; bar height is proportional to the mean log-ratio of GFPneg to bulk reads. (iii) Enriched gRNAs in GFPmedium cells (cyan) in any one of four replicates at P < 10−10 using a binomial test as described in the Online Methods; bar height is proportional to the mean log-ratio of GFPneg to bulk reads. (iv) Annotated genes. (v) Predicted enhancers (cyan, weak; red, strong). (vi) DNase I hotspot regions. (vii) Transcription factor binding density based on ChIP-seq data. (viii) H3K4me3 ChIP-seq data. Several active regulatory elements coincide with dense clusters of overlapping gRNAs. Genomic regions of interest are shaded, annotated above the plot and described in further detail in the text. (b) Correlation of gRNAs significantly enriched in the GFPneg population in fixed-size bins varying from 100 bp to 1 kb for biological replicates in Zfp42. (c) Fraction of GFPneg enriched gRNA among the different functional genomic categories surrounding the Zfp42 gene. Error bars show variability due to the four biological replicates.

  4. Functional motif discovery analysis of region-specific mutant genotypes at enhancers reveals required regulatory motifs.
    Figure 4: Functional motif discovery analysis of region-specific mutant genotypes at enhancers reveals required regulatory motifs.

    (a) A schematic of the procedure involved in finding mutations induced by a particular gRNA. (b) Plot showing the genomic regions surrounding two gRNAs at a proximal Tdgf1 enhancer region (gRNAs are shaded) showing overlap with DNase I hotspot and predicted enhancer regions, and transcription factor binding sites for Stat3, Tcfcp2l1 and Sox2. (c) ROC curve for fivefold classification of GFPneg and GFPpos genotypes using mutations within −20 to +20 bp of the gRNA along left and right paired-end reads as features. (d) Motif logo for region mutated by gRNAs with base scores computed as log-ratios of the Hellinger distance of the GFPneg genotypes at a base to the reference base to the Hellinger distance of the GFPpos genotypes at a base to the reference base, caused by Tdgf_gRNA_1 and Tdgf_gRNA_2 along the left paired end read. The location of the Stat3 binding site with its positive-strand motif is shown along the length of the gRNA.

  5. Functional motif discovery analysis of a URE reveals critical base positions involved in gene regulation.
    Figure 5: Functional motif discovery analysis of a URE reveals critical base positions involved in gene regulation.

    (a) Plot showing the genomic regions surrounding two gRNAs (gRNAs are shaded) showing their lack of active histone modifications, known transcription factor binding sites, predicted enhancers or DNase I hotspots. (b) Receiver operating characteristic (ROC) curve for fivefold classification of GFPneg and GFPpos genotypes using mutations on the right paired-end read within −20 to +20 bp of Tdgf_URE_gRNA2. Unweighted classification (in blue) counts each unique genotype in the test set only once, while weighted classification (red) counts each unique genotype in the test set as many times as the number of reads assigned to it, for calculating sensitivity and specificity. (c) Fraction of unique genotypes in GFPneg and GFPpos populations with mutations at bases along the right paired-end read reveals pattern of cleavage around Tdgf_URE_gRNA2. (d) Motif logo for the region mutated by Tdgf_URE_gRNA2 along the right paired-end read with base scores computed as log-ratios of the Hellinger distance of the GFPneg genotypes at a base to the reference base to the Hellinger distance of the GFPpos genotypes at a base to the reference base.

  6. Local genotypes at an enhancer and a URE dictate Tdgf1 expression phenotype.
    Figure 6: Local genotypes at an enhancer and a URE dictate Tdgf1 expression phenotype.

    (a) Tdgf1 MERA screen ratio of GFPmedium/neg/bulk reads for each gRNA at an upstream enhancer (left) and a downstream URE (right) region. (b) Flow cytometric measurement of Tdgf1GFP expression in clonal cell lines following CRISPR-induced deletion of the shaded regions from a show loss of GFP (first and third plots from left). CRISPR-mediated homology-directed repair (HDR) back to the wild-type genotype induced robust GFP recovery at both loci (second and fourth plots from left). (c) Tdgf1 RNA expression in wild-type mESCs (WT; left), clonal mESC lines with deletions of the enhancer and URE shaded in a (second and third from left), and bulk mESC lines following HDR back to the wild-type genotype (fourth and fifth from left), all normalized to wild-type expression level in two replicates.

  7. Effect of the length of homology arms of guide RNA on background cutting due to unintegrated guide RNA PCR fragments.
    Supplementary Fig. 1: Effect of the length of homology arms of guide RNA on background cutting due to unintegrated guide RNA PCR fragments.

    Homology constructs with a GFP-targeting gRNA were introduced into the cell. In the absence of a dummy-cleaving guide RNA, the homology construct would not be integrated into the ROSA locus, hence any loss of GFP in the cell would be due to this unintegrated construct cutting the target sequence in the GFP gene. Thus, we were able to measure the effects of different lengths of homology arms as percentage GFP-loss due to cutting the target site by unintegrated guide RNA.

  8. CRISPR-Cas9-mediated mutation following homologous recombination into a genomically integrated gRNA cassette.
    Supplementary Fig. 2: CRISPR-Cas9–mediated mutation following homologous recombination into a genomically integrated gRNA cassette.

    Tdgf1GFP mESCs with a ROSA26-integrated U6 promoter dummy gRNA expression cassette express uniformly strong GFP. After electroporation of Cas9, dummy gRNA-targeting gRNA plasmid, and a PCR fragment comprising a gRNA targeting GFP flanked by 120-140bp homology arms,30% of cells lose GFP expression. Omission of the dummy gRNA-targeting gRNA plasmid results in minimal GFP loss, showing that homologous recombination of the PCR fragment is required for proper gRNA targeting.

  9. MERA followed by flow cytometry enables isolation of GFP cells at four mESC loci.
    Supplementary Fig. 3: MERA followed by flow cytometry enables isolation of GFP cells at four mESC loci.

    NanogGFP fusion, Rpp25GFP fusion, and Tdgf1GFP mESCs express GFP. After bulk gRNA integration followed by flow cytometry, highly enriched GFPmedium/neg populations can be purified. These populations are then deep sequenced.

  10. Predicting enrichment of gRNAs in GFPneg or GFPmedium populations.
    Supplementary Fig. 4: Predicting enrichment of gRNAs in GFPneg or GFPmedium populations.

    a.) Correlation between bulk reads at all integrated gRNAs in two biological replicates for the NanogGFP line. b.) Reads in GFPneg population are highly correlated with the bulk reads per GFP-targeting gRNA in a particular replicate of the Tdgf1 population. c.) Reads in GFPmedium population are correlated with the bulk reads per GFP-targeting gRNA in a particular replicate of the Tdgf1 population. d,e.) Distribution of the log10 ratio of GFPneg to bulk reads for all integrated gRNAs. Blue bars indicate gRNAs not significantly enriched for GFP-loss while red bars indicate the gRNAs predicted as significant. Black asterisks show the position of the GFP-targeting gRNAs on the x-axis and tend to be towards thr far-right. Black dot shows the position of the dummy gRNA on the x-axis.As examples, the distribution is shown for d.)Tdgf1 Replicate 1 e.) Zfp42 Replicate 2. f.) The distribution of gRNAs in a 1kb window centered at the Tdgf1 promoter. Black indicates gRNAs that are integrated but do not cause any significant GFP-loss, red is for gRNAs that are significantly enriched in GFPneg population, and green is for gRNAs significantly enriched in GFPmedium population.

  11. MERA enables systematic identification of required cis-regulatory elements and their relative importance irrespective of putative off-target effects of a few individual guide RNAs in Tdgf1GFP.
    Supplementary Fig. 5: MERA enables systematic identification of required cis-regulatory elements and their relative importance irrespective of putative off-target effects of a few individual guide RNAs in Tdgf1GFP.

    a.) A genomic view of the gRNA distribution along TDGF1proximal regulatory region showing reads for individual gRNAs for a replicate before and after filtering for guide RNAs with off-target effects., DNase-I hotspot regions, predicted enhancers (green=weak, red=strong),transcription factor binding density based on ChIP-seq data and histone modifications. Predicted off-target cutting sites for each guide RNA is also shown as a panel (black).Guide RNAs redicted to cause significant GFP-loss upon introduction into the Zfp42 GFP line(red) are seen to be much fewer and not clustered as in the Tdgf1 library.b,c.) Guide RNAs enriched for GFP-loss at the b.) external promoter Lrrc2 and c.)unmarked regulatory regonare shown before and after filtering for off-target effects.

  12. MERA enables systematic identification of required cis-regulatory elements and their relative importance irrespective of putative off-target effects of a few individual guide RNAs in Zfp42GFP.
    Supplementary Fig. 6: MERA enables systematic identification of required cis-regulatory elements and their relative importance irrespective of putative off-target effects of a few individual guide RNAs in Zfp42GFP.

    a.) A genomic view of the gRNA distribution along the Zfp42 proximal regulatory region showing reads for individual gRNAs before and after filtering for guide RNAs with off-target effects, DNase-I hotspot regions, predicted enhancers (green=weak, red=strong),transcription factor binding density based on ChIP-seq data and histone modifications. Predicted off-target cutting sites for each guide RNA is also shown as a panel (black). Guide RNAs predicted to cause significant GFP-loss upon introduction into the Tdgf1 GFP line(red) are seen to be much fewer and not clustered as in the Zfp42 library. b.) The Trim12 promoter >150kb away from the Zfp42 gene shows clusters of guide RNAs significantly enriched in GFPneg and GFPmedium populations even after filtering for off-target effects. c.) Relative importance of various fuctional categories (Figure 3d) as measured by fraction of GFPneg enriched gRNAs is invariant upon filtering out gRNAs with off-target effects.

  13. Deriving rules for off-target prediction using the GUIDE-seq assay and MERA-generated data.
    Supplementary Fig. 7: Deriving rules for off-target prediction using the GUIDE-seq assay and MERA-generated data.

    a. Fraction of off-target sequences predicted from GUIDE-seq(total==442, number of guide RNAs=13) with a particular number of mismatches in non-seed(1-8bp),seed(9-20bp) or PAM sequence.Maximum of 4 mismatches in the seed and non-seed sequence and a maxmimum of one mismatch in the PAM (NNG/NGN) can be tolerated. b. Fraction of GUIDE-seq derived off-target sequences containing 3 adjacent mismatches along the bases of the guide RNA. For a NGG PAM sequence, no triple mismatches are tolerated in the seed region and for NGN/NNG PAM sequence, no triple mismatches are tolerated beyond the fifth base of the gRNA. c. Total number of mismatches tolerated is proportional to the total GC content of the guide RNA sequence. For gRNAs with intermediate GC content (10 to 15), seed GC content determines mismatches tolerated. d.True positive rate for rules of off-target prediction with or without GC content adjustment evaluated as percentage of accurately predicted off-target sites per guide RNA. e.False positive rate for rules of off-target prediction with or without GC content adjustment evaluated as number of number of guide RNAs observed to have no enrichment in GFPneg population with a predicted off-target effect on a gRNA significantly enriched in the GFPneg population. Analysis shown within the same library in Tdgf1 and Zfp42 and also cross-library situations.

  14. Effect of introducing a mismatched gRNA library on GFP loss for a particular gene.
    Supplementary Fig. 8: Effect of introducing a mismatched gRNA library on GFP loss for a particular gene.

    Introduction of sgTdgf1 library into the Tdgf1GFP line, Introduction of sgZfp42 library into the Tdgf1GFP line, introduction of sgTdgf1 library into the Zfp42GFP line, Introduction of sgZfp42 library into the Zfp42GFP line (from left to right clockwise).

  15. Comparison of cis-regulatory programs across two genes, Nanog and Rpp25.
    Supplementary Fig. 9: Comparison of cis-regulatory programs across two genes, Nanog and Rpp25.

    a,b) A genomic view of the gRNAs designed for various regions expected to be involved in regulation of Nanog including two distal regions predicted from PolII Chia-Pet data. b.) A genomic view of the gRNAs designed for various regions expected to be involved in regulation of Rpp25 including two distal regions predicted from PolII Chia-Pet data. c,d) Fraction of significant gRNA among the different functional genomic categories involved in the regulation of c) Nanog, and d) Rpp25.

  16. Distribution of start and end positions of contiguous mutations or [ldquo]disruptions[rdquo] of various lengths along the sequenced read.
    Supplementary Fig. 10: Distribution of start and end positions of contiguous mutations or “disruptions” of various lengths along the sequenced read.

    a-d.) Left and right ends of mutations caused by a gRNA along the length of a read are plotted along the y and x-axes respectively. Each point is a set of genotypes with that particular position of the ends of the disruption in. Ratio of GFPneg to GFPpos reads corresponding to a particular point is shown as blue to bluish-yellow (<1, GFPpos biased), or yellow to red (>1, GFPneg biased). gRNAs are shown as black rectangles along the axes and their boundaries are indicated by thin black lines. Thick black boundaries show region within -20 to +20bp selected for further analysis and classification. Disruptions are shown for a.) Left paired end read for Tdgf1 proximal enhancer with two gRNAs,.b) Right paired end read for Tdgf1 proximal enhancer with two gRNAs. c.)Left paired end read for Zfp42 enhancer with a single gRNA, d.)Right pared end read for Zfp42 enhancer with a single gRNA.

  17. Deep sequencing of two gRNAs within a Tdgf1 proximal enhancer region validates their role in regulation of Tdgf1 and reveals patterns of mutation and functional motifs in the region.
    Supplementary Fig. 11: Deep sequencing of two gRNAs within a Tdgf1 proximal enhancer region validates their role in regulation of Tdgf1 and reveals patterns of mutation and functional motifs in the region.

    a,b.)Fraction of unique genotypes in GFPneg and GFPpos populations with a mutations at bases along the gRNAs reveal pattern of cleavage around the gRNA for c.)Left paired end read. d.)Right paired end read. c,d.)Fraction of unique genotypes in GFPneg and GFPpos with insertion between bases along the gRNA for c.)Left paired end read, d.)Right paired end read. e.)Motif logo for region mutated by gRNAs with base scores computed as log-ratios of the hellinger distance of the GFPneg genotypes at a base to the reference base to the hellinger distance of the GFPpos genotypes at a base to the reference base caused by Tdgf_gRNA_1 and Tdgf_gRNA_2 in the right paired end read. f,g.)Motif logo for insertions showing entropic gain upon GFP-loss in interevening base positions in f.) Left paired end read, g.) Right paired end read.

  18. Deep sequencing of two gRNAs within a Zfp42 enhancer region validates their role in regulation of Zfp42 and reveals functional motifs associated with gene activity.
    Supplementary Fig. 12: Deep sequencing of two gRNAs within a Zfp42 enhancer region validates their role in regulation of Zfp42 and reveals functional motifs associated with gene activity.

    a.)Two gRNAs at a Zfp42 enhancer region in the genomic context showing its overlap with DNAse-I hotspot and predicted enhancer regions and transcription factor binding sites. b,c.) ROC curve for 5-fold classification of GFPneg and GFPpos genotypes using mutations within -20 to +20bp of the gRNA as features for b.) Zfp_gRNA_1 using mutations on the left paired end read, c.)Zfp_gRNA_2 using mutations on the right paired end read.Unweighted classification (in blue) counts each unique genotype in the test-set only once while weighted classification(red) calculates sensitivity and specificity counting each unique genotype in the test-set based as many times as the number of reads assigned to it. d,e.)Motif logo for region mutated by gRNAs with base scores computed as log-ratios of the hellinger distance of the GFPneg genotypes at a base to the reference base to the hellinger distance of the GFPpos genotypes at a base to the reference base, d.) Left paired end read with Zfp_gRNA_1, e.) Right paired end read with Zfp_gRNA_2.

  19. Deep sequencing of two gRNAs within a Zfp42 enhancer region reveals differences in mutational spectrum associated with loss of gene expression.
    Supplementary Fig. 13: Deep sequencing of two gRNAs within a Zfp42 enhancer region reveals differences in mutational spectrum associated with loss of gene expression.

    a,b.)Fraction of unique genotypes in GFPneg and GFPpos populations with a mutations at bases along the gRNAs reveal pattern of cleavage around the gRNA for a.)Left paired end read. b.)Right paired end read. c,d.)Fraction of unique genotypes in GFPneg and GFPpos with insertion between bases along the gRNA for c.)Left paired end read, d.)Right paired end read. e,f.)Motif logo for insertions showing entropic gain upon GFP-loss in interevening base positions in e.) Left paired end read, f.) Right paired end read.

  20. Deep sequencing of two gRNAs within a Tdgf1 URE validates its role in regulation of Tdgf1 and reveals patterns of mutation and functional motifs in the region.
    Supplementary Fig. 14: Deep sequencing of two gRNAs within a Tdgf1 URE validates its role in regulation of Tdgf1 and reveals patterns of mutation and functional motifs in the region.

    a,b.) Left and right ends of mutations caused by a gRNA along the length of a read are plotted along the y and x-axes respectively. Each point is a set of genotypes with that particular position of the ends of the disruption in. Ratio of GFPneg to GFPpos reads corresponding to a particular point is shown as blue to bluish-yellow (<1, GFPpos biased), or yellow to red (>1, GFPneg biased). gRNAs are shown as black rectangles along the axes and their boundaries are indicated by thin black lines. Thick black boundaries show region within -20 to +20bp selected for further analysis and classification. Disruptions are shown for a.) Left paired end read for Tdgf1 URE with a single gRNA,.b) Right paired end read for Tdgf1 URE with another gRNA. c,d.)Fraction of unique genotypes in GFPneg and GFPpos with insertion between bases along the gRNA for c.) ROC curve for 5-fold classification of GFPneg and GFPpos genotypes using mutations on the left paired end read within -20 to +20bp of Tdgf_URE_gRNA1.Unweighted classification (in blue) counts each unique genotype in the test-set only once while weighted classification(red) counts each unique genotype in the test-set as many times as the number of reads assigned to it, for calculating sensitivity and specificity. d.) Fraction of unique genotypes in GFPneg and GFPpos populations with mutations along the left paired end read within -20/+20bp of the Tdgf_URE_gRNA1 reveal the pattern of cleavage. e.) Motif logo for region mutated by gRNAs with base scores computed as log-ratios of the hellinger distance of the GFPneg genotypes at a base to the reference base to the hellinger distance of the GFPpos genotypes at a base to the reference base for the left paired end read containing Tdgf-URE_gRNA1.

  21. Deep sequencing of two gRNAs within a Tdgf1 URE validates it/'s role in regulation of Tdgf1 and reveals patterns of mutation and functional motifs in the region.
    Supplementary Fig. 15: Deep sequencing of two gRNAs within a Tdgf1 URE validates it’s role in regulation of Tdgf1 and reveals patterns of mutation and functional motifs in the region.

    a,b.) Fraction of unique genotypes in GFPneg and GFPpos with insertions between bases along the gRNA for a.) Left paired end read, b.) Right paired end read. c,d.) Motif logo for insertions showing entropic gain upon GFP-loss in intervening base positions in c.) Left paired end read, d.) Right paired end read. e).Vertebrate Phastcons score along bases of Tdgf_URE_gRNA1 reveal highly conserved left half of the sequence.

Accession codes

Primary accessions

Gene Expression Omnibus

References

  1. Jenuwein, T. & Allis, C.D. Translating the histone code. Science 293, 10741080 (2001).
  2. Bernstein, B.E. et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125, 315326 (2006).
  3. Rada-Iglesias, A. et al. A unique chromatin signature uncovers early developmental enhancers in humans. Nature 470, 279283 (2011).
  4. Heintzman, N.D. et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 459, 108112 (2009).
  5. Creyghton, M.P. et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc. Natl. Acad. Sci. USA 107, 2193121936 (2010).
  6. Melnikov, A. et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat. Biotechnol. 30, 271277 (2012).
  7. Arnold, C.D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 10741077 (2013).
  8. Patwardhan, R.P. et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat. Biotechnol. 30, 265270 (2012).
  9. Fullwood, M.J. et al. An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 462, 5864 (2009).
  10. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289293 (2009).
  11. Simonis, M. et al. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat. Genet. 38, 13481354 (2006).
  12. Dostie, J. et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, 12991309 (2006).
  13. Kwasnieski, J.C., Fiore, C., Chaudhari, H.G. & Cohen, B.A. High-throughput functional testing of ENCODE segmentation predictions. Genome Res. 24, 15951602 (2014).
  14. Wang, T., Wei, J.J., Sabatini, D.M. & Lander, E.S. Genetic screens in human cells using the CRISPR-Cas9 system. Science 343, 8084 (2014).
  15. Shalem, O. et al. Genome-scale CRISPR-Cas9 knockout screening in human cells. Science 343, 8487 (2014).
  16. Zhou, Y. et al. High-throughput screening of a CRISPR/Cas9 library for functional genomics in human cells. Nature 509, 487491 (2014).
  17. Koike-Yusa, H., Li, Y., Tan, E.P., Velasco-Herrera, Mdel.C. & Yusa, K. Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library. Nat. Biotechnol. 32, 267273 (2014).
  18. Chen, S. et al. Genome-wide CRISPR screen in a mouse model of tumor growth and metastasis. Cell 160, 12461260 (2015).
  19. Jinek, M. et al. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 337, 816821 (2012).
  20. Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819823 (2013).
  21. Mali, P. et al. RNA-guided human genome engineering via Cas9. Science 339, 823826 (2013).
  22. Jinek, M. et al. RNA-programmed genome editing in human cells. eLife 2, e00471 (2013).
  23. Cradick, T.J., Fine, E.J., Antico, C.J. & Bao, G. CRISPR/Cas9 systems targeting β-globin and CCR5 genes have substantial off-target activity. Nucleic Acids Res. 41, 95849592 (2013).
  24. Arbab, M., Srinivasan, S., Hashimoto, T., Geijsen, N. & Sherwood, R.I. Cloning-free CRISPR. Stem Cell Reports 5, 908917 (2015).
  25. Young, R.A. Control of the embryonic stem cell state. Cell 144, 940954 (2011).
  26. Yue, F. et al. Mouse ENCODE Consortium. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355364 (2014).
  27. Rajagopal, N. et al. RFECS: a random-forest based algorithm for enhancer identification from chromatin state. PLoS Comput. Biol. 9, e1002968 (2013).
  28. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012).
  29. John, S. et al. Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nat. Genet. 43, 264268 (2011).
  30. Sherwood, R.I. et al. Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nat. Biotechnol. 32, 171178 (2014).
  31. Guo, Y., Mahony, S. & Gifford, D.K. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput. Biol. 8, e1002638 (2012).
  32. Tsai, S.Q. et al. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat. Biotechnol. 33, 187197 (2015).
  33. Leung, D. et al. Integrative analysis of haplotype-resolved epigenomes across human tissues. Nature 518, 350354 (2015).
  34. Li, G. et al. Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell 148, 8498 (2012).
  35. Woo, Y.H., Walker, M. & Churchill, G.A. Coordinated expression domains in mammalian genomes. PLoS One 5, e12158 (2010).
  36. Vierstra, J. et al. Functional footprinting of regulatory DNA. Nat. Methods 12, 927930 (2015).
  37. Breiman, L. Random forests. Mach. Learn. 45, 532 (2001).
  38. Nelder, J.A. & Wedderburn, R.W. Generalized linear models. J. R. Stat. Soc. Ser. A 135, 370384 (1972).
  39. Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 9961006 (2002).
  40. Needleman, S.B. & Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443453 (1970).
  41. Liese, F. & Miescke, K.J. Statistical Decision Theory: Estimation, Testing, and Selection (Springer, 2008).
  42. Dixon, J.R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376380 (2012).

Download references

Author information

Affiliations

  1. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Nisha Rajagopal,
    • Sharanya Srinivasan,
    • Yuchun Guo,
    • Matthew D Edwards,
    • Tahin Syed &
    • David K Gifford
  2. Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts, USA.

    • Sharanya Srinivasan,
    • Kameron Kooshesh,
    • Budhaditya Banerjee,
    • Bart J M Emons &
    • Richard I Sherwood
  3. Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, Massachusetts, USA.

    • Kameron Kooshesh
  4. Program in Cancer, Stem Cells, and Developmental Biology, University of Utrecht, Utrecht, the Netherlands.

    • Bart J M Emons

Contributions

Experiments were designed by N.R., R.I.S. and D.K.G. MERA experiments were performed by R.S., S.S., K.K., B.B. and B.J.M.E. N.R. and D.K.G. performed the computational analysis. Y.G., T.S. and M.D.E. helped with the computational analysis.

Competing financial interests

A patent application on MERA has been filed by the authors’ institutions.

Corresponding authors

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Effect of the length of homology arms of guide RNA on background cutting due to unintegrated guide RNA PCR fragments. (198 KB)

    Homology constructs with a GFP-targeting gRNA were introduced into the cell. In the absence of a dummy-cleaving guide RNA, the homology construct would not be integrated into the ROSA locus, hence any loss of GFP in the cell would be due to this unintegrated construct cutting the target sequence in the GFP gene. Thus, we were able to measure the effects of different lengths of homology arms as percentage GFP-loss due to cutting the target site by unintegrated guide RNA.

  2. Supplementary Figure 2: CRISPR-Cas9–mediated mutation following homologous recombination into a genomically integrated gRNA cassette. (171 KB)

    Tdgf1GFP mESCs with a ROSA26-integrated U6 promoter dummy gRNA expression cassette express uniformly strong GFP. After electroporation of Cas9, dummy gRNA-targeting gRNA plasmid, and a PCR fragment comprising a gRNA targeting GFP flanked by 120-140bp homology arms,30% of cells lose GFP expression. Omission of the dummy gRNA-targeting gRNA plasmid results in minimal GFP loss, showing that homologous recombination of the PCR fragment is required for proper gRNA targeting.

  3. Supplementary Figure 3: MERA followed by flow cytometry enables isolation of GFP cells at four mESC loci. (170 KB)

    NanogGFP fusion, Rpp25GFP fusion, and Tdgf1GFP mESCs express GFP. After bulk gRNA integration followed by flow cytometry, highly enriched GFPmedium/neg populations can be purified. These populations are then deep sequenced.

  4. Supplementary Figure 4: Predicting enrichment of gRNAs in GFPneg or GFPmedium populations. (264 KB)

    a.) Correlation between bulk reads at all integrated gRNAs in two biological replicates for the NanogGFP line. b.) Reads in GFPneg population are highly correlated with the bulk reads per GFP-targeting gRNA in a particular replicate of the Tdgf1 population. c.) Reads in GFPmedium population are correlated with the bulk reads per GFP-targeting gRNA in a particular replicate of the Tdgf1 population. d,e.) Distribution of the log10 ratio of GFPneg to bulk reads for all integrated gRNAs. Blue bars indicate gRNAs not significantly enriched for GFP-loss while red bars indicate the gRNAs predicted as significant. Black asterisks show the position of the GFP-targeting gRNAs on the x-axis and tend to be towards thr far-right. Black dot shows the position of the dummy gRNA on the x-axis.As examples, the distribution is shown for d.)Tdgf1 Replicate 1 e.) Zfp42 Replicate 2. f.) The distribution of gRNAs in a 1kb window centered at the Tdgf1 promoter. Black indicates gRNAs that are integrated but do not cause any significant GFP-loss, red is for gRNAs that are significantly enriched in GFPneg population, and green is for gRNAs significantly enriched in GFPmedium population.

  5. Supplementary Figure 5: MERA enables systematic identification of required cis-regulatory elements and their relative importance irrespective of putative off-target effects of a few individual guide RNAs in Tdgf1GFP. (606 KB)

    a.) A genomic view of the gRNA distribution along TDGF1proximal regulatory region showing reads for individual gRNAs for a replicate before and after filtering for guide RNAs with off-target effects., DNase-I hotspot regions, predicted enhancers (green=weak, red=strong),transcription factor binding density based on ChIP-seq data and histone modifications. Predicted off-target cutting sites for each guide RNA is also shown as a panel (black).Guide RNAs redicted to cause significant GFP-loss upon introduction into the Zfp42 GFP line(red) are seen to be much fewer and not clustered as in the Tdgf1 library.b,c.) Guide RNAs enriched for GFP-loss at the b.) external promoter Lrrc2 and c.)unmarked regulatory regonare shown before and after filtering for off-target effects.

  6. Supplementary Figure 6: MERA enables systematic identification of required cis-regulatory elements and their relative importance irrespective of putative off-target effects of a few individual guide RNAs in Zfp42GFP. (388 KB)

    a.) A genomic view of the gRNA distribution along the Zfp42 proximal regulatory region showing reads for individual gRNAs before and after filtering for guide RNAs with off-target effects, DNase-I hotspot regions, predicted enhancers (green=weak, red=strong),transcription factor binding density based on ChIP-seq data and histone modifications. Predicted off-target cutting sites for each guide RNA is also shown as a panel (black). Guide RNAs predicted to cause significant GFP-loss upon introduction into the Tdgf1 GFP line(red) are seen to be much fewer and not clustered as in the Zfp42 library. b.) The Trim12 promoter >150kb away from the Zfp42 gene shows clusters of guide RNAs significantly enriched in GFPneg and GFPmedium populations even after filtering for off-target effects. c.) Relative importance of various fuctional categories (Figure 3d) as measured by fraction of GFPneg enriched gRNAs is invariant upon filtering out gRNAs with off-target effects.

  7. Supplementary Figure 7: Deriving rules for off-target prediction using the GUIDE-seq assay and MERA-generated data. (276 KB)

    a. Fraction of off-target sequences predicted from GUIDE-seq(total==442, number of guide RNAs=13) with a particular number of mismatches in non-seed(1-8bp),seed(9-20bp) or PAM sequence.Maximum of 4 mismatches in the seed and non-seed sequence and a maxmimum of one mismatch in the PAM (NNG/NGN) can be tolerated. b. Fraction of GUIDE-seq derived off-target sequences containing 3 adjacent mismatches along the bases of the guide RNA. For a NGG PAM sequence, no triple mismatches are tolerated in the seed region and for NGN/NNG PAM sequence, no triple mismatches are tolerated beyond the fifth base of the gRNA. c. Total number of mismatches tolerated is proportional to the total GC content of the guide RNA sequence. For gRNAs with intermediate GC content (10 to 15), seed GC content determines mismatches tolerated. d.True positive rate for rules of off-target prediction with or without GC content adjustment evaluated as percentage of accurately predicted off-target sites per guide RNA. e.False positive rate for rules of off-target prediction with or without GC content adjustment evaluated as number of number of guide RNAs observed to have no enrichment in GFPneg population with a predicted off-target effect on a gRNA significantly enriched in the GFPneg population. Analysis shown within the same library in Tdgf1 and Zfp42 and also cross-library situations.

  8. Supplementary Figure 8: Effect of introducing a mismatched gRNA library on GFP loss for a particular gene. (266 KB)

    Introduction of sgTdgf1 library into the Tdgf1GFP line, Introduction of sgZfp42 library into the Tdgf1GFP line, introduction of sgTdgf1 library into the Zfp42GFP line, Introduction of sgZfp42 library into the Zfp42GFP line (from left to right clockwise).

  9. Supplementary Figure 9: Comparison of cis-regulatory programs across two genes, Nanog and Rpp25. (335 KB)

    a,b) A genomic view of the gRNAs designed for various regions expected to be involved in regulation of Nanog including two distal regions predicted from PolII Chia-Pet data. b.) A genomic view of the gRNAs designed for various regions expected to be involved in regulation of Rpp25 including two distal regions predicted from PolII Chia-Pet data. c,d) Fraction of significant gRNA among the different functional genomic categories involved in the regulation of c) Nanog, and d) Rpp25.

  10. Supplementary Figure 10: Distribution of start and end positions of contiguous mutations or “disruptions” of various lengths along the sequenced read. (353 KB)

    a-d.) Left and right ends of mutations caused by a gRNA along the length of a read are plotted along the y and x-axes respectively. Each point is a set of genotypes with that particular position of the ends of the disruption in. Ratio of GFPneg to GFPpos reads corresponding to a particular point is shown as blue to bluish-yellow (<1, GFPpos biased), or yellow to red (>1, GFPneg biased). gRNAs are shown as black rectangles along the axes and their boundaries are indicated by thin black lines. Thick black boundaries show region within -20 to +20bp selected for further analysis and classification. Disruptions are shown for a.) Left paired end read for Tdgf1 proximal enhancer with two gRNAs,.b) Right paired end read for Tdgf1 proximal enhancer with two gRNAs. c.)Left paired end read for Zfp42 enhancer with a single gRNA, d.)Right pared end read for Zfp42 enhancer with a single gRNA.

  11. Supplementary Figure 11: Deep sequencing of two gRNAs within a Tdgf1 proximal enhancer region validates their role in regulation of Tdgf1 and reveals patterns of mutation and functional motifs in the region. (366 KB)

    a,b.)Fraction of unique genotypes in GFPneg and GFPpos populations with a mutations at bases along the gRNAs reveal pattern of cleavage around the gRNA for c.)Left paired end read. d.)Right paired end read. c,d.)Fraction of unique genotypes in GFPneg and GFPpos with insertion between bases along the gRNA for c.)Left paired end read, d.)Right paired end read. e.)Motif logo for region mutated by gRNAs with base scores computed as log-ratios of the hellinger distance of the GFPneg genotypes at a base to the reference base to the hellinger distance of the GFPpos genotypes at a base to the reference base caused by Tdgf_gRNA_1 and Tdgf_gRNA_2 in the right paired end read. f,g.)Motif logo for insertions showing entropic gain upon GFP-loss in interevening base positions in f.) Left paired end read, g.) Right paired end read.

  12. Supplementary Figure 12: Deep sequencing of two gRNAs within a Zfp42 enhancer region validates their role in regulation of Zfp42 and reveals functional motifs associated with gene activity. (280 KB)

    a.)Two gRNAs at a Zfp42 enhancer region in the genomic context showing its overlap with DNAse-I hotspot and predicted enhancer regions and transcription factor binding sites. b,c.) ROC curve for 5-fold classification of GFPneg and GFPpos genotypes using mutations within -20 to +20bp of the gRNA as features for b.) Zfp_gRNA_1 using mutations on the left paired end read, c.)Zfp_gRNA_2 using mutations on the right paired end read.Unweighted classification (in blue) counts each unique genotype in the test-set only once while weighted classification(red) calculates sensitivity and specificity counting each unique genotype in the test-set based as many times as the number of reads assigned to it. d,e.)Motif logo for region mutated by gRNAs with base scores computed as log-ratios of the hellinger distance of the GFPneg genotypes at a base to the reference base to the hellinger distance of the GFPpos genotypes at a base to the reference base, d.) Left paired end read with Zfp_gRNA_1, e.) Right paired end read with Zfp_gRNA_2.

  13. Supplementary Figure 13: Deep sequencing of two gRNAs within a Zfp42 enhancer region reveals differences in mutational spectrum associated with loss of gene expression. (342 KB)

    a,b.)Fraction of unique genotypes in GFPneg and GFPpos populations with a mutations at bases along the gRNAs reveal pattern of cleavage around the gRNA for a.)Left paired end read. b.)Right paired end read. c,d.)Fraction of unique genotypes in GFPneg and GFPpos with insertion between bases along the gRNA for c.)Left paired end read, d.)Right paired end read. e,f.)Motif logo for insertions showing entropic gain upon GFP-loss in interevening base positions in e.) Left paired end read, f.) Right paired end read.

  14. Supplementary Figure 14: Deep sequencing of two gRNAs within a Tdgf1 URE validates its role in regulation of Tdgf1 and reveals patterns of mutation and functional motifs in the region. (379 KB)

    a,b.) Left and right ends of mutations caused by a gRNA along the length of a read are plotted along the y and x-axes respectively. Each point is a set of genotypes with that particular position of the ends of the disruption in. Ratio of GFPneg to GFPpos reads corresponding to a particular point is shown as blue to bluish-yellow (<1, GFPpos biased), or yellow to red (>1, GFPneg biased). gRNAs are shown as black rectangles along the axes and their boundaries are indicated by thin black lines. Thick black boundaries show region within -20 to +20bp selected for further analysis and classification. Disruptions are shown for a.) Left paired end read for Tdgf1 URE with a single gRNA,.b) Right paired end read for Tdgf1 URE with another gRNA. c,d.)Fraction of unique genotypes in GFPneg and GFPpos with insertion between bases along the gRNA for c.) ROC curve for 5-fold classification of GFPneg and GFPpos genotypes using mutations on the left paired end read within -20 to +20bp of Tdgf_URE_gRNA1.Unweighted classification (in blue) counts each unique genotype in the test-set only once while weighted classification(red) counts each unique genotype in the test-set as many times as the number of reads assigned to it, for calculating sensitivity and specificity. d.) Fraction of unique genotypes in GFPneg and GFPpos populations with mutations along the left paired end read within -20/+20bp of the Tdgf_URE_gRNA1 reveal the pattern of cleavage. e.) Motif logo for region mutated by gRNAs with base scores computed as log-ratios of the hellinger distance of the GFPneg genotypes at a base to the reference base to the hellinger distance of the GFPpos genotypes at a base to the reference base for the left paired end read containing Tdgf-URE_gRNA1.

  15. Supplementary Figure 15: Deep sequencing of two gRNAs within a Tdgf1 URE validates it’s role in regulation of Tdgf1 and reveals patterns of mutation and functional motifs in the region. (290 KB)

    a,b.) Fraction of unique genotypes in GFPneg and GFPpos with insertions between bases along the gRNA for a.) Left paired end read, b.) Right paired end read. c,d.) Motif logo for insertions showing entropic gain upon GFP-loss in intervening base positions in c.) Left paired end read, d.) Right paired end read. e).Vertebrate Phastcons score along bases of Tdgf_URE_gRNA1 reveal highly conserved left half of the sequence.

PDF files

  1. Supplementary Text and Figures (5,548 KB)

    Supplementary Figures 1–15, Supplementary Discussion, Supplementary Methods and Supplementary Tables 5–7

Excel files

  1. Supplementary Table 1 (264 KB)

    Genomic locations targetted by gRNAS in the Tdfg1 library and sequenced read counts per gRNA from bulk, GFP-negative and GFP-medium populations

  2. Supplementary Table 2 (278 KB)

    Genomic locations targetted by gRNAS in the Zfp42 library and sequenced read counts per gRNA from bulk, GFP-negative and GFP-medium populations

  3. Supplementary Table 3 (177 KB)

    Genomic locations targetted by gRNAS in the Nanog library and sequenced read counts per gRNA from bulk and GFP-negative populations

  4. Supplementary Table 4 (103 KB)

    Genomic locations targetted by gRNAS in the Rpp25 library and sequenced read counts per gRNA from bulk and GFP-negative populations

Additional data