Introduction

It has been over 40 years since the discovery of the first DNA sequence capable of enhancing the transcription of a reporter gene1. Since then, many cis-regulatory DNA elements known as enhancers have been identified, and their biochemical and functional properties have been extensively investigated. A central role of enhancers is to regulate gene expression by binding transcription factors (TFs) and other regulators able to modulate the transcription of target genes2. Enhancers can act independently of the distance and orientation to their target genes3 and they have specific chromatin features that aid in their genome-wide identification through high-throughput methods. Common techniques used for enhancer identification are DNase I hypersensitivity sequencing (DNase-seq)4 and the Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq)5 which exploit the fact that enhancers are free from nucleosomes and are more sensitive to enzymatic treatment. One powerful tool for identifying active enhancers is Chromatin immunoprecipitation-sequencing (ChIP-seq), which utilizes antibodies that recognize specific histone modifications, such as H3K27Ac and H3K4me1, in enhancers flanking nucleosomes3,6,7. Additionally, active enhancers were found to be actively transcribed into enhancer RNAs (eRNAs)8,9, and to form loops with target promoters to exert their regulatory effects on gene expression. In recent years, several high-throughput 3D chromatin conformation techniques, such as Hi-C10,11, ChiA-PET12,13, and micro-C14, have been used to identify physical interactions between potential enhancers and promoters, elucidating the spatial organization of the genome.

While these approaches are highly informative and allow identifying the location of potential enhancers, none of these represent the definitive proof of their activity. To address this, reporter assays have been widely used to confirm the activity of predicted enhancers by testing their ability to drive the expression of a heterologous reporter gene. Previously, this method was limited to testing one sequence at a time using transgenic technologies15,16, but the development of high-throughput massively parallel reporter assays such as STARR-seq17 and MPRA18,19,20,21 has revolutionized the field, enabling the simultaneous testing of thousands of DNA sequences for enhancer activity in a single experiment. More in detail, in the MPRA, candidate regulatory sequences, including promoters or enhancers, are introduced in a construct upstream of a minimal promoter, a gene reporter, and a unique barcode. Once the MPRA construct is introduced into cells, the tested sequences drive the expression of the associated unique barcode. The ratio between the number of transcribed barcodes in the RNA fraction (RNA counts) and the number of barcodes in the DNA fraction (DNA counts) provides a quantitative measurement of the regulatory activity of the tested sequences.

Recently, MPRA experiments have been employed to understand transcriptional patterns driving neuronal differentiation through perturbation of transcription factor binding motifs within regulatory elements22,23 and for studying the effect of thousands of human specific substitutions on the functional activity of enhancer regions implicated in human brain evolution24,25,26, or neuronal-associated diseases27,28,29.

Here, we used a genomic-integrated MPRA, the Lentiviral-based Massively Parallel Reporter Assay—LentiMPRA30—to evaluate the activity of approximately 7000 putative gene-linked enhancers. These enhancers were previously identified in human induced pluripotent stem cells (iPSC)-derived forebrain organoids and human fetal cortex combining ChIP-seq and Hi-C data31 and are potentially involved in early human neurodevelopment.

Results

Enhancer selection and lentiMPRA experimental design

To characterize the activity of a set of enhancers putatively involved in early human neurodevelopment, we used lentiMPRA in forebrain organoids to test the activity of a subset of the 96,375 gene-linked enhancers identified by ChIP-seq in our previous study conducted in human fetal cortex and human forebrain organoids31. In the original dataset, enhancers were discovered by ChIP-seq, annotated by ChromHMM and linked to putative target genes by proximity and/or external DNA conformation (Hi-C) dataset from stem cells and fetal brains31. We generated an MPRA library selecting the most active (e.g., with highest H3K27Ac peak signal) 6989 enhancers in organoids as measured by ChIP-seq (Fig. 1A). Additionally, we included a total of 122 positive control sequences from three different datasets: (i) 87 enhancers from human embryonic stem cells (hESCs) validated using the ChIP-STARR-seq approach32; (ii) 21 MPRA-validated enhancers from hESC-derived neuronal progenitor cells (NPCs)23; (iii) 35 human brain enhancers from the Vista Enhancer Browser validated by mouse transgenesis33. We also included 150 negative controls generated by shuffling the nucleotides of 150 randomly selected candidate enhancer regions. To address the length limitation of oligonucleotide synthesis, we selected a minimal enhancer region of 270 bp by intersecting our enhancers with synonymous information from other datasets, such as p300 ChIP-seq peaks from neuronal cell lines and fetal brain34,35, DNA hypersensitivity peaks36 and CAGE analysis from fetal brain and neuronal cell types35,37,38,39. The goal was to identify regions with the highest overlap across these datasets which likely correspond to the core active enhancer regions. If the resulting core regions still exceeded the length of 270 bp, we used FIMO40 to refine them by identifying a subregion with the highest number of known transcription factor binding sites (TFBSs) (see Supplementary Fig. S1 and Methods). In total, the library included 7261 sequences synthetized along with 15 bp adapters on either side. The lentiMPRA library was amplified, and a minimal promoter and a 15 bp random barcode were placed downstream of each synthesized sequence and cloned into a lentiMPRA vector upstream of the GFP coding sequence (Fig. 1A).

Figure 1
figure 1

Experimental design and overall lentiMPRA results. (A) ~ 7000 candidate enhancer regions were selected among the 96,000 gene-linked enhancers identified by Amiri et al.31. The lentiMPRA library was synthesized on a custom array and cloned into a lentiMPRA vector, packaged into lentivirus, and introduced into 3 iPSC lines before organoid differentiation. (B) At 3 different time points, DNA/RNA sequencing was used for estimating the enhancer activity from the ratio of corresponding barcodes. Enhancers were considered active by MPRA if their activity was significantly above the background (blue curve) derived from a gaussian mixture model of the activity of negative controls. (C) The Venn diagram shows counts of MPRA-active enhancers across time points. BC, barcode; LTR, long terminal repeat; mP, minimal promoter; TD, terminal differentiation.

To investigate whether our candidate enhancers can elicit a time-dependent transcriptional response during early neurodevelopment, we infected three induced pluripotent stem cell (iPSC) lines with our lentiMPRA library and measured enhancer activity at the iPSC stage and at two terminal differentiation (TD) time points during forebrain organoid differentiation: an earlier stage (TD0) when organoids were almost exclusively composed of proliferating progenitors and a more mature stage (TD30) when organoids were still harboring progenitors and actively generating layer 5–6 cortical postmitotic neurons (Supplementary Fig. S2 and S3; Methods).

LentiMPRA identifies active enhancers during forebrain organoids differentiation

Using DNA sequencing, we discovered that 95.1% (6907/7261) of the tested enhancer sequences in the original library were successfully recovered, with an average of 39.9 unique barcodes associated with each enhancer sequence (Supplementary Fig. S4). Out of the total 7261 sequences, 94.9% passed stringent quality control (Methods). In the MPRA experiment, enhancer activity is measured as the ratio of transcribed barcode reads (obtained by RNA-seq) to integrated genomic barcode reads (evaluated by DNA-seq). To identify active enhancers in the tested set, we first defined the background distribution for the ratio of RNA/DNA barcodes. We reasoned that the presence of a multitude of TFs in each cell, combined with the non-deterministic (i.e., probabilistic) nature of TF binding motifs and the largely unstudied effects of TF cooperativity, could result in enhancer activity even in randomly shuffled negative control sequences. Consequently, the distribution of activity for the negative control sequences was approximated as a sum of two gaussian distributions, one representing the true background signal and one representing an actual signal from potentially active sequences (Fig. 1B and Supplementary Fig. S5; Methods). Such a bimodal approximation precisely described the observed experimental distribution with an average signal from likely active sequences being roughly 55% stronger than the average background (Fig. 1B). Such an approximation also described the signal distribution for positive controls, suggesting that some of the positive controls are not active, albeit enriched for active enhancers as compared to negative controls (Supplementary Fig. S5).

MPRA-active enhancers were defined as those having a signal significantly above the background distribution (Fig. 1B). Altogether, 34.8% of the enhancers were active in at least one time point (Supplementary Table S1). Of the core validated enhancers, 1755 (74.6%) exhibited activity at a specific time point, reflecting a stage-specific epigenomic regulation: 193 were active only in iPSCs, 552 at TD0, and 1010 at TD30. More than 25% of active enhancers were shared between time points, with a large proportion being active at both TD0 and TD30 (449), indicating that some regulatory elements may play a role in both early and late stages of neurodevelopment (Fig. 1C).

The background and the actual signal distribution were wide and overlapped significantly, limiting a clear discrimination between active and inactive enhancers, and reducing the power of the MPRA approach (Fig. 1B). Based on this overlap and a selected p-value threshold, we estimated that we were only powered to validate about 35% of truly active enhancers. To elaborate on the missing fraction of true enhancers, we clustered all tested enhancers by their activity profile (RNA/DNA ratio) across samples in the MPRA experiment. This clustering approach revealed that almost all enhancers fell into two large clusters, cluster 1 and 2 (Fig. 2A). Cluster 1 had just a few MPRA-active enhancers, while the cluster 2 encompassed almost all MPRA-active enhancers. We interpreted the data as suggesting that the inactive enhancers in cluster 2 were likely validated, but they did not formally pass the significance test due to the low sensitivity of the MPRA assay. This result would also explain the limited reproducibility of MPRA-active enhancers across samples at TD0 and TD30 (Supplementary Fig. S6).

Figure 2
figure 2

Characteristics of active enhancers. (A) Heatmap of enhancer activity (RNA/DNA) of all candidate enhancers tested in MPRA experiment. Clustering was performed using the Ward variance minimization algorithm. The right bar annotates MPRA-active (red) or inactive (blue) enhancers. (B) Upset plot showing the number of active (red) and inactive (blue) enhancers overlapping external datasets. In the box, bars represent the fraction of overlapping active and inactive enhancers relative to the total number of active and inactive enhancers. p-values were calculated using Fisher’s exact test (*p-value < 0.05; **p-value < 0.01). (C) Cumulative plot of number of transcription factor binding sites identified by FIMO in positive controls, negative controls, and tested enhancers. No statistically significant difference was observed between active and inactive enhancers. (D) Cumulative plot of the expression (RPKM) of genes predicted to be regulated by tested enhancers in TD0 and TD30 organoids in the endogenous genomic context by Amiri et al.31. Genes regulated by MPRA-active enhancers have significantly higher expression than inactive enhancers. The background curve represents the expression of all ~ 96,375 gene-linked enhancers from Amiri et al.31.

We next intersected MPRA-tested enhancers with external ATAC-seq-derived peaks obtained from bulk and single cell data from fetal brains and forebrain-directed organoids41,42 as well as from whole-brain organoids43. A large proportion (79.1%) of the tested enhancers were present in these external datasets. Consistent with our expectations, we observed that MPRA-active enhancers were enriched in fetal brains and forebrain-directed organoid datasets from Trevino et al. and Ziffra et al. (Fig. 2B inset, Supplementary Fig. S7 and Supplementary Table S2).

To further qualify the nature of MPRA-active enhancers, we compared the numbers of TFBSs within active and inactive enhancers. Our analysis revealed that while positive control enhancers had significantly more TFBSs compared to both scrambled negative controls and tested enhancers (Fig. 2C), we did not observe any significant difference in the number of TFBSs or in the expression of cognate TFs between MPRA-active and -inactive tested enhancers.

We next took advantage of the fact that in the original dataset of Amiri et al. we identified putative target genes for the tested enhancers in their native DNA location, by either proximity in linear DNA and/or 3D DNA conformation datasets31. We then compared the MPRA-active and -inactive enhancers with regard to the expression of their linked genes. To increase the statistical power of our analysis, we incorporated bulk RNA-seq data from 88 additional samples from a parallel experiment using genetically distinct iPSC lines (collected at TD0 and TD30, using the same forebrain differentiation protocol) in addition to the samples used in the MPRA experiment. Our analysis revealed that the genes linked to MPRA-active enhancers exhibited significantly higher expression levels than those associated with inactive enhancers (Fig. 2D)31. Genes linked to enhancers in both categories had higher expression than “background”, i.e., all gene-linked enhancers from Amiri et al.31. This likely reflects the selection of the most active enhancers for the MPRA experiment. The fact that MPRA-inactive enhancers had higher target gene expression than the “background” expression was consistent with the low sensitivity of MPRA experiment estimated above, implying that a significant fraction of MPRA-inactive enhancers could actually be active in a different experimental setting.

We then correlated the difference in enhancer activity (measured by lentiMPRA) across time points with the difference in expression of the gene(s) linked to the enhancer in the endogenous genomic context. Comparing enhancer activity and gene expression at TD0 versus the iPSC stage, we found that MPRA-active enhancers upregulated or downregulated at TD0 or TD30 versus the iPSC stage were typically correlated (positively or negatively) with the difference in expression of their endogenous linked genes (Fig. 3A, B and Supplementary Fig. S8, S9). In contrast, MPRA-inactive enhancers rarely showed a correlation with the expression of their corresponding genes when comparing either TD0 (Fig. 3A and B) or TD30 with iPSCs (Supplementary Figs. S9 and S10). These observations demonstrate that the MPRA-active enhancers, compared to the inactive ones, were those that, in their appropriate genomic context, were linked to highly expressed genes and that differential MPRA activity predicted differential gene expression during differentiation. Given the nature of the MPRA assay, it can be inferred that the activity of these MPRA enhancers is less dependent on the genomic context. Among the enhancers exclusively active in iPSCs, we found that the most active one (Fig. 3A, Supplementary Table S3) is located upstream of the GTPase-activating protein (SH3 domain)-binding protein 2 (G3BP2) gene, which encodes an RNA binding protein involved in maintaining pluripotency by regulating the transcription factors Oct-4 and Nanog44 (Fig. 3C). The G3BP2 gene was also upregulated in iPSCs compared to the TD30 stage (Supplementary Fig. S9), consistent with the TD0 MPRA results and with its biological role in maintaining pluripotency. The 270 bp tested enhancer region is potentially bound by seven TFs (Supplementary Table S3), including ZNF263, a TF expressed in pluripotent stem cells45 and computationally predicted to have the highest affinity for the binding motif (Fig. 3C).

Figure 3
figure 3

MPRA active enhancers correlate with differences in expression of linked genes across different time points. (A) Scatter plots of change in MPRA activity for an MPRA-active enhancer (y-axis) and change of expression for the gene(s) linked to the enhancer in the endogenous genomic context. Activity and expression are compared for TD0 and iPSC. Each dot represents a gene-enhancer pair. iPSC or TD0 active enhancers are shown as red circles in the upper panel; inactive enhancers are represented in the bottom panel as blue circles. (B) More correlated enhancer-gene pairs are observed in MPRA-active enhancers compared to inactive enhancers. The x-axis represents the cutoff for the change in enhancer activity and gene expression. The y-axis represents the fraction of enhancer-gene pairs passing cutoff for the change in the enhancer activity and gene expression. The dashed line represents cutoff at log2FC = 0.8, which is the same as marked in panel (A). (C) Example of an iPSC-only active enhancer (MPRA activity shown in the heatmap) and the expression of the linked gene (G3BP2), with ZNF263 as the predicted binding transcription factor. (D) Example of a TD0 active enhancer (MPRA activity shown in the heatmap) and expression of the linked gene (CUX1) with ASCL1, a TF predicted to bind to this enhancer. For comparison, a mildly active enhancer at TD0 for this gene is identified downstream of the active enhancer. TSS, Transcription Start Site.

Within the enhancers displaying the highest log2FC activity at TD0 (Fig. 3A), several are linked to genes particularly noteworthy in the context of brain development, suggesting an important role of these enhancers in neuronal functions. The enhancer with the highest differential activity at TD0 compared to iPSC is linked to the Regulator of G Protein Signaling 4 (RGS4) gene. This gene is expressed in the developing nervous system and adult brain, although its function remains unclear46. Other examples include ANKRD11, a chromatin regulator modulating gene expression in neural precursor cells47, and the receptor tyrosine kinase ErbB448, which regulates various neuronal processes such as migration, proliferation, and differentiation. Other genes of interest are FAM107A, contributing to neural cell survival, migration, and spine formation49, and P2RX4, a purinoceptor for ATP involved in key CNS functions50.

Among the top 50 differentially active enhancers at TD0 versus iPSC, the one displaying the greatest enhancer activity across all the TD0 samples (Supplementary Table S3) partially overlapped with an exon located ~ 400 kb downstream of the transcription start Site (TSS) of the Cut Homeobox 1 (CUX1) gene, a TF playing a critical role in upper layer cortical neuron differentiation, dendrite branching, and synapse formation51,52. Despite its role in late differentiation, other works indicated an increase in CUX1 mRNA levels in early progenitors, including radial glia. This suggests that CUX1 could potentially serve as a determinant, in early progenitors, of the identity of the upper cortical layers53,54. CUX1-linked enhancer differential activity was also confirmed by comparing TD30 vs iPSC (Supplementary Fig. S9 and Supplementary Table S3). Among the four TFs predicted to bind this active enhancer (Supplementary Table S3) there is ASCL1, which has been previously identified to promote CUX1 expression55. Additionally, a second intronic enhancer linked to CUX1, bound by ZNF436, showed mild MPRA activity only at TD0 (Fig. 3D).

Discussion

Spatial and temporal regulation of gene expression through regulatory regions, specifically enhancers, is essential for shaping the structure and function of the human brain during development. A number of biochemical techniques, such as ChIP-seq, ATAC-seq and DNase-seq have been used for identifying enhancers and characterizing their activity in neurodevelopment. Among those, the MPRAs have been adopted to test an enhancer’s ability to activate a synthetic reporter in a high throughput and context-independent manner, providing an orthogonal readout of enhancer activity. Most of the MPRA-based studies have primarily concentrated on investigating the functionality of potential regulatory sequences in 2D iPSCs-derived neuronal cells at very early stages of differentiation22,23 or in human-derived neuronal stem cells24,25,26. In this study, we used lentiMPRA in forebrain organoids to evaluate, over the course of brain development, the activity of ~ 7000 gene-linked enhancers previously identified in human fetal tissues and brain organoids from a combination of histone-mark ChIP-seq and DNA conformational studies. Besides validating 2352 enhancers, our analysis of the MPRA-active enhancers in relation to upstream binding TFs and downstream targets suggested important implications for the developmental biology of the human brain.

By analyzing RNA-seq data from forebrain organoids collected at the same time points, we found that while there was no relation with number of potential TFBSs or TFs expression levels, MPRA-active enhancers are associated with highly expressed genes in the endogenous genomic context. Furthermore, differential enhancer activity, as determined by the lentiMPRA assay, tended to be correlated, in a positive or negative fashion, with differential gene expression across time points of neural differentiation. This provides an interesting insight into the capability of MPRA to detect enhancers associated with genes dynamically regulated during different stages of neurodevelopment.

On a technical side, our study highlights a few essential limitations of MPRA techniques. Perhaps the major limitation of MPRA is its low sensitivity (poor discrimination of signal from noise), estimated to be about 35% in our experiments, which likely explains limited reproducibility across replicates. For the later differentiation time points this could also be due to lentivirus regulatory element transgene silencing, which was previously reported to be an issue over differentiation of stem cells into other cell types56. More data are required to precisely assess the variability of this assay across technical replicates and genetically different iPSC lines. In addition, there may be some false negative results in the MPRA assay, since the assay tests a shorter version of the original enhancer, which may not include all the necessary elements required for transcription initiation. For example, MPRA-tested enhancers may lack crucial co-activators that are provided by DNA loop conformation, or may be present in the flanking sequences of the tested region. This may explain why, while detecting a significant enrichment of MPRA-active enhancers with external ATAC-seq datasets obtained in fetal brain or forebrain organoids, a considerable percentage of MPRA-inactive enhancers also overlapped with those external datasets.

Similarly to any bulk assay, using MPRA in a heterogenous system such as organoids, which are characterized by cellular diversity, reveals another caveat of the technique. Given the absence of cell-type information coupled to enhancer activity, it is likely that MPRA may be strongly biased towards revealing those enhancers present in the most abundant cell types. Indeed, when intersecting with scATAC-seq data obtained from the same organoids preparations at TD0 and TD30, there was a significant enrichment of MPRA-active enhancers only in the most abundant cell population, radial glial cells (Supplementary Table S4)57. Future studies using single-cell MPRA-seq approaches58,59 in combination with existing single-cell biochemical assays such as scATAC-seq, could open new avenues to gain a comprehensive understanding of the fine regulatory dynamics occurring within each cell type in complex neuronal differentiation systems.

Materials and methods

Human subjects

Human subjects were recruited through several research projects at the Yale Child Study Center. Written informed consent was obtained from each participant enrolled in the study, and all research was approved by the Yale University Institutional Review Board (HIC# 1104008337) and Yale Center for Clinical Investigation at Yale University and was performed in accordance with the Declaration of Helsinki. Human participants' names and other HIPAA identifiers were removed from all sections of the manuscript, including supplementary information. The participants agreed to data sharing of genomic unidentified data using controlled data access.

All methods were performed according to relevant regulations and guidelines of Yale University, including the Biological Safety Committee and ESCRO committee of Yale University.

LentiMPRA library design

Candidate enhancers were identified in Amiri et al. 201831 by chromatin segmentation analyses using H3K27Ac, H3K27me3 and H3K4me3 ChIP-seq datasets obtained in cortical organoids and cerebral cortical tissue from postmortem fetal human brains. Enhancers were linked to putative target genes by proximity and/or by association with promoters using fetal brain 3D chromatin conformation data. From an initial dataset of > 300,000 enhancers, 96,375 were found to be associated with genes and termed gene-linked enhancers (GLE). From the GLE dataset, approximately 7000 enhancers were selected for MPRA based on top activity, as determined by H3K27Ac peak signal. As ChIP-seq enhancers can encompass hundreds to thousands of bases, for the purpose of oligonucleotide synthesis for the MPRA assay, a core of 270 bp region was identified. To achieve this, our chosen enhancers were intersected with enhancers from other datasets to select the region with the highest number of overlaps, and therefore potentially more active. In detail, we used: (i) p300 ChIP-seq peaks from human neuronal cell lines39 and human fetal cortex60, (ii) CAGE analysis from brain tissues37, and (iii) DNase hypersensitivity peaks from neuronal progenitor cells and brain tissue36. Finally, if there was no overlap or if the overlapping enhancer region was still too large, we used FIMO40 to further reduce the size to 270 bp by selecting the region with the highest number of TFBSs, as TF binding is a reliable predictor of enhancer activity. The list of tested enhancers is outlined in Supplementary Table S1.

LentiMPRA library generation

The lentiMPRA plasmid library was constructed as previously described in Gordon et al., 202030. Briefly, the oligonucleotide pool of the 7261 enhancers was synthetized by Twist Bioscience and amplified via two rounds of PCR, first to add the minimal promoter and then to add the barcode, using two sets of adaptors primers, 5BC-AG-f01/r01 and 5BC-AG-f02/r02 respectively (Supplementary Table S5). The amplified fragments were cloned via Gibson assembly (using NEBuilder HiFi DNA Assembly Master Mix; New England BioLabs, cat. No. E2621L) into the SbfI/AgeI site of the pLS-SceI vector (Addgene, cat. No. 137725, a gift from Ahituv’s lab) to construct the library. The resulting library was digested with I-SceI (New England BioLabs, cat. No. R0694S) to remove any vector that did not receive an insert. The recombination products were then electroporated into electrocompetent cells (NEB 10-beta, New England BioLabs, cat. No. C3020K) and plated onto Carbenicillin plates. Sanger sequencing of 32 colonies, using n40.dn.F and EGFP.up.R primers (Supplementary Table S5), was then used to confirm the proper assembly of the library. The library was purified using a number of colonies needed to achieve the desired number of Barcodes (n = 50) to be associated at each sequence. Barcode-associated fragments were amplified using P7-pLSmp-ass-gfp (100 µM) and P5-pLSmP-ass-i741 primers (Supplementary Table S5), purified using Plasmid plus midi kit (QIAGEN) and tested for its quality via sequencing on a MiSeq (see below section).

MiSeq

The association between enhancer sequences and the barcodes was ascertained using Illumina MiSeq v2.0 sequencing with pair-ended 150 bp read length. The reads overlap for 30 bp in the middle of the enhancer sequences. Three MiSeq libraries were sequenced to obtain enough number of barcodes covering each tested enhancer sequence with the total number of 168 million reads. The barcodes with 15 bp length were sequenced at the same time with the same read names as the enhancer sequences with the same batch of MiSeq.

Lentivirus packaging and MOI

The lentiMPRA library was packaged into lentivirus using the plasmid library, psPAX2 (RRID: Addgene_12260) and pMD2.G (RRID: Addgene_12259) and its titration was determined as previously described30. In brief, iPSCs were plated at 0.150 million cells/well in 24-well plates and incubated for 24 h. Serial volumes (0, 2, 4, 8, 16, 32 ul) of the lentivirus were added. The infected cells were cultured for 3 days and washed with PBS three times before genomic DNA extraction. Genomic DNA was extracted using the Wizard SV genomic DNA purification kit (Promega). Virus titer and copy number of viral particles per cell were measured by qPCR as previously described30.

iPSCs reprogramming and maintenance

Two iPSC lines were used for the lentiMPRA experiment: ACE1815 (two technical replicates, named ACE_a and ACE_b in the figures) and 11251. Lines were generated from human skin fibroblasts obtained from a skin biopsy of normal individuals using a viral-free episomal reprogramming method57,61 at the Yale Stem Cell Reprogramming Core and passaged for 18 or 20 passages before differentiation. All iPSC lines used in this study fulfilled standard reprogramming criteria, including (i) immunocytochemical expression of pluripotency markers (NANOG; SSEA4; TRA1-60); (ii) expression of known hESC/iPSC markers (SOX2, NANOG, LIN-28, GDF3, OCT4, DNMT3B) by semi-quantitative RT-PCR; (iii) downregulation of exogenous reprograming factors. The iPSC lines were cultured on Matrigel (Corning Matrigel Matrix Basement Membrane Growth Factor Reduced)-coated dishes with mTeESR1 media (StemCell Technologies) and propagated using Dispase (StemCell Technologies).

Lentiviral infection

The lentiMPRA library was transduced into iPSCs at the undifferentiated stage. To provide 75%-80% confluency the next day, iPSCs were seeded on Matrigel-coated 10 cm dishes in mTeSR1 media supplemented with 10 uM Y-27632. After 24 h, the cells were infected with the lentivirus library at an average of Multiplicity of Infection (MOI) of 4–5 and incubated for 3 days with daily media changes to remove non-integrated virus. The experiment involved three independent replicate cultures, including two lines from different individuals (ACE1815 and 11251), and one technical replicate (ACE1815_a and ACE1815_b). All cultures were infected at the same time and using the same lentiviral library.

Forebrain organoid differentiation

The iPSC lines were differentiated into forebrain organoids as described in Jourdon et al. 202357 based on modifications of previous protocols62,63. Briefly, undifferentiated iPSC colonies were treated with 5 µM of the Y27632 compound and dissociated to single cells with Accutase (Millipore, 1:2 dilution in PBS 1X). Four million dissociated cells were seeded in each well of a 6-well plate and cultured in mTeSR1 with 10 µM Y27632 compound on an orbital shaker at a speed of 95 rpm. Forebrain neural induction was triggered by dual SMAD inhibition in mTeSR1 media supplemented with 10 µM SB431542, 1 µM LDN193189 and 5 µM Y-27632 (day1). At day 2, embryoid bodies were cultured in KSR medium (DMEM supplemented with 15% Knockout Serum Replacement, 1% L-Glutamine, 1% NEAA, 1% Pen/Strep and 55 µM of 2β-Mercaptoethanol, 2-ME) with the addition of SB431542, LDN193189, XAV939 and Y-27632. The neural induction with dual SMAD inhibition was maintained until day 7 after which organoids were gradually adapted to NIM medium (DMEM/F12, 1% N2 supplement, 2% B27 without vitamin A, 1% NEAA, 1% Pen/Strep, 0.15% Glucose and 1% Glutamax) through a dilution series of KSR and NIM. Neural progenitors proliferation was induced at day 9 in NIM 75% and KSR 25% supplemented with FGF2 (10 ng/ml) and EGF (10 ng/ml) and organoids were maintained in proliferative medium until day 16 in 100% NIM. Terminal differentiation was initiated at day 17 (TD0) in terminal differentiation medium (Neurobasal medium supplemented with 1% N2, 2% B27 without vitamin A, 15 mM HEPES, 1% Glutamax, 1% NEAA and 55 µM 2-ME) with the addition of the neurotrophic factors BDNF (10 ng/ml) and GDNF (10 ng/ml) until TD30. In the differentiation phase, half of the medium was changed twice a week. Organoids were transferred from wells of a 6-well plate to a 10 cm dish between TD5 and TD10 and the speed of the orbital shaker was decreased to 80 rpm.

Cell harvesting, library preparation and DNA/RNA extraction and sequencing for barcodes count

The infected cells were harvested at three different time points: as iPSCs, after 3 days from the infection before starting organoid differentiation (iPSC stage), and as forebrain organoids at earlier (TD0) and more mature (TD30) terminal differentiation points. Genomic DNA and total RNA were simultaneously extracted using the AllPrep DNA/RNA Mini Kit (Qiagen, cat. No. 80204) following the manufacturer’s protocol. RNA samples were treated with Turbo DNase (Life Technologies, cat. No. AM1907) to remove any residual DNA contamination. Sequencing libraries were prepared as previously described30. Briefly, at least 60 µg total RNA per sample was used for reverse transcription with SuperScript II (Life Technologies, cat. No. 18064–071) using the primer P7-pLSmP-ass16UMI-gfp (Supplementary Table S5) to add a 16-bp UMI and a P7 flowcell sequence downstream of the barcode. PCR steps were performed on the DNA and RNA samples in order to amplify barcodes, adding P5 flowcell sequence and sample index upstream, and P7 flowcell sequence and UMI downstream to the barcode. Finally, the sequencing libraries were pooled and subjected to paired-end sequencing with UMI, and sample index read.

Immunostaining

Organoids were randomly selected and fixed in 4% PFA in PBS for 2–4 h. The organoids were then cryopreserved in 25% sucrose overnight, embedded in O.C.T. (Sakura), and frozen on dry ice before being stored at − 80 °C. Serial cryosections were obtained with a thickness of 12–16 µm. Immunostaining was performed by incubating the sections in blocking solution (PBS, 10% Donkey Serum, 1% Triton-100) for 1 h, followed by incubation with primary antibodies (overnight, 4 °C) and secondary antibodies (1–2 h, from Jackson ImmunoResearch or ThermoFisher Scientific). The slides were then mounted with coverslips using VECTASHIELD (Vector Labs) and imaged on a Zeiss microscope equipped with an apotome module and ZEN 3.3 (ZEN pro) software. Three cell lines were used for immunocytochemical analyses, and a minimum of four organoids per line were analyzed. Images were acquired randomly to cover the entire extent of the organoid. Antibody list: FOXG1 (rabbit, 1:200, Takara), PAX6 (mouse, 1:200, BD Bioscience), EOMES (rabbit, 1:1000, Abcam), FOXP2 (goat, 1:200, Santa Cruz), GAD1 (mouse, 1:200, Chemicon), HuC/D (mouse, 1:200, Invitrogen), SOX1 (goat, 1:100, R&D Systems), CTIP2 (rat, 1:500, Abcam).

For quantification, 4 organoids per sample were analyzed. Images were acquired from multiple sections to cover the entire extent of the organoid. Quantification of the average number of PAX6-, FOXG1- and CTIP2-positive cells was performed using Fiji software with the BioVoxxel plug-in under the Fiji analysis software platform. The relative number of PAX6 + , FOXG1 + or CTIP2 + cells was calculated as a percentage of total DAPI + cells across the 4 organoids in each iPSC line/time point.

LentiMPRA computational pipeline

Pre-processing using MPRAflow and MPRAnalyze

The association between enhancer sequences and barcodes was identified using the MPRAflow association package30 with the following command:

nextflow run association.nf –fastq-insert "R1.fastq.gz" –fastq-bc "barcode.fastq.gz" –design "design.fa" –name "MPRAflow" –fastq-insertPE "R2.fastq.gz" -w {workdir} –labels "labels.txt" –cigar 270 M.

The count for number of reads for each RNA and DNA barcodes in all 9 samples for all enhancers were calculated using MPRAflow count package30 with the following command:

nextflow run count.nf –dir "DNA_RNA" –e "experiments.csv" –design "design.fa" –association "MPRAflow_filtered_coords_to_barcodes.pickle" –labels "labels.txt" –umi-length 15 –name "countUMI" –outdir {workdir} –mpranalyze.

MPRAnalyze64 was then performed to normalize the count for RNA barcodes and DNA barcodes in all 9 samples to generate the enhancer activity (RNA/DNA ratio) using the standard pipeline.

Identification of negative control distribution using negative references

A mixed Gaussian distribution was applied to calculate the distribution of negative controls using the 135 negative control activity in 9 samples. The curve fit function in Scipy package65 was applied to estimate the mean and standard deviation of the negative control distribution. The distribution accurately described the negative control data with Kolmogorov–Smirnov test p-value = 0.33 and Anderson–Darling test p-value = 0.12.

Candidate regulatory sequences activity quantification and validation

The p-value for the activity of an enhancer to be higher than the negative control distribution was calculated using the enhancer activity (RNA/DNA) and the negative control distribution. The significantly active enhancers were then identified using the p-value of 0.05 after Bonferroni correction using the total number of enhancers tested.

Estimating sensitivity of MPRA experiment to define an enhancer as an active one

Using the negative control Gaussian distribution, we defined a cutoff for the RNA/DNA ratio that corresponds to a significant p-value. We then calculated the area under the positive control Gaussian and above the cutoff. The sensitivity was then estimated as a ratio of the area and the total area under the positive control Gaussian distribution.

Clustering of enhancer activity for candidate regulatory sequences

Enhancer activity (RNA/DNA) of candidate regulatory sequences and positive reference sequences was calculated by MPRAnalyze. Subsequently, Seaborn 0.11.066 clustermap was employed for clustering the enhancer activity values. Ward's minimum variance method was utilized for clustering. The resulting heatmap displayed the clustered enhancer activity values. The color used for enhancer activity larger than 4 was the same as that used for the value of 4.

Transcription factor binding site identification and expression of the transcription factors

FIMO40 was employed to identify TFBSs from all 7289 sequences, including candidate regulatory sequences, positive reference sequences, and negative reference sequences. The transcription factor binding site annotation was downloaded from JASPAR202267 with 1252 Homo Sapiens annotations. The command for running FIMO40 is as follows:

fimo –o {out_dir} JASPAR2022_tfbs input.fa.

TFBSs identified by FIMO with FDR <  = 0.05 were taken for further analyses. We quantified the number of TFBSs in each sequence and compared the distribution between validated and non-validated candidate regulatory sequences using Kolmogorov–Smirnov test. We further calculated the expression (RPKM) of the TFs which bind to these TFBSs in 97 TD0/TD30 organoid samples. The expression of TFs in validated and non-validated candidate regulatory sequences was compared using the Kolmogorov–Smirnov test.

Genes regulated by candidate regulatory sequences and expression

Genes regulated by candidate enhancers were identified by Amiri et al.31 taking confident_set1, confident_set2, and proximity genes. The expression (RPKM) of the genes was calculated from 97 TD0/TD30 organoid samples.

Intersection with external datasets

Intersection with external datasets were performed using BedTools68 and the resulting data was plotted using the 'UpSetR' package in R69.

Single cell ATAC-seq analysis

For each 10X scATACseq sample, fastq files were first processed by cellranger-atac v2.0.0 with default parameters and 10X prebuilt reference arc-GRCh38-2020-A-2.0.0. The resulted cell-by-peak count matrix was first processed and filtered by Signac following online vignette (https://stuartlab.org/signac/articles/pbmc_vignette.html). Cell types were annotated using our annotated scRNAseq dataset and the label transfer method implemented in Seurat following online vignette (https://satijalab.org/seurat/articles/atacseq_integration_vignette.html). scATACseq peaks were then called using all reads in a sample or subsets of reads from each annotated cell type by running Signac function CallPeaks with default parameters.