To generate the competition training data, we conducted a high-throughput experiment to measure the regulatory effect of millions of random DNA sequences (Methods). Prior research has shown that random DNA can display activity levels akin to genomic regulatory DNA because of the incidental occurrence of numerous TF-binding sites (TFBSs)13,22,26. Here, we cloned 80-bp random DNA sequences into a promoter-like context upstream of a yellow fluorescent protein (YFP), transformed the resulting library into yeast, grew the yeast in Chardonnay grape must and measured expression by fluorescence-activated cell sorting (FACS) and sequencing13,27,28 (Methods). This resulted in a training dataset of 6,739,258 random promoter sequences and their corresponding mean expression values.

We provided these data to the competitors, who could use them to train their model, with two key restrictions. First, competitors were not allowed to use external datasets in any form to ensure that all models are trained on the same dataset. Second, ensemble predictions were also disallowed as they would almost certainly provide a boost in performance but without providing any insight into the best model types and training strategies.

We evaluated the models on a set of ‘test’ sequences designed to probe the predictive ability of the models in different ways. The measured expression levels driven by these sequences were quantified in the same way as the training data but in a separate experiment with more cells sorted per sequence (~100), yielding more accurate estimated expression levels compared to the training data measurements and providing higher confidence in the challenge evaluation. The test set consisted of 71,103 sequences from several promoter sequence types. We included both random sequences and sequences from the yeast genome to get an estimate of performance difference between the random sequences in the training domain and naturally evolved sequences. We also included sequences designed to capture known limitations of previous models trained on similar data, namely sequences at the high-expression and low-expression extremes and sequences designed to maximize the disagreement between the predictions of a previously developed CNN and a physics-informed NN (‘biochemical model’)13,22. We previously found that predicting changes in expression between closely related sequences (that is, nearly identical DNA sequences) is substantially more challenging; hence, we included subsets where models had to predict changes that result from single-nucleotide variants (SNVs), perturbations of specific TFBSs and tiling of TFBSs across background sequences13,22. Each test subset was given a different weight when scoring the submissions, proportional to the number of sequences in the set and how important we considered it to be (Table 1). For instance, predicting the effects of SNVs on gene expression is a critical challenge for the field because of its relevance to complex trait genetics29. Accordingly, a substantial number of SNV sequence pairs were included in the test set and SNVs were given the highest weight. Within each sequence subset, we determined model performance using Pearson’s r2 and Spearman’s ρ, which captured the linear correlation and monotonic relationship between the predicted and measured expression levels (or expression differences), respectively. The weighted sum of each performance metric across test subsets yielded our two final performance measurements, which we called the Pearson score and Spearman score.

Table 1 Summary of the test subsets

Our DREAM Challenge ran for 12 weeks in the summer of 2022 and included two evaluation stages: the public leaderboard phase and the private evaluation phase (Fig. 1a). The leaderboard opened 6 weeks into the competition and allowed teams to submit up to 20 predictions on the test data per week. At this stage, we used 13% of the test data for leaderboard evaluation and displayed only the overall Pearson’s r2, Spearman’s ρ, Pearson score and Spearman score to the participants, while keeping the performance on the promoter subsets and the specific sequences used for the evaluation hidden. The participating teams achieved increasing performance each week (Extended Data Fig. 1), showcasing the effectiveness of such challenges in motivating the development of better machine learning models. Over 110 teams across the globe competed in this stage. At the end of the challenge, 28 teams submitted their models for final evaluation. We used the remaining test data (~87%) for the final evaluation (Fig. 1b,c and Extended Data Fig. 2).

Fig. 1: Overview of the challenge. a, Left, competitors received a training dataset of random promoters and corresponding expression values. Middle, they continually refined their models and competed for dominance in a public leaderboard. Right, at the end of the challenge, they submitted a final model for evaluation using a test dataset consisting of eight sequence types: (i) high expression, (ii) low expression, (iii) native, (iv) random, (v) challenging, (vi) SNVs, (vii) motif perturbation and (viii) motif tiling. b,c, Bootstrapping provides a robust comparison of the model predictions. Distribution of ranks in n = 10,000 samples from the test dataset (y axes) for the top-performing teams (x axes) Pearson score (b) and Spearman score (c). d,e, Performance of the top-performing teams in each test data subset. Model performance (color and numerical values) of each team (y axes) in each test subset (x axes) for Pearson’s r2 (d) and Spearman’s ρ (e). Heat map color palettes are min–max-normalized by column. f,g, Performance disparities observed between the best and worst models (x axes) in different test subsets (y axes) for Pearson’s r2 (f) and Spearman’s ρ (g). The calculation of the percentage difference is relative to the best model performance for each test subset. Source data Full size image

Innovative model designs surpass the state of the art

We retrained the transformer model architecture of Vaishnav et al.12, the previous best-performing model for this type of data, on the challenge data and used it as a reference in the leaderboard (‘reference model’). The overall performance of top submissions, all NNs, was substantially better than the reference model. Despite recent prominence of attention-based architectures22, only one of the top five submissions in the challenge used transformers, placing third. The best-performing submissions were dominated by fully convolutional NNs, with first, fourth and fifth places taken by them. The best-performing solution was based on the EfficientNetV2 architecture30,31 and the fourth and fifth solutions were based on the ResNet architecture32. Moreover, all teams used convolutional layers as the starting point in their model design. An RNN with bidirectional long short-term memory (Bi-LSTM) layers33,34 placed second. While the teams broadly converged on many similar training strategies (for example, using Adam35 or AdamW36 optimizers), they also had substantial differences (Table 2).

Table 2 Breakdown of the top-performing models into key components

The competing teams introduced several innovative approaches to solve the expression prediction problem. Autosome.org, the best-performing team, transformed the task into a soft-classification problem by training their network to predict a vector of expression bin probabilities, which was then averaged to yield an estimated expression level, effectively recreating how the data were generated in the experiment. They also used a distinct data-encoding method by adding channels to the traditional four-channel one-hot encoding (OHE) of the DNA sequence used by most teams. The two additional channels indicated (1) whether the sequence provided as input was likely measured in only one cell (which results in an integer expression value) and (2) whether the input sequence is being provided in the reverse complement orientation. Furthermore, Autosome.org’s model, with only 2 million parameters, was the model with the fewest parameters among the top ten submissions, demonstrating that efficient design can considerably reduce the necessary number of parameters. Autosome.org and BHI were distinct in training their final model on the entirety of the provided training data (that is, no sequences withheld for validation) for a prespecified number of epochs (determined previously using cross-validation using validation subsets). Unlock_DNA, the third best team, took a novel approach by randomly masking 5% of the input DNA sequence and having the model predict both the masked nucleotides and gene expression. This approach used the masked nucleotide predictions as a regularizer, adding a reconstruction loss to the model loss function, which stabilized the training of their large NN. BUGF, the ninth best team, used a somewhat similar strategy where they randomly mutated 15% of the sequence and calculated an additional binary cross-entropy loss predicting whether any base pair in the sequence had been mutated. The fifth best team, NAD, used GloVe37 to generate embedding vectors for each base position and used these vectors as inputs for their NN, whereas the other teams used traditional OHE DNA sequences. Two teams, SYSU-SAIL-2022 (11th) and Davuluri lab (16th), attempted to train DNA language models38 on the challenge data by pretraining a BERT (bidirectional encoder representations from transformers) language model39 on the challenge data and subsequently used the BERT embeddings to train an expression predictor.

Test sequence subsets reveal model disparities

Analysis of model performance on the different test subsets revealed distinct and shared challenges for the different models. The top two models were ranked first and second (sometimes with ties) for each test subset regardless of score metric, showcasing that their superior performance could not be attributed to any single test subset (Fig. 1d,e). Furthermore, the rankings within each test subset sometimes differed between the Pearson score and Spearman score, reinforcing that these two measures capture performance in distinct ways (Fig. 1d,e).

While the ranking of models was similar for both random and native sequences, the differences in model performance were greater for native yeast sequences than random sequences. Specifically, performance differed between models by as much as 17.6% for native sequences but only 5% for random sequences (Pearson’s r2, Fig. 1f). Similarly, this difference was 9.6% (native) versus 2.7% (random) for Spearman’s ρ (Fig. 1g). This suggests that the top models learned more of the regulatory grammar that evolution has produced. Furthermore, the substantial discrepancy between performance on native and random sequences suggests that there is yet more regulatory logic to learn (although the native DNA has lower sequence coverage, presumably because of its higher repeat content, likely reducing data quality and predictability of this set; Extended Data Fig. 3).

Models were also highly variable in their ability to accurately predict variation within the extremes of gene expression. The cell sorter had a reduced signal-to-noise ratio at the lowest expression levels and the sorting bin placement could truncate the tails of the expression distribution6,12. Overall, model performance was most variable across teams in these subsets, suggesting that the challenge models were able to overcome these issues to varying degrees. For example, the median difference in Pearson’s r2 between the highest and lowest performance was ~48% for high-test and low-test subsets and 16% for the others (Fig. 1f,g).

The models also varied in their ability to predict expression differences between closely related sequences (Fig. 1de, ‘SNVs’, and Extended Data Figs. 4 and 5), with more substantial differences in model performance for subtler changes. Specifically, the percentage differences between best and worst in Pearson’s r2 and Spearman’s ρ were 6.5% and 4% for motif perturbation, 17.7% and 7% for motif tiling and 14.6% and 9.6% for SNVs, respectively, suggesting that the top-performing models better captured the subtleties of cis-regulation. This is consistent with our understanding of the subtlety of the impact; perturbing TFBSs (motif perturbations, where we mutate sequences strongly matching the cognate motif for an important TF or vary the number of binding sites) represented a comparatively large perturbation and could be predicted with simple models that capture the binding of these TFs and can count TFBS instances. However, when TFBSs are tiled across a background sequence, the same TFBS is present in every sequence and the model must have learned how its position affects its activity, in addition to capturing all the secondary TFBSs that are created or destroyed as the motif is tiled13. Lastly, SNVs are even harder to predict because nearly everything about the sequence is identical but for a single nucleotide that may affect the binding of multiple TFs in potentially subtle ways.

Prix Fixe framework reveals optimal model configurations

The top three solutions from the DREAM Challenge were distinguished both by their substantial improvement in performance compared to other models and their distinct approaches to data handling, preprocessing, loss calculations and diverse NN layers, encompassing convolutional, recurrent and self-attention mechanisms. To identify the factors underlying their performances, we developed a Prix Fixe framework that broke down each solution into distinct modules and, by selecting one of each module type, tested arbitrary combinations of the modules from each solution (Fig. 2a). We reimplemented the top three solutions within this framework and found that 45 of 81 possible combinations were compatible. We removed specific test time processing steps unique to each solution that were not comparable across solutions. Lastly, we retrained all compatible combinations using the same training and validation data, addressing the issue that some original solutions had used the entire dataset for training. Our approach facilitated a systematic and fair comparison of the individual contributions of different components to overall performance.

Fig. 2: Dissecting the optimal model configurations through a Prix Fixe framework. a, The framework deconstructs each team’s solution into modules, enabling modules from different solutions to be combined. b, Performance in Pearson score from the Prix Fixe runs for all combinations of modules from the top three DREAM Challenge solutions. Each cell represents the performance obtained from a unique combination of core layer block (major rows, left), data processor and trainer (major columns, top), first layer block (minor rows, right) and final layer block (minor columns, bottom) modules. Gray cells denote combinations that were either incompatible or did not converge during training. c, Performance (Pearson score, y axis) of the three data processor and trainer modules (x axis and colors) for each Prix Fixe model including the respective module (individual points). Original model combinations are indicated by white points, while all other combinations are in black. d, Number of parameters (x axis) for the top three DREAM Challenge models (Autosome.org, BHI and UnlockDNA) along with their best-performing counterparts (based on core layer block), DREAM-CNN, DREAM-RNN and DREAM-Attn, in the Prix Fixe runs (y axis). e, As in d, but showing each model’s Pearson score (x axis). Source data Full size image

Our analysis revealed both the source of Autosome.org’s exceptional performance and the interplay of different model components, along with their potential for further optimization. The BHI and UnlockDNA NNs saw a notable improvement in performance when retrained using Autosome.org’s data processor and trainer (Fig. 2b,c and Extended Data Figs. 6 and 7). Moreover, each team’s model architecture could be optimized further, resulting in models that achieved better performance (Fig. 2c) using the same core blocks but with similar or fewer parameters (Fig. 2d). However, except for Autosome.org’s data processor and trainer module, no other module component dominated the others and their performance appeared to depend on what other modules they were combined with (Supplementary Fig. 1). For each core block of Autosome.org, BHI and UnlockDNA, we named the optimal Prix Fixe model as DREAM-CNN, DREAM-RNN and DREAM-Attn, respectively. The DREAM models learned a very similar view of the cis-regulatory logic as shown by the similar attribution scores (Extended Data Fig. 8) using in silico mutagenesis (ISM). Interestingly, in addition to agreeing on the large effects where recognizable consensus TFBSs were altered, the models also agreed on the smaller effects that varied in sign over 1–3 bp, which is too short to correspond to consensus TFBSs40, supporting the notion that the abundance of low-affinity binding sites has an important role in many cis-regulatory elements (CREs)7,13,41.

Optimized models outperform the state of the art for other species and data types

To determine whether the model architectures and training strategies we optimized on yeast data would generalize to other species, we next applied them to Drosophila melanogaster and human datasets on a diverse set of tasks. First, we tested their ability to predict gene regulatory activity measured in D. melanogaster (in the context of a developmental and a housekeeping promoter) in a self-transcribing active regulatory region sequencing (STARR-seq) massively parallel reporter assay (MPRA). This fundamentally represents the same sequence-to-expression problem the models were designed to solve, despite the different organism (Drosophila versus yeast), experimental measurement approach (RNA sequencing versus cell sorting), longer sequence (249 bp versus 150 bp), smaller datasets (~500,000 versus 6.7 million) and the transition from a single-task to a multitask framework (two promoter types). We compared the DREAM-optimized models to DeepSTARR42, a state-of-the-art CNN model based on the Basset20 architecture and specially developed for predicting the data we used in this benchmark (STARR-seq with unique molecular identifier integration (UMI-STARR-seq)43 in D. melanogaster S2 cells42,44). For a robust comparison, we trained the models using cross-validation and always evaluated on the same held-out test data (Methods). Our models consistently outperformed DeepSTARR across both developmental and housekeeping transcriptional programs (Fig. 3a), with the DREAM-RNN’s model performance surpassing that of DREAM-CNN and DREAM-Attn.

Fig. 3: DREAM Challenge models beat existing benchmarks on Drosophila and human datasets. a, D. melanogaster STARR-seq42 prediction. Pearson’s correlation for predicted versus actual enhancer activity for held-out data (y axis) for two different transcriptional programs (x axis) for each model (colors). b, Human MPRA45 prediction. Pearson correlation for predicted versus actual expression for held-out data (y axis) for MPRA datasets from three distinct human cell types (x axis) for each model (colors). c,d, Human accessibility (bulk K562 ATAC-seq)46,49 prediction. For each model (x axis and colors), model performance (y axes) is shown in terms of both Pearson’s correlation for predicted versus actual read counts per element (c) and 1 − median Jensen–Shannon distance for predicted versus actual chromatin accessibility profiles across each element (d). In a–d, points represent folds of cross-validation, performance is evaluated on held-out test data and P values determined by t-tests (paired, two-sided) comparing the previous state-of-the-art model to the optimized models are shown above the model performance distributions. e, Comparison of the number of parameters (x axis) for different models used in chromatin accessibility prediction task. Source data Full size image

To further validate the generalizability of our models, we next trained the DREAM-optimized models on lentivirus-based MPRAs (lentiMPRAs) that tested CREs across three human cell types: hepatocytes (HepG2), lymphoblasts (K562) and induced pluripotent stem cells (WTC11)45. Here, our models had to capture more complex regulatory activity from vastly smaller datasets (~56,000–226,000 versus 6.7 million). We compared the models against MPRAnn45, a CNN model optimized for these specific datasets (Methods). All models were trained using cross-validation and evaluated on held-out test data in the same way that MPRAnn was originally trained45. The DREAM-optimized models substantially outperformed MPRAnn, with the performance difference widening with more training data (Fig. 3b). The only exception was DREAM-Attn, which did not outperform MPRAnn on the smallest dataset (WTC11; 56,000 sequences). Again, DREAM-RNN demonstrated the best performance among our models, especially for larger datasets.

To evaluate the models on a distinct prediction task that still relates to CRE function, we evaluated our optimized models on the task of predicting open chromatin. Specifically, we compared our optimized models to ChromBPNet46,47,48, a BPNet-based16 model that predicts assay for transposase-accessible chromatin with sequencing (ATAC-seq) signals across open chromatin regions. Here, the input DNA sequences were ~14 times longer than the yeast promoters on which the DREAM models were optimized (2,114 versus 150 bp) and the models were now tasked with simultaneously predicting the overall accessibility (read counts) and accessibility profile (read distribution) for a central 1,000-bp section, rather than predicting a single expression value. While DREAM-Attn could not be trained because the memory requirement for the attention block became too large with such a long input sequence, we trained and evaluated the other DREAM-optimized models and ChromBPNet on K562 bulk ATAC-seq data49 (Methods). DREAM-RNN outperformed ChromBPNet substantially in predictions of both read count and chromatin accessibility (Fig. 3c,d), highlighting the adaptability of our models even on substantially different cis-regulatory data types. DREAM-CNN, on the other hand, performed on par with ChromBPNet46 in predictions of read count (Fig. 3c) but was less effective in predicting chromatin accessibility profiles (Fig. 3d).

Notably, the architectures and training paradigms of the DREAM-optimized models were changed minimally for these evaluations (Extended Data Fig. 9). The components that could not accommodate the data were discarded (for example, the input-encoding channel denoting singleton observations was not compatible to STARR-seq, MPRA and ATAC-seq data; Methods). The only other modifications made were required for the prediction head to predict the new task (for example, the final layer block architecture and using task-specific loss functions; Methods) or to adapt to the smaller number of training sequences compared to the DREAM dataset (reducing the batch size and/or maximum learning rate (LR); Methods). Importantly, DREAM-RNN outperformed the other Prix Fixe optimized models in all of these secondary benchmarks (Fig. 3a–d), highlighting its excellent generalizability.