Rosenberg, A.B. et al. Cell 163, 698–711 (2015).

Splicing is a complex and highly regulated process that generates enormous transcriptional diversity. Researchers have tried various approaches to learn the sequence-based determinants of splicing activity, with limited success. Rosenberg et al. identified universal splicing elements by training a machine-learning method with more than 2 million data points from a synthetic mini-gene assay. Each barcoded mini-gene included two closely spaced, fully degenerate 25-nucleotide sequences in either the donor or the acceptor region of an intron that split two exons. The authors transfected pooled plasmids into HEK293 cells and sequenced RNA to determine splice status, and they found that a model trained only on these synthetic data accurately predicted the effects of human genetic variants on isoform ratios. They also showed that splicing occurs preferentially at upstream donor sites and that exon inclusion is regulated cooperatively by cis elements.