Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Interpreting neural networks for biological sequences by learning stochastic masks

A preprint version of the article is available at bioRxiv.

Abstract

Sequence-based neural networks can learn to make accurate predictions from large biological datasets, but model interpretation remains challenging. Many existing feature attribution methods are optimized for continuous rather than discrete input patterns and assess individual feature importance in isolation, making them ill-suited for interpreting nonlinear interactions in molecular sequences. Here, building on work in computer vision and natural language processing, we developed an approach based on deep learning—scrambler networks—wherein the most important sequence positions are identified with learned input masks. Scramblers learn to predict position-specific scoring matrices where unimportant nucleotides or residues are scrambled by raising their entropy. We apply scramblers to interpret the effects of genetic variants, uncover nonlinear interactions between cis-regulatory elements, explain binding specificity for protein–protein interactions, and identify structural determinants of de novo-designed proteins. We show that scramblers enable efficient attribution across large datasets and result in high-quality explanations, often outperforming state-of-the-art methods.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Scrambler architecture and masking operator.
Fig. 2: MNIST feature attribution.
Fig. 3: APA feature attribution.
Fig. 4: 5′ UTR feature attribution.
Fig. 5: Protein heterodimer feature attribution.
Fig. 6: Protein structure feature attribution.

Similar content being viewed by others

Data availability

The data analysed in this study originated from several previous publications and data repositories. The data for the MNIST prediction task (Fig. 2) are available at https://keras.io/api/datasets/mnist/. The data from Bogard and colleagues7 for the polyadenylation prediction task (Fig. 3) are available on Gene Expression Omnibus at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE113849. The data from Sample and co-workers10 for the 5′ UTR prediction task (Fig. 4) are available on Gene Expression Omnibus at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114002. The full data from Chen et al.60 for the protein–protein interaction prediction task (Fig. 5) are available from the corresponding author on request. The subset of sequences used in the benchmark comparisons are included in the GitHub repository (http://www.github.com/johli/scrambler). the de novo-designed sequences from Anishchenko and colleagues62 for the protein structure prediction task (Fig. 6) are available from the corresponding author on request.

Code availability

All code68 is available at http://www.github.com/johli/scrambler (https://doi.org/10.5281/zenodo.5676173), licensed under MIT License. External software used in this study is listed in the Methods.

References

  1. Alipanahi, B., Delong, A., Weirauch, M. & Frey, B. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  Google Scholar 

  2. Avsec, Z. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    Article  Google Scholar 

  3. Eraslan, G., Avsec, Z., Gagneur, J. & Theis, F. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).

    Article  Google Scholar 

  4. Movva, R. et al. Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLoS ONE 14, e0218073 (2019).

    Article  Google Scholar 

  5. Zhou, J. & Troyanskaya, O. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  Google Scholar 

  6. Arefeen, A., Xiao, X. & Jiang, T. DeepPASTA: deep neural network based polyadenylation site analysis. Bioinformatics 35, 4577–4585 (2019).

    Article  Google Scholar 

  7. Bogard, N., Linder, J., Rosenberg, A. & Seelig, G. A deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 91–106 (2019).

    Article  Google Scholar 

  8. Cheng, J. et al. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol. 20, 48 (2019).

    Article  Google Scholar 

  9. Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 (2019).

    Article  Google Scholar 

  10. Sample, P. et al. Human 5’ UTR design and variant effect prediction from a massively parallel translation assay. Nat. Biotechnol. 37, 803–809 (2019).

    Article  Google Scholar 

  11. Senior, A. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).

    Article  Google Scholar 

  12. Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).

    Article  Google Scholar 

  13. Talukder, A., Barham, C., Li, X. & Hu, H. Interpretation of deep learning in genomics and epigenomics. Brief. Bioinform. 22, bbaa177 (2020).

  14. Lanchantin, J., Singh, R., Wang, B. & Qi, Y. Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. In 2017 Pacific Symposium on Biocomputing 254–265 (2017); https://doi.org/10.1142/9789813207813_0025

  15. Schreiber, J., Lu, Y. & Noble, W. Ledidi: designing genome edits that induce functional activity. Preprint at bioRxiv https://doi.org/10.1101/2020.05.21.109686 (2020).

  16. Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl Acad. Sci. USA 118, e2017228118 (2021).

  17. Kelley, D., Snoek, J. & Rinn, J. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    Article  Google Scholar 

  18. Zeng, W., Wu, M. & Jiang, R. Prediction of enhancer–promoter interactions via natural language processing. BMC Genomics 19, 13–22 (2018).

    Article  Google Scholar 

  19. Kelley, D. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    Article  Google Scholar 

  20. Zeng, W., Wang, Y. & Jiang, R. Integrating distal and proximal information to predict gene expression via a densely connected convolutional neural network. Bioinformatics 36, 496–503 (2020).

    Google Scholar 

  21. Singh, S., Yang, Y., Póczos, B. & Ma, J. Predicting enhancer–promoter interaction from genomic sequence with deep neural networks. Quant. Biol. 7, 122–137 (2019).

    Article  Google Scholar 

  22. Calvo, S., Pagliarini, D. & Mootha, V. Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proc. Natl Acad. Sci. USA 106, 7507–7512 (2009).

    Article  Google Scholar 

  23. Araujo, P. et al. Before it gets started: regulating translation at the 5’ UTR. Comp. Funct. Genomics https://doi.org/10.1155/2012/475731 (2012).

  24. Whiffin, N. et al. Characterising the loss-of-function impact of 5′ untranslated region variants in 15,708 individuals. Nat. Commun. 11, 2523 (2020).

    Article  Google Scholar 

  25. Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013).

  26. Zeiler, M. & Fergus, R. Visualizing and understanding convolutional networks. In European Conference on Computer Vision 818–833 (Springer, 2014); https://doi.org/10.1007/978-3-319-10590-1_53

  27. Springenberg, J., Dosovitskiy, A., Brox, T. & Riedmiller, M. Striving for simplicity: the all convolutional net. Preprint at https://arxiv.org/abs/1412.6806 (2014).

  28. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In International Conference on Machine Learning 3319–3328 (PMLR, 2017).

  29. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In International Conference on Machine Learning 3145–3153 (PMLR 2017).

  30. Lundberg, S. & Lee, S.-I. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems 4768–4777 (NIPS, 2017).

  31. Singh, M., Ribeiro, S. & Guestrin, C. Why should I trust you? Explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (ACM, 2018).

  32. Fong, R. & Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. In 2017 IEEE International Conference on Computer Vision 3449–3457 (IEEE, 2017); https://doi.org/10.1109/ICCV.2017.371

  33. Fong, R., Patrick, M. & Vedaldi, A. Understanding deep networks via extremal perturbations and smooth masks. In 2019 IEEE/CVF International Conference on Computer Vision 2950–2958 (IEEE, CVF, 2019); https://doi.org/10.1109/ICCV.2019.00304

  34. Dabkowski, P. & Gal, Y. Real time image saliency for black box classifiers. Preprint at https://arxiv.org/abs/1705.07857 (2017).

  35. Chen, J., Song, L., Wainwright, M. & Jordan, M. Learning to explain: an information-theoretic perspective on model interpretation. In International Conference on Machine Learning 883–892 (PMLR, 2018).

  36. Yoon, J., Jordon, J. & van der Schaar, M. INVASE: instance-wise variable selection using neural networks. In International Conference on Learning Representations (ICLR, 2018).

  37. Chang, C., Creager, E., Goldenberg, A. & Duvenaud, D. Explaining image classifiers by counterfactual generation. Preprint at https://arxiv.org/abs/1807.08024 (2018).

  38. Zintgraf, L., Cohen, T., Adel, T. & Welling, M. Visualizing deep neural network decisions: prediction difference analysis. In 2018 International Conference on Learning Representations. Preprint at https://arxiv.org/abs/1702.04595 (2017).

  39. Carter, B., Mueller, J., Jain, S. & Gifford, D. What made you do this? Understanding black-box decisions with sufficient input subsets. In Proc. 22nd International Conference on Artificial Intelligence and Statistics 567–576 (AISTATS, 2019).

  40. Carter, B. et al. Critiquing protein family classification models using sufficient input subsets. J Comput. Biol. 27, 1219–1231 (2020).

  41. Covert, I., Lundberg, S. & Lee, S.-I. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research 22, 1-90 (2021).

  42. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  43. Chung, J., Ahn, S. & Bengio, Y. Hierarchical multiscale recurrent neural networks. Preprint at https://arxiv.org/abs/1609.01704 (2016).

  44. Jang, E., Gu, S. & Poole, B. Categorical reparameterization with gumbel-softmax. Preprint at https://arxiv.org/abs/1611.0114 (2016).

  45. Ancona, M., Ceolini, E., Öztireli, C. & Gross, M. Towards better understanding of gradient-based attribution methods for deep neural networks. In Workshop at International Conference on Learning Representations. Preprint at https://arxiv.org/abs/1711.06104 (2018).

  46. Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).

  47. Giammartino, D. D., Nishida, K. & Manley, J. Mechanisms and consequences of alternative polyadenylation. Mol. Cell 43, 853–866 (2011).

  48. Shi, Y. Alternative polyadenylation: new insights from global analyses. RNA 18, 2105–2117 (2012).

    Article  Google Scholar 

  49. Elkon, R., Ugalde, A. & Agami, R. Alternative cleavage and polyadenylation: extent, regulation and function. Nat. Rev. Genet. 14, 496–506 (2013).

    Article  Google Scholar 

  50. Tian, B. & Manley, J. Alternative polyadenylation of mRNA precursors. Nat. Rev. Mol. Cell Biol. 18, 18–30 (2017).

    Article  Google Scholar 

  51. Li, Z. et al. DeeReCT-APA: prediction of alternative polyadenylation site usage through deep learning. Genomics Proteomics Bioinformatics https://doi.org/10.1016/j.gpb.2020.05.004 (2021).

  52. Wylenzek, M., Geisen, C., Stapenhorst, L., Wielckens, K. & Klingler, K. A novel point mutation in the 3′ region of the prothrombin gene at position 20221 in a lebanese/syrian family. Thromb. Haemost. 85, 943–944 (2001).

    Article  Google Scholar 

  53. Danckwardt, S. et al. The prothrombin 3′ end formation signal reveals a unique architecture that is sensitive to thrombophilic gain-of-function mutations. Blood 104, 428–435 (2004).

    Article  Google Scholar 

  54. Takagaki, Y. & Manley, J. RNA recognition by the human polyadenylation factor CstF. Mol. Cell. Biol. 17, 3907–3914 (1997).

    Article  Google Scholar 

  55. Stacey, S. et al. A germline variant in the TP53 polyadenylation signal confers cancer susceptibility. Nat. Genet. 43, 1098–1103 (2011).

    Article  Google Scholar 

  56. Medina-Trillo, C. et al. Rare foxc1 variants in congenital glaucoma: identification of translation regulatory sequences. Eur. J. Hum. Genet. 24, 672–680 (2016).

    Article  Google Scholar 

  57. Altay, C. et al. A mild thalassemia major resulting from a compound heterozygosity for the IVS-11-1 (G → A) mutation and the rare T → C mutation at the polyadenylation site. Hemoglobin 15, 327–330 (1991).

    Article  Google Scholar 

  58. Garin, I. et al. Recessive mutations in the ins gene result in neonatal diabetes through reduced insulin biosynthesis. Proc. Natl Acad. Sci. USA 107, 3105–3110 (2010).

    Article  Google Scholar 

  59. Maguire, J., Boyken, S., Baker, D. & Kuhlman, B. Rapid sampling of hydrogen bond networks for computational protein design. J. Chem. Theory Comput. 14, 2751–2760 (2018).

    Article  Google Scholar 

  60. Chen, Z. et al. Programmable design of orthogonal protein heterodimers. Nature 565, 106–111 (2019).

    Article  Google Scholar 

  61. Ford, A., Weitzner, B. & Bahl, C. Integration of the Rosetta suite with the python software stack via reproducible packaging and core programming interfaces for distributed simulation. Protein Sci. 29, 43–51 (2020).

  62. Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).

  63. Alford, R. et al. The rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).

    Article  Google Scholar 

  64. Parrini, C. et al. Glycine residues appear to be evolutionarily conserved for their ability to inhibit aggregation. Structure 13, 1143–1151 (2005).

    Article  Google Scholar 

  65. Krieger, F., Möglich, A. & Kiefhaber, T. Effect of proline and glycine residues on dynamics and barriers of loop formation in polypeptide chains. J. Am. Chem. Soc. 127, 3346–3352 (2005).

    Article  Google Scholar 

  66. Linder, J. & Seelig, G. Fast activation maximization for molecular sequence design. BMC Bioinform. 22, 1–20 (2021).

    Article  Google Scholar 

  67. Chaudhury, S., Lyskov, S. & Gray, J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).

    Article  Google Scholar 

  68. Linder, J. et al. johli/scrambler: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.5676173 (2021).

Download references

Acknowledgements

This work was supported by NIH award R21HG010945 and NSF award 2021552 to G.S., and by NSF award 1908003 and 1703403 to S.K.

Author information

Authors and Affiliations

Authors

Contributions

J.L., A.L.F., S.K and G.S. conceived and developed the project. J.L. and A.L.F. performed the computational analyses with input from A.L. J.L., A.L.F., S.K and G.S. wrote the paper with input from A.L., Z.C. and D.B.

Corresponding author

Correspondence to Johannes Linder.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Ahmed Alaa and Jinsung Yoon for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Scrambler Neural Networks.

(a) Scrambler network architecture. The network is based on groups of residual blocks. This particular network configuration has 5 groups of residual blocks, with 32 channels, filter width 3 and varying dilation factor (1x, 2x, 4x, 2x, 1x). There is a skip connection (single convolutional layer with 32 channels and filter width 1) before each residual group. All skip connections are added together with the output of the final residual group. A softplus activation is applied to the final tensor in order to get importance scores that are strictly larger than 0. (b) Each residual group consists of 4 identical residual blocks connected in series. (c) Each residual block consists of 2 dilated convolutions, each preceded by batch normalization and ReLU activations. A skip connection adds the input tensor to the output of the final convolution.

Extended Data Fig. 2 Additional MNIST Attributions and Comparisons.

(a) Comparison of attribution methods on the ‘Inclusion’-benchmark of Fig. 2b (Perturbing the input patterns by keeping the top X% most important features according to each method and replacing all other features with random samples from a background distribution, n=1,888). Left table: Median KL-divergence between original predictions and predictions made on perturbed input patterns (lower is better). Right table: Classification accuracy of the predictor using the perturbed input patterns (higher is better). The ‘100%’-case refers to the original (non-perturbed) input pattern. The best method(s) are highlighted in green. (b) Uniformly random mask dropout training procedure, which teaches the Scrambler to find alternative salient feature sets. For each input pattern, we sample a random dropout pattern containing squares of varying width. The pattern is multiplied with the predicted importance scores, effectively zeroing out certain regions (forcing the background distribution to be used). The dropout pattern is also passed as additional input to the Scrambler (it is concatenated along the channel dimension), allowing the network to learn which other feature set to choose. (c) Biased dropout training procedure. Instead of randomly sampling dropout patterns, we first let the Scrambler predict importance scores with an all-ones dropout pattern (no dropout), which we use to form an importance sampling distribution. We then sample a dropout pattern and use it for training the same way we trained on uniformly random patterns. (d) Another biased dropout training approach. We first use the Scrambler to predict the importance scores given the all-ones dropout pattern as input (no dropout). The top 5% most important features are subtracted from the all-ones pattern. Then, with a certain probability, we either re-run the Scrambler on this updated pattern (repeating the previous steps), or we end the loop and choose this as our final dropout pattern to train on. (e) Procedure for training the Scrambler to dynamically change the entropy of its solutions. Instead of fitting the network to a constant KL-divergence of its scrambled input distribution, we here randomly sample KL-divergence values and use them both as input to the network and as the target for the conservation penalty. The bit value is broadcasted and concatenated along the input channel dimension. (f) Example attributions of MNIST digit ‘2’ and ‘4’ with dynamically resized feature sets, by passing increasingly large target ‘lum’ values as input to the Scrambler (‘lum’ values are normalized KL-divergence bits, see Methods).

Extended Data Fig. 3 Additional APARENT Attributions and Comparisons.

(a) Top: Example attribution of a polyadenylation signal sequence from the APARENT test set, using Inclusion-Scramblers trained with increasingly large tbits of conservation. Known regulatory motifs annotated. Bottom: Two additional example attributions, showing only the results for the tbits = 1.0 case. (b) Non-trainable convolutional filter with a 1D gaussian kernel (filter width 6) is prepended to the final softplus activation function of the Scrambler. (c) APARENT isoform predictions of original sequences and of corresponding sampled sequences from the PSSMs predicted by an Occlusion-Scrambler trained with tbits = 1.8, with and without the Gaussian filter. (d) Example attributions of tbits = 1.8 - and 1.5 Occlusion-Scramblers, with and without the Gaussian filter. (e) Example attributions of a polyadenylation signal sequence, comparing different methods. (f) Comparison of attribution methods on the ‘Inclusion’-benchmark of Fig. 3b (Perturbing the input patterns by keeping the top X% most important features according to each method and replacing all other features with random samples from a background distribution, n=1,737). Median KL-divergences are computed between original predictions and predictions made on perturbed input patterns (lower is better). These predictions were made using the APARENT model. Shown are also the Spearman r correlation coefficients between original and perturbed predictions using the DeeReCT-APA model (higher is better). The default Scrambler network for the APA task uses 4 residual blocks, while the ‘Deep’ architecture uses a total of 20 residual blocks. The best method(s) are highlighted in green. (g) Attributions of four human polyadenylation signals which are associated with known deleterious variants, comparing the Perturbation method to the reconstructive Inclusion-Scrambler (tbits = 0.25 target bits) on hypothetical variants which have not been found in the population. Gene names and clinical condition associated with the PAS annotated above each sequence. In each of the four examples, the Scrambler correctly detects the loss of the presumed RBP binding site or otherwise important motif due to each respective variant (loss of the CstF binding motif in FOXC1, TP53; loss of the SRSF10 binding motif in INS; loss of the T-rich DSE motif in HBB). (h) Left: Example attributions of a medium-strength polyadenylation signal sequence, using three Scramblers which have been optimized for different objectives: (Reconstructive features) reconstructing the original prediction, (Negative features) minimizing the prediction, and (Positive features) maximizing the prediction. Right: APARENT isoform predictions of original sequences and of corresponding sampled sequences from the PSSMs predicted by the Negative-feature and Positive-feature Scrambler respectively.

Extended Data Fig. 4 Additional Optimus 5-Prime Attributions and Comparisons.

(a) Benchmark comparison on the synthetic Start / Stop test sets, where input patterns are perturbed by keeping the most important features according to each method (6, 9 or 12 nt) and replacing all other features with random samples from a background distribution, n=512). Mean squared errors are computed between original predictions and predictions made on perturbed input patterns using the Optimus 5-Prime model (lower is better). We trained two Scramblers, one with a low entropy penalty (tbits = 0.125, λ = 1) and one with a higher penalty (λ = 10). The best method(s) are highlighted in green. (b) Average recall for finding one of the start codons and one of the stop codons in the 6 most important nucleotides, as identified by each method, measured across the synthetic test sets. (c) Additional benchmark comparison for L2X and INVASE, when using the full 260,000 5’ UTR dataset for training the interpreter model. Shown are the mean squared errors between predictions of original and perturbed input patterns, average recall for finding start and stop codons, and example visualizations on the synthetic start / stop test sets. (d) Left: Attribution of a ClinVar variant, rs779013762, in the ANKRD26 5’ UTR, which is predicted by Optimus 5-Prime to be a functionally silent mutation. The variant creates an IF uORF overlapping an existing IF uORF. The per-example fine-tuning step (which starts from the Low entropy penalty-Scrambler scores) finds a minimal salient feature set in the variant sequence (one IF uORF), while the per-example optimization (which starts from randomly initialized scores) gets stuck in a local minimum. Middle: Attribution of a ClinVar variant, rs201336268, in the TARS2 5’ UTR, which destroys two overlapping IF uORFs and is predicted to lead to upregulation. Both the fine-tuning step and the independent per-example optimization finds that no features are important in the variant sequence (both IF uORFs were removed by the variant and a fully random sequence has on average the same predicted MRL as the variant sequence). The Perturbation method has trouble explaining either of these variants due to saturation effects of the multiple IF stop codons. Right: Attribution of a rare variant, rs886054324, in the C19orf12 5’ UTR, which creates two IF uORFs overlapping a strong OOF uAUG (hence a silent mutation). All attribution methods identify the OOF uAUG as the major determinant, however the Low entropy penalty-Scrambler incorrectly marks an (unmatched) stop codon in the wildtype sequence as important. Both the High entropy penalty-Scrambler and the fine-tuning step based off the Low penalty-Scrambler correctly filters the stop codon. (e) Benchmarking results on the 1 Start / 2 Stop dataset, comparing the Low entropy penalty-Scrambler network to running per-example fine-tuning of those scores and to the baseline method of optimizing each example from randomly initialized scores. Reported are the mean squared error between predictions on original and scrambled sequences (‘MSE’), the error rate (1 - Accuracy) of not finding one Start codon and one Stop codon in the top 6 nt (‘Error Rate’), and the mean per-nucleotide KL-divergence between the scrambled PSSM and the background PSSM (‘Conservation’). (f) Example attributions using a Scrambler network trained with the mask dropout procedure (see Methods for details). By dropping different parts of the importance score mask, the Scrambler learns to discover alternative salient feature sets. In the example on the right: Finding alternative IF uORF regions by separately dropping each of the Start and Stop codons. (g) Example Scrambler attributions with the mask dropout mechanism on two native human 5’ UTRs.

Extended Data Fig. 5 Additional Protein-Protein Interaction Attributions and Comparisons.

(a) Protein heterodimer binder RNN predictor, which was trained on computationally designed (dimerizing) pairs for positive data and randomly paired binders as negative data (see Methods for details). The RNN consists of a shared GRU layer, a dropout layer, and two fully-connected layers applied to the concatenated GRU output vectors. The final output (sigmoid activation) is treated as the Bind / No Bind classification probability. (b) Supplemental benchmark of Gradient Saliency, Integrated Gradients and DeepSHAP, using only the positive-valued importance scores. Left: Prediction KL-divergence of scrambled sequences compared to original test set sequences when either replacing all but the top X% most important amino acid residues with random samples (inclusion) or, conversely, when replacing the top X% nucleotides with random samples and keeping the remaining sequence fixed (occlusion). Right: Mean ddG Difference for the top 8 most important residues according to each method, measured across the test set, and HBNet Average Precision based on each method’s importance scores. (c) Supplemental comparison of different versions of the Scrambling Neural Network (see Methods for a full description of each version). Left: KL-divergence benchmark based on the predictor RNN. Right: Mean ddG Differences and HBNet Discovery Precisions. (d) Supplemental comparison of other methods that optimize similar objectives as the Scrambler (see Methods for a full description of each method). Left: KL-divergence benchmark based on the predictor RNN. Right: Mean ddG Differences and HBNet Discovery Precisions. (e) Supplemental comparison between Scrambling Neural Networks and Sufficient Input Subsets (SIS) with ‘hot-deck’ sampled masking (the number of samples used at each iteration is varied from 1 to 32; see Methods for details). Left: KL-divergence benchmark based on the predictor RNN, Mean ddG Differences and HBNet Discovery Precisions annotated on top of the bar chart. Right: Average number of predictor queries used to interpret a single input pattern (for the Scrambler, this is the amortized cost of training divided by the number of test patterns interpreted).

Extended Data Fig. 6 Example Heterodimer Binder Attributions.

Example attributions of a designed heterodimer binder pair, for a selection of benchmarked methods.

Extended Data Fig. 7 Additional trRosetta Attributions and Comparisons.

(a) Four different Inclusion-PSSMs optimized to reconstruct the structural trRosetta prediction of a Sensor Histidine Kinase. Each PSSM is optimized for increasingly larger tbits. The bottom sequence logo represents the Rosetta score function breakdown per residue (-REU). Spearman r ranged between 0.25 and 0.32 when comparing the absolute numbers of Rosetta energy values to the optimized importance scores. Shown is also the average structure prediction for 512 samples. (b) Inclusion-Scrambled PSSMs of the Hen Egg-white Lysozyme. The PSSM was re-optimized for three different target conservation bits. Spearman r ranged between 0.25 and 0.33 compared to the Rosetta score function. (c) Architecture for per-example scrambling of a single protein sequence according to the contact distributions predicted by trRosetta. Here, we do not use a Multiple Sequence Alignment (MSA), but instead pass the Gumbel-sampled sequence to the PSSM input and an all-zeros matrix to the DCA input. Total KL-divergence between trRosetta-predicted distributions (distance and angle-grams) of the original sequence and samples drawn from the scrambled PSSM is either minimized or maximized (inclusion or occlusion respectively). (d) Reference sequence and predicted contact distribution for a hairpin protein engineered by Activation Maximization. (e) Top: Inclusion-PSSM of the engineered hairpin protein, obtained after optimization with a highly conserved background distribution based on the MSA. Bottom: Inclusion-PSSM of the engineered hairpin protein with a less conserved background distribution (smoothed with pseudo counts).

Supplementary information

Supplementary Information

Supplementary Information and Supplementary Table 1.

Reporting summary

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Linder, J., La Fleur, A., Chen, Z. et al. Interpreting neural networks for biological sequences by learning stochastic masks. Nat Mach Intell 4, 41–54 (2022). https://doi.org/10.1038/s42256-021-00428-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-021-00428-6

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing