Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Improving lab-of-origin prediction of genetically engineered plasmids via deep metric learning

A preprint version of the article is available at Research Square.

Abstract

Genome engineering is undergoing unprecedented development and is now becoming widely available. Genetic engineering attribution can make sequence–lab associations and assist forensic experts in ensuring responsible biotechnology innovation and reducing misuse of engineered DNA sequences. Here we propose a method based on metric learning to rank the most likely labs of origin while simultaneously generating embeddings for plasmid sequences and labs. These embeddings can be used to perform various downstream tasks, such as clustering DNA sequences and labs, as well as using them as features in machine learning models. Our approach employs a circular shift augmentation method and can correctly rank the lab of origin 90% of the time within its top-10 predictions. We also demonstrate that we can perform few-shot learning and obtain 76% top-10 accuracy using only 10% of the sequences. Finally, our approach can also extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model’s outputs.

Your institute does not have access to this article

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Embeddings analysis.
Fig. 2: FSL results.
Fig. 3: FSL top-10 accuracy for unknown labs (two to ten plasmids).
Fig. 4: Two-dimensional visualization of models and effect of point mutations.
Fig. 5: Two-dimensional visualization of t-SNE triplet network predictions.
Fig. 6: Interpreting plasmid feature importance using integrated gradients.

Data availability

The pAAV-Syn-SomArchon sequence (126941, https://www.addgene.org/126941/) and AAEL010097-Cas9 sequence (100707, https://www.addgene.org/100707/) are available on Addgene. The custom sequence developed by Alley et al. is available in their GitHub repository (https://github.com/altLabs/attrib/blob/master/sequences/custom_drive.fasta). All other data to reproduce experiments are available in our Code Ocean capsule64. Source data for Figs. 1, 2, 4c and 6 and Extended Data Figs. 3 and 4 are provided with this paper. The other figures need large raw data, so their source data are available in the capsule64.

Code availability

All code necessary to reproduce experiments and generate figures is available in our Code Ocean capsule64.

References

  1. Alley, E. C. et al. A machine learning toolkit for genetic engineering attribution to facilitate biosecurity. Nat. Commun. 11, 6293 (2020).

  2. Nielsen, A. A. K. & Voigt, C. A. Deep learning to predict the lab-of-origin of engineered DNA. Nat. Commun. 9, 3135 (2018).

    Article  Google Scholar 

  3. Wang, Q., Kille, B., Liu, T. R., Elworth, R. A. L. & Treangen, T. J. PlasmidHawk improves lab of origin prediction of engineered plasmids using sequence alignment. Nat. Commun. 12, 1167 (2021).

    Article  Google Scholar 

  4. Kamens, J. The Addgene repository: an international nonprofit plasmid and data resource. Nucleic Acids Res. https://doi.org/10.1093/nar/gku893 (2014).

  5. Kulis, B. Metric learning: a survey. Found. Trends Mach. Learn. 5, 287–364 (2013).

    Article  Google Scholar 

  6. Koch, G., Zemel, R. & Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop Vol. 2 (2015).

  7. Hoffer, E. & Ailon, N. Deep metric learning using triplet network. In Similarity-Based Pattern Recognition. SIMBAD 2015. Lecture Notes in Computer Science Vol. 9370 (eds Feragen, A. et al.) 84–92 (Springer, 2015).

  8. Fink, M. Object classification from a single example utilizing class relevance metrics. In Proc. 17th International Conference on Neural Information Processing Systems (eds Saul, L. et al.) 449–456 (MIT Press, 2005).

  9. Fei-Fei, L., Fergus, R. & Perona, P. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28, 594–611 (2006).

    Article  Google Scholar 

  10. Wang, Y., Yao, Q., Kwok, J. T. & Ni, L. M. Generalizing from a few examples. ACM Comput. Surveys 53, 1–34 (2020).

    Google Scholar 

  11. Kim, Y. Convolutional neural networks for sentence classification. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1746–1751 (Association for Computational Linguistics, 2014).

  12. Caruana, R., Lawrence, S. & Giles, C. L. Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In Proc. 13th International Conference on Neural Information Processing Systems (eds Leen, T. K. et al.) 381–387 (MIT Press, 2000).

  13. Ying, X. An overview of overfitting and its solutions. J. Phys. Conf. Ser. 1168, 022022 (2019).

    Article  Google Scholar 

  14. Gage, P. A new algorithm for data compression. C Users J. 12, 23–38 (1994).

    Google Scholar 

  15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. 26th International Conference on Neural Information Processing Systems (eds Burges, C. J. C. et al.) 3111–3119 (Curran Associates, 2013).

  16. Lipton, Z. C. A critical review of recurrent neural networks for sequence learning. Preprint at https://arxiv.org/abs/1506.00019 (2015).

  17. Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Preprint at https://arxiv.org/abs/1808.03314 (2018).

  18. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  Google Scholar 

  19. Wang, Q., Elworth, R., Liu, T. R. & Treangen, T. J. Faster pan-genome construction for efficient differentiation of naturally occurring and engineered plasmids with plaster. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 19:1–19:12 (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019).

  20. Lv, Z., Ding, H., Wang, L. & Zou, Q. A convolutional neural network using dinucleotide one-hot encoder for identifying DNA n6-methyladenine sites in the rice genome. Neurocomputing 422, 214–221 (2021).

    Article  Google Scholar 

  21. Choong, A. C. H. & Lee, N. K. Evaluation of convolutionary neural networks modeling of {DNA} sequences using ordinal versus one-hot encoding method.In 2017 International Conference on Computer and Drone Applications (IConDA) 60–65 (IEEE, 2017).

  22. Karim, M. R. et al. Deep learning-based clustering approaches for bioinformatics. Brief Bioinformatics https://doi.org/10.1093/bib/bbz170 (2020).

  23. Xu, D. & Tian, Y. A comprehensive survey of clustering algorithms. Ann. Data Sci. https://doi.org/10.1007/s40745-015-0040-1 (2015).

  24. Omran, M., Engelbrecht, A. & Salman, A. An overview of clustering methods. Intell. Data Anal. 11, 583–605 (2007).

    Article  Google Scholar 

  25. Chakraborty, S. et al. Interpretability of deep learning models: a survey of results. In 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) 1–6 (IEEE, 2017).

  26. Doshi-Velez, F. & Kim, B. Towards a rigorous science of interpretable machine learning. Preprint at https://arxiv.org/abs/1702.08608 (2017).

  27. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. Preprint at https://arxiv.org/abs/1703.01365 (2017).

  28. Hsieh, C.-K. et al. Collaborative metric learning. ACM Digital Library https://doi.org/10.1145/3038912.3052639 (2017).

  29. Yu, J., Gao, M., Rong, W., Song, Y., Xiong, Q. A social recommender based on factorization and distance metric learning. IEEE Access https://doi.org/10.1109/access.2017.2762459 (2017).

  30. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. {BERT}: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1 4171–4186 (Association for Computational Linguistics, 2019).

  31. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. Preprint at https://arxiv.org/abs/1310.4546 (2013).

  32. Pennington, J., Socher, R. & Manning, C. {G}lo{V}e: Global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing 1532–1543 (Association for Computational Linguistics, 2014).

  33. Peters, M. E. et al. Deep contextualized word representations. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 2227–2237 (Association for Computational Linguistics, 2018).

  34. Garg, S. Computational methods for chromosome-scale haplotype reconstruction. Genome Biol. https://doi.org/10.1186/s13059-021-02328-9 (2021).

  35. Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates, 2017).

  36. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M. & Monfardini, G. The graph neural network model. IEEE Trans Neur. Netw. Learn. Syst. 20, 61–80 (2008).

    Article  Google Scholar 

  37. Lin, T.-Y. et al. Microsoft COCO: common objects in context. Preprint at http://arxiv.org/abs/1405.0312 (2014).

  38. Deng, J. et al. Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).

  39. Andriluka, M., Pishchulin, L., Gehler, P. & Schiele, B. 2D human pose estimation: new benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition 3686–3693 (2014).

  40. Krizhevsky, A., Nair, V. & Hinton, G. The CIFAR-10 Dataset (Canadian Institute for Advanced Research, 2010); http://www.cs.toronto.edu/~kriz/cifar.html

  41. Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing 2383–2392 (Association for Computational Linguistics, 2016).

  42. Wang, A. et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. Preprint at http://arxiv.org/abs/1804.07461 (2018).

  43. Williams, A., Nangia, N. & Bowman, S. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 1112–1122 (Association for Computational Linguistics, 2018).

  44. Zellers, R., Bisk, Y., Schwartz, R. & Choi, Y. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing 93–104 (Association for Computational Linguistics, 2018).

  45. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction 2nd edn (Springer, 2009).

  46. Kaufman, S., Rosset, S. & Perlich, C. Leakage in data mining: formulation, detection, and avoidance. In Proc. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 556–563 (Association for Computing Machinery, 2011).

  47. Berger, B., Waterman, M. & Yu, Y. Levenshtein distance, sequence comparison and biological database search. IEEE Trans. Inform. Theory 67, 3287–3294 (2021).

    MathSciNet  Article  Google Scholar 

  48. Chicco, D. Ten quick tips for machine learning in computational biology. BioData Mining https://doi.org/10.1186/s13040-017-0155-3 (2017).

  49. Mikolajczyk, A. & Grochowski, M. Data augmentation for improving deep learning in image classification problem. In 2018 International Interdisciplinary PhD Workshop (IEEE, 2018).

  50. Moshkov, N., Mathe, B., Kertesz-Farkas, A., Hollandi, R. & Horvath, P. Test-time augmentation for deep learning-based cell segmentation on microscopy images. Sci. Rep. https://doi.org/10.1038/s41598-020-61808-3 (2020).

  51. Zou, Q., Xing, P., Wei, L. & Liu, B. Gene2vec: gene subsequence embedding for prediction of mammalian n6-methyladenosine sites from mRNA. RNA 25, 205–218 (2018).

    Article  Google Scholar 

  52. Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. Self-normalizing neural networks. Preprint at https://arxiv.org/abs/1706.02515 (2017).

  53. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

    MathSciNet  MATH  Google Scholar 

  54. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. et al.) 8024–8035 (Curran Associates, 2019).

  55. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).

  56. Smith, L. N. & Topin, N. Super-convergence: very fast training of residual networks using large learning rates. Preprint at https://arxiv.org/abs/1708.07120 (2017).

  57. Hermans, A., Beyer, L. & Leibe, B. Defense of the triplet loss for Person re-identification. Preprint at https://arxiv.org/abs/1703.07737 (2017).

  58. Musgrave, K. et al. PyTorch metric learning. Preprint at https://arxiv.org/abs/2008.09164 (2020).

  59. Wang, X., Zhang, H., Huang, W. & Scott, M. R. Cross-batch memory for embedding learning. Preprint at https://arxiv.org/abs/1912.06798 (2019).

  60. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    MATH  Google Scholar 

  61. Arthur, D. & Vassilvitskii, S. K-means++: the advantages of careful seeding. In Proc. Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035 (SIAM, 2007).

  62. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn: Res. 12, 2825–2830 (2011).

  63. Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

    Article  Google Scholar 

  64. Soares, I., Camargo, F., Marques, A. & Crook, O. Improving lab-of-origin prediction of genetically engineered plasmids via deep metric learning. Code Ocean https://doi.org/10.24433/CO.9853572.V1 (2022).

  65. Hinton, G. & Roweis, S. Stochastic neighbor embedding. In Proc. 15th International Conference on Neural Information Processing Systems 857–864 (MIT Press, 2002).

Download references

Acknowledgements

I.M.S., F.H.F.C. and A.M. acknowledge Amalgam and XNV for providing the necessary infrastructure and financial support to develop the study and design the experiments. O.M.C. acknowledges funding from a Todd-Bird Junior Research Fellowship from New College, Oxford, as well as Open Philanthropy. I.M.S. acknowledges K. A. Assis for helping with graphical design.

Author information

Authors and Affiliations

Authors

Contributions

I.M.S. and F.H.F.C. conceived the study and designed the experiments. F.H.F.C. developed the triplet model training algorithm. A.M. developed the circular shift augmentation. O.M.C. split the data. I.M.S. developed the interpretation method. F.H.F.C. did the few-shot experiments. All authors wrote the manuscript, and reviewed and approved the final manuscript. I.M.S. managed the project.

Corresponding authors

Correspondence to Igor M. Soares, Fernando H. F. Camargo, Adriano Marques or Oliver M. Crook.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Shilpa Garg and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Triplet method illustration.

The triplet is composed of an anchor (DNA), a positive (the lab-of-origin), and a negative example (another lab). a In the beginning, the anchor might be closer to the negative than it is to the positive. During training, we pull the anchor and positive towards each other while pushing the negative away. b In the end, labs and their DNA sequences will be nearer to each other, forming groups. Each group is represented by a different color indicating that they are in separate spaces. We can also expect both labs and DNA sequences to be closer to other similar ones.

Extended Data Fig. 2 Applying Convolutional Neural Networks with Metric Learning on plasmids leads to state-of-the-art.

Images a to f illustrate our method for both models, where they differ in training output and methodology (c). In g, we compare top-10 prediction accuracy on the test set of both approaches to Nielsen and Voigt2, Alley et al.1, BLAST baseline18 and PlasmidHawk3

Extended Data Fig. 3 Plasmid features importance of unknown sequences.

a NTI for an unknown sequence. This may help to investigate specific patterns. b The NTI for Bernard Moss’s lab which was assigned the sequence author by our model. c Comparison between tokens highlighted for the sequence and important tokens from the predicted lab. The red line represents the sequence. The first bar under the graph represents the laboratory, while the second represents the sequence. d Plotting the difference between the NTI of the sequence and the NTI of the predicted laboratory, we can see that few tokens stand out in this sequence beyond the usual presented by the laboratory, and the bar is almost all black.

Source data

Extended Data Fig. 4 NTI for custom sequence designed by Alley et al.1 and its closest two labs.

NTI for the custom sequence. Tokens in the 600-700 range seem to be the most substantial for this streak. b Top 30 most important tokens. The BPE generates the tokens by grouping characters, and those tokens require further investigation to identify biological features within them. c Comparison between tokens highlighted for the sequence and important tokens from the predicted Omar Akbari lab. They share a similar level of importance for the first 700 tokens. The first bar under the graph represents the Akbari laboratory, while the second represents the custom sequence. d Comparison between tokens highlighted for the sequence and important tokens from the top2 predicted Edward Boyden lab. They are similar only in the first tokens in the BPE dictionary and neither Boyden Lab nor Akbari Lab have the same importance as the last tokens as shown in the NTI of the sequence. The first bar under the graph represents the Boiden laboratory, while the second represents the custom sequence.

Source data

Supplementary information

Supplementary Information.

Supplementary Tables 1–5 and Algorithm 1.

Peer Review File

Source data

Source Data Fig. 1.

Zip file containing four statistical source data (a,b,c,d)

Source Data Fig. 2.

CSV file with statistical source data.

Source Data Fig. 4.

CSV file with statistical source data for Fig. 4c.

Source Data Fig. 6.

Zip file containing eight statistical source data (a-1, a-2,b-1,b-2,c-1,c-2,d-1,d-2)

Source External Data Fig. 3.

Zip file containing four statistical source data (a,b,c,d)

Source External Data Fig. 4.

Zip file containing four statistical source data (a,b,c,d)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Soares, I.M., Camargo, F.H.F., Marques, A. et al. Improving lab-of-origin prediction of genetically engineered plasmids via deep metric learning. Nat Comput Sci 2, 253–264 (2022). https://doi.org/10.1038/s43588-022-00234-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-022-00234-z

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing