Abstract
Genome engineering is undergoing unprecedented development and is now becoming widely available. Genetic engineering attribution can make sequence–lab associations and assist forensic experts in ensuring responsible biotechnology innovation and reducing misuse of engineered DNA sequences. Here we propose a method based on metric learning to rank the most likely labs of origin while simultaneously generating embeddings for plasmid sequences and labs. These embeddings can be used to perform various downstream tasks, such as clustering DNA sequences and labs, as well as using them as features in machine learning models. Our approach employs a circular shift augmentation method and can correctly rank the lab of origin 90% of the time within its top-10 predictions. We also demonstrate that we can perform few-shot learning and obtain 76% top-10 accuracy using only 10% of the sequences. Finally, our approach can also extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model’s outputs.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The pAAV-Syn-SomArchon sequence (126941, https://www.addgene.org/126941/) and AAEL010097-Cas9 sequence (100707, https://www.addgene.org/100707/) are available on Addgene. The custom sequence developed by Alley et al. is available in their GitHub repository (https://github.com/altLabs/attrib/blob/master/sequences/custom_drive.fasta). All other data to reproduce experiments are available in our Code Ocean capsule64. Source data for Figs. 1, 2, 4c and 6 and Extended Data Figs. 3 and 4 are provided with this paper. The other figures need large raw data, so their source data are available in the capsule64.
Code availability
All code necessary to reproduce experiments and generate figures is available in our Code Ocean capsule64.
References
Alley, E. C. et al. A machine learning toolkit for genetic engineering attribution to facilitate biosecurity. Nat. Commun. 11, 6293 (2020).
Nielsen, A. A. K. & Voigt, C. A. Deep learning to predict the lab-of-origin of engineered DNA. Nat. Commun. 9, 3135 (2018).
Wang, Q., Kille, B., Liu, T. R., Elworth, R. A. L. & Treangen, T. J. PlasmidHawk improves lab of origin prediction of engineered plasmids using sequence alignment. Nat. Commun. 12, 1167 (2021).
Kamens, J. The Addgene repository: an international nonprofit plasmid and data resource. Nucleic Acids Res. https://doi.org/10.1093/nar/gku893 (2014).
Kulis, B. Metric learning: a survey. Found. Trends Mach. Learn. 5, 287–364 (2013).
Koch, G., Zemel, R. & Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop Vol. 2 (2015).
Hoffer, E. & Ailon, N. Deep metric learning using triplet network. In Similarity-Based Pattern Recognition. SIMBAD 2015. Lecture Notes in Computer Science Vol. 9370 (eds Feragen, A. et al.) 84–92 (Springer, 2015).
Fink, M. Object classification from a single example utilizing class relevance metrics. In Proc. 17th International Conference on Neural Information Processing Systems (eds Saul, L. et al.) 449–456 (MIT Press, 2005).
Fei-Fei, L., Fergus, R. & Perona, P. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28, 594–611 (2006).
Wang, Y., Yao, Q., Kwok, J. T. & Ni, L. M. Generalizing from a few examples. ACM Comput. Surveys 53, 1–34 (2020).
Kim, Y. Convolutional neural networks for sentence classification. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1746–1751 (Association for Computational Linguistics, 2014).
Caruana, R., Lawrence, S. & Giles, C. L. Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In Proc. 13th International Conference on Neural Information Processing Systems (eds Leen, T. K. et al.) 381–387 (MIT Press, 2000).
Ying, X. An overview of overfitting and its solutions. J. Phys. Conf. Ser. 1168, 022022 (2019).
Gage, P. A new algorithm for data compression. C Users J. 12, 23–38 (1994).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. 26th International Conference on Neural Information Processing Systems (eds Burges, C. J. C. et al.) 3111–3119 (Curran Associates, 2013).
Lipton, Z. C. A critical review of recurrent neural networks for sequence learning. Preprint at https://arxiv.org/abs/1506.00019 (2015).
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Preprint at https://arxiv.org/abs/1808.03314 (2018).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Wang, Q., Elworth, R., Liu, T. R. & Treangen, T. J. Faster pan-genome construction for efficient differentiation of naturally occurring and engineered plasmids with plaster. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 19:1–19:12 (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019).
Lv, Z., Ding, H., Wang, L. & Zou, Q. A convolutional neural network using dinucleotide one-hot encoder for identifying DNA n6-methyladenine sites in the rice genome. Neurocomputing 422, 214–221 (2021).
Choong, A. C. H. & Lee, N. K. Evaluation of convolutionary neural networks modeling of {DNA} sequences using ordinal versus one-hot encoding method.In 2017 International Conference on Computer and Drone Applications (IConDA) 60–65 (IEEE, 2017).
Karim, M. R. et al. Deep learning-based clustering approaches for bioinformatics. Brief Bioinformatics https://doi.org/10.1093/bib/bbz170 (2020).
Xu, D. & Tian, Y. A comprehensive survey of clustering algorithms. Ann. Data Sci. https://doi.org/10.1007/s40745-015-0040-1 (2015).
Omran, M., Engelbrecht, A. & Salman, A. An overview of clustering methods. Intell. Data Anal. 11, 583–605 (2007).
Chakraborty, S. et al. Interpretability of deep learning models: a survey of results. In 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) 1–6 (IEEE, 2017).
Doshi-Velez, F. & Kim, B. Towards a rigorous science of interpretable machine learning. Preprint at https://arxiv.org/abs/1702.08608 (2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. Preprint at https://arxiv.org/abs/1703.01365 (2017).
Hsieh, C.-K. et al. Collaborative metric learning. ACM Digital Library https://doi.org/10.1145/3038912.3052639 (2017).
Yu, J., Gao, M., Rong, W., Song, Y., Xiong, Q. A social recommender based on factorization and distance metric learning. IEEE Access https://doi.org/10.1109/access.2017.2762459 (2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. {BERT}: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1 4171–4186 (Association for Computational Linguistics, 2019).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. Preprint at https://arxiv.org/abs/1310.4546 (2013).
Pennington, J., Socher, R. & Manning, C. {G}lo{V}e: Global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing 1532–1543 (Association for Computational Linguistics, 2014).
Peters, M. E. et al. Deep contextualized word representations. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 2227–2237 (Association for Computational Linguistics, 2018).
Garg, S. Computational methods for chromosome-scale haplotype reconstruction. Genome Biol. https://doi.org/10.1186/s13059-021-02328-9 (2021).
Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates, 2017).
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M. & Monfardini, G. The graph neural network model. IEEE Trans Neur. Netw. Learn. Syst. 20, 61–80 (2008).
Lin, T.-Y. et al. Microsoft COCO: common objects in context. Preprint at http://arxiv.org/abs/1405.0312 (2014).
Deng, J. et al. Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Andriluka, M., Pishchulin, L., Gehler, P. & Schiele, B. 2D human pose estimation: new benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition 3686–3693 (2014).
Krizhevsky, A., Nair, V. & Hinton, G. The CIFAR-10 Dataset (Canadian Institute for Advanced Research, 2010); http://www.cs.toronto.edu/~kriz/cifar.html
Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing 2383–2392 (Association for Computational Linguistics, 2016).
Wang, A. et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. Preprint at http://arxiv.org/abs/1804.07461 (2018).
Williams, A., Nangia, N. & Bowman, S. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 1112–1122 (Association for Computational Linguistics, 2018).
Zellers, R., Bisk, Y., Schwartz, R. & Choi, Y. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing 93–104 (Association for Computational Linguistics, 2018).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction 2nd edn (Springer, 2009).
Kaufman, S., Rosset, S. & Perlich, C. Leakage in data mining: formulation, detection, and avoidance. In Proc. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 556–563 (Association for Computing Machinery, 2011).
Berger, B., Waterman, M. & Yu, Y. Levenshtein distance, sequence comparison and biological database search. IEEE Trans. Inform. Theory 67, 3287–3294 (2021).
Chicco, D. Ten quick tips for machine learning in computational biology. BioData Mining https://doi.org/10.1186/s13040-017-0155-3 (2017).
Mikolajczyk, A. & Grochowski, M. Data augmentation for improving deep learning in image classification problem. In 2018 International Interdisciplinary PhD Workshop (IEEE, 2018).
Moshkov, N., Mathe, B., Kertesz-Farkas, A., Hollandi, R. & Horvath, P. Test-time augmentation for deep learning-based cell segmentation on microscopy images. Sci. Rep. https://doi.org/10.1038/s41598-020-61808-3 (2020).
Zou, Q., Xing, P., Wei, L. & Liu, B. Gene2vec: gene subsequence embedding for prediction of mammalian n6-methyladenosine sites from mRNA. RNA 25, 205–218 (2018).
Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. Self-normalizing neural networks. Preprint at https://arxiv.org/abs/1706.02515 (2017).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. et al.) 8024–8035 (Curran Associates, 2019).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
Smith, L. N. & Topin, N. Super-convergence: very fast training of residual networks using large learning rates. Preprint at https://arxiv.org/abs/1708.07120 (2017).
Hermans, A., Beyer, L. & Leibe, B. Defense of the triplet loss for Person re-identification. Preprint at https://arxiv.org/abs/1703.07737 (2017).
Musgrave, K. et al. PyTorch metric learning. Preprint at https://arxiv.org/abs/2008.09164 (2020).
Wang, X., Zhang, H., Huang, W. & Scott, M. R. Cross-batch memory for embedding learning. Preprint at https://arxiv.org/abs/1912.06798 (2019).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Arthur, D. & Vassilvitskii, S. K-means++: the advantages of careful seeding. In Proc. Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035 (SIAM, 2007).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn: Res. 12, 2825–2830 (2011).
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Soares, I., Camargo, F., Marques, A. & Crook, O. Improving lab-of-origin prediction of genetically engineered plasmids via deep metric learning. Code Ocean https://doi.org/10.24433/CO.9853572.V1 (2022).
Hinton, G. & Roweis, S. Stochastic neighbor embedding. In Proc. 15th International Conference on Neural Information Processing Systems 857–864 (MIT Press, 2002).
Acknowledgements
I.M.S., F.H.F.C. and A.M. acknowledge Amalgam and XNV for providing the necessary infrastructure and financial support to develop the study and design the experiments. O.M.C. acknowledges funding from a Todd-Bird Junior Research Fellowship from New College, Oxford, as well as Open Philanthropy. I.M.S. acknowledges K. A. Assis for helping with graphical design.
Author information
Authors and Affiliations
Contributions
I.M.S. and F.H.F.C. conceived the study and designed the experiments. F.H.F.C. developed the triplet model training algorithm. A.M. developed the circular shift augmentation. O.M.C. split the data. I.M.S. developed the interpretation method. F.H.F.C. did the few-shot experiments. All authors wrote the manuscript, and reviewed and approved the final manuscript. I.M.S. managed the project.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Shilpa Garg and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Triplet method illustration.
The triplet is composed of an anchor (DNA), a positive (the lab-of-origin), and a negative example (another lab). a In the beginning, the anchor might be closer to the negative than it is to the positive. During training, we pull the anchor and positive towards each other while pushing the negative away. b In the end, labs and their DNA sequences will be nearer to each other, forming groups. Each group is represented by a different color indicating that they are in separate spaces. We can also expect both labs and DNA sequences to be closer to other similar ones.
Extended Data Fig. 3 Plasmid features importance of unknown sequences.
a NTI for an unknown sequence. This may help to investigate specific patterns. b The NTI for Bernard Moss’s lab which was assigned the sequence author by our model. c Comparison between tokens highlighted for the sequence and important tokens from the predicted lab. The red line represents the sequence. The first bar under the graph represents the laboratory, while the second represents the sequence. d Plotting the difference between the NTI of the sequence and the NTI of the predicted laboratory, we can see that few tokens stand out in this sequence beyond the usual presented by the laboratory, and the bar is almost all black.
Extended Data Fig. 4 NTI for custom sequence designed by Alley et al.1 and its closest two labs.
NTI for the custom sequence. Tokens in the 600-700 range seem to be the most substantial for this streak. b Top 30 most important tokens. The BPE generates the tokens by grouping characters, and those tokens require further investigation to identify biological features within them. c Comparison between tokens highlighted for the sequence and important tokens from the predicted Omar Akbari lab. They share a similar level of importance for the first 700 tokens. The first bar under the graph represents the Akbari laboratory, while the second represents the custom sequence. d Comparison between tokens highlighted for the sequence and important tokens from the top2 predicted Edward Boyden lab. They are similar only in the first tokens in the BPE dictionary and neither Boyden Lab nor Akbari Lab have the same importance as the last tokens as shown in the NTI of the sequence. The first bar under the graph represents the Boiden laboratory, while the second represents the custom sequence.
Supplementary information
Supplementary Information.
Supplementary Tables 1–5 and Algorithm 1.
Source data
Source Data Fig. 1.
Zip file containing four statistical source data (a,b,c,d)
Source Data Fig. 2.
CSV file with statistical source data.
Source Data Fig. 4.
CSV file with statistical source data for Fig. 4c.
Source Data Fig. 6.
Zip file containing eight statistical source data (a-1, a-2,b-1,b-2,c-1,c-2,d-1,d-2)
Source External Data Fig. 3.
Zip file containing four statistical source data (a,b,c,d)
Source External Data Fig. 4.
Zip file containing four statistical source data (a,b,c,d)
Rights and permissions
About this article
Cite this article
Soares, I.M., Camargo, F.H.F., Marques, A. et al. Improving lab-of-origin prediction of genetically engineered plasmids via deep metric learning. Nat Comput Sci 2, 253–264 (2022). https://doi.org/10.1038/s43588-022-00234-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-022-00234-z
This article is cited by
-
Analysis of the first genetic engineering attribution challenge
Nature Communications (2022)