PlasmidHawk improves lab of origin prediction of engineered plasmids using sequence alignment

With advances in synthetic biology and genome engineering comes a heightened awareness of potential misuse related to biosafety concerns. A recent study employed machine learning to identify the lab-of-origin of DNA sequences to help mitigate some of these concerns. Despite their promising results, this deep learning based approach had limited accuracy, was computationally expensive to train, and wasn’t able to provide the precise features that were used in its predictions. To address these shortcomings, we developed PlasmidHawk for lab-of-origin prediction. Compared to a machine learning approach, PlasmidHawk has higher prediction accuracy; PlasmidHawk can successfully predict unknown sequences’ depositing labs 76% of the time and 85% of the time the correct lab is in the top 10 candidates. In addition, PlasmidHawk can precisely single out the signature sub-sequences that are responsible for the lab-of-origin detection. In summary, PlasmidHawk represents an explainable and accurate tool for lab-of-origin prediction of synthetic plasmid sequences. PlasmidHawk is available at https://gitlab.com/treangenlab/plasmidhawk.git.

sequences. These data can be further explored to measure the research diversity within and between labs. Here, we introduce 596 two kinds of diversity measurements: between-lab research diversity score and within-lab research diversity score. Between-lab 597 research diversity score describes the dissimilarities of plasmids from a given lab compared to other labs. It can be quantified by 598 the averaged lab Jaccard distances. The larger the distance is, the more distinct a lab's sequences are from other labs' sequences.

599
The within-lab research diversity score (WRDS) refers to the labs own research diversity -the similarity among sequences 600 within their own labs. WRDS can be quantified by the number of fragments each lab has in the pan-genome. A lab using more 601 pan-genome fragments tends to have more variety among its plasmids. Based on this definition, the Hahn Klaus lab has the 602 largest within-lab research diversity score (Supplementary Table 1).

603
The obvious problem with the definition of WRDS is that labs with a large number of plasmids are more likely to have a 604 large number of fragments (higher WRDS). Therefore, we normalize the WRDS with the number of plasmids each lab has.

605
This is referred to as the normalized within-lab research diversity score (nWRDS). Labs with larger nWRDS values, on average, 606 use more unique subsequences to construct their plasmids. The Zhang YZ lab has the largest nWRDS value. ZHang YZ has a 607 total of 11 plasmids, which align to 129 different fragments. On average, each plasmid can align to 11.7 different fragments 608 (Supplementary Table 2). On the other hand, David Root has a total of 2717 plasmids and all those plasmids are aligned 609 to 108 fragments in the complete pan-genome. Therefore, the David Root lab has the smallest nWRDS value of only 0.04 610 (Supplementary Table 3). As we mentioned previously, most of the Root David plasmids are from single library screening 611 papers. This lowers the total number of fragments each plasmid aligns to, since most plasmids will align to the same set of 612 fragments in the complete pan-genome. In the future, we could collapse plasmids from the same library into a single plasmid to 613 calculate nWRDS value to correct for this.  Figure 1. a Percentage of times that CORRECT mode predicted the true lab-of-origin versus the number of top labs identified by PlasmidHawk MAX mode. The X-axis represents the number of labs predicted by MAX mode. The y-axis represents the percentage of times CORRECT mode's top 1 prediction was correct in that interval. b The distribution of true source lab rankings when it is within the labs identified and ranked by CORRECT mode. The X-axis represents the ranking of the true depositing lab based on the lab scores output by CORRECT mode. The y-axis represents the percentage of times that this occurs out of all cases where the correct source lab is included in the output of CORRECT mode.

28/36
Supplementary Figure 2. Averaged lab Jaccard distances and number of plasmids each lab has. Each dot represents a lab. The x-axis is the averaged lab Jaccard Distance and the y-axis is the number of plasmids each lab has. Blue dots represent labs where all of their plasmids in five experiments are predicted wrong by both the CNN and PlasmidHawk. Red dots represent labs where all of their plasmids are predicted wrong only by PlasmidHawk. Green dots represent labs whose testing plasmids are predicted wrong only by the CNN. White dots represent labs with at least one testing plasmid correctly predicted by the CNN and PlasmidHawk.

29/36
Supplementary Figure 3. a The number of testing plasmids predicted correctly by the CNN and PlasmidHawk (top 1 prediction). The red ball represents the number of plasmids predicted correctly only by the CNN and the green ball represents the number of plasmids predicted correctly only by PlasmidHawk. The yellow region represents the number of plasmids whose lab-of-origin are successfully identified by both the CNN and PlasmidHawk. b Averaged lab Jaccard distances and number of fragments each lab has. Each dot represents a lab. The red dots represent labs with at least one plasmid that is correctly predicted by the CNN approach, but not by PlasmidHawk.

Supplementary Figure 4.
Lab-of-origin prediction accuracy for labs with different averaged lab Jaccard distances. The X-axis shows labs with different averaged lab Jaccard distances and the y-axis is the prediction accuracy of the testing plasmids from those labs. them are intrinsically more likely to be selected when compared with fragments with a small number of labs annotated to them.

627
Despite these limitations, we believe the hypergeometric p-value is a reasonable representation for PlasmidHawk MAX mode 628 significance. Gene enrichment analysis has come across similar problems as well. Although many studies have proposed 629 numerous methods, such as Bayesian-based methods and probabilistic generative models, to overcome these limitations, there 630 is no standard p-value calculation in gene enrichment study other than the hypergeometric p-value. Therefore, we believe, the 631 hypergeometric p-value is a reasonable approach in calculating the significance of predicted labs returned by PlasmidHawk 632 MAX mode. While the p-value could likely be further refined to include a wide array of further confounding factors (such as 633 fragment dependence mentioned above), this is a non-trivial process and is left as an open research question.

634
In particular, the hypergeometric p-value is useful when a handful of labs occupy a large majority of the pan-genome 635 fragments. In this case, the large p-value signifies that labs returned by PlasmidHawk MAX mode are likely by chance. For 636 example, in the case where the pan-genome has 1000 fragments and lab A has 999 out of 1000 fragments in the pangenome. In 637 this situation, for any given query sequence, PlasmidHawk MAX mode is likely to return lab A as the predicted lab. Despite 638 this, we do not suggest users use p-values as the only indicator to decide whether the predictions are accurate or not. The 639 p-value should serve as a general guidance and not the final deciding factor.