Introduction

Advancements in synthetic biology are continuing to rapidly increase both organism engineering capabilities and the accessibility of those capabilities to a broad range of potential actors1. This necessarily increases the potential biosecurity risk from misuse of these capabilities, such as deliberate or accidental use of gene synthesis and gene editing to produce dangerous pathogenic organisms or toxins2,3,4,5,6. Moreover, the gene and protein sequences for dangerous organisms and toxins are readily available, e.g., in the NCBI databases and other similar public scientific resources.

As a consequence, it is becoming increasingly critical to be able to make rapid and accurate determinations of the risk posed by particular nucleic acid sequences or amino acid sequences. Organizations like the International Gene Synthesis Consortium (IGSC) and its members need to make these determinations in order to determine whether or not the materials a customer has ordered are subject to legal controls and whether or not to fill that order7,8. Similar concerns are present for organizations that design or edit organisms, for DNA depositories such as AddGene or the iGEM registry, and many other organizations as well.

At present, the most common means of evaluating the risk posed by a nucleic or amino acid sequence is to use the BLAST algorithm suite9 to determine which sequences in NCBI’s nucleic acid and protein databases (Or other equivalent comprehensive sequence databases such as the ENA and DDBJ databases.) are most closely related to the sequence under consideration. BLAST is a heuristic algorithm that estimates a set of best matches for a query sequence against a sequence database based on local alignments. While the heuristic nature of the algorithm means that its exact results may be unstable and may omit good matches, especially in cases where there are many matches, it is well-known to be highly effective for a wide variety of bioinformatic use cases.

A typical method for evaluating sequence risk is thus to run BLAST for the sequence against comprehensive databases such as NCBI’s non-redundant protein sequence (nr) and nucleotide (nr/nt) databases. If BLAST finds that the sequence is found to be more closely related to some controlled pathogen or toxin than to any non-controlled sequence, then the sequence is considered controlled, unless an expert judges the sequence to fall into an exception category such as being a common “housekeeping gene”2. Alternatives to this method include a number of other recently developed tools that are specifically tailored for identification of dangerous and/or controlled sequences, including FAST-NA Scanner24, ThreatSEQ25, SeqScreen26, and SecureDNA27. Nevertheless, categorization via BLAST against comprehensive sequence databases remains the de facto standard for determining whether a sequence is controlled.

Figure 1
figure 1

NCBI nucleic acid and protein databases allow categorization of chimeric material under the taxa most relevant to the study. For example, NCBI protein accession 6PCI_D is classified as Ebola virus since it is the Ebola virus GP2 protein, studied with the aid of an appended twin streptavidin tag. BLAST matching of new material can then misidentify its taxa by matching against the chimeric material. For example, when the twin streptavidin tag is added to the mRuby protein, BLAST on its 3’ end produces a best match with 6PCI_D, since they share their last amino acid before the tag, thus identifying the sequence as controlled Ebola virus material, despite it being completely unrelated. In short, chimeric material can mislead BLAST-based identification of controlled sequences into believing that benign sequences are dangerous or dangerous sequences are benign.

Neither BLAST nor any of the NCBI databases, however, are specifically designed for the purpose of determining whether a sequence is from a dangerous pathogen or toxin. Taxonomic classification ambiguities that are allowed in the NCBI nucleic acid and protein databases to support their intended purposes can result in misclassification, as illustrated in Fig. 1. For example, researchers will often modify proteins or genetic sequences by incorporating sequences with well-understood functions that serve as biotechnological “tools” enabling their studies. For example, if an amino acid “tag” sequence is added to a protein in order to enable a crystallization study, then the modified protein will generally be appropriately categorized under the original taxon, as that is the subject of the study. Likewise, a virus modified to include a fluorescent reporter might still be categorized under the original virus. Various forms of horizontal transfer can create similar chimerism naturally as well. These categorizations may reasonably be considered correct, as the sequence is indeed generally most relevant to the taxon to which it has been assigned.

Assigning a chimeric sequence to a non-chimeric taxon, however, necessarily means that the assigned taxon now includes some amount of sequence materials derived from other taxa. These chimeric materials can then be matched by BLAST, identifying both the original taxa and the chimeric taxa. Depending on the specifics of the chimerism and the taxa involved, this can either cause benign material to appear dangerous (as in the example in Fig. 1) or cause dangerous material to appear benign.

For common subjects of study, such as model organisms or important pathogens and toxins, the relative number of sequences assigned to the taxon may be quite high indeed. For example, as of this writing, SARS-CoV-2 sequences in NCBI comprise a massive 61.9% of all viral nucleotide sequences and 76.9% of all viral protein sequences. Other dangerous pathogens are also often highly weighted, if less extreme. For example, currently Influenza A has 2450 times more nucleic acid sequences than the average viral species (8.2% of all viral nucleotide sequences) and 700 times more protein sequences (2.3% of all viral protein sequences). Similarly, Burkholderia pseudomallei (which causes melioidosis) has 2320 times more nucleic acid sequences than the average bacterial species (0.52% of all bacterial nucleotide sequences) and 4600 times more protein sequences (1.03% of all bacterial protein sequences), while Vibrio cholerae (which causes cholera) has 2210 times more nucleic acid sequences (0.50% of all bacterial nucleotide sequences) and 2380 times more protein sequences (0.53% of all bacterial protein sequences). Given these large biases in sequence frequency, chimeric usages of a sequence can come to outweigh the original or even to crowd it entirely out of BLAST results. Moreover, many biotechnology tools are originally categorized as artificial sequences, a polyphyletic category that by its nature cannot be used for determination of potential pathogen or toxin content.

In this case study, we investigate whether such taxonomic ambiguities can, in fact, result in incorrect determinations of sequence risk based on the results of BLAST against the NCBI non-redundant protein database. In this investigation, we are aware of the potential for creating information hazards, e.g., enabling bad actors to evade BLAST-based screening precautions and obtain controlled genetic material. For this reason, in this manuscript we specifically focus on false positives, in which benign biotechnology tools are mis-categorized as dangerous pathogens or toxins.

Results

For this case study, we selected seven protein sequences for common biotechnology tools, ranging in length from 18 to 231 amino acids. One of these is the T4 foldon, a short sequence from an E. coli bacteriphage that has a long history of use for stabilization of proteins (e.g.10,11), including applications in vaccines (e.g.12,13,14). Three others are protein tags used for purification: a peptide signal for secretion15 and the streptavidin tag16 either coupled with a PreScission protease target or in a twin streptaptividin configuration17. The final three are reporter sequences, which have greater length: the first 60 amino acids of a GFP fluorescent protein, the NanoLuc luciferase18, and the ZsGreen fluorescent protein19. Sequences are provided in Supplementary S1.

This case study thus covers multiple different common classes of protein-based biotechnology tool, with sequence sizes that extend from well above the current standard screening threshold of 200 nucleic acids base pairs2,7 down to just above the screening threshold of 50 base pairs recently proposed by the US DHHS20.

Figure 2
figure 2

For each test sequence, BLAST analysis found many matches with the same maximum bit-score. Of these tied matches, the fraction of sequences classified as controlled pathogens or toxins varies widely. When the sequence was extended with each of the 20 possible amino acids added to its 3’ or 5’ end, between 5% and 20% (1–4 of the options) have best matches consisting only of controlled pathogens or toxins. For some sequences, only one side was analyzed: the secretion and streptavidin tags are typically used at the 5’ end of a sequence, and the GFP 5’ fragment is constrained to follow the GFP sequence on its 3’ side. The biosecurity-focused methods FAST-NA and SeqScreen perform much better than BLAST: each correctly categorize all of the original sequences and all but one of the 200 single amino acid extensions.

To analyze the potential impact of chimeric material on determination of controlled pathogen/toxin status for these biotechnology tools, we first ran a protein BLAST against the NCBI non-redundant protein database (nr) for each of the seven amino acid sequences and examined the taxonomic categorization of the results (see “Methods” for details). Since each of these sequences is widely used, BLAST returns many sequences that are equal “best matches” with respect to maximum bit-score (the metric typically used for determination of control status), including some that match material from sequences categorized as controlled pathogens or toxins. The fraction of matches for each sequence to controlled material (Fig. 2 second column) spans a broad range, from 1.6% for NanoLuc to a remarkable 93% for the T4 foldon, with an overall average of 33.8%. Since the results for each sequence also contain non-controlled sequences with equal best-match scores, these sequences would all appropriately be assessed as non-controlled.

The high volume of controlled sequence matches, however, indicates that classification is likely to be fragile. Even a single additional amino acid that matches better to a controlled sequence than a non-controlled sequence can change the classification, as in the case shown in Fig. 1. To study this fragility, we ran BLAST against sequences extended by a single amino acid at the 5’ and/or 3’ end, systematically testing each of the 20 possible options for an additional canonical amino acid. The protein tags are typically added to the 5’ end, so only 3’ extensions were tested for these sequences, and the GFP fragment contains only the 3’ portion of the protein, so only 5’ extensions were tested for that sequence. As predicted, with the addition of a single additional amino acid, the previous ties for best match were indeed often broken in favor of controlled sequences (Fig. 2 third column), with an average of 12% of all extensions having a best match only to controlled sequences.

The taxonomic classification of each best match to a controlled sequence is provided in the “1 AA flanker” table in Supplementary S3. The most frequent match was to the SARS-CoV-2 virus, which was a best match for 11 out of the 24 sequences with controlled best matches. Other controlled taxa were mainly viral—high pathogenicity avian influenza, Newcastle disease virus, Vesicular stomatitis virus, Yellow fever virus, and Ebola virus—but one match was to the Clostridium botulinum hemagglutinin protein, part of the botulinum neurotoxin (BoNT) complex21.

For comparison, we also ran against two tools specifically designed for identification of dangerous and/or controlled materials, FAST-NA Scanner24 and SeqScreen26 (results in Fig. 2 right-hand columns; the other tools cited above were not available to us). Both FAST-NA Scanner and SeqScreen correctly categorized all of the original sequences as non-controlled. For the single amino acid extensions, each had a false positive on one of the two hundred sequences tested: FAST-NA Scanner mis-identified one of the Streptavidin plus PreScission protease 3’ extensions as the same Clostridium botulinum hemagglutinin protein matched by BLAST, and SeqScreen mis-identified one of the ZsGreen 3’ extensions as the same chimeric Ebola virus sequence matched by BLAST. These results show that while these specifically tailored methods are still imperfect, their performance is far better than that of BLAST against a general purposes database.

Figure 3
figure 3

For each short sequence, a single amino-acid extension classified as pathogenic using BLAST was further extended with ten different amino acid sequences, each 60 residues in length. For all except the secretion peptide, the single amino acid strongly predicted the classification of the extended sequence, with more than half of the random sequences also being classified as a controlled pathogen. FAST-NA and SeqScreen perform much better than BLAST: each correctly categorize all of the extended sequences.

Since some of these sequences are much smaller than the commonly used 200 base pair threshold, one might wonder whether these results would still hold when considering the sequence in a larger context. To test this hypothesis, we selected one of the controlled sequences from each of the T4 foldon extensions and from the protein tag extensions, then generated ten random extensions for each of the selected sequences. Specifically, for each selected sequence, we generated ten sequences of 60 random amino acids, then concatenated these to the sequence on the same side as its single amino acid extension (the 50 sequences thus produced are provided in Supplementary S2). We then ran BLAST against these 50 extended sequences. The taxonomic classification of each best match to a controlled sequence is provided in the “Random Extension” table in Supplementary S3. In some cases, the random additional material did indeed break the best match relationship (Fig. 3), though in two of these the new best match was a controlled sequence from a different taxa. For all except the secretion peptide, however, the single amino acid extension strongly predicted the classification of the larger sequence, with at least half of the random sequences still best matching to controlled sequences, and overall 27 out of the 50 extended sequences. As before, FAST-NA Scanner24 and SeqScreen26 performed much better than BLAST, in this case finding no incorrect classifications.

In sum, then, these results indicate that when best match BLAST is used as a heuristic to determine whether a sequence is controlled, any inclusion of a biotechnology tool in a sequence classified as a controlled pathogen or toxin results in a significant chance that other uses of the same tool will be classified as controlled by BLAST as well. Other methods specifically designed for identification of dangerous and/or controlled sequences, however, are much less likely to misclassify these sequences.

Discussion

These results demonstrate a serious problem in the use of BLAST against NCBI nucleic acid and protein databases to identify controlled pathogen sequences. Although only a handful of sequences were selected for this case study, in our current biosecurity screening deployment we have seen these same issues commonly arise with a wide range of other common biotechnology tools, including antibodies, sequencing adapters, antibiotic selection markers, plasmid vectors, and promoters. A similar dynamic applies for the more dangerous issue of false negatives, in which a controlled pathogen is misidentified as a benign sequence. Failures of this type can have serious biosecurity consequences in allowing bad or careless actors to obtain dangerous genetic material, as well as serious legal consequences for organizations if classification errors cause them to violate national export control laws or biosafety regulations. We have observed these issues in practice as well, but do not report details here as they could constitute an information hazard (e.g., enabling bad actors to evade BLAST-based screening precautions and obtain controlled genetic material).

Moreover, this situation is unstable, with the scope of the problem appearing to be expanding. Inspection of a few of the sequence dates from top matches found that most of the matched sequences, pathogenic or otherwise, were from the last few years. This is unsurprising, given that both the volume of sequencing data and the use of biotechnology tools are continuing to rapidly expand, and implies that the degradation of biosecurity screening based on BLAST against NCBI nucleic acid and protein databases has been worsening and is likely to continue to do so even more rapidly. Worse yet, since the sequences of biotechnology tools have been included in a large number of sequences in the NCBI nucleic acid and protein databases, this means there are often many sequences with identical or near-identical match scores, and as a consequence the actual set of accessions returned by BLAST may be volatile and change unpredictably.

The impact of this degradation in performance is particularly acute for the pragmatics in biosecurity and biosafety operations. The driving dynamics creating the issue mean that it is precisely the most commonly used biotechnology tools that are likely to be misidentified as the pathogens of greatest interest, and vice versa. As a consequence, the cost of false positives and the risks from false negatives are amplified and are likely having a significant but unrecognized impact on the ongoing biosecurity and biosafety operations of industry, government agencies, and other organizations.

There are several potential paths for mitigation of this problem. When sequence analysis is being performed by human experts, one might choose to invest in training regarding the classification hazards posed by chimeric material, but this does not address the problem faced by the automation-assisted workflows used in many biosecurity screening deployments due to the high cost of analysis by human experts2. Alternatively, the rules and curation procedures for taxonomic classification in NCBI and other affiliated databanks might be adjusted to better support pathogen classification, but this would conflict with a wide range of other usages for which these databanks have been designed. Specialized curation methods, such as the FLAN22 and VADR23 systems for viral pathogen curation, help with some aspects of data quality but cannot address the underlying interaction between BLAST and the categorization of chimeric material. One might also consider producing a variant of BLAST that is more suited for pathogen identification, e.g., by taking taxonomic weighting into account, but it is unclear whether this is actually feasible. Another approach is to switch to using existing databases that apply stringent taxonomic standards in curation, such as NCBI refseq, but doing so would drastically reduce coverage of variant sequences. Ultimately, however, it will likely be advisable to shift away from BLAST against general-purpose databases and towards emerging methods that have been specifically tailored for pathogen identification, such as FAST-NA Scanner24, ThreatSEQ25, SeqScreen26, or SecureDNA27. Ensemble approaches might be considered as well, though it is unclear how much incremental benefit would be obtained over the already good performance of tools such as FAST-NA Scanner and SeqScreen.

No matter the potential path to mitigation, however, another important challenge illuminated by this study is the lack of a comprehensive standard and test set for evaluating the quality of pathogen identification. Current practice has largely defaulted to the use of BLAST against NCBI nucleic acid and protein databases as a “gold standard” for pathogen identification, despite the fact that results from BLAST vary based on settings and are subject to many different possible interpretation strategies. Even if consensus could be reached on these issues, however, the results presented here demonstrate that continued use of BLAST against general-purposes databases as a standard for identification is simply not sustainable. As a consequence, it will be important for stakeholders in pathogen identification to work together to develop a comprehensive standard for assessing the efficacy of pathogen identification systems.

Finally, although this work has focused specifically on pathogen identification, the underlying problems of taxonomic ambiguity and weighting are likely also affecting other applications that make use of general databases such as NCBI nucleic acid and protein. We therefore recommend that other user communities should perform their own case studies to determine whether there are similar issues in need of mitigation in their own applications.

Methods

Sequence analysis with BLAST was run via the NCBI BLAST web interface (https://blast.ncbi.nlm.nih.gov/Blast.cgi) for protein (blastp) using the non-redundant protein sequence (nr) database and NCBI’s standard default parameter values. Taxonomic classification counts were determined using the BLAST taxonomy lineage report, counting all taxonomies whose top result tied for the maximum bit-score. Closest match was determined by maximum bit-score, with a sequence considered non-controlled if no controlled material scored higher than uncontrolled material (i.e., ties go to uncontrolled material).

A sequence in NCBI was considered to be controlled if its assigned taxa was a biological agent on either the Australia Group or US Commerce Control List. Sequences assigned to the polyphyletic taxa of “artificial sequences” (NCBI:txid81077) and “plasmids” (NCBI:txid36549) were not used for determination of control status, as these collections intentionally mix controlled and non-controlled materials.

Sequence analysis with SeqScreen was run using SeqScreen version 4.0 in a Singularity container with database SeqScreenDB_22.11 and default arguments. A sequence was considered to be controlled if the SeqScreen report identified it as belonging to a taxa that is biological agent on either the Australia Group or US Commerce Control List.

Sequence analysis with FAST-NA was run using FAST-NA Scanner version 4.1 with database version 1.0.6. A sequence was considered to be controlled if FAST-NA Scanner reported a match to any threat cluster, where threat clusters cover all taxa that are biological agents on either the Australia Group or US Commerce Control List, plus all protein toxins on those lists.