## Introduction

Genetic engineering is becoming increasingly powerful, widespread, and accessible, enabling ever-more people to manipulate organisms in increasingly sophisticated ways. As biotechnology advances and spreads, the ability to attribute genetically engineered organisms to their designers becomes increasingly important—both as a means to ensure due recognition and prevent plagiarism, and as a means of holding these designers accountable to the communities their work affects1,2,3,4. While many academic researchers openly claim credit for their strains and sequences, the provenance of other products—including unpublished work, the products of industrial and government labs, and the work of amateur enthusiasts—is often more difficult to establish.

While tools for attributing these products of biotechnology—for genetic engineering attribution (GEA)—have historically lagged behind the pace of scientific development, recent years have seen rapid progress1,2,5,6. Genetic engineers face many design choices when creating an engineered nucleic-acid sequence, and the sum of these choices constitutes a design signature which, in at least some cases, is detectable by GEA algorithms2,5 (Fig. 1a). The more reliably and precisely these algorithms can identify the true designer of a sequence, the greater the potential benefits for accountability and innovation.

Past work on GEA2,5,6 has largely focused on predicting the origin lab of plasmid sequences from the Addgene data repository. Performance on this problem has improved rapidly (Fig. 1b). Most recently, Alley et al. used a Recurrent Neural Network (RNN) approach to achieve an accuracy of 70% and a top-10 accuracy (the frequency with which the true lab-of-origin is within the model’s top-10 predictions) of 85%2.

A recent publication using a non-machine-learning (ML) pan-genome method reported comparable results, with 76% accuracy (henceforth, “top-1 accuracy”) and 85% top-10 accuracy6.

#### The Prediction Track

In the Prediction Track, participants attempted to guess the lab-of-origin of plasmid sequences from the Alley et al. dataset (see below). Participants were given access to both training data and labels from the training set, while labels from the leaderboard and holdout test sets were withheld. The top-10 accuracy of each submission on the leaderboard set was immediately reported to the submitting team upon submission, and the best top-10 accuracy scores on this set for each team were continuously displayed on a public leaderboard during the competition. The top-10 accuracy of each submission on the holdout test set was not reported until after the Prediction Track had closed, and was used to determine the final competition ranking. Prizes were awarded to the four teams who achieved the highest top-10 accuracy scores on this private test set.

#### The Innovation Track

Following closure of the Prediction Track, teams that achieved a top-10 accuracy of at least 75.6% were invited to participate in the Innovation Track. This threshold was based on an earlier estimate of BLAST top-10 accuracy (see below). To compete in this track, participants were asked to submit short reports (maximum 4 pages, maximum 2 figures), which were then reviewed by a team of judges (see below). describing how their approach would contribute to solving real-world attribution problems. Prizes were awarded to teams who exhibited novel and creative approaches to the problem, or who demonstrated that their algorithms possessed useful properties other than raw accuracy. The full text of the Innovation Track problem description is available in the Supplementary Note.

Submitted reports were assessed by a team of 12 judges, including experts in synthetic biology, bioinformatics, biosecurity, and machine learning. Each judge reviewed a group of six submissions; assignment of submissions into these groups was performed randomly, with the constraints that each possible pair of submissions must be reviewed by at least two judges and that each individual submission must be reviewed by the same number of judges.

To avoid issues arising from differences in scoring practices between judges, each judge was asked to rank the submissions they received, with a rank of 1 indicating the best submission. Prizes were awarded to the four teams who achieved the smallest average rank across judges. In the event of a two-way tie, the process was repeated using only those judges who reviewed both submissions; this was sufficient to obtain four unique prizewinners in this case.

### Data preparation

Data for the GEAC was provided by Alley et al.2, and comprised all plasmids deposited in the Addgene repository up to July 27th 2018—a total of 81,834 entries. For each plasmid, the dataset included a DNA sequence, along with metadata on growth strain, growth temperature, copy number, host species, bacterial resistance markers, and other selectable markers. Each of these categorical metadata fields was re-encoded as a series of one-hot feature groups:

• Growth strain: growth_strain_ccdb_survival, growth_strain_dh10b, growth_strain_dh5alpha, growth_strain_neb_stable, growth_strain_other, growth_strain_stbl3,growth_strain_top10, growth_strain_xl1_blue

• Growth temperature: growth_temp_30, growth_temp_37, growth_temp_other

• Copy number: copy_number_high_copy, copy_number_low_copy, copy_number_unknown

• Host species: species_budding_yeast, species_fly, species_human, species_mouse,species_mustard_weed, species_nematode, species_other, species_rat, species_synthetic,species_zebrafish

• Bacterial resistance: bacterial_resistance_ampicillin, bacterial_resistance_chloramphenicol, bacterial_resistance_kanamycin, bacterial_resistance_other, bacterial_resistance_spectinomycin

• Other selectable markers: selectable_markers_blasticidin, selectable_markers_his3, selectable_markers_hygromycin, selectable_markers_leu2, selectable_markers_neomycin, selectable_markers_other,selectable_markers_puromycin, selectable_markers_trp1, selectable_markers_ura3, selectable_markers_zeocin

In addition to the sequence and the above metadata fields, the raw dataset also contained unique sequence IDs, as well as separate IDs designating the origin lab. For the competition, both sequence and lab IDs were obfuscated through 1:1 replacement with random alphanumeric strings.

The number of plasmids deposited in the dataset by each lab was highly heterogeneous (Supplementary Fig. 21). Many labs only deposited one or a few sequences—too few to adequately train a model to uniquely identify that lab. To deal with this problem, Alley et al. grouped labs with fewer than 10 data points into a single auxiliary category labelled “Unknown Engineered”. This reduced the number of categories from 3751 (the number of labs) to 1314 (1313 unique labs + Unknown Engineered).

In addition to issues with small labs, the dataset also contains “lineages” of plasmids: sequences that were derived by modifying other sequences in the dataset. This could potentially bias accuracy measures by introducing dependencies between entries in the training and test sets. To deal with this issue, Alley et al. inferred lineage networks among plasmids in the dataset, based on information in the complete Addgene database acknowledging sequence contributions from other entries. More specifically, lineages were identified by searching for connected components within the network of entry-to-entry acknowledgements in the Addgene database (see Alley et al.2 for more details).

The data were partitioned into train, validation, and test sets, with the constraints that (i) every category have at least three data points in the test set, and (ii) all plasmids in a given lineage be assigned to a single dataset. Following the split, the training set contained 63,017 entries (77.0%); the validation set contained 7466 entries (9.1%); and the test set contained 11,351 entries (13.9%).

For the GEAC, these three data partitions were reassigned based on the needs of the competition: the training set was provided to the participants for model development, including the true (though obfuscated, see above) lab IDs. The validation and test sets, meanwhile, were repurposed as the leaderboard and holdout test sets of the competition. One entry with a 1nt sequence was dropped from the leaderboard set, leaving a total of 7465 entries.

The test and leaderboard sets were shuffled together, and provided to participants without the accompanying lab IDs; as described above, participants’ top-10 accuracy on the validation set was used to determine their position in the public leaderboard during the competition, while their top-10 accuracy on the test set was used to determine the final ranking and prizewinners. To avoid overfitting, participants were not shown their results on the holdout test set until the end of the competition, at which point participants were ranked based on the top-10 accuracy of their most recent submission on that test set.

### Data integrity

In order to minimise competitor access to Addgene data during the GEAC, a number of steps were undertaken during the design and execution of the competition, including:

• The source of the data was not disclosed to participants;

• Plasmid and lab IDs were obfuscated in the competition dataset, raising the barrier to potential cheating;

• In order to receive any prize money, high-scoring participants had to submit their model code to DrivenData for independent verification—including visual inspection for obvious cheating, validation of performance on the test dataset, and verification on a separate dataset of Addgene sequences collected after the competition.

### Computing the BLAST benchmark

Previous implementations of GEA using BLAST36 have reported top-1 accuracies of just over 65% and top-10 accuracies of roughly 75%2. During the preparation of this manuscript, we found that a small modification of this attribution algorithm (specifically, replacing use of the quicksort algorithm37 with mergesort38) resulted in equal top-1 accuracy, while substantially increasing top-N accuracy for N > 1 (Supplementary Fig. 3). We have used the results from this modified algorithm in the main text, while presenting both sets of results side-by-side in the supplementary material. Under our implementation, the procedure followed by both algorithms can be summarised as follows:

• Sequences from the training set were extracted into a FASTA file, then used to generate a BLAST nucleotide database.

• Sequences from the test set were extracted into a FASTA file, then aligned to the training-set database, with an E-value threshold of 10.

• Alignments reported by BLAST were sorted in ascending order of E-value. The original implementation used quicksort for this sorting step, while our modified algorithm used mergesort. (In the latter but not the former case, this is equivalent to sorting in descending order of bit score.)

• The lab IDs corresponding to each training-set sequence were identified, and the sorting results were filtered to include only the first result for each lab-ID/test-set-sequence combination. The remaining hits for each test-set sequence were ranked in ascending order of occurrence in the dataset.

• Finally, top-N accuracy was calculated as the proportion of test-set sequences for which the ID of the true origin lab was assigned a rank less than or equal to N.

BLAST version 2.10.1 was used to generate the baseline.

For the purpose of calculating calibration (Supplementary Fig. 18), these ranks were reversed (so that the best match had the highest rank) and normalised using softmax.

### Other baselines

Predictions on the competition test set for deteRNNt2 and a reproduction of the CNN model developed by ref. [5] were provided by ref. [2]. Top-N accuracy, X-metrics, calibration indices, and other metrics were re-computed from scratch based on these files.

### Post-competition analysis

Demographic information on the competition was collected using Google Analytics (Universal Analytics). Other data were analysed using python 3.7 and R version 4.1. Figures were plotted using ggplot2 version 3.3.1.

Each submission to the Prediction Track consisted of a J×K prediction matrix, where J is the number of sequences in the holdout test set (11,351) and K is the total number of lab classes in that test set (1314). Each entry in this matrix ostensibly reflected a predicted probability of the corresponding lab being the true lab-of-origin for that sequence, with the entries in each row summing to unity.

To compute accuracy metrics for each team for this analysis, we first generated a rank matrix from their prediction matrix. In this matrix, the lab with the highest predicted probability for a given sequence was assigned rank 1, the second-highest prediction rank 2, and so on. To prevent teams achieving high scores by giving uniform predictions for large numbers of labs, tied predictions were assigned the maximum rank. Given this rank matrix, the top-N accuracy for any N could thus be computed as the proportion of rows for which the true lab was assigned a rank of N or less.

Given these accuracy scores, the X99 score could be computed as the minimum positive integer N such that top-N accuracy is at least 99%. This metric can be generalised to other thresholds, where XR is the minimum positive integer N such that top-N accuracy is at least R%. X95, X90 and X80 scores were all computed in this way.

For the purposes of calculating precision and recall, the number of true positives, false positives and false negatives were computed separately for each lab class for each submission. For a given class, the number of true positives $${tp}$$ was defined as the number of times in the test set that that class was correctly assigned rank 1 (i.e. assigned rank 1 when it was in fact the true lab-of origin); the number of false positives $${fp}$$ as the number of times it was incorrectly assigned rank 1; and the number of false negatives $${fn}$$ as the number of times it was incorrectly assigned rank >1. Precision and recall for each class were then calculated as $${tp}/({tp}+{fp})$$ and recall as $${tp}/({tp}+{fn})$$, and the F1 score for each class as the harmonic mean of its precision and recall. The overall precision and recall for each team were computed as the arithmetic mean of its class-specific precisions and recalls, respectively, while the macro-averaged F1 score was computed as the arithmetic mean of its class-specific F1 scores.

### Calibration

Following Guo et al.15 we checked whether predictions had frequentist calibration of their probabilistic forecasts. To estimate the expected accuracy from finite samples, we grouped predictions into 15 interval bins of equal size. We let $${B}_{m}$$ be the set of indices of samples whose prediction confidence falls into the intervals $$(\frac{m-1}{M},\frac{m}{M}]$$. The accuracy of bin $${B}_{m}$$ is then defined as

$${{{{{\rm{acc}}}}}}\left({B}_{M}\right)=\frac{1}{\left|{B}_{m}\right|}\mathop{\sum}\nolimits_{{B}_{m}}1({\hat{{{{{\rm{y}}}}}}}_{i}={y}_{i})$$
(1)

where $${\hat{{{{{\rm{ y }}}}}}}_{i}$$ and $${y}_{i}$$ are the (top-1) predicted and true class labels for sequence $$i$$ and $$\left|{B}_{m}\right|$$ is the number of samples in bin $${B}_{m}$$. The average confidence within bin $${B}_{m}$$ is defined as

$${{{{{\rm{conf}}}}}}\left({B}_{M}\right)=\frac{1}{\left|{B}_{m}\right|}\mathop{\sum}\nolimits_{{B}_{m}}{\hat{{{{{{\rm{p}}}}}}}}_{i}$$
(2)

where $${\hat{{{{{{\rm{p}}}}}}}}_{i}$$ is the predicted probability assigned to class $${\hat{{{{{{\rm{y}}}}}}}}_{i}$$ for sequence i. The expected deviation between confidence and accuracy can then be estimated using the expected calibration error (ECE):

$${{{{{\rm{ECE}}}}}}={\sum }_{m=1}^{M}\frac{\left|{B}_{m}\right|}{n}\left|{{{{{\rm{acc}}}}}}\left({B}_{M}\right)-{{{{{\rm{conf}}}}}}\left({B}_{M}\right)\right|$$
(3)

where $$n$$ is the total number of samples. The maximum calibration error (MCE) estimates the worst-case deviation from the binning procedure as:

$${{{{{\rm{MCE}}}}}}=\mathop{{{\max }}}\limits_{{{{{{\rm{m}}}}}}\in \{1,\ldots,M\}}\left|{{{{{\rm{acc}}}}}}\left({B}_{M}\right)-{{{{{\rm{conf}}}}}}\left({B}_{M}\right)\right|$$
(4)

### Ensemble

To ensemble the four prizewinning teams from the Prediction Track, the probability assigned to each lab for each plasmid sequence was averaged between the top 4 classes, with equal weight given to each class. That is, the prediction for sequence $$i$$ to lab $$j$$ was given by:

$${p}_{{ij}}=\frac{1}{4}{\sum }_{k=1}^{4}{p}_{{ijk}}$$
(5)

where $$k$$ indexes over the methods and $${p}_{{ijk}}$$ is the prediction score given for sequence $$i$$ to lab $$j$$, by method $$k$$.

### Amazon web server compute costs

Approximate costing for machine learning methods were calculated using Amazon EC2 on-demand pricing. We assumed a single GPU machine with sufficient memory (128 GB) costing $1.14 per hour (g3.8×large). This totals$51.30 for 45 h of GPU time. For the CPU based methods, which required 20GB of solid-state drive, an x2gd.medium instance, costing $0.08 per hour, would be sufficient. This totals$0.05 for the 0.66 CPU hours used.

### Robustness analysis

To assess the robustness of the ranking of the winning teams to choice of validation dataset, the lab-of-origin predictions for the set of sequences were subsampled so that predictions were only retained for 80% of the sequences. Sampling was performed without replacement for each subsample. The rank order of predictions was re-computed on this subsampled dataset, and from here we computed metrics of interest including top-1 accuracy, top-10 accuracy and X99 score.To generate a distribution of scores, this resampling strategy was performed 1000 times. Distributions were compared using the KS-test; all pairwise comparisons between teams on all metrics (top-1 accuracy, top-10 accuracy, and X99) were significantly different at p < 0.01.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.