Microbial Typing by Machine Learned DNA Melt Signatures

There is still an ongoing demand for a simple broad-spectrum molecular diagnostic assay for pathogenic bacteria. For this purpose, we developed a single-plex High Resolution Melt (HRM) assay that generates complex melt curves for bacterial identification. Using internal transcribed spacer (ITS) region as the phylogenetic marker for HRM, we observed complex melt curve signatures as compared to 16S rDNA amplicons with enhanced interspecies discrimination. We also developed a novel Naïve Bayes curve classification algorithm with statistical interpretation and achieved 95% accuracy in differentiating 89 bacterial species in our library using leave-one-out cross-validation. Pilot clinical validation of our method correctly identified the etiologic organisms at the species-level in 59 culture-positive mono-bacterial blood culture samples with 90% accuracy. Our findings suggest that broad bacterial sequences may be simply, reliably and automatically profiled by ITS HRM assay for clinical adoption.

genomic DNA calculated based on its genome copies (GC) was amplified in a 40 (a) and a 50 (b)-cycle PCR targeting the ITS region. The PCR was immediately followed by HRM to produce corresponding derivative melt curves. The limit of detection (LOD) was determined to be the concentration where melt curve profile was maintained (arrows).

Naïve Bayes
In this section, we present details about the proposed adaptive Naïve Bayes algorithm. Given C species in the reference dataset, and for the i-th species C i , we have N i number of training samples. For any new unknown test sample x, we aim to calculate the posteriori probability via Bayes' theorem: where p(C k ) is the prior for the k-th species, and p(x|C k ) is the likelihood function given all the training samples in the k-th species.
The prior information is assumed to be homogeneous:

= 1
The likelihood function is calculated with a Gaussian distribution: ).
The essence in the algorithm lies in the way we calculate the distance ( , : ; ). This measures the similarity between curve shapes for the test sample and training reference.
Assume for a test species, denoted as = = { = + , = , , … , = @ } where m is the number of replicates in this species. We want to achieve a consensus prediction of whether this species falls into any species category from the reference panel. We assume each replicate of same importance, so we just average the final posteriori probability of each replicate to obtain the prediction for the test species:

Curve Similarity Calculation
There are three steps in the calculation of curve similarity. First, we align each curve according to the temperature of 53°C. This guarantees each curve is well-aligned and thus high accuracy in the following curve similarity calculation. Second, we apply Hilbert Transformation on the curves. Hilbert transformation is a convolution process on the curve: where ƒ(t) denotes the curve we have. The output of Hilbert transformation is a complex function where the real part is the original input and the complex part denotes the transformed domain. We calculate the distance between two curves by combining the two parts as follows: where ƒ and represent two curves.

Details in predicting out-of-reference samples
To distinguish whether the test target belongs to any species in the reference panel, we adapt the original Naïve Bayes to accommodate the prediction of out-of-reference samples. Assume for a test species, denoted as = = { = + , = , , … , = @ } where m is the number of replicates in this species. First, for each replicate, we assign a prior probability to be out-of-reference sample by looking at the curve region between temperature 52.5°C to 53.5°C. This would give us some knowledge about whether this replicate is an outlier because most of outlier curves will generate some unusual peak curves in this temperature region. Further, when we apply Naïve Bayes, we assign the posteriori probability to be out-of-reference by adding a gated function that if the following quantile is below some threshold: we set θ = 0.3 in our experiments.