Introduction

The newly developed discipline of nano-NMR1,2,3,4,5,6,7 is aimed at reducing the minimal NMR sample size by many orders of magnitude, and thus increasing the NMR sensitivity and spatial resolution down to a few molecules8. This is achieved by replacing the macroscopic coil of the NMR setup, which measures the magnetic field, by a single or an ensemble of controllable spins, e.g., NV centers in diamond, which serve as tiny magnetometers. Recent experiments have shown that it is possible to estimate the spectrum of artificial signals and signals of polarized samples with high resolution9,10,11,12,13. However, the obvious advantages of receiving spectral information about tiny quantities of molecules are masked by the extra amount of noise that goes hand in hand with most configurations of this setup. There are a few sources of this extra noise, which include the NV coherence time (magnetic noise), the controller noise (laser and microwave operations), and most importantly the diffusion induced noise, which is negligible in the regular NMR setup but is extremely large in the nano-NMR setup and broadens the line-width above the required resolution. This noise creates a serious bottleneck, as the crucial information is encoded in the tiny chemical shifts and small energy gaps caused by J - couplings. That is, the nano-NMR setup is usually characterized by a weak measured signal, which is masked by a strong noise.

Moreover, the precise noise model is usually complex and unknown. Consequently, it is an intractable data processing challenge to achieve a spectral discrimination between weak and similar signals of near-by frequencies. In particular, because the noise model is complex and unknown, it is difficult to tackle this noise and reach the optimal discrimination by conventional data analysis methods with which optimal discrimination can usually be obtained only when full knowledge of the noise model is available.

In this work we show that the challenge of spectral discrimination between weak and similar signals in the presence of strong and complex noise, can be efficiently confronted by DL algorithms, which effectively learn the noise model. Moreover, we show that DL methods are capable of learning the noise model from a small amount of data which only needs to be gathered for a few minutes. This means that a DL algorithm can analyze a test signal with the same efficiency as numerically demanding Bayesian methods that rely on precise knowledge of the model. In addition, we show that DL methods can be extremely useful in dealing with challenging frequency resolution problems and possibly overcome Bayesian methods even under assumptions that these have full knowledge of the model and infinite computing power.

DL techniques have been successfully applied to spectral data in the fields of Astronomy, Chemistry, Geosciences, and Bioinformatics14. Spectral data from these disciplines pose similar challenges: (1) High data dimensionality (2) Difficulty of modeling the important features from first principles (3) Dirty environments with many classes of objects that need to be differentiated along with varying signal intensities (4) Importance of subtle differences in the signal. Despite these difficulties, which apply in our context as well, impressive achievements have been made such as the detection of narcotics in Raman spectroscopy data with a 0.5% error rate15.

DL methods have also been used for the analisys of NMR data, in particular in the context of automated protein structure for peak-picking of nuclear magnetic resonance spectra16, of biological macromolecules17, and recently also in the context of analyzing a variety of spectral images of proteins by using support vector machine classifier combined with histogram of oriented gradient18, and by using convolutional neural network19. In addition, deep learning techniques such as long short term memory networks20 and variational auto-encoder networks21 have been used in NMR applications for material characterization and subsurface characterization.

We believe that the success of DL methods in the analysis of regular NMR data should be amplified in the nano-NMR setup due to the larger amount of noise in this setting, which originates from two main ingredients that are absent in the regular NMR setting. The first ingredient is the origin of the signal. While in regular NMR the signal is created by thermal polarization, in nano-NMR the signal is created by statistical polarization22, which imply that in the nano-NMR setup the noise is stronger. The second ingredient is the quantum projection noise, which is of a Poissonian or a Bernouli nature, and in many cases is the dominant source of noise. Here we provide evidence that DL methods can tackle these noises efficiently.

To evaluate the efficiency of DL methods in terms of the spectroscopy of nano-NMR data, we consider two problems, frequency discrimination and frequency resolution. We first examine the ability of DL methods to discriminate between two signals corresponding to two different frequencies. In particular, we consider data from signals that were read by an NV center, which simulates noisy nano-NMR data. Typical data for these two frequencies are shown in Fig. 1, which presents two time traces of the datasets together with their Fourier transforms. It is immediately clear that it is impossible to discriminate between the two frequencies based on the Fourier transform alone because the signal has a strong phase noise on top of the detection noise. In this work we show that DL methods are able to classify the data with the same efficiency as Bayesian methods, which use full knowledge of the signal and noise model and are numerically much more demanding than DL methods. The advantage of DL methods is also indicated by their superior performance in frequency discrimination of the experimental data, where the signal and noise models are not fully known.

Figure 1
figure 1

Typical noisy data of two different frequencies that we aim to discriminate in this work. The oscillating magnetic signals at the two different frequencies suffer from a strong phase noise and are read by an NV center, which adds quantum noise to the output binary signal (see Eq. 10). (upper right): The time trace binary signal from one frequency of 250 Hz together with its Fourier transform after subtracting the zero frequency (upper left). (lower right): The time trace binary signal from the second frequency of 251.6 Hz and its Fourier transform (lower left).

We then employ DL methods to tackle the problem of frequency resolution in a noisy environment. We show that DL methods can efficiently discriminate between the signal of a single frequency and the signal of two nearby frequencies that have a strong amplitude and phase noise.

Our results strongly suggest that DL methods can effectively learn the physical and noise models and by that constitute an efficient alternative to Bayesian methods, which require a priori knowledge on the physical and noise models.

Frequency Discrimination

The physical model

We consider the problem of discrimination between two signals corresponding to two different frequencies by a single quantum probe. In the nano-NMR setup this corresponds, for example, to the scenario where a single NV center, which serves as a tiny magnetometer, is placed in the proximity of a sample that contains two known molecules between which we wish to discriminate. Specifically, in the presence of a single frequency signal (a single molecule) the Hamiltonian of the spin probe is given by

$${H}_{{s}_{i}}={g}_{i}\,\cos \,({\omega }_{i}t+{\phi }_{i}){S}_{z},$$
(1)

where gi, ωi, and ϕi are the amplitude, frequency, and (random) phase of signal i respectively, which is the standard setting in nano-NMR experiments1,2,3,4,23. The probe, which is initially polarized along \(\hat{x}\), freely evolves according to \({H}_{{s}_{i}}\) for a short duration, Δt, and then is measured along \(\hat{y}\). In the measurement scheme of a single experiment, the sequence of probe operations consists of initialization, evolution, and measurement, which is repeated many times under the constant presence of a signal (Fig. 2(a)). In the case of a single shot measurement, the measurement result is a sequence of zeros and ones, Fig. 1 (right), and the probability for a successful measurement (one) is given by

$$P(t)=\,\sin \,{[\frac{{g}_{i}}{2{\omega }_{i}}(\sin [{\omega }_{i}t+{\phi }_{i}]-\sin [{\omega }_{i}(t-\Delta t)+{\phi }_{i}])+\frac{\pi }{4}]}^{2}.$$
(2)
Figure 2
figure 2

The physical model. (a) The probe, which is initially polarized along \(\hat{x}\), freely evolves according to \({H}_{{s}_{i}}\) (Eq. 1) for a short duration, Δt, and then is measured along \(\hat{y}\). In the measurement scheme of a single experiment, the sequence of probe operations consists of initialization, evolution, and measurement, which is repeated N times under the constant presence of a signal. In each experiment, the frequency of the signal is equal to one of two known frequencies, ω1 and ω2. (b) Our aim is to discriminate between the two frequencies. A single experiment results in a string of bits, x = {1, 0, 0, 1, …}. Given x, we want to obtain an estimation of the frequency of the signal, ωest = ω1 or ωest = ω2.

We start by considering an ideal scenario (no noise or inefficiencies) where Eq. (2) holds. We assume that in each experiment the signal corresponds to one of two known frequencies (ω1 and ω2), the amplitudes of the signals are known, but in each experiment the signal has an unknown uniformly distributed random phase. A single experiment results in a string of bits, x = {1, 0, 0, 1, …}, where 1 and 0 correspond to a detection of the ms = 0 state or ms = −1 state of the NV center. Given x, we want to obtain an estimation of the frequency of the signal, ωest = ω1 or ωest = ω2 (Fig. 2(b)). We quantify the performance of a discrimination method M by the error probability of the frequency estimation, which is defined by

$${P}_{M}^{error}\equiv \frac{1}{2}\mathop{\sum }\limits_{\begin{array}{c}i=1\\ j\ne i\end{array}}^{i=2}\,{P}_{M}({\omega }_{est}={\omega }_{j}|{\omega }_{i}),$$
(3)

where PM (ωest = ωj|ωi) is the probability of method m to output ωest = ωj given that the frequency of the signal is ωi.

Full bayesian method

In the ideal scenario considered here, we have full knowledge of the model (Eq. (2)) and the only unknowns are the random phases ϕi. Hence, we can simply utilize a Full Bayesian method known as the likelihood-ratio test and denoted by MFB, where for each frequency we calculate the maximal log-likelihood over the unknown random phases. That is,

$${L}_{1}=\mathop{{\rm{\max }}\,}\limits_{{\phi }_{k}}L({\phi }_{k}|x,{\omega }_{1}),\,{L}_{2}=\mathop{{\rm{\max }}\,}\limits_{{\phi }_{k}}L({\phi }_{k}|x,{\omega }_{2}),$$
(4)

where

$$L({\phi }_{k}|x,{\omega }_{i})=\sum _{j}\,({x}_{j}\,\log \,P({t}_{j},{\omega }_{i},{\phi }_{k})+\,(1-{x}_{j})\,\log \,(1-P({t}_{j},{\omega }_{i},{\phi }_{k}))).$$
(5)

We estimate the frequency according to the larger likelihood; that is

$${\omega }_{est}=(\begin{array}{ll}{\omega }_{1} & {L}_{1} > {L}_{2}\\ {\omega }_{2} & {\rm{otherwise}}.\end{array}$$
(6)

As MFB utilizes the maximal information on the signal, it obtains the minimal possible error for an unbiased estimator, which can serve as a benchmark to evaluate the efficiency of a learning method. Hence, its error probability serves as a lower bound for the DL method. It is known that Bayesian methods are optimal given the maximal amount of information and given that the optimization can be done efficiently, which is usually not the case, specifically when considering a noisy environment. In order to verify that we indeed have the optimal method, we compare the results to an analytical calculation of the Fisher Information (FI), which can be done in this case.

In general, full knowledge is not available due to either a lack of knowledge of the noise model in the experiment and detection inefficiencies, or lack of knowledge of the signal. In this case, we can utilize a correlation based method, Mcorr, for frequency discrimination. To this end, we first use a train set of measurement results, Xtrain, for which the frequency of the signal is known. For each xXtrain we calculate the correlation vector \({C}_{k}={\langle {x}_{i}{x}_{i+k}\rangle }_{i}\) (here we replace the 0 bit by −1). Then, for each frequency we calculate the averaged correlation vector, \({C}^{{\omega }_{i}}={\langle {C}_{k}\rangle }_{x\in {X}_{train}^{{\omega }_{i}}}\), where \({X}_{train}={X}_{train}^{{\omega }_{1}}\cup {X}_{train}^{{\omega }_{2}}\). To estimate the frequency of an unknown signal we calculate its correlation vector, Ck, and then the distances

$${D}_{1}=||{C}_{k}-{C}^{{\omega }_{1}}|{|}_{{L}_{2}},\,{D}_{2}=||{C}_{k}-{C}^{{\omega }_{2}}|{|}_{{L}_{2}},$$
(7)

by the L2 norm. We estimate the frequency according to the smaller distance; that is,

$${\omega }_{est}=(\begin{array}{ll}{\omega }_{1} & {D}_{1} < {D}_{2}\\ {\omega }_{2} & {\rm{otherwise}}.\end{array}$$
(8)

This method, however, disregards higher order correlations functions and the finite precision of the correlation functions itself which varies considerably between the nearest neighbors and the higher neighbor separation. While in the limit where all these effects are taken into account this should approach the optimum, it is numerically very challenging or even impossible to apply to most problems of interest.

Deep learning method

To overcome the model’s lack of knowledge, we suggest using a supervised DL model, which we denote by MDL. In particular, we consider a feed-forward fully connected neural network. The main reason for choosing a fully connected neural network is that the signal (frequency) information is encoded in the correlations between the values of different input neurons (measurement results), and in particular far apart input neurons. Similar to Mcorr, we use a training dataset of measurement results of known signals (known labels) to train MDL. We denote the labels of the two frequencies by 0 and 1. MDL is then applied to a test dataset and results in estimations of the frequencies of the test measurement results. Our DL model is a feed-forward fully connected neural network of four layers (two hidden layers) as depicted in Fig. 3. While two hidden layers are sufficient for the scenarios considered in this work, it may be the case that more complex noise models would require to employ deeper neural networks. The first layer is called the input layer. The neurons of the input layer output the input data; in our case, the measurement results x of a single experiment, to the second layer. The output of neuron j in the second (hidden) layer is given by \({f}_{j}(z)=f(\sum _{i}\,{w}_{ij}{x}_{i}+{b}_{j})\), where f is the activation function, and wij and bj are the weights and biases respectively. For the hidden layers we use the rectified linear (ReLU) activation function, f (z) = max(0, z). The output of the second layer is then fed as an input to the third layer and so on. The last layer is called the output layer. In our model the output layer has one neuron whose low and high activation levels are associated with the two possible labels (frequencies). For the output neuron we use the Sigmoid activation function. We use the mean squared error (MSE) between the outputs of the learning model and the labels of the train set as the loss function that is minimized during the training by optimizing the weights and biases of the model. Please note that there is no special reason for choosing the Sigmoid activation function with the MSE loss function; the softmax activation function together with the cross-entropy loss functions may be used as well. Overfitting is avoided by restricting the total numbers of nodes in the network (and hence, the number of free variables). In particular, for the examples considered in this work we use a second layer of 20 nodes, and a third layer of 35 nodes (a small modification of the number of neurons in each of the two hidden layers would not change the model’s accuracy significantly). Regarding the test dataset, after the application of the Sigmoid activation function on the output of MDL, we label the output by 1 or 0 according to whether it is >0.5 or <0.5 respectively. We then calculate \({P}_{{M}_{DL}}\) by the loss function (the MSE) between the output labels and the true labels.

Figure 3
figure 3

The MDL neural network is a feed-forward fully connected neural network. The input layer inputs the measurement results x to the second layer (first hidden layer). The output of the last hidden layer is fed to the output layer, which results in the frequency discrimination, ωest = ω1 or ωest = ω2.

Numerical analysis

As a way of testing the performance of MDL in terms of frequency discrimination, we constructed numerical sets of measurement results, x, according to Eq. (2) for two different frequencies, where the phase, ϕi, was chosen randomly (uniformly distributed between 0 and 2π) for each x. The input data were generated with g1 = g2 = ω1 = 10/(2π) Hz, ω2 = ω1 + Δω, Δt = 0.5 sec, and a total measurement time of Ttot = 500 sec (1000 measurements). Part of the datasets were used for training and the remainder was used for testing the learning model. We compared the performance of MFB to the performance of MDL and Mcorr. In Fig. 4 we show the discrimination error probabilities, \({P}_{{M}_{FB}}\), \({P}_{{M}_{DL}}\), and \({P}_{{M}_{corr}}\) as a function of the frequency difference, Δω, between the two signals, as well as the corresponding MDL receiver operating characteristic (ROC) curves and areas under the curve (AUC). We considered a first layer of 1000 nodes (1000 measurements), a second layer of 20 nodes, and a third layer of 35 nodes. This choice of number of nodes limits the free variable space and allows us to avoid overfitting without resorting to regularization methods. In this ideal scenario, both Mcorr and MDL approach the optimal performance of MFB even though both methods have no a priori information on the physical model.

Figure 4
figure 4

Performance in the ideal model scenario. Left: discrimination error probabilities (Eq. (3)) as a function of the frequency difference, Δω. Full Bayesian, \({P}_{{M}_{FB}}\) (green squares), Deep Learning, \({P}_{{M}_{DL}}\) (red circles), correlations, \({P}_{{M}_{corr}}\) (blue hexagons), and analytical bound on \({P}_{{M}_{FB}}\) (dashed black). The input data were generated according to Eq. (2) with g1 = g2 = ω1 = 10/(2π) Hz, ω2 = ω1 + Δω, Δt = 0.5 sec, and a total measurement time of Ttot = 500 sec (1000 measurements). Right: receiver operating characteristic (ROC) curve and area under the curve (AUC) of MDL for different values of Δω, corresponding to the first, third, fifth and seventh points from left in the left figure.

In order to provide indications on the performance of MDL in real-world noisy scenarios we further considered a few more noise models and assumed that these noise models are “unknown” and hence, they are not taken into account in the Bayesian methods MFB and Mcorr, which remain unchanged as described above. This serves as an indication on how much better the performance of MDL could be in comparison to MFB and Mcorr in a real-world scenario when the noise model is truly unknown to some extent. The first noise model is still a phase noise. While previously we considered that the random (uniformly distributed) phase of the signal is constant during a single experiment, here we consider a scenario in which the random phase is changed once during a single experiment, where the second random phase is also uniformly distributed. Moreover, the time interval in which the phase change occurs is also uniformly distributed between the time intervals of a single experiment (1000 time intervals). The discrimination error probabilities, \({P}_{{M}_{FB}}\), \({P}_{{M}_{DL}}\), and \({P}_{{M}_{corr}}\) as a function of the frequency difference, Δω, between the two signals are shown in Fig. 5(a). It is clear that while the phase noise damages the discrimination capability of MFB and Mcorr, MDL is capable of learning the noise model. The second noise model considers a magnetic noise δb, which modifies the Hamiltonian of the probe, Eq. (1) to

$${H}_{{s}_{i}}={g}_{i}\,\cos \,({\omega }_{i}t+{\phi }_{i}){S}_{z}+\delta b{S}_{z}.$$
(9)
Figure 5
figure 5

Performance in noisy scenarios. Discrimination error probabilities (Eq. (3)) as a function of the frequency difference, Δω. Full Bayesian, \({P}_{{M}_{FB}}\) (green squares), Deep Learning, \({P}_{{M}_{DL}}\) (red circles), and correlations, \({P}_{{M}_{corr}}\) (blue hexagons). (a) Phase noise - the random phase of the signal is randomly changed once during a single experiment at a random time interval. (b) Magnetic noise - the probe is subject to a random magnetic field, which is randomly changed once during a single experiment at a random time interval. (c) Amplitude noise - the amplitude of the signal has a different (random) value in each time interval. (d) Mixed noise scenario, which includes all of the above noise models. See text for more details. (e) ROC curve and AUC of MDL for different values of Δω, corresponding to the second, fourth, sixth, eighth and tenth points from left in figure (d).

Similar to the phase noise, we assume that δb is changed once during a single experiment and that the time interval in which the change of δb occurs is uniformly distributed between the time intervals of a single experiment. Each of the two values of δb is Normally distributed with a zero mean and a standard deviation of σ = gi/5 = 2/(2π) Hz. The discrimination error probabilities, \({P}_{{M}_{FB}}\), \({P}_{{M}_{DL}}\), and \({P}_{{M}_{corr}}\) as a function of the frequency difference, Δω, between the two signals are shown in Fig. 5(b). In this case MDL handles the magnetic noise better that MFB and much better than Mcorr. In the third noise model we consider noise in the amplitude of the signal. Specifically, we assume that the amplitude value is different in each time interval and that it is Normally distributed with a mean of g = 10/(2π) Hz (the previous value of the non-noisy amplitude) and a standard deviation that is equal to the mean value, that is, σ = g = 10/(2π) Hz. The discrimination error probabilities, \({P}_{{M}_{FB}}\), \({P}_{{M}_{DL}}\), and \({P}_{{M}_{corr}}\) as a function of the frequency difference, Δω, between the two signals are shown in Fig. 5(c). In this case MDL performs slightly better than Mcorr and better than MFB. Lastly, we consider the mixed-noise scenario where all of the above three noise models are includes. The discrimination error probabilities, \({P}_{{M}_{FB}}\), \({P}_{{M}_{DL}}\), and \({P}_{{M}_{corr}}\) as a function of the frequency difference, Δω, between the two signals are shown in Fig. 5(d) and the corresponding MDL ROC curves and AUC are shown in Fig. 5(e). It is apparent that MDL is still capable of learning the noise model while the performance of MFB and Mcorr is severely degraded when assuming that we have no further knowledge on the noise model. Of course, in case that we have more knowledge on the noise model, we may be able to modify the Bayesian methods accordingly. However, the implication of such a modification is that the optimization is performed with respect to a larger set of free variables, and therefore implies longer run times while the DL run time remains unchanged. Moreover, the above results suggest that Bayesian method could be very sensitive to the noise model; a minor unknown difference between the true noise model and the assumed noise model could result in a significantly reduced performance of the Bayesian method (say, for example, that there are three phase changes in a single experiment instead of two).

Experimental verification

The NV center in diamond24,25,26 is one of the leading quantum probe systems for sensing, imaging and spectroscopy. Here we considered frequency discrimination of measurement results obtained by a single NV center in ambient conditions. Two artificial signals were produced by a signal generator with frequencies ω1 = 2π × 250 Hz and ω2 = 2π × 251.6 Hz. Each signal was measured for a total measurement time of Ttot = 220 sec, with a time interval of Δt = 10 μs. From the row data, we generated strings of 25000 measurement results (Ttot = 0.25 sec) such that the phase corresponding to each x can be considered as a random phase (no phase relation), and the frequencies cannot be resolved by a Fourier-Transform (see Fig. 1 (left)). The low photon-detection efficiency of a true detection (ms = 0) and a false detection (ms = −1) was ~7.4% and ~5.2% respectively, indicating low SNR and contrast.

In order to achieve a theoretical bound on the discrimination error, we considered a theoretical model with a modified probability for a successful measurement, which is given by

$$Q(t)={\eta }_{true}P(t)+{\eta }_{false}[1-P(t)],$$
(10)

where P(t) is given by Eq. (2), and ηtrue and ηflase are the true and false detection efficiencies respectively. Assuming that ηflase = 0.7 ηtrue, we constructed numerical datasets according to Eq. (10), and set the amplitudes of the signals, g1 and g2, and the efficiency ηtrue for each signal to match the experimental results according to two constraints: (i) The power spectrum at the frequency of the signal of the numerical data was required to be approximately equal to the power spectrum of the experimental data. (ii) The average of the experimental and numeric signals fulfilled \(\langle x\rangle =\frac{{\eta }_{true}+{\eta }_{false}}{2}\). For the numerical model we achieved \({P}_{{M}_{FB}}\approx 10.8 \% \) and \({P}_{{M}_{DL}}\approx 11.6 \% \), (see Fig. 6 (left), green square and red circle under the diamonds). These results are consistent with the experimental data, for which we obtained \({P}_{{M}_{DL}}^{\exp }\approx 12.1 \% \) (Fig. 6 (left) blue diamond), reaching \({P}_{{M}_{FB}}\) without having any information on the model. Moreover, the Full Bayesian method on the experimental data obtained only \({P}_{{M}_{FB}}^{\exp }\approx 16.2 \% \) (Fig. 6 (left) green diamond). This difference is due to the fact that the experimental statistics differ slightly from our probability function; while for the Bayesian method this creates a problem, the DL method is able to learn this difference and take it into account. This difference is expected to be much more dramatic in real nano-NMR experiments in which there are much more uncertainties of the model. In addition, we analyzed \({P}_{{M}_{FB}}\) and \({P}_{{M}_{DL}}\) on the numerical data as a function of the frequency difference, Δω. The results are shown in Fig. 6 (left). The ROC curve and AUC of MDL on the experimental data are shown in Fig. 6 (right).

Figure 6
figure 6

Performance in the low-efficiency model scenario. Left: discrimination error probabilities (Eq. (3)). Full Bayesian, \({P}_{{M}_{FB}}\) (green squares) and Deep Learning, \({P}_{{M}_{DL}}\) (red circles) on numeric data, Full Bayesian, \({P}_{{M}_{FB}}^{exp}\) (green diamond), and Deep Learning, \({P}_{{M}_{DL}}^{exp}\) (blue diamond) on the experimental data, as function of the frequency difference, Δω. The input numeric data were produced according to Eq. (10) with g1 = 12.5 KHz, g2 = 11.25 KHz, ω1 = 250 Hz, ω2 = ω1 + Δω, Δt = 10 μsec and a total measurement time of Ttot = 0.25 sec (25000 measurements). Right: ROC curve and AUC of MDL on the experimental data, corresponding to the blue diamond in the left figure.

It is worth noting that due to the relatively large window size of 25000, a full analysis of Mcorr is not possible within a reasonable time scale on a common computer. Partial analysis (taking into account segments of two-point correlations only) of Mcorr of both the numerical model and the experimental data yielded \({P}_{{M}_{corr}}\gtrsim 0.4\). This indicates that DL could indeed be the better choice when there is a lack of knowledge on the model.

Comparison to other machine learning methods

So far we have shown that DL methods are useful for the problem of frequency discrimination in the nano-NMR settings. In this section we ask whether other machine learning methods could be useful for this task and if so, how these methods perform compared to DL.

Any method that is able to discriminate between two signals of near-by frequencies, as we have considered in previous sections, should be able to learn and acquire the information on the signals from the correlations between different measurement results (different xi). Hence, any successful discrimination method should involve some non-linearity. Indeed, a fully connected neural network with only linear layers fails in the considered discrimination problem (the achieved error probability is 1/2). We tested the performance of three other linear learning methods, namely, logistic regression (with no interaction terms), K nearest neighbours and supported vector machines (SVM) with a linear kernel, in the ideal model scenario (Fig. 4). Similarly to a fully connected linear neural network, these methods completely fail to discriminate between the signals and achieve an error probability of 1/2 for all values of Δω.

Regarding non-linear models, we considered two models, SVM with the non-linear radial basis function (rbf) kernel and XGboost, which is an implementation of gradient boosted decision trees (in our case we consider non-linear boosting as linear boosting fails). We tested these two models in the ideal model scenario (Fig. 4) as well as in the mixed noise model scenario (see Fig. 5(d)). The results are shown in Fig. 7. It can be seen that these two methods achieve accuracies (discrimination error probabilities) which are very similar to the accuracies obtained by DL. However, there is a big difference in terms of required computational resources as these two methods consume more memory compared to DL, and require much longer running times of, for example, ~20 hours compared to ~20 minutes by DL. Indeed, it is not feasible to use these methods for the discrimination in the case of the experimental data, where the size of the inputs is much larger (input strings of 25000 compared to input strings of 1000). While we have not made an exhaustive study and analysis of machine learning methods, which is beyond the scope of this work, these findings strengthen the possible advantage and benefit of DL method for data processing of nano-NMR experimental results.

Figure 7
figure 7

Comparison to other Machine Learning methods. Discrimination error probabilities (Eq. (3)) of Deep Learning, \({P}_{{M}_{DL}}\) (red circles), Supported Vector Machines, \({P}_{{M}_{SVM}}\) (blue hexagons), and XGBoost, \({P}_{{M}_{XGB}}\) (magenta pentagon). (a) Ideal model (see Fig. 4). (b) Mixed noise model (see Fig. 5(d)).

Frequency resolution

In this section we considered the problem of discrimination between a signal with a single frequency and a signal with two proximal frequencies centred at the value of the single frequency (Fig. 8). We assumed that the signals have strong amplitude and phase noise, which we model by the Ornstein-Uhlenbeck (OU) process, motivated by NV probed statistically polarized nano-NMR experiments4,5,23. The OU process is a stochastic random process, which is a stationary, a Gaussian and a Markov process. It is given by \(x(t+dt)=x(t)-\frac{1}{\tau }x(t)dt+\sqrt{c}\,dW(t)\), where τ and c are positive constants called, respectively, the relaxation time and the diffusion constant, and dW(t) is a temporally uncorrelated normal random variable with mean 0 and variance dt. The environmental noise experienced by an NV center is faithfully modelled by an OU process27.

Figure 8
figure 8

The problem of frequency resolution. The observed signal could be one of two possible signals that should be resolved. One signal (red) has two near by frequencies (blue and orange) and corresponds to their sum. The second signal (green) has one frequency, which is centered between the two near by frequencies of the first signal. The closer the two frequencies are, the harder it is to resolve between the two signals.

Specifically, the Hamiltonian of the probe is given by

$$H=(\mathop{\sum }\limits_{i=1}^{n}\,{A}_{i}(t)\,\cos \,[{\delta }_{i}t]-{B}_{i}(t)\,\sin \,[{\delta }_{i}t]){S}_{z},$$
(11)

where Ai and Bi undergo an OU process due to the amplitude and phase noise, and δi are the frequencies. The probability for a successful measurement (one) is

$$\begin{array}{rcl}P(t) & = & \sin [\mathop{\sum }\limits_{i=1}^{n}\,\frac{{A}_{i}(t)}{{\delta }_{i}}(\sin \,[{\delta }_{i}t]-\,\sin \,[{\delta }_{i}(t-\Delta t)])\\ & & +\,{\frac{{B}_{i}(t)}{{\delta }_{i}}(\cos [{\delta }_{i}t]-\cos [{\delta }_{i}(t-\Delta t)])+\frac{\pi }{4}]}^{2},\end{array}$$
(12)

where n = 2 and δi = δc ± Δ2. For two frequencies Δ is finite, and for a single frequency Δ = 0.

Numerical analysis

We constructed numerical datasets according to Eq. (12) where Ai(t) and Bi(t) follow OU processes with mean μ = 0, volatility \(\sigma =\frac{\pi }{10}\sqrt{\frac{4}{\pi {T}_{2}}}\), and reversion speed θ = 1/T2, where T2 = 256 sec is the coherence time of the signal. In addition, we fixed Ttot = 2T2 and Δt = 1 sec. We tested the performance of MDL as a function of the frequency difference, Δ, in comparison to MFB and Mcorr. In MFB the maximal log-likelihood was calculated over the random OU processes. For each string of measurement results x, we considered the single frequency signal with Δ = Δ0 = 0 and the signal of two near-by frequencies with Δ = Δn > 0, where Δn corresponds to the numerical value of the frequency difference between the two frequencies. We generated many sets of random OU processes denoted by Ok and calculated

$${L}_{1}=\mathop{{\rm{\max }}\,}\limits_{{O}_{k}}L({O}_{k}|x,{\Delta }_{0}),\,{L}_{2}=\mathop{{\rm{\max }}\,}\limits_{{O}_{k}}L({O}_{k}|x,{\Delta }_{n}),$$
(13)

where

$$L({O}_{k}|x,{\Delta }_{i})=\sum _{j}\,({x}_{j}\,\log \,P({t}_{j},{\Delta }_{i},{O}_{k})+(1-{x}_{j})\,\log \,(1-P({t}_{j},{\Delta }_{i},{O}_{k}))).$$
(14)

We estimated the signal as a single frequency signal or as a signal of two frequencies according to the larger likelihood; that is

$${\Delta }_{est}=(\begin{array}{ll}{\Delta }_{0} & {L}_{1} > {L}_{2}\\ {\Delta }_{n} & {\rm{otherwise}}.\end{array}$$
(15)

Figure 9 (left) shows the error probability as a function of the frequency difference. The MDL results were better than the results of Mcorr as well as the results of MFB. Interestingly, even though MFB has full knowledge of the noise model it achieves a larger error probability than MDL. We note that increasing the number of OU processes, Ok, in the above likelihood calculation does not improve \({P}_{{M}_{FB}}\). While MDL and Mcorr could reach a result within ~45 min, MFB did so within ~7 hours (CPU times, both considered on the same common PC without utilizing GPU). The MDL ROC curves and AUC are shown in Fig. 9 (right). These numerical results provide a strong indication that DL methods can potentially identify molecules based on their NMR signal extremely fast, which may be a useful tool in probing chemical reactions at the nano scale.

Figure 9
figure 9

Performance in the noisy frequency resolution scenario. Left: discrimination error probabilities (Eq. (3)). Full Bayesian, \({P}_{{M}_{FB}}\) (green squares), Deep Learning, \({P}_{{M}_{DL}}\) (red circles), and correlations, \({P}_{{M}_{corr}}\) (blue hexagons), as a function of the frequency difference, Δω. The input data were produced according to Eq. (12) with Ttot = 2T2. Right: ROC curve and AUC of MDL for different values of Δω, corresponding to the second, fourth, and sixth points from left in the left figure.

Theoretical implications

While the numerical advantages of machine learning methods were already shown28,29, their theoretical value was not demonstrated before. Beyond the practical interest of utilizing machine learning methods in the nano-NMR frequency resolution problem, machine learning methods, and in particular DL methods, could also have a considerable theoretical value.

Generally in estimation problems, the MSE of an estimator M for a given unseen test input x can be written as

$${\rm{E}}[{(y-M(x))}^{2}]={({\rm{Bias}}[M(x)])}^{2}+{\rm{Var}}[M(x)]+{\rm{Var}}[\epsilon ],$$
(16)

where y is the true label, M(x) is the estimated label, E is the expectation value with respect to the training set, the bias of M is given by \({\rm{Bias}}[M(x)]={\rm{E}}[M(x)]-M(x)\), the variance of M is given by \({\rm{Var}}[M(x)]={\rm{E}}[M{(x)}^{2}]-{\rm{E}}{[M(x)]}^{2}\), and \({\rm{Var}}[\epsilon ]\) is the irreducible error due to the (zero mean) noise \(\epsilon \). The error probability, PM, is then obtained by the expectation value of the MSE, \({\rm{E}}[{(y-M(x))}^{2}]\), with respect to the test set.

An unbiased estimator is an estimator M for which we have that \({\rm{Bias}}\,[M(x)]=0\). An optimal unbiased estimator has a minimal variance, which is known as the minimum variance unbiased (MVU) estimator. However, from Eq. (16) it is seen that an MVU estimator is not necessarily an optimal estimator which minimizes the MSE. Indeed, it is known that biased methods can outperform the unbiased ones30,31,32. In this case the magnitude of the bias is increased and \({({\rm{Bias}}[M(x)])}^{2} > 0\), but the variance \({\rm{Var}}\,[M(x)]\) is significanlty decreased such that the MSE is smaller than the MSE of an MVU estimator. Such strategies of error reduction are used ubiquitously in image restoration33,34 and beamforming applications35,36. Moreover, it is known that biased methods can be superior in various spectral analysis applications37. Despite its superiority there are only a few structured methods in which such a biased estimator can be constructed31 and in most cases the search for such estimators is extremely challenging, especially as it is unknown if such an estimator exists.

Our numerical analysis of MFB has converged to the final result, however, the method has resolved the two frequencies with a higher error rate than the error rate of MDL. Since MFB is an MVU estimator, our results indicate that for the model at hand the unbiased full model Bayesian analysis is not optimal, and that a superior biased method exists. This brings up an extra advantage of DL as the search for a biased method in usually done in an ad-hoc manner. Moreover, in most cases there is no way of knowing if a superior method to the unbiased method exists. Hence, our results provide some hope that DL methods could be used as an analytical tool for identifying superior estimators, and in particular, for identifying the ultimate limits of resolution problems.

Conclusion

In conclusion, we showed that DL methods are able to mitigate the effect of the inherent strong noise in the nano-NMR settings. In particular, the DL neural networks effectively learn the noise model, even when no prior knowledge on the noise model is assumed. This is a crucial property of the DL methods as in many realistic nano-NMR scenarios the noise model is complex and not accurately known. We investigated the performance of DL methods in the problems of frequency discrimination and frequency resolution. We showed that DL methods can outperform Bayesian methods when full knowledge of the noise model is not available and that DL methods can analyze a test signal as accurately as numerically demanding Bayesian methods even though Bayesian methods have full knowledge of the noise model, and the DL methods have no prior knowledge at all.

DL methods can perform better than Bayesian methods when the noise model is not precisely known or when the noise model is known but it is a complex model. In the first case DL methods can achieve better results than Bayesian methods as DL methods do not assume prior knowledge on the model while Bayesian methods rely on precise knowledge of the model. This was demonstrated in the case of frequency discrimination in the noisy scenario, as well as in the analysis of the experimental data. In the second case the results of both methods may be similar, but the consumption of computational resources of Bayesian methods can be much larger compared to the resources required by DL methods, as was demonstrated in the problem of frequency resolution of noisy signals.

Our results can be seen as a strong indication that DL methods will turn out to be the method of choice when analyzing spectroscopic nano-NMR data. In addition, our results indicate that DL methods could be utilized as a tool that may enable to identify superior biased estimators and ultimate limits of resolution problems, which are otherwise difficult to obtain12.