The application of sparse-sampling techniques to NMR data acquisition would benefit from reliable quality measurements for reconstructed spectra. We introduce a pair of noise-normalized measurements, and , for differentiating inadequate modelling from overfitting. While and can be used jointly for methods that do not enforce exact agreement between the back-calculated time domain and the original sparse data, the cross-validation measure is applicable to all reconstruction algorithms. We show that the fidelity of reconstruction is sensitive to changes in and that model overfitting results in elevated and reduced spectral quality.
Recent developments in sparse-sampling and iterative reconstruction techniques have made it possible to acquire magnetic resonance spectra using a fraction of the measurement time required by the conventional Nyquist-sampling method, without sacrificing spectral resolution or quality1,2,3,4,5. Despite these remarkable advancements, there are not, as yet, unbiased, quantitative measurements of the fidelity of reconstructed spectra. Instead, the quality of spectral reconstruction is frequently assessed by direct comparison with artifact-free spectra generated from fully sampled datasets, by examination of algorithm-specific parameters, or by estimating the reduction of aliasing artifacts in the reconstructed spectra. Each of these measurements has its own limitations: comparison of reconstructed spectra with artifact-free spectra from fully sampled data is not feasible in real applications; spectra reconstructed with different methods cannot be compared using algorithm-specific parameters; and finally, excessive ‘artifact’ reduction during reconstruction may not correlate with the improvement of spectral fidelity.
Here, we introduce two algorithm-independent measurements for evaluating the quality of nuclear magnetic resonance (NMR) spectra reconstructed from sparsely sampled datasets, demonstrate their utility in differentiating inadequate modelling from overfitting, and discuss the implication of such quality measurements for the fidelity of NMR spectral reconstruction.
Quality measurements for reconstructed NMR spectra
NMR time domain data and frequency domain spectra are connected through the Fourier transform. Conceptually, the quality of a reconstructed spectrum can be measured by computing the inverse Fourier transform of the spectrum and comparing the resulting time domain data with the raw measurements at the sampled positions. However, this alone is inadequate, as the following example illustrates. The spectrum generated from the Fourier transform of the sparsely sampled time domain data would fulfil such a criterion, yet it is not a high-quality reconstruction due to the presence of strong aliasing artifacts, which arise from the lack of modelled signals at the unmeasured positions in the time domain. A more useful quality measure would go further, flagging this as a poor reconstruction.
Similar issues were encountered previously in X-ray crystallography, where the diffraction pattern, which is related to the electron density by a Fourier transform, contains intensity information but not phase information; the phases must be reconstructed iteratively in reciprocal space as the model of the molecular structure is assembled in real space. Initially, it was proposed that the inverse Fourier transform of the modelled electron density could be compared with the diffraction data, and their correlation—in the form of the ‘R-factor,’ , where Fobs and Fcalc represent observed and back-calculated structure factors—would reflect the quality of the modelled electron density map6. If the iterative process of assembling the model is successful, its inverse Fourier transform will come increasingly close to agreeing with the observed data, and R will become progressively smaller.
While R is helpful, it was soon realized that R alone is inadequate, as the incorporation of experimental noise into the model will drive down the R-factor while reducing, rather than improving, the fidelity of the structural model7. To address this, it was suggested that a small percentage of the measurements (5–10%) be set aside and excluded from the reconstruction process, and that the electron density be built from the remaining measurements, known as the working set. As the reconstruction proceeds, the consistencies of the calculated structure factors with the working dataset (Rwork) and the excluded dataset (Rfree) are used together to evaluate the quality of the electron density map. While incorporation of noise into the model improves the agreement with the working set (reducing Rwork), it results in worse fitting to the excluded set (increasing Rfree), thus allowing over-refinement to be detected7.
Despite this conceptual similarity between the data processing of sparsely sampled NMR and crystallography, several issues must be addressed before these quality measurements can be applied to NMR. One important consideration is that some reconstruction algorithms explicitly or implicitly require exact agreement between the back-calculated time domain and the experimental measurements, resulting in an R (or Rwork) value of zero. While this would seem to negate the value of this approach, we show below that if the raw time domain measurements are divided into a working dataset used for reconstruction and a free dataset reserved for cross validation, the Rfree can be used alone as a meaningful measure of reconstruction quality.
The application of R-factor measurements to NMR is additionally confounded by a second distinction between NMR and crystallography. In crystallography, the occupancy of the unit cell by protein does not vary significantly, regardless of the protein under study. Therefore, the R-factors always fall into the same numeric range for crystals of a similar quality. In NMR, however, each position on the directly observed dimension of a spectrum constitutes an independent reconstruction problem, and the set of independent 2-D planes in a 3-D spectrum (or 3-D cubes in a 4-D spectrum) contain vastly different numbers of signals, from pure noise to large arrays of signals with different intensities and lineshapes. For a noise plane, the ideal reconstruction would contain no signal, the back-calculated time domain data would be zero, and both Rwork and Rfree would be 100%; whereas for a plane containing signals, the back-calculated time domain from the ideal reconstruction would be very close to the raw measurements, and the corresponding R-factors would be vanishingly small. In order to obtain a consistent readout that is independent of the number of signals involved in the reconstruction, we introduce noise-normalized quality measurements and , which are defined as:
In equation (1), is the vector length of the hypercomplex measurements in a reference noise plane or cube, while is the vector difference of the observed and back-calculated time-domain signals from the model spectrum.
Application of quality measurements to reconstructed spectra
In order to illustrate the utility of our proposed measurements, we first used the CLEAN algorithm8 to reconstruct the 3-D HNCO spectra of six proteins at a sampling density of 5% (Fig. 1; Supplementary Figs 1–5). Ten per cent of the measurements were excluded from spectral reconstruction and marked for calculation. CLEAN builds a model of the frequency domain spectrum through the iterative identification of signal components, and thus and are naturally suited to monitoring the progress of the reconstruction. All results presented here show the model only, without the inclusion of any residuals.
Two N–CO planes were selected from the HNCO spectrum of GB1 for illustration, one containing strong signals (Fig. 1a,b) and one containing weak signals (due to leakage from the neighbouring plane, Fig. 1c,d). For both cases, as the threshold for inclusion of signals in the model decreased, individual signals were picked up and incorporated (Fig. 1b,d), leading to an initial decrease of and values. When nearly all of the signals were modelled, the difference between the raw data and the back-calculated data was on par with the noise, with the and values approaching unity. When the stopping threshold was set well below the fluctuation of aliasing artifacts, the CLEAN algorithm started to identify noise spikes and model them as signals (Fig. 1b,d). Such excessive modelling continued to diminish , whereas went through a minimum and then increased slightly before reaching a plateau. Such an effect was particularly noticeable for the reconstructed planes with a weaker signal-to-noise ratio (Fig. 1c). It can also be appreciated that despite similar reconstruction qualities, conventional R-factors normalized against signals would appear smaller for the N–CO plane containing strong signals, whereas they would seem larger for a plane containing weak signals (compare the right panels of Fig. 1a with Fig. 1c). The introduction of Rnoise factors overcomes this limitation and offers consistent measurement of the reconstruction quality: an ideal reconstruction would have and approaching unity regardless of the signal content. Improvement of the reconstruction is reflected by simultaneous reduction of and , whereas overfitting results in divergent values.
We next examined whether our proposed quality measurements and can be applied to other reconstruction algorithms beyond CLEAN. In order to demonstrate the general applicability of these quality measurements, we implemented three popular reconstruction algorithms: convex l1-norm minimization9, maximum entropy reconstruction10,11, and iterative soft thresholding (IST)12.
Convex l1-norm minimization, commonly used in compressed sensing, optimizes the frequency domain data to generate the spectrum with the smallest possible l1-norm while having the inverse Fourier transform be consistent with the experimental time domain measurements. Optimizing against both measures is possible through a constrained minimization in which a Lagrangian multiplier λ is introduced to balance the two requirements:
In equation (2), C is the composite score, S is the modelled frequency domain spectrum, s is the modelled time domain signals, m is the experimental measurements, and RMSD is the root mean square deviation.
In order to examine the behaviour of our quality measurements for reconstruction by l1 minimization, we generated a simulated sparsely sampled 1-D time domain dataset, which was Fourier transformed to yield an initial spectrum containing aliasing artifacts (Fig. 2a). A reference spectrum was also generated that contained the fully sampled signals and noise (Fig. 2b). Before reconstruction, the time domain measurements were separated into two parts, the working dataset used for reconstruction and the free dataset reserved for cross validation.
We examined the and values of each reconstruction as a function of the Lagrangian multiplier (λ), which alters the amount of weight placed on regularization versus agreement with the experimental data. At each value of λ, the reconstruction was obtained at the minimum of the composite score. With λ set to zero, the scoring function consists solely of the l1-norm, and minimization of this norm drives the frequency domain spectrum to the baseline, yielding a final score of zero at the end of the reconstruction process (Fig. 2c). As λ is increased, putting more weight on the consistency between the modelled time domain signals s and the experimental measurements m, more signals are retained in the reconstruction and the final composite score increases (Fig. 2c); at the same time, the values for both and begin to decrease (Fig. 2d). While the at the end of reconstruction continues to decrease with ever-increasing values of λ, eventually reaching zero when complete agreement with the working dataset is achieved, the at the end of reconstruction instead decreases to a minimum and then starts to increase slightly (Fig. 2d). Such a trend was consistently observed with cross-validation sets selected from 10% to 30% of the overall measurements (Supplementary Fig. 6), highlighting the robustness of the and measurements. Importantly, the change in is closely mirrored by the change in the λ-dependent RMSD between the reconstruction and reference spectra (Fig. 2e): the minimal difference is achieved when is at or very close to its minimum (Fig. 2f,g), but not when a very large value of λ enforces complete agreement with the experimentally measured working dataset (Fig. 2h,i). These results strongly support the notion that is a valid measure of the reconstruction spectral fidelity in the absence of an external reference spectrum.
We next tested the maximum entropy reconstruction method (MaxEnt)10,11 using the same set of 1-D simulated data. The Fourier transforms of the sparsely sampled measurements and full measurements are shown in Fig. 3a,b, respectively. Reconstruction with the maximum entropy method is conceptually similar to l1-norm minimization, except that the regularization term maximizes the information entropy E(S) of the frequency domain spectrum rather than minimizing its l1-norm. Mathematically, the optimal solution is achieved through a constrained minimization of the negative entropy −E(S) and the Lagrangian-multiplier-weighted RMSD of the modelled time domain data s and raw measurements m:
In the limit of the experimental noise being much larger than unity, a common condition encountered experimentally, the behaviour of the maximum entropy method is similar to the convex l1-norm minimization algorithm. As the Lagrangian multiplier λ was increased from zero, progressively more signals were included in the final reconstruction, resulting in an increase in the final composite score (Fig. 3c) and a decrease in the and (Fig. 3d). However, further increases in λ caused more of the experimental noise to be included, resulting in a divergence of and (Fig. 3d). The change of the over λ was mirrored by the change of the RMSD of the reconstruction and reference spectra (Fig. 3e). A very large λ ultimately enforced the agreement between modelled time domain data and the experimental measurements, a scenario that resembles the forward maximum entropy reconstruction13 or maximum entropy interpolation14. It is important to note that the highest quality of reconstruction is achieved when reaches a minimum (Fig. 3f,g), but not at a very large λ (Fig. 3h,i).
As a final example, we applied these measures to the IST method, which enforces exact agreement between the reconstruction and the experimental data12. IST begins with the Fourier transform of the sparsely sampled working dataset. In each iteration, signals above a predefined threshold in the frequency domain are extracted (Fig. 4a), and their inverse Fourier transform is used to update the values of the time domain at all positions except those belonging to the working dataset. These extracted signals are, in effect, a model of the current reconstruction, and can be used to monitor and . This process is repeated with a decreasing threshold, and the inverse Fourier transform of the model increasingly converges with the measured data of the working set. Snapshots of the IST reconstruction were taken at every step over 1,500 iterations. Although the model spectrum is typically not presented in iterations of spectral reconstruction by IST, it is intrinsic to the reconstruction process and is produced as an output in our implementation (Fig. 4a, right panel) and in the implementation of hmsIST15, a variant of IST. Such a model spectrum allows meaningful calculation of both and (Fig. 4b).
As with other reconstruction methods, the calculated from the model spectrum continued to decrease with an increasing number of IST iterations, eventually reaching zero; whereas decreased to a minimum, then slightly increased and eventually reached a plateau (Fig. 4b). The initial simultaneous decrease in and is consistent with the efficient processing of genuine signals, whereas the ultimate divergence of and reflects overfitting.
While the modification of IST to produce a model allows both measures of and to be used, we show that it is also possible to use alone as a quality measure for the reconstructions generated by the unmodified IST algorithm. In the unmodified IST algorithm, would not be a useful measurement of the spectral quality: when calculated from the reconstruction rather than the model, the value of remains at zero during the entire run, reflecting the fact that the algorithm enforces exact agreement between the reconstruction and the measured data (Fig. 4c). However, as the reconstruction has no bias toward the free dataset, the calculated from the evolving reconstruction spectrum behaves identically to that of the model spectrum (compare Fig. 4b,c), reinforcing the notion that is a universally applicable cross-validation measurement of reconstruction quality and model overfitting. Importantly, we again observe an excellent correlation between the RMSD curve of the reconstruction and reference spectra and the curve (compare Fig. 4c,d): the highest quality of reconstruction is achieved when is close to its minimum (Fig. 4e,f), and quality decreases with additional iterations (Fig. 4g,h). Further iterations beyond the minimum of model more noise and artifacts than genuine signals (compare Fig. 4e,g)—a situation of model overfitting that leads to an increase of (Fig. 4c) and the degradation of reconstruction fidelity (compare Fig. 4f,h).
Sparse sampling and iterative spectral reconstruction techniques are poised to transform magnetic resonance measurements in the post-FT era16, yet the characteristics and relative performance of the various reconstruction algorithms vary dramatically17,18,19, demanding quantitative measurements for estimating reconstruction fidelity, for detecting inadequate modelling and for preventing model overfitting. It is clear from our 1-D simulations that algorithm-specific measures such as the composite score of convex l1-norm minimization or maximum entropy are dependent on the reconstruction parameters (for example, the Lagrangian multiplier λ) and cannot be compared with each other directly for evaluation of reconstruction quality, whereas other algorithms, such as the IST, do not have a regularization score at all. The introduction of the cross-validation parameter provides a benchmark for assessment of the reconstruction fidelity independent of reconstruction algorithm specifics. For algorithms that do not enforce exact agreement at measured time domain positions, and can be used jointly to identify inadequate modelling and overfitting, while is a universally applicable gauge of reconstruction fidelity in the absence of a reference spectrum.
The notion of cross validation has been used previously in compressed sensing for estimating decoding errors20, and the method of using permutations of subsets of the raw data has also been used to search for convergent spectral reconstructions in NMR21,22,23. However, the cross-validation measures and presented in this work are novel, and are uniquely suited for application to NMR spectra.
The issue of model overfitting has been raised by Hoch and colleagues14,18, though no algorithm-independent measurements for such effects have been reported. The development of permits unbiased comparison of reconstruction methods and sampling patterns and a direct measurement of model overfitting. For example, a comparison of the three reconstruction algorithms in our 1-D simulation shows that the lowest score was achieved by the convex l1-norm minimization and maximum entropy methods, but only with the appropriate Lagrangian multipliers. Such a low measurement is accompanied by the lowest RMSD between the reconstruction spectrum and the reference spectrum and thus the highest reconstruction fidelity. Reconstruction with less optimal Lagrangian multipliers leads to deterioration of the reconstruction fidelity either due to inadequate modelling or overfitting. The reconstruction fidelity of the IST method comes very close to the convex l1-norm minimization and maximum entropy method when using an optimal number of iterations. However, the IST reconstruction is significantly worse with an infinite number of iterations, due to spectral overfitting.
As NMR spectral reconstruction is done independently for individual planes or cubes along the directly observed dimension, the most informative assessment of the reconstruction quality would be to calculate the quality factors separately for each position on the directly observed dimension. It is, however, conceivable that an overall quality factor could be calculated for the entire multi-dimensional spectrum, as the mean and standard deviation of the quality factors of the individual reconstructions.
The increasing sensitivity brought about by innovation in NMR instrumentation and pulse sequence design and the demand for more efficient data collection in biomolecular NMR studies have led to the burgeoning development of sparse sampling and reconstruction methods. The introduction and demonstration of the algorithm-independent reconstruction quality measurement should provide much-needed quality assurance and greatly facilitate the wide adoption of sparse-sampling techniques in magnetic resonance spectroscopy.
NMR measurements, simulation and spectral reconstruction
Three-dimensional sparsely sampled HNCO experiments were recorded on Agilent or Bruker NMR spectrometers using 15N/13C-labelled GB1, FAAP20 UBZ4, foldon, ubiquitin, the UBM1-ubiquitin complex and FKBP12. 2-D cosine-weighted randomized concentric ring sampling patterns24 of 314 points adapted to the 64 × 96 sampling grid or 220 points adapted to the 64 × 64 sampling grid for indirect (N-C) dimensions were used, corresponding to a sampling density of ∼5%. A randomly selected dataset containing 90% of the measurements were used for spectral reconstruction via the CLEAN algorithm8 and for calculation of and Rwork, whereas the remaining 10% measurements were excluded from reconstruction and were used for calculation of and Rfree. Modelled components of the CLEAN reconstruction were inverse Fourier transformed for comparison with the time domain measurements and for calculation of R-factors as described in the main text.
One-dimensional simulations were performed using MATLAB (MathWorks). The simulation contained nine exponentially decaying signals with amplitudes from 64 to 1 and frequencies from −4,000 to 4,000 Hz in the presence of white noise. A pure noise dataset was also generated containing white noise of the same amplitude as the reference noise. A sampling grid of 1,024 points was used. A sparse dataset was created by randomly selecting 30% of the sampling points. Of this dataset, 80% of the measurements were used for spectral reconstruction and for calculation of , while the remaining measurements (20%) were excluded from spectral reconstruction and were only used for calculation of . Reconstruction was carried out using the convex l1-norm minimization algorithm, the maximum entropy method, and iterative soft thresholding method. For assessing the stability of and , additional tests were carried out with convex l1-norm minimization using 90–70% of the measurements for spectral reconstruction and 10–30% for cross validation.
NMR measurements and the software for calculating the Rnoise and R-factors and for 1-D simulations are available upon request from the authors.
How to cite this article: Wu, Q. et al. Unbiased measurements of reconstruction fidelity of sparsely sampled magnetic resonance spectra. Nat. Commun. 7:12281 doi: 10.1038/ncomms12281 (2016).
This work is supported by National Institutes of Health Grants AI094475 and GM115355 to P.Z. We would like to thank Dr Ronald Venters for sharing the FKBP12 data and for insightful comments on the manuscript.