Machine learning assistive rapid, label-free molecular phenotyping of blood with two-dimensional NMR correlational spectroscopy

Translation of the findings in basic science and clinical research into routine practice is hampered by large variations in human phenotype. Developments in genotyping and phenotyping, such as proteomics and lipidomics, are beginning to address these limitations. In this work, we developed a new methodology for rapid, label-free molecular phenotyping of biological fluids (e.g., blood) by exploiting the recent advances in fast and highly efficient multidimensional inverse Laplace decomposition technique. We demonstrated that using two-dimensional T1-T2 correlational spectroscopy on a single drop of blood (<5 μL), a highly time- and patient-specific ‘molecular fingerprint’ can be obtained in minutes. Machine learning techniques were introduced to transform the NMR correlational map into user-friendly information for point-of-care disease diagnostic and monitoring. The clinical utilities of this technique were demonstrated through the direct analysis of human whole blood in various physiological (e.g., oxygenated/deoxygenated states) and pathological (e.g., blood oxidation, hemoglobinopathies) conditions.


Introduction
1 oxygenated state (non-oxidized, diamagnetic state) (Figs. 4a-b). The presence of excessive 2 oxidized Hb in blood causes serious tissue hypoxia, a pathological state known clinically as 3 methemoglobinemia 34 .

5
The marked relaxation enhancement observed was due to the presence of five unpaired electrons 6 in the ferric iron (Fe 3+ ), which acted as the paramagnetic relaxation center 34 . The magnetic moment 7 of ferric iron is 1000-fold higher than that of one single proton 34,35 . Significantly, due to the long 8 range dipolar nuclei-electron, the paramagnetism of the unpaired electrons had considerable 9 effect on the bulk water molecules (R-peak). In contrast to the oxygenated states (in diamagnetic 10 state), the spin-lattice relaxation effect in oxidized states (in paramagnetic state) appeared to be 11 much more efficient in comparison to the spin-spin relaxation effect and hence the reduction in 6 ratio of deoxy-Hb (309.7) is distinctively larger than its oxy-Hb (167.9) and oxidized Hb (37.5) 1 counterpart (Fig. 4c).  Table 2). An additional six other Hb variants (in Supplementary Fig. 3b) were received for 6 machine learning and blind test studies (Table 3 and Fig. 7). A limitation of this study was that 7 the current study only involve heterozygous HbE phenotype. Given the low prevalence of 8 homozygous HbE variant phenotype (~0.1%) in our population 37 , therefore, we were unable to 9 include such subject during the study period. The Hb variants were first identified by a cation-10 exchange high performance liquid chromatography method (Bio-Rad Variant II analyser) and 11 further confirmed by capillary electrophoresis (Sebia CAPILLARYS 2 analyser) and genotyping.

12
NMR measurements were carried out in its native state (without any chemical treatment) of the 13 spun down packed RBCs.
14 The Hb genotyping identified single nucleotide polymorphism in the β-globin in the first and 15 second samples, which was consistent with HbE ( Fig. 5a) and HbD variant (Fig. 5b). A third rare 16 Hb variant samples were identified with a G insertion at codon 27 of the β-globin gene (Fig. 5c).

23
T-peak dispersion for the beta thalassemia variant with a mutated β-globin chain was particularly 24 large with a flat plateau, suggesting that frame shift mutation causes a greater amount of 1 compared to the wild-type ( Supplementary Fig. 3a). T1-relaxation stretching was observed for 2 HbE variant (T2=0.94ms, T1=9.44ms) and the beta thalassemia variant (T2=0.56ms, T1=10ms), in 3 agreement with commonly observed clinical phenotype such as mild haemolysis due to increased 4 oxidative damage. Interaction of Hb variants and other forms of hemoglobinopathies can lead to 5 complex thalassemia syndromes with varying clinical phenotypes (Fig. 2).

6
Machine learning assisted medical decision. The 32 anonymized subjects consist of mixture of 7 non-disease samples (wild-type), and disease samples (details in Supplementary Fig. 4b). The 8 NMR correlational spectroscopy maps (molecular fingerprint) were converted into computer 9 language for deep image analysis using statistical programming languages (e.g., R, Orange 3.1.2).

13
The unsupervised learning techniques were used for dimension reduction (e.g., MDS), and 14 classification (e.g., hierarchical clustering) to assist in making medical decision (Fig. 6). The 2D

15
NMR correlational spectroscopy maps are complex 3D contour plots, and MDS technique was 16 used to reduce higher dimension into two dimensional scatter plot which is more user-friendly 17 for interpretation of information (Fig. 7a). Each feature (molecular fingerprint of one subject) was 18 classified based on the common similarity within their intra-cluster as opposed to their inter 19 clusters. Subjects were successfully classified into two clusters (disease (oxidized Hb, blue), non-20 disease (healthy wild type, red)) using the MDS technique (P-value<0.05), apart from the mutated 21 counterpart (Hb variants, orange). In addition, the disease subtypes (sub-type 1: oxidized Hb, sub-type 2: partially oxidized Hb) were also observed (Fig. 7a). Distances between each subjects 23 were shown in the heat map (Fig. 7b). Using hierarchical clustering, disease staging, prognosis or 24 risk factor prediction (high/low risk factor) were enabled (Fig. 7c). Other techniques (e.g., Isomap, 1 disease samples (wild-type) and disease samples (details in Supplementary Fig. 4b). Supervised 2 learning models (e.g., logistic regression, neural network, k nearest neighbours (kNN) and naïve 3 Bayes) were used to evaluate its´ efficiency against human-being. K-fold cross validation (e.g., 4 k=2, 3, 5) and leave-one-out method were used for samplings. Five technicians were trained to 5 differentiate between (diseases, non-disease) and subsequently were asked to classify the state of 6 the spectrum based on a binary decision (diseases, non-disease) in blinded manner. At the end of 7 the experiment, the results were cross-checked and classified as true positive, true negative, false 8 positive and false negative ( Supplementary Fig. 6). On-average, the machine learning models 9 (e.g., CA=0.885, sensitivity=0.885, specificity=0.887) outperformed the human being (e.g.,

10
CA=0.794, sensitivity=0.811, specificity=0.720) in many aspects, when k=5 (Table 3). The 11 performance of the supervised models, in general, improved with increasing value of k and 12 achieved the maximum point when ´leave-one-out´ method was used in training the datasets 13 (details in Supplementary Figure 7). Noticeably, the performance variation between each 14 individual was larger than that of machine learning models as a result of human subjective 15 judgment. On-average, machine learning models (30 seconds) also took much shorter time than 16 human (about 10 minutes) to complete the tasks given.

19
In this work, we showed that detailed and specific molecular microenvironment of water-proton 20 interactions in blood can be mapped out using the two-dimensional T1-T2 correlational 21 spectroscopy. Interestingly, as water is ubiquitous to life form, water-protein interactions (e.g.,

22
the protein hydration) attracted considerable interests from terahertz spectroscopy 40

13
We demonstrated that a total experimental time in less than 6 minutes is sufficient for a high 14 sensitivity and good spectral resolution, and without losing the spectral integrity (details in 15 Supplementary Fig. 1). The 2D correlation maps were processed using built-in ILT algorithm 16 (FISTA inversion) 47 method with 5000 iterations and smoothing parameter of 1 were used. The 17 inversion typically completed in less than 2 minutes using a desktop computer (Intel Core either used immediately or kept at 4°C and used within three to four days (unless mentioned 23 otherwise) of collection before the micro MR analysis. To induce the Hb into various derivative 24 states, the blood samples were incubated with the desired chemical as mentioned in the Text (e.g.,

25
sodium nitrite) and finally washed to remove the chemical residual. Heparinized micro capillary tubes (Fisher Scientific, PA) were used to transfer the processed blood and finally spun down at 1 6000g for 1 minute to obtain packed red blood cells for MR measurements.

2
Machine learning algorithm and workflow. The NMR-based POCT can be used with or without 3 the assistant of AI (Fig. 2). Machine learning techniques were used to transform the human 4 complicated data (e.g., 2D NMR correlational maps) into user-friendly medical decision making 5 following the workflow developed (Fig. 6). The maps were converted into machine language 6 using the image embedding features (e.g., Squeeze Net). Machine learning techniques were used 7 to perform dimension reduction using various techniques (e.g., MDS, t-SNE, Isomap) 8 ( Supplementary Fig. 5).

9
Blinded test. Supervised learning models (e.g., neural network, k nearest neighbor, logistic 10 regression, and naïve Bayes) were used to train and predict the data. We first trained 5 human 11 beings to differentiate between (diseases, non-disease) and asked them to classify 32 anonymized in the frequency domain. The relaxation properties can be used as a highly sensitive and specific 23 molecular probe, and provide important molecular motion (e.g., correlational relaxation, 24 diffusion properties), which is not readily available in NMR spectra in the frequency domain.

25
(c) A single drop of whole blood contained in a micro capillary tube was spun using standard 26 haematocrit centrifuge (6000g, 1 min) to separate and concentrate the RBCs from the plasma.

27
The capillary tube is then loaded into a permanent magnet. The radio frequency coil (inner analysis (e.g., hierarchical clustering, dimension reduction) were performed to transform the highly 7 complicated data (e.g., hyper dimension) into human friendly information to assist in medical 8 decision making (e.g., diagnostic, staging) in real-time mode (Fig. 6 and Fig. 7). (d) Multi-omics 9 information (e.g., proteomics, genomics) may be performed simultaneously to confirm the genetic     learning in processing the complicated data into user-friendly medical decision (e.g., disease 4 subtyping). The maps were converted into machine language using the image embedding (e.g.,

5
Squeeze Net) features. Dimensionality reductions were performed using various unsupervised 6 models (e.g., MDS, t-SNE, Isomap). Supervised learning models (e.g., neural network, logistic 7 regression, naïve Bayes) were used to train and predict the data. The performance of supervised 8 learning techniques were compared to that of human performance (Table 3)  indicates (longer, shorter) distance between subjects. Other unsupervised models (e.g., linearly local 6 embedding, Isomap, t-sne) were also evaluated for comparison ( Supplementary Fig. 5). (c) The 7 hierarchical clustering enabled disease staging, prognosis or risk factor prediction (high/low risk 8 subject) with respect to standard reference. For simplicity, three referencing states (WT and oxidized 9 Hb) were shown. The non-disease state consists of (healthy wild-type), and disease state consist of 10 (oxidized Hb, Hb variants). The short forms used were wild type (WT), oxidized Hb (Oxi), and Hb 11 variants (Var). The clustering circles (dotted lines) were drawn for eye-balling purposes.