Near-infrared spectroscopy of blood plasma with chemometrics towards HIV discrimination during pregnancy

Prevention of mother-to-child transmission programs have been one of the hallmarks of success in the fight against HIV/AIDS. In Brazil, access to antiretroviral therapy (ART) during pregnancy has increased, leading to a reduction in new infections among children. Currently, lifelong ART is available to all pregnant, however yet challenges remain in eliminating mother-to-child transmission. In this paper, we focus on the role of near-infrared (NIR) spectroscopy to analyse blood plasma samples of pregnant women with HIV infection to differentiate pregnant women without HIV infection. Seventy-seven samples (39 HIV-infected patient and 38 healthy control samples) were analysed. Multivariate classification of resultant NIR spectra facilitated diagnostic segregation of both sample categories in a fast and non-destructive fashion, generating good accuracy, sensitivity and specificity. This method is simple and low-cost, and can be easily adapted to point-of-care screening, which can be essential to monitor pregnancy risks in remote locations or in the developing world. Therefore, it opens a new perspective to investigate vertical transmission (VT). The approach described here, can be useful for the identification and exploration of VT under various pathophysiological conditions of maternal HIV. These findings demonstrate, for the first time, the potential of NIR spectroscopy combined with multivariate analysis as a screening tool for fast and low-cost HIV detection.

www.nature.com/scientificreports/ methodologies based on spectrometric techniques are an alternative for metabolomic screening of biofluids. Mass spectrometry and nuclear magnetic resonance (NMR) are the most common spectrometric techniques due to their relatively good sensitivity and specificity to detect diseases, although they carry an elevated cost and require a complex experimental setup 6 . Near-infrared (NIR) spectroscopy is a spectrometric technique that explores high-energy vibrational modes of molecular chemical bonds, generating a sample spectrochemical spectrum in the region between 750 to 2500 nm [7][8][9] . This technique employs relatively simple and low-cost instrumentation when compared to mass spectrometry, NMR or even mid-infrared spectroscopy, and is widely available in portable handheld devices for in-field analysis. In addition to the simple instrumentation, this technique is non-destructive and also reagent-free, thus experimental measurements can be performed quickly and with no or minimal sample preparation. Therefore, there is a need for accurate and low-cost techniques for VT detection. NIR together with chemometric methods has played an increasingly important role in the field of medical and biological analysis, through quickly detecting pathological conditions, even at very early stages. Previous studies have demonstrated the clinical applications of using NIR spectroscopy in biological samples such as to detect Alzheimer's disease in blood plasma with 93% of accuracy, 88% of sensibility and 96% of specificity 9 ; to differentiate strains of Klebsiella pneumonia producing and non-producing carbapenemase 10 ; and for the identification of Aedes aegypti mosquitoes infected by the Zika virus 11 . It was also used with HIV, however, with others goals that were not the discrimination between infected and healthy groups 7 .
A previous study of our group showed that it was possible to differentiate groups of pregnant women living with HIV from those who were not infected by the virus using the ATR-FTIR technique (Attenuated Total Reflection Fourier-Transform Infrared spectroscopy) 12 . Herein, NIR spectroscopy is employed to distinguish HIV-infected pregnant women compared to healthy pregnant controls (without co-morbidities). Towards this, different chemometric techniques of multivariate classification were tested with the spectral dataset in order to optimize diagnostic results. In this study, our aim was to utilize NIR to differentiate groups of pregnant women with HIV from pregnant women without it.

Results
Statistical analysis of clinical data. Demographic and epidemiological data of the participants can be observed in Table 1.
There was a significant association between the HIV diagnosis and the level of schooling, χ 2 (1) = 4.715, p < 0.05. The proportion of pregnant women with education level no higher than middle school was superior in the HIV group (69.2%) when compared with the control group (44.7%) ( Table 1). www.nature.com/scientificreports/ The patients with education level until middle school had a chance 2.78 times higher of having HIV (OR = 2.78; IC95%: 1.09-7.07) when compared to the patients who completed high school or a superior degree (Table 1).
There was a significant association between the HIV diagnosis and the use of drugs, χ 2 (1) = 4.319, p < 0.05. The proportion of pregnant women that use drugs was superior in the HIV group (25.6%) when compared to the control group (7.9%) ( Table 1).
The patients who use drugs had a chance approximately 4 times higher of having HIV (OR = 4.02; IC95%: 1.01-16.00) when compared to the patients who don't use drugs (Table 1). No significant association was observed between the CD4+/CD8+ ratio and the maximum viral load (ρ = − 0.346, p = 0.206). Furthermore, the CD4+ cell count showed a negative correlation with the maximum viral load (ρ = − 0.642, p = 0.010). NIR spectroscopy. NIR spectroscopy is a valuable tool capable of analysing different types of diseases by measuring biologically-derived samples. Herein, NIR spectroscopy was employed to detect HIV-infected blood plasma samples spectra of pregnant patients, where metrics such as diagnostic accuracy, sensitivity and specificity were calculated. Seventy-seven blood plasma samples were analysed, with 39 samples originating from HIV-infected pregnant women and 38 from healthy pregnant controls. Three spectra were collected per sample, resulting in a total of 231 spectra. The spectra were cut in the region between 1850 to 2150 nm, responsible for biomolecular-derived spectrochemical signatures. The average raw spectrum for each group of sample is depicted in Fig. 1A. To reduce noise, the raw spectral data were pre-processed by Savitzky-Golay (SG) smoothing and baseline correction (Fig. 1B). There is a high degree of superposition between spectral features among categories; consequently, multivariate analysis tools are necessary to distinguish the categories.
The spectral data were divided into training (70%) and test (30%) sets using the Kennard-Stone (KS) sample selection algorithm. To predict whether pregnant women are affected by HIV, it is necessary to use supervised classification models capable of extracting spectral features that differentiate the HIV-infected pregnant category spectra from those of healthy pregnant controls. Several supervised classification techniques were tested to discriminate the data; their performances are depicted in Table 2.
The best discrimination results were found for GA-QDA (accuracy of 87%), followed by GA-LDA (accuracy of 83%). The GA-LDA model selected 3 variables for category discrimination: 1929 nm, 1932 nm and 2151 nm ( Fig. 2A); with a discriminant function plot showing 4 samples misclassified in the test set (Fig. 2B). GA-QDA  www.nature.com/scientificreports/ also selected the same 3 variables (Fig. 2C), but with a more powerful discriminant function whereby 3 samples are misclassified in the test set (Fig. 2D). Three spectral wavelengths were responsible for class separation based on GA-LDA/QDA (1929,1932 and 2151 nm). The QDA algorithm performs better than LDA when categories have different covariance structures, since LDA assumes the categories having similar covariance matrices; hence, using a pooled covariance matrix for calculation, while QDA models each category covariance matrix separately 13 . Thus, GA-QDA is the best algorithm considering the analysis of complex sample matrices with different internal variances. The first two variables selected by GA-QDA (1929 and 1932 nm) are related to OH stretching in carbohydrates, while the variable at 2151 relates to C=O and C-N stretching in proteins 14,15 .

Discussion
There are few studies that relate HIV and the search for new tools that will predict the possible diagnosis of HIV. To our knowledge, our group was first to use the Near-infrared spectroscopy with multivariate classification to analyze the blood plasma collected from pregnant women with HIV, where we observed towards HIV discrimination during pregnancy. In a previous study, our group found a similar result using the ATR-FTIR technique, where we were able to discriminate the groups with and without HIV 12 .
In this study, 77 blood plasma samples taken from pregnant women were analysed using NIR spectroscopy to detect HIV-infected patients. Different chemometric algorithms were tested for category discrimination, but the best classification performance was obtained with GA-QDA. HIV-infected samples were discriminated from the healthy control category with 87% accuracy, 83% sensitivity and 91% specificity using 3 spectral wavelengths selected by the genetic algorithm: 1929, 1932 and 2151 nm. The absorbances at 1929 nm and 1932 nm represent OH stretching of carbohydrates and the absorbance at 2151 nm the C=O and C-N stretching of proteins 14,16 .
According to the variables selected by the GA, we can imply that there was a slightly difference in the proteins and carbohydrates content between the tested groups. As known, blood plasma is mainly composed of water; however, there is a small proportion of proteins (albumin, fibrinogen and immunoglobulins), metabolites, carbohydrates, lipids, among other molecules 6 . This composition normally has pre-establish reference values, however, infections can generate metabolic chances that alter these values. The NIR spectroscopy is an analytical technique with small chemical specificity due to the high degree of overlapping of many spectral features; thus, pure biomarker identification solely based on this technique is not feasible. Although we managed to separate the groups successfully, further studies are necessary to generate a spectral pattern that is characteristic of patients with HIV. Thus, new tests with patients that are not pregnant are not in antiretroviral therapy and patients with other viral infections are very important to the validation of the technique as a possible diagnostic tool.

Design of study and population. This is a nested case-control study that was conducted in a Reference
Obstetrics Center for AIDS at the School Maternity Januario Cicco (MEJC), located in Natal, Rio Grande do Norte, Brazil, between March 2017 and May 2019. A total of 39 pregnant women living with HIV were recruited, all a single pregnancy and gestational age between 10 and 38 weeks. The patients were selected spontaneously during their prenatal care at the Infectious Diseases Clinic at this reference maternity. All pregnant women living with HIV were monitored by an infectologist and an obstetrician. Only participants with complete clinical information were included in the analysis. Subjects were excluded if they had chronic medical conditions, including hypertension, type 2 diabetes mellitus, or heart or kidney diseases.
The study was approved by the Ethics Committee of the Federal University of Rio Grande do Norte under the protocol number 1.808.891. Written informed consent was obtained from every participant. All procedures were performed in compliance with the Declaration of Helsinki.
Clinical measurements. Clinical data were collected by medical record review. Women living with HIV were categorized according to Antiretroviral Therapy (ART) exposure during their prenatal follow-up. Clinical, socio-epidemiological and laboratorial characteristics were summarized and submitted to a statistical analysis. We examined counts of CD4 cells, CD8 cells and CD4/CD8 ratios that are summarized in Table 3.
Healthy pregnant control group. This study includes 38 healthy pregnant women who attended a lowrisk maternity hospital. The pregnant women were between 19 and 44 years old with a gestational age between 10 and 38 weeks. All were HIV negative and did not present with any other co-morbidity.

Blood plasma analysis by NIR.
A blood sample from each patient was collected in EDTA tubes and then submitted to centrifugation for 10 min at 1500 rpm at room temperature to obtain blood plasma for subsequent spectroscopic analysis. Maternal serum samples were collected at the time of clinically indicated blood tests and store at − 80 °C for research purpose. NIR spectroscopy. Plasma aliquots (stored frozen) were left to thaw for several minutes at room temperature before spectrochemical measurement. Measurements were performed in a random order without the analyst having a prior knowledge of the samples' category. An equal number of samples were analysed per day. Before analysis, the plasma samples were homogenised for 1 min in a portable vortex mixer (Gilson Inc., USA), and then 25 μL of plasma were collected for each sample using a micropipette. The 25 μL plasma volume was then transferred to a clean enzyme-linked immunosorbent assay (ELISA) microplate (96 wells, U-type bottom). Spectrochemical measurements were carried out using an ARCoptix FT-NIR Rocket spectrometer (ARCoptix S.A., Switzerland) in the 900-2600 nm range using a fibre optic positioned onto each ELISA microwell in transflectance mode. Three replica spectra were collected per sample using an ambient air spectrum as background.
Computational analysis. The spectral data were organized in a matrix form, where the rows contained the sample spectrum and columns contained the spectral variables (absorbance intensities for each wavelength). The Table 3. Clinical variables of the pregnant patients infected with HIV (n = 39). The categorical data are expressed by absolute (n) and relative (%) frequencies. The continuous data are expressed by median and percentiles 25 and 75 and by mean and standard deviation. www.nature.com/scientificreports/ data were processed using MATLAB R2014b environment (MathWorks Inc., USA) using PLS Toolbox version 7.9.3 (Eigenvector Research Inc., USA) and lab-made routines. For supervised classification, the samples' spectra were divided into training (70%, n = 54 [27 HIV-infected, 27 healthy controls]) and test sets (30%, n = 23 11 healthy controls]) using the Kennard-Stone (KS) algorithm 18 . The training set is used for model construction and the test set for model validation. The models were built by combining feature extraction/selection algorithms with discriminant analysis. Principal component analysis (PCA) 19 was used for feature extraction, while successive projections algorithm (SPA) 20 and genetic algorithm (GA) 21 were used for feature selection. Discriminant analysis was performed by linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) combined with PCA (PCA-LDA/QDA), SPA (SPA-LDA/QDA) and GA (GA-LDA/QDA). These algorithms were tested independently in order to find the best classification model.
For PCA-LDA/QDA, the PCA scores are used as input variables for LDA or QDA 13 . In SPA-LDA/QDA or GA-LDA/QDA, the selected variables by SPA or GA are used as input variables for LDA or QDA 22 . The spectral variables are selected in SPA and GA by minimising the cost function G according to the following equation: where N V represents the number of validation samples and g n is calculated as follows: where r 2 (x n , m I(n) ) is the squared Mahalanobis distance between the object x n and the centre of its true category m I(n) ; and r 2 (X n , m I(m) ) is the squared Mahalanobis distance between the object x n and the centre of the closest wrong category m I(m) . GA was performed using 100 generations with 200 chromosomes each; and the mutation and cross-over probabilities were set at 10% and 60%, respectively.
The LDA ( L ik ) and QDA ( Q ik ) classification scores are calculated in a non-Bayesian form as follows 13,23 : where x i is a vector containing the input variables for sample i; x k is the mean spectrum of category k; C pooled is the pooled covariance matrix; and C k is the variance-covariance matrix for category k.
Model validation. The models validation performances were evaluated by calculating the accuracy (AC), sensitivity (SENS) and specificity (SPEC) for the test set. AC represents the total number of samples correctly classified; SENS represents the proportion of positives correctly classified; and SPEC represents the proportion of negatives correctly classified. These parameters are calculated as follows 24 : where AC stands for accuracy, SENS for sensitivity, SPEC for specificity, TP for true positives, TN for true negatives, FP for false positives, and FN for false negatives.

Statistical analysis.
Descriptive analyses were conducted on the sociodemographic, clinical and biological data of the participants. For each categorical and continuous variable, data are reported as proportions or mean (with standard deviation) or median with interquartile range (IQR) respectively. Shapiro-Wilk normality test was applied to verify the adherence of the continuous variables to the normal distribution. A descriptive analysis of the adherent variables to the normal distribution was performed by mean and standard deviation (mean ± SD). The analysis was realized through absolute and relative frequencies to the categorical variables. The Student t test to independent samples was applied to the continuous variables that showed normality. The Chi-Square test was used to analyze the association between the HIV diagnosis and the categorical variables. In case of expected frequencies below five, it was applied the Fisher exact test. The odds ratio with trust interval of 95% was calculated to the binary categorical variables. The Spearman correlation was executed to evaluate the association between the CD4+/CD8+ ratio to the variables CD4+, CD8+ and the maximum viral load. The same correlation was also executed to associate the CD4+ cell count to the maximum viral load. The significance level of 5% was adopted to all the analysis.
Ethical standards. All procedures were performed in compliance with the Declaration of Helsinki. (1) (2) g n = r 2 (x n , m I(n) ) min I(m)� =I(n) r 2 (X n , m I(m) )