Fine root lignin content is well predictable with near-infrared spectroscopy

1. Root lignin is a key driver of root decomposition, which in turn is a fundamental component of the terrestrial carbon cycle and increasingly in the focus of ecologists and global climate change research. However, measuring lignin content is labor-intensive and therefore not well-suited to handle the large sample sizes of most ecological studies. To overcome this bottleneck, we explored the applicability of high-throughput near infrared spectroscopy (NIRS) measurements to predict fine root lignin content. 2. We measured fine root lignin content in 73 plots of a field biodiversity experiment containing a pool of 60 grassland species using the Acetylbromid (AcBr) method. To predict lignin content, we established NIRS calibration and prediction models based on partial least square regression (PLSR) resulting in moderate prediction accuracies (RPD = 1.96, R2 = 0.74, RMSE = 3.79). 3. In a second step, we combined PLSR with spectral variable selection. This considerably improved model performance (RPD = 2.67, R2 = 0.86, RMSE = 2.78) and enabled us to identify chemically meaningful wavelength regions for lignin prediction. 4. We identified 38 case studies in a literature survey and quantified median model performance parameters from these studies as a benchmark for our results. Our results show that the combination Acetylbromid extracted lignin and NIR spectroscopy is well suited for the rapid analysis of root lignin contents in herbaceous plant species even if the amount of sample is limited.

roots of mixed grassland communities of the Jena Experiment. The 74 samples spanned a species gradient of 1-16 species from a pool of 61 species with a minimum lignin content of 7.7% and a maximum of 42.8%.
The RSEL of ±3.61% is within the range of other studies reporting RSEL for AcBr based lignin determinations 51,52 . Spectral Pre-processing. We evaluated the effect of six different methods of spectral pre-processing on the accuracies of full spectrum PLSR in 100 repeated independent model validations ( Fig. 1). We based our model evaluation on root mean square error of prediction (RMSEP) and residual predictive deviation (RPD) ( Table 1). Standard normal variate (SNV) pre-processing achieved the overall highest model accuracies with highest RPD and lowest RMSEP values. This pre-processing was therefore selected for further modelling in block II (Fig. 1). Overall accuracies were lowest for multiplicative scatter correction (MSC) yet the high BIAS and RMSEP values indicate a strong offset between predicted and observed lignin content. Irrespective of spectral pre-processing, all full spectrum PLSR-models showed RPD values below 2 and were therefore not promising to accurately predict lignin in quantitative analyses.
Variable selection. The variable selection procedure based on CARS-PLSR significantly improved the accuracies in independent model validation to RPD values above 2.5 ( Table 2). Starting from the full spectrum, www.nature.com/scientificreports www.nature.com/scientificreports/ the RMSE steadily declined during the variable selection process such that the remaining wavelengths are more likely to allow better predictions. Moreover, CARS-PLSR identified wavelength regions, which code for the most important information within the regression procedure and thus potentially facilitate a mechanistic understanding of model results. The best CARS-PLSR model used 16 wavelengths, partly adjacent so that they formed nine clusters over the whole spectral region (as indicated by arrows and black bars in Fig. 2b). All wavelengths with PLS regression coefficients >0 are positively correlated with lignin content, whereas wavelengths with regression coefficients <0 indicate a negative relationship with lignin content and are most likely linkable to other chemical compounds (e.g. cellulose and polyoses). The selected 9 wavelength clusters from the best model can be assigned to local extrema of the regression coefficients obtained through PLSR modeling (Fig. 2a). The relative importance of selected wavelengths in the CARS-PLSR increases from lower to higher wavelengths with local maxima around 1428 nm, 1881 nm and 2274 nm (Fig. 2a) but increasing importance for the prediction of lignin content above the cluster of 1881 nm (Fig. 2b). However, our results also imply that wavelengths below 1881 nm likely have strong supplementary qualities since prediction accuracies strongly decreased if single wavelengths within that region were excluded from our model. Our best model indicated four wavelength clusters, which were positively , spectral analysis (outlined in black boxes and step 5) and statistical analysis. The latter is subdivided in spectral pre-processing (block 1 with steps 6-9) and final analysis (block 2 with steps [10][11][12]. See text for further details.  Table 1. Accuracies for different pre-processing methods achieved in model validation (mean and standard deviation from 100 runs). Given are measures of accuracies as number of latent variables used in PLSR (nLV), residual predictive deviation (RPD), measure of determination between predicted and observed lignin contents (R p ²), root mean squared error of prediction (RMSEP), standard error of prediction (SEP) and deviation from the line of equality of linear regression between predicted and observed values (BIAS). We used raw data (Raw) and six different pre-processing methods: asymmetric least squares baseline offset correction (B.als), iterative restricted least squares baseline offset correction (B.irls), multiplicative scatter correction (MSC), standard normal variate (SNV), first derivative (D1) and second derivative (D2). The best model is highlighted in bold.
The gradient in species richness had no significant negative effect on model residuals in calibration or in validation (Fig. 3).   Table 2. Comparison of accuracies (mean ± SD) achieved for full spectrum PLSR versus best model from variable selection using CARS-PLS resulting from 100 repetitions for the calibration and validation.
Abbreviations are as given in Table 1. In addition we use root mean square error of calibration (RMSE). www.nature.com/scientificreports www.nature.com/scientificreports/ leaves) from the same species (see Supplementary Table S1 for full details of individual studies). The literature survey revealed three critical issues: First, 58% of the case studies analyzed lignin content in woody tissue, where a higher lignin content results in a more pronounced spectral signal. Second, 78% of the case studies were limited to one species and only 8% had more than five species. This minimizes the complexity of spectral signals reported in literature, as interspecific chemical variability is likely to be higher than intraspecific variability in chemical compounds 53,54 . Third, the minimum amount of sample was 100 mg and the median amount 300 mg, which is one order of magnitude higher compared to this study (10 mg). Table 3 summarizes the results of this literature survey as quantile ranges of common measures of accuracy and predictive power (RMSEP, SEP, R 2 and RPD). The median accuracy over these 38 case studies indicates an acceptable quantitative prediction of lignin content in plant tissue with a RPD of 2.45 and R² of 0.83 in validation. In comparison, our best model had a better quantitative prediction (RPD of 2.67) and a higher predictive R² of 0.86.

Discussion
Our study demonstrates that fine root lignin content is well predictable using near infrared spectroscopy. This is true for field samples from monocultures up to 16 species mixtures assembled from a pool of 60 different herbaceous grassland species. Further, we find a higher accuracy in model prediction when combining PLSR with CARS variable selection (RMSEP = 2.82) as compared to models using the full spectrum PLSR (RMSEP = 3.83). All measures of accuracy for lignin prediction in our study were in the range of results from 38 relevant case studies extracted from a literature survey, despite much lower sample volumes and higher sample variation in our study. Thus, we could show that the combination of Acetylbromid extracted lignin and FT-NIR spectroscopy is well suited for the rapid analysis of root lignin contents of herbaceous plant species even if the amount of sample is limited.   Table 3. Summary statistics of a literature survey of studies predicting lignin contents from different plant tissues using spectroscopic methods. Given are minimum, maximum and interquartile ranges of the number of species examined per study (Species), sample volume used for chemical lignin extraction (Lignin) and proportion of samples used for model validation and calibration (Val/Cal). Additional abbreviations as in Table 1. Superscript numbers indicate the number of individual datasets that entered the analysis. See Table S1 for details on individual studies.
www.nature.com/scientificreports www.nature.com/scientificreports/ Fine root lignin content and range. The range of lignin content based on Acetylbromid extraction in our fine root samples (10-43%) exceeds that typically reported in the literature. However, most studies report lignin content in aboveground biomass based on Klason lignin extraction for woody gymnosperms (25-33%) 55 , woody angiosperms (20-25%) 54 or herbaceous angiosperms (3-19%) 51,56 . There are three potential reasons for the discrepancy in lignin ranges between our study and literature values. First, the extraction methods might produce different results. Though Acetylbromid is better correlated to Klason lignin than other extraction methods 51 , there is an offset between both methods for herbaceous species 56 , which might account for part of the range differences we found. Second, roots often have higher lignin content than shoots 57 . The limited data available on lignin content of grass species fine roots suggest a mean of 19.6% ± 1.7%, which is only slightly lower than our 20.5% ± 7.7% for mixed herb species 7,57 . Third, our samples most likely contained a larger proportion of partially degraded fine root debris for which lignin content is known to be higher 47 . The field samples of mixed species were harder to sort for dead roots given the different color and texture of roots of different species. This problem might have been less pronounced for the data on individual species root lignin content published so far.
Spectral pre-processing. We tested six different types of spectral pre-processing and raw spectra to determine the most adequate method for our sample type. We found that standard normal variate (SNV) and multiplicative scatter correction (MSC) obtained the highest prediction accuracies according to residual predictive deviation (RPD) and R² in validation. Both types of spectral pretreatments have been specifically designed for correcting NIR spectra of scattered radiation 58 and have been widely applied in NIR spectroscopy 32 . Yet, Yet, error terms of prediction (RMSEP and BIAS) were higher in MSC compared to SNV in MSC compared to SNV. This difference can be traced back to the fact that we performed an independent spectral pre-processing for the validation and calibration set. Since MSC relies mathematically on the mean spectrum of the entire dataset 59 , any pre-processing results diverge for calibration and validation datasets. Thus, prediction accuracies widely change for individual random draws of validation and calibration data. In contrast, SNV acts on the individual spectrum only, which makes it less dependable on pre-processing results 60 . In addition, our results reveal that first and second derivative standardization, which have been found valuable in previous studies 45,46,61 did not result in enhanced accuracies compared to using unprocessed spectra (raw spectra). This might indicate relatively minor contributions of spectral signal distortion in the data. While it remains difficult to a priori select the best pre-processing method 43 , the case of MSC revealed that it is possible to choose an incapable one.

Selection of key wavelength-clusters based on CARS-PLS.
We compared model accuracies between partial least square regression (PLSR) models and the combination of PLSR models with a predictor selection algorithm (competitive adaptive reweighted sampling, CARS). Our results indicate a significant increase in model validation accuracies through variable selection with CARS for all measures of model accuracy. Most likely, CARS predictor variable selection reduces the noise caused by non-informative variables for each individual model run. These findings are in line with other studies combining PLS based models with the CARS framework for variable selection 27,37,38,40 , also reporting an enhanced model performance compared to using the full set of predictors from PLSR models alone.
In addition to increased model accuracies, the selection of 9 clusters of wavelengths from the best model might also enhance our mechanistic understanding of spectral lignin prediction. The finding that wavelength cluster 1, 3, 6 and 8 (1243 nm, 1428 nm, 1881 nm and 2274 nm) are positively correlated with lignin content (indicated by positive regression coefficients in PLS) is in line with previous studies where these wavelengths seemed indicative of lignin or lignified structures. Li et al. 50 also found the wavelength at 1243 nm to be relevant for predicting lignin content, but in contrast to our study, identified a larger set of important wavelengths (20), many of which located in the spectral region between 1450 nm and 1700 nm. The absorption at 1428 nm is often related to the first overtone of phenolic O-H stretching in lignin [62][63][64] . The cluster at 1881 nm is ascribed to C-H 62 and O-H stretching and deformation 65 and is associated with lignin in general 62 . The cluster around 2270 nm may also arise from O-H stretching overtones in lignin 63,64 . Thus, all selected positively correlated wavelength clusters can be directly linked to lignin or chemical structures related to lignin, which further strengthens the applicability of near-infrared prediction of fine root lignin content.
Similarly, the wavelength clusters 2, 4, 5, 7 and 9 (1409 nm, 1715 nm, 1735 nm, 2035 nm and 2437 nm) which were negatively correlated with lignin content, have previously been linked to chemical groups with some ecological counterpart or trade-off towards fine root lignin content.
During lignification, the protein containing, holocellulosic fiber scaffolding of the compound middle lamella and the secondary cell wall are incrusted with lignin 66 . This process increases the mechanical stability of lignified tissue against high compression and capillary forces. In addition, lignification enhances the recalcitrance of plant tissue against the alkaline soil solution, root exudates and exoenzymes from bacterial and fungal communities 5,67 . According to Schwanninger et al. 30 wavelengths around 1715 nm and 1735 nm result from overtone C-H stretching vibrations in polyoses and cellulose. The wavelength cluster around 2035 nm could be connected to amide-1 stretching in peptides and proteins 63,64,68 or similarly to C=O stretching vibration from acetyl groups in polyoses 50 . Bands at 2437 nm probably indicate α-helical protein structures 63 . Thus, the wavelength clusters 4, 5 and 7, which are negatively correlated to root lignin in our study, are indicative of cellulose, polyoses and proteins, indicating a trade-off well known from literature 69 . This trade-off may be explained by a substrate competition for the biosynthesis of cellulose and protein in a living cell, versus biosynthesis of lignin and the subsequent death of the cell 70 .
A potentially competing chemical group, providing similar functionality to the cell as lignin, are suberins. Suberins and associated waxes are important hydrocarbon-containing biopolymers that also increase the mechanical stability of plant tissue yet without impairing the water absorption capacity and plasticity of the root (2019) 9:6396 | https://doi.org/10.1038/s41598-019-42837-z www.nature.com/scientificreports www.nature.com/scientificreports/ tissue. Our wavelength cluster 2 (around 1409 nm) has been connected to C-H stretching and deformation in aliphatic hydrocarbons 64,65 and hydrocarbon-containing plant waxes 68 as well as to O-H deformations in extractable alcohols 30,64,65 . Again, there might be a trade-off in allocating carbon to enhance tissue stability via either lignin or suberin and associated waxes. Suberins occur in the secondary walls of endodermal and hypodermal cells of primary roots, while the incorporation of waxes into the suberin polymer is limited to root tissues with secondary growth 71 . Yet, monocotyledonous roots, such as grasses, have no secondary growth and should invest into higher lignin concentration at the expanse of wax contents. Indeed, Zeier and Schreiber 72,73 could show an inverse relationship between endodermal lignin and suberin content in five monocotyledonous species. In addition, Chen et al. 11 could show that grass containing plots from the same experiment had significantly higher lignin contents and lower nitrogen contents in fine roots than legume containing plots.
Overall, our results show that wavelength assignments in NIRS have the potential to reveal insightful information regarding the underlying chemistry. Wavelengths with high explanatory power in the regression model provide us with background knowledge on the chemical groups or bindings, which absorb at distinct wavelengths 74 . However, this is not a trivial task in complex lignocellulosic materials 30 since exact band positions depend on different equilibrium moisture contents 75 and thus intrinsically on the chemical composition of constituents.
Model design and best model outcome. The aim of our study was to develop a model to predict fine root lignin content, which is well applicable to a larger range of herbaceous species. For this reason, we put a strong focus on evaluating the predictive power of our models. Consequently, we assigned a much larger than usual proportion of samples to model validation. In addition, we used repeated outer validation instead of single cross-validation, as this allows for a more realistic determination of uncertainty when predicting unknown samples 76 . Both procedures increase the reliability of our models yet at the cost of enhanced error terms especially given the overall limited number of samples.
Despite this conservative procedure our best model has a better quantitative prediction (higher RPD value of 2.67) and a higher predictive R² of 0.86 compared to the median of 38 case studies extracted from our literature survey (RPD = 2.45, R² = 0.83). In contrast, our measures of error of prediction (RMSE and SEP) are higher than the 75% quantile range of literature studies. This discrepancy is due to the fact that our data set comprised a much larger sample variability due to the higher number of species. We are aware of only one other study 47 with comparable sample complexity using leaves, leaf litter and organic residue of 31 tree species to predict lignin content via near infrared spectroscopy (SEP = 5, R² = 0.76 and RPD = 2.1). Similar results (RMSEP = 5.5 and R² = 0.5) were achieved by Kelly et al. 77 when predicting lignin content for 14 agricultural species used for fiber production, indicating that in fact sample heterogeneity might well determine the error of model prediction for lignin. This phenomenon is well known for model prediction via spectral analysis for biological applications 78 . However, Petisco et al. 46 found that lignin content in tree leafs from 17 species was well predictable (SEP < 1), but in that study lignin content was spanning a smaller range and standard deviation then realized in our study. Overall, our results reveal that root lignin in a large number of herbaceous species is at least as well predictable as the median of studies predicting lignin content in one or few mostly woody or crop species.

Conclusion
Our study explored -for the first time -the applicability of NIR spectroscopy to determine AcBr extracted root lignin content in mixtures of up to 60 grassland species. We used CARS-PLS to select the most relevant wavelengths for root lignin prediction and concurrently increased prediction accuracy compared to PLS models based on full spectrum information. Despite the large number of species and the limited amount of sample available in our study, we found higher prediction accuracies than the median of 38 case studies extracted from a literature survey. Thus, we conclude that predicting AcBr extracted root lignin from NIRS spectroscopy shows great potential to overcome the limitation of large sample sizes as commonly examined in ecological studies.

Material and Methods
Sampling. The Jena Experiment is a biodiversity experiment studying the effects of plant species and functional group richness on ecosystem functions. 60 mesophilic central European grassland species were classified into four functional groups: grasses (16 species), legumes (12 species), small herbs (12 species) and tall herbs (20 species). Detailed information about the design and main results of 15 years Jena Experiment is given in Weisser et al. 79 . For the current study, we collected roots of plant communities from 76 experimental plots, spanning the existing gradient of species richness (1,2,4,8,16) and functional group richness (1,2,3,4). Community root collection is explained in detail in Chen et al. 11 . In brief, we excavated soil volumes with a surface area ranging from 20 × 10 cm to 40 × 15 cm and a depth of 20 cm depending on the spatial extent of the respective root biomass. Roots were soaked in tap water prior to washing them over a 630 µm-mesh sieve and removing coarse roots (>2 mm), debris and highly decomposed roots before drying (65 °C). Three samples did not contain enough fine root material and had to be excluded from further analysis, leaving 73 samples for this study. In addition, we included one sample of standardized Lolium perenne L. roots. We used this as a reference sample of young and fully clean grass roots contrasting our field root samples with low diameter but potentially higher age (and lignification) as well as partly decaying roots or soil residue. L. perenne plants were cultivated in aeroponics in the greenhouse of the Botanical Garden of Leipzig University over 4 months. Newly grown roots and shoots were cut to 3 cm length every 4 weeks and roots were dried (65 °C) and stored. Dry L. perenne roots from all harvests were shredded and thoroughly mixed before grinding multiple smaller portions of the large sample volume with a vibratory ball mill (MM 400, Retsch Technology GmbH, Germany). After grinding, the powder was again carefully mixed and oven dried again (70 °C, 48 hours). Our overall sample number was therefore 74 (73 field samples plus L. perenne). Moreover, we used homogenized root biomass as internal reference standard to test replicability and recovery rates of chemical analysis. This reference sample was comprised of a mixed field community root www.nature.com/scientificreports www.nature.com/scientificreports/ sample bulked from the 50 largest field samples described above. This sample contained a large variety of grassland species and was extracted and measured with each analytical run. Fig. 1 depicts the chemical analysis where individual steps are indicated with numbers in square brackets. We refer to these numbers in the following text. Dried root samples were ground with a vibratory ball mill (MM 400, Retsch Technology GmbH, Germany) and oven dried again (70 °C, 48 hours) before further processing ([1], Fig. 1). For liquid-solid pre-extraction we extracted up to 50 mg of sample with 12 mL solvent (acetone: ethanol: water; 5:3:2 volume) at 70 °C for 150 minutes, turning the tubes regularly. The extracted samples were centrifuged and washed three times before drying (70 °C, 48 hours). For lignin extraction, we used the acetyl bromide (AcBr) extraction as described by Iiyama & Wallis 24 , but avoided to use 70% perchloric acid that causes the formation of hydrobromic acid and unwanted acid catalyzed, chromophor-forming oxidation of polysaccharides [2]. In brief, we extracted 10 mg sample with 5 mL 25% (vol:vol) solution of AcBr in glacial acetic acid and heated the vials in an oil bath (70 °C, 60 min) with regular shaking to promote sample digestion. We chilled samples on ice (15 min), equilibrated to room temperature (30 min) and centrifuged. To mask strongly absorbing polybromide anions, 1 mL of the supernatant was diluted in 1 mL of 2 N NaOH and 8 mL glacial acetic acid. We included microcrystalline cellulose (Sigma-Aldrich, USA) as control 80 . Finally, we measured 3 mL of the sample solution at 280 nm in a spectrophotometer (Jasco V730, Jasco Labor-u. Datentechnik GmbH, Germany) to determine the specific absorption coefficients (SAC) [3]: To translate SAC into lignin content we used the reference sample from mixed field roots described above. Two subsamples individually underwent the same procedure as described above. In addition, we purified and isolated these reference samples as described in detail by Fukushima & Dehority 81 . For the calibration curve we diluted 10-750 µL extracted lignin aliquots in 8 mL masking solution, made up to 10 mL with blank solution (25% AcBr in acetic acid) and measured at 280 nm as detailed above. The lignin content (L) of all samples was calculated using regression eq. (2) [4]. . Transmission (T) was converted to absorbance via log 10 (1/T). Each sample was measured 5 times; spectra were subsequently averaged. Samples were shaken between each replicate to ensure a spectral recording of the sample-inherent variability [5].

Chemical analysis. The left box of
Statistical analysis. We used partial least squares regression (PLSR) to relate measured spectra of 74 collected samples (each with a total of 662 spectral bands as predictor variables) to the measured lignin contents of these samples (response variable) [5]. We used PLSR as the most widespread multivariate calibration method. PLSR is capable of handling a high degree of collinearity in the predictor variables 82 as well as a small number of response samples in relation to the number of predictors 33 . In addition, we combined full spectrum-PLSR with the CARS algorithm, introduced and described in detail by Li et al. 37 . In short, CARS, aims at selecting key wavelengths in a computationally efficient way. The selection procedure in CARS is largely based on the PLS regression coefficients and consists of two major steps. In the first step an exponential decreasing function (EDF) defines the number of selected spectral variables. As a result, the number of kept variables decreases exponentially from one sampling run to the next and wavelengths linked to small absolute PLS regression coefficients are removed before the second step in each sampling run. This second step, also referred to as adaptive reweighted sampling (ARS), uses the absolute regression coefficients to define the probability of a single variable to be drawn in random sampling procedure. As a result, variables with low probabilities might not be drawn and are thus also removed from potential variable space. After a defined number of sampling runs the algorithm choses the overall best model. In this study, PLSR modelling was divided into two blocks [block I and II, Fig. 1]. In the first block, we used random sampling to divide the 74 spectra into 44 spectra for model building (calibration data) and 30 spectra for an outer model validation [6]. We sampled randomly 100 times, saved these data splits and reused them in all following steps. To mirror the operational application of NIR spectroscopy for predicting fine-root lignin from unknown field samples, using an already established prediction model, calibration and prediction data sets were pre-processed separately [6]. To test the effects of different pre-processing procedures, we used raw data as well as six pre-processing methods: asymmetric least squares baseline offset correction (B.als) 83 , iterative restricted least squares baseline offset correction (B.irls) 84 , multiplicative scatter correction (MSC) 58 , standard normal variate (SNV) 85 , first derivative (D1) and second derivative (D2) 43 [7]. Based on the differently pre-processed calibration data set the PLS regression models were computed and cross validated (ten-fold) to avoid model overfitting and to determine the optimal number of latent variables (nLV) 33 [8]. Then all final models were applied to predict the respective prediction data [9]. From Block I we analyzed the obtained prediction accuracies averaged from 100 data splits and selected the pre-processing method with the highest accuracy for further analysis in Block II [10].
www.nature.com/scientificreports www.nature.com/scientificreports/ In the second block of model building, we used the CARS-PLSR approach as described above. We used the same data splits as in Block I to allow a direct comparison with the results of full spectrum-PLSR [10].
We describe model quality using the following indices: coefficient of determination for prediction (R P 2 ) , the root mean square error of cross validation (RMSECV), the root mean square error of prediction (RMSEP), the standard error of prediction (SEP) and the residual predictive deviation (RPD), calculated from standard deviation of the prediction data set and SEP. We consider RPD values >1.5-2 as good for preliminary screenings and initial predictions 86 , RPD values >2-2.5 as acceptable for quantitative predictions, values > 2.5-3 as good and values >3 as excellent for predictions 50 . We conducted all statistical analyses in the statistics software R (version 3.3.1) 87 using the packages: baseline for B.als/B.irls corrections 84 , pls 88 , carspls 37 (downloadable at: https://code. google.com/archive/p/carspls/downloads) and prospectr for SNV and MSC 89 .

Data Availability
The data used in this article is accessible via https://doi.pangaea.de/10.1594/PANGAEA.895501.