Abstract
Highthroughput phenotyping (HTP) technologies can produce data on thousands of phenotypes per unit being monitored. These data can be used to breed for economically and environmentally relevant traits (e.g., drought tolerance); however, incorporating highdimensional phenotypes in genetic analyses and in breeding schemes poses important statistical and computational challenges. To address this problem, we developed regularized selection indices; the methodology integrates techniques commonly used in highdimensional phenotypic regressions (including penalization and rankreduction approaches) into the selection index (SI) framework. Using extensive data from CIMMYT’s (International Maize and Wheat Improvement Center) wheat breeding program we show that regularized SIs derived from hyperspectral data offer consistently higher accuracy for grain yield than those achieved by standard SIs, and by vegetation indices commonly used to predict agronomic traits. Regularized SIs offer an effective approach to leverage HTP data that is routinely generated in agriculture; the methodology can also be used to conduct genetic studies using highdimensional phenotypes that are often collected in humans and model organisms including body images and wholegenome gene expression profiles.
Introduction
Highthroughput phenotyping (HTP) technologies have been adopted at a fast pace in agriculture; applications range from the use of HTP in highly controlled environments (e.g., growth chambers^{1}) to extensive HTP using sensing devices mounted on aerial (e.g., hyperspectral cameras mounted on aerial vehicles^{2}) and terrestrial equipment such as tractors and combine harvesters^{3}. Modern agricultural production systems use HTP data to optimize management practices^{4}, forecast agricultural outputs^{5} and to assess the quality (e.g., protein content) of agricultural commodities^{6}. HTP data can also be a valuable input for breeding programs. For instance, extensive HTP may enable an expansion of genetic testing that can lead to higher intensity of selection and faster genetic progress. Moreover, HTP data may offer opportunities to improve traits such as drought tolerance that are otherwise difficult to measure and breed for.
Sensors can generate data on hundreds or thousands of phenotypes per unit being monitored. For example, hyperspectral cameras can generate reflectance of electromagnetic power at hundreds of wavelengths in the visible and infrared spectrum. These measurements can be considered as indicator phenotypes that can be used to predict other traits. An extensive body of research deals with the use HTP data to predict phenotypes such as grain yield^{5,7,8,9}, dry matter^{3}, oil and protein content^{10,11}. However, there has been much less research on how to integrate HTP data in genetic studies and in breeding schemes. In genetics, the problem of predicting the genetic merit of a target trait given a set of correlated phenotypes was first addressed by Smith^{12} and Hazel^{13} who introduced the concept of selection index (SI) in plant and animal breeding, respectively.
A SI seeks to improve a target trait y_{i} (e.g., grain yield) using information from another set of measured traits (e.g., hyperspectral image data). A linear SI is a weighted sum of the measured phenotypes with weights derived to maximize the correlation between the genetic merit for the selection target and the SI. Thus, the SI methodology offers a natural framework for integrating HTP data into breeding decisions. However, when the measured phenotype is highdimensional, the naïve application of the SI can lead to overfitting and suboptimal accuracy of indirect selection.
To address this problem, we developed regularized selection indices (including penalized and reducedrank methods) that are tailored to achieve accurate prediction of genetic values using highdimensional phenotypes. The proposed methodology integrates into the SI framework methods often used to prevent overfitting in highdimensional phenotypic regressions^{14}. Using extensive multienvironment crop imaging data from CIMMYT’s wheat breeding program we show that regularized SIs offer improved accuracy of indirect selection in both optimal and stress environments.
Results
A selection index is a linear combination of \(p\) measured phenotypes, x_{i} = (x_{i1}, …, x_{ip})′, of the form \({I}_{i}={{\boldsymbol{x}}}_{i}^{{\rm{{\prime} }}}{\boldsymbol{\beta }}\), where β = (β_{1}, …, β_{p})′ is a vector of regression coefficients whose entries define the weights of each of the measured phenotypes in the SI. In a standard SI the weights are derived by minimizing the expected squared deviation between the genetic merit for the selection goal (\({g}_{{y}_{i}}\), e.g., the genetic merit for grain yield of the \({i}^{{\rm{th}}}\) genotype) and the index, that is:
The solution to this optimization problem is (see Methods section):
where \({{\boldsymbol{G}}}_{x,y}={\mathbb{E}}({{\boldsymbol{x}}}_{i}\,{g}_{{y}_{i}})=({G}_{{x}_{1},y},\ldots ,{G}_{{x}_{p},y}){}^{{\rm{{\prime} }}}\) is a vector containing the genetic covariances between the selection objective (\({y}_{i}\)) and each of the measured traits (\({{\boldsymbol{x}}}_{i}\)), and \({{\boldsymbol{P}}}_{x}\) is the (population) phenotypic variancecovariance matrix of the measured phenotypes, that is, \({{\boldsymbol{P}}}_{x}={\mathbb{E}}({{\boldsymbol{x}}}_{i}\,{{\boldsymbol{x}}}_{i}^{{\rm{{\prime} }}})={\rm{C}}{\rm{o}}{\rm{v}}({{\boldsymbol{x}}}_{i},{{\boldsymbol{x}}}_{i}^{{\rm{{\prime} }}})\). Thus, a standard SI takes the form \({I}_{i}={{\boldsymbol{x}}}_{i}^{{\rm{{\prime} }}}{{\boldsymbol{P}}}_{x}^{1}{{\boldsymbol{G}}}_{x,y}\). The theory underlying the derivation of SIs and response to indirect selection is well established^{15,16}.
The SI is by construction the best linear predictor (BLP) of the genetic merit for the selection goal; this property holds when \({{\boldsymbol{G}}}_{x,y}\) and \({{\boldsymbol{P}}}_{x}\) are known. However, when the number of measured phenotypes is large, errors in the estimation of \({{\boldsymbol{P}}}_{x}\) and \({{\boldsymbol{G}}}_{x,y}\) may lead to overfitting and suboptimal accuracy of indirect selection.
Regularized selection indices
Reducedrank (e.g., principal components methods) and penalized regression^{14} are two approaches commonly used to confront overfitting in highdimensional regression problems. These methodologies were developed for regression problems involving an observable phenotype (\({y}_{i}\)). In the SI, the response (\({g}_{{y}_{i}}\)) is unobservable; however, the same principles that are applied in phenotypic reducedrank and penalized regressions can be integrated into the SI framework.
Reducedrank selection indices
In principal components (PC) regression, the response is regressed on a reduced number (\(q < p\)) of PCs extracted from a set of predictors (\({{\boldsymbol{x}}}_{i}\)); the same concept can be used to derive a reducedrank SI. For instance, one can extract a reduced number of PCs from a crop image and the resulting PCs can be used as ‘measured traits’ in Eq. (1). The solution of Eq. (1) will render estimates of the regression coefficients for the PCs, which can be transformed back to coefficients applicable to the measured traits (see Methods). Thus, a reducedrank SI (referred to as PCSI) can be derived following these steps: (i) extract, using the singular value decomposition, \(q\) PCs from the matrix containing the measured phenotypes, (ii) estimate the genetic covariances between the first \(q\) PCs and the selection objective, (iii) use these estimated (co)variances to derive coefficients associated with the top \(q\) PCs; finally, (iv) transform these coefficients into coefficients for the measured phenotypes. This process can be done using \(q=1,2,\ldots ,p\) PCs (\(q=p\) renders the standard SI). For the sequence of estimates (\({\hat{{\boldsymbol{\beta }}}}^{(1)},{\hat{{\boldsymbol{\beta }}}}^{(2)},\ldots ,{\hat{{\boldsymbol{\beta }}}}^{(p)}\)), one can evaluate the accuracy of indirect selection of the resulting SI and an optimal rank for the PCSI can be chosen to maximize the accuracy of indirect selection.
Penalized selection indices
In a penalized regression, regularization is achieved by including in the objective function a penalty on model complexity. In the context of a SI, we have
where \(\lambda \) is a penalty parameter (\(\lambda =0\) yields the coefficients for the standard SI) and \(J({\boldsymbol{\beta }})\) is a penalty function. Commonly used penalties include the L2 \(({{\boldsymbol{\beta }}}_{2}^{2}=\mathop{\sum }\limits_{j=1}^{p}{\beta }_{j}^{2})\) and L1 \(({{\boldsymbol{\beta }}}_{1}=\mathop{\sum }\limits_{j=1}^{p}{\beta }_{j})\) norms^{17}, or a weighted sum of the two^{18}.
Using \(J({\boldsymbol{\beta }})=\frac{1}{2}\mathop{\sum }\limits_{j=1}^{p}{\beta }_{j}^{2}\) in Eq. (3) renders a Ridgeregressiontype PSI (RRPSI, see Methods):
where \({\boldsymbol{I}}\) is a \(p\times p\) identity matrix. The RRPSI (referred to as the L2PSI) yields shrunken estimates of the regression coefficients.
In many applications, variable selection (i.e., a SI that is a function of a subset of the measured phenotypes) may be desirable. This property can be obtained using penalties involving the L1norm, either alone, \(J({\boldsymbol{\beta }})=\mathop{\sum }\limits_{j=1}^{p}{\beta }_{j}\) (LASSO^{19}), or in combination with the L2norm, \(J({\boldsymbol{\beta }})=\frac{1}{2}(1\alpha )\mathop{\sum }\limits_{j=1}^{p}{\beta }_{j}^{2}+\alpha \mathop{\sum }\limits_{j=1}^{p}{\beta }_{j}\) (elasticnet^{18}). Unlike the L2PSI, the LASSO and elasticnet SIs (hereinafter referred to as L1PSI and ENPSI, respectively) do not have a closedform solution^{14}. However, solutions for PSIs involving an L1penalty can be obtained using iterative procedures such as the coordinate descent^{20} and the least angle regression^{21} (LARS) algorithms (see Methods). As with the PCSI, an optimal PSI can be obtained by choosing the values of the regularizing parameters \((\lambda ,\alpha )\) that maximize the accuracy of indirect selection.
Accuracy of indirect selection
Indirect selection accuracy is defined as the correlation between the index used to rank genotypes and the genetic merit of the selection objective, that is, \(Acc(I)=cor({I}_{i},{g}_{{y}_{i}})\). This parameter is equal to the product of the square root of the heritability of the SI (\({h}_{I}\)) times the genetic correlation between the SI and the selection target, \(cor({g}_{{I}_{i}},{g}_{{y}_{i}})\)^{16}. To avoid estimation bias \(Acc(I)\) must be estimated using data that was not used to derive the coefficients of the index (Fig. 1); therefore, in the application presented below we: (i) partitioned the data into training and testing sets, (ii) derived the coefficients of the SI in the training set, (iii) applied these coefficients to image data of the testing set (\({I}_{i}={{\boldsymbol{x}}}_{i}^{{\rm{{\prime} }}}{\boldsymbol{\beta }}\)), and (iv) estimated \({h}_{I}\), \(cor({g}_{{I}_{i}},{g}_{{y}_{i}})\), and \(Acc(I)\) in the testing set. Furthermore, we quantified the efficiency of indirect selection relative to mass phenotypic selection (RE) using \(RE=\frac{{h}_{I}}{{h}_{y}}cor({g}_{{I}_{i}},{g}_{{y}_{i}})\)^{16}.
Regularized selection indices for wheat grain yield using hyperspectral image data
We applied the methodology described in the previous section to data (n = 3,276) from the CIMMYT’s Global Wheat Program consisting of grain yield (ton ha^{−1}) and hyperspectral image data. The data were collected at CIMMYT’s experimental station in Ciudad Obregon, Sonora, Mexico (27°20′N, 109°54′W, 38 masl) from 39 yield trials in which a total of 1,092 genotypes were tested. Rainfall in Obregon is very limited; therefore, four different environments were generated representing a combination of planting methods (Flat or Bed), controlled irrigation (minimal, 2 or 5 irrigations), and planting dates (optimum or earlyheat). As expected, average yield decreased as drought stress intensity increased (see Table 1 and Supplementary Fig. S1 for boxplots of yield by environment).
Image data was collected using an infrared and an hyperspectral camera and consisted of reflectance of electromagnetic power at 250 wavebands within the visible and nearinfrared spectrums (392–850 nm). Separate images were collected at 9 timepoints covering vegetative (VEG), grain filling (GF), and maturity (MAT) stages of the crop (see Supplementary Fig. S2). Grain yield and image data were preadjusted using mixedeffects model that accounted for genotype, trial, replicate, and subblock (see Methods section).
Regularization improves the heritability and the accuracy of the index
To assess the effect of regularization on the accuracy of indirect selection we fitted an L1PSI over a grid of values of the regularization parameter (\({\lambda }^{(1)} > {\lambda }^{(2)} > \ldots > 0\) in Eq. (3), using \(\lambda =0\) renders a standard SI). For each of the solutions (\(\hat{{\boldsymbol{\beta }}}({\lambda }^{(1)})\), \(\hat{{\boldsymbol{\beta }}}({\lambda }^{(2)})\), …) we estimated the heritability of the resulting index and the genetic correlation between the index and the selection target, and from those estimates we derived the accuracy of indirect selection. The same approach was used to evaluate the accuracy of indirect selection of PCSIs with a varying number (1, 2, …) of PCs.
We first fitted PSIs and PCSIs using data from a single timepoint; the results from the latest timepoint (corresponding to MAT or late GF stages depending on the environment) are presented in Fig. 2 (see Supplementary Figs. S3S5 for other timepoints). The heritability of the L1PSI (Fig. 2A) decreased as more bands became active in the index. Likewise, the heritability of PCSI (Fig. 2B) decreased with the number of PCs used. However, the genetic correlation increased as either more bands become active in the L1PSI or more PCs were used in the PCSI. Consequently, the maximum accuracy of indirect selection was achieved with a SI of intermediate complexity (with anywhere between 20 and 60 of the 250 bands being active in the L1PSI, and between 20–60 PCs in the PCSI). Results for other timepoints and environments (Supplementary Figs. S3–S5) exhibited similar patterns with some differences between environments. The accuracy of indirect selection of the optimal L1PSI was always close to that of the optimal PCSI and that of the optimal L2PSI (Supplementary Table S1). Importantly, in all cases the accuracy of indirect selection of the optimal regularized SIs was considerably higher than that of the standard SI, which is the one corresponding to 250 active bands or 250 PCs (i.e., the rightmost results in the plots in Fig. 2).
Figure 3 displays the accuracy of indirect selection across timepoints for the optimal (i.e., the one with the highest accuracy of indirect selection) L1PSI and PCSI. For comparison we also display in the plot the accuracy of indirect selection achieved by a standard SI (in green). The estimated 95% confidence intervals of the accuracy of the regularized SIs (either PCSI or L1PSI) are all above (and do not overlap) with the confidence intervals for the accuracy of the standard SI, except for a single timepoint (57 DAS in environment Bed2IR). Results from Tukey’s Honest Significance Difference confirmed that the accuracy of the regularized SIs is statistically different (higher) than the standard SI at a 5% of significance (Supplementary Table S1) for all but one (57 DAS in environment Bed2IR) timepoint/environment. Regularization increased the selection accuracy across timepoints and environments. Regularized SIs (either PCSI or L1PSI) had an accuracy of indirect selection that was in average 10–40% higher than the accuracy achieved by a standard SI. These gains in accuracy were stronger in the optimal environment (Bed5IR with a median of 36%) and smaller in the stressed environments (FlatDrought and BedEHeat with a median of 16%). Interestingly, there were no sizable differences between the accuracy of indirect selection achieved with the optimal L1PSI and that of the optimal PCSI. Compared with a standard SI, regularized SIs had higher heritability (Supplementary Fig. S6); this was achieved without compromising the genetic correlation (Supplementary Fig. S7), thus leading to a higher accuracy of indirect selection achieved by either penalization or rankreduction strategies.
Using data from multiple timepoints further improves selection accuracy
The results presented above were based on data from a single timepoint. We also generated selection indices using data from multiple timepoints (in this case, \({{\boldsymbol{x}}}_{i}\) was a vector containing 2,250 traits, corresponding to 250 wavebands measured at each of 9 timepoints). Integrating data from multiple timepoints further increased the accuracy of L1PSI by a margin that ranged from 1 to 8 points on the correlation scale (Table 2). The gains in selection accuracy obtained using data from multiple timepoints were more evident in environments with lower accuracy; similar results were obtained for the PCSI and L2PSI (Supplementary Table S1).
L1penalization leads to sparse selection indices
Figure 4 shows a heatmap for the solutions of the optimal L1PSI that integrated data from the 9 timepoints. Each panel represents an environment, horizontal bands represent timepoints. Within each timepoint wavebands not entering in the solution are in grey and nonzero coefficients are represented in a yellowred scale (red indicates large absolutevalue coefficients). The wellirrigated environments (Bed5IR and BedEHeat) had considerably sparser indices with only a reduced number of wavebands in the solutions; these were mostly located in the violet, blue and red regions of the spectrum. In stressed environments (FlatDrought and Bed2IR) there were also a few wavebands in the green and infrared regions that were active. In all the indices, there were wavebands from several timepoints that were active in the optimal solution, suggesting that data from both early and late phenological stages are informative about the genetic merit for grain yield.
Comparison with phenotypic prediction
We compared the accuracy of indirect selection of the PSI and PCSI with vegetation indices and penalized phenotypic prediction. Vegetation indices are often used to predict yield^{22}, biomass, and chlorophyll content^{23,24}. We considered two vegetation indices: the Red and Green Normalized Difference Vegetation Indices (RNDVI^{25} and GNDVI^{26} respectively). For each of these indices we estimated the genetic correlation with grain yield, as well as their heritability and accuracy of indirect selection (Supplementary Table S1). Overall, the accuracy of indirect selection of the GNDVI and RNDVI was lower than the one achieved with a PSI (the average difference in accuracy between RNDVI and the L1PSI varied by environment from 0.02 to 0.14 points in correlation, Supplementary Table S1, in favor of the L1PSI). The heritability of the GNDVI and RNDVI was similar and superior in some cases to that of the L1PSI (Supplementary Fig. S6); however, the genetic correlation between the vegetation indices and grain yield was (in most timepoints and environments) lower than the genetic correlation between the L1PSI and grain yield (Supplementary Fig. S7). Thus, the main driver of the difference in accuracy between the L1PSI and the vegetation indices was the difference in genetic correlation.
We also fitted L1penalized phenotypic prediction (L1Phen) and compared the accuracy of indirect selection of these phenotypic prediction methods with that of penalized SIs. Overall, the L1Phen achieved an accuracy of indirect selection very close to that of the L1PSI (Supplementary Table S1); however, in a few environments at some timepoints, the L1PSI achieved a higher accuracy of indirect selection than the phenotypic prediction.
Discussion
Highthroughput phenotyping has been extensively adopted in agricultural research and commercial production. Extracting interpretable information from HTP data poses important statistical challenges. The clear majority of research in this area has focused on calibrating equations to predict phenotypes (e.g., total biomass, grain yield) using HTP data as inputs. This approach is wellsuited for phenotypic prediction; however, the same approach can be suboptimal for selection because the best predictor of a phenotype is not always the best predictor of the genetic merit of the same trait.
The best (linear) phenotypic predictor is the sum of the best linear predictor of the genetic merit (\({g}_{y}\)) plus the best linear predictor of the environmental term (\({\varepsilon }_{y}\)), that is, \({\mathbb{E}}(y{\boldsymbol{x}})={\mathbb{E}}({g}_{y}{\boldsymbol{x}})+{\mathbb{E}}({\varepsilon }_{y}{\boldsymbol{x}})\). The first term, \({\mathbb{E}}({g}_{y}{\boldsymbol{x}})\), is the SI and it is, by construction, maximally correlated with the genetic merit. The second term, \({\mathbb{E}}({\varepsilon }_{y}{\boldsymbol{x}})\), is relevant for phenotypic prediction but represents noise when the problem is that of selecting the best genotypes.
Selection indices exploit genetic covariances, while phenotypic prediction relies on phenotypic covariances between the selection target and the measured phenotype (e.g., crop imaging). Thus, the two methods yield different results whenever the patterns of phenotypic correlations are sufficiently different from the patterns of genetic correlations. In our data set, environmental conditions were highly controlled, with relatively low uncontrolled withintrial variability in environmental conditions. Consequently, the patterns of phenotypic and genetic correlations were very similar (see Supplementary Fig. S8). This was true for many timepoints and environments but not in others (e.g., 80, 85 and 93 DAS in FlatDrought, and 90 and 98 DAS in Bed2IR); it was exactly in those timepoints and environments that the L1PSI achieved higher accuracy of indirect selection than the L1Phen method (Supplementary Table S1).
A standard SI (Eq. (1)) is, by construction, maximally correlated with the genetic merit of the selection objective. This optimality property holds when the genetic and phenotypic (co)variance matrices that are needed to derive the coefficients of the SI (see Eq. (2)) are known without error. However, when the measured phenotype is highdimensional, estimation errors in the phenotypic (co)variance matrix (\({{\boldsymbol{P}}}_{x}\)), as well as in the genetic covariances (\({{\boldsymbol{G}}}_{x,y}\)), can make the standard SI suboptimal. Our empirical results confirm this: standard SIs overfitted the data; this leads to a SI with low heritability and low accuracy of indirect selection.
To prevent overfitting, we considered integrating ideas commonly used in highdimensional regression into the SI methodology. Our empirical results show that regularization consistently improves the accuracy of indirect selection relative to standard SIs. We verified this for various environmental conditions and for crop imaging data collected at 9 different timepoints. The optimal PSI and the optimal PCSI achieved almost the same accuracy of indirect selection for all the environments and timepoints, suggesting that either type of regularization can be effective.
Reducedrank selection indices are appealing because after dimension reduction the problem of deriving a SI is trivial and can be dealt with methods commonly used to derive standard SIs. Moreover, after HTP has been reduced to a few derivedtraits (say the top 10 PCs), these traits can be integrated into genetic evaluations (either pedigreebased^{27} or genomicenabled^{28}) using standard multitrait models.
Principal componentsbased methods have been considered before in the analysis of Fouriertransformed infrared (FTIR) spectra derived from milk samples. For instance, Soyeurt, Misztal & Gengler^{29} used a reduced number of FTIRderived PCs to estimate variance components for selection objectives (e.g., fat or protein content in milk). Building upon this idea, Dagnachew, Meuwissen & Ådnøy^{30} suggested predicting the genetic merit for milk fatty acids using FTIRderived PCs as ‘traits’ in a genetic evaluation. However, when mapping from genetic predictions of PClodgings onto genetic predictions for the selection objective the authors used coefficients derived from a phenotypic (partial least squares) regression. This does not guarantee that the resulting index is maximally correlated with the genetic merit of the selection target. The penalized and PCSI presented in this study address that problem by using coefficients that are derived using genetic (and not phenotypic) covariances.
A disadvantage of the PCSI is that the methodology does not naturally provide variable selection, a feature that may be desirable when the measured phenotype is highdimensional.
Penalized selection indices can perform variable selection based on genetic covariances. While the derivation of a PSI is a bit more challenging than that of the PCSI, the computational burden involved in the derivation of a PSI is not extremely high.
Integration of PSI and PCSI into genetic evaluations
The SIs considered here predict genetic merit for a selection target from a set of traits measured on an individual (\({I}_{i}={{\boldsymbol{x}}}_{i}^{{\rm{{\prime} }}}{\boldsymbol{\beta }}\)); such indices exploit borrowing of information between traits within an individual. Borrowing of information between individuals increases selection accuracy; we envision two ways in which regularized SIs can be integrated into pedigree or genomicbased genetic evaluations.
One possibility is to use a twosteps approach whereas in the first step a PSI or a PCSI is used to predict the genetic merit using withinindividual information. This step can be considered as a task where patterns attributable to genetic covariances are extracted and those attributable to environmental covariances are smoothedout. Then, in a second step, the resulting indexdata \(\{{I}_{1},\ldots ,{I}_{n}\}\) could be used as a trait in a genetic evaluation.
Our study shows that the use of a regularized SI leads to a derivedphenotype that has higher genetic accuracy than standard SIs, and that of best phenotypic prediction. In principle, using a more accurate phenotype should lead to a higher accuracy of the predicted breeding values in the second step. However, further studies are needed to determine whether the gains in accuracy at the level of the HTPderived phenotype will fully translate into a higher accuracy of the predicted breeding values in a twosteps procedure.
A onestep approach is conceptually feasible and statistically more efficient as it offers the possibility of considering correlations between traits, relationships between genotypes, and the effects of nongenetic factors jointly; however, the implementation of the onestep approach using highdimensional phenotypes can be computationally challenging. To implement a onestep approach, the optimization problem of Eq. (3) can be modified by replacing \({{\boldsymbol{x}}}_{i}\), the vector with the measured phenotypes on the \({i}^{{\rm{th}}}\) individual, with a vector \({\boldsymbol{x}}={({{\boldsymbol{x}}}_{1}^{{\rm{{\prime} }}},{{\boldsymbol{x}}}_{2}^{{\rm{{\prime} }}},\ldots ,{{\boldsymbol{x}}}_{n}^{{\rm{{\prime} }}})}^{{\rm{{\prime} }}}\) that contains all the available HTP data (measured on all \(n\) individuals); after expanding the squared error loss and taking expectations we get
where \({{\boldsymbol{G}}}_{gx}\) is a \(pn\times 1\) vector of genetic covariances including betweentraitswithinindividual (co)variances and betweensubjects covariances. In standard genetic models, \({{\boldsymbol{G}}}_{gx}\) takes a Kronecker form \({{\boldsymbol{G}}}_{gx}={{\boldsymbol{A}}}_{i}\,\circ \,{{\boldsymbol{G}}}_{x,y}\), where \({{\boldsymbol{A}}}_{i}\) are genetic (either DNA or pedigreederived) relationships between the candidate for selection and each of the individuals in the training set, and \({{\boldsymbol{G}}}_{x,y}\) is, as before, a vector of genetic covariances between the selection objective and the measured traits (\({\boldsymbol{X}}\)). Likewise, \({{\boldsymbol{P}}}_{x}\) is a \(pn\times pn\) phenotypic (co)variance matrix. Estimating \({{\boldsymbol{P}}}_{x}\) would require estimating all the genetic and environmental covariances among the measured traits.
Impact of the use of highthroughput phenotypes in breeding programs
According to breeders’ equation^{16,31}, the rate of genetic gain from selection is directly proportional to selection accuracy and selection intensity. Thus, relative to the use of standard SIs, the use of regularized SIs is expected to increase selection gains by a factor equal to the gains observed in accuracy, that is between 10% and 40%. Relative to mass phenotypic selection, the PSIs had efficiencies, RE, ranging from 60% to 90%; therefore, relative to direct phenotypic selection, selection decisions based on PSI derived from images are expected to yield lower genetic gains than the ones that could be achieved via direct mass selection. However, the use of HTP technologies (e.g., crop monitoring using hyperspectral cameras mounted in drones) may enable the expansion of the number of genotypes tested/measured as well as the number of locations where those genotypes are tested. This could lead to an increase in selection intensity which will in turn increase selection gains. For instance, if the use of HTP enables doubling the number of genotypes tested, the increase in selection gains that could be achieved with HTP may range from 20% (in the case where the PSI has RE of 60%) to 80% (for the traits/environments with RE of 90%).
The discussion in the preceding paragraph is entirely based on breeders’ equation, which does not contemplate the longterm impacts of selection in genetic diversity. A more accurate and more intensive selection may affect diversity and longterm response to selection. To address this problem, attention to diversity will be needed with regularized SIs as with any other selection criteria.
Regularized selection indices can also be a valuable tool in genetic research
Highdimensional phenotypes are also becoming increasingly available in genetic studies involving human subjects and model organisms. Performing genetic studies (e.g., genomewide association analyses) on highdimensional phenotypes is challenging and the burden of multiple testing across hundreds or thousands of phenotypes (e.g., RNAabundance across thousands of genes) may critically compromise power. The PSI and PCSI presented in this study could be used to extract genetic patterns from highdimensional phenotype data such as brain imaging or wholegenome gene expression profiles and these patterns can then be used as traits in genetic studies.
Conclusion
We proposed two novel methods for predicting the genetic merit for selection objectives from highdimensional phenotypes. These phenotypes are becoming increasingly available as the adoption of HTP in crop and animal production increases. The proposed methods integrate regularization procedures commonly used in highdimensional regressions into the SI methodology. Regularization prevents overfitting and increases the accuracy of the index. The methods proposed here can be used to extract genetic patterns from almost any kind of highdimensional phenotype, including not only HTP data emerging in agriculture but also highdimensional phenotypes that emerge in genetic studies involving human subjects and model organisms.
Methods
Standard selection index
The weights on a SI are derived as the solution to the optimization problem of Eq. (1):
The righthand side can be expressed as \({\mathbb{E}}{({g}_{{y}_{i}}{{\boldsymbol{x}}}_{i}^{{\rm{{\prime} }}}{\boldsymbol{\beta }})}^{2}={\mathbb{E}}({g}_{{y}_{i}}^{2})2\,{\mathbb{E}}({g}_{{y}_{i}}{{\boldsymbol{x}}}_{i}{)}^{{\rm{{\prime} }}}{\boldsymbol{\beta }}+{{\boldsymbol{\beta }}}^{{\rm{{\prime} }}}{\mathbb{E}}({{\boldsymbol{x}}}_{i}{{\boldsymbol{x}}}_{i}^{{\rm{{\prime} }}}){\boldsymbol{\beta }}\). The first term, \({\mathbb{E}}({g}_{{y}_{i}}^{2})\), does not involve \({\boldsymbol{\beta }}\); therefore, it can be dropped from the objective function. Furthermore, if \({{\boldsymbol{x}}}_{i}\) has null mean, and assuming that the environmental effects on \({{\boldsymbol{x}}}_{i}\,\)are orthogonal to \({g}_{{y}_{i}}\), then \({\mathbb{E}}({g}_{{y}_{i}}{{\boldsymbol{x}}}_{i})={{\boldsymbol{G}}}_{x,y}\) is a vector containing the genetic covariances between the selection target and each of the measured phenotypes. Likewise, \({\mathbb{E}}({{\boldsymbol{x}}}_{i}{{\boldsymbol{x}}}_{i}^{{\rm{{\prime} }}})={{\boldsymbol{P}}}_{x}\) is the phenotypic (co)variance matrix of \({{\boldsymbol{x}}}_{i}\). Therefore, the problem in Eq. (1) can be written as
Differentiating the righthand side with respect to vector \({\boldsymbol{\beta }}\) and setting the derivatives equal to zero leads to the first order conditions: \({{\boldsymbol{P}}}_{x}\hat{{\boldsymbol{\beta }}}={{\boldsymbol{G}}}_{x,y}\); therefore,
Reducedrank selection index
Recall that the singular value decomposition of a realvalued matrix, \({\boldsymbol{X}}=[{{\boldsymbol{x}}}_{1},{{\boldsymbol{x}}}_{2},\ldots ,{{\boldsymbol{x}}}_{n}{]}^{{\rm{{\prime} }}}\) (individuals in rows, phenotypes in columns) takes the form \({\boldsymbol{X}}={\boldsymbol{U}}{\boldsymbol{D}}{{\boldsymbol{V}}}^{{\rm{{\prime} }}}\), where \({\boldsymbol{U}}=[{{\boldsymbol{u}}}_{1},\ldots ,{{\boldsymbol{u}}}_{p}]\) is the matrix containing the leftsingular vectors that span the rowspace of \({\boldsymbol{X}}\), \({\boldsymbol{V}}=[{{\boldsymbol{v}}}_{1},\ldots ,{{\boldsymbol{v}}}_{p}]\) is the matrix with the rightsingular vectors, and \({\boldsymbol{D}}={\rm{diag}}({d}_{1},\ldots ,{d}_{p})\) is a diagonal matrix with positive or zero elements. The PCs \({\boldsymbol{W}}={\boldsymbol{XV}}={\boldsymbol{UD}}\) are linear combinations of the measured phenotypes. A reducedrank regression uses the first \(q\) PCs (\(\tilde{{\boldsymbol{W}}}=[{{\boldsymbol{w}}}_{1},\ldots ,{{\boldsymbol{w}}}_{q}]\), \(q\le p\)) as ‘measured phenotypes’ in the SI:
where \({\mathop{{\boldsymbol{w}}}\limits^{ \sim }}_{i}\) is a vector containing the scores for the \({i}^{{\rm{th}}}\) observation on the first \(\,q\) PCs. The solution to the optimization problem takes the form \({\hat{{\boldsymbol{\gamma }}}}^{(q)}={{\boldsymbol{P}}}_{\mathop{w}\limits^{ \sim }}^{1}{{\boldsymbol{G}}}_{\mathop{w}\limits^{ \sim },y}\), where \({{\boldsymbol{P}}}_{\tilde{w}}\) is the phenotypic (co)variance matrix of the first \(q\) PCs and \({{\boldsymbol{G}}}_{\tilde{w},y}\) is a vector containing the genetic covariances between each of the top \(q\) PCs and the selection objective. Since the leftsingular vectors are orthonormal (i.e., \({{\boldsymbol{u}}}_{j}^{{\boldsymbol{{\prime} }}}{{\boldsymbol{u}}}_{j}=1\) and \({{\boldsymbol{u}}}_{j}^{{\boldsymbol{{\prime} }}}{{\boldsymbol{u}}}_{k}=0\), for \(j\ne k\)), then \({{\boldsymbol{W}}}^{{\rm{{\prime} }}}{\boldsymbol{W}}={{\boldsymbol{D}}}^{2}={\rm{d}}{\rm{i}}{\rm{a}}{\rm{g}}({d}_{1}^{2},\ldots ,{d}_{p}^{2})\). Hence, a methodofmoments estimate of the phenotypic (co)variance matrix of \(\tilde{{\boldsymbol{W}}}\) contains only the first \(q\) elements \({\tilde{{\boldsymbol{D}}}}^{2}={\rm{diag}}({d}_{1}^{2},\ldots ,{d}_{q}^{2})\); this is
Using \({\hat{{\boldsymbol{P}}}}_{\tilde{w}}\) makes the coefficients of the PCs to be proportional to the genetic covariance between each of the PCs and the selection objective: \({\hat{{\boldsymbol{\gamma }}}}^{(q)}=(n1){({\tilde{{\boldsymbol{D}}}}^{2})}^{1}{{\boldsymbol{G}}}_{\tilde{w},y}\). This solution can be mapped to coefficients for the measured traits using \({\hat{{\boldsymbol{\beta }}}}^{(q)}=(n1)\mathop{{\boldsymbol{V}}}\limits^{ \sim }{({\mathop{{\boldsymbol{D}}}\limits^{ \sim }}^{2})}^{1}{{\boldsymbol{G}}}_{\mathop{w}\limits^{ \sim },y}\), where \(\tilde{{\boldsymbol{V}}}\) is the matrix containing only the first \(q\) rightsingular vectors.
Penalized selection indices
The objective function of the penalized SI is given by Eq. (3). Here we considered PSIs using either L1 or L2norms or a combination of the two.
L2PSI
Using an L2norm as penalty, \(J({\boldsymbol{\beta }})=\frac{1}{2}\mathop{\sum }\limits_{j=1}^{p}{\beta }_{j}^{2}=\frac{1}{2}{\boldsymbol{\beta }}{}^{{\rm{{\prime} }}}{\boldsymbol{\beta }}\), in Eq. (3) leads to the following optimization problem:
Therefore:
The second and third righthand side terms can be combined to obtain:
where \({\boldsymbol{I}}\) is a \(p\times p\) identity matrix. Differentiating with respect to \({\boldsymbol{\beta }}\) and setting the derivatives equal to zero, we obtain the firstorder conditions: \(({{\boldsymbol{P}}}_{x}+\lambda {\boldsymbol{I}}){\hat{{\boldsymbol{\beta }}}}^{L2}={{\boldsymbol{G}}}_{x,y}\); therefore:
ENPSI
The coefficients for the elasticnet family are obtained by considering an objective function as in Eq. (3), with \(J({\boldsymbol{\beta }})=\frac{1}{2}(1\alpha )\mathop{\sum }\limits_{j=1}^{p}{\beta }_{j}^{2}+\alpha \mathop{\sum }\limits_{j=1}^{p}{\beta }_{j}\); therefore,
The L1PSI and L2PSI are particular cases corresponding to \(\alpha =1\) and \(\alpha =0\), respectively. When \(\alpha =0\) the solution has a closedform (see L2PSI above). If \(\alpha > 0\), no closedform solution exists; however, a solution can be obtained using the same iterative algorithms that are used to solve elasticnet regressions (e.g., LARS and coordinate descent^{14}). These algorithms can be implemented either by ‘partial residuals’ or using ‘covariance updates’^{32}. In our case, the objective function is entirely based on (co)variance terms. The objects \({{\boldsymbol{P}}}_{x}\) and \({{\boldsymbol{G}}}_{x,y}\) enter in the objective function of the PSI in the same way that \({{\boldsymbol{X}}}^{{\rm{{\prime} }}}{\boldsymbol{X}}\) and \({{\boldsymbol{X}}}^{{\rm{{\prime} }}}{\boldsymbol{y}}\) enter in a standard elasticnet regression. Therefore, to obtain solutions, we implemented the standard LARS algorithm (e.g., Hastie et al.^{14}) entirely based on covariance updates.
Data
The data set consists of 1,092 inbred wheat lines grouped into 39 trials and grown during the 2013–2014 season at the Norman Borlaug experimental research station in Ciudad Obregon, Sonora, Mexico. Each trial consisted of 28 breeding lines that were arranged in an alphalattice design with three replicates and six subblocks. The trials were grown in four different environments: FlatDrought (sowing in flat with irrigation of 180 mm through drip system), Bed2IR (sowing in bed with 2 irrigations approximately 250 mm), BedEHeat (bed sowing 30 days before optimal planting date with 5 normal irrigations approximately 500 mm), and Bed5IR (bed sowing with 5 normal irrigations). In 2013, all the trials were planted by midNovember (optimal planting date), on the \({21}^{{\rm{st}}}\) (Bed2IR and Bed5IR) and on the \({26}^{{\rm{th}}}\) for FlatDrought. Trials for BedEHeat were planted on October \({30}^{{\rm{th}}}\). Grain yield (ton ha^{−1}, total plot yield after maturity) was recorded.
Reflectance data were collected from the fields using both infrared (A600 series Infrared camera, FLIR, Wilsonville, OR) and hyperspectral (Aseries, MicroHyperspec, VNIR Headwall Photonics, Fitchburg, MA) cameras mounted on a Piper PA16 Clipper aircraft on 9 different dates (timepoints) between January \({10}^{{\rm{th}}}\) and March \({27}^{{\rm{th}}}\), 2014. During each flight, data from \(p=250\) wavebands ranging from 392 to 850 nm were collected for each pixel in the pictures. Using ArcMap software (ESRI, CA), the average reflectance of all the pixels within each georeferenced trial plot was calculated and reported as a single datapoint for each genotype for each band. Days to heading were recorded as the number of days from the date of sowing/first irrigation until 50% of spike emergence in each plot. Heading of about 50–80% of the total number of plots was used as criterion to distinguish between vegetative (VEG) and grain filling (GF) stages. The crop was considered to be at maturity (MAT) stage when the average RNDVI decreased to ~0.4.
Phenotype preprocessing
Within each environment, grain yield phenotypic records were preadjusted by fitting the following mixed model,
where \({y}_{jklm}\) is the grain yield phenotype value for the \({j}^{{\rm{th}}}\) genotype, \({k}^{{\rm{th}}}\) trial, \({l}^{{\rm{th}}}\) replicate (within trial), \({m}^{{\rm{th}}}\) subblock (within trial and replicate), \(\mu \) is the overall mean and \({g}_{j}\), \({t}_{k}\), \({r}_{l(k)}\), and \({b}_{m(kl)}\,\)are the genotype, trial, replicate, and subblock effects, respectively (all assumed to be random) and \({\varepsilon }_{jklm}\) is an error term. Random effects were assumed to be independently and identically distributed (iid) normal with null mean and effectspecific variances. Likewise, the error terms were assumed to be iid with null mean and common error variance.
Grain yield data were preadjusted by subtracting from the phenotypic record (\({y}_{jklm}\)) the mean (\(\hat{\mu })\) plus BLUPs of trial, replicate, and subblock effects; this is
Reflectance data was preadjusted by fitting the above model, using reflectance at individual bands as phenotype expanded with the inclusion of a timepoint effect. Separate models were fitted to each of the wavebands. As with grain yield, reflectance data were preadjusted by subtracting from the measured reflectance the estimated mean and predicted timepoint, trial, replicate, and subblock effects.
For quality control, preadjusted grain yield and reflectance phenotypes were removed for those grain yield scores lying beyond 3 times the interquantile region from the 0.25 and 0.75 quantiles.
After preadjusting, all phenotypes were standardized (to have unit variance); for ease of exposition, hereinafter we refer to the adjustedscaled phenotypes (including grain yield and the image data) simply as phenotypes.
Heritability estimation
After preadjusting standardization, we analyzed phenotypes using a mixed model of the form
where \({y}_{ij}\) is the phenotype for the \({i}^{{\rm{th}}}\) observation (\(i\) here is a single index for indices \(k\), \(l\), and \(m\) in Eq. (4)) of the \({j}^{{\rm{th}}}\) genotype; the genetic values are \({g}_{j}\mathop{ \sim }\limits^{iid}N(0,{\sigma }_{{g}_{y}}^{2})\), where \({\sigma }_{{g}_{y}}^{2}\) is the genetic variance; and the environmental terms are \({\varepsilon }_{ij}\mathop{ \sim }\limits^{iid}N(0,{\sigma }_{{\varepsilon }_{y}}^{2})\). Plotbasis heritability was calculated from variance components estimates using
Trainingtesting partitions
The data set contains information from 39 trials with 84 observations each. To assess the accuracy of indirect selection, we randomly assigned complete trials to testing sets. The training set comprised all the data from the trials not assigned to the testing set. This approach guarantees that no data from a single trial is present in both training and testing sets. This approach aims at representing a situation where one has calibrated the coefficients of the index using historical trials and apply these coefficients to image data of future trials. A similar validation scheme has been used (using herdyearseason groups instead of trials) in validation problems in previous studies involving milk spectra data^{33}. In each trainingtesting partition, out of the 29 trials available, 26 trials (\({n}_{trn}\approx 2,184\) observations) were randomly assigned to the training set, and the data from the remaining 13 trials (\({n}_{tst}\approx 1,092\)) was used for testing set. The regression coefficients of the indices (the \({\boldsymbol{\beta }}\)’s for the standard SI, PSI, and PCSI) were calculated using grain yield and reflectance data of the training set. Estimates of the coefficients and reflectance data were then used to calculate the SI, \({I}_{ij}={{\boldsymbol{x}}}_{ij}^{{\rm{{\prime} }}}\hat{{\boldsymbol{\beta }}}\), for each observation \(i\) in the testing set (\(i=1,\ldots ,{n}_{tst}\)). The heritability of the index and the genetic correlation between the index and the selection goal were estimated in the testing set.
The trainingtesting procedure described above was repeated 100 times by randomly assigning trials to training and testing sets. From these analyses, we reported the mean of heritability, genetic correlation, and selection accuracy; and their standard deviation across trainingtesting partitions.
Estimation of phenotypic and genetic parameters
The population phenotypic (co)variance matrix \({{\boldsymbol{P}}}_{x}\) was estimated within the training set using the unbiased sample (co)variance matrix given by \({\hat{{\boldsymbol{P}}}}_{x}=\frac{1}{n1}\mathop{\sum }\limits_{i=1}^{{n}_{trn}}({{\boldsymbol{x}}}_{i}\bar{{\boldsymbol{x}}})({{\boldsymbol{x}}}_{i}\bar{{\boldsymbol{x}}}{)}^{{\rm{{\prime} }}}\), where \(\bar{{\boldsymbol{x}}}\) is the vector containing the sample mean of each waveband. Since reflectance data are centered and standardized, this reduces to \({\hat{{\boldsymbol{P}}}}_{x}=\frac{1}{n1}{{\boldsymbol{X}}}^{{\rm{{\prime} }}}{\boldsymbol{X}}\), where \({\boldsymbol{X}}=[{{\boldsymbol{x}}}_{1},{{\boldsymbol{x}}}_{2},\ldots ,{{\boldsymbol{x}}}_{n}{]}^{{\rm{{\prime} }}}\) is the matrix containing all measured traits in the training set.
The genetic covariance \(({G}_{{x}_{j},y})\) between grain yield and the \({j}^{{\rm{th}}}\) measured trait (\(j=1,\ldots ,p\)) was estimated using a sequence of univariate genetic models as in Eq. (5). We fitted that model with grain yield phenotypes as response, then for each of the reflectance bands and then for the sum of grain yield and each of the bands. The genetic covariances between the bands and grain yield were then estimated using
where \({\hat{\sigma }}_{{g}_{y}}^{2}\), \({\hat{\sigma }}_{{g}_{{x}_{j}}}^{2}\) and \({\hat{\sigma }}_{{g}_{y+{x}_{j}}}^{2}\) are the estimated genetic variances for grain yield, the \({j}^{{\rm{th}}}\) band, and the sum of grain yield and the \({j}^{{\rm{th}}}\) band, respectively.
Estimation of the accuracy of indirect selection
To assess the accuracy of indirect selection we applied the regression coefficients derived in the training set to image data from the testing set to derive \({I}_{ij}={{\boldsymbol{x}}}_{ij}^{{\rm{{\prime} }}}\hat{{\boldsymbol{\beta }}}\). Then, using a mixed model analysis like that described in the previous section we estimated the heritability of the SI (\({h}_{I}^{2}\)), the heritability of grain yield (\({h}_{y}^{2}\)), and the genetic correlation between the SI and grain yield (\(cor({g}_{{I}_{i}},{g}_{{y}_{i}})\)). From these estimates, we derived the accuracy of indirect selection, \(Acc(I)={h}_{I}\,cor({g}_{{I}_{i}},{g}_{{y}_{i}})\), and the relative efficiency, \(RE=\frac{{h}_{I}}{{h}_{y}}cor({g}_{{I}_{i}},{g}_{{y}_{i}})\).
Software
All the aforementioned analyses were implemented in the R software environment^{34}, version 3.5.1. Linear mixed models were implemented using the ‘lmer‘ function from the LME4^{35} Rpackage. The software that implements the LARS and coordinate descent algorithms are available through the SFSI Rpackage (https://github.com/MarcooLopez/SFSI).
Data availability
The data used in this study are publicly available by CIMMYT (https://www.cimmyt.org/) who owns all rights in the data. The data is also included in the SFSI Rpackage. The Rscripts needed to perform the analyses presented in this study can be found in the documentation of the SFSI Rpackage.
References
Nagel, K. A. et al. GROWSCREENRhizo is a novel phenotyping robot enabling simultaneous measurements of root and shoot growth for plants grown in soilfilled rhizotrons. Funct. Plant Biol. 39, 891–904 (2012).
Araus, L. & Cairns, J. E. Field highthroughput phenotyping: the new crop breeding frontier. Trends Plant Sci. 19, 52–61 (2014).
Montes, J. M. et al. Nearinfrared spectroscopy on combine harvesters to measure maize grain dry matter content and quality parameters. Plant Breed. 125, 591–595 (2006).
White, J. W. et al. Field Crops Research Fieldbased phenomics for plant genetics research. F. Crop. Res. 133, 101–112 (2012).
Ferrio, J. P. et al. Assessment of durum wheat yield using visible and nearinfrared reflectance spectra of canopies. F. Crop. Res. 94, 126–148 (2005).
Spielbauer, G. et al. Highthroughput nearinfrared reflectance spectroscopy for predicting quantitative and qualitative composition phenotypes of individual maize kernels. Cereal Chem. 86, 556–564 (2009).
Garriga, M. et al. Assessing wheat traits by spectral reflectance: do we really need to focus on predicted traitvalues or directly identify the elite genotypes group? Front. Plant Sci. 8, 1–12 (2017).
Weber, V. S. et al. Prediction of grain yield using reflectance spectra of canopy and leaves in maize plants grown under different water regimes. F. Crop. Res. 128, 82–90 (2012).
Aguate, F. M. et al. Use of hyperspectral image data outperforms vegetation indices in prediction of maize yield. Crop Sci. 57, 2517–2524 (2017).
Garnsworthy, P. C., Wiseman, J. & Fegeros, K. Prediction of chemical, nutritive and agronomic characteristics of wheat by near infrared spectroscopy. J. Agric. Sci. 135, 409–417 (2000).
Oblath, E. A. et al. Development of nearinfrared spectroscopy calibrations to measure quality characteristics in intact Brassicaceae germplasm. Ind. Crop. Prod. 89, 52–58 (2016).
Smith, H. F. A discrimant function for plant selection. Ann. Eugen. 7, 240–250 (1936).
Hazel, L. N. The genetic basis for constructing selection indexes. Genetics 28, 476–490 (1943).
Hastie, T., Tibshirani, R. & Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction. (Springer (2009).
Bulmer, M. G. The mathematical theory of quantitative genetics. (Oxford University Press (1985).
Falconer, D. S. & Mackay, T. F. C. Introduction to quantitative genetics. (Prentice Hall (1996).
Fu, W. J. Penalized regressions: the Bridge versus the LASSO. J. Comput. Graph. Stat. 7, 397–416 (1998).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301–320 (2005).
Tibshirani, R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B 58, 267–288 (1996).
Friedman, J., Hastie, T., Höfling, H. & Tibshirani, R. Pathwise coordinate optimization. Ann. Appl. Stat. 1, 302–332 (2007).
Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. Ann. Stat. 32, 407–499 (2004).
Tattaris, M., Reynolds, M. P. & Chapman, S. C. A direct comparison of remote sensing approaches for highthroughput phenotyping in plant breeding. Front. Plant Sci. 7, 1–9 (2016).
Babar, M. A. et al. Spectral reflectance to estimate genetic variation for inseason biomass, leaf chlorophyll, and canopy temperature in wheat. Crop Sci. 46, 1046–1057 (2006).
Haboudane, D., Miller, J. R., Tremblay, N., ZarcoTejada, P. J. & Dextraze, L. Integrated narrowband vegetation indices for prediction of crop chlorophyll content for application to precision agriculture. Remote Sens. Environ. 81, 416–426 (2002).
Tucker, C. J. Red and photographic infrared linear combinations for monitoring vegetation. Remote Sens. Environ. 8, 127–150 (1979).
Gitelson, A. A., Kaufman, Y. J. & Merzlyak, M. N. Use of a green channel in remote sensing of global vegetation from EOSMODIS. Remote Sens. Environ. 58, 289–298 (1996).
Henderson, C. R. & Quaas, R. L. Multiple trait evaluation using relatives’ records. J. Anim. Sci. 43, 1188–1197 (1976).
Meuwissen, T. H. E., Hayes, B. J. & Goddard, M. E. Prediction of total genetic value using genomewide dense marker maps. Genetics 157, 1819–1829 (2001).
Soyeurt, H., Misztal, I. & Gengler, N. Genetic variability of milk components based on midinfrared spectral data. J. Dairy Sci. 93, 1722–8 (2010).
Dagnachew, B. S., Meuwissen, T. H. E. & Ådnøy, T. Genetic components of milk Fouriertransform infrared spectra used to predict breeding values for milk composition and quality traits in dairy goats. J. Dairy Sci. 96, 5933–5942 (2013).
Lush, J. L. Animal breeding plans. (Iowa State College, Ames (1937).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Ferragina, A. de los Campos, G., Vazquez, A. I., Cecchinato, A. & Bittante, G. Bayesian regression models outperform partial least squares methods for predicting milk components and technological properties using infrared spectral data. J. Dairy Sci. 98, 8133–8151 (2015).
R Core Team. R: A Language and environment for statistical computing. (2018).
Bates, D., Mächler, M., Bolker, B. M. & Walker, S. C. Fitting linear mixedeffects models using lme4. J. Stat. Softw. 67, 1–48 (2015).
Acknowledgements
We acknowledge CIMMYT’s Global Wheat Program that provided both experimental field and HTP data used in this work. M.L.C. was supported by the Monsanto’s BeachellBorlaug International Scholarship Program (MBBISP).
Author information
Authors and Affiliations
Contributions
R.S., S.D., J.C. and S.M. were involved in the design of the field experiments and data collection. S.M. performed the HTP data correction and georeferencing. M.L.C. and G.D.L.C. conceived the idea, performed the analyses and produced a first draft. E.O. and G.R along with all the authors contributed to the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
LopezCruz, M., Olson, E., Rovere, G. et al. Regularized selection indices for breeding value prediction using hyperspectral image data. Sci Rep 10, 8195 (2020). https://doi.org/10.1038/s41598020650112
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598020650112
This article is cited by

Combining novel feature selection strategy and hyperspectral vegetation indices to predict crop yield
Plant Methods (2022)

Trichome density and pod damage rate as the key factors affecting soybean yield under natural infestation of Helicoverpa armigera (Hübner)
Journal of Plant Diseases and Protection (2022)

Genomic selection for morphological and yieldrelated traits using genomewide SNPs in oil palm
Molecular Breeding (2022)

Multigeneration genomic prediction of maize yield using parametric and nonparametric sparse selection indices
Heredity (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.