Relationship between gene regulation network structure and prediction accuracy in high dimensional regression

The least absolute shrinkage and selection operator (lasso) and principal component regression (PCR) are popular methods of estimating traits from high-dimensional omics data, such as transcriptomes. The prediction accuracy of these estimation methods is highly dependent on the covariance structure, which is characterized by gene regulation networks. However, the manner in which the structure of a gene regulation network together with the sample size affects prediction accuracy has not yet been sufficiently investigated. In this study, Monte Carlo simulations are conducted to investigate the prediction accuracy for several network structures under various sample sizes. When the gene regulation network is a random graph, a sufficiently large number of observations are required to ensure good prediction accuracy with the lasso. The PCR provided poor prediction accuracy regardless of the sample size. However, a real gene regulation network is likely to exhibit a scale-free structure. In such cases, the simulation indicates that a relatively small number of observations, such as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=300$$\end{document}N=300, is sufficient to allow the accurate prediction of traits from a transcriptome with the lasso.

Technological advancements have enabled the collection of highly multidimensional data from biological systems [1][2][3][4] . For example, RNA sequencing quantifies expression levels of thousands of genes. Such omics data is useful in predicting organismal traits, with empirical applications including diagnosis and classification of diseases and prediction of patient survival [5][6][7][8] and possible future applications in predicting crop yields 9 , insecticide resistance 10 , and environmental adaptation 11 .
A common challenge in predicting traits from omics data is the dimension of the data far exceeding that of the sample size (known as high-dimensional regression). For example, if one is to apply least-squares estimation in multiple regression (e.g. trait ≈ β 0 + β 1 gene 1 + β 2 gene 2 + · · · ) to predict a trait value from a transcriptome, the sample size needs to be (at least) larger than the number of model parameters. However, because transcriptome studies typically observe thousands of genes, a sample size exceeding the number of genes is not realistic at present. In this case, high-dimensional regression modeling must be considered.
The least absolute shrinkage and selection operator (lasso 12 ) is one of the most frequently used methods for high-dimensional regression. It simultaneously achieves variable selection and parameter estimation. Theoretically, the prediction accuracy of the lasso is highly dependent on the correlation structure among exploratory variables; it is high under certain strong conditions, such as the compatibility condition 13 . However, in practice, it is not easy to check whether the compatibility condition holds. Another popular estimation method for highdimensional regression is principal component regression (PCR 14 ). PCR is a two-stage procedure: first, principal component analysis is conducted for predictors, following which the regression model on which the principal components are used as predictors is fitted. This method may perform well when the exploratory variables are highly correlated.
It is reasonable to assume that gene regulation networks will result in conditional independence among the levels of gene expression [15][16][17] . Here, two variables are conditionally independent when they are independent given

Prediction methods for high-dimensional data
Suppose that we have n observations {(x i , y i ) | i = 1, . . . , n}, where x i are p-dimensional vector of explanatory variables and y i are responses (i = 1, . . . , n) . Let X = (x 1 , . . . , x n ) T and Y = (y 1 , . . . , y n ) T . Consider the linear regression model: Lasso. The lasso minimizes a loss function that consists of quadratic loss with a penalty based on an L 1 norm of a parameter vector: where > 0 is a regularization parameter. Because of the nature of the L 1 norm in the penalty term, some of the elements of the coefficients are estimated to be exactly zero. Thus, variable selection is conducted, and only variables that correspond to nonzero coefficients affect the responses.

PCR.
In some cases, the first few largest eigenvalues of the covariance matrix of predictors (i.e., proportional contributions of principle components) can be considerably large (e.g., spiked covariance model 26 ). In such a case, the lasso may not function effectively in terms of both prediction accuracy and consistency in model selection, because the conditions for its effective performance (e.g., compatibility condition 27 ) may not be satisfied. This issue could be addressed using PCR because it transforms data with a large number of highly correlated variables into a few principal components. In the first stage of PCR, principal component analysis is applied to predictors. The ith observation of predictor, The matrix A is obtained by the following least squares optimization problem 28 : here, x is the sample mean vector, that is, x = n i=1 x i /n . In this work, the number of projected dimensions, d, was chosen such that d principle components collectively explain 90% or more variance (and d − 1 principle components do not). Then, in the second stage, regression analysis is conducted, for which the principal components, {z 1 , . . . , z n } , are used as predictors. Here, the regression coefficients in the second stage are estimated by the lasso.

Simulation framework
An overview of the simulation is presented in Fig. 1. First, the model that defines the relationship between the trait and the levels of gene expression was parameterized. This was done using the empirical data 11 , which quantified the transcriptome of wild Arabidopsis halleri subsp. gemmifera weekly for two years in their natural habitat as well as bihourly on the equinoxes and solstices (p = 17,205 genes for n = 835 observations). Three types of simulated data were generated using different covariance matrices of genes, denoted as j ( j = 1, 2, 3 ). 1 is the sample covariance matrix of genes. Generally, none of the elements of the inverse of sample covariance matrix are exactly zero, implying that each gene interacts with all the other genes. Such a fully connected network is ineffective in terms of interpretation of the mechanism of gene regulation. Thus, two other covariance matrices were produced to simulate sparse networks based on the sample covariance matrix 1 . 2 is generated by the graphical lasso 29 , which corresponds to the random graph. Although the graphical lasso is widely used because of its computational efficiency, real networks are often scale-free. Therefore, 3 , which corresponds to the scale-free network, was generated here. The estimation of scale-free networks is achieved by the reweighted graphical lasso 30 . Based on these three covariance matrices j ( j = 1, 2, 3 ), the simulated transcriptome data were generated from the multivariate normal distribution. The simulated trait data were generated from simulated transcriptome data. Finally, lasso and PCR were applied to these simulated data to compare their prediction accuracies. The sample sizes of the simulated data were varied to investigate the relationship between prediction accuracy and sample sizes.
Evaluation of the estimation procedure. The performance of the estimation procedure is investigated by the following expected prediction error: where X * and Y * follow X * ∼ N(0, � j ) (j = 1, 2, or 3) and Y * ∼ N((X * ) T β, σ 2 I n ) , respectively. The estimator β is obtained using current observations, while X * and Y * correspond to future observations. The j ( j = 1, 2, 3 ), β , and σ 2 are true values but unknown. In practice, these parameters are defined by using the actual dataset, (X, Y). Detail of setting of these parameters will be presented in the next subsection.
To estimate the expected prediction error, the Monte Carlo simulation is conducted. We first randomly generate training and test data, (X train ,Ỹ train ) and (X test ,Ỹ test ) , respectively. Here, X train follows a multivariate normal distribution with mean vector µ X and variance-covariance matrix j , where µ X is the sample mean of X. Then, www.nature.com/scientificreports/ Y train is generated by Ỹ train =X train β + ǫ , where ǫ is a random sample from N(0, σ 2 I) with I being an identity matrix. The test data, (X test ,Ỹ test ) , are generated by the same procedure as (X train ,Ỹ train ) but independent of (X train ,Ỹ train ) . The number of observations for the training and test data are N ( N = 50, 100, 200, 300, 500, 1000 ) and 1000, respectively. The lasso and the PCR are performed with training data (X train ,Ỹ train ) , following which RMSE is calculated in (10). The above process, from random generation of data to RMSE calculation, was performed 100 times.

Parameter setting. Covariance structures.
Here, the characterization of the network structure of predictors by conditional independence is considered. When the predictors follow a multivariate normal distribution, the network structure based on the conditional independence corresponds to the nonzero pattern of the inverse covariance (precision) matrix. In other words, the network structure is characterized by the inverse covariance matrix of predictors. Let S be the sample covariance matrix of predictors, that is, Here δ is a small positive value (in this simulation, δ = 10 −5 ). The term δI allows the existence of 1 . Note that because 1 is not sparse, it leads to the complete graph, which is of no use in interpreting gene regulatory networks. To generate a covariance matrix whose inverse matrix is sparse, L 1 penalization is employed for the estimation of 2 and 3 as follows: where P j (�) (j = 2, 3) are penalty terms which enhance the sparsity of the inverse covariance matrix. To estimate the sparse inverse covariance matrix, the lasso penalty is typically used as follows: where (3) is referred to as the graphical lasso 29 , and there exists several efficient algorithms to obtain the solution [31][32][33] . The estimator of (2) with (3) corresponds to 2 and 2 = −1 2 . The lasso penalty (3) does not enhance scale-free networks. It penalizes all edges equally so that the estimated graph is likely to be a random graph, that is, the degree distribution becomes a binomial distribution. To enhance scale-free networks (i.e., power-law distribution), the log penalty 30 is used as follows: where ω (·,−i) = (ω 1i , ω 2i , . . . , ω (i−1)i , ω (i+1)i , . . . , ω pi ) T and a i > 0 are tuning parameters. We note that the penalty (4) is slightly different from original definition 30 , expressed as When we do not assume that ω ij = ω ji , the estimate of the inverse covariance matrix with (5) is not symmetric. Since the original graphical lasso algorithm does not assume that ω ij = ω ji 31,34 , we slightly modify the penalty as in Eq. (4). Notably, P 3 (�) in (4) coincides with (5) when ω ij = ω ji . From a Bayesian viewpoint, the prior distribution which corresponds to the log penalty becomes the power-law distribution 30 ; thus, the penalty (4) is likely to estimate the scale-free networks. The estimator of (2) with (4) corresponds to 3 .
Because the log-penalty (4) is nonconvex, it is not easy to directly optimize (2). To implement the maximization problem (2), the minorize-maximization (MM) algorithm 35 has been constructed 30 , in which the weighted lasso penalty P In general, ˆ must be symmetric, so that Eq. (7) can be expressed as via the weighted graphical lasso (2) with penalty (6). 4. t ← t + 1.
To obtain 2 and 3 , the tuning parameters a i (i = 1 . . . , p) and ρ must be determined. Following the experiments 30 , a i = 1 was set for i = 1 . . . , p . To select the value of the regularization parameter ρ , several candidates were first prepared. In our simulation, the candidates were ρ = 0.3, 0.4, 0.5, 0.6, 0.7 . From these, the value of ρ was selected such that the extended Bayesian information criterion (EBIC 36,37 ) was minimized. Here, q is the number of nonzero parameters of the upper triangular matrix of ˆ , and δ ∈ [0, 1) is a tuning parameter. As the value of δ increases, a sparser graph is generated. Note that δ = 0 corresponds to the ordinary BIC 38 . We set δ = 0.5 because δ = 0.5 is shown to yield good performance in both simulated and real data analyses 37 . As a result, the EBIC selected ρ = 0.5.
The upper triangular matrix 3 must be estimated with the reweighted graphical lasso problem. A value of p = 17205 results in p(p + 1)/2 ≈ 148 million parameters. As a result, with the machine used in this study (Intel Core Xeon 3 GHz, 128 GB memory), it would take several days to conduct the reweighted graphical lasso approach, even with a small number of iterations such as T = 5 . For this reason, T = 5 iterations were employed to produce 3 here. Finally, 2 and 3 were scaled such that their signal-to-noise ratio became 1 . Figure 2 depicts the logarithm of the largest 30 eigenvalues of j ( j = 1, 2, 3 ). The first few largest eigenvalues of 3 are significantly larger than those of 2 , implying that the scale-free networks tend to produce predictors with large correlations.
In addition, the root mean squared error (RMSE) is calculated as follows:

Results
The box and whisker plot of the RMSE and the coefficient of determination ( R 2 ) are illustrated in Figs. 4 and 5. The horizontal axis is N (the number of observations of training data) and the vertical axis is the RMSE or R 2 based on 1000 observations of test data. We compared the performance of the lasso with that of the PCR. When 1 and 3 were used, the PCR performed worse than the lasso for small sample sizes. For 2 , the prediction performance with PCR was unsatisfactory even when the sample size N increased. The poor performance of the PCR can be attributed to the predictors associated with small eigenvalues; these predictors affected the prediction performance. Figure 6 depicts a scatter plot of nonzero elements of β and the eigenvector for the maximum eigenvalue of 2 . As can be seen, only a significant amount of correlation existed; in fact, the correlation coefficient was only 0.068.
The prediction accuracy was compared among the three covariance structures. In all the cases except PCR with 2 , the values of RMSE decreased and R 2 increased with the increase in the value of N. Further, R 2 was unstable for small sample sizes for all the cases when the lasso was applied. For large sample sizes, the R 2 of 1 was better than that of 2 and 3 . As described before, 1 was the sample covariance matrix, while 3 (and 2 ) was estimated using the graphical lasso. As the lasso-type regularization methods shrink parameters toward zero, the correlations among the exploratory variables reduce when the graphical lasso is used. Therefore, 2 and 3 resulted in smaller correlations as compared to 1 . Consequently, the R 2 may increase with stronger correlations. We compared the RMSE results of 2 and 3 . With 2 , we found that a sufficiently large number of observations is required to yield a small RMSE with the lasso. Meanwhile, 3 resulted in a small RMSE with a relatively small number of observations, such as N = 300.
Code availability. The proposed simulation is implemented in R package simrnet, which is available at https:// github. com/ keihi rose/ simrn et. Below is a sample code of the simrnet in R: When p = 100 , it took less than 12 min to conduct the simulation with 100 replications using the machine employed herein (Intel Core Xeon 3 GHz, 128 GB memory). For high-dimensional data such as p = 17,205, which was used in the simulation presented in this paper, several days were required to complete the simulation task. www.nature.com/scientificreports/

Concluding remarks
In a gene regulation network, a gene regulates a small portion of a genome, not all the genes in a genome. This indicates that gene regulation network is expected to be a sparse network rather than a complete graph. Therefore, two covariance matrices indicating sparse networks ( 2 , 3 ) were prepared in addition to a covariance matrix derived from empirical data ( 1 ). Generally, although hundreds of genes contribute to defining a trait, their contributions are not equal. It is frequently observed that genes regulating a trait include a few large-effect genes and many small-effect genes. This property was reflected in the distribution of β (Fig. 3). We considered the case where a limited number of regression coefficients significantly contributed to the definition of a trait. The Monte Carlo simulation result indicated that regardless of the network structure, the number of observations should be greater than at least 200 to accurately predict traits from a transcriptome ( 1 , 3 , Figs. 4 and 5). We also found that the lasso generally provided better accuracy than the PCR. In particular, when the gene regulation network was random ( 2 ), the prediction accuracy of the PCR was poor even if the sample size increased. In conclusion, it is important to sufficiently secure large sample sizes when performing regression analysis of data that exhibits either the random graph and the scale-free network. Additionally, we concluded that the lasso would be preferable to the PCR to ensure a good prediction accuracy. Conventional theory on the relationship between RMSE and sample size has been developed under the assumption that the sample size exceeds the number of exploratory variables 39 . However, omics data, which is rapidly being accumulated, results in high dimensional data with strong correlations. Thus, our simulation study considered more complicated settings than the traditional ones. Our simulation, or its extension, may be used in the future to find clues about theoretical aspects that may ultimately lead to the development of a sample size determination technique for omics data.  www.nature.com/scientificreports/ Other than the scale-free network, the small-world network is another notable property in the networks literature 40 . The definition of the small-word networks is that the shortest path length between two randomly chosen variables is proportional to log p ; that is, it is considerably small compared with the network size. The small-world networks have been investigated in various fields of research, including the biology [41][42][43] . Some statistical properties of the small-world networks have also been studied [44][45][46] . The investigation of the prediction accuracy in the small-world networks would be interesting but beyond the scope of this research. We would like to take this as a future research topic. The development of methods that provides better prediction accuracy than the lasso in various network structures with small sample sizes would also be an important future research topic.