TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies

One of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dimensional massive data or SNPs. Recently, machine learning methods have become more popular in high-dimensional genetic data analysis for their fast computation speed. However, most of machine learning methods have several drawbacks, such as poor generalization ability, over-fitting, unsatisfactory classification and low detection accuracy. This study proposed a two-stage algorithm based on least angle regression and random forest (TSLRF), which firstly considered the control of population structure and polygenic effects, then selected the SNPs that were potentially related to target traits by using least angle regression (LARS), furtherly analyzed this variable subset using random forest (RF) to detect quantitative trait nucleotides (QTNs) associated with target traits. The new method has more powerful detection in simulation experiments and real data analyses. The results of simulation experiments showed that, compared with the existing approaches, the new method effectively improved the detection ability of QTNs and model fitting degree, and required less calculation time. In addition, the new method significantly distinguished QTNs and other SNPs. Subsequently, the new method was applied to analyze five flowering-related traits in Arabidopsis. The results showed that, the distinction between QTNs and unrelated SNPs was more significant than the other methods. The new method detected 60 genes confirmed to be related to the target trait, which was significantly higher than the other methods, and simultaneously detected multiple gene clusters associated with the target trait.


Fast multi-locus random-SNP-effect EMMA (FASTmrEMMA)
Let ( = 1,2, . . . , ) be the phenotypic value of the ith individual in a sample of size n from a natural population. The genetic model can be described as: where = ( 1 , . . . , ) ; α is a c × 1 vector of the fixed effects, say the intercept, population structure effect and so on；~( , ) is QTN effects as random； = { 1 2 , . . . , 2 } ； p is the number of putative QTNs; W and Z are the corresponding designed matrices for and ；polygenic effects ~( , 2 ) is a × 1 random vector；K is a known × relatedness matrix； is residual error with an assumed ( , 2 ) distribution； 2 is residual error variance； is a × identity matrix.

Least angle regression
Assuming that ̂ is the estimated value of the current least angle regression 3 (LARS), there are: Let A be a set of indices corresponding to markers with the greatest absolute current correlations; B be a subset of all indices {1,2, … , }; = { 1 , 2 , … , }; k is the iteration steps. The specific algorithm for LARS are as follows: Step 1 Step 2: Determine the minimum angle direction . Let the unit column vector 1 = (1, . . . ,1) , the length of it is | | . Let = , = (1 −1 1 ) − 1 2 , = −1 1 , so = and = 1 , ‖ ‖ 2 = 1. is the component of the correlation coefficient ̂. Let = , which is the correlation coefficient of the minimum angle direction and gene markers in B.

Constructing random forest models
The specific steps of the algorithm for constructing random forests 4 are as follows: Step 1: Set the number of trees in the random forest: ntree. Ntree new self-sampling sample sets are generated from the original training dataset by using the Bagging algorithm. The samples which are not sampled at each time form ntree out of bag (OOB) data sets.
Step 2: One CART tree is generated by each self-sampling sample set. If there are k features, the mtry features are randomly extracted from the k features at each node of each tree ( << ) (generally, when dealing with regression problems, default = 3 ), taking the minimum node impurity, that is, the minimum Gini index, as the criterion, the feature with minimum Gini index and its corresponding segmentation point are selected from the mtry features, which are treated as the optimal feature and the optimal segmentation point. Then, two child nodes are generated from the current node, and the training data set is assigned to the two child nodes according to their features. Each tree grows to the maximum extent without cutting.
The Gini index of node is expressed as , k is the number of categories of test output under the current attribute, p(j|t) is the probability of taking class j for sample test output in node t.
Step 3: A random forest is constructed by ntree CART trees. For the classification problems, random forest can classify new data with the largest number of votes as the prediction results. For the regression problems, random forests takes the average output of all CART trees as the prediction results.

Variable importance assessment of random forests
Variable importance assessment is an important feature of random forest. The importance ranking of variables are obtained by scoring the importance of variables.
Random forest take the importance of variables as the variable selection standard, which increases the interpretability of the model. The variable importance measure (VIM) method for evaluating the importance of variables commonly used in random forest can be divided into two categories: one is calculated by Gini index and the other is calculated by OOB error rate. For the , the classical random forest gives two importance indexes: mean decrease accuracy for classification data and %IncMSE for regression data. The larger the mean decrease accuracy and %IncMSE, the more important the variable is. For the , the classical random forest gives two importance indexes: mean decrease gini for classification data and inc node purity for regression data. The larger the mean decrease gini and inc node purity, the more important the variable is. In this paper, was treated as the importance scoring method and %IncMSE was treated as the index of importance.
Assuming the sample size of the original sample set is ntree and the number of the variables is k, the generated random forest has ntree classification trees, that is, ntree OOB data sets. The specific steps of the calculation of are as follows: Step 1: For each tree classifier in a random forest, the OOB prediction error and the standard error can be calculated by using the corresponding OOB data. For the classification data, the prediction error of OOB is the error rate of classification; for the regression data, the prediction error of OOB refers to MSE. The ntree OOB prediction errors obtained are denoted as % 1 , % 2 , . . . , % .
Step 2: Randomly replace the value of a variable ( = 1,2 Step 3: For a certain prediction variable, computing its importance means the difference between the transformed prediction error and the original one. So, the importance score of is expressed as: (1 ≤ ≤ ) where, =√ ，̂ is the standard error of ( ) of each tree in the random forest.

Supplementary Softwares
Software S1 The program code for the two-stage algorithm based on least angle regression and random forest (TSLRF) This file for the program code of the TSLRF includes the following files: 1) "input files": Simulation analysis data: Supplementary Data S1.csv is genotypic values of each SNP marker for all the individuals. Supplementary Data S2.csv is the phenotypic values for all the individuals.