Machine learning random forest for predicting oncosomatic variant NGS analysis

Since 2017, we have used IonTorrent NGS platform in our hospital to diagnose and treat cancer. Analyzing variants at each run requires considerable time, and we are still struggling with some variants that appear correct on the metrics at first, but are found to be negative upon further investigation. Can any machine learning algorithm (ML) help us classify NGS variants? This has led us to investigate which ML can fit our NGS data and to develop a tool that can be routinely implemented to help biologists. Currently, one of the greatest challenges in medicine is processing a significant quantity of data. This is particularly true in molecular biology with the advantage of next-generation sequencing (NGS) for profiling and identifying molecular tumors and their treatment. In addition to bioinformatics pipelines, artificial intelligence (AI) can be valuable in helping to analyze mutation variants. Generating sequencing data from patient DNA samples has become easy to perform in clinical trials. However, analyzing the massive quantities of genomic or transcriptomic data and extracting the key biomarkers associated with a clinical response to a specific therapy requires a formidable combination of scientific expertise, biomolecular skills and a panel of bioinformatic and biostatistic tools, in which artificial intelligence is now successful in developing future routine diagnostics. However, cancer genome complexity and technical artifacts make identifying real variants challenging. We present a machine learning method for classifying pathogenic single nucleotide variants (SNVs), single nucleotide polymorphisms (SNPs), multiple nucleotide variants (MNVs), insertions, and deletions detected by NGS from different types of tumor specimens, such as: colorectal, melanoma, lung and glioma cancer. We compared our NGS data to different machine learning algorithms using the k-fold cross-validation method and to neural networks (deep learning) to measure the performance of the different ML algorithms and determine which one is a valid model for confirming NGS variant calls in cancer diagnosis. We trained our machine learning with 70% of our data samples, extracted from our local database (our data structure had 7 parameters: chromosome, position, exon, variant allele frequency, minor allele frequency, coverage and protein description) and validated it with the 30% remaining data. The model offering the best accuracy was chosen and implemented in the NGS analysis routine. Artificial intelligence was developed with the R script language version 3.6.0. We trained our model on 70% of 102,011 variants. Our best error rate (0.22%) was found with random forest machine learning (ntree = 500 and mtry = 4), with an AUC of 0.99. Neural networks achieved some good scores. The final trained model with the neural network achieved an accuracy of 98% and an ROC-AUC of 0.99 with validation data. We tested our RF model to interpret more than 2000 variants from our NGS database: 20 variants were misclassified (error rate < 1%). The errors were nomenclature problems and false positives. After adding false positives to our training database and implementing our RF model routinely, our error rate was always < 0.5%. The RF model shows excellent results for oncosomatic NGS interpretation and can easily be implemented in other molecular biology laboratories. AI is becoming increasingly important in molecular biomedical analysis and can be very helpful in processing medical data. Neural networks show a good capacity in variant classification, and in the future, they may be useful in predicting more complex variants.


Figure S11. Artificial Neural Network (ANN)
The bigger the given value in the input, the more the output value will be close to 1. On the other hand, the more the input value is negative, the more the exit value will approach zero. It is this result which informs us how the neural network is activated. At 0, we will say the neurons are inactive, at one it will be fully activated and can have all the intermediate states. To increase control on this exit value we bring the bias b. Bias is the threshold where we will consider the neuron activated. For example, if our sigmoid function returns a 50% activated neuron when it receives 0 input, if we add a bias to our equation, the neuron will be activated more easily. This value allows us to shift the activation function. We will include the bias b to the input value of our function. This value will be adjusted at the same time as all of our weights during the network training phase. After this calculation, we end up with an exit value. This process, where our value is propagated until delivering a result, is called feedforward. Information about the initial variant will propagate through the hidden layers based on weight values. We ultimately obtain a result in the output layer. According to our model, we have 2 output neurons. Each of them is the representations of a number between zero and one (NO or YES). The degree of activation of each of these neurons represents the percentage chance that the variant passed as input is pathogenic -> 1 (YES) or 0 (NO) (according to our training dataset). For example, variant chr12:25398280/ KRAS / 24.41% / 1962 / c.35G>T / p.Gly12Val, the best possible answer would be to obtain all of the neurons = NO (=0) shut off except the one representing the YES (=1). At the beginning, our network is trying to obtain the most effective solution for the first time (First epoch:is when an entire dataset is passed forward and backward through the neural network only once) all of its output neurons will be more or less activated, it predicted randomly because he had no information about the result we were expected. To improve the result, we have to correct the parameters of our neural network. The sole way to alter the behavior of our network is to vary its weights. The learning process consists in achieving the most exact possible weights to provide an accurate answer. For this, we compare the given result and the predicted result.We will end up with a value called loss. The higher the loss, the farthest, the network is from the precise answer. With the loss, we are capable to deduct which weights have participated the most on our incorrect answer. At that point we have to adjust the weights to minimize the loss and get closer to the precise answer. On the next epoch, we try with another variant. The network will be wrong again, but it is closer to the expected answer. Loss value is a representation of our error rate on this network for each of its output neurons. Then amongst our parameters, we will have to identify which combination of each of these weights will allow us to obtain the most insignificant possible loss. To evaluate all these values, we will implement the algorithm of gradient descent (optimizer). Gradient descent uses derivatives of a function. As we know, the derivative function allows us to know the tangent's gradient of the curve for a given point. According to that tangent, we will be capable to determine how to change the entry weight to get closer to the result that we are looking for. If the derivative is positive, we will have to lower down our weight and vice versa. We can also deduct that the more important our derivative is, further we would be from the accurate result. We will have to increase our input, recalculate the derivative and restart those steps again until we reach a nil derivative. Which means we reached the minimum loss. Gradient descent contains a parameter named learning rate. Initially, the steps are more substantial. That means the learning rate is higher and as the point goes down the learning rate becomes smaller because the steps get shorter. The backpropagation system corresponds to the action of correcting the weights that led to the error, doing it layer by layer implementing the gradient descent. We merely have to duplicate these fitforward and back propagation steps again and again until we obtain the optimal result. We repeat these steps thousands of times with many somatic variants and adjust the weights each time. After many epochs, the network is capable to recognize oncosomatic variants. Progressively, the network will be effective to have its own representation of each variant introduced into the inputs. After the training process, we can test the network on mutation variants that we have never seen until now. If the learning is done correctly, the network delivers us our expected answer. All these calculation processes are automated thanks to NeuralNet package.

Figure S1
: Error rate output (out of bag) of random forest (model1$err.rate[,1]) in function of number of trees (Index). The plot shows that the best error rate was stabilized for ntree = 500. All values before 500 oscillate excessively and after it's more machine time consuming. The number of trees in random forest to be used for feature selection is defined by the lowest error rate in the initial random forest of 500 trees with 7 variables (chromosome, position, exon, variant allele frequency, minor allele frequency, coverage, amino acid change) randomly sampled as candidates at each split. Out of bag (OOB) error rate to assess the quality of random forest prediction of oncosomatic variants shown as a function of the number of decision trees generated during the machine learning with mtry=4. Figure S2. Optimal number of variables for each split (mtry) is reached for mtry=4, where the model as the lowest out of bag (OOB) error. TuneRF R function was used to search the best optimal value of mtry. Parameters of TuneRF: ntree: number of trees to be grown for tuning = 500, stepfactor: this is the value by which the chooses of mtry gets inflated or deflated = 0.5, improve argument specifies the improvement in OOB error value for the search to continue = 0.05         Figure S11. The smaller the mean minimal depth, the more important the variable is and the higher up the y-axis the variable will be. The rainbow gradient reveals the min and max minimal depth for each variable. The bigger the proportion of minimal depth zero (red blocks), the more frequent the variable is used for splitting trees. The range of the x-axis is from zero to the maximum number of trees for the feature. chr: Chromosome, MAF: minor allele frequency, POS: position, coverage, exon, protdesc: amino acid change.

Downsample To Coverage
Reduce coverage in over-sampled locations to this value. Allowed values: Integers >= 1.

Filter Unusual Predictions
Filter: predictions are distorted to fit the data more than this distance (relative to the size of the variant). Enable the position bias filter.
True False

Indel As HPindel
Apply indel filters to non HP indels.
True False

Do MNP Realignment
Realign reads in the vicinity of candidate mnp variants.
True False

FD Nonsnp Min Var Cov
Override min_var_coverage of the flow-disrupted variants that are not SNPs (0 to disable the override).
Suggested trial value between 0 and 10. Impact: Decreasing values make variant calls less specific but more sensitive 0 <= 1<= 10

Read Mismatch Limit
Do not use reads with number of mismatches (where 1 gap open counts 1) above this value. Allowed value: Integers >=0 where 0 to disable this filter. Suggested trial value (TagSeq) 5, (other) 0 0 <= 0

Min Cov Fraction
Do not use reads with fraction of covering the best assigned unmerged target region below this. Allowed values: Decimal numbers between 0 and 1. Suggested trial value (TagSeq) 0.9, (other) 0 0 <= 0

Use Input Allele Only
Only consider provided alleles in the hotspots file. Allowed values: 0 = generate de novo candidates, 1 = hotspots only 0 <= 0<= 1

Min Family Size
Minimum number of reads with same UID required to form a functional family. Suggested value between 3 and 7. Impact: Increasing values make variant calls less sensitive but more specific.

SNP Min Var Coverage
Minimum number of variant supporting functional families required to make a SNP call. Suggested trial value between 2 and 10. Impact: Increasing values make variant calls less sensitive but more specific 2 <= 2<= 10

MNP Min Var Coverage
Minimum number of variant supporting functional families required to make a MNP call. Suggested trial value between 2 and 10. Impact: Increasing values make variant calls less sensitive but more specific