Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning

Elucidating functionality in non-coding regions is a key challenge in human genomics. It has been shown that intolerance to variation of coding and proximal non-coding sequence is a strong predictor of human disease relevance. Here, we integrate intolerance to variation, functional genomic annotations and primary genomic sequence to build JARVIS: a comprehensive deep learning model to prioritize non-coding regions, outperforming other human lineage-specific scores. Despite being agnostic to evolutionary conservation, JARVIS performs comparably or outperforms conservation-based scores in classifying pathogenic single-nucleotide and structural variants. In constructing JARVIS, we introduce the genome-wide residual variation intolerance score (gwRVIS), applying a sliding-window approach to whole genome sequencing data from 62,784 individuals. gwRVIS distinguishes Mendelian disease genes from more tolerant CCDS regions and highlights ultra-conserved non-coding elements as the most intolerant regions in the human genome. Both JARVIS and gwRVIS capture previously inaccessible human-lineage constraint information and will enhance our understanding of the non-coding genome.

1. Differences in AUROCs have been used to support the advantages of the authors' methods. However, the authors did not provide p-values for these differences. Thus, it is unclear to me whether the authors' methods are statistically significantly better than alternative approaches. Following a common practice in statistics, the authors should use DeLong's test to compute a pvalue whenever two AUROCs are compared, and discuss the statistical significance of their findings accordingly.
2. Figure 2a in the main text shows that gwRVIS has limited power to identify constrained regions when the number of variants is small. It may be explained by the heteroscedasticity in the data, as evident by the increase of residual variance with the total number of variants in Figure 2a. I suggest the authors fit a linear regression model with heteroscedasticity to the same data, and define a new RVIS score as the ratio of the residual and the estimated standard deviation from the heteroscedastic model. This modification will increase the variance of the RVIS score when the estimated standard deviation is small, and hence mitigate the undesired dependence of gwRVIS score on the total number of variants.
3. The authors used non-coding pathogenic variants from ClinVar and putatively benign variants from denovo-db as positive and negative data, respectively. Because most ClinVar variants are proximal to protein-coding genes, a machine learning method could "cheat" in the prediction of pathogenic variants by using features correlated with distances to protein-coding genes. While the authors didn't use distances to protein-coding genes as a feature in their models, I expect that gwRVIS scores will, nevertheless, be correlated with the distances due to its large window-size (3kb). To fully rule out the impact of the distances to variant prediction, it is better to compare different methods on a set of positive and negative data with matched distances to protein-coding genes. This strategy was originally used in the GWAVA paper (PMID: 24487584).
4. In the prioritization of pathogenic structural variants (SVs), the authors considered SVs in intergenic regions as benign and SVs in other regions as pathogenic. Thus, the author's model should be interpreted as a tool to discriminate between genic and intergenic SVs, not a tool for unbiased prioritization of pathogenic SVs. The authors should revise this part of their manuscript accordingly.
5. I am wondering what convolutional neural networks (CNNs) learned from raw sequences in this paper. If the CNNs learned meaningful patterns for variant prioritization, their convolutional filters should correspond to motifs for transcription factor binding. The authors can use existing methods (e.g., PMID: 26213851 & 27896980) to visualize their convolutional filters and check if they correspond to any known motifs.
6. The authors should also compare their methods with two other tools, EIGEN (PMID: 26727659) and DeepSEA (PMID: 26301843), which I found highly competitive in predicting non-coding pathogenic variants. 7. It is unclear to me if the authors used any regularization techniques, such as early stopping, weight decay, etc., in the training of deep neural networks. If not, these models may suffer from severe overfitting.
Reviewer #2 (Remarks to the Author): In their manuscript "Prioritization of non-coding regions based on human genomic constraint and primary sequence context with deep learning", Vitsios et al. present a measure of a genome-wide windowed variant intolerance score (gwRVIS) and an integration of this measure with other annotations in a pathogenicity score (JARVIS) applicable for non-coding sequences. The work highlights that the derived pathogenicity score does not use species conservation and may therefore have advantages over previous measures. While the authors do not explicitly proof such an advantage, they highlight the performance of their score in a number of validation and test data sets. The work is interesting and relevant for a larger audience; however, I have a number of comments and major concerns about the applied cross-validation approach.
Major comments: -ClinVar pathogenic variants correspond to rather small numbers and are highly clustered. While the small numbers are an issue for training complex models (like DNNs), the clustering is an issue for assessing model performance. The authors cannot use a simple cross-validation scheme under these circumstances. They need to make sure that all variants from the same genomic region are in the same fold. Otherwise, their results will be overly optimistic. While this is critical, it does not seem that the authors have done that.
-You are independently optimizing MAF and length parameters. I would expect an interaction as larger windows contain more variants and increase the likelihood of observing more rare alleles. This would require something like a grid-search. Given the rather flat distributions, this effect might be small and I understand if the authors do not want to change the current implementation. However, the authors should mention this limitation and provide the MAF value that they used for optimizing the window size (lines 527-538). -DNN architecture: You are using very little data to train the DNNs while having several fully connected layers. Have you considered things like drop-out? How did you identify the architectures and how did you optimize hyperparameters like learning rates and number of iterations? If derived from other literature, provide references and values.
-When using the CNN, do you see better base pair level resolution of your scores? What is the resolution? Are your scores useful for fine mapping applications or even pinpointing TF binding sites (probably not)? Do you have means of checking that (e.g. publicly available saturation mutagenesis MPRA data)?
Minor comments: -Line 75: You are submitting to a journal with a broad readership; consider introducing "studentized residual" at first mention as residual divided by an estimate of its variance. Please not that you are using both "studentized" and "studentised" in your manuscript. -Line 124: UCNE abbreviation used before it is introduced below -Lines 190-191: FANTOM5 is more than just lncRNAs, please rephrase. -Line 298: Other than ascertainment bias (with respect to conservation in ClinVar), there is no "overfitting" for LINSIGHT, CADD or DANN. I think the problem is the use of the word "overfitting" by the authors (see below) -Suppl. Fig 8: Is this the mean ROC or is it the ROC of the mean of the model predictions? -Lines 298/305/309/364…: The term overfitting is not used correctly. Overfitting describes the lack of generalization of a learner when applied outside of its training data set. This is seen by a reduced performance on a test set. What you mean is giving high weights to certain features, potentially due to an ascertainment bias in the training data sets; thereby showing an increased performance. You need to rephrase. In some sentences you want to say that performances estimates are overly optimistic because test and training overlap, in others you want to talk about biases/being geared towards/having dispositons/preferences… .

Differences in AUROCs have been used to support the advantages of the authors' methods.
However, the authors did not provide p-values for these differences. Thus, it is unclear to me whether the authors' methods are statistically significantly better than alternative approaches. Following a common practice in statistics, the authors should use DeLong's test to compute a pvalue whenever two AUROCs are compared, and discuss the statistical significance of their findings accordingly.
In the original training set, JARVIS performs significantly better (p<2.7x10 -4 ) than all other scores that are not impacted by ClinVar-informed data leakage or integrating TSS distance information (ncER and LINSIGHT) in the training set except for DeepSEA which performs comparably with AUC=0.945 over 0.940 for JARVIS (Fig. 4;DeLong p=0.064 Figure 2a. I suggest the authors fit a linear regression model with heteroscedasticity to the same data, and define a new RVIS score as the ratio of the residual and the estimated standard deviation from the heteroscedastic model. This modification will increase the variance of the RVIS score when the estimated standard deviation is small, and hence mitigate the undesired dependence of gwRVIS score on the total number of variants.

Figure 2a in the main text shows that gwRVIS has limited power to identify constrained regions when the number of variants is small. It may be explained by the heteroscedasticity in the data, as evident by the increase of residual variance with the total number of variants in
We have now considered an alternative version of gwRVIS that considers the heteroskedasticity observed in the original formulation for genomic windows with very low number of variants in general. We have added a supplementary figure that illustrates the heteroskedasticity effect on the original gwRVIS calculation and how this is accommodated for by the weighted regression modeling (Suppl. Fig. 15).
We sought to explore if accounting for heteroskedasticity significantly changes the resulting set of gwRVIS scores. Thus, we additionally considered weighted ordinary least squares (WOLS) as a means to correct for lower gwRVIS variance in windows with lower numbers of all variants. Overall no significant difference can be observed between the two versions of gwRVIS, as Pearson's correlation between the two is r=0.96 (p<1x10-308). Additionally, residuals from windows with more variants seem to be considerably shrunk compared to those from windows with very few variants (Suppl. Fig. 15d), potentially amplifying signals from noisier windows (with fewer number of variants) over windows of potentially higher confidence (with more variants). Thus, we decided to retain the gwRVIS modeling with the non-weighted version of linear lines regression and added a section in Supplemental Methods describing this approach .
3. The authors used non-coding pathogenic variants from ClinVar and putatively benign variants from denovo-db as positive and negative data, respectively. Because most ClinVar variants are proximal to protein-coding genes, a machine learning method could "cheat" in the prediction of pathogenic variants by using features correlated with distances to protein-coding genes. While the authors didn't use distances to protein-coding genes as a feature in their models, I expect that gwRVIS scores will, nevertheless, be correlated with the distances due to its large window-size (3kb). To fully rule out the impact of the distances to variant prediction, it is better to compare different methods on a set of positive and negative data with matched distances to protein-coding genes. This strategy was originally used in the GWAVA paper (PMID: 24487584).
We have prepared an alternative version of the training set (referred to as the matched training set) by selecting control variants from denovo-db with similar distribution of TSS distances to closest genes, as compared to the pathogenic variants employed in the JARVIS training set (Suppl. Fig. 17). JARVIS performs significantly better than all other scores (AUC=0.800; Suppl. Fig. 18

In the prioritization of pathogenic structural variants (SVs), the authors considered SVs in intergenic regions as benign and SVs in other regions as pathogenic.
Thus, the author's model should be interpreted as a tool to discriminate between genic and intergenic SVs, not a tool for unbiased prioritization of pathogenic SVs. The authors should revise this part of their manuscript accordingly. lines 445-450 We re-phrased the respective section in main text ( ) as well as the legend in Fig. 6 to reflect that this benchmarking refers to classification of genic vs non-genic SV effects, with the former class expected to be enriched for clinically relevant SVs.

I am wondering what convolutional neural networks (CNNs) learned from raw sequences in this paper. If the CNNs learned meaningful patterns for variant prioritization, their convolutional filters should correspond to motifs for transcription factor binding. The authors can use existing methods (e.g., PMID: 26213851 & 27896980) to visualize their convolutional filters and check if they correspond to any known motifs.
We have performed motif analysis on the sequences learnt by the JARVIC CNN module, based on a similar approach followed in another multi-module deep learning network (Angermueller et al., 2017;PMID: 28395661). Specifically, in order to capture the learnt sequence patterns, we look into the most highly activated filters that comprise the feature maps at the output of the first convolutional layer. We selected the most activated k-mers (where k=11, equal to the selected filter size) in the training set of pathogenic sequences (those that had activation sum 0.9 times higher than the max. activation sum). We then performed multiple sequence alignment between the most-activated sequences with cd-hit, and identified the most predominant clusters of 11-mers that were learnt by the CNN's 1st layer. The identified clusters were then fed to the TomTom web-server against the eukaryote, JASPAR-vertebrate and uniprobe-mouse databases of known motifs (Suppl. File 2). We observed that, despite the fairly small size of the training set, JARVIS CNN module has managed to learn dozens of sequence patterns that align with known vertebrate, human or mouse motifs, many of which are enriched for long Cytosine stretches (Suppl. Fig.  22, 23, 24 & Suppl. File 2). We describe this analysis and results in a section in the main text (lines 359-367) and in Methods (lines 672-683), as well as provide the sequence logs and/or consensus sequences of the top learnt motifs in Suppl. Fig. 22, 23,  6. The authors should also compare their methods with two other tools, EIGEN (PMID: 26727659) and DeepSEA (PMID: 26301843), which I found highly competitive in predicting non-coding pathogenic variants.

We have now added Eigen-PC to our benchmarking (Ionita-Laza et al. 2016). In certain benchmarks, Eigen-PC did not have sufficient number of annotated nucleotide positions and thus has been omitted from those specific plots (e.g. the matched by TSS distance training set, the GWAS and mendelian validation sets). We found that Eigen-PC performed consistently lower than JARVIS in all applicable tests. DeepSEA was also added across all benchmarks for single-nucleotide variants and performed either significantly lower than JARVIS or in some cases comparably with no significant improvement based on
DeLong's test. We could not install the stand-alone version of DeepSEA and the web-server accepts inputs of up to 50,000 data points, thus we were not able to extract DeepSEA scores for the structural variants and so this score is not included in the SV benchmarking. 7. It is unclear to me if the authors used any regularization techniques, such as early stopping, weight decay, etc., in the training of deep neural networks. If not, these models may suffer from severe overfitting.

We have used dropout as our regularization technique across all modules of the deep neural network. Specifically, a dropout layer was introduced after every max-pooling and/or fully-connected layer to avoid overfitting on the training set (dropout-ratio: 0.2). These parameters can be found at the 'feedf_dnn()' and 'cnn2_fc2()' functions at the respective module in GitHub repository (https://github.com/astrazenecacgr-publications/JARVIS/blob/master/modules/jarvis/deep_learn_raw_seq/func_api_nn_models.py). We have also added a section in Methods that describes this regularization approach (lines 641-643).
Major comments: -ClinVar pathogenic variants correspond to rather small numbers and are highly clustered. While the small numbers are an issue for training complex models (like DNNs), the clustering is an issue for assessing model performance. The authors cannot use a simple cross-validation scheme under these circumstances. They need to make sure that all variants from the same genomic region are in the same fold. Otherwise, their results will be overly optimistic. While this is critical, it does not seem that the authors have done that.
We have now added an alternative version of the cross-validation training set, which uses stratification by chromosome during the fold splits. That ensures that variants from the same genomic region cannot be part of both the training and test sets at any cross-validation step, thus removing any bias from data circularity that is highlighted by the Reviewer. JARVIS performance dropped marginally (AUC=0.929, Suppl. Fig. 18) but remained significantly higher than the others Suppl. Fig. 19c & Suppl. File 1) except for DeepSEA and CADD that performed comparably (AUC=0.922 and 0.913,DeLong test p=0.065 and 0.16,respectively) and ncER and LINSIGHT that performed significantly better (AUC=0.980 and 0.969,respectively). We have added a section in the main text describing this analysis and respective results (lines 329-337).

-You are independently optimizing MAF and length parameters. I would expect an interaction as larger windows contain more variants and increase the likelihood of observing more rare alleles. This would require something like a grid-search. Given the rather flat distributions, this effect might be small and I understand if the authors do not want to change the current implementation. However, the authors should mention this limitation and provide the MAF value that they used for optimizing the window size (lines 527-538).
We thank the reviewer for this remark. In order to reduce the parameter search space, we have indeed performed a two-step optimization approach instead of a full-scale grid search. We initially decided on a MAF value of 0.1% as a nominal threshold for common variant annotation. This threshold is consistent with our original implementation of RVIS (Petrovski et al., 2013) and has also been selected in cohort studies like in Cirulli et al., 2020, as a useful threshold for defining rare variants (equivalent to being observed once among 1,000 individuals). By using this nominal MAF value we first optimised the window length parameter and then moved to optimise separately the MAF value, which again converged on 0.1%. We have added a section in Methods describing this analysis and respective results (lines 576-580).

-DNN architecture: You are using very little data to train the DNNs while having several fully connected layers. Have you considered things like drop-out? How did you identify the architectures and how did you optimize hyperparameters like learning rates and number of iterations? If derived from other literature, provide references and values.
As per previous response to Reviewer 1, we have used dropout as our regularisation technique across all modules of the deep neural network. Specifically, a dropout layer was introduced after every max-pooling and/or fully-connected layer to avoid overfitting on the training set (dropout-ratio: 0.2). These parameters can be found at the 'feedf_dnn()' and 'cnn2_fc2()' functions at the respective module in GitHub repository (https://github.com/astrazeneca-cgrpublications/JARVIS/blob/master/modules/jarvis/deep_learn_raw_seq/func_api_nn_models.py).
The JARVIS deep learning parameter selection was based on a previously published successful method that fine-tuned parameters for inputs of similar size (Angermueller et al., 2017;PMID: 28395661). Specifically, the derived parameters that are employed by JARVIS, include the depth of the network (2 layers), filter size (k=11 and 3 in each layer), learning rate (lr=0.0001) with an Adam optimizer and early stopping after 10 epochs of the validation loss not getting improved. We have added a section in methods explaining the selection of parameters used by the JARVIS neural network (lines 637-643).
-When using the CNN, do you see better base pair level resolution of your scores? What is the resolution? Are your scores useful for fine mapping applications or even pinpointing TF binding sites (probably not)? Do you have means of checking that (e.g. publicly available saturation mutagenesis MPRA data)?
The JARVIS model that was trained exclusively with CNNs on raw genomic sequences performed lower than the model employing a feed-forward neural network applied to structured data (like gwRVIS, functional genomic annotations, etc. (Suppl. Fig . 8)). The combination of the two has also proven to boost predictive performance in the majority of validation sets, which reflects that the CNN module has actually learnt additional patterns from the raw genomic sequences.
In order to examine if JARVIS' CNN module learnt any biologically meaningful motifs, we performed motif analysis based on a similar approach followed in another multi-module deep learning network (Angermueller et al., 2017;PMID: 28395661). Specifically, in order to capture the learnt sequence patterns, we look into the most highly activated filters that comprise the feature maps at the output of the first convolutional layer. We selected the top 90th percentile of most activated k-mers (where k=11, equal to the selected filter size) in the training set of pathogenic sequences. We then performed multiple sequence alignment between the most-activated sequences with cd-hit, and identified the most predominant clusters of 11-mers that were learnt by the CNN's 1st layer. The identified clusters were then fed to the TomTom web-server against the eukaryote, JASPAR-vertebrate and uniprobe-mouse databases of known motifs (Suppl. File 2). We observed that, despite the fairly small size of the training set, JARVIS CNN module has managed to learn dozens of sequence patterns that align with known vertebrate, human or mouse motifs, many of which are enriched for long Cytosine stretches (Suppl. Fig. 22,23,24 & Suppl. File 2). We describe this analysis and results in a section in the main text (lines 359-367) and in Methods (lines 672-683), as well as provide the sequence logs and/or consensus sequences of the top learnt motifs in Suppl. Fig. 22 Minor comments: -Line 75: You are submitting to a journal with a broad readership; consider introducing "studentized residual" at first mention as residual divided by an estimate of its variance. Please not that you are using both "studentized" and "studentised" in your manuscript. We have now added an expanded description of the studentised residual term the first time we introduce it. We have also adopted the "studentized" spelling throughout.

-Line 88: Discuss known effects of GC content
We have added a brief description of known GC effects in main text (lines 89-91).
-Lines 91-92/474ff and Fig 1: Provide numbers/proportion of genome left. We have added a reference to the total genomic region of higher confidence (~75% of the genome) that resulted after all filtering steps lines 532-534 ( ) .

-Line 124: UCNE abbreviation used before it is introduced below
We have added the full name of UCNE at the first mention.
-Line 298: Other than ascertainment bias (with respect to conservation in ClinVar), there is no "overfitting" for LINSIGHT, CADD or DANN. I think the problem is the use of the word "overfitting" by the authors (see below) We have re-phrased accordingly in main text to reflect the data leakage and/or ascertainment bias problem in each case lines 302-304; 313-315; 328 ( ) .

-Lines: 482: Provide versions for TOPMed and gnomAD releases
We have now added versions of each resource line 528 ( ) .

-Lines 555-556: Provide version numbers for all tools (where applicable)
We have added the versions for the libraries used to fit the models (lines 644-645).

-Line 588: Provide numbers as: X out of Y
We added a reference to a supplementary figure (Suppl. Fig. 12)