Predicting congenital renal tract malformation genes using machine learning

Congenital renal tract malformations (RTMs) are the major cause of severe kidney failure in children. Studies to date have identified defined genetic causes for only a minority of human RTMs. While some RTMs may be caused by poorly defined environmental perturbations affecting organogenesis, it is likely that numerous causative genetic variants have yet to be identified. Unfortunately, the speed of discovering further genetic causes for RTMs is limited by challenges in prioritising candidate genes harbouring sequence variants. Here, we exploited the computer-based artificial intelligence methodology of supervised machine learning to identify genes with a high probability of being involved in renal development. These genes, when mutated, are promising candidates for causing RTMs. With this methodology, the machine learning classifier determines which attributes are common to renal development genes and identifies genes possessing these attributes. Here we report the validation of an RTM gene classifier and provide predictions of the RTM association status for all protein-coding genes in the mouse genome. Overall, our predictions, whilst not definitive, can inform the prioritisation of genes when evaluating patient sequence data for genetic diagnosis. This knowledge of renal developmental genes will accelerate the processes of reaching a genetic diagnosis for patients born with RTMs.

www.nature.com/scientificreports/machine learning classifier is then generated by using the properties of the two training sets in an optimal way to separate the groups.Once trained, the classifier can be applied to predict the correct group for a new example.
We have previously used machine learning to identify proteins that constitute drug targets [11][12][13] and to identify genes essential for mammalian embryonic development 14 .Others have implemented machine learning to identify genes that drive kidney clear cell cancer 15 and to assign roles of genetic variants to kidney excretory function 16 .
Here we exploit supervised machine learning to identify genes with a high probability of being involved in renal development.These genes, when mutated, would therefore be promising candidates for causing RTMs.Due to the limited knowledge of genetic causes of human RTMs, we developed a positive training set of genes known to cause RTMs when mutated in the mouse, and a second training set of genes known not to cause disruptions to renal tract development.We utilised the mouse as a model organism because it is heavily studied, and mouse knockout experiments have proved useful in revealing biological functions of many human genes [17][18][19] .By applying supervised machine learning to the features of the genes in these two training sets, the classifier determines which feature values are common to renal developmental genes, and then identifies genes possessing these attributes from a novel dataset.Here we report the kidney development association status for all genes in the mouse genome as predicted by our classifier.Due to developmental similarities and genetic conservation between mouse and human, the genes we predict to have a role in mouse RTM development will comprise a dataset worthy of further investigation for human genetic diagnosis.Overall, our predictions can inform the prioritisation of candidate genes and accelerate the processes of reaching a genetic diagnosis for individuals affected by RTMs.

Results
Datasets.We first compiled a dataset of genes that are known to cause RTMs when mutated in the mouse, and a dataset of genes that are known not to cause RTMs (non-RTM) (Fig. 1), using data from the Mouse Genome Informatics (MGI) 20 database and data from the IMPC consortium 21 .This gave 310 mouse genes that are associated with RTMs when mutated (hereafter called 'RTM genes') and 4752 genes known not to cause documented RTM developmental defects ('non-RTM genes'), based on phenotype annotations of null alleles of targeted single-gene knockouts.RTM genes were also manually verified for their roles in human RT develop-Figure 1.The workflow for predicting mouse RTM genes integrating genomic and protein features using Random Forest classification model.First, features of mouse genes are collated from public databases.Statistical analyses and feature selection were then performed to identify most informative features differentiating between known RTM and non-RTM genes.A Random Forest classifier was built to predict RTM and non-RTM genes from these features.Finally, this classifier was used to predict RTM association status for all protein coding genes in the mouse genome not included in the classifier development.
ment based on the literature and RTM disease associations 22 .Human CAKUT-causing genes 23 are included in our training set.In order to investigate features specific to protein function, we restricted our datasets to proteincoding genes only.As a result, we obtained a total of 174 RTM and 4141 non-RTM mouse genes (Tables S1 and  S2).

Properties of RTM and non-RTM genes.
We collected data for a wide range of genomic and proteomic features of mouse protein coding genes 24 , including gene and protein length, gene expression, subcellular localisation, and known interaction partners.A total of 106 features of mouse genes linked to RT development were compared to genes not associated with RTM to reveal properties linked to RT development.Many features were found to be statistically significantly different in their distributions between the RTM and non-RTM datasets (Table 1).
We found that RTM genes are more likely to be longer in length than non-RTM genes (Table 1, Fig. 2a).Additionally, RTM genes tend to have both longer exons and longer introns than non-RTM genes (Table 1, Fig. 2b,c).A greater proportion of RTM genes are expressed at the organogenesis stage of mouse development when compared with non-RTM genes (74.7 vs. 57.8%,Chi-squared P-value: 4.1 × 10 -3 ).RTM genes were also highly expressed in eight-week fibroblast and post-juvenile RT tissues (Table 1).
Gene Ontology (GO) 27 is one of the most widely used approaches for annotating gene functions.We found differences in the GO term annotations for the biological process and cellular component classes between RTM and non-RTM gene groups.For biological processes, GO terms enriched in the RTM dataset include 'kidney development' , 'uretic bud morphogenesis' , 'uretic bud development' , 'metanephros development' , and 'mesonephros development' .Terms enriched in the non-RTM dataset include 'inflammatory response, 'immune system process' , 'apoptotic process' , and 'ion transport' .For cellular component, terms most frequently associated with RTM genes include: 'extracellular region' , 'basement membrane' , 'cell surface' and 'extracellular matrix' .Non-RTM dataset was enriched for terms including 'glutamatergic synapse' , 'membrane' , 'cytoplasm' , 'plasma membrane' , and 'cytosol' .Lists of the 20 most enriched GO terms for each class are listed in Tables S3-S6.
Known protein-protein interaction (PPI) data for mouse proteins were also analysed.This PPI network contains all known literature-curated interactions of mouse proteins from BioGrid 28 , BIND 29 , Chen PiwiScreen 30 , IntAct 31 , INNATEDB 32 , MGI, DIP 33 , MINT 34 and also from a recent study 35 .We found three statistically significant properties in the PPI network: Betweenness centrality and bottleneck of RTM proteins in the interaction Gene expression across post-juvenile RTM tissues (male) (FPKM) 6.56 2.12 1.0 × 10 -5 network is significantly higher than that of non-RTM proteins (P-value = 2.8 × 10 -4 and P-value = 4.8 × 10 -2 , respectively).In contrast, the eigenvector score which measures the centrality of a protein in the interaction network is significantly higher for non-RTM proteins than RTM-proteins (P-value = 7.2 × 10 -3 ).
Training and test datasets.Numerous features are significantly different between RTM and non-RTM genes.We therefore sought to develop a machine learning classifier that could categorise a mouse gene as RTM or non-RTM from its features (Fig. 1).We used 106 features as input to generate training datasets for classification.Our original dataset containing 174 RTM and 4141 non-RTM mouse genes had a severely imbalanced class frequency ratio (1:23.8).Imbalanced training datasets pose problems for machine learning strategies 36,37 ; therefore, class distribution was balanced by oversampling the genes of the RTM (minority) class when training the classifiers.We generated balanced training datasets having 522 genes each from the RTM and non-RTM datasets.The 522 non-RTM genes were randomly selected from the 4141 non-RTM genes.The RTM dataset which had 174 genes was increased by an additional 348 genes, synthesized from the existing RTM genes.We applied the Synthetic Minority Oversampling Technique (SMOTE) 38 to generate these synthetic RTM genes.These genes were close in feature space to the existing RTM genes.We then trained our classifier with this classbalanced dataset.
To evaluate the performance of our machine learning classifier, we assembled test datasets with genes that were not included in the training datasets (Fig. 1).Test 1 dataset (Table S7) contains 3619 genes from our original non-RTM dataset that were not used in classifier training.Test 2 dataset (Table S8) includes 27 mouse genes that are orthologues of those in the critical region involved in DiGeorge Syndrome.This chromosomal disorder occurs due to the deletion of a number of genes on chromosome 22q11.2,and the functions of many of these genes are still unknown.We utilised 22q11.2deletion region genes as a test dataset because approximately 30% of DiGeorge patients have congenital kidney and/or and urinary tract anomalies 39,40 .Test 3 dataset (Table S9) includes 31 mouse orthologues of human genes from the non-syndromic vesicoureteric reflux (VUR) candidate region on human chromosome 10q26 41 ; this region showed strong association with ureter malformation.Test 4 dataset (Table S10) comprises a total of 13,379 mouse protein-coding genes that have no experimental annotations for renal anomalies.The MouseMine 42 database was used to retrieve these genes.Gene and protein features were then collected for the test dataset genes following the same procedure used for the training genes.
Performance of the machine learning classifier.We performed feature selection prior to the training procedure.Feature selection is a useful tool for developing a classifier from a dataset with many features.It selects the most useful features from the training dataset and helps the classifier to learn a more efficient way to make predictions.Here, the Information Gain feature selection method in Weka has been used to identify the most important mouse gene features for classification from the training dataset.This method found a subset of 71 informative features amongst the 106 total features (Table 2 and S11).Most of these selected features were found to be statistically different in values between the RTM and non-RTM genes in this study, confirming their value as discriminators between the training sets.
To construct our machine learning classifier we used the Random Forest 43 implementation in Weka 44 which is an ensemble classifier comprising multiple decision tree models.It has been found to be a highly accurate machine learning method in numerous studies 14,[45][46][47] .We employed tenfold cross-validation to increase the robustness of our classifier and mitigate the potential for classifier overfitting 48 .A classifier overfits if its prediction accuracy is higher on the training dataset than on the validation/test dataset.We observed that the crossvalidation accuracy of our Random Forest classifier built on 70 selected features is 85.3% (891/1044) with 424 true-positives (TPs) (RTM genes correctly identified as RTM), 98 false-negatives (FNs) (RTM genes identified as non-RTM), 472 true-negatives (TNs) (non-RTM genes identified as non-RTM) and 50 false-positives (FPs) (non-RTM genes identified as RTM).Table 3 demonstrates the robust performance of this classifier by means of several performance metrics.We also compared the performance of this Random forest classifier with the J48 decision tree 49 , Gradient Boosted Tree (XGBoost) 50,51 and Support Vector Machine (SVM) 52 models.J48 classifier was developed in Weka, and XGBoost and SVM classifiers were implemented in R with default parameters settings using the tenfold cross-validation method.Table 3 shows the superiority of the Random Forest classifier in predicting RTM genes among all classifiers.
Our classifier showed an accuracy of 84.3% on the Test 1 dataset, which only contains non-RTM genes.We further used this classifier to identify the status of mouse genes in the Test 2 dataset, each of which could be an orthologue of a possible candidate for causing the renal defects associated with DiGeorge Syndrome.Among all genes in this dataset, DiGeorge critical region 14 (Dgcr14), zinc finger DHHC-type palmitoyltransferase 8 (Zdhhc8), CRK like proto-oncogene adaptor protein (Crkl), guanine nucleotide-binding subunit beta-like protein (Gnb1l), KLF transcription factor 8 (Klf8), and DiGeorge critical region 8 (Dgcr2) were predicted as RTM genes.The remaining genes were identified as non-RTM.Moreover, this classifier identified Transforming acidic coiled-coil containing protein 2 (Tacc2), carboxypeptidase X and M14 family member 2 (Cpxm2) genes as RTM genes from the Test 3 dataset which contains mouse orthologues of the VUR candidate region on human chromosome 10q26.
To test whether the Random Forest classifier suffers from overfitting, we generated 9 more balanced training datasets containing different subsets of non-RTM genes.Nine different Random Forest classifiers were trained on these datasets (Table S12).We found that the mean accuracy of these classifiers is 85.9% with a standard deviation (SD) of 0.7%.This low SD indicates that all these classifier's prediction performances are very similar.This result confirms that our classifier is not biased by the choice of genes in the training dataset, because if the subset of genes chosen for the training dataset impacted highly upon the classifier accuracy, a high SD would have been detected when multiple classifiers were compared.
Prediction of all genes in the mouse genome.We created a fourth test dataset (Test 4) that contains all those protein-coding genes in the mouse genome that were not included in the RTM and non-RTM datasets.From this test dataset, our classifier predicted 19% (2534/13,379) of genes as RTM genes, and the remaining  www.nature.com/scientificreports/81% (10,845/13,379) as non-RTM genes.We generated a ranked list of these RTM genes with their likelihood of being associated with RT development (Table S13).The top 10 predicted RTM genes are listed in Table 4. Three genes from our most highly confident predictions, Scube3 53 , Sema3c 54,55 , and Rspo3 56,57 , have been independently experimentally validated as causing renal developmental defects.

RTM gene database.
To provide data on predicted RTM genes, a publicly available database named MoR-TalGene (http:// 130. 88. 96.183/) has been created.This database shows the RT/non-RT status for all mouse genes, either from published literature or from our predictions.The confidence scores stating the predicted probabilities of the genes to be RTM can also be obtained from this database.A known or predicted mouse gene can be searched by multiple identifiers such as: gene name, MGI ID, Ensembl ID and UniProt ID.Lists of all RTM and non-RTM genes (both known and predicted) within the mouse genome or within a particular chromosome and/ or genomic region can also be retrieved from the database.All search results can be downloaded as CSV files.

Discussion
We aimed to facilitate the identification of RTM candidate genes by identifying genes in the mammalian genome that are associated with RT development.This knowledge may accelerate this process of achieving a genetic diagnosis for patients with congenital RTMs, because genes associated with renal tract development are likely to cause congenital RTMs when deleterious variants are present.Using supervised machine learning, we generated a Random Forest classifier that achieved 85% accuracy in tenfold cross validation trials after feature selection.Additionally, the classifier was 85% accurate when predicting the RTM association status of genes within our Test 1 dataset, which included all known non-RTM genes not used for training the classifier.We examined two genomic regions associated with RTMs, the 22q11.2deletion region (DiGeorge Syndrome Critical Region) and the non-syndromic vesicoureteric reflux (VUR) candidate region on human chromosome 10q26.Here we present several candidate genes that can be examined through future experimental analysis as likely causative genes for the RTMs associated with these loci.Our database of RTM association status for all protein coding genes will be of value to researchers and clinicians investigating genetic causes of RTMs.
Our study identified properties of genes required for RT development.Although some of the properties are not surprising, such as high expression levels in the developing RT, others are more difficult to interpret, such as amino acid content.Some of the features more highly represented in RTM genes have also been found to be associated with genes required during mammalian development, such as longer sequence length, high betweenness centrality in the PPI network, high PPI network bottleneck score, and nuclear localisation, and therefore their inclusion in the RTM gene class is likely reflective of a developmental function for RTM genes.
Our genome predictions demonstrate that approximately 18% of protein-coding genes in the mouse genome may have a role in RT development, while 82% do not have such a role.These proportions are dissimilar to those of our initial training datasets compiled from the published literature, where we found that 174 protein-coding genes have been shown to be involved in RT development, compared to 4141 genes have been shown to not cause detectable RT phenotypes when mutated.However, it should be noted that some of the non-RTM genes may have had limited RTM characterisation, and therefore may in future with additional phenotyping be found to be RTM genes.Our classifier is not simply recapitulating the input proportions.The higher proportion of genes predicted to have a role in RT development as compared to those known to have a role in RTM development from experimental investigation indicates that RTM genes have been under-sampled in experimental studies.We therefore propose that further experimental analysis of the genes we predict as highly likely to be associated with RT development will reveal new gene functions and promising new models for congenital RTMs.Table 4. Top 10 mouse RT genes predicted using our Random Forest classifier.The probability score (Confidence Score) output by our classifier (normalised in the 0-1 range) indicates the confidence level of a prediction result and tells the likelihood of a mouse gene in the test dataset being associated with RT development.The Confidence Score reports the fraction of decision trees in the Random Forest that predict the gene to be associated with RTMs.A score of 1 would reflect that 100% of decision trees classify that gene as RTM associated, corresponding to the strongest possible confidence in the prediction.www.nature.com/scientificreports/Our most highly confident predictions of RTM genes include several genes with links to renal disorders.For example, the Scube3 gene is expressed during kidney development 58 .A Scube3 mutant mouse harbouring a missense variant, Scube3 N294K/N294K , has been identified from a mutagenesis screen 53 .These mice display alterations in renal function, including increased electrolyte, total protein, albumin, and glucose excretion rates.It has also been recently reported that bi-allelic inactivating variants in SCUBE3 are associated with a skeletal and craniofacial developmental disorder linked to impaired BMP signalling 59 .It is unclear if kidney function was evaluated in these patients.However, BMP4 mutations cause defects in kidney development 60 , providing support for the hypothesis that altered SCUBE3 function can cause renal tract abnormalities due to the loss of BMP developmental signals.Additionally, SCUBE3 has been identified as a renal cell carcinoma 61 tumour suppressor gene.Erroneous hypermethylation of the promoter of SCUBE3 in renal cell carcinoma leads to a 45% reduction in the expression level of the gene as compared to control kidney cell expression levels.Tumour methylation of SCUBE3 also was associated with a significantly increased risk of death and cancer relapse.Together, these studies support our finding that Scube3 is a gene of relevance to RT development.
Bioinformatic analysis of renal cell carcinoma transcriptome datasets has revealed that PRSS23 displays significant differential expression between tumour and non-tumour datasets 62 .Further support for a role for PRSS23 in kidney function comes from transcriptome studies of patients with focal segment glomerulosclerosis (FSGS), which is a major cause of end stage renal disease.FSGS patients exhibit upregulation of PRSS23, as does the Cd2ap +/-, Fyn -/-mouse model of FSGS 63 .It is hypothesised that PRSS23 may promote TGFB signalling and cause renal tissue damage 63 .Whether interactions between PRSS23 and TGFB occur during kidney development remains a question for further investigation.
A role for Sema3c kidney development has been noted in a mutant mouse model, whereby Sema3c mutants showed reduced ureteric bud branching 55 .This mouse model incorporated the use of a GFP reporter, and therefore was not included in our training set genes which are exclusively targeted deletion models 55 .Furthermore, a recent study reports that the Sema3c gene is associated with the pathophysiology of acute kidney injury 54 .Sema3c knockout mice display decreased renal tissue damage and leukocyte infiltration following acute kidney injury.Sema3c is expressed in the wild type developing mouse kidney, but this expression is no longer detectable in the adult 54,64 .However, after surgically induced acute kidney injury Sema3c expression is upregulated as compared to control uninjured kidneys.Analysis of kidney biopsies from patients with acute injury also confirms upregulation of SEMA3C, indicating conservation of its function.Secretion of Sema3C protein following injury was detected, leading to the hypothesis that damaged kidney tubules produce Sema3c which causes further renal vascular damage and reduced blood flow.
In a study to identify genes driving early events in the formation of Wilms tumours, or nephroblastomas, the gene WNT5b was identified to have upregulated expression in human Wilms tumour blastemal cells as compared to differentiated kidney glomerular cells 65 .WNT5B protein expression was detected in human developing kidneys subsequent to renal vesicle formation, with expression in the nuclei of differentiated kidneys and in the cytoplasm in Wilms tumour tissue.Wilms tumours also often display an increase in copy number of WNT5B 66 , suggesting this gene may be involved in tumour pathogenesis.These studies indicate that disruption of Wnt signalling, and in particular increased WNT5B expression, may disrupt nephrogenesis.
Tfpi encodes a secreted protease inhibitor produced by kidney myofibroblasts, which likely has a role in the pathology of autosomal dominant polycystic kidney disease 67 .Myofibroblast depletion reduces kidney cyst growth and cyst epithelial cell proliferation in an autosomal dominant cystic kidney disease mouse model.It is hypothesised that the secretion of protease inhibitors, such as Tfpi, by myofibroblasts promotes the proliferation of cyst epithelium, leading to worsening renal function and advanced disease progression.
The R-spondin genes Rspo1 and Rspo3 are expressed in the developing mouse kidney from embryonic day (E) 10.5, in an overlapping pattern with Six2+ renal progenitors 57 .By late gestation, Rspo3 is strongly expressed in the cortical stroma compartment and stroma cells lining ducts of the renal papilla.Kidney-specific deletion of Rspo3 results in a mild reduction of renal progenitor cells, whereas joint deletion of Rspo1 and Rspo3 resulted in severe renal hypoplasia.Further characterisation of Rspo3 in the mouse developing kidney stroma revealed a requirement for Rspo3 in the stromal compartment to maintain kidney progenitor cells in late gestation.Additionally, single cell transcriptomics studies have identified Rspo3 as a key marker of the kidney stromal compartment 56 .Further investigation of these genes in human renal tract malformations and congenital disease is needed.The Rspo3 knockout experimental studies were performed after the compilation of our RTM training set, and thus the abnormal renal developmental phenotype of the mouse knockout model was not yet known when our computational study was initiated.
Another gene within our top 20 most confident predictions of genes associated with RT development is the gene Slit3.At the start of our study this gene was not annotated in MGI as being associated with RTMs using the molecular phenotyping terms we selected for inclusion as an RTM gene.However, a recent report confirms that SLIT3 is indeed a human RT disease gene, being discovered as a cause of renal agenesis and hypodysplasia 68 .Additionally, Slit3 knockout mice have been reported to demonstrate renal agenesis, although this phenotype was only present in 20% of the animals analysed 69 .
Within the 22q11.2deletion region, we have identified Crkl as a candidate kidney development gene.Notably, Crkl protein altering variants have been found in DiGeorge syndrome patients with congenital urinary abnormalities 40 , providing strong support for our classification of this gene as a RT development gene.CRKL has also been found as one of the keys genes for the normal development of both upper and lower genitourinary (GU) tracts, and its deletion at 22q11.2 is shown to cause urogenital birth defects 70 .Another gene in the 22q11.2critical region predicted to be associated with RT development, KLF8, has been shown to be over expressed in renal cell carcinoma tissue as compared to non-tumour adjacent tissue 71  Overall, our classifier has identified several predicted RTM genes within our test datasets that have links to kidney or RT development.It is important to note that our classifier, whilst achieving superior accuracy to random guessing, still remains a computational tool which cannot be expected to achieve perfection for every gene status prediction.Looking forward, we propose that experimental analysis of the genes with highly confident RTM predictions will confirm or refute the role of these specific genes in RT development.Exploration of RTM patient exome and/or genome sequence datasets will reveal if these genes harbour deleterious variants in individuals with RTMs.Modelling deleterious variants in cell and animal models will enable deeper understanding of the developmental processes that these variants disrupt.Furthermore, the RTM gene predictions can be of use in determining which genes within an identified RTM genomic critical region or copy number variable region should be considered the most likely genetic candidates for causing disease.Our predictions may be informative for the analysis of sequence variation from RTM patients, to allow prioritisation of variants within genes of currently unknown RTM association status.Combining animal model analysis and RTM patient genome sequence analysis will provide strong evidence that genes with high confidence predictions are indeed linked to human RTMs, expanding our knowledge of the genetic causes of congenital kidney and lower urinary tract disease and expediting genetic diagnosis for RTM patients.

Methods
Data retrieval.We used the MGI database to compile a dataset comprising all mouse genes.Mouse genes were labelled as either RTM or non-RTM using the mutant mouse phenotype information from the IMPC and MGI databases (accessed on 15 October 2016).Only null alleles of mouse genes with known phenotypes resulting from single gene knockout (targeted deletions) experiments were included in this study.We defined the phenotype of a knockout mouse as RTM if the gene was known to be involved in renal development.These genes can potentially cause congenital renal developmental defects when mutated.A total of 10 phenotype terms in the MGI were used to classify a single gene knockout phenotype as RTM.These were: abnormal kidney morphology (MP:0002135), abnormal ureter morphology (MP:0000534), abnormal ureteropelvic junction morphology (MP:0011487), abnormal ureterovesical junction morphology (MP:0011488), abnormal urethra morphology (MP:0000537), abnormal urinary bladder morphology (MP:0000538), abnormal urinary system development (MP:0003942), abnormal urothelium morphology (MP:0003630), persistent cloaca (MP:0003129) and vesicoureteral reflux (MP:0001948).The RTM genes were also checked manually to find out which RT abnormalities are associated with them in the mouse.RTM genes were further verified by manually checking whether they are also critical to RT anomalies in humans.Genes with insufficient evidence of an associated RT phenotype in the mouse have been excluded.Additionally, renal ciliopathy genes have also been excluded from our RTM gene dataset.Mouse knockouts with phenotypes unrelated to any of these renal annotations were marked as non-RTM.Our datasets were restricted to protein-coding genes only.We further retrieved the Ensembl 72 gene identifier and UniGene 73,74 expression cluster identifier mapping to each MGI gene symbol.Encoded proteins for each mouse gene were determined from the UniProt database.Only the longest length protein isoform was analysed for each gene.

Feature collection.
We collected a number of gene and protein-sequence-based features to differentiate RTM and non-RTM phenotypes.Features including 'gene length' , '% of GC content' , 'transcript count' , 'exon count' , 'exon length' and 'intron length' were computed based on the data retrieved from the Ensembl release 103 database of Mus musculus genes, using the Ensembl BioMart 75 data mining tool.Gene expression data as transcripts per million (TPM) were obtained from the UniGene database for 13 embryonic developmental stages.The RNA-seq gene expression data were downloaded from the BGEE 76 database which included 6 tissue types (11 weeks testis, 8 weeks fibroblast, 8 weeks heart, post-juvenile adult RTM, post-juvenile testis and 2 months skin).The Pepstats 77 program was used to calculate protein length, molecular weight and amino acid composition.UniProt and WoLF PSORT 78 program were used for subcellular localisation features.Other gene and protein-sequence-based features including evolutionary age, signal peptides, transmembrane domain, subcellular locations were obtained from Ensembl, SignalP 79 and UniProt.Mouse protein-protein interaction (PPI) data were downloaded from the I2D 80 v2.3 database, which is a database of known and predicted protein interactions for human, mouse, rat, fly, yeast and worm genomes.The 'network analyser' plugin of Cytoscape 81 v3.1.1 and the Hub object Analyser (Hubba) 82 web-based service were used to compute PPI network properties.GO terms were obtained using the 'Functional Annotation' tool of the web-based application DAVID 83 v6.8.A detailed description of these features has been explained in previous studies 14,24 .Data on the chromosome location of mouse genes were obtained from Ensembl.

Machine learning classifiers.
A Random Forest classifier was developed using the publicly available Java based machine learning software Weka (version 3.8.2).The classifier was trained using the tenfold cross-validation method on a training dataset of RTM and non-RTM mouse genes, where the training dataset was randomly split into 10 equal datasets with 9 datasets being used for classifier training and the remaining part being used for testing.Training datasets with equal number of RTM and non-RTM genes were used to avoid bias towards the larger gene group.However, we could not find data for numerous features for a number of genes in the training datasets.These include: 10 features of the PPI network generated from known PPIs and gene expression across 13 developmental stages.Adjustments were made to these features by replacing their missing values with the respective feature mean values.Separate test datasets were also created from genes that have not been included in classifier training.Calculating the proportion of correctly predicted genes in the test datasets validated the performance of the classifier.The classifier generates a probability score to indicate the confidence level of a prediction outcome.This probability score is calculated by taking the average of all predictions made by the decision trees in the Random Forest.A score of 1 indicates that all trees agree to the same class prediction.
Oversampling technique.Since our RTM and non-RTM datasets varied in the number of genes, we generated balanced training datasets containing an equal number of RTM and non-RTM mouse genes.The data imbalance was overcome by subsampling the non-RTM dataset at random 84 and by generating synthetic instances of the RTM class using SMOTE.SMOTE is one of the most widely used oversampling techniques to solve class imbalances by generating synthetic samples for the minority class based upon the existing minority class samples.Each training dataset contained different subsets of RTM and non-RTM genes as a result of random selection.
Feature selection.Accurate and reliable classification mainly relies upon the quality of the input features used to build the classifier; not all the features in the training dataset are useful.Usage of relevant features can reduce overfitting, optimise classification performance and decrease the training time.Feature selection was performed using the Information Gain method implemented in Weka, which estimates the rank of a feature by evaluating its information gain in the context of the classification target and selects only the most informative features for classification in order of significance 85 .The higher the value of the information gain is, the more important the feature is in determining the classification target.
Performance measures.Performance of the predictive classifier was evaluated by several metrics which include accuracy, confusion matrix, precision and recall.Our classifier scores a prediction as TP (number of RTM genes correctly identified) or FP (number non-RTM genes incorrectly identified), or TN (number of non-RTM genes correctly identified) or FN (number of RTM genes incorrectly identified).Four metrics were estimated from these counts to assess how fit our classifier is in gene prediction: accuracy (proportion of true results); true positive rate (recall or sensitivity)-TPR; false positive rate-FPR; and precision, defined by the following equations: Classifier performance was further evaluated from the area values of receiver operating curve (ROC) and precision-recall curve (PRC).The ROC area measures how well a classifier is performing in general, whereas the PRC area measures how well the classifier fits in identifying the samples from individual group.An area value of 1 represents an accurate prediction; a value of 0.5 represents a random guess.

Statistical analysis.
The statistical significance of each feature was determined using the non-parametric Mann-Whitney U test.We also used the Chi-squared (χ 2 ) test to examine whether the frequencies of a feature in RTM and non-RTM dataset differ from each other.All statistical tests were performed using the statistics software package SPSS v23.Data visualisation was performed using R 86 .

Figure 2 .
Figure 2. Distributions of the total gene length, exon length, intron length and protein length in RTM and non-RTM datasets.These violin plots outline distribution of (a) gene length (b) exon length (c) intron length and (d) protein length with overlaid boxplots.The width of the violin plots represents the proportion of the data located there; the top and bottom of the boxplots denote the upper and lower quartiles; the line inside the box denotes the median of the data.The P-values from the Mann-Whitney U tests are reported below their respective graphs.

Figure 3 .
Figure 3. Distributions of several amino acid residues (%) between RTM and non-RTM mouse proteins.These violin plots outline distribution of the proportion of (a) glycine (b) asparagine (c) proline and (d) isoleucine (e) leucine (f) glutamine residues with overlaid boxplots.The width of the violin plots represents the proportion of the data located there; the top and bottom of the boxplots denote the upper and lower quartiles; the line inside the box denotes the median of the data.The P-values from the Mann-Whitney U tests are reported below their respective graphs.

Table 1 .
List of statistically significant features between RTM and non-RTM genes.The median value of each feature is reported.Statistically significant results are listed for P-values less than 0.05.

Table 2 .
Top 10 features selected from the training dataset using the Information Gain feature selection method.Features are sorted in descending order with respect to the corresponding information gain value, with the most informative feature listed first.

Table 3 .
Tenfold cross validation performance of the Random Forest, SVM, XGBoost and J48 classifiers trained and evaluated on the training dataset.Data from before and after feature selection are presented.Here, TP = True Positive; FP = False Positive; ROC = Receiver Operating Curve; PRC = Precision-Recall Curve.

Gene class TP rate FP Rate Precision F-Measure ROC area (AUC) PRC area
Vol.:(0123456789) Scientific Reports | (2023) 13:13204 | https://doi.org/10.1038/s41598-023-38110-z . siRNA knockdown of KLF8 limited cellular growth and invasion capacity of human renal carcinoma cells in vitro.Therefore, KLF8 likely plays a role in proliferation of renal carcinoma cells.Further investigation is needed to determine if KLF8 also plays a role in developmental renal cell proliferation.