Mental disorders are a global health concerns with depression and anxiety disorders costing the global economy of $1 trillion in lost productivity each year [1]. In the United States, serious mental disorders cost the society $193.2 billion each year [2]. Over 13.1 million adults in United States experienced serious mental illness in 2019, and 7.7 million minors (aged 6–17) experienced a mental disorder in 2016 based on statistics from the Centers for Disease Control and Prevention and the National Alliance of Mental illness. Meanwhile, suicide is the second leading cause of death among people aged 10–34 according to the National Institution of Mental Health. Accurate diagnosis is the first and most important step when encountering mental disorders to ensure appropriately tailored therapies; however, the average delay between onset of mental disorder symptoms and treatment is 11 years [3], and the misdiagnosis rate is disappointing [4, 5]. In past few decades, protocols, such as the Diagnostic and Statistical Manual of Mental Disorders (DSM), have improved the mental disorder diagnosis accuracy and efficiency significantly, but unlike many other diseases, objective screening methodologies and lab tests are still lacking for mental disorders due in part to the underlying disease heterogeneity. Also, co-occurrence of different types of mental disorders, e.g., attention deficit hyperactivity disorder (ADHD) and autism [6], make the diagnosis even more challenging. Therefore, alternative diagnostic methods are warranted and could serve as additional reference in the diagnosis of patients with multiple co-occurring types of disorders.

Structural variation in the human genome shows strong association with mental disorders and certain variations have already been leveraged as drug targets [7]. Non-coding structural variants impacting long non-coding RNAs (lncRNAs) have been shown to influence the entire cell cycle by interacting with DNA, RNA, and proteins [8]. The resulting regulatory effects will result in alternation of gene expression in many complex diseases, including but not limited to cancers, Alzheimer’s disease, cardiovascular issues, neuronal disorders, immune responses, and hereditary diseases [9, 10]. Variation and dysregulation in lncRNAs may thus contribute to human complex diseases and may themselves be potential therapeutic targets, e.g., H19, HOTAIR, LUNAR1, MALAT1, NEAT1, MaTARs in cancer [11] and PVT1 in diabetic nephropathy [9]. Mutations in untranslated region (UTR)/intronic regions may also be potential therapeutic targets since they may lead to protein instability [12] or alternative splicing in genes that are critical in signaling pathways, such as tumorigenesis [13]. Meanwhile, machine learning models, especially deep learning algorithms, have been shown to be of potential value in stratifying mental disorders. Researchers have applied machine learning or deep learning algorithms in mental disorders, usually based on one of these four types of feature vectors, i.e., clinical data, genetic/genomic data, vocal and visual expression data, and social media data [14]. Many studies using genetic/genomic data have focused on prioritizing the susceptibility genes and pathways for mental disorders [15, 16]. For studies predicting disease phenotype, the majority are limited to a specific disease type, such as bipolar disorder [17] or ADHD [18]. On the other hand, it is common that a patient may be diagnosed with more than one type of mental disorders, while studies in African American (AA) are also lacking.

In this study, we analyzed blood whole genome sequencing (WGS) data from 4179 ethnic minority individuals (AA), including 1384 patients with the diagnosis of at least one of the eight common mental disorders where we created a multi-layer perceptron (MLP) neuronal network using coding/non-coding structural variation burdens from different genomic regions as feature vectors. This was done to address two questions: first, whether the model could differentiate mental disorder patients and controls; second, whether we could label correctly patients with different types of disorders, especially patients with multiple diagnosis of mental disorders. The accuracy of the prediction was evaluated using two-fold random shuffle tests and our results support a powerful labeling capacity of the deep learning algorithm with non-coding structural variation demonstrating particular robustness to the classification.


Patient cohorts

The patients selected in this study are from the Center for Applied Genomics (CAG) at The Children’s Hospital of Philadelphia (CHOP), and the WGS was generated through the NHLBI Trans-Omics for Precision Medicine (TOPMed) WGS Program ( All 4179 AA patients were selected from the CAG biobank, including 1384 patients with a diagnosis of at least one of eight mental disorders (Fig. 1 and Supplementary Table 1). The patients were approached during regular hospital visits at multiple clinics, including emergency room, ambulatory settings, surgical, general pediatrics, and specialty pediatric practices. The patients recruited were in the age range of 0–21 years, obtaining healthcare at CHOP. Parental consent was obtained for individuals under 18 years old and assent was also obtained for subjects aged 7–17 years. The consent allowed samples to be analyzed using the genomic technologies herein, to address the research questions proposed. Parents can opt-in to permit regular updates of their child’s electronic health record data (EHR) and to be re-contacted for future study, which essentially everyone did.

Fig. 1: Phenotype summary of 4179 African American individuals from the NHLBI Trans-Omics for Precision Medicine (TOPMed) project.
figure 1

a Age distribution of patients: the majority ~95% are children under 18 years old. b Number of patients with corresponding eight diagnosis, including ADHD, depression, anxiety, autism, intellectual disabilities, speech/language disorder, delays in developments, and ODD, being noted that one patient could have multiple diagnosis. c Distribution of patients’ diagnosis, ranged from controls (no diagnosed mental disorders) to maximum six diagnoses.

Electronic health record (EHR) data extractions

The CAG at CHOP maintains a de-identified extract of clinical data from the CHOP EHR database for consented patients. This database contains longitudinal information about visits, diagnoses, medical history, prescriptions, procedures, and lab tests. For this study the mental health status of de-identified individuals was classified based on the International Classification of Diseases (ICD) codes (ICD-9 and ICD-10) associated with clinical visits and entered in the medical history record.

Whole genome sequencing (WGS) data processing and variation detection

The WGS variant call format files were extracted from the TOPMED database directly. Based on the description, the DNA was isolated from whole blood, and DNA quantity and sex discordance have been checked in the quality assessments. Libraries for WGS were created using the Illumina’s TruSeq DNA PCR-Free Library Preparation Kit. WGS was performed on the Illumina HiSeq X Ten platform with paired end 150 bp reads. The bcl2fastq v2 15.0 package was used to generate individual FASTQ files. The alignment pipeline can be found at The common variants that have minor allele frequency greater than 0.05 in AA ethnicity based on the Exome Aggregation Consortium database [19] have been removed.

Genomics feature vectors selections for deep learning models

The human genome was divided into 587 pieces (~5M bp/piece) based on the GRCh38 genomic coordinates. The occurrence classified seven different types of variation, including nonsynonymous single nucleotide variants (SNV), frameshift SNV, SNVs in UTR, non-coding RNA SNV, SNV in intronic region, SNV in intergenic region, and SNV producing a stop codon, for each genomic piece. The genomic pieces were subsequently applied as a feature vector in the deep learning model. The processes were repeated for all individuals in the study. A random forest algorithm was applied to reduce the number of feature vectors by computing relative importance or contribution of each genomic piece in the prediction, then we scaled the relevance down so that the sum of all scores is 1. Feature vectors with zero relative importance were removed for different types of variants. Technically, the random forest model uses “gini” to measure the quality of a split, while the minimum number of samples required to split an internal node equals 2, and nodes are expanded until all leaves are pure or until all leaves contain less than 2. The number of features to consider when looking for the best split equals the square root(num_features) and the number of trees in the forest equals 500. The modeling codes are based on the Scikit-learn package (version 0.21.3, in Python language. Feature vectors with the highest weights were considered as hotspots, and drug target genes within the hotspot regions were explored through the Integration of the Drug–Gene Interaction Database [20]. Only FDA-approved medications were considered.

Deep learning parameters and random shuffled two-fold tests

MLP from the Scikit-learn package (version 0.21.3) was applied as the deep learning model based on seven different types of variants. Two types of prediction have been made including binary labeling of patients diagnosed with mental disorders versus controls, and multiple labeling for patients with at least one type of mental disorders, including ADHD, depression, anxiety, autism, intellectual disabilities, speech/language disorder, delays in developments, and oppositional defiant disorder (ODD). Thus, each of 1384 patient’s phenotype becomes a 1 × 8 binary matrix instead of a binary value and each column corresponds to one of eight disorders as described above. Parameters for the deep learning model, including maximum iterations, alpha value in L2 regularization, activation functions, solvers, learning rate, number of layers, and numbers of neurons per layer, were optimized using “gp_minimize” function from the scikit-optimize 0.7.2 python library.

In order to test the predictive abilities for selected features, we applied a two-fold shuffle testing. More specifically, 1384 patients and 2795 controls were split into 1:1 ratio for 50 rounds randomly for case–control labeling, with one set used as training data and the other one used as independent test set. The genomic feature vectors were selected as described in the previous paragraph for training data, then the deep learning model described above is applied to label whether the sample is a mental disorder patient or control in the testing data. Similarly, for multiple labeling of 1384 patients with at least one diagnosis, these samples were split into 1:1 ratio for 50 rounds randomly, instead of generating a binary value labeling, the prediction output is a 1 × 8 matrix, while each column corresponds to one of the eight disorders, and value 1 represents existence of the disorder.


Phenotype prediction accuracy for mental disorders versus controls in 4179 African American (AA) individuals using two-fold shuffle tests

As described in the Method section, two-fold shuffle testing was applied to assess the mental disorders’ prediction, based on 50 rounds of two-fold random shuffle tests of genetic variants. Reduced feature vectors, which were based on the random forest algorithm, showed a reproducible prediction accuracy at 65% in classifying mental disorder patients versus controls using the deep learning model (Table 1) with optimized parameters as described in the Method section. A notable observation is that structural variants in non-coding regions, such as variants in non-coding RNAs, intronic and intergenic regions, showed similar level of predictive accuracy compared to structural variants in coding regions, including nonsynonymous SNV, frameshift SNVs, and SNVs producing stop codons.

Table 1 Prediction accuracy summary (mean ± standard deviation).

Phenotype prediction accuracy for patients with multiple diagnosis in 1384 African American (AA) individuals using two-fold shuffle tests

Unlike labeling of patients versus controls, which is a binary question, labeling patients with multiple diagnosis is a multi-labeling question. More specifically, instead of having a binary value representing presence/absence of the disorders, the phenotype of each patient is a 1 × 8 binary matrix, with each column corresponding to one type of disorders in the order of ADHD, speech/language disorders, developmental delays, depression, anxiety, ODD, autism, and intellectual disabilities. As a result, the accuracy of prediction is more complicated to present. We applied hamming loss, which is considered a standard accuracy representative that is frequently applied for binary multiple labeling question to measure the prediction accuracy. The hamming loss is the fraction of labels that are incorrectly predicted, which is ranged from 0 to 1, while lesser value of hamming loss indicates a better classifier. As shown in Table 1, the hamming loss score is less than 0.3, indicating high fractions of correct labeling. Meanwhile, we also calculated the exact matches of phenotype labeling, to determine if a patient diagnosed for ADHD, autism, and ODD, has a predictive phenotype that is exactly the same as the diagnosis. The accuracy ranged from 7 to 10% depending on the variant types. Considering random guess accuracy for the phenotype is 1/256 (~0.4%), the deep learning model has superior prediction capacity compared to random guesses. The accuracy and the recall for eight different disorders are shown in Tables 2 and 3 for coding and non-coding variants, respectively.

Table 2 Prediction accuracy for specific disorders in patients with at least one diagnosis based on coding variants.
Table 3 Prediction accuracy for specific disorders in patients with at least one diagnosis based on non-coding variants.

Genomics regions with high weights based on the deep learning model

The weight or the contribution of each genomic region (feature vector) is based on the 4179 AA individuals and calculated using the Random Forest algorithm, as described in the Method section. The genomics regions (as feature vectors) containing variants that showed non-uniformed weights in both prediction models (case–control and multiple labeling) and the weights of variants in coding regions have larger standard deviations than that of variants in non-coding regions. In other words, genomic regions with non-coding variants (UTR/ncRNA/intronic/intergenic) show more uniformed weight distribution compared to regions with coding region variants (Fig. 2). This suggests that variants in non-coding regions mainly serve as biomarkers of genetic susceptibility of mental disorders, conferred by functional genetic variants in each region. In addition, different chromosomes show alternative patterns of weights, and a notable fact is that the coding hotspots were almost same between case–control classification and multiple labeling models (Fig. 3). This is in contrast to the patterns of hotspots that are not matched for non-coding variants between the two models (Fig. 4). Enrichment analysis was performed based on gene hotspots (>1% weight) using the DAVID Bioinformatics platform [21]. Training the models in computer clusters will only take a few hours (less than 1 day on a standard PC). The computational time includes mainly feature vector extractions and parameter optimizations. In the feature vector extraction step, the programs must scan through the WGS data to annotate and categorize the SNVs, therefore consuming a huge amount of computational time and resources (about 5 days on clusters). Parameter optimization using the “gp_minimize” function from the scikit-optimize 0.7.2 python library takes about 3 days since many parameters, especially number of neuros and layers, need to be tested.

Fig. 2: Boxplots for weights of 587 genomic regions (feature vectors).
figure 2

a In prediction of cases versus controls. b In multiple labeling for 1384 mental disorder patients.

Fig. 3: Feature vector weight distribution of three different types of structural variants (nonsynonymous SNVs, frameshift SNVs, and stop codon SNVs) cross 22 autosomes.
figure 3

a In prediction of cases versus controls, b In multiple labeling for 1384 mental disorder patients. Red dash line is the value if the genomic regions are uniformly weighted.

Fig. 4: Feature vector weight distribution of four different types of structural variants (SNVs in UTR regions, ncRNA, intronic regions, and intergenic regions) cross 22 autosomes.
figure 4

a In prediction of cases versus controls. b In multiple labeling for 1384 mental disorder patients. Red dash line is the value if the genomic regions are uniformly weighted.


Accurate diagnosis of mental disorders can be difficult, and even more challenging in patients who suffer comorbid conditions with more than one type of mental disorders. Although guidelines and standards based on the DSM are helping, the misdiagnosis rate is still high. An assessment of 840 patients in 2011 showed that the misdiagnosis rates reached 65.9% for major depressive disorder, 92.7% for bipolar disorder, 85.8% for panic disorder, 71.0% for generalized anxiety disorder, and 97.8% for social anxiety disorder [5]. A more recent study showed that 51% of schizophrenia had primary diagnosis in the consultation clinic different from the following visits [22], and the misdiagnosis of ADHD is also high, including both over and under estimations [23]. The misdiagnosis could result in prescription of wrong medications that can lead to side effects from the medication without any of the benefits, then further worsen the condition as a consequence [24]. The difficulties in diagnosing mental disorders are further complicated by comorbid symptoms heterogeneity, and lack of objective standards like imaging/lab testing methodologies that are commonly useful for other diseases. For young patients, especially toddlers under 3 years of age who are not able to finish any writing tests for mental disorders, the delay and misdiagnosis rates are even more serious. This is unfortunate as early intervention is critical for many types of severe mental disorders. For example, a previous study shows that early intervention before 30 months of age could significantly improve IQ, an adaptive behavior in autism [25]. As a result, objective alternative approaches could serve as independent references to aid the clinicians to reduce the misdiagnosis rate and make correct decisions for young patients and toddlers. Over the past 15–20 years, structural variants in the genome, including both coding/non-coding regions, have been identified and used as biomarkers in informing the diagnosis and treatment course of mental disorders [26, 27]. In this study we combed genomic variants identified from 4179 AA, with 22% of patients under age 3 years (Fig. 1a) and applied as feature vectors in two MLP deep learning models, which label mental disorder patients versus controls, and patients with multiple mental disorders, respectively.

Among the 4179 AA individuals, we selected 1384 patients who were diagnosed with at least one of eight common mental disorders: ADHD, depression, anxiety, autism, intellectual disabilities, speech/language disorder, delays in developments, and ODD (Fig. 1b). In the first prediction model of mental disorders versus controls, the prediction model showed average accuracy around 65% based on 50 rounds of two-fold random shuffle tests for variants in coding and non-coding regions (Table 1). The accuracy is lower than the previous study labeling of ADHD versus control (~80%) [18]. The main reason is likely due to the comorbid factors when combining eight disorders together as cases that cause significant increase in genetic heterogeneity.

The second prediction model clarified a more interesting question, which is whether we could predict the diagnosis for patients with multiple disorders. In other words, a single patient could belong to multiple categories. Hamming loss, which is the fraction of labels that are incorrectly predicted and frequently applied as accuracy standards for multiple labeling question, was applied as the measure of multiple labeling accuracy (Table 1). As shown by 50 rounds of two-fold random shuffle tests, the hamming loss score is less than 0.3, meaning that at least 70% of binary values in the phenotype matrix are labeled correctly. An alternative approach of accuracy level in the second prediction model is to calculate the exact matches between predicted value and real phenotype. The exact match rate is 7.2~9.3%. The accuracy level is relatively low related to multiple potential factors. The first reason is the limited number of patients with multiple diagnosis, while only 662 patients have more than two diagnosis and 274 patients have more than three diagnoses (Fig. 1c). Therefore, there may not be enough training data for the models to learn from. Secondly, the sample size for some disorders is small, for example, the labeling accuracies for ODD and autism are much lower than other disorders (Tables 2 and 3), meanwhile the sample size for these two are the smallest among all disorders (Fig. 1b). Thirdly, different mental disorders may share genetic risks [28]. Of note, the classification accuracy from random guess for a patient to be correctly classified into one or more of the eight types of disorder is 1/256 (0.4%). In contrast, the labeling from our model is vastly superior and serves as a proof-of-concept that the information could be used to serve as additional references in clinical diagnosis and decision making.

Structural variants in non-coding regions, including UTR, ncRNA, intronic, and intergenic regions, showed no worse prediction abilities than variants in coding regions. However, the weight patterns are different for coding/non-coding variants. The weights of genomic coding variants showed much larger standard deviation than variants in non-coding regions for the two prediction models (Fig. 2). Lack of highly weighted genomic regions (hotspots) for non-coding variants indicates that non-coding variants are likely to function as genomic alternative, instead of causative, compared to coding variants. Also, the weight patterns in 22 chromosomes are highly similar between the two prediction models of coding variants (Fig. 3), but visually different for non-coding variants (Fig. 4). These results indicate that the impact of coding variants are very similar in the eight types of mental disorders, but the regulatory effects from non-coding variants could be essentially different among different disorders.

Enrichment analysis for genes in hotspots, which have weight greater than 1%, was performed (Table 4). The top hotspot at chr19:50000001-55000000 was identified in both categories of stop codon and frameshift SNVs and showed significant enrichment (p < 0.05) in genes involving immune response, regulation of transcription/nucleic acid binding, pathways of osteoclast differentiation, and antigen processing/presentation. Previous study reported that schizophrenia, bipolar disorder, and major depression are characterized by several immune-inflammatory alterations outside the brain [29]. In the prediction for mutations on RNA-binding protein target sites, previous results also suggest that binding site dysregulation is a principal contributor to individuals’ risk of developing psychiatric disorders [30]. Osteoporosis was found to co-occur with schizophrenia [31], and auto-antibodies showed higher prevalence in schizophrenia patients’ brain tissues than controls [32]. In addition, another hotspot on chr17:35000001-40000000 contains 33 genes with stop codon SNVs, enriched in chemotaxis biological processes and chemokine activity/signaling pathways. Chemokines were highlighted of novel brain-specific functions and may present novel diagnostic and/or therapeutic targets in psychiatric disorders [33]. Genes in the genomic region at chr11:55000001-60000000 contain stop codon SNVs that are significantly enrichment in G-protein coupled receptor signaling pathway and olfactory transduction. G-protein-coupled receptors were reported to play critical roles in depression, bipolar disorder, and schizophrenia, as well as their treatments [34]. Association has also been reported between olfactory processing and bipolar disorder, major depression, and anxiety [35]. Genes within these hotspots were further explored for potential interactions with FDA-approved medications (Table 5 and Supplementary Table 2). Medications that may be used to treat mental disorders and medications that may cause unwanted drug effects and have supportive animal/clinical evidence are highlighted. For example, CEPT interacts with the statin family (e.g., Cerivastatin, Mevastatin, etc.). Previous studies suggested that the adjuvant treatment with a statin may be beneficial for patients with depression and schizophrenia who were prescribed psychotropic drugs [36, 37]. Risperidone, interacting with TNF, as an adjunctive therapy for treatment-resistant depression, may improve rate of response and remission based on clinical evidence [38, 39]. MMP2 interacts with paclitaxel, a commonly used chemotherapy medication, and induces anxiety-like behavior in mouse [40]. Oral dexamethasone for 4 days, which interacts with SERPINE1, was significantly more effective than placebo in a randomized, double-blind study of outpatients with depression [41]. Vasopressin, another chemical interacting with SERPINE1, was shown to be related to increased risk of stress disorder [42]. Therefore, the hotspots identified in this study may promote the development of treatments/preventions, as well as new drug discoveries, in addition to their roles as biomarkers for the prediction of mental disorders.

Table 4 Coding hotspots based on weight of genomic regions and enriched Gene Ontology (GO)/KEGG pathways.
Table 5 Genes in coding hotspots and their interacted medications.

In summary, our deep learning model showed promising accuracy to differentiate patients versus controls, as well as the potential of labeling patients with multiple disorders. As shown by our study, genetic variants in non-coding regions (e.g., ncRNA, intronic, and intergenic) have comparable labeling capacities to variants in coding regions. However, unlike coding region variants, non-coding variants do not have genomic hotspots and show much more narrow standard deviations, indicating they probably serve as alternative proxy markers. Genes in genomic regions with the highest weights showed enrichment in biological pathways involved in immune responses, antigen/nucleic acid binding, chemokine signaling pathway, and G-protein receptor activities, which with future research may provide mechanistic insights into these mental disorders based on genetic marker support.