Main

The development of new and cheaper sequencing technology is allowing scientists to search for new approaches to perform personalized treatment.1 Whole-genome sequencing (WGS) can determine single-nucleotide variants (SNVs) which are private genetic variants and identify all genetic variants within each person.2 A channel for studying major depressive disorder (MDD) is to identify its association with gene expression which suggests a measurable impact of current MDD status on gene expression.3 This approach inspired us to investigate whether single-base variation could also be translated to quantitative measurements. The variations occurring within a DNA sequence influence gene structure and its protein function. Furthermore, SNVs in genes correlate with differences in the way individuals respond to a drug treatment or in their susceptibility to a complex disease such as major depression.

Here we propose a novel concept: single-nucleotide variant proportion (SNVP) in genes, to explore MDD based on DNA sequencing data. Investigating SNVP in MDD-associated genes may help identify pathways involved in MDD, and SNVP-MDD associations may reveal hidden genetic structures in this complex disorder. We have recently identified common and rare variations in a total of 46 genes that may confer susceptibility to MDD in a Mexican–American cohort.4 We obtained complete WGS data for a group of 15 participants selected from a Mexican–American cohort,5, 6, 7 10 MDD patients and 5 controls. We have confirmed that in the cohort there was no family or population structure among all those individuals4 and no blood relationship among the 15 selected participants. We also included WGS data from a group of 10 Australians of European-Ancestry including 5 MDD cases and 5 controls as a comparison group. We performed SNV-calling analysis of high quality WGS paired-end reads using a previously described pipeline.4

SNVP in a gene is defined as the ratio of the number of SNVs to the number of all nucleotides in this gene sequence. For example, the gene G has L nucleotides in its DNA sequence, and there are n SNVs in this gene in a given person. The SNVP in gene G for this person is . We calculated the SNVP of 46 genes in 25 human subjects. Difference between two group means was tested using independent two-sample t-test. As we studied 46 genes, P-values were corrected using the false discovery rate method8 and significance was set at ⩽0.05. We calculated the Euclidean distance between two subjects based on the SNVP for 46 genes. After obtaining the distance matrix across individuals, the multi-dimensional scaling (MDS) method9 was used to detect distance relationships between individuals in a two-dimensional picture. We then used the neighbor-joining method10 on the distance matrix to reconstruct the cluster tree drawn using MEGA software.11 For further details see Supplementary Materials and Methods.

Figure 1 profiles SNVP in those 46 genes for 25 human subjects in a heat map. We can find that Mexican–American individuals have significantly different SNVPs on many genes when compared to Australian individuals of European-Ancestry (see Supplementary Table S1). Statistical test results for SNVP across those 46 genes are summarized in Table 1. In the Mexican–American sample significant differences were found between MDD case and control in 9 genes, namely the CNTD1, GRK4, LRRC24, MYH13, PHF21B, SLX4, TBC1D2B, TMEM151A and TRPV4 genes. For the Australian group, significant results were found for the MUC6 and TBC1D2B genes between MDD cases and controls. Specifically, SNVP in gene TBC1D2B which is located in Chr 15q24.3–q25.1 shows significant results in both Mexican–American and Australian groups.

Figure 1
figure 1

Heat map showing SNVP in 46 genes for 25 human subjects. AU, Australian; C, control; D, depression; MA, Mexican–American; SNVP, single-nucleotide variant proportion. The deeper the color, the higher the SNVP abundance.

Table 1 The statistical test results for SNVP in four groups cross the 46 genes

Following the proposed method, we calculated the Euclidean distance matrix of 25 human subjects using their SNVPs of all 46 genes. Then we applied the classical MDS method to map the distance matrix in a two-dimensional plane. Each point in the plane represents one individual. In Figure 2a there is an apparent separation between the Mexican–American group and the Australian group, as expected. However, there is no clear distinction between MDD cases and controls. As the MDS transformation from distance matrix to two-dimensional space may lose some high-dimensional distance information, we constructed the neighbor-joining tree to represent the clustering results. Figure 2b shows that all 10 Mexican–American MDD patients are grouped together in a cluster, and within that cluster the Mexican–American control group were separated far away from the MDD group. The Australians of European-ancestry group, as a different population, assembled as an obvious outgroup away from Mexican–Americans. These interesting results imply that the SNVP in some genes may be associated with MDD.

Figure 2
figure 2

(a) MDS two-dimensional visualization of 25 human subjects (Mexican–American MDD case, blue point; Mexican–American control, red point; Australian MDD case, green point; Australian control, brown point). (b) Cluster tree for 25 human subjects based on their SNVPs of all the 46 genes. MDD, major depressive disorder; MDS, multi-dimensional scaling; SNVP, single-nucleotide variant proportion.

Our method may be a valuable predictive/diagnostic tool. A subject can be represented by a multi-dimensional feature vector that indicates the SNVP of every candidate gene. In the corresponding multi-dimensional feature space, some binary classifiers such as logistic regression12 and support vector machine13 can be used to distinguish cases and controls, and then prediction accuracy, sensitivity and specificity can be obtained. As an example, the support vector machine classifier separated MDD cases (all SNVP<0.00126) and controls (all SNVP>0.00126) with a 100% prediction accuracy in our Australian sample with a threshold of 0.00126 using the one-dimensional feature vector with the SNVP of the TBC1D2B gene (this SNVP was found to be significantly different between MDD and controls in both the Mexican–American and Australian groups). Our approach could work in a large number of subjects by using a combination of several candidate genes.

To the best of our knowledge this is the first study that formulates the concept SNVP. It may bring a new methodology of quantitative sequencing analysis at the gene level. Our proposed SNVP may also be a quantitative variable similarly to gene expression level,14 as differences in SNVP may reflect differences in medication response or disease risk. Our approach may be complementary to genome-wide genotyping data analysis by testing newly found candidate genes. In this study, we used WGS data to calculate the SNVP. However, it would be sufficient to obtain targeted sequencing data of specific regions of interest, this would allow for a more efficient use of funding and bioinformatics resources and time.