Introduction

With the growth of digital devices, there are many complaints nowadays about a variety of eye-related symptoms such as eye strain, blurred vision, and double vision, and it is collectively referred to as digital eye strain or computer vision syndrome (CVS). In addition, it is also known to be associated with non-ocular symptoms such as headaches, and shoulder and neck pain. The American Optometric Association has defined CVS as a combination of eye and vision problems linked to the use of visual display terminals (VDTs)1. The diagnosis is carried out based on a comprehensive examination including visual acuity measurements, testing for refractive error and focusing movement in addition to medical inquiries about symptoms the patient has experienced. Although CVS is associated with a variety of symptoms, Sheedy et al.2 stated that the underlying problem can be identified by the location of the symptoms and categorized into the two groups of internal and external symptoms. According to the general review by Blehm et al.3, there are four types of symptoms (asthenopic, ocular surface-related, visual, and extra-ocular) and they can be subdivided into three main potential pathophysiological causes (ocular surface mechanisms, accommodative mechanisms, and extraocular mechanisms). Numerous studies have been conducted so far to address health and safety issues for VDT users, but the efficacy of proposed treatments for minimizing CVS-related symptoms have not been adequately demonstrated yet3,4. A better understanding of the underlying physiology behind CVS is important to enable more accurate diagnosis and treatment.

The genome-wide association study (GWAS)5 has become a common genetic approach to identify potential susceptibility genes for diseases. To date, a number of GWAS have been successful in identifying the genetic pathways associated with ocular diseases such as myopia, age-related macular degeneration and glaucoma6,7. Although GWAS has brought forth an adequate amount of knowledge on the genetic etiology of ophthalmic traits, through relatively easy to recruit severe cases, more common diseases or syndromes will require an increase in sample size and well-designed analytical methods. No GWAS therefore has been reported for CVS, and we need an elaborate method to integrate multiple symptoms related to CVS, which cannot be directory measured, into a model in order to precisely evaluate genetic effects using GWAS.

With the establishment of cohorts with more detailed phenotypic information, multi-trait GWAS, which is a multivariate analysis on the same subjects, has also become possible. Multivariate methods are generally considered more powerful than univariate methods unless only one trait is associated with a genetic marker or all traits are very highly correlated8. Many methods have been proposed and applied so far, including dimensionality reduction using principal component analysis (PCA), multivariate mixed models as an extension of regression analysis, Bayesian polynomial regression and so on8,9. In recent years, methodologies using structural equation modeling (SEM) have been applied for multi-trait GWAS10,11. SEM is a multivariate statistical method in which the interactions between variables are described as simultaneous equations and their parameters are inferred12. Unlike other methods, the SEM-based method can incorporate latent variables, which are hypothetical quantities that have not been measured and can be inferred from other measured variables, into the model. A SEM-based multi-trait GWAS is attained by assuming a model structure in which a genomic variant is an exogenous predictor that influences observed variables via the latent factor, and then repeating the statistical test genome-wide for each variant. SEM has enabled investigations of complex traits such as psychology and psychiatry, in which their unmeasurable phenotypes are indexed by multiple symptoms or behaviors. Since this feature is similar to CVS, we applied a SEM-based multi-trait GWAS to CVS, which we expected increased power to identify genetic associations.

In this study, a web-based survey about CVS was conducted on 1998 Japanese volunteers. To investigate the biological mechanisms behind CVS, we performed the SEM-based multi-trait GWAS for the CVS-related multiple symptoms for the first time to the best of our knowledge under the assumption of the existence of a potential phenotype behind the CVS-related symptoms.

Methods

Study population and data collection

Among customers of a direct-to-consumer (DTC) genetic testing service (MYCODE), 1998 Japanese volunteers (50.5% male, mean age 49.3 years) were recruited in August 2019 (Table 1). Customers with prior consent in writing for research recruitment were invited to participate in this study via the company website and email. Participation was voluntary and based on informed consent. The web-based questionnaire survey was conducted with the participants. A total of 14 questionnaire items covering the four categories defined by Blehm et al.3 were designed (Table 2). Each item was measured on a scale of one to seven in each of the two conditions: “When you feel like you have used your eyes” and “When you feel like you have not used your eyes very much”. These two conditions were set to simulate the answers before and after the VDT task, and the difference (delta) of these two conditional answers was used in the following analysis. In addition to that, self-reported medical history was obtained on the presence or absence of dry eye syndrome, presbyopia and other common eye diseases. This study received approval from the institutional review board of DeNA Life Science, Inc. and was implemented in accordance with the Declaration of Helsinki.

Table 1 Characteristics of participants in this study.
Table 2 List of CVS-related questionnaires and statistics on responses.

Genotyping and quality control

Participants collected their own saliva samples using the kit sent to them when they used the service before participation in this study, and DNA extraction and genotyping were performed at the laboratory of DeNA Life Science, Inc. Infinium OmniExpress-24 or Human OmniExpress-24 BeadChip (Illumina, Inc.), both of which additionally contained approximately 30,000 custom probes, was used. The stringent quality control (QC) procedures of genotyped results were applied with call rate ≥ 95%, minor allele frequency (MAF) ≥ 0.01, identity by descent < 0.1875, and Hardy–Weinberg equilibrium (p value > 1.0e−6). The gender on the declaration was confirmed to match the genotyped one. All the above procedures were performed using PLINK (ver. 1.9)13. In order to stratify the population, PCA computed the first two principal components based on a sample set of 1000 Genomes Project phase 3 (1KGP; N = 2504) and then projected study samples onto the 2D subspace. Samples falling outside the Japanese cluster were excluded (Supplementary Fig. S1). Eventually, 1966 subjects (999 male) with a total of 540,916 single nucleotide polymorphisms (SNPs) on autosomes ended up passing the QC filter. GRCh37/hg19 (GCF_000001405.13) was the reference for the genome construction information.

Statistical analysis

To implement the multi-trait GWAS based on the SEM methodology, gwsem package (ver. 0.1.17) in R, which was developed by Verhulst et al.14, was introduced. Age, sex, and the top 2 eigenvectors from PCA were added into the model as external covariates. The number of principal components to be incorporated was determined based on examination of the scree plot of their eigenvalues. After running the SEM-based multi-trait association analysis on all variants, p values were subsequently calculated, adjusting each test statistic by a genomic inflation factor. Genotype imputation was conducted on variants which were located within ± 500 kb of GWAS hits, and SEM-based GWAS was rerun on estimated genotypes. Eagle2 (ver. 2.4.1)15 was used for the phasing step referring to the East Asian population (EAS; N = 504) of 1KGP, and then imputation was performed using Minimac3 (ver. 2.0.1)16. The accuracy index of imputation performance, correlation between reference and predicted genotypes (R squared), was 0.942 ± 0.116 for common variants (MAF ≥ 0.05; see Supplementary Fig. S2). Imputed variants with R squared < 0.7 were removed for QC purpose. GWAS hits meeting a suggestive level of significance (p value < 1.0e−5) were considered lead variants, and the locus was defined as a region containing highly correlated variants (R squared > 0.6) with each lead variant in the 1KGP EAS population.

Colocalization between GWAS summary statistics and expressed quantitative trait locus (eQTL) signals was analyzed based on a Bayesian statistical methodology proposed by Giambartolomei et al.17 We utilized GSE115828 data (N = 523)18, the only publicly available retinal tissue eQTL data from the Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information as of May 2020. The whole blood eQTL data in the Genotype-Tissue Expression (GTEx ver. 7; N = 369)19 was analyzed as well. In addition, two data sources for eQTL data (GSE5335120 and hum0099.v121) were added, available from GEO and the National Bioscience Database Center and consisting of more than 100 healthy Japanese subjects. The former is whole blood data (N = 301) and the latter is eQTL data from whole blood cells and five immune cell populations (CD4+ T cells, CD8+ T cells, B cells, NK cells, and monocytes) of 105 subjects. Note that previous studies19,22 have demonstrated that a sample size of about 100 individuals is sufficient to detect major eQTL effects in various cell types collected in a single population. Posterior probability of H4 (PPH4) of colocalization was calculated for each ± 100 kb of the identified loci containing lead variants using the coloc package in R.

Results

Model specification via cause-and-effect graph

It is assumed that the 14 questionnaire items are observed symptoms due to a potential syndrome, that is CVS, and multiple items will be observed in conjunction and correlated to some extent. The results of 1998 respondents were used to calculate pairwise partial correlation coefficients between items, and positive correlations among all items were verified (Supplementary Fig. S3). The PCA calculated from the eigenvalue vectors of the partial correlation matrix showed that all items were in the same direction on the first PC, and they appeared to form clusters corresponding to the four categories to which each item belonged on the second PC, though not completely (Fig. 1a). Next, covariance selection23, a graphical modeling technique, was applied to derive a partial correlation network consisting of only well-fitting edges between items by stepwise removing edges with low partial correlation coefficients (Fig. 1b). Although q5 "watery eyes" was a singleton that did not share edges with any of the items, it could be seen as forming one large network as a whole. Since the above suggests that each of the four categories captures its own characteristics among belonging items, the items were integrated (averaged) and the representative values of each category were used in the following analysis. In addition, we constructed a factor analysis model with one common factor for all categories (Fig. 2). Using OpenMx package (ver. 2.17.3) in R to calculate goodness of model fit, the Chi-squared statistic was 22.5 (degrees of freedom = 9; p value = 7.4e−3) and the root mean square error of approximation (RMSEA) was 0.028 (confidence interval 0.00–0.045). The closer the RMSEA is to zero, the better the fit to SEM, and a value below 0.05 is desirable24. This model was considered to be robust for SEM-based multi-trait GWAS.

Figure 1
figure 1

(a) PCA results for questionnaire items based on the partial correlation matrix. The length of arrows indicates contributions of the variables to the principal axes. The proportion of variance captured is given as a percentage for both the first and second principal components. (b) Partial correlation network estimated by covariance selection. A dense subnetwork of high partial correlated edges is called a clique, and the existence of a clique would suggest the presence of the same number of latent factors. No cliques with edges all having values greater than 0.2 were found in the network. For model optimization, gmm package (ver. 1.6-5) in R was used to minimize Akaike information criterion.

Figure 2
figure 2

Node diagram of SEM for multi-trait GWAS. The observed variables (four symptom categories) are presented as rectangles and the latent variable as ellipses. Note that each symptom category was calculated by the average of component questionnaires. The association between the latent factor and the observed variables are estimated by the factor loadings (γk). GWAS analysis infer the strength of the causal effect (β) from each SNP to the phenotypes via the latent variable. Age and gender, as well as the first two principal components of the PCA performed for population stratification, were added as covariates. The arrows describe the relationships between the variables. Residuals (δk) and variances (σ) are drawn as double headed dashed arrows into an object.

Multi-trait GWAS

SEM is a form of graphical modeling that uses graphs to assume causalities between observed and latent variables and estimates the strength of the causalities. We introduced a latent variable into the model, where the causal effect of each variant on the questionnaire items (observed variables) was transmitted via the latent variable (Fig. 2). Comprehensive association analysis on each variant was performed to infer the strength of association between the variant and latent factor, and the statistical significance was calculated as a p value. A quantile–quartile (Q–Q) plot was generated between the actual observed p values and the p values as theoretically expected, showing no significant population stratification (Supplementary Fig. S4). The output of GWAS can be represented by a diagram called a Manhattan plot, where each variant’s strength of association (− log10 of p value) is relative to the genomic position (Fig. 3). Although there were no hits with the genome-wide significance level (5.0e−8), 12 peaks with variants at the top that met the suggestive level of significance were detected (Table 3). For each of the peaks, regional association plots were generated using LocusZoom (ver. 1.4)25, and they all formed a single peak at each close location on the genome (Supplementary Fig. S5). When GWAS was performed again with the addition of other covariates (daily VDT hours, glasses and/or contact lens use, and eye drop use), there was no noticeable change in the relative trend, although the overall signal weakened slightly.

Figure 3
figure 3

Manhattan plot of multi-trait GWAS results. The more significant the association, the higher the "skyscraper.” The red horizontal line represents the genome-wide significance threshold, while the blue line corresponds to the suggestive threshold.

Table 3 List of identified loci via multi-trait GWAS.

We performed single-trait GWAS on each of the four categories using gwsem, and several peaks detected by the multi-trait GWAS were identified as well (Supplementary Fig. S6). As an example, the peak upstream of the sidekick cell adhesion molecule 1 (SDK1) gene on 7p22.2 are seen in the "ocular surface-related" and "visual" categories, while the others also show a less significant association. Furthermore, since similarly consistent peaks can be found using PLINK, which is a gold standard tool for GWAS, there is little concern about bias due to the statistical processing method.

GWAS and eQTL colocalization analysis

None of the lead variants detected from multi-trait GWAS were located on a protein coding region. We examined the influence of detected variants on gene expression levels by utilizing publicly available eQTL data sources. In order to identify genes for which the expression level is affected by each variant, we focused on colocalization between the distribution of GWAS and eQTL signals. In other words, the p value of the GWAS result for each variant was compared with that of eQTL on a genome scale to see if they both show significant p values at the same position. Calculating a posterior probability (PPH4) when GWAS and eQTL are associated and share a single causal variant, combinations of locus, genes and expression tissue with > 0.5 of PPH4 were comprehensively explored. As a result, four combinations were detected (Table 4). When using a criteria > 0.75 of PPH4 strongly suggesting GWAS and eQTL signals are due to a common causal variant26, a strong colocalization (PPH4 = 0.906) between GWAS signals at the locus (1,625,705–1,627,501) ± 100 kb on chromosome 2 and eQTL signals on the myelin transcription factor 1 like (MYT1L) gene in retinal tissue was detected. As seen in the regional association plots generated by the locuscomparer package (ver. 1.0.0)27 in R, the p values distribution in GWAS and eQTL signals showed a similar shape and the high degree of correlation was visually confirmed as well (Fig. 4). This suggests that rs9677043, which has the strongest p value for both GWAS and eQTL, is a highly possible causal variant. Note that rs9677043 is located approximately 166 kb downstream of the MYT1L gene. We analyzed the association of rs9677043 with other self-reported diseases (presbyopia, dry eye syndrome, severe myopia, cataract, and glaucoma), but no significant associations were found.

Table 4 List of loci with a high colocalization identified via GWAS and eQTL colocalization analysis.
Figure 4
figure 4

Regional plots of GWAS and eQTL signals on MYT1L gene in retina. Variants are colored based on their correlation with the labeled top variant, which has the smallest p value in the region.

Discussion

The MYT1L gene, the expression level of which showed a strong colocalization (PPH4 > 0.9) in retina, is a paralog of MYT1 and encodes a member of the zinc finger superfamily of transcription factors. It is mainly expressed in neural tissue and is known to play an important role in neuronal differentiation by specifically suppressing the expression of non-neuronal genes28. The eQTL data showed rs9677043, detected as a possible causal variant, has a positive effect on MYT1L expression levels in retinal tissue. The path coefficient β estimated by SEM was 0.132, showing that the mutation of rs9677043 worsens the observed symptoms. This suggests that the differentiation of neurons in retinal tissue has a potential to be associated with CVS.

Looking at other identified loci, Chr1:48,233,438–48,233,438 and Chr7:3,319,615–3,715,120 have relatively strong GWAS signals (p value < 1.0e−6), and they were associated with the TraB domain containing 2B (TRABD2B) and SDK1 genes according to the position on the genome, respectively. TRABD2B is a metalloproteinase having the function of a negative regulator of the Wnt signaling pathway, which has been reported to regulate the transdifferentiation of retinal nerves into the ciliary body29. On the other hand, SDK1 encodes a protein in the immunoglobulin superfamily taking a critical part of the immune response, and in addition, is thought to work in selective synapse formation between retinal neurons because of its ability to adhere specifically to another SDK1 on the surface of other cells30. All of these genes are associated with retinal nerves or a wide range of neural tissues. As chronic inflammation is reported to be one factor in neuronal dysfunction in the retina31,32, immune response also might provide clues to investigate the physiology of CVS.

We will describe the limitations of this study from the following four of viewpoints: (1) self-reported questionnaires via a web browser, (2) sufficiency of the level of statistical significance, (3) impact of racial differences in eQTL analysis, (4) need for validation study using different sample groups.

There is no doubt that web-based questionnaires are less clinically reliable than diagnosis by a doctor. Furthermore, it has been pointed out that there are some problems, such as the bias caused by the web-mediated process, compared to traditional laboratory questionnaire interviews33. On the other hand, one of the DTC genetic testing services, 23andMe in the US, has uncovered numerous biological findings utilizing self-reported medical history and web-based questionnaires from a large group of customers34. This is a new research approach that is driven by mass data and this study is a continuation of that trend.

The ordinary GWAS has established a very strict criteria—a genome-wide significance level. This study instead used the suggestive significance level to give a biological interpretation. The main reasons for the low statistical power may be the subjective way in which the trait was assessed, the wide spectrum of traits, and the small number of samples. Considering the original purpose of GWAS, we should find its meaning and value in the discovery of unknown important molecular mechanisms or new markers for diagnosis. In fact, other studies have led to an extended list of associated variants, which can provide a resource for functional studies, by using more exploratory criteria35. This requires sufficiently narrowing the window for the possibility of false positives, and this study used colocalization with eQTL to give additional validity to biological certainty.

It has been noted that the expression levels of many genes are variable between human populations, and it is mainly explained by differences in genotype frequencies20. We used HaploReg (ver. 4.1)36 to check rs9677043 detected from the European retinal eQTL data, and the MAF in the European and East Asian populations were 0.53 and 0.47, respectively. Although there is no large difference compared to this study (0.464), it is desirable to obtain retinal tissue eQTL data from the Japanese population, considering the fact that transcriptional regulation at the gene level cannot be explained by a single variant alone.

GWAS can be positioned as a “forward genetic” experiment to identify the genetic basis of a trait and provide an opportunity for exploratory hypothesis generation. In order to strengthen the hypothesis, it is ideal to conduct external validation on different sample groups, and this study is no exception. On the other hand, the common study design of examining the functional role of genetic variation is to knock out a specific gene in a model organism. Such an experiment based on “reverse genetics” is necessary for hypothesis verification. In recent years, national biobanks and DTC genetic testing services provide a large-scale cohort linked to genetic and rich phenotypic information, mainly obtained from electronic medical records. An approach of extending GWAS to the entire set of phenotypic entities is called the phenome-wide association study (PheWAS)37, and it allows “reverse genetic” experiments to be virtually conducted on human subjects38,39. The proof-of-principle for the hypotheses generated in this study is expected by making use of cohorts in the future. Besides, the Mendelian randomization approach is an extension of PheWAS to test for causal relationships between phenotypes, and the SEM methodology has been used in many cases40,41.

Last of all, we will discuss the benefits of the use of SEM for multi-trait GWAS. The path coefficients estimated by SEM reflect the strength of each causal relationship and quantify the influence of a variant on the target phenotype via intermediate variables (latent factors) and predefined causal paths. For instance, the positive path coefficient from an intermediate variable to an observed variable suggests that an increase in the intermediate variable directly leads to an increase in the observed variable. In the case of this study, the coefficients (γ) from the four categories (asthenopic, ocular surface-related, visual, and extra-ocular) to the intermediate variable were 0.842 ± 0.024, 0.657 ± 0.018, 0.787 ± 0.027, and 0.823 ± 0.022, respectively, and “asthenopic” and “extra-ocular” were found to be relatively strong symptoms for CVS. Thus, we believe that SEM has the potential to enhance our understanding of complex phenotypes, such as syndromes, by genetically unraveling the relationship between observed symptoms. SEM is expected to be a core technique for performing a series of association analyses from GWAS to PheWAS.