Development of common disorders involves a complex interplay of multiple factors. We may assume that many genes and many environmental factors are involved. Future gene–environmental cohort studies will no longer focus on single-gene abnormalities or major environmental risk factors, such as smoking in the case of lung cancer, which have been studied to date; rather, they will concentrate on the identification of risk factors that have not been discovered by conventional epidemiological methodologies.1 Such risk factors preclude precise predictions because they will be revealed only after detailed analyses of the interactions among multiple factors—between genes and environmental factors, or among genes, for example. We may reasonably expect that combinations of these factors have an important role in the occurrence of a variety of common diseases, such as cancer, diabetes mellitus and cardiovascular diseases. These represent reasons why contemporary genomic cohort studies require the extensive collection of environmental and genetic information on a population level rather than from patients with specific disorders. Such a study design will enable researchers to formulate and test as many hypotheses as they can formulate. Launching a contemporary genomic cohort study requires enormous resources. Developed countries are investing in full-scale efforts to launch studies with cohorts comprising hundreds of thousands of local participants to meet the research objectives mentioned above. Such endeavors, such as the ongoing United Kingdom Biobank with 5 00 000 volunteers to date and the prospective cohort studies being planned in the United States, are termed ‘biobank’ projects.1 In Japan, it is an urgent task to build a contemporary gene–environmental cohort.

Yamagata University, with its long history of and experience with traditional cohort research projects that started some 30 years ago,2, 3 started Yamagata Molecular Epidemiological Cohort Study, when its Global Center of Excellence (G-COE) program entitled ‘Formation of an International Network for Education and Research of Molecular Epidemiology’ (2008–2012) was approved by the Japanese Government. The summary of our study was shown Table 1. As of March 2012, 9000 participants (72.6% of the health check-up examinees; 3589 men and 5411 women; median age at baseline 64 years) donated their DNA to the study. Basically, the same study platform is shared with the Japan Multi-institutional Collaborative Cohort Study (J-MICC Study).4

Table 1 Summary of the study

We aim to conduct genome-wide gene–gene or gene–environmental interaction analysis using single-nucleotide polymorphisms (SNPs) in this study. It is an attractive way to identify genetic components that confer susceptibility to complex human diseases. However, individual hypothesis testing for SNP–SNP pairs, as in common genome-wide association studies (GWASs), involves difficulty in setting overall P-values due to the complicated correlation structure, namely, the multiple testing problem that causes unacceptable false-negative results.5, 6 Specifically, there is difficulty in setting a genome-wide significance level using statistical methods such as Bonferroni correction, leading to prohibitively conservative results because they fail to successfully incorporate the correlation structure between each hypothesis. For instance, the total number of hypotheses for gene–gene interaction is about 1011 –1012 in standard GWAS data. Then the Bonferroni-corrected significance level must be considerably small; that is, the correction factor <10−11. No efficient and universal multiple testing method to deal with such a huge set of hypotheses having a complicated correlation structure is proposed so far.7 The number of SNP–SNP pairs larger than the sample size, the so-called large p small n problem or the curse of dimensionality (that is, the number of predictors is larger than the sample size),8, 9 precludes simultaneous analysis using multiple regression. To overcome these issues, our research team at Yamagata University developed an up-to-date method for ultrahigh-dimensional variable selection, termed sure independence screening (SIS), for appropriate handling of numerous SNP–SNP interactions by including them as predictor variables in logistic regression.7 This research team has also implemented the procedures in a software program, EPISIS, using the cost-effective GPGPU (general-purpose computing on graphics processing units) technology. The EPISIS program can complete an exhaustive search for SNP–SNP interactions in a standard GWAS data set within several hours. We plan to conduct SNP–SNP and SNP–environmental interaction analyses for associations of diseases or biological phenotypes or biomarkers using EPISIS software. Analysis using whole-genome information will be conducted in the future.

We obtained informed consent from each participant after a group orientation about the study, using participant-friendly materials that are available at our website (http://gcoe.id.yamagata-u.ac.jp/jp/cohort/). The main study protocol was approved by the Ethics Committee at Yamagata University School of Medicine.

We described the study design of the gene–environmental cohort study to unveil how genetic and environmental factors jointly influence the risk of common human disease development. The strength of our study is as follows: the capacities of constructing a large cohort study, such as the size of 20 000; the study platform with high commonality to other study groups; the novel useful method, statistical method and the software; and the high-quality cancer registration. A study cohort of 20 000 participants is being gathered for contemporary genomic analysis during the G-COE program. After the G-COE program, the size of cohort is planned to expand to cover all of Yamagata prefecture. The ultimate target is a study cohort that is comparable in size to the United Kingdom and United States genomic cohorts. For this purpose, we are collaborating with the large cohort studies in Japan, such as the J-MICC study, as well as expanding our cohort size. We also plan to share the information related to our newly developed genomic analysis7 with other study groups.