Introduction

The fundamental nature of mental disorders remains poorly understood, but genetic factors have an important role.1, 2, 3, 4, 5, 6, 7 Considerable progress in psychiatric genetics has been made in recent years, based on large samples and international collaborations, for example, through the pivotal efforts of the Psychiatric Genomics Consortium.8 We can expect that larger samples will reveal new insights to common and rare variants underpinning mental disorders.9

Many environmental factors influencing pre- and postnatal development are associated with schizophrenia, bipolar affective disorder, autism and attention-deficit/hyperactivity disorder, and furthermore, adverse life circumstances increase the risk of mental disorders. Gene–environment synergism contributes to the aetiology of these disorders, but suitable datasets to explore this important field of research have been lacking. To understand the impact of genes and environments over the life course, large and truly population-based longitudinal cohort studies are required.10, 11

As part of the Lundbeck Foundation Initiative for Integrative Psychiatric Research (iPSYCH: http://iPSYCH.au.dk/), a large case–cohort study has commenced. In most countries, it would not be logistically feasible to compile large, representative population-based samples. In Denmark, the existence of (a) a universal public health care system free of charge, (b) several national longitudinal registers and (c) strict ethical and data protection legislation required to safeguard the privacy of study participants, has provided a remarkable research platform.12 Recent technological developments and a new legal framework for use of bio-banked material for research have created similar possibilities for genetic research.

The vision of iPSYCH was to leverage these combined resources, considering the entire national cohort as our study population. We utilized information on individuals with a diagnosis of selected mental disorders (N=57 377) and a randomly sampled cohort13, 14 of the general population (N=30 000). The sample is known as the iPSYCH Danish case-cohort study (iPSYCH2012). We used neonatal dried blood spots from the Danish Neonatal Screening Biobank to investigate detailed genetic and biomarker information, some of which are markers of environmental exposures. The rich Danish population-based registers were used to add information on all individuals and all their relatives. Thus, we created a comprehensive data source for the combined study of genetic and environmental aetiologies of severe mental disorders. Within the iPSYCH2012 sample, currently around 77 500 individuals have been array genotyped and around 20 000 have been whole exome sequenced. Ten thousand samples have been analysed for ranges of cytokines and neurotrophic factors. Epigenetic and metabolome data from several thousand samples are emerging. For the entire sample and their relatives, detailed longitudinal information related to health, prescribed medicine, social and socioeconomic information exists. This study provides a general overview of the sample design and outlines future research.

The overall design

Individuals diagnosed with schizophrenia, mood disorders, bipolar affective disorder, autism and attention-deficit/hyperactivity disorder were identified through linkage between Danish population-based registers along with a random sample of the same population that supplied the cases.15 Dried blood spots for virtually all individuals were retrieved from the Danish Neonatal Screening Biobank and processed for genotyping. The design includes the ability to efficiently analyse prospectively collected cohort data within the iPSYCH case–cohort sample.15 This particular design provides several advantages: As the cohort is randomly selected from the entire population, we are able to generate unbiased absolute risks and incidence rates and to estimate the effect sizes of genetic markers on risk of mental disorders, which is representative of the entire Danish population. To date, most genetic and epidemiological studies are based on convenient case-control samples, which are prone to biases.15, 16 The iPSYCH2012 sample was preceded by four smaller Danish samples,17, 18, 19, 20, 21, 22, 23, 24 all aiming to investigate the potential interplay between genes and the environment. Collectively, these forerunners informed on the best possible study design to use in the iPSYCH2012 sample (Supplementary Text 1). The following three paragraphs describe the resources and methods used to identify individuals included in the iPSYCH2012 sample.

Selecting the study base

The Danish Civil Registration System was established in 1968,25 where all people alive and living in Denmark were registered. It includes information on the unique personal identification number, sex, date and place of birth, parents’ identifiers and continuously updated information on emigration and death. The personal identification number is used in all national registers enabling accurate linkage within and between registers. The study base included all singleton births with known mothers born between 1 of May 1981 and 31 of December 2005, who were alive and resided in Denmark at their first birthday (N=1 472 762 persons). Selecting births in this period ensures individual samples to be retrieved in the Danish Neonatal Screening Biobank and reasonable distribution of cases and cohort members for all birth years. All residents are registered in the Danish Civil Registration System irrespective of health, income, receipt of social benefits, employment and other socioeconomic characteristics.26

Diagnoses of mental disorders

Persons within the study base were linked via their personal identifier to the Danish Psychiatric Central Research Register27 to obtain information on mental disorders. The Danish Psychiatric Central Research Register was computerized in 1969 and contains data on all admissions to Danish psychiatric in-patient facilities. Information on outpatient visits was included from 1995 onwards. From 1994 onwards, the International Classification of Diseases, 10th revision, Diagnostic Criteria for Research was used for diagnostic classification.28 All persons within the study base, who had a diagnosis of schizophrenia, bipolar disorder, affective disorder, autism and attention-deficit/hyperactivity disorder were included (Table 1). At the time of linkage, the Danish Psychiatric Central Research Register contained all psychiatric contacts until 31 December 2012. Table 1 summarizes the number of individuals across the diagnostic groups.

Table 1 Number of persons included in iPSYCHs population-based sample of the Danish population born 1981–2005

Selecting the population-based cohort

Among the 1 472 762 persons included in the study base, a total of 30 000 persons were chosen uniformly at random (Table 1) corresponding to 2.04% of the study base (=30 000/1 472 762). As the cohort members were chosen randomly, some cohort members may also have the disorders of interest.13, 14 Thus, the cohort selected is representative of the entire Danish population born in the same period.26 In addition, the cohort members are at risk of developing the disorder of interest during follow-up, whereas controls are typically conditioned to be healthy until the study ends.29 We have thereby identified the individuals to be included in the iPSYCH2012 sample. Next, we describe the enrichment with genetic and other biomarker data.

The Danish neonatal screening biobank

Blood spots for individuals included in the iPSYCH2012 sample were retrieved from the Danish Neonatal Screening Biobank within the Danish National Biobank.30 This facility stores dried blood spot samples taken from practically all neonates born in Denmark since 1 May 1981 and stored at −20 °C. These samples were collected primarily for diagnosis of congenital disorders. The samples are stored for follow-up diagnostics, screening, quality control and research. At time of blood sampling (4–7 days after birth), parents are informed in writing about the neonatal screening and that the blood spots are stored in the Danish Neonatal Screening Biobank and can be used for research, pending approval from relevant authorities. The parents are also informed about how to prevent or withdraw the sample from inclusion in research studies.

Genotyping was based on two blood spot punches of 3.2 mm, equivalent to 6 μl of whole blood.30 Biological components are generally very well preserved in neonatal dried blood spot samples, in particular if the samples are stored at −20 °C. However, it may be challenging to analyse the samples due to the very limited amount of biological material available, the nature of dried whole blood on filter paper and decades of storage. In particular, the determination of concentration of biomarkers in dried blood spots is less precise than in serum. This calculation is based on the assumption that one punch 3.2 mm in diameter is equivalent to 3 μl of whole blood, which only applies if the filter paper is fully and evenly saturated. Moreover, measurements are performed on whole blood containing various cell types that may have an influence on the concentration of certain components. The hematocrit, which is usually unknown, is also an important factor for blood components that do not re-distribute into red blood cells. Special high sensitive assays may be required and multi-analyte measurements are preferred to get as much information of the limited samples as possible. The neonatal dried blood spots is suitable for next generation sequencing,31 DNA methylation profiling,32 metabolome profiling, vitamin D,33 multiplex measurements of cytokines,34 antibodies to infectious agents19 and whole transcriptome analysis through microarray35 and RNA-seq.36 Importantly, these measurements are made in samples drawn few days after birth, meaning that case-control differences cannot be ascribed to disease-related confounders as medication, alcohol or substance use, smoking or the disease state itself.

Systematic comparisons of genomic DNA versus whole-genome-amplified DNA37 reveals increased signal noise. Although this has very little impact on genotype calls, it is problematic for Copy Number Variation detection algorithms such as PennCNV.38 Efforts within the iPSYCH community are making progress towards solving the noise issues. Technical reproductions using RNA microarrays reported in Grauholm et al.35 indicated high reproducibility, independently of spot size, and indicated that the critical factor is storage conditions rather than storage length. Ho et al.39 found differences between cerebral palsy cases and matched controls using dried blood spots from the Michigan neonatal screening. Combined these reports strongly indicate that it is possible to do meaningful transcriptome experiments despite prolonged storage at perceived sub-optimal conditions.

Preparation of samples for genotyping and sequencing from the Danish neonatal screening biobank DNA was extracted and whole genome amplified at the Statens Serum Institut following previously established procedures.40, 41 The sample flow is described in Figure 1.

Figure 1
figure 1

The selected samples were correlated with their DNSB identifiers and entered into an in-house developed selection database (Step 1 and 2). Sample identities were then validated and assigned a pseudonymized unique ID (Step 3) before cutting two discs of 3.2 mm of dried blood into a 96-well PCR plate (Step 4). Proteins were washed of the blood spots and stored at −80 °C before DNA was extracted using Extract-N-Amp Blood PCR Kit (Sigma-Aldrich, St Louis, MO, USA) (Step 5). DNA was amplified in triplicates using REPLI-g (Qiagen, Hilden, Germany) and combined to a single sample (Step 6). Finally, concentrations were quantified using Quant-iT picogreen (Invitrogen, Carlsbad, CA, USA) (Step 7) and a genetic fingerprint established using the iPLEX pro Sample ID panel (Agena Bioscience, Hamburg, Germany) (Step 8) before aliquoting a fraction of the sample for genotyping (Step 9).

PowerPoint slide

Array genotyping and quality control

Samples were processed at the Broad Institute (Boston, MA, USA) using the Infinium PsychChip v1.0 array (Illumina, San Diego, CA, USA) in accordance with the manufacturer’s instructions.42 Genotyping was conducted in 25 waves. Variant calls were trained using GenTrain2 (Illumina) on the first wave (4146 samples) using the PsychChip 15048346 B manifest and GenomeStudio version v2011.1. Following autoclustering, loci were manually curated if they had a call frequency below 90%, GenTrain scores below 0.5 or cluster separation below 0.2. During this processing, 3890 loci were excluded and 928 were manually modified. The resulting GenTrain was used to produce GenCall variant calls used for sample level quality control of the entire cohort.43 Samples with call rates below 95% (N=2270) were designated to fail sample quality control (QC). Sex was inferred using heterozygosity on chromosome X; below 20% in males; above 20% in females. Sex obtained from genotyping was compared to the sex recorded in the Danish Civil Registration System and mismatches were excluded. It is extremely unlikely to observe errors in recorded sex in the Danish Civil Registration System.26 About 0.25% (N=224) of the sample did not match the expected sex. Half of the failures (N=119) were due to abnormal structural variation on chromosome X (aneuploidy and loss of heterozygosity). The other half were due to sample mix-ups (N=103). In this study we describe the sample QC only and not the subsequent single-nucleotide polymorphism QC, which vary between studies.

Probe remapping

All probe sequences were queried against an HG19 database using a nucleotide version of the Basic Local Alignment Search Tool. The Basic Local Alignment Search Tool results were compared with the original array manifest, an Illumina update to the array manifest, and the Broad Institute updates to the manifest. The genomic coordinates matched between the Basic Local Alignment Search Tool results and the existing manifests for 95.12% of probes. 2.23% of probes were updated based on the new Basic Local Alignment Search Tool results. 2.11% retained their original mapping. The remaining 0.54% were split between the Broad Institute reference and the Illumina update or the probe was removed from the data set (Supplementary Table 1).

Improving variant calls

GenCall,43 Birdseed44 and zCall45 were used supplementary to improve variant calls. GenCall and Birdseed are genotype calling algorithms best suited for common variants, while zCall is a post-processing step for GenCall to improve genotype calling for rare variants. Approximately half of the probes on the array are common variants (minor allele frequency0.05), while the other half are rare variants (minor allele frequency<0.05). A large subgroup of the rare variants are non-polymorphic within the cohort. A consensus genotype call was made from the three calling algorithms (Supplementary Text 2) using PLINK.46, 47

Ethical framework

The Danish Scientific Ethics Committee, the Danish Health Data Authority, the Danish data protection agency and the Danish Neonatal Screening Biobank Steering Committee approved this study. This is in keeping with the strict ethical framework and the Danish legislation protecting the use of these samples.30, 48 Permission has been granted to study genetic and environmental factors for the development and prognosis of mental disorders. To unravel the foundation of severe mental disorders, it is central that this rich data source is accessible to the international research community to the largest extent possible. It is also paramount to protect the privacy of the individuals included in the study. Owing to the sensitive nature of these data, individual level data can be accessed only through secure servers where download of individual level information is prohibited.49 iPSYCH encourage national and international collaboration. For details, please contact Professor Preben Bo Mortensen, Scientific Director of iPSYCH.

Baseline characteristics

Table 2 shows baseline characteristics of the 86 189 individuals included in the iPSYCH2012 sample. Among these individuals, 77 639 (90%) passed sample QC. In the cohort group, males constituted 51% in both the initial and in the QC’ed sample. The following numbers refer to the initial sample: Overall, 26 380 individuals were included due to suffering from an affective disorder. Among individuals with affective disorder, 543 individuals were incidentally also among the cohort members, that is, the 2.03% random sample of the study base. Overall, 28 812 (96.04%) of the 30 000 cohort members had none of the 5 psychiatric diagnoses until 2012. A total of 49 737 (86.68%) cases and 25 159 (83.86%) cohort members were native Danes. The largest second-generation immigrant group was persons having one or both parents born in Europe followed by one or both parents born in Scandinavia.

Table 2 Baseline characteristics of the iPSYCH2012 case–cohort

Comparing the percentage of cases included in the initial sample with the percentage of cases passing the QC revealed no systematic deviations across selected baseline characteristics (Table 2). Comparing the percentage of cohort members included in the initial sample with the percentage of cohort members passing QC also revealed no systematic deviations across baseline characteristics.

Visualization of genetic data by foreign parental origin

To visualize population substructure for the genetic data, a principle component analysis was conducted (Supplementary Text 3). There was a clear correspondence between the first two principal components based on the genome-wide single-nucleotide polymorphism genotypic data and parental country of birth as registered in the Danish Civil Registration System (Figure 2). Individuals born in Denmark to parents born in other Scandinavian countries clustered together with Danish-born individuals with Danish-born parents, as expected. Within each foreign parental region of birth, individuals with two foreign parents were as anticipated more genetically divergent compared to those having only one foreign-born parent. This finding provides strong evidence of internal validity for processing of individual samples and the ability to link to information in the rich Danish registers on an individual level.

Figure 2
figure 2

Scatterplot of the first two principal components colored according to parental region of birth. Big circles indicate mean values for the given parental group. Crosses indicates both parent born abroad within the region indicated by the color. Absence of cross indicate one Danish born parent and one parent born in the region indicated by the color. Persons with unknown information on parental region of birth (N=1088) and mixed parentage are not shown (N=366).

PowerPoint slide

Perspectives

The large iPSYCH2012 sample will provide a solid foundation for a range of studies in decades ahead. We have completed genotyping and plans are advanced for a range of other analyses, including an update and major expansion with cases diagnosed since 2012, as well as including new diagnostic case groups. The sample is thus not only a rich database for research in the current version - it also constitutes a logistic and organizational framework for future studies, although each new study will require relevant ethical permissions. Most other genetic studies are based on samples of convenience rather than utilizing true population-based samples. To our knowledge, no large-scale population-based sample with genome-wide association study data exists elsewhere. In particular, we are confident that the iPSYCH2012 sample provides an important resource to explore novel ways to combine genetic, phenotypic and environmental factors. Phenotypic and environmental factors are readily available through record linkage between the numerous Danish registers or assayed from neonatal dried blood spots.

Access to high-quality, population-based person-linked registers has enabled major contributions to psychiatric epidemiology. For example, researchers have documented key risk factors within psychiatric epidemiology, for example, urban birth,50, 51, 52, 53, 54 paternal age,55, 56, 57 psychiatric family history,58, 59 life-time risk,60 infections,17, 19, 20 neonatal vitamin D deficiency,61 socio-economic adversity,62 treatment resistant schizophrenia,63 pharmacological treatment,64 suicide65 and excess mortality.66 Key features such as the avoidance of selection bias and control of multiple confounders have been important aspects of these studies. However, genetic studies have traditionally not had access to population-based samples, with cases often recruited from multiplex families, or convenient samples of prevalent cases in contact with mental health services. The iPSYCH2012 sample includes a large representative sample of severe mental disorders from a representative sampling frame. The possibility to link the iPSYCH2012 sample to the comprehensive and high quality Danish population-based registers offers researchers unique possibilities to study the interplay between the genetic factors, and variables from the environment, and variables related to health,27, 67, 68, 69 mortality, income and social and socioeconomic characteristics.70, 71, 72 Genetic association studies are by default observational studies, subject to many of the same sources of bias and confounding as other epidemiological studies.16 Therefore, we believe our samples can assist the assessment of the potential impact of such biases and especially lack thereof, and point toward new avenues of research. For example, it has been shown that the genetic associations with schizophrenia identified in the seminal Psychiatric Genomics Consortium paper3 were stronger in more chronic cases than in first episode cases.73 This may suggest that, in future studies, the genetic architecture of schizophrenia could perhaps be refined to identify genes particularly associated with the risk of developing disease, and genes particularly predicting a chronic course, something that could have important preventive and clinical implications. Such future studies will benefit from the continued dialogue between epidemiological studies as iPSYCH and the large-scale studies available only through collaboration in international consortia.

The iPSYCH2012 sample will be able to leverage single-nucleotide polymorphism -derived, genome-wide metrics such as disease-specific polygenic risk scores.74, 75, 76 These provide a continuous measure of liability (rather than a categorical measure of family history), which will greatly enhance our ability to combine genetic, environmental and phenotypic data in disease prediction. We have found higher polygenic loading for schizophrenia in both cases and controls with family histories of mental disorders.77 Also 48% of the effect associated with family history of psychoses was mediated through the polygenic risk score for schizophrenia.78 To further explore the association between the risk of schizophrenia and the polygenetic risk score for schizophrenia, we have investigated the interplay with infections,79 treatment resistant schizophrenia,80 chronicity of schizophrenia,73 and mortality and suicidal behaviour.81

Since the initiation of the iPSYCH2012 sample, other related Danish projects have built on the same framework as that used within iPSYCH, for example, anorexia (5703 cases), obsessive-compulsive disorder (7747 cases), conduct disorder (4205 cases), hyperkinetic conduct disorder (3690 cases) and 1546 twin pairs. All samples gain power in utilizing cohort members within the iPSYCH2012 sample, while also contributing to the unique possibilities of the iPSYCH2012 sample.

Strengths and limitations

Identification of cases within the iPSYCH2012 sample is based on contacts to in- and out-patient psychiatric departments and visits to psychiatric emergency care units in a nation where treatment is provided through the government healthcare system free of charge, and where no private psychiatric hospitals exist. Financial factors are thus less likely to influence pathways to healthcare in Denmark compared to many other nations.82 Unlike samples of convenience, the iPSYCH2012 sample is representative of the Danish population irrespectively of (a) recall bias, (b) emigration or death before sampling, (c) institutional care, (d) imprisonment, (e) being homeless, (f) health and (g) socioeconomic status.26 In contrast to most genetic studies, the iPSYCH2012 sample also provides the unique possibility to explore the potential impact of the longitudinal trajectory on causes and outcomes of mental disorders.

Register-based studies like the current study cannot identify persons with untreated disorders or disorders treated in primary health care only. Most cases with mild to moderate mental disorders, for example, mild or moderate depression and anxiety disorders are thus not registered in the Danish Psychiatric Central Research Register.27 The major strength of the iPSYCH2012 sample approach is the comprehensive clinical assessment of all mental disorders treated in secondary healthcare in a nationwide population. Validation of the Clinician-derived key diagnoses (schizophrenia, single depressive episode, affective disorder, attention-deficit/hyperactivity disorder and autism) has been carried out with good results.83, 84, 85, 86, 87, 88

Limitations include that, from an ethical point of view, we are not allowed to re-contact individuals for any reason. At present, it is also unclear to which extend it will be possible to enrich the iPSYCH2012 sample with information from cohorts including more detailed information on study participants (for example, see refs 89, 90, 91, 92).

We believe that the iPSYCH2012 sample will aid in accelerating psychiatric research in preventing and treating severe mental disorders for the benefit of patients, their families and friends, and the society.