Introduction

Genetic association studies aim to identify variants that predict disease susceptibility, prognosis or therapy response. Many association studies use geographically matched cases and controls, with controls selected and genotyped for each study. Recent successes in reusing existing controls for newly genotyped cases1, 2 indicate possibilities for designing more cost-effective designs of the next generation of studies. Pooling controls from different studies can be a cost-efficient way to increase the power to detect or verify loci of modest effect size.

The Nordic Center of Excellence in Disease Genetics (http://www.ncoedg.org), formed by the Joint Committee of the Nordic Medical Research Councils, the Nordic Council of Ministers and the Nordic Research Board, announces the release of the Nordic Control database, ‘NordicDB’, providing high-density genome-wide SNP information for 5000 healthy individuals. At present, NordicDB contains randomly ascertained samples from Finland, Sweden and Denmark. The portal (http://www.nordicdb.org), which is under continual development, provides population statistics and web-based tools for efficient use of this resource. Thus, for example, the portal describes quality control (QC) and imputation methods and provides imputed genotype probabilities (HapMap 3 SNPs). This paper introduces the NordicDB and its first release of the imputed data.

Materials and methods

The Nordic Control Database, NordicDB

NordicDB pools together samples from Finnish, Swedish and Danish studies. The selection of studies came from PIs at NCoEDG sites. These samples are individuals chosen to be controls in the original case–control studies. Table 1 presents the contributing studies with number of samples and genotyped SNPs, genotyping platform, sample characteristics, sampling location and reference to papers describing the respective studies in more detail.

Table 1 GWAS studies contributing controls to the Nordic Control Database

When constructing NordicDB, each data set was individually subjected to unified genotype QC measures. Briefly, SNPs were aligned to top strand and updated to build 36. We removed markers with ambiguous allele coding, and individuals and markers with >5% of data missing, as well as individuals with sex inconsistencies between the genotype data and the indicated sex. First- or second-degree relatives were filtered out on the basis of IBD values >0.2. On the basis of QC, on average, <3% of markers and <4% of individuals were excluded from the data sets.

Database and portal

The relational database and the web-based data management application were built using the Molgenis application generator (Molgenis; http://molgenis.sourceforge.net/).10 The database contains information and statistics on samples, markers, genotype data releases and sampling location. The sample identifiers were anonymized for the purpose of this database and cannot be linked to the original study identifiers. All SNPs are on top strand alignment and their physical positions are on build 36. Individual level data can be accessed through an application process using the application form available on the portal (http://www.nordicdb.org/database/Access.html). Applications will be reviewed by the Nordic Center of Excellence Data Review Board (http://www.nordicdb.org/drb) consisting of the PIs of the studies in NordicDB. At the time of preparing this paper, the Data Review Board members were affiliated to the Lund University (Sweden), Karolinska Institutet (Sweden), Sanger Institute (UK) and the University of Tartu (Estonia). The potential user has to specify the data set(s) that he would be requesting and a brief description of the proposed research use of the requested data. The user must also offer the following assurances that:

  • The data will only be used only for approved research, as follows:

  • As control data for case–control study design or as population set for population genetics analyses

  • As example data for software algorithm development:

    1. 1

      Addressing challenges associated with the analysis of sets of genotypic data.

    2. 2

      Detecting differences in allele frequency based on phenotypic data.

    3. 3

      Development of advanced analysis tools for the genetic community.

  • Data confidentiality will be strictly protected.

  • All applicable laws, local institutional policies and terms and procedures specific to the study's data access policy for handling anonymized population control data will be followed.

  • No attempts will be made to identify individual study participants from whom genotype data were obtained using genotype data or by trying to combine genotype data with any other information.

  • No information regarding the obtained control data set will be shared with or sold to third parties.

  • The contributing investigator(s) who conducted the original study and the funding organizations involved in supporting the original study will be acknowledged in publications resulting from the analysis of those data. The FIMM Technology Center (FTC) will provide information regarding which investigators should be acknowledged.

  • An annual report on research progress and publications, in which control data have been used, will be submitted to the FTC.

Finally, the control data use agreement must be cosigned by a group/department/institute leader, who represents the institution for which the applicant works. As data access policies are still being developed, these requirements and policies may change from what is described herein without notice. Some data sets will require the original contributing investigator to be contacted and getting his approval in addition to application approval by the FTC.

The data can be also accessed through the European Genotype Archive (http://www.ebi.ac.uk/ega). In particular, for each sample, researchers will be able to obtain genotype data and an indication of the study from which genotypes originate. As samples from different studies have been genotyped with different technology and SNP locations, we also provide imputed genotypes (see the ‘Imputed data’ section).

Population structure in the Nordic Control database

Population structure can be measured in terms of differences in allele frequencies and linkage disequilibrium (LD) patterns between sub-populations due to systematic ancestry differences. In genetic association studies, when there are differences in allele frequencies between individuals with different disease/trait status due to population structure sampling differences by disease status, the false-positive error rate is inflated.11, 12 Therefore, population structure must be considered carefully when pooling controls that originate from different populations.13 Recent studies have shown that, even for small isolated populations or for populations within restricted areas, stratification should be evaluated and accounted for when assessing genetic association.6, 14

As the NordicDB samples were collected from different Nordic countries, we investigated potential layers of stratification through the multidimensional scaling (MDS) analysis in PLINK.15 Before performing the MDS analysis, we removed non-autosomal SNPs, SNPs in known inverted regions,16 SNPs with MAF <0.01 and SNPs that failed the Hardy–Weinberg equilibrium test at the significance threshold of 1e-06. Individuals identified as outliers based on the inbreeding coefficient were also excluded (Supplementary material available at http://nordicdb.org/database/Data.html). The MDS analysis was based on SNPs which were common across platforms (45k SNPs). From the restricted SNP set, only SNPs and individuals with <5% missingness were included and only SNPs with low LD.13 To prune SNPs in LD, the pairwise genotypic correlation was calculated between all SNPs within windows of 20 SNPs and 1 SNP was excluded from each pair if the LD was found >0.1. A forward shift of five SNPs was assumed between windows. For the purpose of the MDS assessment, a Finnish reference data set was included. This consists of 81 individuals, 40 individuals collected from the capital area, representing genetically general population, and 41 individuals from a Finnish isolate, late-settlement area (LSFIN, described elsewhere17, 18). SNPs from the Illumina Human 1M-Duo chip (Illumina, San Diego, CA, USA) and the Affymetrix Genome-Wide Human SNP Array 6.0 chip (Affymetrix, Santa Clara, CA, USA) were genotyped, resulting in 1 163 280 SNPs after applying QC. The haplotypes in this data set were phased similarly to the HapMap 3 CEU samples (individuals with NW European ancestry) and Tuscany in Italy (TSI). Figure 1a shows the first two axes of genetic variation in NordicDB, CEU HapMap 3 data and the Finnish reference set. The analysis was based on 4809 samples: 2458 Swedish, 2082 Finnish, 161 Danish and 108 from CEU. The plot of the first two MDS components shows excellent resemblance to the geographical placement of the samples (Figure 1b), with a clear NW–SE gradient. To validate the SNP set used in the MDS analysis, we compared patterns of variation based on all available SNPs and on the restricted set, using two studies genotyped on the same chip (CAPS and DGI). The results were similar (data not shown).

Figure 1
figure 1

(a) Top axes of genetic variation in the Nordic Control Database, NordicDB (4620 samples) contrasted with the HapMap CEU (108 samples) and a Finnish HapMap reference population (81 samples). The MDS analysis was performed on 45 000 SNPs that were common between genotyping platforms. The controls are part of the following studies: Cancer Prostate in Sweden (CAPS) 1 and 2, Cancer and Hormonal Replacement in Sweden (CAHRES), Diabetes Genetics Initiative in Western Finland and Southern Sweden (DGI-FIN and DGI-SWE), SGENE and MS in the Helsinki region, Aneurysm study in the Helsinki region, GenomEUtwin Denmark (GenomEUtwin-DK), GenomEUtwin Sweden (GenomEUtwin-SWE) and GenomEUtwin Finland (GenomEUtwin-FIN). (b) Geographical map of Scandinavia with three countries highlighted to show the origin of the samples in panel a: Finland (red), Sweden (green) and Denmark (yellow).

Table 2 shows summary statistics for allele frequency differences and similarities between study populations. We calculated pairwise FST values using Weir and Cockerham's approach implemented in the R package Geneland19 (see http://www.nordicdb.org). The largest differences were those between Finnish and Swedish studies, with magnitude varying according to the location of the Finnish study.

Table 2 Pairwise FST values for data sets in the Nordic Control Database

Imputed data

The limited overlap of SNPs across genotyping platforms and chips is a key issue for NordicDB to address. The Illumina (http://www.illumina.com) and Affymetrix (http://affymetrix.com/index.affx) platforms, which differ in terms of genomic coverage, call rate and accuracy, array processing time and ease of use, typically have an SNP overlap of 10%. Thus, to provide a harmonized SNP set, imputation of non-overlapping SNPs is required. We use Impute software (University of Oxford, Oxford, UK; https://mathgen.stats.ox.ac.uk/impute/impute.html)20 to impute genotypes of the individuals in NordicDB against a common reference set. Choice of the reference population was based on comparing accuracy of imputing in three data sets (namely CAPS1, CAPS2 and CAHRES) using different populations, CEU HapMap 2, CEU Hapmap 3, and the combined HapMap 3 European populations CEU and TSI, in a subset of SNPs from chromosomes 21 and 22. Genotypes of directly typed SNPs were compared with their calls after imputing. The subset of SNPs was chosen by first selecting all SNPs that were common to the genotyping platforms that were used in the three studies (see Table 1) and then removing a minimum number of them such that the maximum pairwise r2 value was 0.2, among the remaining SNPs. Genotypes for SNPs in the selected subset were imputed using genotypes of all other typed SNPs on chromosomes 21 and 22. To assess imputation accuracy, we calculated the root mean square error of prediction (RMSEP) over SNPs and individuals. Writing yki to denote the observed genotype for SNP k of individual i, and pjki to denote the posterior probability of genotype j{0,1,2}, obtained from IMPUTE, for SNP k, individual i, RMSEP was calculated as

where K is the number of SNPs in the subset of imputed SNPs and N the number of individuals in the data set. Accurate imputation results are reflected by low RMSEP values. Without exception, lowest RMSEP values were achieved for the CEU and TSI populations combined (Table 3). Therefore, we used this reference population to impute all data sets in the database. The imputation procedure is described in more detail on the portal, (http://www.nordicdb.org) in which information on how to download imputed data is also provided.

Table 3 Comparison of imputation accuracy for three reference populationsa

Table 4 presents a summary of imputation accuracy for the Nordic Control database, based on those SNPs that were genotyped in at least 90% of the individuals in the originating study. For chromosome 15, a minimum of 85% of SNPs were called after the imputation at a threshold of 0.9, with a concordance rate of 99% (Table 4, last column).

Table 4 Imputation accuracy for the Nordic Control Databasea

Discussion

We have described an open resource (NordicDB) that pools GWAS samples from the Nordic countries. With population substructure present across the Nordic populations,6, 14 there is an obvious need to assess its impact when using NordicDB with a new study population of cases. In dealing with substructure, one should consider adjustment for the main axes of genetic variation21 or selecting a subset of controls that are ancestrally compatible with the cases. An obvious limitation of the Nordic DB is that it includes no environmental variables, and therefore users will not be able to adjust for environmental confounders in performing their own association analyses.

The samples in NordicDB were genotyped with different technologies. This called for harmonizing the QC measures and for imputing the non-overlapping markers using the publicly available LD data from HapMap 3. This allows scientists interested in studying Nordic populations to use their preferred platform to genotype new cases and use NordicDB to pick readily genotyped controls for their studies.