NordicDB: a Nordic pool and portal for genome-wide control data

Leu, Monica; Humphreys, Keith; Surakka, Ida; Rehnberg, Emil; Muilu, Juha; Rosenström, Päivi; Almgren, Peter; Jääskeläinen, Juha; Lifton, Richard P; Kyvik, Kirsten Ohm; Kaprio, Jaakko; Pedersen, Nancy L; Palotie, Aarno; Hall, Per; Grönberg, Henrik; Groop, Leif; Peltonen, Leena; Palmgren, Juni; Ripatti, Samuli

doi:10.1038/ejhg.2010.112

Download PDF

Article
Published: 28 July 2010

NordicDB: a Nordic pool and portal for genome-wide control data

Monica Leu^1,2,
Keith Humphreys¹,
Ida Surakka^2,3,
Emil Rehnberg¹,
Juha Muilu²,
Päivi Rosenström²,
Peter Almgren⁴,
Juha Jääskeläinen⁵,
Richard P Lifton⁶,
Kirsten Ohm Kyvik⁷,
Jaakko Kaprio^2,8,9,
Nancy L Pedersen¹,
Aarno Palotie^2,10,11,
Per Hall¹,
Henrik Grönberg¹,
Leif Groop⁴,
Leena Peltonen^2,3,10,11,
Juni Palmgren^1,12 &
…
Samuli Ripatti^2,3

European Journal of Human Genetics volume 18, pages 1322–1326 (2010)Cite this article

1994 Accesses
8 Citations
Metrics details

Subjects

Abstract

A cost-efficient way to increase power in a genetic association study is to pool controls from different sources. The genotyping effort can then be directed to large case series. The Nordic Control database, NordicDB, has been set up as a unique resource in the Nordic area and the data are available for authorized users through the web portal (http://www.nordicdb.org). The current version of NordicDB pools together high-density genome-wide SNP information from ∼5000 controls originating from Finnish, Swedish and Danish studies and shows country-specific allele frequencies for SNP markers. The genetic homogeneity of the samples was investigated using multidimensional scaling (MDS) analysis and pairwise allele frequency differences between the studies. The plot of the first two MDS components showed excellent resemblance to the geographical placement of the samples, with a clear NW–SE gradient. We advise researchers to assess the impact of population structure when incorporating NordicDB controls in association studies. This harmonized Nordic database presents a unique genome-wide resource for future genetic association studies in the Nordic countries.

Exome-wide analysis implicates rare protein-altering variants in human handedness

Article Open access 02 April 2024

Dick Schijven, Sourena Soheili-Nezhad, … Clyde Francks

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Genomic data in the All of Us Research Program

Article Open access 19 February 2024

The All of Us Research Program Genomics Investigators

Introduction

Genetic association studies aim to identify variants that predict disease susceptibility, prognosis or therapy response. Many association studies use geographically matched cases and controls, with controls selected and genotyped for each study. Recent successes in reusing existing controls for newly genotyped cases^{1, 2} indicate possibilities for designing more cost-effective designs of the next generation of studies. Pooling controls from different studies can be a cost-efficient way to increase the power to detect or verify loci of modest effect size.

The Nordic Center of Excellence in Disease Genetics (http://www.ncoedg.org), formed by the Joint Committee of the Nordic Medical Research Councils, the Nordic Council of Ministers and the Nordic Research Board, announces the release of the Nordic Control database, ‘NordicDB’, providing high-density genome-wide SNP information for ∼5000 healthy individuals. At present, NordicDB contains randomly ascertained samples from Finland, Sweden and Denmark. The portal (http://www.nordicdb.org), which is under continual development, provides population statistics and web-based tools for efficient use of this resource. Thus, for example, the portal describes quality control (QC) and imputation methods and provides imputed genotype probabilities (HapMap 3 SNPs). This paper introduces the NordicDB and its first release of the imputed data.

Materials and methods

The Nordic Control Database, NordicDB

NordicDB pools together samples from Finnish, Swedish and Danish studies. The selection of studies came from PIs at NCoEDG sites. These samples are individuals chosen to be controls in the original case–control studies. Table 1 presents the contributing studies with number of samples and genotyped SNPs, genotyping platform, sample characteristics, sampling location and reference to papers describing the respective studies in more detail.

Table 1 GWAS studies contributing controls to the Nordic Control Database

Full size table

When constructing NordicDB, each data set was individually subjected to unified genotype QC measures. Briefly, SNPs were aligned to top strand and updated to build 36. We removed markers with ambiguous allele coding, and individuals and markers with >5% of data missing, as well as individuals with sex inconsistencies between the genotype data and the indicated sex. First- or second-degree relatives were filtered out on the basis of IBD values >0.2. On the basis of QC, on average, <3% of markers and <4% of individuals were excluded from the data sets.

Database and portal

The relational database and the web-based data management application were built using the Molgenis application generator (Molgenis; http://molgenis.sourceforge.net/).¹⁰ The database contains information and statistics on samples, markers, genotype data releases and sampling location. The sample identifiers were anonymized for the purpose of this database and cannot be linked to the original study identifiers. All SNPs are on top strand alignment and their physical positions are on build 36. Individual level data can be accessed through an application process using the application form available on the portal (http://www.nordicdb.org/database/Access.html). Applications will be reviewed by the Nordic Center of Excellence Data Review Board (http://www.nordicdb.org/drb) consisting of the PIs of the studies in NordicDB. At the time of preparing this paper, the Data Review Board members were affiliated to the Lund University (Sweden), Karolinska Institutet (Sweden), Sanger Institute (UK) and the University of Tartu (Estonia). The potential user has to specify the data set(s) that he would be requesting and a brief description of the proposed research use of the requested data. The user must also offer the following assurances that:

The data will only be used only for approved research, as follows:
As control data for case–control study design or as population set for population genetics analyses
As example data for software algorithm development:
1. 1
  Addressing challenges associated with the analysis of sets of genotypic data.
2. 2
  Detecting differences in allele frequency based on phenotypic data.
3. 3
  Development of advanced analysis tools for the genetic community.
Data confidentiality will be strictly protected.
All applicable laws, local institutional policies and terms and procedures specific to the study's data access policy for handling anonymized population control data will be followed.
No attempts will be made to identify individual study participants from whom genotype data were obtained using genotype data or by trying to combine genotype data with any other information.
No information regarding the obtained control data set will be shared with or sold to third parties.
The contributing investigator(s) who conducted the original study and the funding organizations involved in supporting the original study will be acknowledged in publications resulting from the analysis of those data. The FIMM Technology Center (FTC) will provide information regarding which investigators should be acknowledged.
An annual report on research progress and publications, in which control data have been used, will be submitted to the FTC.

Finally, the control data use agreement must be cosigned by a group/department/institute leader, who represents the institution for which the applicant works. As data access policies are still being developed, these requirements and policies may change from what is described herein without notice. Some data sets will require the original contributing investigator to be contacted and getting his approval in addition to application approval by the FTC.

The data can be also accessed through the European Genotype Archive (http://www.ebi.ac.uk/ega). In particular, for each sample, researchers will be able to obtain genotype data and an indication of the study from which genotypes originate. As samples from different studies have been genotyped with different technology and SNP locations, we also provide imputed genotypes (see the ‘Imputed data’ section).

Population structure in the Nordic Control database

Population structure can be measured in terms of differences in allele frequencies and linkage disequilibrium (LD) patterns between sub-populations due to systematic ancestry differences. In genetic association studies, when there are differences in allele frequencies between individuals with different disease/trait status due to population structure sampling differences by disease status, the false-positive error rate is inflated.^{11, 12} Therefore, population structure must be considered carefully when pooling controls that originate from different populations.¹³ Recent studies have shown that, even for small isolated populations or for populations within restricted areas, stratification should be evaluated and accounted for when assessing genetic association.^{6, 14}

As the NordicDB samples were collected from different Nordic countries, we investigated potential layers of stratification through the multidimensional scaling (MDS) analysis in PLINK.¹⁵ Before performing the MDS analysis, we removed non-autosomal SNPs, SNPs in known inverted regions,¹⁶ SNPs with MAF <0.01 and SNPs that failed the Hardy–Weinberg equilibrium test at the significance threshold of 1e-06. Individuals identified as outliers based on the inbreeding coefficient were also excluded (Supplementary material available at http://nordicdb.org/database/Data.html). The MDS analysis was based on SNPs which were common across platforms (∼45k SNPs). From the restricted SNP set, only SNPs and individuals with <5% missingness were included and only SNPs with low LD.¹³ To prune SNPs in LD, the pairwise genotypic correlation was calculated between all SNPs within windows of 20 SNPs and 1 SNP was excluded from each pair if the LD was found >0.1. A forward shift of five SNPs was assumed between windows. For the purpose of the MDS assessment, a Finnish reference data set was included. This consists of 81 individuals, 40 individuals collected from the capital area, representing genetically general population, and 41 individuals from a Finnish isolate, late-settlement area (LSFIN, described elsewhere^{17, 18}). SNPs from the Illumina Human 1M-Duo chip (Illumina, San Diego, CA, USA) and the Affymetrix Genome-Wide Human SNP Array 6.0 chip (Affymetrix, Santa Clara, CA, USA) were genotyped, resulting in 1 163 280 SNPs after applying QC. The haplotypes in this data set were phased similarly to the HapMap 3 CEU samples (individuals with NW European ancestry) and Tuscany in Italy (TSI). Figure 1a shows the first two axes of genetic variation in NordicDB, CEU HapMap 3 data and the Finnish reference set. The analysis was based on 4809 samples: 2458 Swedish, 2082 Finnish, 161 Danish and 108 from CEU. The plot of the first two MDS components shows excellent resemblance to the geographical placement of the samples (Figure 1b), with a clear NW–SE gradient. To validate the SNP set used in the MDS analysis, we compared patterns of variation based on all available SNPs and on the restricted set, using two studies genotyped on the same chip (CAPS and DGI). The results were similar (data not shown).

Table 2 shows summary statistics for allele frequency differences and similarities between study populations. We calculated pairwise F_ST values using Weir and Cockerham's approach implemented in the R package Geneland¹⁹ (see http://www.nordicdb.org). The largest differences were those between Finnish and Swedish studies, with magnitude varying according to the location of the Finnish study.

Table 2 Pairwise F_ST values for data sets in the Nordic Control Database

Full size table

Imputed data

The limited overlap of SNPs across genotyping platforms and chips is a key issue for NordicDB to address. The Illumina (http://www.illumina.com) and Affymetrix (http://affymetrix.com/index.affx) platforms, which differ in terms of genomic coverage, call rate and accuracy, array processing time and ease of use, typically have an SNP overlap of ∼10%. Thus, to provide a harmonized SNP set, imputation of non-overlapping SNPs is required. We use Impute software (University of Oxford, Oxford, UK; https://mathgen.stats.ox.ac.uk/impute/impute.html)²⁰ to impute genotypes of the individuals in NordicDB against a common reference set. Choice of the reference population was based on comparing accuracy of imputing in three data sets (namely CAPS1, CAPS2 and CAHRES) using different populations, CEU HapMap 2, CEU Hapmap 3, and the combined HapMap 3 European populations CEU and TSI, in a subset of SNPs from chromosomes 21 and 22. Genotypes of directly typed SNPs were compared with their calls after imputing. The subset of SNPs was chosen by first selecting all SNPs that were common to the genotyping platforms that were used in the three studies (see Table 1) and then removing a minimum number of them such that the maximum pairwise r² value was 0.2, among the remaining SNPs. Genotypes for SNPs in the selected subset were imputed using genotypes of all other typed SNPs on chromosomes 21 and 22. To assess imputation accuracy, we calculated the root mean square error of prediction (RMSEP) over SNPs and individuals. Writing y_ki to denote the observed genotype for SNP k of individual i, and p_jki to denote the posterior probability of genotype j∈{0,1,2}, obtained from IMPUTE, for SNP k, individual i, RMSEP was calculated as

where K is the number of SNPs in the subset of imputed SNPs and N the number of individuals in the data set. Accurate imputation results are reflected by low RMSEP values. Without exception, lowest RMSEP values were achieved for the CEU and TSI populations combined (Table 3). Therefore, we used this reference population to impute all data sets in the database. The imputation procedure is described in more detail on the portal, (http://www.nordicdb.org) in which information on how to download imputed data is also provided.

Table 3 Comparison of imputation accuracy for three reference populations^a

Full size table

Table 4 presents a summary of imputation accuracy for the Nordic Control database, based on those SNPs that were genotyped in at least 90% of the individuals in the originating study. For chromosome 15, a minimum of 85% of SNPs were called after the imputation at a threshold of 0.9, with a concordance rate of ∼99% (Table 4, last column).

Table 4 Imputation accuracy for the Nordic Control Database^a

Full size table

Discussion

We have described an open resource (NordicDB) that pools GWAS samples from the Nordic countries. With population substructure present across the Nordic populations,^{6, 14} there is an obvious need to assess its impact when using NordicDB with a new study population of cases. In dealing with substructure, one should consider adjustment for the main axes of genetic variation²¹ or selecting a subset of controls that are ancestrally compatible with the cases. An obvious limitation of the Nordic DB is that it includes no environmental variables, and therefore users will not be able to adjust for environmental confounders in performing their own association analyses.

The samples in NordicDB were genotyped with different technologies. This called for harmonizing the QC measures and for imputing the non-overlapping markers using the publicly available LD data from HapMap 3. This allows scientists interested in studying Nordic populations to use their preferred platform to genotype new cases and use NordicDB to pick readily genotyped controls for their studies.

References

Wellcome Trust Case Control Consortium: Genome-wide association study of 14 000 cases of seven common diseases and 3000 shared controls. Nature 2007; 447: 661–678.
Article Google Scholar
Wrensch M, Jenkins RB, Chang JS et al: Variants in the CDKN2B and RTEL1 regions are associated with high-grade glioma susceptibility. Nat Genet 2009; 41: 905–908.
Article CAS Google Scholar
Zheng SL, Sun J, Wiklund F et al: Cumulative association of five genetic variants with prostate cancer. N Engl J Med 2008; 358: 910–919.
Article CAS Google Scholar
Einarsdöttir K, Humphreys K, Bonnard C et al: Linkage disequilibrium mapping of CHEK2: common variation and breast cancer risk. PLoS Med 2006; 3: e168.
Article Google Scholar
Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, Lund University, and Novartis Institutes of BioMedical Research Saxena R, Voight BF, Lyssenko V et al: Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 2007; 316: 1331–1336.
Article Google Scholar
Jakkula E, Rehnström K, Varilo T et al: The genome-wide patterns of variation expose significant substructure in a founder population. Am J Hum Genet 2008; 83: 787–794.
Article CAS Google Scholar
Bilguvar K, Yasuno K, Niemelä M et al: Susceptibility loci for intracranial aneurysm in European and Japanese populations. Nat Genet 2008; 40: 1472–1477.
Article CAS Google Scholar
McEvoy BP, Montgomery GW, McRae AF et al: Geographical structure and differential natural selection among North European populations. Genome Res 2009; 19: 804–814.
Article CAS Google Scholar
Aulchenko YS, Ripatti S, Lindqvist I et al: Loci influencing lipid levels and coronary heart disease risk in 16 European population cohorts. Nat Genet 2009; 41: 47–55.
Article CAS Google Scholar
Swertz MA, de Brock EO, van Hijum S et al: Molecular Genetics Information System (MOLGENIS): alternatives in developing local experimental genomics databases. Bioinformatics 2004; 20: 2075–2083.
Article CAS Google Scholar
Freedman ML, Reich D, Penney KL et al: Assessing the impact of population stratification on genetic association studies. Nat Genet 2004; 36: 388–393.
Article CAS Google Scholar
Tian C, Gregersen PK, Seldin MF : Accounting for ancestry: population substructure and genome-wide association studies. Hum Molec Genet 2008; 17: R143–R150.
Article CAS Google Scholar
Yu K, Wang Z, Li Q et al: Population substructure and control selection in genome-wide association studies. PLoS ONE 2008; 3: e2551.
Article Google Scholar
Salmela E, Lappalainen T, Fransson I et al: Genome-wide analysis of single nucleotide polymorphisms uncovers population structure in Northern Europe. PLoS ONE 2008; 3: e3519.
Article Google Scholar
Purcell S, Neale B, Todd-Brown K et al: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007; 81: 559–575.
Article CAS Google Scholar
Price AL, Weale ME, Patterson N et al: Long-range LD can confound genome scans in admixed populations. Am J Hum Genet 2008; 83: 132–135.
Article CAS Google Scholar
Nevanlinna HR : The Finnish population structure. A genetic and genealogical study. Hereditas 1972; 71: 195–235.
Article CAS Google Scholar
Varilo T, Laan M, Hovatta I et al: Linkage disequilibrium in isolated populations: Finland and a young sub-population of Kuusamo. Eur J Hum Genet 2008; 8: 604–612.
Article Google Scholar
Guillot G, Santos F, Estoup A : Inference in population genetics with Geneland: a sensitivity analysis to spatial sampling scheme, null alleles and isolation by distance, 2009. Submitted.
Marchini J, Howie B, Myers S, McVean G, Donnelly P : A new multipoint method for genome-wide association studies via imputation of genotypes. Nat Genet 2007; 39: 906–913.
Article CAS Google Scholar
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D : Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006; 38: 904–909.
Article CAS Google Scholar

Download references

Acknowledgements

Ilkka Lappalainen from EBI is thanked for discussions over the project. NordicDB is financially supported by the Nordic Center of Excellence in Disease Genetics, Wallenberg Foundation, FP6 coordinated action PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe), the Wallenberg Consortium North, Sweden, Center of Excellence for Complex Disease Genetics of the Academy of Finland (grants 213506 and 129680) the Biocentrum Helsinki Foundation, The Nordic Centre of Excellence (NCoE) Programme in Molecular Medicine. KH acknowledges support from the Swedish Research Council (grant number 523-2006-972).

Author information

Authors and Affiliations

Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
Monica Leu, Keith Humphreys, Emil Rehnberg, Nancy L Pedersen, Per Hall, Henrik Grönberg & Juni Palmgren
Institute for Molecular Medicine, Finland, FIMM, University of Helsinki, Helsinki, Finland
Monica Leu, Ida Surakka, Juha Muilu, Päivi Rosenström, Jaakko Kaprio, Aarno Palotie, Leena Peltonen & Samuli Ripatti
Public Health Genomics Unit, National Institute for Health and Welfare, Helsinki, Finland
Ida Surakka, Leena Peltonen & Samuli Ripatti
Department of Clinical Sciences, Diabetes and Endocrinology, Lund University Diabetes Centre, Malmö, Sweden
Peter Almgren & Leif Groop
Department of Neurosurgery, Kuopio University Hospital, Kuopio, Finland
Juha Jääskeläinen
Department of Genetics, Howard Hughes Medical Institute, Yale University, Chevy Chase, MD, USA
Richard P Lifton
Department of Epidemiology, Institute of Public Health, University of Southern Denmark, Odense Area, Denmark
Kirsten Ohm Kyvik
Mental Health Problems and Substance Abuse Services Unit, National Institute for Health and Welfare, Helsinki, Finland
Jaakko Kaprio
Department of Public Health, University of Helsinki, Helsinki, Finland
Jaakko Kaprio
The Broad Institute of Harvard and MIT, Cambridge, MA, USA
Aarno Palotie & Leena Peltonen
Department of Human Genetics, Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom
Aarno Palotie & Leena Peltonen
Department of Mathematical Statistics, Stockholm University, Stockholm, Sweden
Juni Palmgren

Authors

Monica Leu
View author publications
You can also search for this author in PubMed Google Scholar
Keith Humphreys
View author publications
You can also search for this author in PubMed Google Scholar
Ida Surakka
View author publications
You can also search for this author in PubMed Google Scholar
Emil Rehnberg
View author publications
You can also search for this author in PubMed Google Scholar
Juha Muilu
View author publications
You can also search for this author in PubMed Google Scholar
Päivi Rosenström
View author publications
You can also search for this author in PubMed Google Scholar
Peter Almgren
View author publications
You can also search for this author in PubMed Google Scholar
Juha Jääskeläinen
View author publications
You can also search for this author in PubMed Google Scholar
Richard P Lifton
View author publications
You can also search for this author in PubMed Google Scholar
Kirsten Ohm Kyvik
View author publications
You can also search for this author in PubMed Google Scholar
Jaakko Kaprio
View author publications
You can also search for this author in PubMed Google Scholar
Nancy L Pedersen
View author publications
You can also search for this author in PubMed Google Scholar
Aarno Palotie
View author publications
You can also search for this author in PubMed Google Scholar
Per Hall
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Grönberg
View author publications
You can also search for this author in PubMed Google Scholar
Leif Groop
View author publications
You can also search for this author in PubMed Google Scholar
Leena Peltonen
View author publications
You can also search for this author in PubMed Google Scholar
Juni Palmgren
View author publications
You can also search for this author in PubMed Google Scholar
Samuli Ripatti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Monica Leu or Samuli Ripatti.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Leu, M., Humphreys, K., Surakka, I. et al. NordicDB: a Nordic pool and portal for genome-wide control data. Eur J Hum Genet 18, 1322–1326 (2010). https://doi.org/10.1038/ejhg.2010.112

Download citation

Received: 29 October 2009
Revised: 24 March 2010
Accepted: 04 June 2010
Published: 28 July 2010
Issue Date: December 2010
DOI: https://doi.org/10.1038/ejhg.2010.112

Keywords

This article is cited by

Genome-Wide Association Study of Polymorphisms Predisposing to Bronchiolitis
- Anu Pasanen
- Minna K. Karjalainen
- Matti Korppi
Scientific Reports (2017)
A combined analysis of genome-wide association studies in breast cancer
- Jingmei Li
- Keith Humphreys
- Per Hall
Breast Cancer Research and Treatment (2011)