Hypospadias is a common congenital condition in boys in which the urethra opens on the underside of the penis. We performed a genome-wide association study on 1,006 surgery-confirmed hypospadias cases and 5,486 controls from Denmark. After replication genotyping of an additional 1,972 cases and 1,812 controls from Denmark, the Netherlands and Sweden, 18 genomic regions showed independent association with P < 5 × 10−8. Together, these loci explain 9% of the liability to developing this condition. Several of the identified regions harbor genes with key roles in embryonic development (including HOXA4, IRX5, IRX6 and EYA1). Subsequent pathway analysis with GRAIL and DEPICT provided additional insight into possible genetic mechanisms causing hypospadias.
Hypospadias is a birth defect with a prevalence of about 4.5 cases in 1,000 newborn boys in Denmark1. Cases present with a urethral opening located on the ventral side of the penis instead of at the tip of the glans. Hypospadias arises between weeks 8 and 16 of gestation2, when the urethra is formed in a complex process that is not yet fully understood. The penile urethra is formed by the fusion of urogenital folds in a proximal to distal direction, but the formation of the glandular urethra could depend on a different process3. In mild cases of hypospadias, the opening is still on the glans, whereas moderate and severe cases have a more proximal opening (Supplementary Fig. 1), sometimes accompanied by penile curvature. Hypospadias can only be treated by surgical repair, and, for the majority of cases, initial surgery provides a satisfactory result4. However, individuals with hypospadias are more likely to experience negative genital appraisal, sexual inhibition, and erection or ejaculation problems later in life3.
The etiology of hypospadias has been studied widely, with both environmental and genetic risk factors investigated. Environmental factors consistently associated with hypospadias are low birth weight and/or being small for gestational age, placental insufficiency, maternal hypertension, pre-eclampsia and maternal intrauterine diethylstilbestrol exposure3. Syndromic forms of hypospadias can be caused by mutations in genes involved in early genital development, and, overall, almost 200 syndromes are associated with hypospadias, including Smith-Lemli-Opitz syndrome (MIM 270400) and Wilms' tumor, aniridia, genitourinary anomalies and mental retardation syndrome (MIM 194072)5. An epidemiological study observed substantial familial aggregation of isolated hypospadias, with recurrence risk ratios of isolated hypospadias for male twin pairs and the first- and second-degree relatives of a case of 50.8, 11.6 and 3.27, respectively1. However, the only well-established genetic association involves a common variant in DGKK6.
These findings motivated us to perform a genome-wide association study (GWAS) based on 1,006 isolated hypospadias cases and 5,486 controls, all genotyped on Illumina Omni chips. We selected 48 SNPs for replication genotyping in up to 1,972 isolated hypospadias cases and 1,812 controls from Denmark, the Netherlands and Sweden, identifying 18 SNPs that were independently associated (P < 5 × 10−8) with hypospadias, including the earlier finding in DGKK (represented by rs4554617); 4 additional SNPs showed suggestive association (5 × 10−8 < P < 1 × 10−6) in the combined analysis. The 22 genetic variants jointly explain 9.4% of the variance in liability to hypospadias.
GWAS and replication
We analyzed 1,006 boys who underwent surgery for hypospadias and 5,486 individuals (2,390 males and 3,096 females) from other GWAS projects as controls. All individuals were of Danish descent and were genotyped with Illumina Human Omni chips. Genotypes were imputed using phased haplotypes from the integrated Phase I release of the 1000 Genomes Project7, and we tested differences in allele dosages between cases and controls for 8,207,076 variants (see the Online Methods for further details on samples, genotyping, imputation and analysis).
There was a substantial amount of genetic signal in the GWAS, with 12 loci reaching genome-wide significance (P < 5 × 10−8; see Supplementary Figs. 2 and 3 for the quantile-quantile plot and Manhattan plot, respectively). We selected 48 SNPs (prioritizing genotyped SNPs) representing the top 30 associated loci, functional variants and possible independent secondary signals in these loci, as well as SNPs in additional genomic regions harboring genes with similar function to genes in regions already showing genome-wide significant association.
In a cost-effective approach, we initially genotyped 752 cases and 748 controls from Denmark, and 15 SNPs that did not replicate were excluded from further genotyping (Supplementary Table 1). The complete replication set of 1,972 cases and 1,812 controls from Denmark, the Netherlands and Sweden was genotyped for 33 SNPs (see the Online Methods for details on the replication samples). In the combined analysis of the discovery and replication data, 18 SNPs reached genome-wide significance, and 4 additional SNPs had suggestive P values between 5 × 10−8 and 1 × 10−6 (Table 1; see Supplementary Table 2 for detailed results for the 3 replication study groups). The genomic regions are displayed with LocusZoom8 in Supplementary Figure 4. The remaining SNPs split into four SNPs with P > 1 × 10−6 (two with P < 0.05 in the replication study; Supplementary Table 3) and seven SNPs in linkage disequilibrium (LD) with the SNPs listed in Table 1, which were not genome-wide significant after adjustment for the primary SNP (Supplementary Table 4).
To investigate the functional characteristics of the findings in Table 1, we annotated all variants yielding a GWAS P value of <1 × 10−4 in these genomic regions with ANNOVAR9, a tool that retrieves variant- and region-specific functional annotations from several databases (see the Online Methods and Supplementary Table 5 for results). Five missense variants in HAAO (rs3816183), CCDC59 (rs143136847), HOXA4 (rs6962314), HOXA7 (rs2301721) and EML4 (rs28651764) were among the genotyped SNPs. The latter two variants were unlikely to explain the observed association because they were in moderate LD (r2 = 0.57 and 0.24, respectively) with SNPs showing stronger association, and their association became non-significant after adjustment for the corresponding primary SNP (Supplementary Table 4). The rare variant in CCDC59 had the most potential to be damaging according to statistics from SIFT10, PolyPhen-2 (ref. 11), LRT12 and MutationTaster13 (Supplementary Table 5).
Other traits with GWAS findings in the identified genomic regions
To identify other conditions with GWAS findings in the identified hypospadias risk regions, we downloaded all reported variants with association P < 1 × 10−7 from the GWAS catalog and searched for SNPs in LD (r2 > 0.2) with the associated SNPs listed in Table 1 (Supplementary Table 6). We identified seven studies with associations for tooth development14,15 (ADK region), bone mineral density16 (PKDCC region), blood pressure17,18 (EBF1 region), prostate cancer19 (EEFSEC region) and pulmonary function20 (DAAM2 region). The findings on tooth development in the ADK region are particularly interesting, and rs7924176 was part of our replication study but was not significantly associated after adjustment for rs17747401 (r2 = 0.72 with rs7924176; Supplementary Table 4); the risk allele associated with longer time to first tooth and delayed eruption of primary or permanent teeth in children was in LD with the hypospadias risk allele, suggesting pleiotropic effects of ADK in several developmental processes. The C allele of rs7584262 (r2 = 0.56 with rs988958) that is associated with lower bone mineral density of the femoral neck was also in LD with the hypospadias risk allele. Furthermore, rs2999052 (EEFSEC region) was in perfect LD with rs2687729, a SNP with suggestive association (P = 1 × 10−7) in a meta-analysis of age at menarche21; the risk allele for hypospadias corresponded to delayed menarche in that previous study.
GRAIL pathway analysis
The present study identified four loci close to homeobox genes (HOXA cluster, IRX5, IRX6 and ZFHX3), a class of regulatory genes with a key role in morphogenesis during embryonic development. Because of the functional connection between these genes, we investigated all loci listed in Table 1 systematically with the Gene Relationships Across Implicated Loci (GRAIL) method22, a computational tool that looks for similarities in PubMed entries for genes at identified loci without including information on phenotype. On the basis of textual relationships between genes, GRAIL assigns a P value to each gene and subsequently to each region (Online Methods). Overall, 47 genes were analyzed in the 22 regions, and 8 regions with GRAIL P values of <1 × 10−3 were identified (Supplementary Table 7). The pairwise relationships for genes in the associated loci are illustrated in Figure 1, showing that more than the three above-mentioned regions with homeobox genes were closely connected. There were also multiple connections for DAAM2, EBF1, EYA1, FOXF1 and GREM1. Among the 20 keywords characterizing the functional relationships between the genes were 'patterning', 'transcription', 'development', 'developing', 'formation', 'specification', 'homeobox' and 'embryogenesis'.
DEPICT pathway analyses
We also applied a newly developed computational tool, Data-Driven Expression-Prioritized Integration for Complex Traits (DEPICT), to analyze functional connections between the identified loci in an even more comprehensive way (Online Methods). In total, we analyzed 76 genes in all 46 independent autosomal regions with association P < 1 × 10−5 in the GWAS with regard to 3 aspects implemented in DEPICT (significance in all analyses was defined by false discovery rate (FDR) ≤ 5%).
In the first step, we analyzed enrichment of expression in particular tissues and cell types by testing whether genes in associated regions were highly expressed in any of 209 Medical Subject Heading (MeSH) annotations on the basis of gene expression data from 37,427 microarrays. We observed 20 significant tissue or cell type annotations among the 209 analyzed categories (Supplementary Table 8), including 5 entries from the genital system and 5 from the skeletal system (Fig. 2a). Five of the six significantly enriched cell types were from connective tissue (Fig. 2b). The cell types with the lowest P values (mesenchymal stem cells, stromal cells and fibroblasts) are closely related to each other and are important for the development of the urethra.
The second step was a gene set enrichment analysis, testing whether genes in associated regions were enriched for reconstituted versions of generic gene sets. Here we identified 183 significantly enriched reconstituted gene sets (Supplementary Table 9). Comparing the names of these sets with the names of the remaining 14,264 gene sets with FDR > 5% showed an over-representation of the common key words 'abnormal', 'morphology', 'development', 'bone' and 'morphogenesis' (Supplementary Table 10). We investigated similarities between the gene sets by clustering them on the basis of the correlation between scores for all genes (Online Methods). Many of the resulting 21 meta gene sets were represented by gene sets that are relevant in development or morphology, and there was also substantial correlation between meta gene sets (Fig. 3a). We show the correlation structure within a meta gene set, using 'morphogenesis of an epithelium' as an example (Fig. 3b) because formation of the urethra is an epithelial tube development process.
Finally, gene prioritization analysis was performed to directly investigate functional similarities among genes from different associated regions. A gene obtained a low prioritization P value if there was high functional similarity to genes from other associated regions (Supplementary Table 11). In total, there were 23 significantly prioritized genes across 19 genomic regions, covering 12 of the genomic regions reported in Table 1 (including the HOXA region represented by 5 genes). Three prioritized genes were not supported in the replication step. The remaining 4 genes were not among the 30 loci with the lowest P values in the GWAS and were therefore not selected for replication genotyping. However, not all the SNPs in Table 1 were represented by genes with low prioritization P values, as illustrated by the color coding in Supplementary Table 11. Genes were also annotated with the referring reconstituted gene sets, the tissue and cell types in which their expression was enriched, and the top cis expression quantitative trait locus (eQTL) in whole blood23.
Gene expression analysis in human foreskin
For the associated locus on the X chromosome, it has been shown that expression of DGKK in preputial tissue is lower in carriers of the hypospadias risk allele6. Therefore, we selected 5 genes (ADK, AHRR, EYA1, HOXA4 and IRX5) in the vicinity of the SNPs from Table 1 for functional studies in the preputial skin samples of up to 94 newborn boys representing controls, as well as mild and moderate to severe hypospadias cases (see the Online Methods for details on subjects and analysis). The results of the quantitative RT-PCR (qRT-PCR) analyses showing increased mRNA levels for HOXA4 dependent on the number of copies of the hypospadias G risk allele at rs1801085 (P = 0.04) are shown in Supplementary Figure 5; all other comparisons did not show statistically significant differences.
Estimation of variance explained
We estimated the variance in the liability to hypospadias explained by the associated SNPs in Table 1 and all genotyped SNPs combined using the genome-wide complex trait analysis (GCTA) tool24,25 (see the Online Methods for further details). Calculating the variance explained by associated SNPs using the results from the combined analysis, the 18 SNPs reaching genome-wide significance jointly explained 8.7%, and adding the 4 SNPs with suggestive P values could explain an additional 0.7% (Table 1). Investigating the potential of all 548,642 genotyped SNPs in the GWAS resulted in an overall estimate of 56.9% of the variance explained. We also calculated estimates for the individual chromosomes and compared the estimates using associated SNPs and all SNPs per chromosome in Supplementary Figure 6. There was a clear correlation between the variance explained and chromosome size, with chromosomes 2, 16 and X performing particularly well.
We investigated all 231 possible pairwise interactions between the associated SNPs in Table 1. None of the interactions reached the Bonferroni threshold of P < 2.2 × 10−5 (data not shown).
For the study groups from the Netherlands and Sweden, information on the location of hypospadias was available (Supplementary Table 12). For individuals from Denmark, information on the location of hypospadias was not available, and there was no epidemiological study describing the distribution of locations for a larger number of cases. We examined the SNPs listed in Table 1 for variation in allele frequency between first-degree (anterior), second-degree (middle) and third-degree (posterior) hypospadias (Online Methods). In the combined analysis, seven SNPs showed a reduced risk allele frequency (P < 0.05) in third-degree compared to first-degree hypospadias (Supplementary Table 13). Five SNPs showed similar or larger risk allele frequencies in third-degree compared to first-degree hypospadias (on the basis of β estimates for the combined group).
To investigate whether the association results were dependent on factors changing over time, we analyzed the SNPs in Table 1 for an effect of year of birth in the Danish study group with cases born between 1982 and 2012 (interquartile range of 1993–2005). The risk allele frequency in cases was analyzed depending on the variable (year of birth—1980) in models with a linear year-of-birth term as well as with a linear and quadratic term (to allow for nonlinear effects). None of the year-of-birth variables reached the Bonferroni threshold of P < 2.3 × 10−3 (data not shown).
Family studies have indicated a strong genetic component in the etiology of hypospadias. The current study identified 18 loci associated with hypospadias at genome-wide significance as well as 4 suggestively associated loci, which jointly explain 9.4% of the variance in liability to this malformation. The association we observed for rs4554617 replicated the locus identified in a previous genome-wide study based on pooled DNA6, a finding that was also replicated in a US study26. Many studies have investigated SNPs or performed mutation screens in candidate genes3, but these studies do not report association for the SNPs reported in this study. The information on variants with frequencies well below 1% in the general population is poor in data imputed from the Illumina chips we used. Therefore, we lacked power to investigate low-frequency variants from mutation screens.
By means of pathway analysis, we show that these loci are functionally connected and point to genes with multiple roles in embryonic development, the most prominent finding being four loci close to different members of the homeobox gene family (HOXA cluster, IRX5, IRX6 and ZFHX3). Some loci even have a direct link to hypospadias. For the HOXA cluster, mutations in HOXA4 have been found in cases of isolated hypospadias27, and a study of HOX genes in human fetal skin28 showed particularly high expression levels of HOXA4. Mutations in HOXA13 are known to cause hand-foot-genital syndrome (MIM 140000)29, a condition frequently presenting with hypospadias in males. Furthermore, it has been shown in mice that Hoxa13 mutants develop hypospadias owing to loss of Bmp7 and Fgf8 signaling30. Another strong candidate gene is EYA1 from the eyes absent protein family, which was first identified in Drosophila melanogaster as being required in embryonic eye development31. Deletion of Eya1 in mice is associated with multiple genitourinary tract defects, including severe hypospadias32. In humans, mutations in EYA1 cause three genetic syndromes not known to involve hypospadias: branchiootorenal syndrome 1 (MIM 113650), branchiootic syndrome 1 (MIM 602588) and otofaciocervical syndrome 1 (MIM 166780).
Our association analysis segregated by the location of hypospadias in the study groups from the Netherlands and Sweden showed reduced risk allele frequencies in posterior compared to anterior cases for seven SNPs. The difference for DGKK was previously reported for rs1934179, and it has been suggested that posterior forms of hypospadias have a different etiology6. Further studies with larger numbers of posterior cases are needed to investigate to what extent the identified variants confer risk for posterior hypospadias.
We investigated the expression of five candidate genes in preputial skin samples from newborns in relation to their SNP genotypes without clear-cut findings; only HOXA4 levels showed an increase with higher numbers of copies of the hypospadias G risk allele at rs1801085. However, this human study was limited in terms of studied tissue and its postnatal origin. It might be necessary to investigate gene expression within the period of urethral development to identify variation relevant for hypospadias.
A recent study provided a comprehensive investigation of in situ gene expression for the genital tubercle in embryonic day (E) 14 mouse embryo based on the Affymetrix Mouse MOE 430 2.0 microarray chip33. Subsequent validation with whole-mount in situ hybridization was also performed for additional tissues from the lower urinary tract, including urethral epithelium and urethral mesenchyme at E13 and for urogenital sinus and urethra at E14. Here tissue relevant for urethral development was investigated within the time period in which hypospadias occur in mice. A large number of homeobox and forkhead genes were among the highly expressed probes, including Hoxa1, Hoxa10, Irx5, Zfhx3 and Foxf1. Higher expression was also observed for other mouse homologs of genes in the vicinity of SNPs from Table 1, for example, Eya1, Igfbp3, Ebf1 and Dgkk, strengthening the candidacy of these genes in the etiology of hypospadias.
Our pathway analysis shed some light on potential mechanisms causing hypospadias. The three cell type categories with the lowest P values in the tissue and cell type enrichment analysis are of particular relevance for hypospadias: mesenchymal stem cells develop into stromal cells, and fibroblasts are among the most common stromal cells, having a key role in closing the urethral groove. Among the physiological systems, the urogenital system and the musculoskeletal system were mainly highlighted as important, warranting further study of genes with potential roles in both skeletal and urogenital development in embryos. A large number of gene sets associated with development, morphology and abnormal growth were enriched for genes in loci with P < 1 × 10−5 in the GWAS. Furthermore, the associated loci near ADK and EEFSEC are connected to GWAS findings for tooth development and menarche, respectively, suggesting that these genes remain important after embryogenesis.
Even though our study was based on a large number of individuals with genome-wide SNP data, we are not aware of any pathway analysis of a GWAS of comparable size identifying such a large number of significant physiological systems, gene sets and pathways. Usually, GWAS meta-analysis based on much larger sample sizes is required to identify pathway networks connected to disease. In line with this, our study shows that the common SNPs underlying the GWAS tag more than 50% of the variance in liability to this malformation, substantially more than is observed in other diseases25. We therefore expect that future GWAS and subsequent meta-analysis will identify many additional hypospadias loci, a scenario previously seen in, for example, breast cancer34, Crohn's disease35 and migraine36. Future studies based on extensive sequencing data might identify causative variants at the identified loci and improve the understanding of the etiology of hypospadias.
Overall, our study provides valuable insight into the genetic architecture of hypospadias by identifying many new risk loci and connecting nearby genes in developmental pathways that could also be important for other conditions.
Denmark. Eligible hypospadias cases were boys identified from the Danish National Hospital Discharge Registry who (i) were diagnosed with hypospadias and underwent surgery in infancy; (ii) were singletons; (iii) did not have any major malformations; and (iv) were of Danish ancestry. In addition, we excluded major malformations at birth according to EUROCAT classification. The GWAS was based on 1,006 successfully genotyped cases. The control group consisted of 5,486 individuals without a record of hypospadias surgery in the Danish National Hospital Discharge Registry, all of Danish ancestry. We also included females as controls to increase power. The controls were mainly genotyped as cases for other ongoing GWAS, including for febrile seizures (n = 1,999; 1,072 males, 927 females), opioid dependence (n = 1,316 cases and controls; 802 males, 514 females), atrial septal defect (n = 1,107; 516 males, 591 females) and postpartum depression (n = 1,064 females). Initial association analysis for all projects was completed, so we could rule out association signals for other conditions leading to false positive signals in our hypospadias scan. For the replication study, we successfully genotyped another 1,006 unrelated cases drawn from the same population using the same case definition. As controls, we genotyped 1,012 boys without hypospadias of Danish ancestry from the Danish National Birth Cohort also unrelated to cases and controls.
Hypospadias cases were born between 1982 and 2012; individuals from the control group were born between 1957 and 2009 (with only 2.4% born before 1982). The study protocol was approved by the Scientific Ethics Committee of the Capital Region (Copenhagen) and the Danish Data Protection Agency. According to Danish law, the Scientific Ethics Committee can grant exemption from obtaining informed consent for research projects based on biobank material under certain circumstances. For this study, such an exemption was granted (H-1-2011-051).
The Netherlands. The AGORA (Aetiologic Research into Genetic and Occupational/Environmental Risk Factors for Anomalies in Children) project of the Radboud University Medical Center in Nijmegen, the Netherlands, is building a database and biobank with questionnaire data and DNA samples from individuals with congenital malformations or childhood cancer and their parents, as well as from unaffected controls and their mothers. All cases resided in the catchment area of the Radboud University Medical Center, and control children were collected through 39 municipalities providing a random sample of the names and addresses of children born between January 1990 and December 2010. For the current study, we successfully genotyped 736 hypospadias cases and 622 control children (300 boys, 322 girls) born between 1980 and 2011, all of European descent and all controls without major birth defects, as determined from questionnaire information. Medical records of all cases were used to exclude syndromic hypospadias cases and to collect the clinical characteristics of cases, including the anatomical location of the urethral opening as determined by experienced pediatric urologists before or during surgery. Anatomical location was subdivided into three categories: anterior (hypospadias sine hypospadias, glandular and (sub)coronal urethral openings), middle (penile urethral openings) and posterior (cases with penoscrotal, scrotal and perineal urethral openings) hypospadias (Supplementary Fig. 1). The Arnhem-Nijmegen Regional Committee on Research Involving Human Subjects approved the AGORA project. All participants and/or their parents gave written informed consent for participation in the study.
Sweden. Swedish hypospadias cases were boys who underwent surgery for hypospadias, had no major malformations and were of European ancestry. For the majority of cases, detailed phenotype information was available, and cases were classified as anterior, middle or posterior hypospadias, applying the same definition as in the Dutch study. Swedish controls were children without any congenital malformation collected consecutively from the Karolinska University Hospital maternity ward; information on parental origin was used to select only children with European ancestry. DNA from cases was isolated from either whole blood or tissue; controls were sampled from placenta. For the current study, we successfully genotyped 230 hypospadias cases born between 1963 and 2011 and 178 control children (89 boys, 83 girls and 6 missing sex information), all born in 2006. All samples were obtained after informed consent from the parents. The Ethics Committee at the Karolinska Institutet has approved the study.
United States. For the study, 94 subjects were included, comprising 62 cases with isolated hypospadias undergoing surgical repair and 32 healthy age-matched male controls undergoing elective circumcision at the Department of Urology of the University of California, San Francisco. The position of the urethral meatus, associated anomalies and family history of hypospadias were assessed by subject survey and physical exam. Of the 62 cases included, 31 had mild hypospadias, defined as ectopic urethral meatus between the corona and midshaft of the penis, and 31 had severe hypospadias, defined as ectopic urethral meatus between the proximal penile shaft and the perineum. Cases with undescended testis, intersex condition or known endocrine abnormalities were excluded from the study. No cases received preoperative testosterone treatment. For cases, excess preputial tissue not used for reconstruction at the time of hypospadias surgery was obtained; no periurethral tissue was taken. For control subjects, excess preputial tissue was obtained during elective circumcision. Both DNA and RNA were isolated from the preputial tissue samples for genotype and expression analysis, respectively. Genomic DNA was extracted using the QIAamp DNeasy Blood and Tissue kit (Qiagen), and total RNA was extracted using the RNeasy Fibrous Tissue Mini kit (Qiagen). The institutional Committee on Human Research at the University of California, San Francisco approved this study, and all parents gave written informed consent for participation in the study.
Genotyping and quality control.
GWAS. The 1,006 hypospadias cases were genotyped on the Illumina HumanOmniExpressExome-8 v1.1 array; the 5,486 controls were genotyped on the Illumina HumanOmniExpressExome-8 v1.1 array, the Illumina HumanOmniExpress-12v1_H array or the Illumina HumanOmni1-Quad v1.0 array. All samples were drawn from the Danish Newborn Screening Biobank and the Danish National Birth Cohort biobank, both of which are part of the Danish National Biobank. For imputation, we used 548,642 SNPs passing quality control; other SNPs were excluded on the basis of a missing rate of >2%, deviation from Hardy-Weinberg equilibrium (P < 1 × 10−6), minor allele frequency of <1%, differences in genotype call rates between genotyping rounds (P < 1 × 10−8) and discrepancies in allele frequencies between sexes for each genotyping round (P < 1 × 10−6). Analyzing the concordance of genotypes for 24 IDs genotyped on multiple chips showed a proportion of identical-by-descent sharing of >0.997 for each individual to itself.
Replication. Genotyping for the 48 selected replication SNPs was performed using competitive allele-specific PCR (KASP) chemistry (LGC Genomics). Individuals with more than 10% (5 or more SNPs for samples genotyped for all SNPs, 4 or more SNPs for samples genotyped for 33 SNPs) missing genotypes were excluded from analysis. All SNPs had less than 2% missing genotypes and showed no deviation from Hardy-Weinberg equilibrium (P > 0.05).
Imputation and association analysis. For the 6,492 individuals from the GWAS, we imputed unobserved genotypes using phased haplotypes from the integrated Phase I release of the 1000 Genomes Project. We used logistic regression to analyze the GWAS, testing for differences in allele dosages between cases and controls under an additive genetic model. The modest inflation of the test statistic was adjusted for by applying genomic control37 (λ = 1.055). We selected all imputed SNPs or insertion-deletions with minor allele frequency of >1% in at least 1 of the 2 groups (cases or controls) and a SNPTEST38 info value of >0.8, resulting in an analysis based on 8,207,076 variants. Imputation and association testing were performed with SHAPEIT39, IMPUTE2 (ref. 40) and SNPTEST38 software.
Given the large number of genomic regions with suggestive association and several genomic regions in the vicinity of functional candidate genes, we designed a cost-effective two-stage replication approach. In the first step, 48 SNPs were genotyped for 16 additional plates (functional unit at LGC Genomics) from Denmark. After this step, 33 SNPs were also genotyped in additional samples from Denmark, the Netherlands and Sweden and in the expression samples from the United States.
We analyzed the replication study groups in PLINK41 and carried out combined analysis of the discovery and replication data using the inverse variance method as implemented in METAL42. We tested for heterogeneity between the GWAS and the three replication groups by applying the I2 statistic43. In associated regions with multiple SNPs in the replication step, we conditioned on the top SNP to explore possible allelic heterogeneity.
Annotation with ANNOVAR. The software tool ANNOVAR9 allows for the functional annotation of genetic variants by retrieving relevant information from public databases in an easy manner. Information for the associated SNPs from Table 1 and all nearby SNPs with GWAS P < 1 × 10−4 is shown in Supplementary Table 5; column D indicates the SNPs from Table 1 and Supplementary Table 4. In particular, we retrieved gene-based annotations indicating SNPs that caused protein-coding changes, including the affected amino acids, as well as predictions for the possible effect of these variants and conservation scores44,45; of the region-based annotations, we selected predicted transcription factor binding sites.
Other traits with GWAS findings in identified genomic regions. We extracted all 6,903 entries reported with P < 1 × 10−7 in the GWAS catalog and investigated LD for these variants with the SNPs reported in Table 1 using 1000 Genomes Project data. Findings with r2 > 0.2 for one of the SNPs associated with hypospadias are presented in Supplementary Table 6.
GRAIL analysis. GRAIL22 analyses associated SNPs in four steps. First, the associated regions were defined (primarily on the basis of LD), and genes in these regions were selected for analysis. Second, the relatedness of each selected gene to all other human genes was assessed using a text-based similarity measure. Third, for each selected gene, the number of independent regions with at least one highly related gene was determined. A P value was assigned to this count, adjusting for the number of genes per region (GRAIL P values in column B of the lower table in Supplementary Table 7). Finally, the gene with the lowest P value within a region was selected as a key gene, and the P value for the associated region was assessed, adjusting for multiple testing in regions with several genes (GRAIL P values in column D of the upper table in Supplementary Table 7). A low GRAIL P value indicates that a gene within an associated region is more related to genes in other associated regions through PubMed abstracts than would be expected by chance.
We analyzed the genomic regions of all 22 SNPs displayed in Table 1 with the following settings: gene size correction, off; HG18 Assembly of the Human Genome; Functional Datasource, text_2012_08; Gene list, default gene list (20,167 genes); Queries and Seed Regions, equal.
DEPICT analyses. A comprehensive pathway analysis was performed applying Data-Driven Expression-Prioritized Integration for Complex Traits (DEPICT; T.H.P., J.M.K., Y. Chan, H.J. Westra & A.R. Wood et al., unpublished data). This method is designed to systematically identify the most likely causal gene in a given region, gene sets that are enriched in genetic associations, and tissues and cell types in which genes from associated loci are highly expressed.
Briefly, the DEPICT method prioritizes genes within a given region on the basis of its functional similarity to genes from other associated regions. Genes that are highly similar to genes from other regions obtain low prioritization P values, and simulated GWAS results are used to adjust for gene length bias as well as for other potential confounders. There can be several prioritized genes in a given region.
DEPICT facilitates gene set enrichment analysis by testing whether genes in associated regions enrich for reconstituted versions of known pathways and gene sets as well as protein-protein interaction subnetworks (collectively referred to as reconstituted gene sets). The aim of gene set reconstitution is to enhance pathway definitions by adding genes to predefined gene sets on the basis of data-driven algorithms and by representing gene sets in a probabilistic framework rather than using a binary indication on whether a given gene belongs to a given gene set. In short, gene set reconstitution is accomplished by identifying genes that are coexpressed with other genes in a given gene set based on a panel of 77,840 gene expression microarrays; genes that are coexpressed with genes from a given gene set are likely to be part of that gene set46. Several types of gene sets were reconstituted: 5,984 protein molecular pathways derived from 169,810 high-confidence experimentally derived protein-protein interactions47; 2,473 phenotypic gene sets derived from 211,882 gene-phenotype pairs from the Mouse Genetics Initiative48; 737 Reactome database pathways49; 184 Kyoto Encyclopedia of Genes and Genomes (KEGG) database pathways50; and 5,083 Gene Ontology database terms51. These reconstituted gene sets were represented in a probabilistic format in which each gene had a score signifying its likelihood of belonging to a given reconstituted gene set. Consequently, reconstituted gene sets might contain different genes than the generic counter parts. In total, 14,461 gene sets were assessed for enrichment in genes in associated regions. To identify independent biological groupings and for visualization purposes, reconstituted gene sets were clustered on the basis of their degree of similarity, as measured by the Pearson correlation between the scores for all genes in a given pair of reconstituted gene sets. The Affinity Propagation tool52,53 was used for clustering, and clusters were named by their 'representative' gene set, which was automatically chosen by the Affinity Propagation clustering method. Correlations between meta gene sets were calculated on the basis of the scores for the representative gene sets. The Cytoscape tool54 was used to draw network figures.
DEPICT also facilitated tissue and cell type enrichment analysis by testing whether the genes in associated regions were highly expressed in any of 209 MeSH annotations for 37,427 microarrays on the Affymetrix U133 Plus 2.0 Array platform. R was used to construct the bar plots.
In this work, we first used PLINK41 to retrieve independent sets of loci for all autosomal associations with P < 1 × 10−5, which resulted in 52 SNPs (parameters -clump-p1 1e-5 −clump-kb 500 −clump-r2 0.1). DEPICT then assigned genes to associated regions if the genes overlapped or resided within the associated LD window (r2 > 0.5) of a given associated SNP. After merging overlapping regions and discarding regions that mapped within the extended major histocompatibility complex locus (here we conservatively exclude chromosome 6, 25–35 Mb), we were left with 46 non-overlapping regions covering a total of 76 genes. Finally, we ran DEPICT on the 46 regions to identify tissue and cell type annotations in which genes from associated regions were highly expressed, to identify reconstituted gene sets enriched for genes from associated regions (which therefore may provide insight into the etiology of hypospadias) and to prioritize genes within associated regions.
Gene expression analysis in human foreskin. Total RNA from preputial tissue samples was extracted using the RNeasy Fibrous Tissue Mini kit from Qiagen. RNA quantity and purity were measured using a Nanodrop spectrophotometer (Thermo Fisher Scientific), and RNA integrity was visualized in agarose gels by the 28S and 18S rRNA bands. We performed RT-PCR to prepare cDNA from RNA samples according to the standard protocol. Briefly, 2.5 μg of RNA was reverse transcribed in a 20-μl reaction volume, and reverse-transcribed products were diluted fourfold with TE buffer (10 mM Tris-HCI, pH 8, 1 mM EDTA).
For qRT-PCR, primers and TaqMan probes for the ADK, AHRR, EYA1, HOXA4 and IRX5 genes were ordered from Life Technologies. The qRT-PCR assays for the five genes were performed with the StepOnePlus system (Life Technologies) according to the standard protocol. Eukaryotic 18S rRNA was taken as an endogenous control. Relative expression in the case and control groups stratified by genotype was calculated according to the equation derived by Livak and Schmittgen55: relative quantity (RQ) of expression = 2. The Ct value denotes the number of cycles at which the fluorescent signals in the reaction system are detected by the thermal cycler, ΔCt = Ct (target gene) − Ct (18S rRNA), and ΔΔCt = ΔCt (test group) − ΔCt (reference group). For all RQ analysis, the ΔCt value of the major homozygous genotype group for controls was used as the reference, and RQ was calculated for other genotype groups in cases and controls. Data were processed with STATA 10 (StataCorp) and expressed as mean ± s.e.m. To test the hypothesis that gene expression systematically increases or decreases with every additional risk allele in the genotype, we performed the nonparametric test for trend across ordered groups developed by Cuzick56, as implemented in STATA on the basis of the ranks of the expression results across the ordered genotype groups.
Estimation of variance explained. To estimate how much of the variance in liability to hypospadias could be explained by the 548,642 genotyped SNPs underlying the GWAS, we performed GCTA24,25 analyses for the complete genome and the individual chromosomes and compared these estimates to heritability estimates based on the same data set for the SNPs presented in Table 1. Additionally, Table 1 gives the heritability estimates for the associated SNPs based on the results from the combined analysis of all study groups. The prevalence of the disease was estimated to be 0.0045 on the basis of data from a recent Danish heritability study1. After iteratively excluding individuals, the analysis was based on 891 hypospadias cases and 5,000 controls with estimated relationship Ajk < 0.0251.
In the analysis of the whole-genome data, the heritability estimate was derived by transforming the estimate on the observed scale to the liability scale. We also performed analyses for the individual chromosomes and display these results together with the results for the SNPs from Table 1 (Supplementary Fig. 6).
Analysis of risk allele frequencies for hypospadias subtypes. Detailed information about the location of hypospadias was available for the majority of Dutch and Swedish cases (Supplementary Table 12). The risk allele frequency in cases was analyzed as a function of study group and degree of hypospadias (Supplementary Table 13). First-degree hypospadias and the Dutch study group were chosen as reference categories; thus, the intercept gives the allele frequency for this group, and the differences for the Swedish study group and second- and third-degree hypospadias are estimated. We present the results for the combined group (all) and separately for cases from the Netherlands and Sweden.
GWAS catalog (accessed 14 February 2014), http://www.genome.gov/gwastudies/; PubMed, http://www.ncbi.nlm.nih.gov/pubmed/; EUROCAT classification, http://www.eurocat-network.eu/; 1000 Genomes Project, http://www.1000genomes.org/; R software, http://www.r-project.org/; Gene Ontology Consortium, http://www.geneontology.org/; Mouse Genome Informatics, http://www.informatics.jax.org/; REACTOME database, http://www.reactome.org/; Kyoto Encyclopedia of Genes and Genomes (KEGG), http://www.genome.jp/kegg/pathway.html.
We thank all study participants (as well as their parents) in Denmark, the Netherlands, Sweden and the United States for their cooperation in this study. We would also like to thank everyone involved in data collection and biological material handling in the four study groups (C.H.W. Wijers, S. van der Velde-Visser, K. Kwak, J. Knoll, R. de Gier, B. Kortmann, A. Paauwen, H.G. Kho, J. Driessen and the anesthesiologists of OR 18 for the Dutch group; data collection in the Netherlands was performed as part of a PhD project supported by the Radboud University Medical Center).
B.F. is supported by an Oak Foundation fellowship. T.H.P. is supported by the Danish Council for Independent Research Medical Sciences (FSS) and the Alfred Benzon Foundation. The study was supported by an FSS grant (0602-01455B), the Novo Nordisk Foundation, the Lundbeck Foundation (421/06), the Swedish Research Council, Foundation Frimurare Barnhuset Stockholm, the Stockholm City Council, the Swedish Society for Medical Research and Karolinska Institutet. Funding support for expression analysis performed at the University of California, San Francisco came from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)-sponsored K12 Urologic Research (KURe) program (5K12DK083021). The funders had no role in study design, execution or analysis or in manuscript writing.
Supplementary Tables 1–13
About this article
GWAS in childhood acute lymphoblastic leukemia reveals novel genetic associations at chromosomes 17q12 and 8q24.21
Nature Communications (2018)