Introduction

Large-scale individual genotyping is typically used for genome-wide association studies (GWAS).1 This approach is frequently laborious, expensive, and time consuming, especially in studies requiring large numbers of participants.2 Furthermore, initial theoretical expectations of sample sizes required to detect small effects were not met, and now large sample requirements are proposed.3 DNA pooling and group genotyping can dramatically reduce the cost of large-scale GWAS aimed at mapping disease susceptibility loci.4

In DNA pooling, as opposed to individual genotyping, individual samples from n1 cases and n2 controls are combined into two independently pooled samples.4 In general, cases and controls are defined based on either the presence or absence of a disease trait, or as related to a quantitative trait (for example, by dichotomizing the continuous trait by selecting the two extremes of the distribution).4, 5 Disease-associated single-nucleotide polymorphisms (SNPs) are identified by a statistical test4 after correction by multiple testing.6 To corroborate results, SNPs shown to be disease associated in the DNA pools (as well as other SNPs in the vicinity) are genotyped, preferably in an independent sample for strong replication.4

The pooling strategy described above is very efficient in terms of genotyping costs and for the initial discovery phase. However, for replication, DNA samples from new patients are necessary, which constitutes a crucial limitation, especially in the case of rare diseases.

Here, we describe a new pooling/bootstrap-based GWAS strategy (pbGWAS), aimed at the identification of disease-associated SNPs; this strategy makes use of DNA pooling and resampling (bootstrapping)7 to randomly sort samples from cases and controls in order to generate multiple pairs of new comparable DNA pools. Statistical evidence from multiple comparisons of the pairs is combined in a sequential way using formal meta-analysis (Figure 1).

Figure 1
figure 1

Pooling/bootstrap genome-wide association study (pbGWAS) strategy when considering n1=n2=96 cases and controls. After generating a total of k pairs of randomplates (steps) by bootstrapping DNA samples, m SNPs (single-nucleotide polymorphisms) are tested on pooled samples at each step. Results from pairs 1 to k is combined using meta-analytical methods.

PowerPoint slide

To show the high efficiency of the pbGWAS method, in terms of both genotyping costs and statistical power, we present the mapping of several modifying loci that influence the age of onset of Alzheimer's disease (AD), in individuals carrying the PSEN1 p.Glu280Ala mutation (often referred to as E280A mutation), a unique and fully penetrant variant that segregates in one of the largest and clinically well-characterized multigenerational pedigrees.8, 9 Along with a unique causal mutation, the pedigree inhabits a homogeneous environment, and the mean age of onset of AD falls quite tightly around a mean age of 47 years (s.d.=6.4 years).9 However, there are outliers with onset ages ranging from 34 to 65 years. The signs and symptoms of the entire kindred with early-onset Alzheimer's disease (EOAD) have been rigorously followed,8 as we have identified the vast majority of the p.Glu280Ala mutation carriers.

Using pbGWAS with eight pairs of resampled DNA pools from cases (p.Glu280Ala carriers with EOAD) and controls (p.Glu280Ala carriers with late-onset Alzheimer's disease (LOAD)) at a total genotyping cost <$20,000, we successfully located new loci with associations that exceed standard thresholds for GWAS significance and also replicated other loci previously reported to be associated with AD.

Materials and methods

Subjects

A total of 1784 descendants of AD patients with confirmed PSEN1 p.Glu280Ala mutations were enrolled in the E280A Antioquia cohort study at the University of Antioquia, Colombia, between January 1, 1995 and January 27, 2010. These patients were clinically followed every 2 years. Detailed description of the enrollment, clinical characteristics, and inclusion criteria are reported elsewhere.8 The PSEN1 mutation in this pedigree is the result of a founder effect dating back to the seventeenth century; now, after 20 generations, the pedigree includes >3000 people with several hundred carriers of the mutation. p.Glu280Ala AD patients with known age of onset of AD symptoms were selected for this specific study. Because of the size of the pedigree, any bias of ascertainment and family structure effect may be considered as minimal. Based on a survival analysis,8 age cutoff values were matched to the 40th and 60th percentile of the survival curve, and patients for whom AD symptoms appeared before 48 years of age (EOAD) were classified as cases; those for whom the age of appearance was after 50 years of age (LOAD) were classified as controls8 (Figure 2). A total of 64 cases and 38 controls, all of them PSEN1 p.Glu280Ala carriers, included in this study, were selected from the 1784 participants enrolled in the study by Acosta-Baena et al.8

Figure 2
figure 2

Definition of case and control individuals in our cohort of Alzheimer's disease patients based on a survival analysis of disease progression in PSEN1 p.Glu280Ala carriers. Modified from Acosta-Baena et al.8 MCI, mild cognitive impairment.

PowerPoint slide

DNA extraction and pool preparation

DNA was isolated from whole blood using the traditional phenol/chloroform extraction followed by ethanol precipitation. The Infinium HD10 and GoldenGate11 assays have low DNA loading requirements and offer the flexibility to support diverse experimental designs. DNA extracted from blood samples, using the solid-phase platforms from QIAGEN (QIAGEN Sciences, MD) produces high-quality genotypes and copy number variants (CNV) data. In our study, we went further and performed phenol/chloroform extraction of DNA samples to guarantee the highest purity possible (phenol/chlorophorm extraction is the preferred method of DNA isolation for more sensitive technologies like next-gen sequencing). DNA samples were verified for integrity by agarose gel electrophoresis and samples with 260/280 absorbance ratios <1.8 were again phenol/chloroform-extracted. Once the DNA samples were ready for the pooling experiments, their concentrations were measured fluorometrically to guarantee accuracy. DNA concentration was measured using the Qubit fluorometer (Invitrogen, Carlsbad, CA, USA). To prepare the DNA pools, individual DNA samples were diluted to 75 ng μl−1 and transferred to two 96-well plates, one for EOAD and one for LOAD. In all, 5 μl of each of the selected samples for a particular pool were mixed together in a single tube (see Figure 1) and submitted for genotyping.

In addition, to control the inherent variability attributable to technical issues, we also did try to minimize any variability originated as a consequence of different technical and personal skills. A single senior technician was involved in the generation of the stem pools and subsequent bootstrapped subsets. Furthermore, processing of pools at each step was done simultaneously for cases and controls to avoid any batch effects.

Comparable pools were constructed by selecting half of the individuals’ DNA samples available from each group using a bootstrapping procedure. Though each pair of DNA pools did not contain either duplicated or triplicated individuals, a DNA sample can result to be (randomly) selected to be part of more than one pool as a consequence of the resampling process.

pbGWAS strategy

DNA pools generation

We used the following algorithm for DNA pool generation: let h be the total number of new pairs of DNA pools to be generated using resampling/bootstrapping and n1 and n2 be the total number of individual DNA samples available for cases and controls, respectively. In order to illustrate how the strategy works, let us assume, without loss of generality, that n1=n2=n=96 such that all samples can be easily organized in a 96-well format plate (Figure 1). To guarantee the highest variance in the resampling process, we randomly selected half of the individuals’ DNA samples [nj/2] from each group, j=1,2. Although in practice it is possible to generate a very large number of bootstrap replicates, this is not always feasible and some general alternatives have been proposed.12, 13

Allele frequency estimation

Genotyping was performed at the NHGRI Genome Technology Branch using 370CNV-Quad SNP-chips from Illumina (www.illumina.com). Genotyping was performed using Illumina ‘Infinium assay’ protocol.10 In brief, the DNA was whole-genome amplified, fragmented, hybridized, fluorescently tagged, and scanned. Allele frequency values for each of the eight pairs of DNA pools were calculated, extracted, and exported in suitable formats using the Illumina's GenomeStudio Genotyping Module. Standard quality control was applied.

Determining disease-associated SNPs

Without loss of generality, assume that (i) h SNP chips consisting of s diallelic markers have been genotyped in cases and controls, (ii) the allele frequency of the major allele is estimated within each group using standard methods,4 and (iii) after experimental and statistical quality control, m (m<s) SNPs are available for analysis. Let p1,i and p2,i be the allele frequency for the i-th SNP in cases and controls, respectively, i=1,2,…,m. To test whether the allele frequency in the cases is not statistically different from that in the controls, the null hypothesis is stated as

with the alternative hypothesis H1,i being either (i) p1,ip2,i, (ii) p1,i>p2,i, or (iii) p1,i<p2,i. We do not consider (i) as it does not provide information about the direction of the difference in the allele frequencies.

For two independent pools, the test statistic comparing the allele frequencies for the i-th SNP has the following form4, 14, 15

where 1 and 2 are the estimated allele frequencies of the major allele in the pair of pools for cases and controls, respectively, and is the variance of 1,i2,i. When a two-pool design is used, can be calculated as14

where i=(n1+n2)−1(n11,i+n22,i) is the common allele frequency, τ is the coefficient of variation of the number of DNA molecules of locus A contributed by each individual, and ɛ2 is the variance of the pool-measurement error.14, 16

For testing (1), we propose a modification of (3) by introducing the estimated allele frequencies from both pools so that the estimation of i is no longer required, and we can focus the analysis on the difference of the allele frequencies between cases and controls for the SNPs being tested. Hence, the test statistic is

with

With the current high-throughput genotyping technologies and DNA extraction techniques, it is possible to obtain accurate estimates of the allele frequency, so the sampling variation and random experimental errors are negligible compared with ɛ.16, 17 Now, if ɛ is assumed to be close to zero, (4) can be written as

which follows a χ2 distribution with one degree of freedom under the null hypothesis. When τ and ɛ2 are known, (4) is preferred. Note that high values of Ti*2 indicate a stronger association between the i-th SNP and the disease.

Let Pi be the P-value calculated for the i-th SNP for testing (1). For each of the h new pairs of DNA pools generated, disease-associated SNPs are detected based on the P-values calculated from the T* statistic for each of the m SNPs passing quality control. For m independent tests for each pair of pools, the family-wise error rate, defined as the probability that one or more of the significance tests results in a type I error6 must be controlled, and standard methods can be applied, including Bonferroni,18 false discovery rate (FDR),19 spectral decomposition, which corrects by multiple testing when SNPs being tested are in linkage disequilibrium,20 and a gene-based (GB) multiple testing correction.21 Here we only focus on the FDR as it is particularly well suited for exploratory analyses.22 Thus, for each of the h pairs of new DNA pools, the disease-associated SNPs are those such that Pi,j*α, with Pi,j* the FDR-corrected P-value for i-th SNP when the j-th pair of new DNA pools is being tested, i=1,2,…,m, j=1,2,…,h.

Combining P-values

Up to this point, we have illustrated how to generate new DNA pools for cases and controls by randomly selecting individuals’ DNA samples and rearranging them accordingly, as well as the statistical hypothesis to be tested and how to determine whether the SNPs are disease-associated. Next, we shall describe the method selected for combining the P-values generated for each of the h pairs of new pools being compared.

Meta-analytical methods combine the results of multiple studies addressing a research question.23, 24 The combination of one-tailed P-values is one of the most widely used meta-analytical methods for summarizing information from k independent studies.24 Fisher's combined probability test25 and Stouffer's weighted Z-transform26 are two such methods. In Fisher's method, the P-values from k tests are combined using the test statistic

which follows a χ2 distribution with 2k degrees of freedom under the null hypothesis of no significant effect in any study. Stouffer's method, on the other hand, takes advantage of the one-to-one mapping of the standard normal distribution to the P-value and uses the test statistic27

which follows a standard normal distribution under the null hypothesis (see above). In (8), wl represents the weight and Zl the quantile of the standard normal distribution (SND) associated with the P-value for study l, l=1,2,…,k. When using either method, large values of the test statistic indicate that at least one of the studies being combined can reject its null hypothesis.23, 27 In our context, this means that at least in one of the new pair of pools the allele frequency for the i-th SNP, i=1, 2,…, m, is greater in cases than in controls. However, no information regarding the size of the experimental effect is available.24 Because the Fisher's method is asymmetrically sensitive to small P-values compared with large P-values and is less powerful than Stouffer's method,27 the latter was selected for combining P-values.

As we randomly select individual DNA samples from n1 cases and n2 controls to construct the multiple new DNA pools, the assumption of independence across the multiple pairs does not hold. However, it is possible to use Stouffer's method to combine dependent P-values by introducing some degree of dependence (correlation) between pairs.28 Let −(k+1)−1ρ1 be such correlation and suppose that ρ is known. Thus, (8) can be written as28

with A=∑l=1kwl2 and B=(∑l=1kwl)2. When ρ is not known, it can be estimated as28

where Φ−1(·) is the inverse of the SND, V[·] is the variance operator and Z=(Z1, Z2,…,Zk) is a vector containing the quantiles of the SND for the k studies. It follows28 that the test statistic is

with A and B as in (9), C=*+√2κ (1−*)(k+1)−1/2 and κ0 a parameter regulating the significance level. Note that the one-tailed combined P-value, , can be easily computed as under the null hypothesis of no significant effect in any study, both (9) and (11) follow a SND. If desired, can easily be converted to a two-tailed test.27

Individual genotyping

In order to validate our DNA pooling results, and to test independence of the type of chip used, we individually genotyped 23 samples (13 EOAD, 10 LOAD) using Human610-Quad SNP-chips from Illumina, and 48 samples (27 EOAD, 21 LOAD) using Human370-Quad SNP-chips. These individuals were selected using the same bootstrap strategy described before to generate the DNA pools.

Genetic statistical analysis

For the individually genotyped sample, genotypic (using dominant, recessive, and additive models) and allelic tests were performed as implemented in Golden Helix's Genotype module (Golden Helix, Bozeman, MT, Golden Helix PBAT Software, http://www.goldenhelix.com) with the onset age status (EOAD vs LOAD) as the phenotype. For all analyses performed, significance of markers was determined using the raw P-value for the corresponding test. Further, with this individually genotyped sample, we tested the potential presence of microdifferentiation (genotype stratification) between the EOAD and the LOAD samples, as this would potentially introduce important noise at each bootstrapping step that consequently could invalidate any association result. To evaluate that possibility, we used the principal component analysis as implemented in the Golden Helix PBAT Software.

Gene-based association

First, the P-values for the top 5% of SNPs in our discovery phase, as well as their corresponding P-values in the individual genotyping analysis, were selected and combined using (9) with ρ=0. Second, a GB association test using VEGAS (versatile GB association study)21 was performed on the latter combined P-values. VEGAS is a freely available software used for performing GB association tests, which produces a GB test statistic and then uses a simulation-based approach to calculate an empirical GB P-value. By default, patterns of linkage disequilibrium for each gene are estimated using the HapMap2 CEU population.21 SNPs and/or genes shown to be statistically significant using either test were classified as modifiers of the age of onset in AD.

Results

Discovery phase

A total 102 DNA samples (n1=64 cases and n2=38 controls) were considered in this study. Comparison of the age of appearance of AD symptoms in both groups disclosed statistically significant differences (EOAD: average=45.26 years, s.d.=2.24, range=39–48; LOAD: average=54.89, s.d.=4.64, range=51–75; t=−11.978, d.f.=47.481, P<5 × 10−15). Eight pairs of DNA pools were generated using an algorithm written in R29 as previously described. Allele frequencies for 373,397 variants (CNVs and SNPs) for all DNA pools were estimated using the 370CNV-Quad SNP chip, but only 287,368 were included in the final analysis. CNVs (22 K) were excluded as they were not used for analysis and 64 K SNPs were excluded because at least one of the allele frequencies in at least one of the 16 SNP-chips was <1 or >99%. The call rates for the 16 DNA pools ranged from 65.1 to 66.2% when clustered with the Illumina cluster file, and from 97.6 to 99% when re-clustered against each other. A replication sample, selected ad hoc, was used for quality control demonstrating the high reliability of the allele frequency estimation (Figure 1, see Supplementary Material online). Q-Q plots for the FDR-corrected P-values obtained at each step of comparison depict good control of the type-I error probability, with SNPs with P<10−4 shown in green (Figure 2, see Supplementary Material online). In addition, an empirical evaluation of the type-I error probability was implemented using a simulation procedure that considered different scenarios, for example, different number of cases and controls, markers, and number of steps. Methods used for this simulation process as well as results are presented in the Supplementary Material (see Supplementary Figures 4–8, and Supplementary Table 1 online). In general, it was demonstrated that the type-I error probability was well controlled by our pbGWAS strategy.

Disease-associated SNPs for each pair of new DNA pools were determined using the corresponding FDR-corrected P-value of (6) with n1=64 and n2=38. Further, to obtain , FDR-corrected P-values from the eight pairs of pools were combined using (9) with ρ=0.95 and wl=1, l=1,2,…,8. Figure 3 depicts Manhattan plots for loci modifying the age of onset in AD based on our discovery phase. As this figure shows, SNPs close to or within the Nephronophthisis 1 (NPHP1) gene on chromosome 2, the Arylsulfatase J precursor (ARSJ) gene on chromosome 4, the Calcium-Dependent Secretion Activator 2 (CADPS2) gene on chromosome 7, and the CD44 antigen isoform 1 precursor (CD44) gene on chromosome 11 were found to be statistically significant after combining the information from the generated DNA pools. Although additional loci were found to be significantly associated with the age of AD onset in our discovery phase, these four were selected as they are novel and because patterns of expression, as well as potential function, suggested them as promising candidate loci.

Figure 3
figure 3

(ad) Manhattan plot of the results of applying a pooling/bootstrap genome-wide association study (pbGWAS) strategy with eight pairs of DNA pools generated from patients from a unique and clinically well-characterized multigenerational pedigree affected by Alzheimer's disease with different ages of onset, all of whom carry the PSEN1 p.Glu280Ala (E280A) mutation. At the y axis, the –log10(P-value) for autosomal single-nucleotide polymorphisms are represented by dots; the x axis corresponds to the genomic coordinates. For display purposes, values in the y axis were smoothed by the median as implemented in Golden Helix. Abbreviations follow the same notation as in Figure 1. Though rs10173717 is 20 Kb downstream NPHP1 and 2.5 Kb downstream of NCRNA00116, NPHP1 appears to be a better candidate than NCRNA00116 becauseNPHP1 encodes a protein that interacts with PTK2B and BCAR1.44, 45 PTK2B, in turn, encodes a cytoplasmic protein having an important role as intermediate between neuropeptide-activated receptors or as neurotransmitters that increase calcium flux and the downstream signals that regulate neuronal activity. BCAR1, in turn, is involved in cellular migration, survival, transformation, and invasion.46 Furthermore, the gene-based approach showed that a total of 10 markers inside the NPHP1 gene (chr2: 110 238 202–110 319 928), clustered associated (statistic=177.01, P=2.08 × 10−4). This gene-based association was not present in NCRNA00116.

PowerPoint slide

In Table 1, we present the top 30 SNPs based on the values, the raw P-value for the basic genotypic test, and the combined P-value of both (see Materials and Methods section). Several SNPs shown here to act as modifiers of the age of onset in AD involve genetic loci previously reported to be associated with frank AD (that is, with the presence or absence of AD). For example, the CADPS2 gene (marker rs3757536, Table 1) was found to be associated with AD in a voxelwise GWAS (vGWAS) aimed at detecting genes related to brain structure and function.30 In the same vein, CD44 gene (marker rs187116, Table 1) expression levels in lymphocytes are higher in AD subjects than in healthy subjects.31 In addition, CD44-positive astrocytes have been shown to be involved in shaping normal neuronal mechanisms as well as in neurodegenerative process such as those involved in AD, Huntington, and Parkinson diseases.32, 33

Table 1 Pooling/bootstrap GWAS, individual genotyping, and combined P-values for the top 30 SNPs potentially modifying the age of onset in AD patients carrying the p.Glu280Ala mutation.

Individual genotyping

Call rates for the 23 samples (using the Illumina Human610-Quad SNP-chips) ranged from 37.8 to 99.8% when clustered with the Illumina cluster file, and from 56.2 to 99.6% when re-clustered against each other. For the Human370-Quad SNP-chip (48 samples), the calls ranged from 86.4 to 99.9% when clustered with the Illumina 370CNV-Quad cluster file. As implemented in GenomeStudio, genotypes for all variants were retrieved and further exported in the appropriate format to run genetic models using GoldenHelix. In Table 1, we present the corresponding raw P-values for the top 30 SNPs. Test for microdifferentiation showed no effect of stratification with two principal components extracted, excluding the possibility of type-I error inflation as a consequence of genotype stratification.

Combined statistics

A total of 14,369 SNPs were in the top 5% after combining the P-values from the discovery phase. Table 2 reports the genes shown to be statistically significantly associated with AD age of onset after performing the GB test on the combined P-value from the discovery phase and the individual genotyping using VEGAS. Q-Q plots for the FDR-corrected P-values obtained for the combined values demonstrate good control of the type I-error probability; SNPs with P<10−4 shown in green (Figure 3, see Supplementary Material online).

Table 2 Results of a gene-based association tests (VEGAS) for the combined P-values from the discovery phase and individual genotyping.

Several studies making use of DNA pooling have consistently reported a very high allele frequency estimation reliability and accuracy when it is compared with determination of allele frequencies by individual genotyping.15, 34, 35 These results are even more consistent when the pool involves individuals of similar ethnic background (as it is the case of this study)34 and remain still consistent throughout different platforms (in this study, we use a unique platform).15, 35 In order to evaluate both accuracy and reliability of allele frequencies estimated by DNA pooling in this particular study, we evaluated the correlation between gene frequencies estimated by DNA pooling with those determined by individual genotyping. Thus far, for each one of the eight pools, gene frequencies were determined for the subset of patients that were individually genotyped and making part of the specific pool. Supplementary Table 2 describes the number of DNA samples used in each step of the pbGWAS strategy that also were individually genotyped.

It is worth mentioning that numbers do not add to 100% of the samples subjected to each step of pooling because the subset—that of individuals that were subjected to individual genotyping—was obtained by bootstrap drafting. First, we compared heterozygosity values (calculated as 2pq) of DNA pools with those of the individual genotypes represented in their respective pool. Using the heterozygosity values from both strategies for cases and controls, we further calculated the linear correlation coefficient and 95% confidence intervals associated with it (see Supplementary Table 3 online). Heterozygosity values were significantly correlated. Supplementary Figure 9 depicts the cumulative distribution function of the heterozygosity values in cases and controls for individual genotyping (red dots) and DNA pooling (black dots), demonstrating full consistency of both distributions.

Discussion

In the recent era of GWAS, the identification of any significant association for complex diseases, including neurodegenerative and psychiatric disorders, meeting the rigorous criteria for statistical significance has been shown to be a very expensive and sometimes futile endeavor. For AD and other conditions, empirical data verified that initial sample size estimations from power calculations based on the assumptions of the common disease–common variant hypothesis (for example, n1000 for cases and controls) were not adequate to detect associations, and this extended to the meta-analyses of samples combined across studies to generate sample sizes of 10,000 cases and 10,000 controls. Now, even larger sample sizes are being used, with 50,000 cases and 50,000 controls, which may be required to detect associations using standard GWAS methods.

Based on the use of large numbers of samples, GWAS have been successful in identifying novel loci that confer susceptibility to AD. However, the costs and logistics involved in reaching such definition, resolution and power are highly demanding as well. Furthermore, because of the implicit clinical and genetic heterogeneity of neurodegenerative and psychiatric conditions, collection of sample sizes of this magnitude is almost impossible.

Here we have described a theoretical method, pbGWAS, and applied it empirically to a homogeneous sample of patients with one of the most severe forms of EOAD. These patients carrying the p.Glu208Ala mutation belong to a unique and clinically well-characterized multigenerational pedigree that segregated AD as consequence of a founder effect.8 The pbGWAS method combines the established strategy of DNA from exquisitely well-defined homogenous cohorts, and a new a sequential strategy of resampling (bootstrapping) from small samples of cases and controls. These procedures provided increased statistical power in order to reach the rigorous criteria for statistical significance needed by a GWAS. In contrast to the traditional GWAS approach, we have been able to identify new loci potentially implicated as modifiers of the age of onset of AD using limited and a relatively small sample.

Because DNA samples from cases and controls are repeatedly used to construct the DNA pools and test disease-associated markers, the pbGWAS strategy, as other methods that combine evidence, is more powerful.36 In fact, by constructing eight pairs of comparable replicated DNA pools using 32 EAOD and 19 LOAD individuals, each time, and combining the information, the resulting effective sample size is 408. This sample size is comparable to 471 that, for a power of 80%, is the one required to detect a moderate odds ratio of 2.5, assuming a minimum allele frequency of 0.3, a Bonferroni genome-wide corrected level of significance of 5%, genotype frequencies in cases, controls, and in the combined group in Hardy Weinberg Equilibrium, and using 1 million of SNPs (http://bioinformatics.centrillionbio.com/sscalc/). At an average genotyping cost of US$300 per sample, the total cost would have been at least US$141,300. Therefore, savings for using our approach add up to US$120,000.

Although it is suggested that DNA pooling must be conducted as an initial screening and further individual genotyping is required for confirmation, especially for the most promising loci,15 we argue that by genotyping only the top hits we would miss the opportunity to apply GB (as opposed to marker-based) association analysis. An additional practical argument is that prices of genotyping from thousand to million SNPs have very small differences.

Despite the herein demonstrated cost savings and intriguing results accomplished by the pbGWAS strategy in this particular case, there are some intrinsic limitations associated to the SNP-chip type and technology, inherent variability of phenotyping, sample size, and the presence of population stratification. Along with the problems of phenotype outline, population stratification is probably the more important constraint to large-scale association studies, either using individual genotyping37 or DNA pooling.35

DNA pooling strategies are highly sensitive to these effects, as the real influence of genotype stratification (true microdifferentiation) cannot be discriminated from allelic stratification. Although novel statistical methods have been proposed to deal with this problem, including the use of either background markers,38, 39 genomic controls,40, 41 structured association,42, 43 or hundreds of thousands of markers,37 for this particular type of strategy we strongly recommend the use of methodologies, as the one described by Turakulov and Easteal,39 that not only define a minimum number of highly selected SNPs but which can also be applied under a sequential strategy.

An additional issue refers to the definition of the sample size, not only for the original stem pools, but also for the size of the bootstrapped pools and number of steps. Though in theory, the higher the size of the stem pool, the higher the potential number of bootstrapped replicated sub-samples (higher variance), in practical terms pools of 96 samples that adjust to standard plates of a broad number of current technological robotic supports, will be of choice. Similarly, we have theoretically and empirically observed that bootstrapped replicates of 50% maximize the information. With regards to the number of sequential steps of replication, we empirically found that once reached the forth step of resampling, quantitative fluctuations are small.

Though the need for replication is clear, we considered that the demonstrated technique, as well as the new findings described here, might provide important insights both into the natural history of AD, as well as a novel methodology that can be applied to a wide variety of conditions. These results will feedback from further replications in cohorts of patients with sporadic forms of AD, other type of dementia, and patients with minimal cognitive impairment. Though it is difficult to warrant that the same group of genes might be commonly causal for this wide spectrum of pathologies, the fact that some usual suspects, that is, CADPS2, are showing up, indicates that these findings originated from the use of conspicuous and homogenous forms of a pathology might be proficient for broad applications.