The authors Chang et al1 are to be commended for their efforts to educate the larger scientific public about the impact of data quality in complex disease gene mapping. Specifically, the authors document four types of data errors that are general to most linkage studies: errors in phenotype, pedigree structure, marker information, and marker genotypes. These authors also pose a set of excellent questions regarding the importance of data quality and provide empirical answers through their experience with GenNet whole-genome linkage studies (part of The Family Blood Pressure Program2).

Among the questions posed are: (i) How much of the genome is covered by a 10 cM linkage scan (when data are removed owing to low genotyping quality or Mendelian inconsistency)? (ii) Do allele shifting markers (ie, markers in which identical alleles are sometimes called differently because different flanking primers, allele sizing software, or allele binning methods have been used when STR genotyping of a data set is performed in multiple batches over several years) affect linkage evidence? (iii) Do family structure errors significantly reduce linkage signal? (iv) Is removal of Mendelian inconsistencies an adequate substitution for comprehensive data cleaning? The answers from the GenNet example document that comprehensive data cleaning can result in both the removal of false-positive evidence of linkage as well as a potentially substantial increase in linkage evidence for true positives. Perhaps most important, these authors (as other authors have recently carried out3) provide a comprehensive protocol that helps guarantee good data quality and therefore maximal power to localize disease genes for complex traits.

This work raises the larger question of allocation of resources in the era of whole-genome mapping for complex diseases. With the advent of genotyping technologies that can produce genotype calls for hundreds of thousands of genotypes across the whole genome4 and a widely publicised successful gene localisation for age-related macular degeneration using these technologies,5 there is an understandably strong attraction to thinking that methods that involve increasing data quantity will be a panacea for the ills of unsuccessful complex gene mapping studies. However, there are a number of factors involved in designing successful gene-mapping studies.

We illustrate three major factors and their relationship schematically in Figure 1. The figure is a triangle, with each vertex representing one aspect of a study that must be balanced with respect to the other two. The vertices are: study cost, in terms of time and money; data quantity, which includes the number of subjects for whom genotype and phenotype (diagnosis) information is obtained, and also the number of genotypes per individual obtained; and data quality, which represents the accuracy of the genotype, phenotype (or diagnostic), and other information (eg, environmental covariates) for each individual in the study. To illustrate use of this figure, consider some examples. If a research team wants to phenotype large numbers of individuals and genotype them, and in addition, wants to insure high accuracy rates, then the team must increase either time or money invested. Similarly, if a team is working with a fixed budget (either time-wise or money-wise), and the team wants to insure high-quality phenotypes and genotypes, then it will most likely have to reduce the sample size studied.

Figure 1
figure 1

Graphical representation of balance among the allocation of resources (factors) in whole-genome mapping studies.

It is critically important to bear these points in mind before embarking on large-scale linkage or association studies. Recent research (including that of Chang et al1) indicates that sacrificing data quality will reduce power to detect loci (or equivalently, will increase sample size requirements6, 7, 8, 9, 10 for a fixed power level) and/or will shift the apparent location of susceptibility genes,11, 12 so that in effect, research teams may end up trying to find a ‘moving target’. For example, the original sample size requirements needed if the data were highly accurate may be insufficient because accuracy is reduced to collect individuals in a given time frame.

In summary, the importance of Chang et al's1 work cannot be overstated. Data quality is a critically important factor to be considered if research teams are to be successful isolating complex disease loci in this new era of whole-genome mapping.

As a final thought, this author recommends that, in the balance among the three factors presented in Figure 1, research teams sacrifice neither data quality nor data quantity in their searches for complex trait susceptibility loci. That is, research teams might consider increasing cost, especially in terms of time, when performing their studies. The monetary costs per year can be kept smaller by increasing the time in which studies are completed. With this type of strategy, teams will have the added advantage that additional samples collected may be used as replication for initial linkage and/or association signals. This strategy has been employed to successfully localize susceptibility genes for previously thought intractable diseases as schizophrenia.13