Introduction

Large-scale, high-density genome-wide association studies (GWAS) have greatly facilitated the discovery of genetic variants that affect the predisposition to human complex diseases.1 The results of recent GWAS suggest that the majority of susceptible loci have small contributions to phenotypic variation, which indicates that there should be a large number of susceptibility loci involved in the genetic basis of complex disease.2, 3, 4, 5, 6, 7, 8 These findings are consistent with the polygenic model, proposed almost a century ago,9 underlying the genetic etiology of complex diseases. At the same time, on the basis of the knowledge of family history, differentiation is often made between sporadic and familial cases. This differentiation implies some sort of genetic heterogeneity, either that environmental factors are more important in sporadic cases or that sporadic cases arise from new mutations of large effect size. Integrating this differentiation into an understanding of the genetic architecture of complex disease depends on the frequency of sporadic cases, consistent with the polygenic model.

In this study, we investigate the relative proportion of sporadic and familial cases expected under the polygenic threshold model. We assume that disease susceptibility is genetically homogeneous in the population and the observed illness results from the accumulative effect of multiple common genetic and environmental effects that exceeds a certain threshold.10, 11, 12, 13 We conducted a simulation study to calculate the probability of a proband case without family history. Furthermore, we investigated the relationship of disease prevalence, heritability of liability, recurrence risk (RR) and family size, and analytically predicted heritability of liability and the proportion of sporadic cases in human complex diseases under the polygenic model.

Methods and results

We simulated pedigrees with three generations. For each pair of parents in the first and second generations, the number of offspring was independently drawn from a Poisson distribution, as in Cui and Hopper.14 The structure of the simulated pedigree is illustrated in Figure 1. The disease liability (y) of each individual in the pedigree was simulated by a simple genetic model y=a+e, where a is the additive genetic effect; e is the environmental effect; e ∼ N(0, h2L); and hL2 is the narrow sense heritability of liability. The additive effects of the individuals in the first generation were drawn from a ∼ N(0, h2L), whereas those in the second and third generations were generated by a=0.5aF+0.5aM+m, where aF and aM are the additive effects of parents, and m is the effect due to Mendelian segregation, m ∼ N(0, 0.5h2L) (Appendix). A range of heritability levels (hL2=0, 10,…100%), along with three levels of disease prevalence (K=10, 1 and 0.1%), was considered to be representative of human complex diseases. An individual was considered to be diseased if y>T, where T is the threshold on the normal distribution truncating the proportion K. A total of 100 000 pedigrees with at least one disease case (proband) in the third generation were generated for each of combinations of hL2, K and S, where S is the mean family size per couple. A proband case was considered to have a family history of disease on the basis of two different definitions: (I) at least one first-degree relative with disease, and (II) at least one first-, second- or third-degree (1°, 2° or 3°, illustrated in Figure 1) relative is affected, and the probability of a sporadic case was defined accordingly.

Figure 1
figure 1

Structure of the simulated pedigree. The number of children for each pair of parents is distributed as a Poisson variable. The numerical labels ‘1’, ‘2’ and ‘3’ represent the first-, second- and third-degree relatives of the proband case.

It is shown in Figure 2 that P(sporadic) depends mainly on disease prevalence K. It increases markedly with decreasing K, and gradually increases with decreasing hL2 and S. For a disease with a prevalence of 0.1%, a proband case has a probability of ∼78% (definition I) or ∼70% (definition II) to be sporadic, even if the heritability of disease is 100% and family size is large (S=5). We investigated the relationship between the underlying and observed scale. In theory, P(sporadic) is approximately equal to where is standard normal cumulative distribution function; fn is the number of nth-degree relatives; with i=z/K; and z is the height of the standard normal curve at the truncation point T (Appendix). For example, when S=2 and for diseases with prevalence ranging from 0.1 to 1%, P(sporadic) will range from 66 to 99% (definition I) or from 52 to 99% (definition II), with hL2 from 100 to 0%. Therefore, for these diseases, the proband cases observed in typical families will more likely seem to be sporadic no matter how heritable the disease is. By contrast, a large proportion of probands would present with a family history for a disease with a high prevalence (K=0.1); even if the heritability is extremely low (hL2=10%), there are 31% (definition I) or 65% (definition II) of probands having a family history. In addition, the nth-degree RR can be calculated analytically as (Appendix). The RRs for relatives from this equation were verified by simulations and are listed in Table 1. In line with the results for P(sporadic), RRs largely depended on disease prevalence K. Although the relationship of hL2, K and λn is theoretically nonlinear, the relationship between log(λ1) and −hL2 log(K) is approximately linear for each K (Figure 3), and is roughly log(λ1)=−0.62 hL2 log(K) with regression R2=99.4% for K from 0.01 to 10%.

Figure 2
figure 2

Proportion of sporadic cases under different combinations of disease prevalence (K), heritability on underlying scale (hL2) and mean family size per couple (S). (a) and (b) refer to the two definitions of family history, respectively.

Table 1 Recurrence risks for relatives under different combinations of heritability of liability and disease prevalence (K) by our analytical derivation
Figure 3
figure 3

Empirical relationship of disease prevalence (K), heritability on underlying scale (hL2) and first-degree recurrence risk (λ1). In the figure, y=log(λ1) and x=−hL2 log(K).

An analytical method to calculate hL2 using K and λ1 is provided in the Appendix. We used this method to predict heritability of liability for a range of common complex diseases using the observable parameters, namely, disease prevalence and first-degree RR (Table 2). Given hL2 and K of disease, we also predicted P(sporadic) from our analytical approximation and checked all our analytic results by simulation. The predictions of P(sporadic) for these diseases are presented in Table 2. The predictions of P(sporadic) from simulation agreed well with those from approximation, with a correlation of 0.998.

Table 2 Prediction of the heritability of liability (hL2) and proportion of sporadic cases in human complex diseases

Discussion

Using the relationship between underlying and observed scale parameters under a polygenic inheritance model, we predicted hL2 for a list of human complex diseases, using K and λ1 collected from literature. The proportion of sporadic cases and RRs for relatives depends mainly on disease prevalence. If we consider three-generation pedigrees with S=2, then for diseases with a low prevalence (K<1%), even if they are highly heritable (hL2=90%), the disease cases will seem more likely to be sporadic, P(sporadic II) >63% (Figure 2b). On the other hand, for a disease with a high prevalence (K >10%), a large proportion of disease cases seems to have a family history, P(sporadic II) <37% (Figure 2b), even when disease heritability is extremely low (hL2=10%). Kendler15 modeled familial versus sporadic schizophrenia and major depression using simple genetic and environment etiology models. He showed that family history had a high positive predictive value, but a low negative predictive value. Under the liability threshold model, these results still apply, that is, a positive family history implies a high genetic liability to disease and a negative family history implies very little about genetic liability to disease.

Our results necessarily reflect our assumptions in modeling complex genetic disease. First, we assumed that the causes of familiality of complex disease reflect only genetic rather than family environment factors. As more distant relatives are less likely to share the same environmental risk factors, this assumption can be tested by comparing RRs for different types of relatives with their expected values under a genetic etiology model. Second, we used idealized family history assuming that the true disease status of all relatives is known, ignoring recall errors and age of onset factors. This implies that our estimates of proportion of sporadic cases are conservative and in practice may be even higher than our prediction. Moreover, in some diseases, family history reflects increased severity of the disease,16 which, although not inconsistent with a polygenic model, may imply genetic heterogeneity. Finally, we assumed that a liability threshold model is representative of complex genetic diseases. The model assumes a normally distributed liability to disease, with disease occurring in those who exceed a liability threshold. A normal distribution of liability would be achieved if there are multiple genetic and environmental factors, each making a small contribution to the risk of disease. Comparison of the prediction frequency of familial versus sporadic cases provides some benchmark for the validity of our assumptions. For example, a Swedish population study investigated breast cancer in 1 732 775 sisters from 763 963 families (S=2.3) before the age of 70 years.17 A total of 16 505 proband cases and 714 sisters of probands were identified with breast cancer, which provides the estimates of K=1% and λs=3 for breast cancer before the age of 70 years. We predicted an hL2 of 35% by our analytical equation. Given K and hL2, we simulated 1 million nuclear families with S=2.3, and estimated a multiplex proportion of siblings (the number of families with at least two affected siblings divided by the number with at least one sibling) of 4.6%, with 95% CI of 4.2% ∼5.0%, which is consistent with the observed proportion of 4.3%.

We consider the example of schizophrenia. The prevalence of sporadic versus familial cases has been considered in detail under a range of genetic models15, 18 in the context of etiological heterogeneity of schizophrenia. The latest evidence from GWAS for schizophrenia points to a large polygenic component.19 At the same time, significantly increased rates of de novo copy number variant (CNV) mutations have been reported in sporadic (but not familial) cases of schizophrenia,20 together implying genetic heterogeneity. In combination, these results point to many risk alleles that have a range of frequencies and effect sizes but that are still consistent with a normally distributed underlying liability to disease. Genetic epidemiology parameters of schizophrenia are classically quoted to be K=1%, λ1=8.6 and hL2=81%,21 but results from a Swedish population sample of >9 million have revised these estimates to K=0.4%, λ1=9 and hL2=64%.22, 23 For K=0.4% and λ1=9, we predict hL2 to be 63%, and for K=1% and λ1=8.6, we predict hL2 to be 80%, reflecting that under a threshold liability model, the lower disease prevalence forces a lower estimate of hL2 for the same λ1. For K=0.4% and hL2=63%, we predict the proportion of sporadic schizophrenia cases to be 90% (definition I) and 83% (definition II) in three-generation pedigrees with S=2 (Table 2). The Swedish population study observed a multiplex proportion of 3.8% (the number of families with at least two affected members divided by the number with at least one affected member)17 from nuclear families with ∼3.8 members per family.16 We simulated 1 million nuclear families with S=1.8, and estimated the multiplex proportion as 4.0% (95% CI of 3.7–4.4% from 100 simulation replicates) and P(sporadic I) as 90.6% (95% CI of 89.7–91.5%).

In conclusion, although sporadic cases can arise from nongenetic risk factors and from new mutations with large effects, a large proportion of sporadic cases is expected under the polygenic model, a result that reflects the relatively low prevalence rates (<5%) that are typical of common complex genetic diseases. Therefore, it is not possible to make any inference with regard to the causal mechanism of a complex disease from the observed proportion of sporadic cases alone.