Introduction

The DSM-IV category of Pervasive Developmental Disorders, also termed Autism Spectrum Disorders (ASDs) (MIM 209850), includes autism as well as Pervasive Developmental Disorders Not Otherwise Specified and Asperger's disorder. There is evidence that ASDs are highly heritable,1, 2, 3 although ASDs show wide clinical variability and a heterogeneous genetic architecture.4, 5 Dizygotic and sibling concordance rates are about one-tenth of monozygotic concordance rates, suggesting that ASDs are attributable to complex multigenic interactions rather than to a single susceptibility gene.

The great majority of identified ASD genes, mostly copy number variations (CNVs), show an unexpectedly high frequency of de novo mutation.5, 6, 7, 8, 9, 10 The increasing number of distinct, individually rare genetic causes of ASDs suggest an alternative to the polygenic hypothesis; most cases of ASDs are due to de novo mutations in the parental germ line, which can cause ASDs in most individuals. However, resistant individuals, mostly female, can be relatively asymptomatic carriers yet transmit the mutation and the resulting disorder, in a nearly dominant fashion. This hypothesis, which hereafter we refer to as the two-component model (in contrast to the classical polygenic threshold model11), was proposed by Zhao et al. (2007).12 These authors analyzed the susceptibility risk for ASDs in multiplex families using data collected mainly by the Autism Genetic Resource Exchange (AGRE) consortium13 and found only two types of ASD families. One type of families, which comprised approximately 99% of the sample, had a low risk (slightly less than 0.01) of producing a child with ASD. The second type of families, comprising approximately 1% of the sample, had a high risk (about 0.5) of producing a child with ASD. Although these results support a two-component model, the sample analyzed was originally designed for linkage analyses and was not systematically ascertained. This ascertainment scheme could bias the genetic composition of the ASD families.

Previously, we conducted a cohort study of siblings thoroughly ascertained through at least one ASD proband in the catchment area.14 Census data of children in the same area has also become available. Therefore, we undertook this study to estimate the distribution of ASD risk within families in this area based on both our sample and the census data. We hoped to verify the two-component model with this study. In addition, to compare the estimates from our analysis and those from Zhao et al. (2007)12 we also analyzed the same dataset used in their study, which was collected by the AGRE.

Materials and methods

Subjects

The sample used for this study, with informed consent and Institutional Review Board approval, has been described in detail elsewhere.14 Briefly, subjects in this sample were siblings born between 1993 and 2004, who were ascertained through at least one proband affected by ASD, and living in the western region of Nagoya city (this region is administered by the West District Care Center for Disabled Children). In this catchment area, all children with ASDs were ascertained through the regional screening system, which consists of a three-stage system of health check-ups and also captures missed cases through referrals from kindergartens, nursery schools, clinics and hospitals. Given that, during the study, the average participation rates for health check-ups were 95.3% for 18-month-old children and 86.5% for 3-year-old children, and given that 99.7% of the infants in the catchment area attended kindergartens or nursery schools, it is likely that most infants with developmental problems were identified. Thus, the screening for ASDs could be considered thorough.

A consensus diagnosis of ASD, based on the DSM-IV criteria, was made on the basis of all available information prepared in a semi-structured case vignette. This information included medical examination, psychological assessment and a clinical report based on repeated observations by psychologists and pediatric psychiatrists at ages 4 years or above. Inter-rater reliability was assessed by comparing diagnoses made by two raters based on data from 27 subjects with names and ages removed. The kappa coefficient between the two diagnosticians was 0.70 for both ASD and non-ASD cases.

To compare the estimates from our analysis and those from Zhao et al. (2007)12 we also analyzed the same dataset used in their study, which was collected by the AGRE.

A summary of the dataset, including the total number of children, number of affected children and the estimated sibling recurrence risk, is shown in Table 1. Here, sibling recurrence risk was estimated using the proband method.15

Table 1 Sample characteristics

Statistical models

The likelihood of sample-only data

We assumed that a mother–father pair in each family had a characteristic and time-invariant risk, x, of producing a male offspring with ASD. We set the risk of producing a female offspring with ASD equal to p × x, where p (female penetrance) represents the factor by which the risk for female offspring is greater than for male offspring. As ASDs are approximately four times more common in males than females, female penetrance, p, ranges from zero to one. Let qm and qf represent the proportions of males and females in the population, respectively, the values of which were 6949/13 568 and 6619/13 568 in the catchment area. Then the probabilities of producing a male child with ASD and a male child without ASD are given by qm × x and qm × (1−x), respectively. Similarly, the probabilities of producing a female child with ASD and a female child without ASD are given by qf × (px) and qf × (1−px), respectively. Note that these probabilities sum to one, that is, qm × x+qm × (1−x)+qf × (px)+qf × (1−px)=1 because qm+qf=1. Let n=(nAM, nUM, nAF, nUF) represent the number of affected males (nAM), unaffected males (nUM), affected females (nAF) and unaffected females (nUF) in a family. If we assume only one risk of producing an affected child, x, the probability that a family with n children is given by multinomial distributions, f(n∣x, p) as:

where the first term on the right side represents n!/nAM! nUM! nAF! nUF!, and n is the number of siblings in a family, thus n=nAM+nUM+nAF+nUF.

As suggested by Zhao et al. (2007),12 the family population may consist of more than one family subpopulation, each of which may have a different risk of producing an affected child. In this case, it is desirable to model the distribution of n as a mixture of K components. For i=1, …, K, the parameter denoting the proportion of the family subpopulation, i, is ai, with ∑i = 1K ai = 1. Therefore, the distribution of n is given by:

For the two-risk component example, when the first type of family with risk x1 has the proportion of a1 in the population and the second type of family with risk of x2 has the proportion of a2 in the population,

Based on the assumption that the individual families are independent and the diagnoses of children within a family are also independent, a log-likelihood function LLsample(θ) of our sample, ascertained from at least one case, is given by:

where obs(ni) is the observed number of families with ni children, m is the number of affected children in a family (m=nAM+nAF), N is the number of families in the sample and the parameters are θ=(xi, ai, p) (1⩽i⩽K). For the AGRE sample, we replaced the conditional in equation (1) with m⩾2 to reflect the ascertainment procedure of this sample.

Under this model, the prevalence of ASDs, R, is given by:

and the sibling recurrence risk of ASDs, S, is given by:

The notation and modeling of ASD-risk described above is identical to that used in the previous study.12 Details are described in the Supplementary information.

Including the Supplementary information on prevalence

Although Zhao et al. (2007)12 used these equations (2) and (3) as constraints to estimate the parameters, this method does not take into account the statistical uncertainty accompanied by the estimation of R and S. Therefore, instead of using these as constraints, we incorporated the prevalence information into the log-likelihood function LLsample+supp(θ), based on a binomial model as:

where N1 and M1 refer to the number of age-matched children (including affected children) and the number of affected children in the catchment area, respectively, the values of which were 13 568 and 281. This equation for LLsample+supp(θ) implies that the first term, LLsample(θ), contains information regarding the sampled population, and the second term contains information regarding the prevalence of ASDs, which can be supplied from the out-of-sample dataset. For the AGRE sample, the equation of LLsample+supp(θ) is also justified. Details are described in the Supplementary information section.

Identifiability on the number of risk components and model selection

We employed a Bayesian framework to estimate parameters. For the finite mixtures of multinomial distributions, a restriction on the number of components, K, and the maximum number of siblings in a family, n, was imposed because of the identifiability (that is, when the mixture has exactly one representation). This restriction is given by n⩾2K−1, where n=5 in our sample and n=8 in the AGRE sample.16 Therefore, we considered the model up to three components (K=3) in our sample and up to four components (K=4) in the AGRE sample.

Finally, as the limit of the discrete distribution with increasing number of components, we examined a continuous risk model by using a beta distribution, Be(α, β)=xα−1 (1−x)β−1/B(α, β), as:

where B(α, β) is the beta function.

The prior distributions for xi, ai (1⩽i⩽k) and p were all assumed to be uniform on [0, 1], where ∑i = 1K ai = 1. and x1⩽x2⩽…⩽xk for the identifiability of mixture models. We used a Markov Chain Monte Carlo method for estimation (details are described in the Supplementary information). To show how well a statistical model fits the observations, the deviance information criteria (DIC) was calculated from Markov Chain Monte Carlo samples.17 DIC is defined as the posterior mean deviance, D̄, plus the ‘effective number of parameters’, pD, where D is the deviance of the model, –2 × log-likelihood. Usually, pD is computed from the difference between D̄ and the deviance at the posterior mean parameter estimates, D(θ̂). However, in the finite mixture model, pD can often be negative because the overdispersion in mixture models leads to D(θ̂)>D̄. Therefore, as proposed by Gelman et al. (2004),18 we computed pD from half the posterior variance of the deviance.

Results

Risk components

We first examined the parameters in our sample and the AGRE sample, based on the log-likelihood, LLsample+supp(θ). Table 2 shows the posterior means (s.d.) of the parameters. The first column refers to the model type by the number of components, xi refers to the risk in the ith family type, and ai refers to the proportion of families with risk xi. The second column refers to DIC as a model choice criterion. For model fitting, the two-component model resulted in a substantially poor fit to the data from both samples compared with the other models. Irrespective of the model, the female penetrance, p, had posterior means between 0.28 and 0.31, with a narrow s.d. (approximately 0.03) in both samples.

Table 2 Risks, family proportions and case proportions contributed from each risk

The columns aixi/R refer to the means (s.d.) of the proportions of ASD cases contributed from the risk xi, where R is calculated from equation (4). Note that this is the proportion of ASD cases, not the proportion of families. Based on each model, the higher risk component showed larger proportions of ASD cases. For example, based on the three-component model, the highest risk component (x3) had the largest proportion of ASD cases, followed by the intermediate risk component (x2), then by the lowest risk component (x1).

Next, we examined the continuous risk model by using a beta distribution with parameters α and β. In Table 2, the bottom rows show the posterior means (s.d.) of the parameters and the DIC values. This continuous model clearly shows higher DIC values than any discrete model.

Lastly, we examined the discrete risk models, which constrained the highest risk to 0.5, forcing the assumption that the highest risk corresponds to a dominant risk. This model had slightly poor fit compared with the model without the constraint. The results of the three-component model are shown in Figure 1 and the results based on LLsample are shown in the Supplementary Table A.

Figure 1
figure 1

Highlighted estimates for our sample based on the three-component model, which constrained the highest risk to 0.5. (left) The proportions of families with each risk (0.010, 0.123 and 0.500) are represented by gray lines. Highest density regions (HPDs (80%)) are represented by dotted lines. The continuous risk estimated using the beta distribution is represented by a dashed line. (right) The proportions of ASD cases contributed from each risk (0.010, 0.123 and 0.500) are represented by gray lines. HDRs (80%) are represented by dotted lines.

Information content of the sample and the supplementary data on prevalence

Next, we examined the ascertainment bias by comparing the estimates of prevalence, R̂, and sibling recurrence risk, Ŝ, with the expected values from the census data. The values of R̂ and Ŝ obtained from each model are shown in Table 3. The R̂ values estimated from the log-likelihood LLsample(θ), which did not include information on prevalence, were far from the MLE of 2.2%. Of course, the estimates of prevalence, R̂, based on the log-likelihood LLsample+supp(θ) were essentially identical to 2.2%, as indicated by the shaded cells in Table 3 because this likelihood included information on prevalence and constrained R̂ to be equal to the MLE. In contrast, the estimates of sibling recurrence risk, Ŝ, based on LLsample(θ) and LLsample+supp(θ) were both close to the estimate of 18.3%, which was based on the proband method,15 although these likelihoods did not accommodate the information on sibling recurrence risk. This finding indicates that our sample contains sufficient information regarding sibling recurrence risk but not prevalence.

Table 3 Estimates of prevalence and sibling recurrence risk based on different models

To compare the estimates from our sample, we analyzed the data used by Zhao et al. (2007) (AGRE sample), and the results are also shown in Tables 2 and 3. The R̂ and Ŝ values obtained from the AGRE sample based on LLsample(θ) were also far from 2.2 and 18.3%, respectively. However, the discrepancies between these figures and the estimates based on LLsample(θ) were much larger than the differences from our sample. As in our sample, the estimates of prevalence, R̂, based on the log-likelihood LLsample+supp(θ) were essentially identical to the MLE of 2.2%, as indicated by the shaded cells in Table 3. However, incorporation of the prevalence information based on LLsample+supp(θ) resulted in a marked change in the estimates of sibling recurrence risk, Ŝ, in the AGRE sample. This finding may indicate that the AGRE sample, ascertained by more than two probands, does not contain sufficient information regarding prevalence or sibling recurrence risk.

Discussion

Inspired by a previous study suggesting only two-risk components of ASDs, here we verified this finding using both the dataset previously analyzed by Zhao et al. (2007)12 and our independently collected dataset. The conclusion of Zhao et al. (2007)12 is mainly based on the finding that the risk estimates of male children show one high-risk component, which is near 50%, and two low-risk components, which are both below 1% and are essentially indistinguishable. Because their analysis was based on the MLE method, under some constraints (equations (2) and (3) in this paper), their analysis of male children alone is restricted up to the three-component model, and the analysis of both male and female children is restricted up to only the two-component model, because of limited degrees of freedom. Instead of the MLE framework, we employed a Bayesian framework, allowing us to consider the models for both male and female children up to three components in our sample and up to four components in the AGRE sample. Our results show that the estimates of ASD risks are not divided into two parts, one near 50% and one below 1%. Using models with more than three components, we find intermediate risks ranging from 5 to 30% in both samples. Furthermore, our results also demonstrated that the two-component model resulted in a substantially poor fit to the data compared with the other models. Therefore, we can conclude that the models with more than three risk components are preferable to the two-component model.

Although the estimated risks themselves cannot be assumed to be genetic, based on twin studies it is reasonable to assume that the risks are mostly genetic in origin.1, 2, 3 Thus, from a genetic point of view, we can interpret families with the highest risk, close to 50%, as transmitting ASDs in a dominant pattern. The risk estimates for our sample under the model, which constrained the highest risk to 0.5 are shown in Figure 1. Here, it is worth noting that the highest risk component, which can be regarded as a dominant risk, was associated with the largest proportion of ASD cases in any component model. With regard to the substantial contribution from a dominant risk of ASDs, our finding is in agreement with Zhao et al. (2007).12

Previous twin studies have suggested that the polygenic factors responsible for ASDs may also be responsible for more common social impairments in the general population, the severity of which falls below the threshold for categorical ASD diagnosis.19, 20 Moreover, autosomal recessive genes responsible for autism have been identified by homozygosity mapping in consanguineous pedigrees.21 The risks from polygenic or recessive genetic factors is included in an intermediate risk class. Therefore, it is likely that more than a three-component model with intermediate risks is preferable to the two-component model, which does not contain intermediate risks.

In addition, our results show the importance of adding information regarding prevalence to the analysis of ASD risk among families selected through affected probands. Any differences between the results obtained with and without incorporating the information on prevalence reflect the characteristics of the sample determined by the particular ascertainment procedure. Incorporation of the prevalence information resulted in a marked change in the estimates of sibling recurrence risk in the AGRE sample. However, the estimates of sibling recurrence risk showed strong stability to the incorporation of prevalence information in our sample. This finding may indicate that our sample, which was thoroughly ascertained by more than one proband, contains sufficient information regarding sibling recurrence risk, whereas the AGRE sample, which was ascertained by the presence of more than two probands, does not. This conclusion is supported by the extremely high sibling recurrence risk estimate of 0.536 in the AGRE sample, which suggests that the sample overly contains numerous carrier parents that transmit ASDs in a nearly dominant pattern. Therefore, for the analysis of the proband-ascertained sample, we can empirically justify the incorporation of the prevalence information.

The results of this report should be interpreted in the context of several potential limitations. First, we assume that the distribution of the number of offspring among nuclear families is independent of the risk for ASDs. However, this assumption may be violated because parents may choose to stop having children after the birth of a child with ASD. This situation, referred to as stoppage, can severely bias the estimates of the ASD-risk distribution. Without knowledge of the distribution of the number of offspring among families, a correction for stoppage appears to be almost intractable.22 Therefore, we tested for the existence of stoppage by using the Mann–Whitney U-test, according to a previous study.23 In brief, if U is the number of times a normal child precedes an affected child in all k sibships, ai is the number of affected children in sibship i, and ni is the number of normal children in sibship i, then,

is a unit normal deviate. Mann–Whitney U-tests on our sample and the AGRE sample resulted in z-values equal to 1.70 (P-value=0.058) and 1.65 (P-value=0.099), respectively. This test suggests that this potential bias seems unlikely.

Second, our findings largely depend on the approach of adjusting for ascertainment, which uses the conditional distribution of the phenotype of non-probands, given the phenotype of probands. This method is attractive because it does not necessitate correctly modeling the ascertainment process. For singly ascertained data, conditioning on probands should provide an asymptotically unbiased estimator. However, this is not true for multiple ascertainment, that is, ascertaining through multiple probands in each family.24 Thus, serious asymptotic bias can occur when adjusting for at least two affected children in the AGRE sample.

Third, as in previous work,12 the analysis of the AGRE sample is based on an extrapolation of the prevalence information. In contrast, the analysis of our sample uses prevalence information derived from the same data source. Thus, our sample has higher internal validity than the AGRE sample. That is, our sample should be well specified in terms of the geography and time covered, as well as ethnic group (all subjects in our sample are Japanese), as compared with the AGRE sample. For sensitivity analysis of the AGRE sample, we have explored ranges of prevalence, R, from 0.5 to 2.2% and found that the DIC values were insensitive to changes in R (Supplementary Table B).

Forth, we assume that the same female penetrance, p, even for different risk components is based on the previous study.12 Although this assumption allows us to avoid the increase in the number of parameters to be estimated, this assumption may be too restrictive to be justified.

The final limiting issue concerns the restriction of the number of risk components incorporated into the statistical model. For the finite mixtures of multinomial distributions, the maximum number of siblings, n, constrains the number of components up to (n+1)/2, due to identifiability. Under normal conditions, in which the maximum number of siblings in a sample is limited (for example, five to eight), the number of components is capped at three to four. Therefore, even if a ‘true’ model comprises of components beyond the capped number, we cannot evaluate the model. Even in that case, it is inferred from the estimated results below the capped number that dominant genes substantially contribute to the development of ASDs. On the other hand, we can consider many more components (for example, 30 or more) by approximating discrete functions by the continuous function, as previously shown. In our results, this model was not supported in terms of DIC. Therefore, we conclude that a limited number of risks are involved in producing children with ASDs.

Despite these limitations, which are largely related to the AGRE data, the estimates from our sample are in remarkable agreement with the estimates from the AGRE sample, suggesting that our results are robust. From our results, the largest risk (dominant risk) can cause the largest proportion of ASD cases in any component model. Recent studies have revealed that submicroscopic CNVs can have a role in ASDs, and the frequencies of 7–10% are observed in simplex families.6, 10 However, CNVs in asymptotic carriers have not yet been fully identified.10 From our results, we predict that the frequency of any kind of mutations being transmitted from carrier parents will increase significantly, once higher resolution genome-scanning methods become available. The identification of de novo CNVs associated with ASDs has progressed considerably in recent years, but detection of mutations transmitted from parents, through examining parent–offspring trios, should become increasingly critical.