Introduction

Crohn's disease (CD, MIM266600) is an inflammatory bowel disease (IBD) that is frequent in the Western world. The annual incidence of CD ranges from 1,7/105 to 24.3/105.1 The greatest frequency occurs in the third decade.2 To date the aetiology of CD is unknown but it is generally considered to result from a complex interplay between genetic and environmental factors.3

A genetic predisposition to CD was initially suggested by ethnic and familial aggregation of the disease (for review see reference3). Twin studies which demonstrated that the concordance rate for the disease was higher in monozygotic twins than in dizygotic twins also argued for a genetic factor.4,5,6 A genetic predisposition to CD was firmly established when susceptibility loci were mapped using linkage studies. To date, at least seven susceptibility loci have been localised, on chromosomes 1, 5, 6, 12, 14, 16 and 19.7,8,9,10,11,12,13,14 More recently, we identified the IBD1 susceptibility locus as the CARD15/NOD2 gene using a positional cloning approach.15 This result was also obtained by a candidate gene strategy.16 As expected, CARD15 does not explain the entire genetic predisposition considering that CARD15 variants are observed in no more than 50% of CD patients.

The incidence of CD has increased in numerous areas in Europe and North America in the second part of the 20th century.17,18 Furthermore, the concordance rate among monozygotic twins ranges from 25 to 58.3% only. These observations argue for an environmental element in causation.4,56 Among the large number of environmental factors that have changed in the Western way of life since the second world war, diet, infections and early events in childhood, measles vaccination, hygiene, contraceptive pills and tobacco have been proposed as important. However, to date, cigarette smoking is the only risk factor clearly established for CD.19,20,21

It is not known if environmental and genetic factors interact or if they have independent effects. We tried to deal with this question by studying sibships previously used to demonstrate linkage in genome-wide searches.11 In a genetically inherited disease, affected siblings are expected to be randomly distributed within the sibship. On the other hand if there is an environmental contribution to the disease, it is expected that sibs who share many environmental factors would be more often concordant for the disease than sibs substantially different in age or parity. We considered the birth order as a global estimate of a shared environment, making the hypothesis that consecutive siblings in a sibship are more likely to be in close contact than those more widely spaced. We thus examined the distribution of affected siblings within sibships segregating for CD and we developed an original procedure to test whether the disease was randomly or non-randomly distributed in the sibships.

Patients and methods

Patients and families

Five hundred multiplex IBD families were identified through a large European consortium on the genetics of IBD. CD only families were included in the study and families with at least one ulcerative colitis or unclassified colitis were excluded. A part of this family set has been used in linkage studies and allowed us to map the IBD1 gene on chromosome 1611 and to further identify CARD15.15 Diagnostic criteria have been previously defined.22

For each family, a pedigree was drawn that included the date of birth, and the disease status (healthy or affected) for each family member. Families with twins were discarded because of completely shared birth order. Similarly, because we looked at a particular distribution of the disease status within sibships, families in which all siblings were affected were excluded. Thus, a total of 102 CD sibships with at least two affected siblings and one healthy sibling remained for this study. Sibships are described in Table 1 and Figure 1.

Table 1 Characteristics of the studied family set
Figure 1
figure 1

Schematic representation of the sibship sample. Each of the 102 studied sibships is represented by a solid line. The siblings are indicated on the sibship line by open circles (healthy siblings) or black circles (CD patients) drawn at their corresponding dates of birth. (A) Sibships where the affected siblings are not all consecutive for birth order (n=44). (B) Sibships where all the affected siblings are clustering (n=58).

Distribution of the disease status within the sibships and the Clustering of Affected Siblings Test (CAST)

The 102 sibships were classed as ‘consecutive’ or ‘not consecutive’ according to the birth order of the affected siblings: a sibship was said ‘consecutive’ if all the affected siblings were born in consecutive order. When one or more healthy sibling separated affected siblings, the sibship was classed as ‘non consecutive’. According to this classification, a random variable (rv) X was attributed to every sibship as follows: X=1 for a ‘consecutive’ sibship, X=0 for a ‘non consecutive’ sibship.

First consider a given sibship of n siblings with p affected siblings and n-p healthy siblings. This sibship is characterised by the paired values (n,p). For the ith born sibling, let:

If the disease status is identically distributed within the sibship, ie if the probability that the ith sibling is affected does not depend on the rank i (the n binary rv Yi are Bernoulli rv), the probability for a sibship (n, p) to be classified as ‘consecutive’ is

and its probability to be classified as ‘non consecutive’ is

The probability generating function (pgf) of the binary rv X defined as usual as

is a polynomial of the first degree in X equal to

Then consider the set of N sibships. The N rv Xj (for j from 1 to N) are assumed to be independent. Their sum

is again a rv. It is well known that the pgf of S is the product of the pgf of the N rv Xj. The pgf of S is therefore a polynomial of the Nth degree in S equal to:

where P(n,p)=(np+1)/Cpn and N(n,p) is the number of (n,p) sibships.

The probability of observing k ‘consecutive’ sibships in a given sample of families is equal to the coefficient of the kth degree of the polynomial Φ(S).

Among the subset of ‘consecutive’ sibships it is also possible to calculate the probability of finding T sibships where the older (respectively younger) sibling is affected. This probability P(T=k) is the coefficient of the kth degree of the polynomial Ψ(T) defined as

where Q(n,p)=1(np+1) and M(n,p) is the number of (n,p) ‘consecutive’ sibships.

We propose a simple procedure to test whether the disease status is identically distributed within the sibships of a given set of families. An identical distribution of the disease status is equivalent to a random distribution of the affected siblings (patients). The null hypothesis:

H0: ‘the disease status is identically distributed within all the sibships’ which is equivalent to the statement ‘patients are randomly distributed within the sibships’ is tested against the alternative hypothesis:

H1: ‘the affected siblings are not randomly distributed in some sibships’.

The statistic upon which the test is based is the number of ‘consecutive’ sibships. The exact probability can be easily computed with a computer algebra system (ex : MapleR). Because affected siblings are clustered together within a ‘consecutive’ sibship this new test is called ‘Clustering of Affected Siblings Test’ and will be quoted by its acronym CAST in the following.

In the same way, a test on the position of the last (or conversely first) affected sibling among the ‘consecu tive’ families is based on the probability which can be computed exactly as well.

Distribution of the proportion of affected siblings according to their date of birth

The global point of view previously assumed may be supplemented by a local analysis of the distribution of the clusters during the century in order to detect a birth cohort effect. Here again the number of clusters (ie consecutive strings of affected sibs) observed at a given period of time has to be compared with the number of clusters expected according to the number and the size of the families living at that period. A robust statistic is the proportion of affected siblings born at a given time and who belong to a cluster. More precisely we introduce for each date of birth (t) a fraction q(t) defined as follows

Note that q(t) is nothing but the probability that a CD patient born at a given date will belong to a cluster. The values of q(t) observed on a given sample of families, denoted qobs(t), are then compared to the values qcomp(t) computed assuming a random distribution of affected siblings within the sibships. Finally the ratio r(t) defined as

is a local measure of an excess of clusters at the time t.

Results

Among the 102 sibships with at least two affected siblings and one healthy sibling, the observed number of ‘consecutive’ sibships was found to be 58 (Table 1). The trend towards an excess of ‘consecutive’ sibships was observed for most of the (n, p) classes (Table 1).

According to the null hypothesis H0 of a uniform distribution of the disease, the mean expected number of ‘consecutive’ sibships was 46. The computed probability of observing a value equal or higher than 58 was P=0.005. The observed number of ‘consecutive’ sibships was thus significantly higher than expected and demonstrated that the disease risk was not randomly distributed within the sibships.

In theory, the non-random distribution of the disease within sibships may be explained by the age of the parents at the date of birth of their offspring. The mean maternal (or paternal) age at birth for affected siblings was 27.4 (respectively 31.5) vs 27.8 (respectively 31.9) for healthy siblings (NS). Thus, neither the age of the mother nor the age of the father at the date of birth seemed to influence the disease status.

The non-random distribution may also reflect an over-incidence of the disease in special parity classes such as first- or last-born siblings. Out of the 58 ‘consecutive’ sibships, the last born was affected in 21 sibships and the first born in 25. The expected number of families where the first-born sibling would be affected was found to be 22. By symmetry, this was also the expected number of sibships with the last-born affected. Using the above described test on first- (respectively last-) borns in the set of consecutive sibships we found no significant deviation. Thus, the non-random distribution of the disease in sibships was not related to the birth order of affected siblings. In other words, the observed clusters of affected siblings were uniformly distributed within the consecutive sibships (Figure 1).

We finally looked at the distribution of the clusters over time. Figure 2 reports the observed proportion of patients belonging to a cluster of affected siblings and the corresponding computed value assuming a random distribution of the disease within the sibships. In accordance with the previously obtained results using the CAST statistic, the mean observed value of q(t) was higher than the expected one. More interestingly, an excess of clusters of affected siblings was observed at any point between 1930 and 1980. This observation suggested that affected siblings were not clustering around a specific date within the century and did not argue for a cohort effect.

Figure 2
figure 2

Local analysis of the distribution of the clusters during the century. qobs(t) (diamonds) is the observed proportion of affected siblings born at time t and belonging to a cluster. qcomp(t) (squares) is the corresponding computed value under the hypothesis of a disease identically distributed among the siblings. r(t) (triangles) is the ratio of qobs(t)/qcomp(t). To smooth out the data, qobs(t) and qcomp(t) were averaged over a 5 years window. More precisely

Discussion

In order to detect the contribution of the environment in familial CD, we analysed a large sample of sibships with at least two affected siblings and one healthy sibling. Using a new statistical test which we named ‘Clustering of Affected Siblings Test’ (acronym CAST), we were able to prove that the affected siblings are not randomly distributed within sibships as would be expected in case of a pure genetic disorder. Acquired genetic variations in the parental germinal cells are also unlikely considering that the disease is not associated with the parent's age at the time of birth. Thus, it can be concluded that the non-random distribution of the disease is very likely related to environmental factors playing a role in families.

In this set of multiplex families we could not find any evidence that the rank of birth was a risk factor. This observation suggests that the environmental exposure is not related to the intrinsic chronological history of the family, but rather to familial exposure to outside factors. The absence of a demonstrated cohort effect in these families suggests that the environmental exposure did not occur at the same time for all the families. On the contrary, starting dates of the exposure appear to be likely different from one family to another.

A gradually increasing incidence of the disease during the second part of the 20th century has been widely reported.17,18 This finding demonstrates that CD phenotype is also modulated by the environmental exposure in sporadic cases. As for familial CD, no cohort effect has been detected. Thus, observations in sporadic and familial CD suggest that the same environmental risk factor(s) may be involved in both presentations of the disease.

The excess of clusters of affected siblings in this family set which contributed to the localisation and identification of the IBD1 gene11,15 suggests that both environmental and genetic factors are involved in the familial CD predisposition. Considering that CARD15 is to date the main known genetic factor involved in CD predisposition, we looked at the proportion of families with one or more CARD15 variation(s) in consecutive and non consecutive sibships. The proportion of families with CARD15 mutation(s) was respectively 0.55 and 0.57 (NS). This finding argues against any distinction of CD families with either a genetic determinant or an environmental exposure. Rather, this observation supports the hypothesis that genetic and environmental risk factors combine to control risk and that CD is a true multifactorial disorder. Multiplex families may thus be considered as an efficient tool for studying environmental predisposing factors and gene-environment interactions.

The new test proposed here and called CAST, is based on the calculation of the probability of observing a given number of clusters of affected siblings in multiplex families. To our knowledge, such a proposal has not been addressed in the literature. This approach is very different from a test which would try to correlate the disease status with the interval between the dates of birth of affected siblings. It thus appears as complementary to other correlation statistics which can be proposed to define the exact risk factor involved in the observed phenomenom.

Thus, if the CAST statistic strongly argues for an environmental factor in familial CD, additional studies taking into account specific environmental exposures (e.g. cigarette smoking), are needed in order to determine which environmental factor(s) is (are) involved. However, the CAST statistic is based on the hypothesis of a close contact between consecutive siblings rather than between more distant siblings. This hypothesis is probably true for environmental exposure in childhood but it appears less valid for environmental exposure later in life. Thus the result of the CAST statistic suggests that the environmental factor(s) play a role during childhood. This hypothesis is in accordance with the young age at onset of the disease for which the maximum of the incidence is observed in the third decade2 and can be brought together with other works suggesting that hygiene in childhood may be a risk factor for CD.23

Some recruitment strategies may interfere with the test. For example, the larger sibships provide more data to test the null hypothesis. Thus, considering that older sibships are usually larger, it can be proposed to recruit older patients. In addition such older sibships would be less subject to misclassification of younger sibs for diseases with delayed age-of-onset. However, these older sibships are less likely to actually have good data on environmental factors and their ability to perform further correlation studies are limited.

As reported here, this test can be used in heterogeneous sets of families with various numbers of siblings, various proportions of affected siblings and whatever the familial predisposition to the disease. It is non-parametric and does not depend on the underlying model of the disease inheritance, including the number of susceptibility genes, the number of environmental factors and the interaction between them. If trivial biases (such as the effects of parental age or delayed age of onset) are discarded, it gives evidence for or against environmental risk exposure in families. It is easy to compute using a computer algebra system such as MapleR (see Annex). Altogether, these properties of the CAST statistic suggest that it can be applied to a large variety of familial traits for which multiplex family samples are available.