Estimating penetrance from multiple case families with predisposing mutations: extension of the ‘genotype-restricted likelihood’ (GRL) method

Some diseases are due to germline mutations in predisposing genes, such as cancer family syndromes. Precise estimation of the age-specific cumulative risk (penetrance) for mutation carriers is essential for defining prevention strategies. The genotype-restricted likelihood (GRL) method is aimed at estimating penetrance from multiple case families with such a mutation. In this paper, we proposed an extension of the GRL to account for multiple trait disease and to allow for a parent-of-origin effect. Using simulations of pedigrees, we studied the properties of this method and the effect of departures from underlying hypotheses, misspecification of disease incidence in the general population or misspecification of the index case, and penetrance heterogeneity. In contrast with the previous version of the GRL, accounting for multiple trait disease allowed unbiased estimation of penetrance. We also showed that accounting for a parent-of-origin effect allowed a powerful test for detecting this effect. We found that the GRL method was robust to misspecification of disease incidence in the population, but that misspecification of the index case induced a bias in some situations for which we proposed efficient corrections. When ignoring heterogeneity, the penetrance estimate was biased toward that of the highest risk individuals. A homogeneity test performed by stratifying the families according to the number of affected members was shown to have low power and seems useless for detecting such heterogeneity. These extensions are essential to better estimate the risk of diseases and to provide valid recommendations for the management of patients.

7 from underlying hypotheses, misspecification of disease incidence in the general population or 8 misspecification of the index case, and penetrance heterogeneity. In contrast with the previous version 9 of the GRL, accounting for multiple trait disease allowed unbiased estimation of penetrance. We also 10 showed that accounting for a parent-of-origin effect allowed a powerful test for detecting this effect.

11
We found that the GRL method was robust to misspecification of disease incidence in the population, 12 but that misspecification of the index case induced a bias in some situations for which we proposed 13 efficient corrections. When ignoring heterogeneity, the penetrance estimate was biased toward that of INTRODUCTION 1 Some diseases with variable age of onset are due to the presence of predisposing gene mutations, 2 such as mismatch repair (MMR) genes in hereditary nonpolyposis colorectal cancer (HNPCC) or 3 BRCA1 and BRCA2 in breast-ovarian cancer syndrome. These genes may be responsible for hereditary 4 forms of these diseases. Precise estimation of the age-specific cumulative risk (penetrance function) 5 for mutation carriers is essential for defining prevention strategies.

6
Families in which such mutations have been identified can contribute to estimate these risks, as 7 long as adjustment is made for these families generally ascertained because of several affected 8 members 1 . Such families are usually referred by physicians to genetic counsellors who propose 9 genetic testing when specific criteria are fulfilled. For hereditary cancers, for example, most criteria 10 used for recommending genetic testing are based on familial aggregation of specific cancers 2,3 . When 11 a mutation is identified in an affected member (defined as index case), genetic testing is proposed to 12 close relatives who, if carriers, will be offered intensive surveillance, which will improve the 13 prognosis, or prophylactic surgery when possible. 14 An ascertainment-adjusted method, based on maximum likelihood, has been proposed for 15 estimating the age-specific cumulative risk (penetrance) of a given disease associated with a 16 deleterious mutation from families in which such a mutation has been identified 4 . This method, called 17 the "genotype-restricted likelihood" (GRL) method, provides unbiased penetrance estimates whatever 18 criteria used for ascertainment of families and without having to model the ascertainment process. The 19 GRL method corrects for the bias due to the selection on carrier genotype of the index case since only 20 families with an identified carrier individual are informative for penetrance estimation. This method is 21 especially appropriate for hereditary predispositions to common diseases when numerous and complex 22 familial criteria involving several affected relatives are used to recommend genetic testing. It has been 23 shown to be independent of selection criteria, in particular on the number of affected individuals and 24 on the age at diagnosis of affected family members 4 .

25
Beside the numerous advantages mentioned above, the GRL method relies on assumptions which 26 may not be fulfilled in some situations. In order to evaluate its robustness to a departure from these 27 hypotheses, we studied the sensitivity of the GRL in various situations such as disease frequency of 28 4 the general population over-or under-estimated or a genetic heterogeneity not taken into account. 1 Additionally, the previous GRL version allows estimating the penetrance of only one trait once at a 2 time and this might bias penetrance estimates in family syndromes where several different traits may 3 occur (pleiotropy), like different tumor localizations. Therefore, we proposed in this paper to extend 4 the method to account for pleiotropy. We also extended the method to allow for a parent-of-origin 5 effect, i.e. penetrance functions differing according to the gender of the parent who transmitted the 6 deleterious mutation. This effect has been described in some diseases 5-8 but has not been yet 7 accounted for in penetrance estimation. These extensions are essential to better estimate the risk of 8 diseases and to provide valid recommendations for the management of patients. 9 10

METHODS 11
The GRL method 12 The GRL 4,9 uses a retrospective likelihood conditioned on the phenotypes of all family members 13 and on the genotype of the index case. It estimates penetrance parameters for mutation carriers by 14 maximizing the probability of observed genotypes (G) of family members who have been tested for 15 the mutation found in the index case, conditional on observed phenotypes (P) and on index case being 16 a carrier (I). Due to the fact that the index case is always tested, the conditional probability may be 17 written as: 18 Pr(G/P,I) = Pr(G,P) / Pr(P,I)

19
Following the cure model of De Angelis 10 , we considered that a proportion κ of individuals will 20 never be affected and a Weibull model for penetrance for the others. The cumulative risk by age t is: where l is the genotype for the mutation (l = 1, 2, 3 for AA, Aa and aa genotypes respectively, A 23 being the mutated allele), α l and λ l are the Weibull shape and scale parameters, κ l the probability of 24 never being affected given l, and T is the age at disease occurrence.

25
Let y ij =(t ij , δ ij , g ij ) the set of observations on the j th member of the i th family, t ij the age at onset of 26 the disease or the age at censoring (earliest date among dates of prophylactic surgery, death or last 27 5 news), δ ij the indicator of occurrence of the disease before the age at censoring and g ij the observed 1 genotype which is coded as 0 when unknown. Let F(t; θ l ) the cumulative risk by age t for the l th 2 genotype and h(t; θ l ) the corresponding hazard functions with θ l =(α l , λ l , κ l ). The probability of the set 3 of observations (y ij ) given the latent genotype (u ij ) is: where φ(g ij ,l) is the probability of observed genotype knowing the latent genotype : φ(g ij ,l) = 1 if 6 g ij =0 or l and 0 otherwise.

7
Finally it is possible to calculate the probability Pr(G,P) using

( )
Pr ij ij y u l = with the Elston-8 Stewart algorithm 11,12 . The denominator, Pr(P,I), is computed in the same way with g ij being known 9 only for the index individual.

10
The maximization of the conditional likelihood Pr(G/P,I) was performed by the L-BFGS-B 11 algorithm 13 . When analysing a real family sample, confidence intervals of penetrance estimates are 12 obtained by bootstraps.

13
The method assumes that the mutated allele frequency in the general population is known as well 14 as the penetrance function in non carriers which is set equal to the incidence in the general population.

Extensions of the GRL 17
Multiple trait phenotype

18
The contribution to the likelihood of each individual was modified to simultaneously take into 19 account the phenotype of a variable number of possible traits, under the hypothesis that, given 20 genotype, the occurrence of a disease does not modify the risk of developing subsequently other 21 diseases. If t ijk is the age at onset or the age at censoring of the k th disease and δ ijk the indicator of the 22 occurrence of the k th disease before censoring time on the j th member of the i th family, y ij =(t ij1 t ijk .. , δ ij1 23 .. δ ijk .., g ij ) with probability : To take into account a parent-of-origin effect, we modified the likelihood by splitting the 2 heterozygote genotype in two different genotypes according to the paternal or maternal origin of the 3 mutated allele. Four genotypes were considered: AA, Aa, aA and aa, where Aa and aA are the ordered 4 heterozygous genotypes in which the first allele is transmitted by the father and the second one by the 5 mother. The matrix of genotype probabilities from one generation to another was modified 6 accordingly and three penetrance functions instead of two were considered for gene carriers. 7 8

Simulations of pedigrees 9
To study the statistical properties of the method, and in particular its robustness to departures from 10 underlying hypotheses, as well as the interest of the extensions implemented in the method, we 11 simulated four generation families with a fixed number of relatives: starting from the index case, we 12 simulated a couple of his(her) parents and two couples of grand-parents. The grand-parents and 13 parents of the index case (generation 1 and 2) had respectively two and three children. Each of them 14 had two children as well as all their offspring until the fourth generation (figure 1). Genotypes of 15 family members were randomly generated according to the mutated allele frequency in ancestors and 16 spouses and to Mendel's laws for offspring except the index case who had at least one mutated allele.

17
To simulate realistic situations, we chose to compute parameters to fit the data of a large national 18 French survey of 537 families with Lynch syndrome 14 . Age at last news (or age at death) for each 19 individual was assumed to be normally distributed with 63, 60, 50, 35 years for means and 17, 16, 13 20 and 11 years for standard deviations respectively for generations 1 to 4. For each family member 21 (including the index case), age at disease occurrence was randomly generated given penetrance 22 function according to genotype. The individual was set as affected or not by comparing the age at 23 disease occurrence and the age at last news. Finally, a family was selected if the index case was 24 affected and if the family fulfilled selection criteria on the minimal number of affected (MNA) 25 individuals.

26
We considered a dominant genetic model for disease transmission with equal penetrance functions 27 for AA and Aa genotypes which has been often observed in cancer predisposition so far. The A allele 28 7 was set to 0.001, so that most of the mutation carriers have the Aa genotype. The sets of penetrance 1 functions used for the simulations are shown in table 1. Three sets of penetrance fonctions (high, 2 medium and low risks) were used in the case when only one trait may be observed. To simulate 3 realistic situations for the several different traits exercise, parameters for the simulation of families 4 were chosen to fit the risks published for colorectal, endometrial, ovarian, and urologic tract cancers in 5 Lynch syndrome 9,15-18 .

6
As only some family members are usually tested in families, we had also to simulate when 7 genotypes are known or not, with probabilities fitting to realistic situations in cancer genetics. That is, 8 we fixed some individuals as known or unknown for their genotype and generated the availability on 9 the other family members' genotypes using a probability π, subsequently referred as the genotyping 10 rate, of being tested as follows:

11
-The index case and his(her) parents are always tested 12 -Grand-parents and spouses are never tested, as well as the sib of the non carrier parent 13 and his (her) offspring 14 -The sib of the carrier parent is tested with probability π and, if non carrier, his(her) 15 offspring is not tested 16 -The offspring of family members found to be carriers is tested with probability π.

Properties of the GRL 19
Estimates and standard errors of penetrance at different ages between 20 and 80 years were 20 obtained from their average on 200 replicates in different situations according to penetrance among 21 carriers and non carriers, genotyping rate π (0.1, 0.3, 0.5, 0.8, 1.0), number of families in the sample 22 (100,200,500), and MNA (2, 3 or 4) individuals for a family to be ascertained. Maximum likelihood 23 estimation should be asymptotically unbiased but some bias may exist with small sample size. It was 24 measured by the average difference between estimations and the true value.

25
Robustness to departures from underlying hypotheses was investigated by invalidating these 26 hypotheses and by evaluating the bias induced on penetrance estimates when analysing the simulated 27 data. We considered two sources of error when using the GRL method: 1) misspecification of the 28 8 disease incidence in the general population; 2) misspecification of the index case (cf. infra). In 1 addition, the GRL assumes that the probability of phenotypes is the same for all mutation carriers, i.e. 2 no heterogeneity in penetrance. However, heterogeneity of penetrance may exist because of modifier 3 factors and one may expect that families with higher penetrance are likely to have more affected 4 individuals than families with lower penetrance. We considered the possibility of penetrance 5 heterogeneity and studied estimates obtained when ignoring this heterogeneity. We also studied the 6 possibility of detecting heterogeneity using the number of affected individuals in the family as a 7 surrogate. Homogeneity tests were performed using likelihood ratio tests that compare the maximum 8 likelihood value L 1 obtained assuming the same penetrance in the different subgroups to the maximum 9 likelihood value L 2 obtained when allowing for a difference among subgroups. The statistic used, -2 10 Ln [L 1 /L 2 ], follows a Chi square distribution with a number of degrees of freedom equal to the 11 difference between the number of parameters estimated in L 1 and in L 2 respectively.

12
The interest of the extension to a possible parent-of-origin effect was evaluated by simulating such 13 an effect in two scenarios i) strong effect with large difference in penetrance, ii) small effect with 14 moderate difference in penetrance according to the parental origin of the mutation. In these two 15 situations, we estimated the power to detect the effect using a likelihood ratio test with a 0.05 type I 16 error. The interest of the extension to a multiple trait phenotype was evaluated by comparing the 17 results of the joint analysis to those obtained by performing separate analyses for each trait, where 18 individuals are considered as unaffected when affected by another trait than the one under study. 19 20

Bias and efficiency of the GRL 22
The bias on penetrance estimates due to limited sample size is shown in table 2 for various 23 situations. The bias is most often positive (i.e. penetrance is overestimated) but very small. Only in 24 extreme situations when the sample size is small (i.e. 100 families), few individuals are tested in the 25 family (e.g. π=0.1) and penetrance is low, the bias may be non negligible compared to the penetrance.

11
The incidence of common diseases in the general population is known when regional or national 12 registries provide accurate rates. However, these registries are usually available for recent time periods 13 only: e.g. the first cancer registry was established in France in 1978. As cancer incidence has varied 14 with time, the incidence specified in the analysis may not be valid for the past generations. In many 15 other diseases for which there are no registries, the incidence may be estimated with some error. We The index case is defined as the first individual who was tested in the family. Index cases are often 23 incident cases who asked for genetic counselling because one or several relative(s) are also affected.

24
However, when there are only prevalent cases in a family, the geneticist has to choose, among the 25 affected members, the person who will first be tested and will be considered as the index case. In 26 practice, the geneticist chooses the one with the highest prior probability of being a carrier, in general 27 the youngest affected one. Such a choice is expected to induce a bias on penetrance estimated as 28 explained on a specific example given in the appendix. To investigate this potential bias, we defined as 1 the index case the youngest affected one in the analyses. The results are shown in table 4 only for 2 MNA=3 or 4. Indeed, no effect was observed for MNA=2 since, in the simulated data, the index case 3 was often the youngest affected one. In general the bias is small, but may be substantial for high and 4 medium penetrance values, with an underestimation of the penetrance for young ages and an 5 overestimation for old ones.  Table 7 shows the power to detect a parent-of-origin effect in samples of various sizes and 15 variable genotyping rate π, for two sets of differences in penetrance according to the gender of the 16 parent having transmitted the mutated allele: a large difference between penetrances by age 70 of 17 respectively 0.30 and 0.60, and a small one of respectively 0.40 and 0.50. As for penetrance 18 heterogeneity, the power was estimated as the proportion of replicates with a significant homogeneity 19 test. The power to detect the parent-of-origin effect was found to be very high, even for samples of 20 moderate size, when the difference was large, but decreased dramatically when the difference between 21 penetrances decreased.  19,20 . In France, more than 2500 mutations in BRCA1 and BRCA2 were found in 1 index cases and more than 7000 relatives had genetic testing between 2003 and 2007. For MMR genes, 2 a mutation was found in more than 1000 index cases and nearly 3000 relatives were tested 3 (http://www.e-cancer.fr/v1/fichiers/public/synthese_evolution_activite_2003_2007.pdf). Such 4 familial data provide unique information on carrier risk, as long as adjustment for selection criteria of 5 these families is adequately performed. These criteria are complex and have evolved with time.

6
Moreover, these criteria are proposed as guidelines for genetic counsellors and therefore are not purely 7 applied. Therefore, a formal correction for these criteria is completely unfeasible and the GRL method 8 appears particularly appropriate for estimating disease risks in carriers.

9
In this paper, we investigated the statistical properties of the GRL method and proposed some 10 extensions.

11
As any maximum likelihood estimator, the GRL estimator is unbiased only asymptotically.

16
We studied the robustness of the GRL to departures from underlying hypotheses. We found that 17 the method was very robust to a misspecification of disease incidence in the general population, even 18 for important errors. Such an error could be due to a cohort effect as observed in some cancers 23 . Our 19 results indicated that the method is expected to perform well when using an average disease incidence 20 at an intermediate period between the periods of oldest and youngest generation of the families.

21
Regarding the specification of the index case, there is no ambiguity if the index is an incident case 22 who asked for genetic counselling and the GRL should be used as it. If there is any doubt on the index 23 identification in some families, when all affected family members are prevalent cases, our 24 recommendation is to use, for these families only, either the random procedure or the conditioning on 25 the genotype of all individuals affected by a disease under study 16 , provided that they had genetic 26 testing proving their carrier status. In general, the random procedure provides the least biased 27 estimates but Quehenberger's method 16 is a good alternative.

13
Regarding penetrance heterogeneity, we found that the estimated penetrance was close to that of 1 highest risk individuals because of selection on multiple case families. A homogeneity test performed 2 by stratifying the families according to the number of affected members was shown to have low power 3 and appears useless for detecting such heterogeneity. However, one must keep in mind that nearly all 4 families referred to genetic counselling so far are multiple case families and that the average 5 penetrance estimated from a sample of such families may be the most appropriate one to predict 6 disease occurrence and to recommend or not genetic testing in such context.

7
Another source of bias could be that we did not take into account a possible family-specific 8 random effect due to low penetrance genes with multiplicative effects, such as those found by 9 Antoniou et al. 24 in breast cancer, nor modifier genes such as those found to influence cancer risk in 10 BRCA2 carriers 25,26 . Such an effect could affect major gene penetrance estimate 22,27,28 . The GRL will 11 have to be extended to take into account such effects when better identified.

12
The GRL method assumes that the frequency of deleterious mutations is low in the general 13 population. A departure from this hypothesis would invalidate our approximation for risks in non 14 carriers that are set to be equal to the risks in the general population. The sample size needed to 15 estimate the risk in non carriers by the GRL is prohibitive and this approximation is necessary.
16 Therefore, we do not recommend using the GRL for mutations with frequency that would exceed 0.01 17 for a dominant predisposition, or 0.10 for a recessive one, in the general population.

18
Finally, we could evaluate the efficiency of the extensions of the GRL that we proposed, i.e.

19
possible multiple trait phenotype and parent-of-origin effect. Regarding the former, we found that 20 analyzing separately each trait induced a bias which was corrected when using the multiple trait 21 option. The bias is due to the fact that a single trait analysis uses an inaccurate ascertainment 22 correction when conditioning on only one trait whereas several traits led to the ascertainment of 23 families. Indeed, our results showed that the risks of CRC in men were unbiased when performing a 24 single trait analysis, which is most probably due to the fact that CRC is almost the only tumour 25 observed in carrier men. Therefore, this extension, in addition to sparing time by performing only one 26 analysis, avoids bias, particularly for low risk traits. This is most probably the reason why we found a 27 lower risk of endometrial cancer than other studies when we analyzed a sample of 36 French families 28 14 using our first version of the GRL 9 . Among studies that used a retrospective likelihood for estimating 1 cancer risk in Lynch syndrome 9,16,17,29 , note that only Quehenberger et al. 16 used a multiple trait 2 approach. This may partly explain the variability of estimates among studies.

3
The test for a parent-of-origin effect was shown to be very efficient with samples of moderate size 4 when the difference in penetrance was large. It would be interesting to apply the method for testing 5 such an effect in hereditary forms of diseases, such as breast and ovarian cancers associated to 6 mutations of BRCA1 or BRCA2 or Lynch syndrome associated with mutations of the MMR genes, as 7 this hypothesis has not been tested yet.   In 2005, a woman was affected by colorectal cancer at age 52 years and died two years later from 3 ovarian cancer (case 1). Nobody suspected Lynch syndrome at that time although one of her uncle, 4 currently aged 70 years, was affected at age 45 years by colorectal cancer (case 2). In 2009, the 5 brother of the latter, aged 71 years, also developed colorectal cancer (case 3). The sister of case 1 6 suspected a hereditary syndrome and asked for genetic counselling. As case 1 was not available any 7 more and case 3 could be a sporadic case because of late-onset diagnosis, the geneticist proposed 8 genetic testing of case 2. Therefore, case 2 was designed as the index case although case 3 was 9 obviously the incident case that would have been the 'natural' index.

10
Why should this choice induce a bias in the analysis? As the GRL is conditioned not only on the 11 phenotypes of family members, but also on the genotype of the index case, the choice of case 2 as the 12 index case 'cancels' his contribution to the likelihood and replaces it by the contribution of case 3. As 13 case 3 was affected at an older age than case 2, this tends to overestimate age of onset and to decrease 14 the penetrance estimate.