Introduction

As clinical sequence information becomes more available, a more-detailed understanding of genetic risk conferred by clinically tested genes and individual variants observed will become more important and more feasible to obtain [1]. Several recent studies have found novel cancer associations using cohorts of clinically ascertained patients who have had cancer panel testing [2,3,4,5,6,7,8]. In addition, penetrance estimates have often been obtained from families selected for having a high disease burden [9,10,11,12,13,14,15]. Each of these situations has potential to lead to biased overestimates of specific disease risk. When dealing with bias inherent in proband ascertainment, one can simply omit probands who were selected for both genotype and phenotype. This leads to calculating risk from analysis of relatives, who are typically assumed to be ascertained with less bias. However, for research studies that ascertain families based on multiple affected individuals, this assumption is not true. Similarly, in clinically ascertained pedigrees with specific genotype findings there may be an unknown ascertainment bias for both patients and relatives with both disease-specific and non-specific phenotypes being overrepresented because ordering providers may justify diagnostic testing based on either expected or intriguing patterns of familial disease. A system to compare incidence in ascertained families to expected population incidence could help identify unexpected biases.

Similarly, analysis attempting to classify individual variants in genes with established effects may suffer from clinical ascertainment bias. Large clinical data sets may be ideal for defining hypomorphic or low-risk variants, as large samples sizes are needed to prove low level risk. However, in the clinical variant classification scenario, all variants in a gene are typically assumed to either have a known “pathogenic” effect or have no effect [16]. Assumed “pathogenic” variant penetrance may be determined from variants initially used to map or identify the gene. When these family data sets are used to generate risk estimates for genes implicated in hereditary disease risk, all variants thought to have the same functional effect have typically been pooled to have a large enough sample to generate risk estimates with confidence intervals consistent with significant risk [17,18,19]. Comparing clinical risk with established risk estimated for variant classification may be problematic for several reasons if there are differences between clinical and research sampling criteria. Research work regarding variant classification may be biased based on family selection criteria used to discover risk genes and clinical samples may have different bias based on less-clearly defined clinical selection. Because of both of these biases, variants with low-to-moderate risk relative to the assumed risk profile may never be correctly classified regardless of the amount of data. On the other hand, benign variants may be incorrectly classified as “pathogenic” if ascertainment enriches for specific phenotypes relative to population risk as ascertained by other studies.

As a way to identify the existence and magnitude of clinical ascertainment bias, we used ascertained pedigrees to evaluate cancer risk using cumulative population incidence of disease, which for cancer is publicly available from the SEER database (seer.cancer.gov) [20]. Applying these standardized incidence ratios (SIR), or relative risk over SEER, to pedigrees can detect potential ascertainment bias by using a control data set of patients/participant pedigrees with negative panel results from the same cohort where case testing was performed. Although we focus on identifying ascertainment bias, in a cohort without ascertainment bias, the method might also be used to assess a priori variant affect without prior knowledge of penetrance by providing an estimate of the SIR. When used for variant classification, this method is conceptually similar in some ways to the family history weighting algorithm presented by Pruss and colleagues [21, 22]. A major difference is that the pedigree SIR approach does not rely on matching large numbers of case and control pedigrees, as hypothetical “SEER” control pedigrees are created using the same family structure as actual case pedigrees. This approach may be simpler than the family history weighting algorithm and will certainly be easier for others to implement, as we have made all our code publicly available.

To illustrate the need to assess ascertainment bias in clinical samples and to demonstrate the performance characteristics of the pedigree SIR method, we simulated families with varying risks and varying number of individuals genotyped, and use the method to classify them. We also assessed clinically ascertained families from hereditary cancer risk testing at the University of Washington and show clinical ascertainment is likely to have unexpected bias that may explain published associations from genetic risk analysis in clinical data sets [2,3,4,5].

Methods

Overview

Given a set of probands carrying a particular variant and their associated pedigree, we can determine variant effect in relation to cancer by comparing the number of relevant cancers to the theoretical distribution of cancers assuming the pedigrees follow SEER incidence rates. Define X as the number of individuals with the desired phenotype, in this case a specific type of cancer. The probability that an individual has cancer assuming a benign variant would be a Bernoulli random variable with probability, p, equal to the SEER probability that the specific individual has cancer based on their sex and age. This single person distribution would have the following mean and variance.

$$\mu = p$$
$$\sigma ^2 = p\left( {1 - p} \right)$$

When more individuals are included into the analysis, and an independence assumption is added, this distribution changes into a Poisson binomial distribution with mean and variance listed below.

$$\mu = \mathop {\sum }\limits_i p_i$$
$$\sigma ^2 = \mathop {\sum }\limits_i p_i(1 - p_i)$$

Here pi represents the probability that person i has cancer given their sex and age. By comparing the observed number of affected individuals in a test group (e.g., clinically ascertained families with negative results or specific variant carriers), X, to the Poisson binomial distribution with mean and variance assumed to follow SEER incidence, we can determine the probability that X came from that distribution. If it is different than expected, or has a small p-value, then the test group is unlikely to come from the same population as the population defined in the incidence curves (e.g., There is ascertainment bias or the variant of uncertain significance (VUS) has an effect on the phenotype). To perform this comparison, we rely on the R package “poibin”, which generates the Poisson binomial distribution [23].

Alternatively, we can classify a variant by estimating its standard incidence ratios, or similarly relative risk over SEER. A standard incidence ratio greater than one would mean that the variant affects function. Standard incidence ratio is defined as the observed rate of a particular disease to the age adjusted expected rate in the general population. For the Poisson binomial model this becomes

$$\alpha = \frac{X}{{\mathop {\sum }\nolimits_i p_i}}.$$

Here, α is our estimate of the SIR, X is the observed number of cancers, and pi is the individual’s probability of getting cancer based on SEER data. Furthermore, the confidence interval for α can be found by determining the values of α which do not yield significant probabilities, when comparing X to the Poisson binomial with αp in place of each success probability, for the given level of confidence. In other words, the confidence interval would be the values of α that the data does not reject at the given confidence level. Note that a variant which affects function according to the first method will have a SIR with confidence interval greater than 1 in the second as the two methods are related.

SEER incidence

SEER incidence values were taken from seer.cancer.gov [20]. As these were given in 5-year increments, we obtained intermediate, yearly, values using a linear interpolation of the adjacent points. These were done assuming the values given in the table were for the right end points of each interval with the right end point of “85+” set to 120. Bias estimates for different cancers and sites were calculated separately.

Probability of having cancer by sex and age

The probability, p, of having cancer is based on age, sex, and affection status. For an unaffected person, p is equal to the probability that the individual did not get cancer for their sex up to their current age or age at death. For an affected person, p is equal to the product of the probability that the individual did not get cancer for their sex up to the age they became affected. In other words, if k is the individual’s current age, age at death, or age of affection (if affected and Ri is the gender specific yearly incidence rates from SEER for the individual, then \(p = 1 - \mathop {\prod}\nolimits_{i = 1}^k {(1 - R_i)}\).

Weighting individuals by probability of having a genotype for variant classification

In the case of variant classification, in which many individuals will have unknown genotypes, it would be advantageous to count only individuals with the variant or with some chance of having the variant. This can be done by weighting individuals based on the probability they have the variant given individuals with known variant status and familial relationships. In this case we would first find all obligate carriers and then find the probability that the individuals with unknown genotypes have the variant and weight them accordingly. An R script to calculate these weights is included in the Supplementary File.

Evaluating potential bias

To determine if a data set may be biased we can use the method above to compare the number of cancers ascertained in “negative” families to the number theoretically expected in these families, assuming they follow the SEER distributions. Appropriate negative families are those that have the same enrollment criteria and genetic testing performed as case samples that will be used for a proposed analysis, but that have no genetic cause of cancer identified. Significant enrichment in the number of cancers in the data set of control families would indicate ascertainment bias.

Simulations

Using the CoSeg R package, we simulated pedigrees using published United States of America demographics. These demographics included age at marriage, age at death, and number of offspring surviving to adulthood for males and females during each decade since 1900. Our goal was to generate pedigrees that would be similar in size and shape to those of individuals receiving genetic testing for hereditary breast, ovarian, or colon cancer at the clinic. In brief, we began with a seeded age that is sampled from a skewed normal distribution derived from the age distribution of individuals receiving hereditary cancer testing at the University of Washington [24], extended up three generations to create a founder with the variant in question and then expanded the pedigree down with each descendant having 0.5 probability of inheriting the variant of interest from a parent with the variant. We used published population demographic measures for average marriage age, number of offspring living to adulthood, and mortality [25, 26]. Phenotypes were sampled based on age and genotype status using the published MLH1 penetrance [27, 28] and various SIR’s.

We simulated 1000 sets of three-generation families ranging in size from 1 to 15 with relative risk (RR) of 1, 5, 10, and 15 times the SEER incidence and calculated the probability we would see a greater number of cancers than seen in each set using the CoSeg R program [29, 30]. To give some perspective, we also compared these with families simulated with literature-reported MLH1 penetrance [27, 28]. Next, we simulated 1000 sets of five families of varying RR and varied the proportion of relatives with known genotype from 100 to 0% (the proband always had known genotype, of course). We again calculated the probability we would see the number of cancers seen given that we were sampling from the SEER values and plotted them as box and whisker plots. To demonstrate the effect of ascertainment bias, we simulated sets of five families of varying RR and selected 1000 that had more than one affected individual. We again calculated the probability we would see the number of cancers given the SEER distribution and plotted the results.

Clinical sample

The BROCA panel is a pan-cancer risk panel that is optimized for individuals with hereditary breast and ovarian cancer. We evaluated 21 sequential families with BROCA panel and billing codes indicating ordering for breast and ovarian cancer evaluation. The ColoSeq panel is a similar panel designed for evaluating colorectal cancer risk. We evaluated an additional 21 sequential families had ColoSeq panel testing or pan-cancer panel testing where billing codes indicated ordering for colorectal cancer risk. For all families selected in both groups the results of clinical testing were negative (no cancer risk variants or VUS identified in any tested gene). These pedigrees were coded and deidentified for subsequent analysis. Without any pre-existing estimates of potential bias, we coded an arbitrary number, 21, in each group for our initial analysis. None of these families had identified hereditary cancer risk variants. Testing was referred from patients across the country, with a preponderance of orders coming from Washington State. The study was approved by the University of Washington IRB (#00005392).

For cancers not predicted to be part of clinically expected risk profiles, observed cancers should follow the SEER distribution, if there is no bias present. Using the pedigree structure and reported or estimated ages of individuals in these families, we ran 100,000 simulations to generate expected cancer count for breast, colorectal, endometrial, ovarian, pancreatic, and prostate cancer using published SEER age-specific cancer incidence [20]. We excluded probands from actual and expected cancer counts so that data would illustrate bias in relative selection, rather than bias in proband selection. We then created a histogram of the expected cancer count values and compared them to the actual number of cancers seen in the families.

Software availability and timing

All code used for this manuscript is publicly available as a supplementary R file to be used with the freely available R program. Each analysis takes less than a minute to run using 100,000 simulations to create an expected SEER distribution.

Results

Simulated sample performance characteristics: variant effect without ascertainment bias

To demonstrate the performance characteristics of the method in the absence of bias, we used 1000 sets of simulated 3-generation pedigrees with variants which affect function that conferred 1, 5, 10, and 15 times the SEER incidence of colorectal cancer. For these plots, we assumed a 30% genotyping rate apart from the proband and weighted individuals based on the probability they carry the variant. With a single family, it is unlikely that a correct classification will be made for any relative risk regardless of ascertainment bias. At seven families, we are able to correctly classify more than half of the relative risk 15 sets and the MLH1 sets. At 13 families, we are able to correctly classify more than half of the relative risk 10 sets as well. Furthermore, only five families were incorrectly classified as having an effect on risk when using families simulated with relative risk 1, sets across all family sizes (8000 families total). Figure 1 shows a box and whisker plot of these results.

Fig. 1
figure 1

Number of families required for classification of variants with varying relative risks including colon cancer caused by a variant which affects function in MLH1. A horizontal dashed line is drawn at P = 0.05 and a solid line at P = 0.01 for convenience. Here, each set of families is simulated 1000 times to obtain a distribution for the boxplot. The underlying simulation used colon cancer risk from the SEER population as baseline and assumes that 30% of individuals are genotyped

We further explored the performance characteristics of this method. To deal with incomplete genotyping, we experimented with weighting ungenotyped individuals based on the probability that they have the variant. Appendix Fig. 1 shows the results of unweighted, weighted, and inverse-weighted (gives weight based on the probability individuals do not have the variant) analysis for several relative risks including MLH1. The number of individuals genotyped in a pedigree can vary, depending on study design. Having fully genotyped pedigrees naturally provides the most correct classifications, though classification can still be made with proband-only genotypes, which is the scenario that is most common for clinically ascertained pedigrees. Increasing the number of genotyped individuals provides non-linear increases in information as family structure often allows imputation of genotype status of ungenotyped individuals. Figure 2 shows the boxplots obtained from varying the percent of genotyped individuals in the pedigrees.

Fig. 2
figure 2

Effect of genotyping on determining the class of a variant for varying relative risks. Here, each set of 5 families is simulated 1000 times to obtain a distribution. The underlying simulation used colon cancer risk from the SEER population

Simulated sample performance characteristics: detecting ascertainment bias

To evaluate the power of this method to demonstrate bias, we simulated pedigrees across a range of relative risks and selected only those with at least two affected individuals for analysis. We analyzed the data set and show that for known benign data sets the pedigree SIR approach is able to detect bias. Figure 3 shows the result for benign, relative risk 1, variants along with other relative risks. Pedigrees with no risk and biased ascertainment appeared like pedigrees with relative risk of 5. For variants causing increased risk, including bias leads to a p-value for variant effect that would be expected with a higher actual risk. Pedigrees simulated with 5-fold risk and ascertainment bias appeared to have greater risk than pedigrees with 10-fold risk and no ascertainment bias (Fig. 3).

Fig. 3
figure 3

This figure shows boxplots of the probability of correct “pathogenic” classification in unbiased and biased pedigrees, ones which were selected to have more than two affecteds with only one accounted for, for varying relative risks. Here, probabilities were obtained from a sample of five families and repeated 1000 times to obtain distributions. The underlying simulation used colon cancer risk from the SEER population as baseline and assumes that 30% of individuals are genotyped

Detecting ascertainment bias in a clinical sample

To assess whether we could use clinically ascertained families to determine cancer risk associated with specific variants, we analyzed 42 pedigrees from the University of Washington. Half of these pedigrees (21) were sent in with clinical orders suggesting either breast/ovarian and the other half (21) with orders for colorectal/endometrial cancer risk sequencing. No patient or relative in the pedigrees was found to have variants that affect function or VUS in any gene tested. The probability that these pedigrees are unbiased for each cancer type is shown in Fig. 4.

Fig. 4
figure 4

Expected distributions of the number of affected individuals for various cancer types for 21 clinical families with testing ordered for breast or ovarian cancer risk a, and colorectal or endometrial cancer risk b. Also included are the number of observed cancer cases for each type represented by a vertical black line and the corresponding p-value. Note that these histograms of expected cancer cases in the group of families were made by sampling the distribution 100,000 times

The set of pedigrees where breast/ovarian cancer risk was interrogated was significantly biased (P < 0.01) for breast, ovarian, and pancreatic cancer, as expected. Unexpectedly, these families also had trends suggesting selection for colorectal and endometrial cancer in relatives. There was no observed bias for prostate cancer in this small set of families. Had these sets not been biased, estimates of the relative risk would be 4.25, 9.30, 3.91, 1.88, and 2.61 for breast, ovarian, pancreatic, colorectal, and endometrial cancer, respectively (Appendix Table 1).

The set of pedigrees with clinically ascertained for colorectal/endometrial cancer risk was significantly biased (P < 0.01) for colorectal and endometrial cancer, as expected. There was a trend for bias towards breast cancer with no apparent bias for ovarian and prostate cancer in pedigrees. Had these sets not been biased, estimates of the relative risk would be 2.47, 4.55, and 1.55 for colorectal, endometrial, and breast cancer, respectively (Appendix Table 1).

Discussion

Clinical pedigrees are increasingly used for variant classification, genetic association, and penetrance estimates, but these may be biased owing to a higher likelihood of families enriched for expected and unexpected phenotypes being sent to the clinic [2,3,4,5]. We developed a relatively simple method to assess this potential bias. Even a small number of pedigrees showed trends suggesting unexpected bias patterns for pedigrees from clinical testing performed by the University of Washington Department of Laboratory Medicine. In addition to the bias expected based on stated rationale for test orders, these small groups of pedigrees showed trends towards enrichment for specific unrelated phenotypes, with colorectal and endometrial cancer potentially being overrepresented in families being tested for breast cancer risk and breast cancer potentially being overrepresented in families being tested for colorectal cancer risk. Neither group showed enrichment for prostate cancer, indicating that assessments of prostate cancer risk in patients found to have positive genetic findings during this period are less likely to be biased. However, ordering patterns may change in the future after as a result of manuscripts showing variants in BRCA2, which affects function in patients with metastatic prostate cancer [31]. The exact values for the amount of clinical ascertainment bias presented here should not be interpreted as definitive or generalizable. Bias in different samples may be different, so studies of genetic effect should conduct independent evaluations of sample-specific bias.

We evaluated clinically ascertained families seeking to evaluate novel associations of other clinical laboratories and potentially using clinical pedigrees for individual variant assessment. With only a small set of pedigrees we observed a high probability for false positives, which diminished our initial enthusiasm for using clinically ascertained pedigrees to assess novel associations. It was intriguing to us that the level of ascertainment bias observed in clinically ascertained samples sent to the University of Washington might explain recently reported associations of known colorectal cancer risk genes MSH6 and PMS2 with breast cancer [2, 4]. The observed bias is not surprising given national guidelines for medical professionals about appropriate hereditary cancer risk ordering [32]. The question of ascertainment bias depends on what population the inference is to be made. We assumed that risk relative to the general population is ideal. Risk estimates using clinical samples may produce correct comparisons with clinically ascertained “controls”, even if they are not ideal estimates of risk relative to the general population. If risk relative to the general population is measured, larger studies that look for low levels of risk will need larger control samples to check for subtle sources of ascertainment bias.

In the absence of bias, using SEER data to estimate the expected number of cancers in a family might help classify variants of uncertain significance over a range of relative risk factors, known genotypes, and number of families without any prior estimates of penetrance. This method for variant assessment may have some distinct benefits over traditional methods. We developed this method to not rely on previous estimates of penetrance, so it would be capable of identifying variants that are less penetrant than those identified in published literature. In addition, it incorporates data from the entire pedigree. (Avoiding the proband, who is known to be ascertained for both phenotype and genotype, is already generally done.) The last feature, which has turned out to be critically important in pre-study data assessment, is that comparison of baseline pedigrees with SEER incidence can highlight ascertainment bias. This bias may be seen in research ascertainment for cohorts used to define pathogenicity as well as clinical cohorts used to classify VUS or identify new genotype–phenotype correlations.

This method has two major assumptions. The first assumption is that the underlying disease risk in the population studied is well-defined (e.g., SEER incidence). The second is that the individuals in the data set are independent. With different assumptions more complex segregation analysis using software, such as MENDEL, could be used similarly with variant “negative” families to assess ascertain bias. There are several limitations to this method. In populations with different levels of underlying risk, this method will generate skewed results. This could be corrected by using population-based incidence estimates instead of the general population estimates, if these are known. Although we have been able to show there is likely to be bias in one clinically ascertained data set, we have not shown how to account or correct for it. Being able to account for that bias might allow one to use and combine variant data from a variety of ascertainment strategies. Furthermore, the independence assumption is likely not completely correct as individuals within families are correlated with each other. However, as many families are needed for the analysis, the independence of families is likely to outweigh the dependence within them, especially with larger samples of families. Another minor limitation is that for variant classification, whereas we have shown that weighting counts by probability of having the variant gives higher power (See Appendix Fig. 1), this leads to comparing fractions to the discrete valued Poisson binomial distribution. We have taken a conservative approach to this issue by rounding down, as there are clear gains in power by incorporating genotype probability information from ungenotyped individuals. It is worth noting that this method is conceptually similar to the logistic regression approach by Easton et al. [33].

Although convenient, clinically ascertained data sets may have subtle biases. In the future, researchers seeking to explore risk associations with clinical data sets should perform more thorough evaluations of ascertainment bias in their samples. The process of evaluating families that test negative in parallel with those that test positive with the same ordering criteria is simple and could be an important standard for association studies that rely on clinical ascertainment. Comparisons of clinical cases with population controls from public databases that are ascertained using very different criteria will lead to many false positives without this or a similar check for ascertainment bias. Although we have shown how one may effectively ascertain clinical ascertainment bias relative to general population risk, we are not aware of a method to correct for false associations that may occur due to this bias. Appropriately correcting for biased ascertainment will be an important topic for future statistical genetics work.