Case-cohort study-design, 1st proposed by Prentice in 1986 is a commonly-used cost-effective outcome-dependent study-design embedded in large cohort studies [1,2,3]. This design is used to reduce costs or conserve resources when the rate of the outcome event of interest is low and/or resources to ascertain exposure data are limited. The case-cohort sample consists of a (stratified) random sample of the full cohort supplemented by cases who are not in the random sample. There are several advantages to the case-cohort design: (1) it reduces the cost/effort for collecting redundant data on non-cases; (2) the random sample can be used for monitoring study progress; (3) data collected through a case-cohort study-design can be used to study the prospective relationship between the exposure and the outcome; and (4) because the random sample is selected independent of the outcome of interest collected exposure data can be used to study other outcomes of future studies. The nested case-control study is an alternative study design to the case-cohort design. In a nested case-control study, controls are selected at each failure time, consequently there is no representative random sample from the full cohort and the data collected from one nested case-control study cannot be easily used to study other outcomes.

The case-cohort study design can be used in transplant research. For example, the Center for International Blood & Marrow Transplant Research (CIBMTR) has two levels of data collection: (1) Transplant Essential Data (TED); and (2) Comprehensive Report Form (CRF) data. Collecting CRF data takes more resources than TED data. Transplant centers designated as CRF centers collect CRF data on some but not all recipients at their center. CRF data include detailed information such as co-variates as pretransplant conditioning, acute graft-versus-host disease (GvHD) etc.

Consider a study correlating to identify pretransplant co-variates with risk of developing a central nervous system (CNS) cancer posttransplant [4]. Posttransplant CNS cancers are rare occurring in <1% of recipients. A case-cohort study can be an efficient way to interrogate this question. At a CRF center one could select a random recipient sample and select all recipients developing a CNS cancer. CRF data can then be collected on the selected random sample and on the few subjects with CNS cancer. CRF data could include co-variates such as age at radiotherapy, prior CNS radiation exposure to anti-cancer drugs crossing the blood brain barrier, GvHD, corticosteroid exposure and others.

Competing risks are common in transplant recipients studies including death from leukemia recurrence before developing a CNS cancer. It is important to analyze competing risks data from case-cohort studies properly. In this tutorial we briefly describe case-cohort study-design and data available from a case-cohort design. We also introduce the commonly used analytic method, the cause-specific hazards model, and software for analyzing data from case-cohort studies with competing risks.

Case-cohort study design and the data structure

Let Ti and Ci be the potential failure and censoring times and µi (=1, 2, …, or K) denote the cause of failure for subject i (= 1,…, n). Without losing generality we denote the event of interest as ‘cause 1′ (µi = 1) and refer to it as the ‘cause of interest’ or ‘event of interest’. If there is only one cause of failure (i.e., K = 1) this reduces to the situation with a uni-variable survival outcome. Let Xi (i.e., minimum of Ti and Ci) and ∆I (i.e., = 1 if Ti is observed before Ci and otherwise 0) denote the observed time and failure indicator. Let Zi(t) denote co-variates. For a case-cohort study, we sample a random sub-cohort of all subjects and all subjects with the event of interest regardless of whether they are in the selected subcohort. Figure 1 provides an illustration on the case-cohort sample. Co-variate information Zi(t) can be decomposed into two parts as Zi(t) = (ZiC(t), ZiE(t)), where ZiC(t) are available on the entire cohort and ZiE(t) are co-variates only available for subjects in the case-cohort sample. For example, ZiE(t) can include the CRF data such as pretransplant radiation dose and ZiC(t) can include TED level data such as age at transplant and sex. Let ξi be an indicator for subject i being selected into the sub-cohort. The observable data is {Xi, ∆i, ∆iµi, ξi, ZiC(t), ZiE(t)} if subject i is in the case-cohort sample, and {Xi, ∆i,iµi, ξi, ZiC(t)} otherwise.

Fig. 1: Case-cohort design.
figure 1

illustration for subjects selection in the case-cohort design.

For example, suppose we are interested in assessing the impacts of mutations ASXL1, EZH2, SRSF2, IDH1, IDH2, and TP53 on death [5]. Collecting these data from stored DNA samples is expensive. To reduce cost and preserve samples we can design a case-cohort study. Assume there are 1000 subjects in the full cohort, 20% die and we set the selection probability of the sub-cohort at 25%. The size of the case-cohort dataset is 400 subjects, 250 in the sub-cohort and 150 outside the sub-cohort. Overall, 200 subjects died and 200 are alive. In this scenario mutations data are collected on only these subjects whereas survival data and other co-variates such as age and sex are collected from all 1000 subjects in this study.

Models and weights for case-cohort studies

For competing risks data there are in general two commonly used models: (1) the cause-specific proportional hazards; and (2) sub-distribution hazards. The cause-specific hazards model is useful when one’s interest is in studying disease etiology whereas the sub-distribution hazards model is of greater interest when the emphasis is on estimating actual risk and prognosis. Here we focus on cause-specific hazard model for case-cohort studies because of the availability of statistical software packages.

The hazard function in the cause-specific hazard model for cause k is given by:

$$\lambda _k\left( {t|Z\left( t \right)} \right) = \lambda _{0k}\left( t \right)exp\left( {\beta _kZ\left( t \right)} \right),$$

where \(\lambda _{0k}\left( t \right)\) is an unspecified baseline hazard function and βk is an unknown parameter of interest. The effects of risk factor for cause k outcome can be measured by the hazard ratio exp(βk). In the cause-specific hazard model one treats subjects who experienced competing risks as censored. When there is only one cause (i.e., K = 1) the cause-specific hazard model is reduced to the Cox proportional hazards model.

Because we lack extensive co-variate data outside the case-cohort sample the estimation method for the Cox proportional hazards model needs to be modified. The so-called weighed partial likelihood is widely-used for case-cohort design. The key to the weighted partial likelihood is to understand the weighting of subjects with the event of interest and sub-cohort subjects without the event of interest. Several weighting functions for case-cohort design are proposed [6,7,8]. In this tutorial, we focus on a time-independent weight function which uses the sub-cohort sampling probability, denoted by α. Specifically, weights for subjects with the event of interest is 1 because all subjects in the full cohort with the event of interest are included in the case-cohort sample i.e., cases in the case-cohort sample are all cases in the full cohort. In contrast, some subjects without the event of interest are not in the case-cohort sample. Consequently, sub-cohort subjects without the event of interest are weighted by 1/α. For example, suppose α is 25%. Then the weight for subjects in the sub-cohort who do not experience the event of interest is 1/0.25 = 4 indicating one subject in the sub-cohort without the event represents four subjects without the event in the full cohort. In practice sampling probability α is unknown and needs to be estimated.

To analyze case-cohort data using SAS (PHREG procedure), two steps are required. Step (1) create weights for each subject. Step (2) calculate the robust variance to account for case-cohort data structure. The example SAS code is provided in the Supplementary material. In PHREG procedure, “COVS(AGGREGATE)” and “ID” statement options allow to calculate robust sandwich type of variance. The R statistical package provides similar capabilities. An example of R code is in the Supplementary material. We now show how to fit these cause-specific models using CIBMTR data.


Consider the transplant dataset reported by Ustun et al. (2018) of 7128 subjects receiving a 1st allograft for acute myeloid leukemia, acute lymphoblastic leukemia, or myelodysplastic syndrome from January, 2008 to December, 2012 [9]. The primary outcome of interest is a fungal infection in this data. 589 (8%) had a fungal infection by day 100 and 1059 (15%) died without a fungal infection before day 100. In a case-cohort study we create a case-cohort sample by randomly selecting 20% of subjects from the 7128 to 1434 subjects to form a sub-cohort. Next, we add everyone not in the sub-cohort who had a fungal infection before day 100 (Fig. 2). 115 of the 1434 randomly-selected subjects had a fungal infection before day 100, 163 died before day 100 without a fungal infection and 1156 had neither a fungal infection nor died before day 100. Next, we add 474 subjects (589–115) with a fungal infection before day 100 not in the randomly-selected sub-cohort bringing numbers of subjects in the case-cohort sample to 1908 (1434 + 474). In this case-cohort sample, 1319 (1434 − 115) did not have a fungal infection before day 100 and were weighted by 1/0.2 = 5 whereas 589 had a fungal infection before day 100 and were weighted by 1.

Fig. 2: Case-cohort example.
figure 2

the case-cohort sample for the fungal infection.

Co-variates of interest in this study were age at transplant, graft-type, GvHD prophylaxis, and year of transplant. We checked the proportional hazards assumption by testing whether the coefficient of log transformed time × each co-variate is equal to zero and all p values were >0.05.

Data of co-variate frequencies in the full and sub-cohorts displayed in Table 1 indicate reasonable comparability. Next, we fit the cause-specific hazard model using the case-cohort sample and fit the same model using the full cohort to compare results. Note the full cohort analysis is only possible because we generated the case-cohort sample from the full cohort. This full cohort analysis would not be possible in real case-cohort studies. Table 2 shows hazard ratios, 95% confidence intervals and p values. Hazard ratios based on the case-cohort sample are very close to those based on the full cohort. The data indicate age at transplant, graft-type, GvHD prophylaxis, and year of transplant are significantly correlated with risk of a fungal infection before day 100 in the full and the case-cohort sample. As expected, the 95% confidence intervals for the case-cohort (N = 1908) are wider than those for the full cohort (N = 7128).

Table 1 Frequencies of the cohorts.
Table 2 Analyses using the cause-specific hazards model.


Case-cohort design is an efficient, cost effective statistical method when an event(s) of interest is rare and/or when obtaining co-variate data is difficult and/or expensive and has great potential in hematopoietic cell transplant research. We provide a brief review of the case-cohort design and show how to properly analyze case-cohort data when there are competing risks using statistical software packages. In this tutorial we considered only cause-specific hazards models for competing risks but one can easily apply these weighting scheme to sub-distribution hazards model such as the Fine-Gray model [10, 11].

In our example we selected the sub-cohort by simple random sampling but stratified sampling can also be used to ensure balance for important co-variates. Also, in the tutorial we only considered time-independent weights. Several methods have been proposed to improve efficiency for case-cohort studies using time-dependent weights and extra information such as auxiliary co-variate data whereby time-dependent weights are calculated among subjects at-risk at each time point [12, 13]. The case-cohort design can also be used to analyze multiple outcomes [14,15,16,17,18]. Lastly, there are sample size and power calculations. Sample size estimation is an important first step for designing a study and formulae for these are available [19, 20].

We hope readers will find this discussion useful and share it with their center statisticians. We expect increased use of the case-cohort method to tackle important questions in hematopoietic cell transplantation in the near future.