Abstract
In the last decade with widespread use of quantitative analyses in medical research, close co-operation between statisticians and physicians has become essential from the experimental design through all phases of complex statistical analysis. On the other hand, easy-to-use statistical packages allow clinicians to perform basic statistical analyses themselves. Since the software they most commonly use does not perform in depth competing risk analysis, we recommend an add-on package for the R statistical software. We provide all the instructions for downloading it from internet and illustrate how to use it for analysis of a sample dataset of patients who underwent haematopoietic stem cell transplantation for acute leukaemia.
Introduction
In studies on HSCT Kaplan–Meier (KM) estimates of survival curves and Cox proportional hazard models are widely used to describe survival trends and identify significant prognostic factors. All these statistical analyses deal with only one type of event, for example death, independently of its cause.
A particular situation arises when interest is focused on a specific cause of failure in the presence of other different causes, which alter the probability of experiencing the event of interest. This is the case of competing risk events, which refers to a situation where an individual is exposed to two or more causes of failure, and its eventual failure can be attributed exactly to only one. In this case, the occurrence of one type of event hinders the occurrence of any other event.
In patients who underwent HSCT, failure events commonly studied are relapse of the original disease (REL) and death from causes related to the transplant (transplant related mortality (TRM)). If the interest is to estimate the probability of relapse, death from TRM is a competing risk event and the cumulative incidence function (CIF) must be calculated by appropriate accounting.1, 2
Until recently, this analysis was erroneously performed as 1-KM, treating the competing events as censored at the time they occurred, but this censoring is inappropriate because after a competing event has occurred, failure from the cause of interest is no longer possible. 1-KM correctly estimates the probability of failure independently of any specific cause, while the probability of one type of competing event is correctly estimated using the CIF, which partition the probability of failure into the probability corresponding to each competing event: at any point in time, the overall 1-KM is equal to the sum of the CIFs for each type of event.3, 4 Moreover, to assess the statistical significance of a prognostic factor in a cumulative incidence analysis Gray's test5 is one of the appropriate tests to perform.
Generally speaking, there are few statistical packages used today in medical statistics that implement CIF for competing risk events, and even fewer provide tools for comparing the cumulative incidence of a particular type of failure among different groups. A ‘competing risks’ analysis is provided by an add-on package of R.6 R is an open source software for statistical computing and graphics, which is freely available at www.r-project.org. R performs many statistical analyses needed in practical applications: linear and generalized linear models, nonlinear regression models, time-series analysis, parametric and nonparametric tests, clustering, smoothing and survival analysis. R also provides additional packages for specific purposes.
We show how to perform a competing risk analysis in R by an example using a sample dataset containing survival times of anonymous patients affected by leukaemia who underwent HSCT. In the analysis we estimate CIF, perform test for equality of CIF curves in subgroups, and compute pointwise confidence intervals. A brief overview of the statistical concepts is reported in Appendix A.
Obtaining and installing R
R is an open source project that is distributed under the GNU (www.gnu.org) GPL (General Public License). Sources, binaries, documentation and additional packages for R software can be obtained via CRAN, the Comprehensive R Archive Network, at the master site, TU Wien, Austria (http://cran.R-project.org). The R distribution comes with a set of manuals. We suggest beginners read the handbook An Introduction to R. Other books on R include introductory texts, such as Dalgaard,7 Iacus and Masarotto8 and data analysis books, such as Venables and Ripley.9
R software is currently developed for the Linux/Unix, Windows and Apple OS families of operating systems. The installing binary for Windows 95, 98, ME, NT4, 2000, and XP is available at http://cran.r-project.org/bin/windows/base/. The setup program for downloading is R-x.x.x-win32.exe, where x.x.x is the current version (at the time of this writing R-2.4.0 is available). After downloading the file, install as usual on the user's computer. Competing risk analysis is available in an add-on package called cmprsk.
Installing the R package cmprsk
Start R in Windows by double clicking on the desktop icon. R issues the symbol >, then expects input commands.
On line windows users:
select ‘Packages’, from the main menu,
select ‘Install package(s)…’,
choose a CRAN site,
select the cmprsk package to download and install.
Windows users who are not on-line:
download the cmprsk package as a zip file from
http://cran.r-project.org/src/contrib/PACKAGES.html,
select ‘Install package from local zip file…’ from the ‘Packages’ menu.
Other information and details of how to install packages for other operating systems are available in the R Installation and Administration manual.
Reading data for competing risk analysis in R
As an example of competing risk analysis in R, we analyze data from 35 patients with acute leukaemia who underwent HSCT. We estimate the cumulative risk of relapse and TRM. At the same time we test equality of cumulative incidence curves in patients affected by AML and ALL. Data can be read in R in a variety of formats (including data from SAS, MINITAB, STATA) and these are fully explained in the R Data Import/Export manual. As MS Excel is commonly used for creating clinical databases, we will show how R software reads an Excel file which in our example is named ‘bmt.xls’.
Instructions:
Go to the Excel Save menu,
Save your worksheet file as a CSV (comma separated values) file,
Close Excel.
Start R in Windows by double clicking on the desktop icon. R issues the symbol >, then expects input commands.
In English versions of Excel where the decimal point is used:
Type:
>bmt=read.csv("bmt.csv")
When the comma is used for the decimal point as in the Italian version:
Type:
>bmt=read.csv("bmt.csv", sep=";", dec=",")
If, as above, the filename does not contain a path, the file is assumed to be located in the current working directory.
If the file is NOT in the current working directory, the function file.choose() provides a graphical user interface from which users can search for a file within any folder.
Type:
>bmt=read.csv(file.choose(), sep=";", dec=",")
Once our sample data has been read, R software recognizes them as bmt .
To see the data of the 35 patients
Type

dis indicates disease coded 0 for ALL and 1 for AML; ftime indicates follow-up time in months, that is, the length of follow-up from transplant to relapse for patients who relapsed, to death for patients dead for TRM or to the last check-up in survivors; status is coded as 0 for a survivor (censored), 1 for death from TRM, and 2 for relapse.
For an alternative data presentation, which shows the data structure,
Type:

Before starting any data analysis
Type:
>attach(bmt)
The function factor() may be used to redefine dis (disease) as a factor with two levels, because in our example there are two diseases (AML and ALL). Each is given its label: 0 is labelled ALL, and 1 is labelled AML.
Type:
>dis=factor(dis, levels=c(0,1), labels= c("ALL", "AML"))
To perform some basic descriptive summaries:
Type:

The table above shows the number of observations for each combination of status and disease. A table that reports the mean follow-up times in months for each combination of status and disease is obtained as follows:
Type:

Fitting the cumulative incidence function
We wrote an R function, called CumIncidence() , which allows to fit cumulative incidence curves with minimum user interaction. Its source code is reported in Appendix B but one may prefer to download the file ‘ CumIncidence.R ’ from http://www.stat.unipg.it/luca/R.
The function CumIncidence() is simply a wrapper to the package cmprsk, and must be loaded in each work-session. Assuming it is contained in the file ‘CumIncidence.R’ located in your working directory, it can be simply loaded as follows:
Type
>source ("CumIncidence.R")
If the file to be sourced is not in the current working directory, the function file.choose() can be used to select the file within any folder:
>source (file.choose())
The function CumIncidence() estimates cumulative incidence curves. CumIncidence() requires at least the following arguments: the follow-up time ( ftime in our example); the indicator variable containing the competing events ( status ); the grouping variable ( dis ). The argument cencode indicates the code for censored observations, and it is set to 0 (default). Further arguments can be provided, such as xlab for the x-axis label, and will be described later. Finally, in our example the output estimates are assigned to an object denominated fit .
Type:
>fit=CumIncidence (ftime, status, dis, cencode = 0, xlab="Months")
The following output is shown on screen:

First, the print out shows the results of Gray's test5 for equality of CIFs across groups. In our example, cumulative incidence curves for ALL and AML are not statistically different for death due to TRM (coded as 1) (P-value= 0.253915), but they are highly significant for relapse (coded as 2) (P-value=0.0077785).
The output reports cumulative incidence estimates, and corresponding standard errors, for each cause of failure (TRM or relapse) in each disease group (ALL or AML). A plot of estimated CIFs for each cause of failure–disease combination is also produced (see Figure 1).
By default cumulative incidence estimates are computed on a suitable grid of time points; in our example, time points are taken from 0 to 70 month by step of 10. These automatic time points can be easily customized by the user: in this case the call to the function CumIncidence() requires a further argument, t , which provides a vector containing the user-defined time points. Suppose, in our example, we chose 3-month intervals for the first year, and 12-month intervals for years 2, 3 and 4 after transplantation, then the call is the following:
Type:

The resulting output shows both estimates of cumulative incidence and s.e. evaluated at the given time points.
Pointwise confidence intervals for competing risk curves
Computing confidence intervals provides useful information about uncertainty related to parameter estimates. A pointwise confidence interval for CIF at some fixed time-point can be obtained using the method proposed by Choudhurt10 (see Appendix A for a brief description).
The CumIncidence() function allows for the pointwise confidence intervals, by simply adding a further argument, level, where we specify the desired confidence level. For example, we may compute 95% pointwise confidence interval at our selected time points as follows:
Type:

The first part of the output is equal to that discussed above. For any combination of competing events (TRM/REL) and disease (ALL/AML), the estimated lower and upper confidence limits at given time points are reported. A set of graphs which plot the estimated curves with the corresponding confidence intervals is also produced (see Figure 2).
Final remarks
Before analyzing your own data you might like to perform a competing risks analysis in R with our dataset, so as to confirm results and to practice the instructions we have given you here. The data file, both in CSV and XLS Excel format, used in the example is available at the web address http://www.stat.unipg.it/luca/R.
As you become confident and competent in using the R software, you will be able to exploit all its potential to estimate the effects of covariates in more complicated models of competing risk analysis, as proposed by Fine and Gray.11
References
Klein JP, Moeschberger ML . Survival Anal, 2nd edn. Springer: New York, 2003, 536pp.
Pintilie M . Competing Risks: A Practical Perspective. John Wiley & Sons: New York, 2006, 240pp.
Klein JP, Rizzo JD, Zhang MJ, Keiding N . Statistical methods for the analysis and presentation of the results of bone marrow transplants. Part I: unadjusted analysis. Bone Marrow Transplant 2001; 28: 909–915.
Satagopan JM, Ben-Porat L, Berwick M, Robson M, Kutler D, Auerbach AD . A note on competing risks in survival data analysis. Br J Cancer 2004; 91: 1229–1235.
Gray RJ . A class of K-sample tests for comparing the cumulative incidence of a competing risk. Ann Stat 1988; 16: 1141–1154.
R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing 2006, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Dalgaard P . Introductory Statistics with R. Springer: New York, 2002, 267pp.
Iacus SM, Masarotto G . Laboratorio Di Statistica Con R. McGraw-Hill: Milano, 2002, 384pp.
Venables WN, Ripley BD . Modern Applied Statistics with S, 4th edn. Springer: New York, 2002, 495pp.
Choudhury JB . Non-parametric confidence interval estimation for competing risks analysis: application to contraceptive data. Stat Med 2002; 21: 1129–1144.
Fine JP, Gray RJ . A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc 1999; 94: 496–509.
Marubini E, Valsecchi MG . Analysing Survival Data from Clinical Trials and Observational Studies. Wiley: New York, 1995, 430pp.
Pepe MS, Mori M . Kaplan–Meier, marginal or conditional probability curves in summarizing competing risks failure time data? Stat Med 1993; 12: 737–751.
Lin DY . Non-parametric inference for cumulative incidence functions in competing risks studies. Stat Med 1997; 16: 901–910.
Acknowledgements
We thank Dr Geraldine Anne Boyd, University of Perugia, for editing this paper.
Author information
Affiliations
Corresponding author
Appendices
Appendix A
Competing risk analysis is dedicated to the study of failure probabilities when each individual may fail due to one of several causes, called competing events. The cumulative incidence function is defined as the probability of failing from cause r (r=1 ,…, k where k is the number of causes of failure) up to a certain time point t. Formally, it may be written as

where λr(t) is the cause specific hazard rate and S(t)=Pr(T⩾t) is the survival function. Non-parametric MLE of (cause specific) CIF is computed as follows:

where drj is the number of failures at time tj from cause r, nj is the number of individuals at risk at time tj, and Ŝ(tj) is the Kaplan–Meier estimate of the overall survival function. It is interesting to note that ∑kr=1Îr(t)=1−Ŝ(tj), that is the sum of cumulative incidence from all causes is equal to 1 minus the Kaplan–Meier estimate of survival.
Confidence interval estimation can be derived10 based on the ln(−ln) transformation, so the (1−α)100% confidence interval for the cumulative incidence function at time t for cause r is given by

where zα/2 is the upper α/2 percentile of the standard normal distribution, and σr(t) is the square root of the estimated variance of Îr(t). This can be calculated as follows (see Marubini and Valsecchi, p 341, eq. 10.12):12

where dj=∑kr=1drj.
Finally, comparison of cause-specific CIFs in different groups can be performed using one of the tests proposed, among others, by Gray,5 Pepe and Mori,13 and Lin.14
Appendix B

Rights and permissions
About this article
Cite this article
Scrucca, L., Santucci, A. & Aversa, F. Competing risk analysis using R: an easy guide for clinicians. Bone Marrow Transplant 40, 381–387 (2007). https://doi.org/10.1038/sj.bmt.1705727
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/sj.bmt.1705727
Keywords
- competing risk analysis
- R statistical software
- sample data base
- instructions for use
Further reading
-
Nomogram to predict cause-specific mortality of patients with rectal adenocarcinoma undergoing surgery: a competing risk analysis
BMC Gastroenterology (2022)
-
Impact of donor types on reduced-intensity conditioning allogeneic stem cell transplant for mature lymphoid malignancies
Bone Marrow Transplantation (2022)
-
Predictive factors associated with induction-related death in acute myeloid leukemia in a resource-constrained setting
Annals of Hematology (2022)
-
Comparative risk of acute myocardial infarction for anti-osteoporosis drugs in primary care: a meta-analysis of propensity-matched cohort findings from the UK Clinical Practice Research Database and the Catalan SIDIAP Database
Osteoporosis International (2022)
-
Effectiveness of prophylactic antiviral therapy in reducing HBV reactivation for HBsAg-positive recipients following allogeneic hematopoietic stem cell transplantation: a multi-institutional experience from an HBV endemic area
Annals of Hematology (2022)

