Survival analysis

Uwe Graichen · uwe.graichen@kl.ac.at

Overview

Survival analysis involves statistical methods that compare time to a specific event between groups to estimate the effect of prognostic factors, such as medical treatment or harmful influences. The event can be the death of the patient. However, they can also be any other endpoints, such as healing or the occurrence of a complication. In this script we present three common methods of survival time analysis, the Kaplan Meier survival time curve, the log rank test, and the Cox regression.

Survival analysis — Principles and an illustrative data analysis using Gnu R

Objective and scope of application

Survival analysis is a set of statistical methods that compares the time to a specific event between groups to estimate the effect of prognostic factors, medical treatment or harmful influences. The event can be the death of the patient, but other endpoints such as recovery, disease or the occurrence of a complication can also be considered. Examples of such statistical analysis methods are the Kaplan-Meier estimator, the log rank test or the Cox regression.

Motivating example

In our exemplary survival analysis, we use a dataset of the North Central Cancer Treatment Group (NCCTG). This dataset contains information about the survival of patients with advanced lung cancer, plus some additional predictors, such as age, gender, and assessments of the patient performance status. It was originally presented by (Loprinzi et al., 1994). We want to compare the survival probability between female and male patients, and we want to identify predictors that influence the survival probability.

Analysis script in Gnu R

Used Gnu R toolboxes

For data import, analysis and visualization of the results we use already existing toolboxes for the Gnu R system. The toolboxes in use are described further down in the post. They are integrated by the following code fragment:

 1library(tidyverse)     # use tidy data
 2library(haven)         # import SPSS file
 3library(rstatix)       # pipe friendly statitics 
 4library(RColorBrewer)  # color maps
 5library(kableExtra)    # table output
 6library(ggplot2)       # high quality plots
 7library(pander)        # rendering R objects into Pandoc markdown
 8library(gtsummary)     # publication-ready analytical and summary tables
 9library(survival)      # survival analysis routines
10library(survminer)     # drawing survival curves and tables

Import of the data

First, the data to be analysed is imported into the Gnu R analysis environment. For the exemplary analyses, we use a dataset from the NCCTG that contains information on lung cancer patients. The following data are included in the dataset:

Variable Description
inst Institution code
time Survival time in days
status Censoring status censored = 1 dead = 2
age Age in years
sex Gender male = 1 female = 2
ph.ecog ECOG performance score as rated by the physician. 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound
ph.karno Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno Karnofsky performance score as rated by patient
meal.cal Calories consumed at meals
wt.loss Weight loss in last six months

The data are available in SAV format. We use the function read_sav of the toolbox haven for the data import.

1# Import a dataset that is in SAV format
2dataIn <- read_sav("NCCTGLungCancerData.sav")

Data wrangling, data selection

Just for information, we determine the size of the imported data.

1# number of data items and variables 
2dim(dataIn)
3## [1] 228  10

The imported data set includes 10 variables and 228 data items.

In the following step, we

  • select the variables to be analysed,
  • remove data entries with missing values and
  • merge the categories that include bedriddenness into one category.

We store the result of these data pre-processing and selection steps in the tibble dataAnalysis.

 1# Extract data to be used for analysis
 2dataAnalysis <- dataIn %>%
 3  # include all variables except inst (minus sign)
 4  select(-inst) %>%
 5  # Remove records with NA values (not available, or no answer)
 6  na.omit() %>%
 7  # change some variables to factors
 8  mutate(sex = as_factor(sex)) %>%
 9  mutate(ph.ecog = as_factor(ph.ecog)) %>%
10  mutate(status = labelled::remove_labels(status)) %>%
11  # merge the categories that include bedriddenness into one category.
12  mutate(ph.ecog = recode(ph.ecog, 'in bed <50% of the day'='at least partially bedridden',
13                          'in bed > 50% of the day but not bedbound'='at least partially bedridden', 'bedbound'='at least partially bedridden'))

Short data summary

In the table below we give a summary of the data that we include in the further steps of the analysis. The meaning of the censoring status in the table below is:

  • 1 – censored
  • 2 – dead
1# print summary table
2tbl_summary(dataAnalysis)
Characteristic N = 1681
Survival time in days 269 (175, 416)
status
    1 47 (28%)
    2 121 (72%)
Age in years 64 (57, 70)
Gender
    male 104 (62%)
    female 64 (38%)
ECOG performance score as rated by the physician
    asymptomatic 47 (28%)
    symptomatic but completely ambulatory 81 (48%)
    at least partially bedridden 40 (24%)
Karnofsky performance score rated by physician
    50 4 (2.4%)
    60 16 (9.5%)
    70 25 (15%)
    80 47 (28%)
    90 50 (30%)
    100 26 (15%)
Karnofsky performance score as rated by patient
    30 2 (1.2%)
    40 2 (1.2%)
    50 3 (1.8%)
    60 23 (14%)
    70 30 (18%)
    80 38 (23%)
    90 44 (26%)
    100 26 (15%)
Calories consumed at meals 975 (622, 1,156)
Weight loss in last six months 7 (0, 15)
1 Median (IQR); n (%)

The censoring status

A study usually has a clearly defined start and end date. If the event to be investigated occurs during this period (in our example this is the death of the patient), then the survival time and the censoring status dead (encoded as 2) are stored. However, it may happen that the event under investigation does not occur until the end of the study or that the patient decides to leave the study. In this case, the censoring status censored (encoded as 1) is stored and the survival time is the period from the patient’s inclusion in the study until the end of the study or until the patient leaves the study.

Analysis by means of Kaplan-Meier survival plot

For the survival analysis by means of a Kaplan-Meier survival plot we use functions which are provided by the packages survival and survminer. The Kaplan-Meier method is a non-parametric approach that results in a step function. At each time an event occurs, there is a step down. We fit a survival curve to the data using the two functions Surv() and survfit() from the package survival. The form of the survfit() command is survfit(formula, data=dataTibble). The meaning of the two parameters formula and data is:

  • formula is a symbolic description of the model to be fitted to the data within the analysis. The variable on the left-hand side of a tilde (~) is called the outcome variable (or dependent variable), while the variables on the right-hand side of a tilde are called the predictors (or independent variables) and are joined by plus signs +. The outcome variable is generated using the Surv() function.
  • data is a tibble containing the variables in the model.

As a result of applying the ´Surv()´ function we get a vector with time values (in days), a single time value for each patient. Values with a plus sign attached describe the time elapsed until censoring, without a plus sign until the event (dead) occurs. For the purposes of illustration, we apply the Surv() function separately to the two variables time and status of the dataAnalysis data tibble.

 1# create a survival object, which encodes
 2# the survival time and the censoring status
 3Surv(dataAnalysis$time, event=dataAnalysis$status)
 4##   [1]  455   210  1022+  310   361   218   166   170   567   613   707    61 
 5##  [13]  301    81   371   520   574   118   390    12   473    26   107    53 
 6##  [25]  814   965+   93   731   460   153   433   583    95   303   519   643 
 7##  [37]  765    53   246   689     5   687   345   444   223    60   163    65 
 8##  [49]  821+  428   230   840+  305    11   226   426   705   363   176   791 
 9##  [61]   95   196+  167   806+  284   641   147   740+  163   655    88   245 
10##  [73]   30   477   559+  450   156   529+  429   351    15   181   283    13 
11##  [85]  212   524   288   363   199   550    54   558   207    92    60   551+
12##  [97]  293   353   267   511+  457   337   201   404+  222    62   458+  353 
13## [109]  163    31   229   156   329   291   179   376+  384+  268   292+  142 
14## [121]  413+  266+  320   181   285   301+  348   197   382+  303+  296+  180 
15## [133]  145   269+  300+  284+  292+  332+  285   259+  110   286   270   225+
16## [145]  269   225+  243+  276+  135    79    59   240+  202+  235+  239   252+
17## [157]  221+  185+  222+  183   211+  175+  197+  203+  191+  105+  174+  177+

In the next step, we estimate the survival curves for female and male patients using the Kaplan-Meier method. We use the formula Surv(time, event=status) ~ sex.

1# estimate a survival curve using Kaplan-Meier method
2kmGender = survfit(Surv(time, event=status) ~ sex, data=dataAnalysis)

We use the ggsurvplot() function to visualise the result of the Kaplan-Meier analysis. In the upper part, the survival curves for both cohorts (female, male) are shown, the median survival times are marked. The middle part of the figure shows, for selected time points, the percentages of patients still at risk of an event. In the lower part of the figure, the censorship events are visualised.

 1# visualize the result of the Kaplan-Meier analysis
 2ggsurvplot(kmGender, 
 3           # adjust the label
 4           xlab="Survival time in days", 
 5           ylab="Survival probability",
 6           break.time.by = 150,
 7           # use percentage at risk
 8           risk.table=TRUE,
 9           risk.table.pos = "out",
10           risk.table.col = "strata",# Risk table color by groups
11           risk.table.y.text = FALSE,
12           risk.table.height=.18,
13           # include confidencial intervals
14           conf.int=TRUE,
15           # draw horizontal AND vertical line for median
16           surv.median.line="hv",
17           # cumulative events table
18           cumevents = TRUE,
19           cumevents.col = "strata",# Risk table color by groups
20           cumevents.y.text = FALSE,
21           cumevents.height=.18,
22           # cumulative number of censoring
23           cumcensor = TRUE,
24           cumcensor.col = "strata",# Risk table color by groups
25           cumcensor.y.text = FALSE,
26           cumcensor.height=.18,
27           tables.theme = theme_cleantable(),
28           # Change ggplot2 theme
29           ggtheme = theme_bw(), 
30           # set the color map
31           palette = "Set1"
32)
33## Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
34## ℹ Please consider using `annotate()` or provide this layer with data containing
35##   a single row.
36## All aesthetics have length 1, but the data has 2 rows.
37## ℹ Please consider using `annotate()` or provide this layer with data containing
38##   a single row.
39## All aesthetics have length 1, but the data has 2 rows.
40## ℹ Please consider using `annotate()` or provide this layer with data containing
41##   a single row.
42## All aesthetics have length 1, but the data has 2 rows.
43## ℹ Please consider using `annotate()` or provide this layer with data containing
44##   a single row.
Visualisation of the results of the Kaplan-Meier analysis

Figure 1: Visualisation of the results of the Kaplan-Meier analysis

For quantitative analysis, parameters of the Kaplan-Meier survival curve can be printed in tabular form (function tbl_survfit()), below the survival probabilities after 30, 90, 180, 360 and 720 days

1tbl_survfit(kmGender,
2            times = c(30, 90, 180, 360, 720),
3            statistic = "{estimate}",
4            label_header = "**{time} days**")
Characteristic 30 days 90 days 180 days 360 days 720 days
Gender




    male 94% 87% 69% 36% 7.4%
    female 98% 91% 83% 56% 20%

and the median survival time.

1tbl_survfit(kmGender, probs=c(0.5),
2            label_header = "**Median survival time (days)**")
Characteristic Median survival time (days)
Gender
    male 284 (229, 353)
    female 426 (345, 641)

log-rank test

We would like to analyse whether the survival times of the two cohorts (female, male) differ significantly. The log-rank test is a common approach for this. In this test, the observations are equally weighted over the entire follow-up time. The log-rank test can be performed using the ´survdiff()` function of the package survival.

 1survdiff(Surv(time, event=status) ~ sex, data=dataAnalysis, rho = 0)
 2## Call:
 3## survdiff(formula = Surv(time, event = status) ~ sex, data = dataAnalysis, 
 4##     rho = 0)
 5## 
 6##              N Observed Expected (O-E)^2/E (O-E)^2/V
 7## sex=male   104       83     69.5      2.60      6.16
 8## sex=female  64       38     51.5      3.52      6.16
 9## 
10##  Chisq= 6.2  on 1 degrees of freedom, p= 0.01

The test shows that there is a significant difference (p = 0.01) between the two cohorts (female, male) in terms of survival times.

Cox Proportional Hazards Survival Regression

Finally, we would like to identify further predictors that have an influence on survival time and quantify the magnitude of their influence. For this type of analysis, the Cox Proportional Hazards Survival Regression (also referred to as Cox Regression) can be used. We apply the Cox proportional hazards model to the data using the two functions Surv() and coxph() from the package survival. The form of the coxph() command is coxph(formula, data=dataTibble). The meaning of the two parameters formula and data is described earlier in this script.

We investigate the influence of the predictors gender, age, ECOG performance score, calories consumed at meals and weight loss in the last six months.

1coxph(Surv(time, status) ~ sex + age + ph.ecog + meal.cal + wt.loss,
2      data = dataAnalysis) %>% 
3  tbl_regression(exp = TRUE) %>%
4  add_significance_stars(hide_ci = FALSE, hide_p = FALSE,
5                         pattern = "{p.value}{stars}")
Characteristic HR1 SE1 95% CI1 p-value2
Gender



    male
    female 0.58 0.201 0.39, 0.86 0.007**
Age in years 1.01 0.011 0.98, 1.03 0.6
ECOG performance score as rated by the physician



    asymptomatic
    symptomatic but completely ambulatory 1.46 0.236 0.92, 2.31 0.11
    at least partially bedridden 2.91 0.296 1.63, 5.19 <0.001***
Calories consumed at meals 1.00 0.000 1.00, 1.00 >0.9
Weight loss in last six months 0.99 0.008 0.97, 1.00 0.12
1 HR = Hazard Ratio, SE = Standard Error, CI = Confidence Interval
2 *p<0.05; **p<0.01; ***p<0.001

If we look at the p-values/star codes, only the two predictors gender and ECOG performance score influence survival time.

We remove the non-significant predictors from the model formula and repeat the analysis.

1coxReg <- coxph(Surv(time, status) ~ sex + ph.ecog,
2                data = dataAnalysis)
3
4coxReg %>% 
5  tbl_regression(exp = TRUE) %>%
6  add_significance_stars(hide_ci = FALSE, hide_p = FALSE,
7                         pattern = "{p.value}{stars}")
Characteristic HR1 SE1 95% CI1 p-value2
Gender



    male
    female 0.60 0.197 0.41, 0.89 0.011*
ECOG performance score as rated by the physician



    asymptomatic
    symptomatic but completely ambulatory 1.38 0.233 0.87, 2.17 0.2
    at least partially bedridden 2.52 0.257 1.52, 4.17 <0.001***
1 HR = Hazard Ratio, SE = Standard Error, CI = Confidence Interval
2 *p<0.05; **p<0.01; ***p<0.001

Now we proceed with the interpretation of the results of the analysis: The relevant parameter of the Cox regression model is the hazard ratio (HR). It represents the ratio of hazards between two cohorts at any time point. A HR < 1 indicates reduced hazard of the event, a HR > 1 indicates an increased hazard of the event.

Let us first look at the predictor gender: Reference are patients of male gender, marked by dashes in the table. The hazard ratio of 0.60 means that the risk of an event for women is 60% compared to men, in other words, it is reduced by 40%.

Next, we consider the predictor ECOG performance score as rated by the physician: The reference in this predictor is asymptomatic patients, marked by dashes. The hazard ratio for symptomatic but completely ambulatory is not significant and should therefore not be interpreted. The hazard ratio of 2.52 means that the risk of an event for at least partially bedridden patients is increased by 152% compared to asymptomatic patients.

Gnu R toolboxes that we used

For the statistical analysis of the data and the visualisation of results and intermediate results, we used a number of toolboxes. Below is a list of these toolboxes with a short description and links to more detailed information:

Information about the hard- and software configuration of the computer on which this post was authored:

1pander(sessionInfo())

R version 4.4.0 (2024-04-24 ucrt)

Platform: x86_64-w64-mingw32/x64

locale: LC_COLLATE=German_Austria.utf8, LC_CTYPE=German_Austria.utf8, LC_MONETARY=German_Austria.utf8, LC_NUMERIC=C and LC_TIME=German_Austria.utf8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: survminer(v.0.4.9), ggpubr(v.0.6.0), survival(v.3.5-8), gtsummary(v.1.7.2), pander(v.0.6.5), kableExtra(v.1.4.0), RColorBrewer(v.1.1-3), rstatix(v.0.7.2), haven(v.2.5.4), lubridate(v.1.9.3), forcats(v.1.0.0), stringr(v.1.5.1), dplyr(v.1.1.4), purrr(v.1.0.2), readr(v.2.1.5), tidyr(v.1.3.1), tibble(v.3.2.1), ggplot2(v.3.5.1) and tidyverse(v.2.0.0)

loaded via a namespace (and not attached): tidyselect(v.1.2.1), viridisLite(v.0.4.2), farver(v.2.1.2), fastmap(v.1.2.0), blogdown(v.1.19), broom.helpers(v.1.15.0), labelled(v.2.13.0), digest(v.0.6.35), timechange(v.0.3.0), lifecycle(v.1.0.4), magrittr(v.2.0.3), compiler(v.4.4.0), rlang(v.1.1.3), sass(v.0.4.9), tools(v.4.4.0), utf8(v.1.2.4), yaml(v.2.3.8), gt(v.0.10.1), data.table(v.1.15.4), knitr(v.1.46), ggsignif(v.0.6.4), labeling(v.0.4.3), xml2(v.1.3.6), abind(v.1.4-5), withr(v.3.0.0), grid(v.4.4.0), fansi(v.1.0.6), xtable(v.1.8-4), colorspace(v.2.1-0), scales(v.1.3.0), cli(v.3.6.2), rmarkdown(v.2.27), generics(v.0.1.3), rstudioapi(v.0.16.0), km.ci(v.0.5-6), tzdb(v.0.4.0), commonmark(v.1.9.1), cachem(v.1.1.0), splines(v.4.4.0), survMisc(v.0.5.6), vctrs(v.0.6.5), Matrix(v.1.7-0), jsonlite(v.1.8.8), carData(v.3.0-5), bookdown(v.0.39), car(v.3.1-2), hms(v.1.1.3), systemfonts(v.1.1.0), jquerylib(v.0.1.4), glue(v.1.7.0), ggtext(v.0.1.2), stringi(v.1.8.4), gtable(v.0.3.5), munsell(v.0.5.1), pillar(v.1.9.0), htmltools(v.0.5.8.1), R6(v.2.5.1), KMsurv(v.0.1-5), evaluate(v.0.23), lattice(v.0.22-6), highr(v.0.10), markdown(v.1.12), backports(v.1.4.1), gridtext(v.0.1.5), broom(v.1.0.6), bslib(v.0.7.0), Rcpp(v.1.0.12), svglite(v.2.1.3), gridExtra(v.2.3), xfun(v.0.44), zoo(v.1.8-12) and pkgconfig(v.2.0.3)

Theoretical foundations of the survival analysis

Kaplan–Meier estimator

The Kaplan-Meier estimator is used to estimate the probability that a given event will not occur for a patient within a time interval. It is a non-parametric estimate of the survival function in the context of event-time analysis (Kaplan & Meier, 1958).

The Kaplan-Meier estimator \(\hat S(t)\) for the survival function \(S(t)\) is defined by: $$ \hat S(t) = \prod_{t_{i}\leq t} \frac{n_i-d_i}{n_i} = \prod_{t_{i}\leq t}\left( 1 - \frac{d_i}{n_i}\right) $$ with \(\hat{S}(0)=1\), the subjects \(d_{i}\) for which the event occurred at time \(t_{i}\) and the subjects \(n_{i}\) at risk at time \(t_{i}\).

The variance of the Kaplan-Meier estimator in the interval \(t_{k} \le t \le t_{k+1}\) can be determined by: $$\widehat{\operatorname{Var}} \{ \hat S (t) \} \approx [\hat S(t)]^2 \left\{ \sum_{i=1}^k \frac{d_i}{n_i(n_i-d_i)} \right\}\,.$$ Thus, the standard error of the Kaplan-Meier estimator is given by: $$\widehat{\operatorname{SE}} \{ \hat S (t) \} \approx [\hat S(t)] \left\{ \sum_{i=1}^k \frac{d_i}{n_i(n_i-d_i)} \right\}^{\frac{1}{2}}$$ and the \(95\%\) confidence interval is $$\left[ \hat S (t) - 1{,}96 \cdot \widehat{\operatorname{SE}}\{ \hat S (t) \};\hat S (t) + 1{,}96 \cdot \widehat{\operatorname{SE}}\{ \hat S (t) \}\right]\,.$$ A visualisation of an estimated survival function including the 95% confidence interval can be seen in Figure 1.

Hazard function

The hazard function \(h(t)\) is the instantaneous probability of an event occurring at time \(t\) (per unit of time), assuming that no event has happened up to time \(t\): $$h(t) = \lim_{\Delta t \to 0} \frac{S(t) - S(t + \Delta t)}{\Delta t} \cdot \frac{1}{S(t)}\,,$$ see also (Rosner, 2016, sec. 14.8).

Log-rank test

The log-rank test is a non-parametric test with the null hypothesis that there are no differences in the survival times of the groups investigated. All time points of the survival curve are included in the test. Only the influence of one predictor can be assessed.

The hazard ratio \(\exp(\beta)\) is defined by $$\frac{h_{1}(t)}{h_{2}(t)} = \exp(\beta)\,,$$ with the hazards \(h_{1}(t)\) and \(h_{2}(t)\) at time \(t\) for the exposed and unexposed group respectively. For the log-rank test we draw up the hypotheses \(H_{0}: \beta = 0 \;\text{vs.}\; H_{1}: \beta \ne 0\). The entire observation time \(T\) is divided into \(k\) intervals. The following is the contingency table for the i-th interval

Event - Event + Total
Exposure + \(a_i\) \(b_i\) \(n_{i1}\)
Exposure - \(c_i\) \(d_i\) \(n_{i2}\)
Total \(a_i + c_i\) \(b_i + d_i\) \(n_{i}\)

with \(a_i\) exposed subjects whit an event during the i-th time interval and \(b_i\) exposed subjects whitout an event during the i-th time interval. Analogously, \(c_i\) and \(d_i\) are defined for unexposed persons.

Finally, the Mantel-Haenszel test is applied to this collection of \(k\) contingency tables, see also (Rosner, 2016, sec. 14.10) and (Petrie & Sabin, 2005, sec. 44).

Cox proportional hazards regression

Using a proportional-hazard model, the hazard \(h(t)\) is modelled as $$h(t) = h_0(t)\exp(\beta_1 x_1 + \ldots + \beta_n x_n)\,.$$ The coefficients \(\beta_i\) explain the impact of the predictors (covariates) \(x_i\). The baseline hazard \(h_0(t)\) at time \(t\) corresponds to the value of the hazard if all the \(x_i = 0\). The proportional-hazard model can be rewriten as a multiple linear regression of the logarithm of the hazard on the variables \(x_i\). A value of \(\beta_i > 0\) or a hazard ratio \(\exp(b_i) > 1\) means that as the value of the i-th predictor increases, the hazard of an event increases and thus the survival time decreases, see also (Rosner, 2016, sec. 14.11) and (Petrie & Sabin, 2005, sec. 44).

Further reading

Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481. https://doi.org/10.1080/01621459.1958.10501452

Loprinzi, C. L., Laurie, J. A., Wieand, H. S., Krook, J. E., Novotny, P. J., Kugler, J. W., Bartel, J., Law, M., Bateman, M., & Klatt, N. E. (1994). Prospective evaluation of prognostic variables from patient-completed questionnaires. North central cancer treatment group. Journal of Clinical Oncology, 12(3), 601–607. https://doi.org/10.1200/JCO.1994.12.3.601

Petrie, A., & Sabin, C. (2005). Medical statistics at a glance (2nd ed.). Blackwell Publishing.

Rosner, B. (2016). Fundamentals of biostatistics (8th ed.). Cengage Learning.