# Survival analysis

## Overview

Survival analysis involves statistical methods that compare time to a specific event between groups to estimate the effect of prognostic factors, such as medical treatment or harmful influences. The event can be the death of the patient. However, they can also be any other endpoints, such as healing or the occurrence of a complication. In this script we present three common methods of survival time analysis, the Kaplan Meier survival time curve, the log rank test, and the Cox regression.

# Survival analysis — Principles and an illustrative data analysis using Gnu R

## Objective and scope of application

Survival analysis is a set of statistical methods that compares the time to a specific event between groups to estimate the effect of prognostic factors, medical treatment or harmful influences. The event can be the death of the patient, but other endpoints such as recovery, disease or the occurrence of a complication can also be considered. Examples of such statistical analysis methods are the Kaplan-Meier estimator, the log rank test or the Cox regression.

## Motivating example

In our exemplary survival analysis, we use a dataset of the North Central Cancer Treatment Group (NCCTG). This dataset contains information about the survival of patients with advanced lung cancer, plus some additional predictors, such as age, gender, and assessments of the patient performance status. It was originally presented by (Loprinzi et al., 1994). We want to compare the survival probability between female and male patients, and we want to identify predictors that influence the survival probability.

## Analysis script in Gnu R

### Used Gnu R toolboxes

For data import, analysis and visualization of the results we use already existing toolboxes for the Gnu R system. The toolboxes in use are described further down in the post. They are integrated by the following code fragment:

```
1library(tidyverse) # use tidy data
2library(haven) # import SPSS file
3library(rstatix) # pipe friendly statitics
4library(RColorBrewer) # color maps
5library(kableExtra) # table output
6library(ggplot2) # high quality plots
7library(pander) # rendering R objects into Pandoc markdown
8library(gtsummary) # publication-ready analytical and summary tables
9library(survival) # survival analysis routines
10library(survminer) # drawing survival curves and tables
```

### Import of the data

First, the data to be analysed is imported into the Gnu R analysis environment. For the exemplary analyses, we use a dataset from the NCCTG that contains information on lung cancer patients. The following data are included in the dataset:

Variable | Description |
---|---|

`inst` |
Institution code |

`time` |
Survival time in days |

`status` |
Censoring status censored = 1 dead = 2 |

`age` |
Age in years |

`sex` |
Gender male = 1 female = 2 |

`ph.ecog` |
ECOG performance score as rated by the physician. 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound |

`ph.karno` |
Karnofsky performance score (bad=0-good=100) rated by physician |

`pat.karno` |
Karnofsky performance score as rated by patient |

`meal.cal` |
Calories consumed at meals |

`wt.loss` |
Weight loss in last six months |

The data are available in SAV format. We use the function `read_sav`

of the toolbox `haven`

for the data import.

```
1# Import a dataset that is in SAV format
2dataIn <- read_sav("NCCTGLungCancerData.sav")
```

### Data wrangling, data selection

Just for information, we determine the size of the imported data.

```
1# number of data items and variables
2dim(dataIn)
3## [1] 228 10
```

The imported data set includes 10 variables and 228 data items.

In the following step, we

- select the variables to be analysed,
- remove data entries with missing values and
- merge the categories that include bedriddenness into one category.

We store the result of these data pre-processing and selection steps in the tibble `dataAnalysis`

.

```
1# Extract data to be used for analysis
2dataAnalysis <- dataIn %>%
3 # include all variables except inst (minus sign)
4 select(-inst) %>%
5 # Remove records with NA values (not available, or no answer)
6 na.omit() %>%
7 # change some variables to factors
8 mutate(sex = as_factor(sex)) %>%
9 mutate(ph.ecog = as_factor(ph.ecog)) %>%
10 mutate(status = labelled::remove_labels(status)) %>%
11 # merge the categories that include bedriddenness into one category.
12 mutate(ph.ecog = recode(ph.ecog, 'in bed <50% of the day'='at least partially bedridden',
13 'in bed > 50% of the day but not bedbound'='at least partially bedridden', 'bedbound'='at least partially bedridden'))
```

### Short data summary

In the table below we give a summary of the data that we include in the further steps of the analysis. The meaning of the censoring status in the table below is:

- 1 – censored
- 2 – dead

```
1# print summary table
2tbl_summary(dataAnalysis)
```

Characteristic |
N = 168^{1} |
---|---|

Survival time in days | 269 (175, 416) |

status | |

1 | 47 (28%) |

2 | 121 (72%) |

Age in years | 64 (57, 70) |

Gender | |

male | 104 (62%) |

female | 64 (38%) |

ECOG performance score as rated by the physician | |

asymptomatic | 47 (28%) |

symptomatic but completely ambulatory | 81 (48%) |

at least partially bedridden | 40 (24%) |

Karnofsky performance score rated by physician | |

50 | 4 (2.4%) |

60 | 16 (9.5%) |

70 | 25 (15%) |

80 | 47 (28%) |

90 | 50 (30%) |

100 | 26 (15%) |

Karnofsky performance score as rated by patient | |

30 | 2 (1.2%) |

40 | 2 (1.2%) |

50 | 3 (1.8%) |

60 | 23 (14%) |

70 | 30 (18%) |

80 | 38 (23%) |

90 | 44 (26%) |

100 | 26 (15%) |

Calories consumed at meals | 975 (622, 1,156) |

Weight loss in last six months | 7 (0, 15) |

^{1} Median (IQR); n (%) |

### The censoring status

A study usually has a clearly defined start and end date. If the event to be investigated occurs during this period (in our example this is the death of the patient), then the survival time and the censoring status *dead* (encoded as 2) are stored. However, it may happen that the event under investigation does not occur until the end of the study or that the patient decides to leave the study. In this case, the censoring status *censored* (encoded as 1) is stored and the survival time is the period from the patient’s inclusion in the study until the end of the study or until the patient leaves the study.

### Analysis by means of Kaplan-Meier survival plot

For the survival analysis by means of a Kaplan-Meier survival plot we use functions which are provided by the packages `survival`

and `survminer`

. The Kaplan-Meier method is a non-parametric approach that results in a step function. At each time an event occurs, there is a step down. We fit a survival curve to the data using the two functions `Surv()`

and `survfit()`

from the package survival. The form of the `survfit()`

command is `survfit(formula, data=dataTibble)`

. The meaning of the two parameters **formula** and **data** is:

**formula**is a symbolic description of the model to be fitted to the data within the analysis. The variable on the left-hand side of a tilde (~) is called the outcome variable (or dependent variable), while the variables on the right-hand side of a tilde are called the predictors (or independent variables) and are joined by plus signs +. The outcome variable is generated using the`Surv()`

function.**data**is a tibble containing the variables in the model.

As a result of applying the ´Surv()´ function we get a vector with time values (in days), a single time value for each patient. Values with a plus sign attached describe the time elapsed until censoring, without a plus sign until the event (dead) occurs. For the purposes of illustration, we apply the `Surv()`

function separately to the two variables time and status of the `dataAnalysis`

data tibble.

```
1# create a survival object, which encodes
2# the survival time and the censoring status
3Surv(dataAnalysis$time, event=dataAnalysis$status)
4## [1] 455 210 1022+ 310 361 218 166 170 567 613 707 61
5## [13] 301 81 371 520 574 118 390 12 473 26 107 53
6## [25] 814 965+ 93 731 460 153 433 583 95 303 519 643
7## [37] 765 53 246 689 5 687 345 444 223 60 163 65
8## [49] 821+ 428 230 840+ 305 11 226 426 705 363 176 791
9## [61] 95 196+ 167 806+ 284 641 147 740+ 163 655 88 245
10## [73] 30 477 559+ 450 156 529+ 429 351 15 181 283 13
11## [85] 212 524 288 363 199 550 54 558 207 92 60 551+
12## [97] 293 353 267 511+ 457 337 201 404+ 222 62 458+ 353
13## [109] 163 31 229 156 329 291 179 376+ 384+ 268 292+ 142
14## [121] 413+ 266+ 320 181 285 301+ 348 197 382+ 303+ 296+ 180
15## [133] 145 269+ 300+ 284+ 292+ 332+ 285 259+ 110 286 270 225+
16## [145] 269 225+ 243+ 276+ 135 79 59 240+ 202+ 235+ 239 252+
17## [157] 221+ 185+ 222+ 183 211+ 175+ 197+ 203+ 191+ 105+ 174+ 177+
```

In the next step, we estimate the survival curves for female and male patients using the Kaplan-Meier method. We use the formula `Surv(time, event=status) ~ sex`

.

```
1# estimate a survival curve using Kaplan-Meier method
2kmGender = survfit(Surv(time, event=status) ~ sex, data=dataAnalysis)
```

We use the `ggsurvplot()`

function to visualise the result of the Kaplan-Meier analysis. In the upper part, the survival curves for both cohorts (female, male) are shown, the median survival times are marked. The middle part of the figure shows, for selected time points, the percentages of patients still at risk of an event. In the lower part of the figure, the censorship events are visualised.

```
1# visualize the result of the Kaplan-Meier analysis
2ggsurvplot(kmGender,
3 # adjust the label
4 xlab="Survival time in days",
5 ylab="Survival probability",
6 break.time.by = 150,
7 # use percentage at risk
8 risk.table=TRUE,
9 risk.table.pos = "out",
10 risk.table.col = "strata",# Risk table color by groups
11 risk.table.y.text = FALSE,
12 risk.table.height=.18,
13 # include confidencial intervals
14 conf.int=TRUE,
15 # draw horizontal AND vertical line for median
16 surv.median.line="hv",
17 # cumulative events table
18 cumevents = TRUE,
19 cumevents.col = "strata",# Risk table color by groups
20 cumevents.y.text = FALSE,
21 cumevents.height=.18,
22 # cumulative number of censoring
23 cumcensor = TRUE,
24 cumcensor.col = "strata",# Risk table color by groups
25 cumcensor.y.text = FALSE,
26 cumcensor.height=.18,
27 tables.theme = theme_cleantable(),
28 # Change ggplot2 theme
29 ggtheme = theme_bw(),
30 # set the color map
31 palette = "Set1"
32)
33## Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
34## ℹ Please consider using `annotate()` or provide this layer with data containing
35## a single row.
36## All aesthetics have length 1, but the data has 2 rows.
37## ℹ Please consider using `annotate()` or provide this layer with data containing
38## a single row.
39## All aesthetics have length 1, but the data has 2 rows.
40## ℹ Please consider using `annotate()` or provide this layer with data containing
41## a single row.
42## All aesthetics have length 1, but the data has 2 rows.
43## ℹ Please consider using `annotate()` or provide this layer with data containing
44## a single row.
```

For quantitative analysis, parameters of the Kaplan-Meier survival curve can be printed in tabular form (function `tbl_survfit()`

), below the survival probabilities after 30, 90, 180, 360 and 720 days

```
1tbl_survfit(kmGender,
2 times = c(30, 90, 180, 360, 720),
3 statistic = "{estimate}",
4 label_header = "**{time} days**")
```

Characteristic |
30 days |
90 days |
180 days |
360 days |
720 days |
---|---|---|---|---|---|

Gender | |||||

male | 94% | 87% | 69% | 36% | 7.4% |

female | 98% | 91% | 83% | 56% | 20% |

and the median survival time.

```
1tbl_survfit(kmGender, probs=c(0.5),
2 label_header = "**Median survival time (days)**")
```

Characteristic |
Median survival time (days) |
---|---|

Gender | |

male | 284 (229, 353) |

female | 426 (345, 641) |

### log-rank test

We would like to analyse whether the survival times of the two cohorts (female, male) differ significantly. The log-rank test is a common approach for this. In this test, the observations are equally weighted over the entire follow-up time. The log-rank test can be performed using the ´survdiff()` function of the package survival.

```
1survdiff(Surv(time, event=status) ~ sex, data=dataAnalysis, rho = 0)
2## Call:
3## survdiff(formula = Surv(time, event = status) ~ sex, data = dataAnalysis,
4## rho = 0)
5##
6## N Observed Expected (O-E)^2/E (O-E)^2/V
7## sex=male 104 83 69.5 2.60 6.16
8## sex=female 64 38 51.5 3.52 6.16
9##
10## Chisq= 6.2 on 1 degrees of freedom, p= 0.01
```

The test shows that there is a significant difference (p = 0.01) between the two cohorts (female, male) in terms of survival times.

### Cox Proportional Hazards Survival Regression

Finally, we would like to identify further predictors that have an influence on survival time and quantify the magnitude of their influence. For this type of analysis, the Cox Proportional Hazards Survival Regression (also referred to as Cox Regression) can be used. We apply the Cox proportional hazards model to the data using the two functions `Surv()`

and `coxph()`

from the package survival. The form of the `coxph()`

command is `coxph(formula, data=dataTibble)`

. The meaning of the two parameters **formula** and **data** is described earlier in this script.

We investigate the influence of the predictors gender, age, ECOG performance score, calories consumed at meals and weight loss in the last six months.

```
1coxph(Surv(time, status) ~ sex + age + ph.ecog + meal.cal + wt.loss,
2 data = dataAnalysis) %>%
3 tbl_regression(exp = TRUE) %>%
4 add_significance_stars(hide_ci = FALSE, hide_p = FALSE,
5 pattern = "{p.value}{stars}")
```

Characteristic |
HR^{1} |
SE^{1} |
95% CI^{1} |
p-value^{2} |
---|---|---|---|---|

Gender | ||||

male | — | — | — | |

female | 0.58 | 0.201 | 0.39, 0.86 | 0.007** |

Age in years | 1.01 | 0.011 | 0.98, 1.03 | 0.6 |

ECOG performance score as rated by the physician | ||||

asymptomatic | — | — | — | |

symptomatic but completely ambulatory | 1.46 | 0.236 | 0.92, 2.31 | 0.11 |

at least partially bedridden | 2.91 | 0.296 | 1.63, 5.19 | <0.001*** |

Calories consumed at meals | 1.00 | 0.000 | 1.00, 1.00 | >0.9 |

Weight loss in last six months | 0.99 | 0.008 | 0.97, 1.00 | 0.12 |

^{1} HR = Hazard Ratio, SE = Standard Error, CI = Confidence Interval |
||||

^{2} *p<0.05; **p<0.01; ***p<0.001 |

If we look at the p-values/star codes, only the two predictors gender and ECOG performance score influence survival time.

We remove the non-significant predictors from the model formula and repeat the analysis.

```
1coxReg <- coxph(Surv(time, status) ~ sex + ph.ecog,
2 data = dataAnalysis)
3
4coxReg %>%
5 tbl_regression(exp = TRUE) %>%
6 add_significance_stars(hide_ci = FALSE, hide_p = FALSE,
7 pattern = "{p.value}{stars}")
```

Characteristic |
HR^{1} |
SE^{1} |
95% CI^{1} |
p-value^{2} |
---|---|---|---|---|

Gender | ||||

male | — | — | — | |

female | 0.60 | 0.197 | 0.41, 0.89 | 0.011* |

ECOG performance score as rated by the physician | ||||

asymptomatic | — | — | — | |

symptomatic but completely ambulatory | 1.38 | 0.233 | 0.87, 2.17 | 0.2 |

at least partially bedridden | 2.52 | 0.257 | 1.52, 4.17 | <0.001*** |

^{1} HR = Hazard Ratio, SE = Standard Error, CI = Confidence Interval |
||||

^{2} *p<0.05; **p<0.01; ***p<0.001 |

Now we proceed with the interpretation of the results of the analysis: The relevant parameter of the Cox regression model is the hazard ratio (HR). It represents the ratio of hazards between two cohorts at any time point. A HR < 1 indicates reduced hazard of the event, a HR > 1 indicates an increased hazard of the event.

Let us first look at the predictor gender: Reference are patients of male gender, marked by dashes in the table. The hazard ratio of 0.60 means that the risk of an event for women is 60% compared to men, in other words, it is reduced by 40%.

Next, we consider the predictor ECOG performance score as rated by the physician: The reference in this predictor is *asymptomatic* patients, marked by dashes. The hazard ratio for *symptomatic but completely ambulatory* is not significant and should therefore not be interpreted. The hazard ratio of 2.52 means that the risk of an event for *at least partially bedridden* patients is increased by 152% compared to *asymptomatic* patients.

# Gnu R toolboxes that we used

For the statistical analysis of the data and the visualisation of results and intermediate results, we used a number of toolboxes. Below is a list of these toolboxes with a short description and links to more detailed information:

`tidyverse`

— This toolbox includes a number of Gnu R software packages that use**tidy data**data structures and in which the design philosophy and grammar of**tidy data**are implemented, e.g. pipes. Information on**tidy data**: H. Wickham,*Tidy data*, The Journal of Statistical Software, Vol. 59, 2014, Information on the toolbox: https://www.tidyverse.org/.`haven`

— The toolbox`haven`

enables Gnu R to read and write various data formats used by other statistical software systems. The following formats are supported: SAS, SPSS and Stata. Information on the toolbox: https://haven.tidyverse.org/.`rstatix`

— The`rstatix`

toolbox offers a simple, intuitive and coherent with the “tidyverse” design philosophy framework for performing basic statistical tests, e.g. t-test, Wilcoxon test, ANOVA, Kruskal-Wallis and correlation analyses. Information on the toolbox: https://rpkgs.datanovia.com/rstatix/.`RColorBrewer`

— Provides colour schemes designed by Cynthia Brewer, see http://colorbrewer2.org. Information on the toolbox: https://cran.r-project.org/web/packages/RColorBrewer/index.html.`kableExtra`

— This toolbox offers simple methods to create tables from**tidy data**data structures. Information about the toolbox: https://github.com/haozhu233/kableExtra.`ggplot2`

— In the`ggplot2`

toolbox a system for the declarative creation of graphics is realised, which is based on ‘The Grammar of Graphics’. Information on the toolbox: https://ggplot2.tidyverse.org/.`gtsummary`

— The package`gtsummary`

enables the creation of publication-ready analytical and summary tables using the R programming language. Information on the toolbox, website: https://www.danieldsjoberg.com/gtsummary/, cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/gtsummary.pdf`survival`

— This toolbox contains the most important routines for survival analysis, including Kaplan-Meier curves, log-rank tests and Cox models. Information on the toolbox: https://cran.r-project.org/web/packages/survival/.`survminer`

— The`survminer`

toolbox provides functions to support the survival analysis and the visualisation of results. Information on the toolbox: https://rpkgs.datanovia.com/survminer/.

Information about the hard- and software configuration of the computer on which this post was authored:

```
1pander(sessionInfo())
```

**R version 4.4.0 (2024-04-24 ucrt)**

**Platform:** x86_64-w64-mingw32/x64

**locale:**
*LC_COLLATE=German_Austria.utf8*, *LC_CTYPE=German_Austria.utf8*, *LC_MONETARY=German_Austria.utf8*, *LC_NUMERIC=C* and *LC_TIME=German_Austria.utf8*

**attached base packages:**
*stats*, *graphics*, *grDevices*, *utils*, *datasets*, *methods* and *base*

**other attached packages:**
*survminer(v.0.4.9)*, *ggpubr(v.0.6.0)*, *survival(v.3.5-8)*, *gtsummary(v.1.7.2)*, *pander(v.0.6.5)*, *kableExtra(v.1.4.0)*, *RColorBrewer(v.1.1-3)*, *rstatix(v.0.7.2)*, *haven(v.2.5.4)*, *lubridate(v.1.9.3)*, *forcats(v.1.0.0)*, *stringr(v.1.5.1)*, *dplyr(v.1.1.4)*, *purrr(v.1.0.2)*, *readr(v.2.1.5)*, *tidyr(v.1.3.1)*, *tibble(v.3.2.1)*, *ggplot2(v.3.5.1)* and *tidyverse(v.2.0.0)*

**loaded via a namespace (and not attached):**
*tidyselect(v.1.2.1)*, *viridisLite(v.0.4.2)*, *farver(v.2.1.2)*, *fastmap(v.1.2.0)*, *blogdown(v.1.19)*, *broom.helpers(v.1.15.0)*, *labelled(v.2.13.0)*, *digest(v.0.6.35)*, *timechange(v.0.3.0)*, *lifecycle(v.1.0.4)*, *magrittr(v.2.0.3)*, *compiler(v.4.4.0)*, *rlang(v.1.1.3)*, *sass(v.0.4.9)*, *tools(v.4.4.0)*, *utf8(v.1.2.4)*, *yaml(v.2.3.8)*, *gt(v.0.10.1)*, *data.table(v.1.15.4)*, *knitr(v.1.46)*, *ggsignif(v.0.6.4)*, *labeling(v.0.4.3)*, *xml2(v.1.3.6)*, *abind(v.1.4-5)*, *withr(v.3.0.0)*, *grid(v.4.4.0)*, *fansi(v.1.0.6)*, *xtable(v.1.8-4)*, *colorspace(v.2.1-0)*, *scales(v.1.3.0)*, *cli(v.3.6.2)*, *rmarkdown(v.2.27)*, *generics(v.0.1.3)*, *rstudioapi(v.0.16.0)*, *km.ci(v.0.5-6)*, *tzdb(v.0.4.0)*, *commonmark(v.1.9.1)*, *cachem(v.1.1.0)*, *splines(v.4.4.0)*, *survMisc(v.0.5.6)*, *vctrs(v.0.6.5)*, *Matrix(v.1.7-0)*, *jsonlite(v.1.8.8)*, *carData(v.3.0-5)*, *bookdown(v.0.39)*, *car(v.3.1-2)*, *hms(v.1.1.3)*, *systemfonts(v.1.1.0)*, *jquerylib(v.0.1.4)*, *glue(v.1.7.0)*, *ggtext(v.0.1.2)*, *stringi(v.1.8.4)*, *gtable(v.0.3.5)*, *munsell(v.0.5.1)*, *pillar(v.1.9.0)*, *htmltools(v.0.5.8.1)*, *R6(v.2.5.1)*, *KMsurv(v.0.1-5)*, *evaluate(v.0.23)*, *lattice(v.0.22-6)*, *highr(v.0.10)*, *markdown(v.1.12)*, *backports(v.1.4.1)*, *gridtext(v.0.1.5)*, *broom(v.1.0.6)*, *bslib(v.0.7.0)*, *Rcpp(v.1.0.12)*, *svglite(v.2.1.3)*, *gridExtra(v.2.3)*, *xfun(v.0.44)*, *zoo(v.1.8-12)* and *pkgconfig(v.2.0.3)*

## Theoretical foundations of the survival analysis

### Kaplan–Meier estimator

The Kaplan-Meier estimator is used to estimate the probability that a given event will not occur for a patient within a time interval. It is a non-parametric estimate of the survival function in the context of event-time analysis (Kaplan & Meier, 1958).

The Kaplan-Meier estimator `\(\hat S(t)\)`

for the survival function `\(S(t)\)`

is defined by:
$$ \hat S(t) = \prod_{t_{i}\leq t} \frac{n_i-d_i}{n_i} = \prod_{t_{i}\leq t}\left( 1 - \frac{d_i}{n_i}\right) $$
with `\(\hat{S}(0)=1\)`

, the subjects `\(d_{i}\)`

for which the event occurred at time `\(t_{i}\)`

and the subjects `\(n_{i}\)`

at risk at time `\(t_{i}\)`

.

The variance of the Kaplan-Meier estimator in the interval `\(t_{k} \le t \le t_{k+1}\)`

can be determined by:
`$$\widehat{\operatorname{Var}} \{ \hat S (t) \} \approx [\hat S(t)]^2 \left\{ \sum_{i=1}^k \frac{d_i}{n_i(n_i-d_i)} \right\}\,.$$`

Thus, the standard error of the Kaplan-Meier estimator is given by:
`$$\widehat{\operatorname{SE}} \{ \hat S (t) \} \approx [\hat S(t)] \left\{ \sum_{i=1}^k \frac{d_i}{n_i(n_i-d_i)} \right\}^{\frac{1}{2}}$$`

and the `\(95\%\)`

confidence interval is
`$$\left[ \hat S (t) - 1{,}96 \cdot \widehat{\operatorname{SE}}\{ \hat S (t) \};\hat S (t) + 1{,}96 \cdot \widehat{\operatorname{SE}}\{ \hat S (t) \}\right]\,.$$`

A visualisation of an estimated survival function including the 95% confidence interval can be seen in Figure 1.

### Hazard function

The hazard function `\(h(t)\)`

is the instantaneous probability of an event occurring at time `\(t\)`

(per unit of time), assuming that no event has happened up to time `\(t\)`

:
`$$h(t) = \lim_{\Delta t \to 0} \frac{S(t) - S(t + \Delta t)}{\Delta t} \cdot \frac{1}{S(t)}\,,$$`

see also (Rosner, 2016, sec. 14.8).

### Log-rank test

The log-rank test is a non-parametric test with the null hypothesis that there are no differences in the survival times of the groups investigated. All time points of the survival curve are included in the test. Only the influence of one predictor can be assessed.

The hazard ratio `\(\exp(\beta)\)`

is defined by `$$\frac{h_{1}(t)}{h_{2}(t)} = \exp(\beta)\,,$$`

with the hazards `\(h_{1}(t)\)`

and `\(h_{2}(t)\)`

at time `\(t\)`

for the exposed and unexposed group respectively. For the log-rank test we draw up the hypotheses `\(H_{0}: \beta = 0 \;\text{vs.}\; H_{1}: \beta \ne 0\)`

. The entire observation time `\(T\)`

is divided into `\(k\)`

intervals. The following is the contingency table for the i-th interval

Event - | Event + | Total | |
---|---|---|---|

Exposure + |
`\(a_i\)` |
`\(b_i\)` |
`\(n_{i1}\)` |

Exposure - |
`\(c_i\)` |
`\(d_i\)` |
`\(n_{i2}\)` |

Total |
`\(a_i + c_i\)` |
`\(b_i + d_i\)` |
`\(n_{i}\)` |

with `\(a_i\)`

exposed subjects whit an event during the i-th
time interval and `\(b_i\)`

exposed subjects whitout an event during the i-th
time interval. Analogously, `\(c_i\)`

and `\(d_i\)`

are defined for unexposed persons.

Finally, the Mantel-Haenszel test is applied to this collection of `\(k\)`

contingency tables, see also (Rosner, 2016, sec. 14.10) and (Petrie & Sabin, 2005, sec. 44).

### Cox proportional hazards regression

Using a proportional-hazard model, the hazard `\(h(t)\)`

is modelled as `$$h(t) = h_0(t)\exp(\beta_1 x_1 + \ldots + \beta_n x_n)\,.$$`

The coefficients `\(\beta_i\)`

explain the impact of the predictors (covariates) `\(x_i\)`

. The baseline hazard `\(h_0(t)\)`

at time `\(t\)`

corresponds to the value of the hazard if all the `\(x_i = 0\)`

. The proportional-hazard model can be rewriten as a multiple linear regression of the logarithm of the hazard on the variables `\(x_i\)`

. A value of `\(\beta_i > 0\)`

or a hazard ratio `\(\exp(b_i) > 1\)`

means that as the value of the i-th predictor increases, the hazard of an event increases and thus the survival time decreases, see also (Rosner, 2016, sec. 14.11) and (Petrie & Sabin, 2005, sec. 44).

## Further reading

Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. *Journal of the American Statistical Association*, *53*(282), 457–481. https://doi.org/10.1080/01621459.1958.10501452

Loprinzi, C. L., Laurie, J. A., Wieand, H. S., Krook, J. E., Novotny, P. J., Kugler, J. W., Bartel, J., Law, M., Bateman, M., & Klatt, N. E. (1994). Prospective evaluation of prognostic variables from patient-completed questionnaires. North central cancer treatment group. *Journal of Clinical Oncology*, *12*(3), 601–607. https://doi.org/10.1200/JCO.1994.12.3.601

Petrie, A., & Sabin, C. (2005). *Medical statistics at a glance* (2nd ed.). Blackwell Publishing.

Rosner, B. (2016). *Fundamentals of biostatistics* (8th ed.). Cengage Learning.