Binary Logistic Regression
Overview
Binary logistic regression is used to model the relationship between a binary dependent variable (e.g. therapy successful? Yes/No) and a set of explanatory variables. As a result of the regression analysis, we identify explanatory variables that influence the binary dependent outcome variable. The binary logistic regression equation can also be used to predict the outcome of the binary dependent variable as a function of values of the explanatory variables.
Binary logistic regression – Principles and an illustrative data analysis using Gnu R
Objective and scope of application
The binary logistic regression is one of the class of general linear models. It is used to model the relationship between a binary dependent variable (e.g. therapy successful? Yes/No) and a set of explanatory variables. As a result of the regression analysis, we identify explanatory variables that influence the binary dependent outcome variable. The binary logistic regression equation can also be used to predict the outcome of the binary dependent variable as a function of values of the explanatory variables.
Motivating example
Microaneurysms are a early visible manifestation of diabetic retinopathy. They are characterised by small red dots on the retina.
Healthy retina | Retina with microaneurysms |
---|---|
In this blog post we use logistic regression to identify predictors for the occurrence of retinal microaneurysms. The predictors we include in the analysis are age, gender, systolic blood pressure, oral glucose tolerance test (OGTT), total cholesterol and body mass index (BMI) of the patients. We only include data from people over 18 years of age in the analysis. The dataset we use for the example analysis in this blog post was pre-compiled from data of the survey “National Health and Nutrition Examination Survey (NHANES), 2007-2008” (United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics, 2012).
Analysis script in Gnu R
Used Gnu R toolboxes
For data import, analysis and visualization of the results we use already existing toolboxes for the Gnu R system. The toolboxes in use are described further down in the post. They are integrated by the following code fragment:
1library(haven) # import SPSS files
2library(dplyr) # functions for dataframe manipulation
3library(MASS) # stepAIC for model selection
4library(gtsummary) # publication-ready analytical and summary tables
5library(ggplot2) # high quality plots
6library(ggstatsplot) # high quality statistics plots
7library(pander) # rendering R objects into Pandoc markdown
8library(gt) # output of tibbles as table
9library(gtExtras) # extension of gt
10library(car) # vif for multicollinarity test
Import of the data
We use the function read_sav
from the software package haven
to import the data file.
1# Import a dataset that is in SAV format
2dataIn <- read_sav("retinalMicroaneurysm.sav")
The following variables are included in the dataset we use in the analysis:
BPXSY1
: Systolic blood pressure (first reading) mm HgRIAGENDR
: Gender of the sample personRIDAGEYR
: Age in years of the sample person at time of screeningLBXGLT
: Two Hour Glucose(OGTT) (mg/dL)OPDUMA
: Retinal microaneurysms, worse eyeLBDTCSI
: Total cholesterol (mmol/L)BMXBMI
: Body Mass Index (kg/m**2)
We use the function as_factor
to convert the variables RIAGENDR
and OPDUMA
into a factors, preserving the variables and value label attributes.
1# Change some variables to factors
2dataAnalysis <- dataIn %>%
3 mutate(RIAGENDR = as_factor(RIAGENDR)) %>%
4 mutate(OPDUMA = as_factor(OPDUMA))
Short data summary
In the table below we give a summary of the data that we include in the further steps of the analysis.
1# Print summary table (table 1)
2dataAnalysis %>%
3 tbl_summary()
Characteristic | N = 1,0461 |
---|---|
Gender | |
Male | 522 (50%) |
Female | 524 (50%) |
Age in years | 58 (48, 68) |
Systolic: Blood pres (1st rdg) mm Hg | 124 (114, 136) |
Two Hour Glucose(OGTT) (mg/dL) | 122 (99, 158) |
Retinal microaneurysms, worse eye | 55 (5.3%) |
Total Cholesterol( mmol/L) | 5.22 (4.63, 5.95) |
Body Mass Index (kg/m**2) | 27.8 (24.8, 31.4) |
1 n (%); Median (IQR) |
Analysis by means of binary logistic regression
For the analysis by means of logistic regression we use a generalised linear model, which is provided by the package stats
as command glm()
. The form of the glm()
command is glm(formula, family=familytype(link=linkfunction), data=dataTibble)
. The meaning of the three parameters formula, family and data is:
- formula is a symbolic description of the model to be fitted to the data within the analysis. The variable on the left-hand side of a tilde (~) is called the outcome variable (or dependent variable), while the variables on the right-hand side of a tilde are called the predictors (or independent variables) and are joined by plus signs +. Using binary logistic analysis, the outcome variable must be dichotomous (binary).
- family is a description of the distribution of the outcome variables and the link function to be used in the model. In our analysis problem we are dealing with a binary (also dichothomous) outcome variable. We use logistic regression (logit).
- data is a tibble containing the data of the variables in the model.
The aim of the analysis is to identify a compact model that represents the collected data as well as possible. We use the Akaike information criterion (AIC) as a metric for the quality of model fit (Akaike, 1974; Burnham & Anderson, 2002; Venables & Ripley, 2002). We use a combination of forward and backward selection approach. In the following, we set up a model that includes all existing predictors, we fit this model to the data:
1# define the model and fit it to the data
2glmlogit <- glm(formula = OPDUMA ~ RIAGENDR + RIDAGEYR + BPXSY1 + LBXGLT +
3 LBDTCSI + BMXBMI,
4 family = binomial(link="logit"),
5 data = dataAnalysis)
In the next step, we examine multicollinarity in the predictor variables. Multicollinearity is present when two or more of the predictors in a regression model are correlated. It can affect our analysis and thus limit the conclusions we can draw. We use the variance inflation factor (VIF) to measure the strength of the correlation between the predictor variables in the regression model (Fox & Weisberg, 2011; Hair Jr. et al., 2019). A heuristic for the evaluation of VIF values:
- A value of 1 indicates there is no correlation between a given and all other predictor variables.
- Values between 1 and 5 indicate a moderate correlation between a given and all other predictor variables, but this does not yet require attention.
- Values greater than 5 indicate a strong correlation between a given predictor variable and all other predictor variables, the estimated coefficients and p-values are most likely unreliable.
To check whether there is multicollinarity in the predictors, we use the vif
function of the car
toolbox.
1# multicollinarity test using VIF
2glmlogitvif <- vif(glmlogit)
We reformat the result of the multicollinarity test into tibble format to make it easier to use for subsequent outputs in a table and a visualisation.
1# reformating the result of the multicollinarity test
2dfglmlogitvif <- tibble(predictors = names(glmlogitvif),
3 vif = glmlogitvif) %>%
4 mutate(predictors = case_when(predictors == "OPDUMA" ~ "Retinal microaneurysms, worse eye",
5 predictors == "RIAGENDR" ~ "Gender",
6 predictors == "RIDAGEYR" ~ "Age in years",
7 predictors == "BPXSY1" ~ "Systolic: Blood pres (1st rdg) mm Hg",
8 predictors == "LBXGLT" ~ "Two Hour Glucose(OGTT) (mg/dL)",
9 predictors == "LBDTCSI" ~ "Total Cholesterol( mmol/L)",
10 predictors == "BMXBMI" ~ "Body Mass Index (kg/m**2)",
11 .default = predictors))
1# render the result of multicollinarity test as table
2dfglmlogitvif %>%
3 gt() %>%
4 cols_label(
5 predictors = "Predictors",
6 vif = "VIF"
7 )
Predictors | VIF |
---|---|
Gender | 1.026555 |
Age in years | 1.223686 |
Systolic: Blood pres (1st rdg) mm Hg | 1.155713 |
Two Hour Glucose(OGTT) (mg/dL) | 1.157670 |
Total Cholesterol( mmol/L) | 1.062960 |
Body Mass Index (kg/m**2) | 1.045360 |
1# visualise the result of multicollinarity test as table
2dfglmlogitvif %>%
3 ggplot(aes(y=predictors, x=vif)) +
4 geom_col(fill ="gray") +
5 theme_bw() +
6 labs(y = "Predictors", x = "VIF") +
7 geom_vline(xintercept = 1, color = "green", linewidth=1.) +
8 geom_vline(xintercept = 5, color = "red", linewidth=1.)
As a result of the test, no multicollinarity could be detected.
Subsequently we give a summary of the results of the binary logistic modelling using this “large” model.
1# summary of the results of the binary logistic modelling in a table
2glmlogit %>%
3 tbl_regression(estimate_fun = purrr::partial(style_ratio, digits = 3),
4 pvalue_fun = purrr::partial(style_sigfig, digits = 3)) %>%
5 add_significance_stars(hide_ci = FALSE, hide_p = FALSE,
6 pattern = "{p.value}{stars}")
Characteristic | log(OR)1 | SE1 | 95% CI1 | p-value2 |
---|---|---|---|---|
Gender | ||||
Male | — | — | — | |
Female | -0.563 | 0.294 | -1.155, 0.002 | 0.055 |
Age in years | -0.006 | 0.013 | -0.032, 0.019 | 0.623 |
Systolic: Blood pres (1st rdg) mm Hg | 0.004 | 0.008 | -0.012, 0.019 | 0.617 |
Two Hour Glucose(OGTT) (mg/dL) | 0.007 | 0.002 | 0.003, 0.012 | 0.000*** |
Total Cholesterol( mmol/L) | -0.081 | 0.139 | -0.358, 0.186 | 0.560 |
Body Mass Index (kg/m**2) | -0.035 | 0.029 | -0.092, 0.020 | 0.228 |
1 OR = Odds Ratio, SE = Standard Error, CI = Confidence Interval | ||||
2 *p<0.05; **p<0.01; ***p<0.001 |
As can be seen in the column headed p-value, not all predictors contribute significantly to the model.
For model selection and refinement we use the stepAIC
function of the MASS
Toolbox and approach of stepwise regression, which includes both the forward and the backward selection of the predictors (Venables & Ripley, 2002). In the following, the function stepAIC
is applied. The trace of the model selection steps with the respective AIC value and the finally selected model is displayed.
1# model selection the AIC criterion
2glmlogitaic <- glmlogit %>% stepAIC(trace = T, direction = 'both')
3## Start: AIC=427.03
4## OPDUMA ~ RIAGENDR + RIDAGEYR + BPXSY1 + LBXGLT + LBDTCSI + BMXBMI
5##
6## Df Deviance AIC
7## - RIDAGEYR 1 413.27 425.27
8## - BPXSY1 1 413.28 425.28
9## - LBDTCSI 1 413.38 425.38
10## - BMXBMI 1 414.55 426.55
11## <none> 413.03 427.03
12## - RIAGENDR 1 416.84 428.84
13## - LBXGLT 1 424.43 436.43
14##
15## Step: AIC=425.27
16## OPDUMA ~ RIAGENDR + BPXSY1 + LBXGLT + LBDTCSI + BMXBMI
17##
18## Df Deviance AIC
19## - BPXSY1 1 413.42 423.42
20## - LBDTCSI 1 413.53 423.53
21## - BMXBMI 1 414.63 424.63
22## <none> 413.27 425.27
23## + RIDAGEYR 1 413.03 427.03
24## - RIAGENDR 1 417.16 427.16
25## - LBXGLT 1 424.52 434.52
26##
27## Step: AIC=423.42
28## OPDUMA ~ RIAGENDR + LBXGLT + LBDTCSI + BMXBMI
29##
30## Df Deviance AIC
31## - LBDTCSI 1 413.66 421.66
32## - BMXBMI 1 414.77 422.77
33## <none> 413.42 423.42
34## + BPXSY1 1 413.27 425.27
35## + RIDAGEYR 1 413.28 425.28
36## - RIAGENDR 1 417.39 425.39
37## - LBXGLT 1 425.90 433.90
38##
39## Step: AIC=421.66
40## OPDUMA ~ RIAGENDR + LBXGLT + BMXBMI
41##
42## Df Deviance AIC
43## - BMXBMI 1 415.00 421.00
44## <none> 413.66 421.66
45## + LBDTCSI 1 413.42 423.42
46## + BPXSY1 1 413.53 423.53
47## + RIDAGEYR 1 413.58 423.58
48## - RIAGENDR 1 418.07 424.07
49## - LBXGLT 1 426.07 432.07
50##
51## Step: AIC=421
52## OPDUMA ~ RIAGENDR + LBXGLT
53##
54## Df Deviance AIC
55## <none> 415.00 421.00
56## + BMXBMI 1 413.66 421.66
57## + LBDTCSI 1 414.77 422.77
58## + BPXSY1 1 414.87 422.87
59## + RIDAGEYR 1 414.99 422.99
60## - RIAGENDR 1 419.49 423.49
61## - LBXGLT 1 426.62 430.62
Below we give a summary of the results of the reduced model.
1# summary of the refined model
2glmlogitaic %>%
3 tbl_regression(estimate_fun = purrr::partial(style_ratio, digits = 3),
4 pvalue_fun = purrr::partial(style_sigfig, digits = 3)) %>%
5 add_significance_stars(hide_ci = FALSE, hide_p = FALSE,
6 pattern = "{p.value}{stars}")
Characteristic | log(OR)1 | SE1 | 95% CI1 | p-value2 |
---|---|---|---|---|
Gender | ||||
Male | — | — | — | |
Female | -0.603 | 0.290 | -1.186, -0.045 | 0.037* |
Two Hour Glucose(OGTT) (mg/dL) | 0.007 | 0.002 | 0.003, 0.011 | 0.000*** |
1 OR = Odds Ratio, SE = Standard Error, CI = Confidence Interval | ||||
2 *p<0.05; **p<0.01; ***p<0.001 |
If we look again at the column with the p-values, the two remaining predictors are significant. In the second column, the coefficients of the model are listed in terms of the logarithm of the odds ratio. An interpretation of the result is easier if we look at the odds ratio itself.
1# summary of the refined model using odds ratio
2glmlogitaic %>%
3 tbl_regression(exponentiate = TRUE,
4 estimate_fun = purrr::partial(style_ratio, digits = 3),
5 pvalue_fun = purrr::partial(style_sigfig, digits = 3)) %>%
6 add_significance_stars(hide_ci = FALSE, hide_p = FALSE,
7 pattern = "{p.value}{stars}")
Characteristic | OR1 | SE1 | 95% CI1 | p-value2 |
---|---|---|---|---|
Gender | ||||
Male | — | — | — | |
Female | 0.547 | 0.290 | 0.305, 0.956 | 0.037* |
Two Hour Glucose(OGTT) (mg/dL) | 1.007 | 0.002 | 1.003, 1.011 | 0.000*** |
1 OR = Odds Ratio, SE = Standard Error, CI = Confidence Interval | ||||
2 *p<0.05; **p<0.01; ***p<0.001 |
The odds ratios can be determined from the log(OR) using the exponential function. The odds ratio describes the effect on the output variable if only one of the predictors changes and all others remain constant.
Let us first look at the odds ratio of the OGTT value: The OR is 1.007. This means that with an increase in the OGTT value by one unit (1mg/dl), on average the odds of developing a retinal microaneurysm increase by a factor of 1.007. Or in other words: The odds increase by 0.7% per one-unit OGTT.
For the predictor gender, the odds ratio for women compared to men is 0.547. This means that on average, women have a factor of 0.547 lower odds of developing retinal microaneurysms. Or in other words, the mean risk for women compared to men is reduced by 45.3%.
1# visualization of the odds ratios of the refined model
2ggcoefstats(glmlogitaic, exponentiate = TRUE) +
3 # further modification with the ggplot2 commands
4 # note the order in which the labels are entered
5 ggplot2::scale_y_discrete(labels = c("(Intercept)", "Gender/female",
6 "Two Hour Glucose(OGTT) (mg/dL)")) + ggplot2::labs(y = "Predictors", x = "Odds ratio")
Gnu R toolboxes that we used
For the statistical analysis of the data and the visualisation of results and intermediate results, we used a number of toolboxes. Below is a list of these toolboxes with a short description and links to more detailed information:
tidyverse
— This toolbox includes a number of Gnu R software packages that use tidy data data structures and in which the design philosophy and grammar of tidy data are implemented, e.g. pipes. Information on tidy data: H. Wickham, Tidy data, The Journal of Statistical Software, Vol. 59, 2014, Information on the toolbox: https://www.tidyverse.org/.haven
— The toolboxhaven
enables Gnu R to read and write various data formats used by other statistical software systems. The following formats are supported: SAS, SPSS and Stata. Information on the toolbox: https://haven.tidyverse.org/.ggplot2
— In theggplot2
toolbox a system for the declarative creation of graphics is realised, which is based on ‘The Grammar of Graphics’. Information on the toolbox: https://ggplot2.tidyverse.org/.gtsummary
— The packagegtsummary
enables the creation of publication-ready analytical and summary tables using the R programming language. Information on the toolbox, website: https://www.danieldsjoberg.com/gtsummary/, cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/gtsummary.pdfggstatsplot
— The toolboxggstatsplot
can be used to annotate visualisations of data analysis results with the results of statistical tests. Information on the toolbox, website: https://indrajeetpatil.github.io/ggstatsplot/.MASS
— The toolboxMASS
provides functions and datasets to support Venables and Ripley, “Modern Applied Statistics with S” (4th edition, 2002) (Venables & Ripley, 2002). Information on the toolbox, website: https://cran.r-project.org/web/packages/MASS/index.html.car
— The toolboxcar
provides functions and datasets associated with the book Fox and Weisberg “An R Companion to Applied Regression” (3rd Edition, 2011) (Fox & Weisberg, 2011). Information on the toolbox, website: https://cran.r-project.org/web/packages/car/index.html.
Information about the hard- and software configuration of the computer on which this post was authored:
1pander(sessionInfo())
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
locale: LC_COLLATE=German_Austria.utf8, LC_CTYPE=German_Austria.utf8, LC_MONETARY=German_Austria.utf8, LC_NUMERIC=C and LC_TIME=German_Austria.utf8
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: car(v.3.1-2), carData(v.3.0-5), gtExtras(v.0.5.0), gt(v.0.10.1), pander(v.0.6.5), ggstatsplot(v.0.12.3), ggplot2(v.3.5.1), gtsummary(v.1.7.2), MASS(v.7.3-60.2), dplyr(v.1.1.4) and haven(v.2.5.4)
loaded via a namespace (and not attached): tidyselect(v.1.2.1), farver(v.2.1.2), statsExpressions(v.1.5.4), fastmap(v.1.2.0), TH.data(v.1.1-2), blogdown(v.1.19), bayestestR(v.0.13.2), broom.helpers(v.1.15.0), labelled(v.2.13.0), digest(v.0.6.35), estimability(v.1.5.1), lifecycle(v.1.0.4), survival(v.3.5-8), magrittr(v.2.0.3), compiler(v.4.4.0), rlang(v.1.1.3), sass(v.0.4.9), tools(v.4.4.0), utf8(v.1.2.4), yaml(v.2.3.8), knitr(v.1.46), labeling(v.0.4.3), xml2(v.1.3.6), multcomp(v.1.4-25), abind(v.1.4-5), withr(v.3.0.0), purrr(v.1.0.2), grid(v.4.4.0), datawizard(v.0.10.0), fansi(v.1.0.6), xtable(v.1.8-4), colorspace(v.2.1-0), paletteer(v.1.6.0), emmeans(v.1.10.2), scales(v.1.3.0), zeallot(v.0.1.0), insight(v.0.19.11), cli(v.3.6.2), mvtnorm(v.1.2-5), rmarkdown(v.2.27), generics(v.0.1.3), performance(v.0.11.0), rstudioapi(v.0.16.0), tzdb(v.0.4.0), parameters(v.0.21.7), commonmark(v.1.9.1), cachem(v.1.1.0), stringr(v.1.5.1), splines(v.4.4.0), effectsize(v.0.8.8), vctrs(v.0.6.5), Matrix(v.1.7-0), sandwich(v.3.1-0), jsonlite(v.1.8.8), bookdown(v.0.39), hms(v.1.1.3), patchwork(v.1.2.0), ggrepel(v.0.9.5), correlation(v.0.8.4), fontawesome(v.0.5.2), tidyr(v.1.3.1), jquerylib(v.0.1.4), glue(v.1.7.0), rematch2(v.2.1.2), codetools(v.0.2-20), stringi(v.1.8.4), gtable(v.0.3.5), prismatic(v.1.1.2), munsell(v.0.5.1), tibble(v.3.2.1), pillar(v.1.9.0), htmltools(v.0.5.8.1), R6(v.2.5.1), evaluate(v.0.23), lattice(v.0.22-6), highr(v.0.10), markdown(v.1.12), readr(v.2.1.5), backports(v.1.4.1), broom(v.1.0.6), bslib(v.0.7.0), Rcpp(v.1.0.12), coda(v.0.19-4.1), xfun(v.0.44), zoo(v.1.8-12), forcats(v.1.0.0) and pkgconfig(v.2.0.3)
Theoretical foundations of the logistic regression
Binary logit regression model
Phenomena such as the presence of a disease (yes/no) can be coded by means of binary (dichotomous) variables. Binary logistic regression is used when the dependence of a binary (dichotomous) outcome variable (dependent variable) on predictors (independent variable) is to be modelled. The variable \(p\)
indicates the probability that an event (e.g. presence of the disease) will occur. The odds of having the disease are defined as the quotient of probability and counter probability \(\frac{p}{1 - p}\)
. By means of logarithm (natural logarithm), the probability values \((0, 1)\)
can be mapped into the range of real numbers \((-\infty, \infty)\)
. The binary logistic regression model is defined by the equation
$$
\DeclareMathOperator{\logit}{logit}
\logit(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_k x_k ,
$$
with
\(x_i\)
the predictors (independent variables)\(\beta_0\)
the estimated intercept term\(\beta_i\)
the estimated binary logistic regression coefficients.
If we solve this equation for \(p\)
, we get
$$
p = \frac{\exp(\beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_k x_k)}{1 + \exp(\beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_k x_k)}
$$
(Petrie & Sabin, 2005; Rosner, 2016).
Interpretation of the regression coefficients
Continuous predictors
In a binary logistic regression model, consider a single continuous predictor \(x_i\)
separately that differs between two individuals by a \(\beta_{i\Delta}\)
and all other predictors in the model have the same values for both individuals. The odds ratio between the two individuals is estimated by $$\hat{OR} = \exp(\hat{\beta}_{i\Delta}).$$
Dichotomous predictors
Suppose there is a dichotomous predictor \(x_i\)
, coded 1 for property present and 0 property absent. If two individuals differ exclusively by this predictor, then the odds ratio can be determined by $$\hat{OR} = \exp(\hat{\beta}_{i}).$$
Further reading
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference – a practical information-theoretic approach (2nd ed.). Springer. https://doi.org/10.1007/b97636
Fox, J., & Weisberg, S. (2011). An R companion to applied regression (2nd ed.). Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/
Hair Jr., J. F., Black, W. C., & Babin, R. E., B. J. Anderson. (2019). Multivariate data analysis (8th ed.). Cengage.
Petrie, A., & Sabin, C. (2005). Medical statistics at a glance (2nd ed.). Blackwell Publishing.
Rosner, B. (2016). Fundamentals of biostatistics (8th ed.). Cengage Learning.
United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics. (2012). National health and nutrition examination survey (NHANES), 2007-2008. Inter-university Consortium for Political and Social Research. https://doi.org/10.3886/ICPSR25505.v3
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer. http://www.stats.ox.ac.uk/pub/MASS4