Binary Logistic Regression

Uwe Graichen · uwe.graichen@kl.ac.at

Overview

Binary logistic regression is used to model the relationship between a binary dependent variable (e.g. therapy successful? Yes/No) and a set of explanatory variables. As a result of the regression analysis, we identify explanatory variables that influence the binary dependent outcome variable. The binary logistic regression equation can also be used to predict the outcome of the binary dependent variable as a function of values of the explanatory variables.

Binary logistic regression – Principles and an illustrative data analysis using Gnu R

Objective and scope of application

The binary logistic regression is one of the class of general linear models. It is used to model the relationship between a binary dependent variable (e.g. therapy successful? Yes/No) and a set of explanatory variables. As a result of the regression analysis, we identify explanatory variables that influence the binary dependent outcome variable. The binary logistic regression equation can also be used to predict the outcome of the binary dependent variable as a function of values of the explanatory variables.

Motivating example

Microaneurysms are a early visible manifestation of diabetic retinopathy. They are characterised by small red dots on the retina.

Healthy retina Retina with microaneurysms
By Mikael Häggström, used with permission.
By Edgar Nagel, used with permission.

In this blog post we use logistic regression to identify predictors for the occurrence of retinal microaneurysms. The predictors we include in the analysis are age, gender, systolic blood pressure, oral glucose tolerance test (OGTT), total cholesterol and body mass index (BMI) of the patients. We only include data from people over 18 years of age in the analysis. The dataset we use for the example analysis in this blog post was pre-compiled from data of the survey “National Health and Nutrition Examination Survey (NHANES), 2007-2008” (United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics, 2012).

Analysis script in Gnu R

Used Gnu R toolboxes

For data import, analysis and visualization of the results we use already existing toolboxes for the Gnu R system. The toolboxes in use are described further down in the post. They are integrated by the following code fragment:

 1library(haven)         # import SPSS files
 2library(dplyr)         # functions for dataframe manipulation 
 3library(MASS)          # stepAIC for model selection
 4library(gtsummary)     # publication-ready analytical and summary tables
 5library(ggplot2)       # high quality plots
 6library(ggstatsplot)   # high quality statistics plots
 7library(pander)        # rendering R objects into Pandoc markdown
 8library(gt)            # output of tibbles as table
 9library(gtExtras)      # extension of gt
10library(car)           # vif for multicollinarity test

Import of the data

We use the function read_sav from the software package haven to import the data file.

1# Import a dataset that is in SAV format
2dataIn <- read_sav("retinalMicroaneurysm.sav")

The following variables are included in the dataset we use in the analysis:

  • BPXSY1: Systolic blood pressure (first reading) mm Hg
  • RIAGENDR: Gender of the sample person
  • RIDAGEYR: Age in years of the sample person at time of screening
  • LBXGLT: Two Hour Glucose(OGTT) (mg/dL)
  • OPDUMA: Retinal microaneurysms, worse eye
  • LBDTCSI: Total cholesterol (mmol/L)
  • BMXBMI: Body Mass Index (kg/m**2)

We use the function as_factor to convert the variables RIAGENDR and OPDUMA into a factors, preserving the variables and value label attributes.

1# Change some variables to factors
2dataAnalysis <- dataIn %>%
3  mutate(RIAGENDR = as_factor(RIAGENDR)) %>%
4  mutate(OPDUMA = as_factor(OPDUMA))

Short data summary

In the table below we give a summary of the data that we include in the further steps of the analysis.

1# Print summary table (table 1)
2dataAnalysis %>%
3  tbl_summary()
Characteristic N = 1,0461
Gender
    Male 522 (50%)
    Female 524 (50%)
Age in years 58 (48, 68)
Systolic: Blood pres (1st rdg) mm Hg 124 (114, 136)
Two Hour Glucose(OGTT) (mg/dL) 122 (99, 158)
Retinal microaneurysms, worse eye 55 (5.3%)
Total Cholesterol( mmol/L) 5.22 (4.63, 5.95)
Body Mass Index (kg/m**2) 27.8 (24.8, 31.4)
1 n (%); Median (IQR)

Analysis by means of binary logistic regression

For the analysis by means of logistic regression we use a generalised linear model, which is provided by the package stats as command glm(). The form of the glm() command is glm(formula, family=familytype(link=linkfunction), data=dataTibble). The meaning of the three parameters formula, family and data is:

  • formula is a symbolic description of the model to be fitted to the data within the analysis. The variable on the left-hand side of a tilde (~) is called the outcome variable (or dependent variable), while the variables on the right-hand side of a tilde are called the predictors (or independent variables) and are joined by plus signs +. Using binary logistic analysis, the outcome variable must be dichotomous (binary).
  • family is a description of the distribution of the outcome variables and the link function to be used in the model. In our analysis problem we are dealing with a binary (also dichothomous) outcome variable. We use logistic regression (logit).
  • data is a tibble containing the data of the variables in the model.

The aim of the analysis is to identify a compact model that represents the collected data as well as possible. We use the Akaike information criterion (AIC) as a metric for the quality of model fit (Akaike, 1974; Burnham & Anderson, 2002; Venables & Ripley, 2002). We use a combination of forward and backward selection approach. In the following, we set up a model that includes all existing predictors, we fit this model to the data:

1# define the model and fit it to the data
2glmlogit <- glm(formula = OPDUMA ~ RIAGENDR + RIDAGEYR + BPXSY1 + LBXGLT +
3                  LBDTCSI + BMXBMI,
4                family = binomial(link="logit"),
5                data = dataAnalysis)

In the next step, we examine multicollinarity in the predictor variables. Multicollinearity is present when two or more of the predictors in a regression model are correlated. It can affect our analysis and thus limit the conclusions we can draw. We use the variance inflation factor (VIF) to measure the strength of the correlation between the predictor variables in the regression model (Fox & Weisberg, 2011; Hair Jr. et al., 2019). A heuristic for the evaluation of VIF values:

  • A value of 1 indicates there is no correlation between a given and all other predictor variables.
  • Values between 1 and 5 indicate a moderate correlation between a given and all other predictor variables, but this does not yet require attention.
  • Values greater than 5 indicate a strong correlation between a given predictor variable and all other predictor variables, the estimated coefficients and p-values are most likely unreliable.

To check whether there is multicollinarity in the predictors, we use the vif function of the car toolbox.

1# multicollinarity test using VIF
2glmlogitvif <- vif(glmlogit)

We reformat the result of the multicollinarity test into tibble format to make it easier to use for subsequent outputs in a table and a visualisation.

 1# reformating the result of the multicollinarity test
 2dfglmlogitvif <- tibble(predictors = names(glmlogitvif),
 3                        vif = glmlogitvif) %>%
 4  mutate(predictors = case_when(predictors == "OPDUMA" ~ "Retinal microaneurysms, worse eye",
 5                                predictors == "RIAGENDR" ~ "Gender",
 6                                predictors == "RIDAGEYR" ~ "Age in years",
 7                                predictors == "BPXSY1" ~ "Systolic: Blood pres (1st rdg) mm Hg",
 8                                predictors == "LBXGLT" ~ "Two Hour Glucose(OGTT) (mg/dL)",
 9                                predictors == "LBDTCSI" ~ "Total Cholesterol( mmol/L)",
10                                predictors == "BMXBMI" ~ "Body Mass Index (kg/m**2)",
11                                .default = predictors))
1# render the result of multicollinarity test as table
2dfglmlogitvif %>%
3  gt() %>%
4  cols_label(
5    predictors = "Predictors",
6    vif = "VIF"
7  )
Predictors VIF
Gender 1.026555
Age in years 1.223686
Systolic: Blood pres (1st rdg) mm Hg 1.155713
Two Hour Glucose(OGTT) (mg/dL) 1.157670
Total Cholesterol( mmol/L) 1.062960
Body Mass Index (kg/m**2) 1.045360
1# visualise the result of multicollinarity test as table
2dfglmlogitvif %>%
3  ggplot(aes(y=predictors, x=vif)) +
4  geom_col(fill ="gray") +
5  theme_bw() +
6  labs(y = "Predictors", x = "VIF") +
7  geom_vline(xintercept = 1, color = "green", linewidth=1.) +
8  geom_vline(xintercept = 5, color = "red", linewidth=1.)

As a result of the test, no multicollinarity could be detected.

Subsequently we give a summary of the results of the binary logistic modelling using this “large” model.

1# summary of the results of the binary logistic modelling in a table
2glmlogit %>%
3  tbl_regression(estimate_fun = purrr::partial(style_ratio, digits = 3),
4                 pvalue_fun = purrr::partial(style_sigfig, digits = 3)) %>%
5  add_significance_stars(hide_ci = FALSE, hide_p = FALSE,
6                         pattern = "{p.value}{stars}")
Characteristic log(OR)1 SE1 95% CI1 p-value2
Gender



    Male
    Female -0.563 0.294 -1.155, 0.002 0.055
Age in years -0.006 0.013 -0.032, 0.019 0.623
Systolic: Blood pres (1st rdg) mm Hg 0.004 0.008 -0.012, 0.019 0.617
Two Hour Glucose(OGTT) (mg/dL) 0.007 0.002 0.003, 0.012 0.000***
Total Cholesterol( mmol/L) -0.081 0.139 -0.358, 0.186 0.560
Body Mass Index (kg/m**2) -0.035 0.029 -0.092, 0.020 0.228
1 OR = Odds Ratio, SE = Standard Error, CI = Confidence Interval
2 *p<0.05; **p<0.01; ***p<0.001

As can be seen in the column headed p-value, not all predictors contribute significantly to the model.

For model selection and refinement we use the stepAIC function of the MASS Toolbox and approach of stepwise regression, which includes both the forward and the backward selection of the predictors (Venables & Ripley, 2002). In the following, the function stepAIC is applied. The trace of the model selection steps with the respective AIC value and the finally selected model is displayed.

 1# model selection the AIC criterion
 2glmlogitaic <- glmlogit %>% stepAIC(trace = T, direction = 'both')
 3## Start:  AIC=427.03
 4## OPDUMA ~ RIAGENDR + RIDAGEYR + BPXSY1 + LBXGLT + LBDTCSI + BMXBMI
 5## 
 6##            Df Deviance    AIC
 7## - RIDAGEYR  1   413.27 425.27
 8## - BPXSY1    1   413.28 425.28
 9## - LBDTCSI   1   413.38 425.38
10## - BMXBMI    1   414.55 426.55
11## <none>          413.03 427.03
12## - RIAGENDR  1   416.84 428.84
13## - LBXGLT    1   424.43 436.43
14## 
15## Step:  AIC=425.27
16## OPDUMA ~ RIAGENDR + BPXSY1 + LBXGLT + LBDTCSI + BMXBMI
17## 
18##            Df Deviance    AIC
19## - BPXSY1    1   413.42 423.42
20## - LBDTCSI   1   413.53 423.53
21## - BMXBMI    1   414.63 424.63
22## <none>          413.27 425.27
23## + RIDAGEYR  1   413.03 427.03
24## - RIAGENDR  1   417.16 427.16
25## - LBXGLT    1   424.52 434.52
26## 
27## Step:  AIC=423.42
28## OPDUMA ~ RIAGENDR + LBXGLT + LBDTCSI + BMXBMI
29## 
30##            Df Deviance    AIC
31## - LBDTCSI   1   413.66 421.66
32## - BMXBMI    1   414.77 422.77
33## <none>          413.42 423.42
34## + BPXSY1    1   413.27 425.27
35## + RIDAGEYR  1   413.28 425.28
36## - RIAGENDR  1   417.39 425.39
37## - LBXGLT    1   425.90 433.90
38## 
39## Step:  AIC=421.66
40## OPDUMA ~ RIAGENDR + LBXGLT + BMXBMI
41## 
42##            Df Deviance    AIC
43## - BMXBMI    1   415.00 421.00
44## <none>          413.66 421.66
45## + LBDTCSI   1   413.42 423.42
46## + BPXSY1    1   413.53 423.53
47## + RIDAGEYR  1   413.58 423.58
48## - RIAGENDR  1   418.07 424.07
49## - LBXGLT    1   426.07 432.07
50## 
51## Step:  AIC=421
52## OPDUMA ~ RIAGENDR + LBXGLT
53## 
54##            Df Deviance    AIC
55## <none>          415.00 421.00
56## + BMXBMI    1   413.66 421.66
57## + LBDTCSI   1   414.77 422.77
58## + BPXSY1    1   414.87 422.87
59## + RIDAGEYR  1   414.99 422.99
60## - RIAGENDR  1   419.49 423.49
61## - LBXGLT    1   426.62 430.62

Below we give a summary of the results of the reduced model.

1# summary of the refined model
2glmlogitaic %>%
3  tbl_regression(estimate_fun = purrr::partial(style_ratio, digits = 3),
4                 pvalue_fun = purrr::partial(style_sigfig, digits = 3)) %>%
5  add_significance_stars(hide_ci = FALSE, hide_p = FALSE,
6                         pattern = "{p.value}{stars}")
Characteristic log(OR)1 SE1 95% CI1 p-value2
Gender



    Male
    Female -0.603 0.290 -1.186, -0.045 0.037*
Two Hour Glucose(OGTT) (mg/dL) 0.007 0.002 0.003, 0.011 0.000***
1 OR = Odds Ratio, SE = Standard Error, CI = Confidence Interval
2 *p<0.05; **p<0.01; ***p<0.001

If we look again at the column with the p-values, the two remaining predictors are significant. In the second column, the coefficients of the model are listed in terms of the logarithm of the odds ratio. An interpretation of the result is easier if we look at the odds ratio itself.

1# summary of the refined model using odds ratio
2glmlogitaic %>%
3  tbl_regression(exponentiate = TRUE,
4                 estimate_fun = purrr::partial(style_ratio, digits = 3),
5                 pvalue_fun = purrr::partial(style_sigfig, digits = 3)) %>%
6  add_significance_stars(hide_ci = FALSE, hide_p = FALSE,
7                         pattern = "{p.value}{stars}")
Characteristic OR1 SE1 95% CI1 p-value2
Gender



    Male
    Female 0.547 0.290 0.305, 0.956 0.037*
Two Hour Glucose(OGTT) (mg/dL) 1.007 0.002 1.003, 1.011 0.000***
1 OR = Odds Ratio, SE = Standard Error, CI = Confidence Interval
2 *p<0.05; **p<0.01; ***p<0.001

The odds ratios can be determined from the log(OR) using the exponential function. The odds ratio describes the effect on the output variable if only one of the predictors changes and all others remain constant.

Let us first look at the odds ratio of the OGTT value: The OR is 1.007. This means that with an increase in the OGTT value by one unit (1mg/dl), on average the odds of developing a retinal microaneurysm increase by a factor of 1.007. Or in other words: The odds increase by 0.7% per one-unit OGTT.

For the predictor gender, the odds ratio for women compared to men is 0.547. This means that on average, women have a factor of 0.547 lower odds of developing retinal microaneurysms. Or in other words, the mean risk for women compared to men is reduced by 45.3%.

1# visualization of the odds ratios of the refined model
2ggcoefstats(glmlogitaic, exponentiate = TRUE) +
3  # further modification with the ggplot2 commands
4  # note the order in which the labels are entered
5  ggplot2::scale_y_discrete(labels = c("(Intercept)", "Gender/female",
6                                       "Two Hour Glucose(OGTT) (mg/dL)")) +  ggplot2::labs(y = "Predictors", x = "Odds ratio")

Gnu R toolboxes that we used

For the statistical analysis of the data and the visualisation of results and intermediate results, we used a number of toolboxes. Below is a list of these toolboxes with a short description and links to more detailed information:

Information about the hard- and software configuration of the computer on which this post was authored:

1pander(sessionInfo())

R version 4.4.0 (2024-04-24 ucrt)

Platform: x86_64-w64-mingw32/x64

locale: LC_COLLATE=German_Austria.utf8, LC_CTYPE=German_Austria.utf8, LC_MONETARY=German_Austria.utf8, LC_NUMERIC=C and LC_TIME=German_Austria.utf8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: car(v.3.1-2), carData(v.3.0-5), gtExtras(v.0.5.0), gt(v.0.10.1), pander(v.0.6.5), ggstatsplot(v.0.12.3), ggplot2(v.3.5.1), gtsummary(v.1.7.2), MASS(v.7.3-60.2), dplyr(v.1.1.4) and haven(v.2.5.4)

loaded via a namespace (and not attached): tidyselect(v.1.2.1), farver(v.2.1.2), statsExpressions(v.1.5.4), fastmap(v.1.2.0), TH.data(v.1.1-2), blogdown(v.1.19), bayestestR(v.0.13.2), broom.helpers(v.1.15.0), labelled(v.2.13.0), digest(v.0.6.35), estimability(v.1.5.1), lifecycle(v.1.0.4), survival(v.3.5-8), magrittr(v.2.0.3), compiler(v.4.4.0), rlang(v.1.1.3), sass(v.0.4.9), tools(v.4.4.0), utf8(v.1.2.4), yaml(v.2.3.8), knitr(v.1.46), labeling(v.0.4.3), xml2(v.1.3.6), multcomp(v.1.4-25), abind(v.1.4-5), withr(v.3.0.0), purrr(v.1.0.2), grid(v.4.4.0), datawizard(v.0.10.0), fansi(v.1.0.6), xtable(v.1.8-4), colorspace(v.2.1-0), paletteer(v.1.6.0), emmeans(v.1.10.2), scales(v.1.3.0), zeallot(v.0.1.0), insight(v.0.19.11), cli(v.3.6.2), mvtnorm(v.1.2-5), rmarkdown(v.2.27), generics(v.0.1.3), performance(v.0.11.0), rstudioapi(v.0.16.0), tzdb(v.0.4.0), parameters(v.0.21.7), commonmark(v.1.9.1), cachem(v.1.1.0), stringr(v.1.5.1), splines(v.4.4.0), effectsize(v.0.8.8), vctrs(v.0.6.5), Matrix(v.1.7-0), sandwich(v.3.1-0), jsonlite(v.1.8.8), bookdown(v.0.39), hms(v.1.1.3), patchwork(v.1.2.0), ggrepel(v.0.9.5), correlation(v.0.8.4), fontawesome(v.0.5.2), tidyr(v.1.3.1), jquerylib(v.0.1.4), glue(v.1.7.0), rematch2(v.2.1.2), codetools(v.0.2-20), stringi(v.1.8.4), gtable(v.0.3.5), prismatic(v.1.1.2), munsell(v.0.5.1), tibble(v.3.2.1), pillar(v.1.9.0), htmltools(v.0.5.8.1), R6(v.2.5.1), evaluate(v.0.23), lattice(v.0.22-6), highr(v.0.10), markdown(v.1.12), readr(v.2.1.5), backports(v.1.4.1), broom(v.1.0.6), bslib(v.0.7.0), Rcpp(v.1.0.12), coda(v.0.19-4.1), xfun(v.0.44), zoo(v.1.8-12), forcats(v.1.0.0) and pkgconfig(v.2.0.3)

Theoretical foundations of the logistic regression

Binary logit regression model

Phenomena such as the presence of a disease (yes/no) can be coded by means of binary (dichotomous) variables. Binary logistic regression is used when the dependence of a binary (dichotomous) outcome variable (dependent variable) on predictors (independent variable) is to be modelled. The variable \(p\) indicates the probability that an event (e.g. presence of the disease) will occur. The odds of having the disease are defined as the quotient of probability and counter probability \(\frac{p}{1 - p}\). By means of logarithm (natural logarithm), the probability values \((0, 1)\) can be mapped into the range of real numbers \((-\infty, \infty)\). The binary logistic regression model is defined by the equation $$ \DeclareMathOperator{\logit}{logit} \logit(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_k x_k , $$ with

  • \(x_i\) the predictors (independent variables)
  • \(\beta_0\) the estimated intercept term
  • \(\beta_i\) the estimated binary logistic regression coefficients.

If we solve this equation for \(p\), we get $$ p = \frac{\exp(\beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_k x_k)}{1 + \exp(\beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_k x_k)} $$ (Petrie & Sabin, 2005; Rosner, 2016).

Interpretation of the regression coefficients

Continuous predictors

In a binary logistic regression model, consider a single continuous predictor \(x_i\) separately that differs between two individuals by a \(\beta_{i\Delta}\) and all other predictors in the model have the same values for both individuals. The odds ratio between the two individuals is estimated by $$\hat{OR} = \exp(\hat{\beta}_{i\Delta}).$$

Dichotomous predictors

Suppose there is a dichotomous predictor \(x_i\), coded 1 for property present and 0 property absent. If two individuals differ exclusively by this predictor, then the odds ratio can be determined by $$\hat{OR} = \exp(\hat{\beta}_{i}).$$

Further reading

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference – a practical information-theoretic approach (2nd ed.). Springer. https://doi.org/10.1007/b97636

Fox, J., & Weisberg, S. (2011). An R companion to applied regression (2nd ed.). Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/

Hair Jr., J. F., Black, W. C., & Babin, R. E., B. J. Anderson. (2019). Multivariate data analysis (8th ed.). Cengage.

Petrie, A., & Sabin, C. (2005). Medical statistics at a glance (2nd ed.). Blackwell Publishing.

Rosner, B. (2016). Fundamentals of biostatistics (8th ed.). Cengage Learning.

United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics. (2012). National health and nutrition examination survey (NHANES), 2007-2008. Inter-university Consortium for Political and Social Research. https://doi.org/10.3886/ICPSR25505.v3

Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer. http://www.stats.ox.ac.uk/pub/MASS4