Parametric hypothesis test for normal distributed unpaired samples

Uwe Graichen · uwe.graichen@kl.ac.at

Overview

In the current post, we present the \(t\)-test as a parametric hypothesis test for normally distributed unpaired samples with equal variance. We state necessary conditions for the application of this test and we analyze an example data set.

Parametric hypothesis test for normally distributed unpaired samples of equal variance — Principles and an illustrative data analysis in Gnu R

Objective and scope of application

The \(t\)-test for unpaired samples is used to compare the location of the means of two independent data series. In this way, differences between the samples under consideration can be analyzed and the significance of these differences can be examined and proven.

The \(t\)-test can be applied one-sided or two-sided. In the one-sided test, only the equality or inequality of the mean values of the two samples is analyzed. The null hypothesis \((H_0)\) of the one-sided test is the means of the two data series \(\mu_1\) and \(\mu_2\) are equal \(H_0: \mu_1 = \mu_2\). The corresponding alternative hypothesis \((H_1)\) is the two means \(\mu_1\) and \(\mu_2\) are not equal \(H_1: \mu_1 \ne \mu_2\). In the two-sided test, the direction of inequality (greater than, less than) of the means is also considered in the analysis. The two possible null hypotheses of the two-sided tests are \(H_0: \mu_1 \le \mu_2\) and \(H_0: \mu_1 \ge \mu_2\) with the alternative hypotheses \(H_1: \mu_1 > \mu_2\) resp. \(H_1: \mu_1 < \mu_2\).

Requirements

The \(t\)-test can only be used if the two data series to be examined fulfill certain requirements:

  1. The values of the two data series to be analyzed must be normally distributed. This can be checked using the Shapiro-Wilk test or visually using a quantile-quantile plot (Q-Q plot). If the sample number of the data series to be analyzed is greater than 30, then proof of normal distribution is usually not necessary, in this case the law of large numbers comes into effect (central limit theorem of Lindeberg-Lévy).
  2. The variances of the two data series to be analyzed must be equal; the \(F\) test or the Levene test can be used for the verification.

Motivating example

Within the context of a study, the upper arm length (UL) of female and male subjects, aged between 50 and 52 years, is to be compared. The null hypothesis is: the mean values of the upper arm lengths are the same for both groups of subjects, $$H_0: \mu_{\mathrm{UL, f}} = \mu_{\mathrm{UL, m}}$$ and correspondingly the alternative hypothesis $$H_1: \mu_{\mathrm{UL, f}} \ne \mu_{\mathrm{UL, m}} . $$ The data used for the analysis are from the United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics. National Health and Nutrition Examination Survey (NHANES), 2007-2008. Inter-university Consortium for Political and Social Research [distributor], 2012-02-22. doi.org/10.3886/ICPSR25505.v3.

Analysis script in Gnu R

Used Gnu R toolboxes

For data import, analysis and visualization of the results we use already existing toolboxes for the Gnu R system. The toolboxes in use are described further down in the post. They are integrated by the following code fragment:

1library(tidyverse)     # use tidy data
2library(haven)         # import SPSS file
3library(rstatix)       # pipe friendly statitics 
4library(RColorBrewer)  # color maps
5library(kableExtra)    # table output
6library(ggplot2)       # high quality plots
7library(ggstatsplot)   # statistic plots
8library(pander)        # rendering R objects into Pandoc markdown
9library(latex2exp)     # mathematical expressions in plots

Import of the data and first exploratory analysis

First, the data to be analysed is imported into the Gnu R analysis environment. The data are available in SAV format. We use the function read_sav of the toolbox haven for the data import.

1# Importiere Datensatz im SAV-Format
2dataIn <- read_sav("25505-0012-Data.sav")

The imported dataset contains several variables. We select the two variables that will be considered in the analysis, the gender and the upper arm length of the subjects. We also remove all dataset entries with missing values (NA). Then, using the glimpse(dataAnalysis) statement, we output some information about the data selected for further analysis.

 1dataAnalysis <- dataIn %>%
 2  dplyr::filter(RIDAGEYR >= 50 & RIDAGEYR <= 52) %>% # Age between 50 and 52
 3  select(RIAGENDR, BMXARML) %>% # Selection of the two variables
 4  mutate(RIAGENDR = as_factor(RIAGENDR)) %>% # Convert to factor
 5  na.omit() # Remove all entries with missing values (NA)
 6
 7# Output of the first element of the data set
 8glimpse(dataAnalysis)
 9## Rows: 316
10## Columns: 2
11## $ RIAGENDR <fct> Male, Male, Male, Female, Male, Male, Female, Male, Male, Mal…
12## $ BMXARML  <dbl> 34.1, 37.2, 37.7, 35.7, 40.5, 36.0, 38.5, 40.0, 42.6, 45.7, 3…

For exploratory purposes, we report a few statistical parameters for both groups of subjects.

1dataAnalysis %>%
2  group_by(RIAGENDR) %>%
3  get_summary_stats() %>%
4  kable(caption = "Statistical parameters for the upper arm lengths of the two subject groups")
RIAGENDR variable n min max median q1 q3 iqr mad mean sd se ci
Male BMXARML 172 34.0 45.7 39.0 37.3 40.925 3.625 2.669 39.045 2.350 0.179 0.354
Female BMXARML 144 29.5 42.4 35.7 34.2 37.150 2.950 2.224 35.933 2.354 0.196 0.388

Table 1: Statistical parameters for the upper arm lengths of the two subject groups

The dataset includes 316 complete records of subjects between the ages of 50 and 52. 144 of these subjects are female and 172 are male. The sample numbers of the two groups of subjects are visualized below by a bar chart. We use the functions ggplot and geom_bar of the toolbox ggplot2 for the visualization.

1ggplot(dataAnalysis, aes(x = RIAGENDR)) +
2  geom_bar(stat="count", width=0.7, fill="steelblue") +
3  geom_text(stat="count", aes(label=after_stat(count)), vjust=2, size=10) +
4  labs(x = "Gender", y = "Number") +
5  theme_bw() +
6  theme(text = element_text(size = 16))

In the next step, we generate an overview of the distribution of the data to be statistically analysed. We combine box-whisker and violin plots for this purpose. The box-whisker plot shows the median as well as the lower and upper quartiles. The violin plot shows the distribution of the recorded values for both groups of subjects. We also mark the mean values in the plots using a red diamond.

 1ggplot(dataAnalysis, aes(x = RIAGENDR,
 2                         y = BMXARML)) +
 3  geom_violin() +
 4  geom_boxplot(width=0.3) +
 5  stat_summary(fun=mean, colour="darkred", geom="point", shape=18,
 6               size=3, show.legend = FALSE) +
 7  stat_summary(fun=mean, colour="red", geom="text", show.legend = FALSE, 
 8               vjust=-0.7, aes( label=round(after_stat(y), digits=1))) +
 9  labs(x = "Gender", y = "Upper arm length / cm") +
10  theme_bw() +
11  theme(text = element_text(size = 16))

Verify requirements for \(t\)-test

Requirement: Normal distribution

The sample number of the two groups of subjects is larger than 30. Due to the large sample number, a proof of the normal distribution is actually not required. To illustrate the procedure, we nevertheless perform the verification. We use the Shapiro-Wilk test for the proof of the normal distribution of the data of the two groups of subjects.

1dataAnalysis %>%
2  group_by(RIAGENDR) %>%
3  shapiro_test(BMXARML) %>%
4  add_significance("p") %>%
5  kable(caption = "Results of the Shapiro-Wilk test")
RIAGENDR variable statistic p p.signif
Male BMXARML 0.9892286 0.2164096 ns
Female BMXARML 0.9870128 0.1962956 ns

Table 2: Results of the Shapiro-Wilk test

The \(p\) value of the Shapiro-Wilk test is \(p > 0.05\) for the data of both groups of subjects considered in the study. Thus, the distribution of the present data does not differ significantly from the normal distribution, it follows that the data are normally distributed.

The Q-Q plot provides a visual way to qualitatively check whether data have a normal distribution. In the Q-Q plot, the quantiles of the empirical distribution of the collected data are shown as a function of the quantiles of the normal distribution. The solid line in the Q-Q plot illustrates the normal distribution. If the samples of the subject group (points in the plot) correspond to a normal distribution, then they should be represented as well as possible by this line.

1ggplot(dataAnalysis, aes(sample = BMXARML, color = RIAGENDR)) +
2  geom_qq() +
3  geom_qq_line(linewidth=1.5) +
4  labs(x = "Theoretical quantiles", y = "Upper arm length / cm") +
5  theme_bw() +
6  theme(text = element_text(size = 16))  + 
7  scale_color_brewer(palette = "Paired", name = "Gender")

Requirement: Equality of variance

The second necessary condition to compare two samples by means of the \(t\)-test is the equality of the variances of the two samples. The equality of the variances of the two samples can be demonstrated using the Levene test.

1dataAnalysis %>%
2  levene_test(BMXARML ~ RIAGENDR) %>%
3  add_significance("p") %>%
4  kable(caption = "Results of *Levene* test")
df1 df2 statistic p p.signif
1 314 0.6347787 0.4262099 ns

Table 3: Results of Levene test

Hypothesis testing by means of the \(t\)-test

The requirements for the parametric hypothesis test are fulfilled. We now apply the \(t\)-test to verify the \(H_0\) hypothesis.

1dataAnalysis %>%
2  t_test(BMXARML ~ RIAGENDR) %>%
3  add_significance("p") %>%
4  kable(caption = "Results of the $t$-test")
.y. group1 group2 n1 n2 statistic df p p.signif
BMXARML Male Female 172 144 11.7147 304.1166 0 ****

Table 4: Results of the \(t\)-test

The \(p\) value of the \(t\) test is considerably below the significance level of 0.05. The \(H_0\) hypothesis is rejected and the alternative hypothesis \(H_1\) is accepted. The mean value of the upper arm lengths of the two test groups differs statistically highly significant.

Visual representation of the results of the analysis

Finally, we visually present the results of the statistical analysis. The distribution of the data of the two groups of subjects is illustrated by a combination of box-whisker and violin plots, the individual measured values by coloured dots. The mean values are marked by labelled red dots. The number of all samples in the database is given at the top right of the graph \((n_{\mathrm{obs}} = 316)\), the sample numbers of the individual groups of subjects are in brackets on the X-axis. The results of the \(t\) test are shown at the top left of the graph: the statistical test used, the parameter, the calculated statistic and the significance. They are followed by information on the effect size. We will go into more detail about the properties and possibilities for calculating the effect size in one of the next posts in this blog.

1ggbetweenstats(
2   data = dataAnalysis,
3   x = RIAGENDR,
4   y = BMXARML,
5   ggtheme = ggplot2::theme_bw(),
6   bf.message = FALSE,
7   title = "Gender-specific upper arm length differences, age 50 to 52 years"
8   ) +
9   labs(x = "Gender", y = "Upper arm length / cm")

Gnu R toolboxes that we used

For the statistical analysis of the data and the visualisation of results and intermediate results, we used a number of toolboxes. Below is a list of these toolboxes with a short description and links to more detailed information:

Information about the hard- and software configuration of the computer on which this post was authored:

1pander(sessionInfo())

R version 4.4.0 (2024-04-24 ucrt)

Platform: x86_64-w64-mingw32/x64

locale: LC_COLLATE=German_Austria.utf8, LC_CTYPE=German_Austria.utf8, LC_MONETARY=German_Austria.utf8, LC_NUMERIC=C and LC_TIME=German_Austria.utf8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: latex2exp(v.0.9.6), pander(v.0.6.5), ggstatsplot(v.0.12.3), kableExtra(v.1.4.0), RColorBrewer(v.1.1-3), rstatix(v.0.7.2), haven(v.2.5.4), lubridate(v.1.9.3), forcats(v.1.0.0), stringr(v.1.5.1), dplyr(v.1.1.4), purrr(v.1.0.2), readr(v.2.1.5), tidyr(v.1.3.1), tibble(v.3.2.1), ggplot2(v.3.5.1) and tidyverse(v.2.0.0)

loaded via a namespace (and not attached): tidyselect(v.1.2.1), viridisLite(v.0.4.2), farver(v.2.1.2), statsExpressions(v.1.5.4), fastmap(v.1.2.0), TH.data(v.1.1-2), blogdown(v.1.19), bayestestR(v.0.13.2), digest(v.0.6.35), timechange(v.0.3.0), estimability(v.1.5.1), lifecycle(v.1.0.4), survival(v.3.5-8), magrittr(v.2.0.3), compiler(v.4.4.0), rlang(v.1.1.3), sass(v.0.4.9), tools(v.4.4.0), utf8(v.1.2.4), yaml(v.2.3.8), knitr(v.1.46), labeling(v.0.4.3), xml2(v.1.3.6), abind(v.1.4-5), multcomp(v.1.4-25), withr(v.3.0.0), grid(v.4.4.0), datawizard(v.0.10.0), fansi(v.1.0.6), xtable(v.1.8-4), colorspace(v.2.1-0), paletteer(v.1.6.0), emmeans(v.1.10.2), scales(v.1.3.0), MASS(v.7.3-60.2), zeallot(v.0.1.0), insight(v.0.19.11), cli(v.3.6.2), mvtnorm(v.1.2-5), rmarkdown(v.2.27), generics(v.0.1.3), rstudioapi(v.0.16.0), tzdb(v.0.4.0), parameters(v.0.21.7), cachem(v.1.1.0), splines(v.4.4.0), effectsize(v.0.8.8), vctrs(v.0.6.5), Matrix(v.1.7-0), sandwich(v.3.1-0), jsonlite(v.1.8.8), carData(v.3.0-5), bookdown(v.0.39), car(v.3.1-2), patchwork(v.1.2.0), hms(v.1.1.3), ggrepel(v.0.9.5), correlation(v.0.8.4), systemfonts(v.1.1.0), jquerylib(v.0.1.4), glue(v.1.7.0), rematch2(v.2.1.2), codetools(v.0.2-20), stringi(v.1.8.4), gtable(v.0.3.5), prismatic(v.1.1.2), munsell(v.0.5.1), pillar(v.1.9.0), htmltools(v.0.5.8.1), R6(v.2.5.1), evaluate(v.0.23), lattice(v.0.22-6), highr(v.0.10), backports(v.1.4.1), broom(v.1.0.6), bslib(v.0.7.0), Rcpp(v.1.0.12), svglite(v.2.1.3), coda(v.0.19-4.1), xfun(v.0.44), zoo(v.1.8-12) and pkgconfig(v.2.0.3)

Theoretical foundations of the \(t\)-test

Requirements and fundamental principles of the test

The \(t\)-test can be applied if in both cohorts of a sample the variable under consideration is normally distributed and the variances are equal.

In this test, the difference in the mean values of the two cohorts is analysed. According to the null hypothesis that the sample means are the same in both cohorts under consideration, this difference is equal to zero. A test statistic based on the \(t\)-distribution is used for the analysis. This \(t\)-test statistic is applied to the difference of the two sample means or to the value of the difference of the sample means, according to the null hypothesis.

Notation and calculation of the \(t\)-test statistic

The sample sizes of the two analysed cohorts are \(n_1\) and \(n_2\). The sample means are \(\bar{x}_1\) and \(\bar{x}_2\), the sample standard deviations are \(s_1\) and \(s_2\).

The first step is to estimate the pooled standard deviation \(s\) $$ s = \sqrt{\frac{(n_1 - 1),s^2_1 + (n_2 - 1),s^2_2}{n_1 + n_2 - 2}},.$$ Then the test statistic is determined $$t = \frac{\bar{x}_1 - \bar{x}_2}{s\,\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}\,,$$ which follows a \(t\) distribution with \((n_1 + n_2 - 2)\) degrees of freedom. The value \(t\) calculated using the test statistic is compared with a value of the \(t\) distribution determined by the degree of freedom \(df\) and the significance level \(\alpha\), see the following figure.

The respective values can be taken from tables in the books listed below or calculated with the help of a Gnu R function qt, e.g. for \(df = 304\) and \(\alpha = 0.05\).

1df <- 304             # degree of freedom
2alpha <- 0.05         # level of significance 
3qt(1 - alpha / 2, df) # two sided test -> alpha / 2
4## [1] 1.967798

Further reading

Ott, R. L., & Longnecker, M. (2010). An introduction to statistical methods and data analysis (6th ed.). Brooks/Cole, Cengage Learning.

Petrie, A., & Sabin, C. (2005). Medical statistics at a glance (2nd ed.). Blackwell Publishing.

Rosner, B. (2016). Fundamentals of biostatistics (8th ed.). Cengage Learning.